diff mbox series

[qemu,v20] spapr: Implement Open Firmware client interface

Message ID 20210520090557.435689-1-aik@ozlabs.ru (mailing list archive)
State New, archived
Headers show
Series [qemu,v20] spapr: Implement Open Firmware client interface | expand

Commit Message

Alexey Kardashevskiy May 20, 2021, 9:05 a.m. UTC
The PAPR platform describes an OS environment that's presented by
a combination of a hypervisor and firmware. The features it specifies
require collaboration between the firmware and the hypervisor.

Since the beginning, the runtime component of the firmware (RTAS) has
been implemented as a 20 byte shim which simply forwards it to
a hypercall implemented in qemu. The boot time firmware component is
SLOF - but a build that's specific to qemu, and has always needed to be
updated in sync with it. Even though we've managed to limit the amount
of runtime communication we need between qemu and SLOF, there's some,
and it has become increasingly awkward to handle as we've implemented
new features.

This implements a boot time OF client interface (CI) which is
enabled by a new "x-vof" pseries machine option (stands for "Virtual Open
Firmware). When enabled, QEMU implements the custom H_OF_CLIENT hcall
which implements Open Firmware Client Interface (OF CI). This allows
using a smaller stateless firmware which does not have to manage
the device tree.

The new "vof.bin" firmware image is included with source code under
pc-bios/. It also includes RTAS blob.

This implements a handful of CI methods just to get -kernel/-initrd
working. In particular, this implements the device tree fetching and
simple memory allocator - "claim" (an OF CI memory allocator) and updates
"/memory@0/available" to report the client about available memory.

This implements changing some device tree properties which we know how
to deal with, the rest is ignored. To allow changes, this skips
fdt_pack() when x-vof=on as not packing the blob leaves some room for
appending.

In absence of SLOF, this assigns phandles to device tree nodes to make
device tree traversing work.

When x-vof=on, this adds "/chosen" every time QEMU (re)builds a tree.

This adds basic instances support which are managed by a hash map
ihandle -> [phandle].

Before the guest started, the used memory is:
0..e60 - the initial firmware
8000..10000 - stack
400000.. - kernel
3ea0000.. - initramdisk

This OF CI does not implement "interpret".

Unlike SLOF, this does not format uninitialized nvram. Instead, this
includes a disk image with pre-formatted nvram.

With this basic support, this can only boot into kernel directly.
However this is just enough for the petitboot kernel and initradmdisk to
boot from any possible source. Note this requires reasonably recent guest
kernel with:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=df5be5be8735

The immediate benefit is much faster booting time which especially
crucial with fully emulated early CPU bring up environments. Also this
may come handy when/if GRUB-in-the-userspace sees light of the day.

This separates VOF and sPAPR in a hope that VOF bits may be reused by
other POWERPC boards which do not support pSeries.

This is coded in assumption that later on we might be adding support for
booting from QEMU backends (blockdev is the first candidate) without
devices/drivers in between as OF1275 does not require that and
it is quite easy to so.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---

The example command line is:

/home/aik/pbuild/qemu-killslof-localhost-ppc64/qemu-system-ppc64 \
-nodefaults \
-chardev stdio,id=STDIO0,signal=off,mux=on \
-device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
-mon id=MON0,chardev=STDIO0,mode=readline \
-nographic \
-vga none \
-enable-kvm \
-m 8G \
-machine pseries,x-vof=on,cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off \
-kernel pbuild/kernel-le-guest/vmlinux \
-initrd pb/rootfs.cpio.xz \
-drive id=DRIVE0,if=none,file=./p/qemu-killslof/pc-bios/vof-nvram.bin,format=raw \
-global spapr-nvram.drive=DRIVE0 \
-snapshot \
-smp 8,threads=8 \
-L /home/aik/t/qemu-ppc64-bios/ \
-trace events=qemu_trace_events \
-d guest_errors \
-chardev socket,id=SOCKET0,server,nowait,path=qemu.mon.tmux26 \
-mon chardev=SOCKET0,mode=control

---
Changes:
v20:
* compile vof.bin with -mcpu=power4 for better compatibility
* s/std/stw/ in entry.S to make it work on ppc32
* fixed dt_available property to support both 32 and 64bit
* shuffled prom_args handling code
* do not enforce 32bit in MSR (again, to support 32bit platforms)

v19:
* put bootargs in the FDT
* moved setting properties to a VOF machine hook
* moved fw_size and claim for it to vof_init()
* added CROSS to the VOF's makefile
* simplified phandles assigning
* pass MachineState to all machine hooks instead of calling
qdev_get_machine (following QOM)
* bunch of smaller changes and added comments
* added simple test to attempt to start with x-vof=on

v18:
* fixed top addr (max address for "claim") on radix - it equals to ram_size
and vof->top_addr was uint32_t
* fixed "available" property which got broken in v14 but it is only visible
to clients which care (== grub)
* reshuffled vof_dt_memory_available() calls, added vof_init() to allow
vof_claim() before rendering the FDT

v17:
* mv hw/ppc/vof.h include/hw/ppc/vof.h
* VofMachineIfClass -> VofMachineClass; it is not VofMachineInterface as
nobody used this scheme, usually "Interface" is dropped, a couple of times
it is "xxxInterfaceClass" or "xxxIfClass", as used the latter as it is
used by include/hw/vmstate-if.h
* added SPDX
* other fixes from v16 review

v16:
* rebased on dwg/ppc-for-6.1
* s/SpaprVofInterface/VofMachineInterface/

v15:
* bugfix: claimed memory for the VOF itself
* ditched OF_STACK_ADDR and allocate one instead, now it starts from 0x8000
because it is aligned to its size (no particular reason though)
* coding style
* moved nvram.bin up one level
* ditched bool in the firmware
* made debugging code conditional using trace_event_get_state() + qemu_loglevel_mask()
* renamed the CAS interface to SpaprVofInterface
* added "write" which for now dumps the message and ihandle via
trace point for early debug assistance
* commented on when we allocate of_instances in vof_build_dt()
* store fw_size is SpaprMachine to let spapr_vof_reset() claim it
* many small fixes from v14's review

v14:
* check for truncates in readstr()
* ditched a separate vof_reset()
* spapr->vof is a pointer now, dropped the "on" field
* removed rtas_base from vof and updated comment why we allow setting it
* added myself to maintainers
* updated commit log about blockdev and other possible platforms
* added a note why new hcall is 0x5
* no in place endianness convertion in spapr_h_vof_client
* converted all cpu_physical_memory_read/write to address_space_rw
* git mv hw/ppc/spapr_vof_client.c hw/ppc/spapr_vof.c

v13:
* rebase on latest ppc-for-6.0
* shuffled code around to touch spapr.c less

v12:
* split VOF and SPAPR

v11:
* added g_autofree
* fixed gcc warnings
* fixed few leaks
* added nvram image to make "nvram --print-config" not crash;
Note that contrary to  MIN_NVRAM_SIZE (8 * KiB), the actual minimum size
is 16K, or it just does not work (empty output from "nvram")

v10:
* now rebased to compile with meson

v9:
* remove special handling of /rtas/rtas-size as now we always add it in QEMU
* removed leftovers from scsi/grub/stdout/stdin/...

v8:
* no read/write/seek
* no @dev in instances
* the machine flag is "x-vof" for now

v7:
* now we have a small firmware which loads at 0 as SLOF and starts from
0x100 as SLOF
* no MBR/ELF/GRUB business in QEMU anymore
* blockdev is a separate patch
* networking is a separate patch

v6:
* borrowed a big chunk of commit log introduction from David
* fixed initial stack pointer (points to the highest address of stack)
* traces for "interpret" and others
* disabled  translate_kernel_address() hack so grub can load (work in
progress)
* added "milliseconds" for grub
* fixed "claim" allocator again
* moved FDT_MAX_SIZE to spapr.h as spapr_of_client.c wants it too for CAS
* moved the most code possible from spapr.c to spapr_of_client.c, such as
RTAS, prom entry and FDT build/finalize
* separated blobs
* GRUB now proceeds to its console prompt (there are still other issues)
* parse MBR/GPT to find PReP and load GRUB

v5:
* made instances keep device and chardev pointers
* removed VIO dependencies
* print error if RTAS memory is not claimed as it should have been
* pack FDT as "quiesce"

v4:
* fixed open
* validate ihandles in "call-method"

v3:
* fixed phandles allocation
* s/__be32/uint32_t/ as we do not normally have __be32 type in qemu
* fixed size of /chosen/stdout
* bunch of renames
* do not create rtas properties at all, let the client deal with it;
instead setprop allows changing these in the FDT
* no more packing FDT when bios=off - nobody needs it and getprop does not
work otherwise
* allow updating initramdisk device tree properties (for zImage)
* added instances
* fixed stdout on OF's "write"
* removed special handling for stdout in OF client, spapr-vty handles it
instead

v2:
* fixed claim()
* added "setprop"
* cleaner client interface and RTAS blobs management
* boots to petitboot and further to the target system
* more trace points

v20

v20!
---
 pc-bios/vof/Makefile                      |   23 +
 include/hw/ppc/spapr.h                    |   19 +-
 include/hw/ppc/vof.h                      |   42 +
 pc-bios/vof/vof.h                         |   43 +
 hw/ppc/spapr.c                            |   68 +-
 hw/ppc/spapr_hcall.c                      |   21 +-
 hw/ppc/spapr_vof.c                        |  156 ++++
 hw/ppc/vof.c                              | 1021 +++++++++++++++++++++
 pc-bios/vof/bootmem.c                     |   14 +
 pc-bios/vof/ci.c                          |   91 ++
 pc-bios/vof/libc.c                        |   92 ++
 pc-bios/vof/main.c                        |   21 +
 tests/qtest/rtas-test.c                   |   15 +-
 MAINTAINERS                               |   12 +
 default-configs/devices/ppc64-softmmu.mak |    1 +
 hw/ppc/Kconfig                            |    3 +
 hw/ppc/meson.build                        |    3 +
 hw/ppc/trace-events                       |   24 +
 pc-bios/README                            |    2 +
 pc-bios/vof-nvram.bin                     |  Bin 0 -> 16384 bytes
 pc-bios/vof.bin                           |  Bin 0 -> 3784 bytes
 pc-bios/vof/entry.S                       |   51 +
 pc-bios/vof/l.lds                         |   48 +
 23 files changed, 1759 insertions(+), 11 deletions(-)
 create mode 100644 pc-bios/vof/Makefile
 create mode 100644 include/hw/ppc/vof.h
 create mode 100644 pc-bios/vof/vof.h
 create mode 100644 hw/ppc/spapr_vof.c
 create mode 100644 hw/ppc/vof.c
 create mode 100644 pc-bios/vof/bootmem.c
 create mode 100644 pc-bios/vof/ci.c
 create mode 100644 pc-bios/vof/libc.c
 create mode 100644 pc-bios/vof/main.c
 create mode 100644 pc-bios/vof-nvram.bin
 create mode 100755 pc-bios/vof.bin
 create mode 100644 pc-bios/vof/entry.S
 create mode 100644 pc-bios/vof/l.lds

Comments

BALATON Zoltan May 20, 2021, 9:59 p.m. UTC | #1
On Thu, 20 May 2021, Alexey Kardashevskiy wrote:
> The PAPR platform describes an OS environment that's presented by
> a combination of a hypervisor and firmware. The features it specifies
> require collaboration between the firmware and the hypervisor.
> 
> Since the beginning, the runtime component of the firmware (RTAS) has
> been implemented as a 20 byte shim which simply forwards it to
> a hypercall implemented in qemu. The boot time firmware component is
> SLOF - but a build that's specific to qemu, and has always needed to be
> updated in sync with it. Even though we've managed to limit the amount
> of runtime communication we need between qemu and SLOF, there's some,
> and it has become increasingly awkward to handle as we've implemented
> new features.
> 
> This implements a boot time OF client interface (CI) which is
> enabled by a new "x-vof" pseries machine option (stands for "Virtual Open
> Firmware). When enabled, QEMU implements the custom H_OF_CLIENT hcall
> which implements Open Firmware Client Interface (OF CI). This allows
> using a smaller stateless firmware which does not have to manage
> the device tree.
> 
> The new "vof.bin" firmware image is included with source code under
> pc-bios/. It also includes RTAS blob.
> 
> This implements a handful of CI methods just to get -kernel/-initrd
> working. In particular, this implements the device tree fetching and
> simple memory allocator - "claim" (an OF CI memory allocator) and updates
> "/memory@0/available" to report the client about available memory.
> 
> This implements changing some device tree properties which we know how
> to deal with, the rest is ignored. To allow changes, this skips
> fdt_pack() when x-vof=on as not packing the blob leaves some room for
> appending.
> 
> In absence of SLOF, this assigns phandles to device tree nodes to make
> device tree traversing work.
> 
> When x-vof=on, this adds "/chosen" every time QEMU (re)builds a tree.
> 
> This adds basic instances support which are managed by a hash map
> ihandle -> [phandle].
> 
> Before the guest started, the used memory is:
> 0..e60 - the initial firmware
> 8000..10000 - stack
> 400000.. - kernel
> 3ea0000.. - initramdisk
> 
> This OF CI does not implement "interpret".
> 
> Unlike SLOF, this does not format uninitialized nvram. Instead, this
> includes a disk image with pre-formatted nvram.
> 
> With this basic support, this can only boot into kernel directly.
> However this is just enough for the petitboot kernel and initradmdisk to
> boot from any possible source. Note this requires reasonably recent guest
> kernel with:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=df5be5be8735
> 
> The immediate benefit is much faster booting time which especially
> crucial with fully emulated early CPU bring up environments. Also this
> may come handy when/if GRUB-in-the-userspace sees light of the day.
> 
> This separates VOF and sPAPR in a hope that VOF bits may be reused by
> other POWERPC boards which do not support pSeries.
> 
> This is coded in assumption that later on we might be adding support for
> booting from QEMU backends (blockdev is the first candidate) without
> devices/drivers in between as OF1275 does not require that and
> it is quite easy to so.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> 
> The example command line is:
> 
> /home/aik/pbuild/qemu-killslof-localhost-ppc64/qemu-system-ppc64 \
> -nodefaults \
> -chardev stdio,id=STDIO0,signal=off,mux=on \
> -device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
> -mon id=MON0,chardev=STDIO0,mode=readline \
> -nographic \
> -vga none \
> -enable-kvm \
> -m 8G \
> -machine pseries,x-vof=on,cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off \
> -kernel pbuild/kernel-le-guest/vmlinux \
> -initrd pb/rootfs.cpio.xz \
> -drive id=DRIVE0,if=none,file=./p/qemu-killslof/pc-bios/vof-nvram.bin,format=raw \
> -global spapr-nvram.drive=DRIVE0 \
> -snapshot \
> -smp 8,threads=8 \
> -L /home/aik/t/qemu-ppc64-bios/ \
> -trace events=qemu_trace_events \
> -d guest_errors \
> -chardev socket,id=SOCKET0,server,nowait,path=qemu.mon.tmux26 \
> -mon chardev=SOCKET0,mode=control
> 
> ---
> Changes:
> v20:
> * compile vof.bin with -mcpu=power4 for better compatibility
> * s/std/stw/ in entry.S to make it work on ppc32
> * fixed dt_available property to support both 32 and 64bit
> * shuffled prom_args handling code
> * do not enforce 32bit in MSR (again, to support 32bit platforms)
>

[...]

> diff --git a/default-configs/devices/ppc64-softmmu.mak b/default-configs/devices/ppc64-softmmu.mak
> index ae0841fa3a18..9fb201dfacfa 100644
> --- a/default-configs/devices/ppc64-softmmu.mak
> +++ b/default-configs/devices/ppc64-softmmu.mak
> @@ -9,3 +9,4 @@ CONFIG_POWERNV=y
>  # For pSeries
>  CONFIG_PSERIES=y
>  CONFIG_NVDIMM=y
> +CONFIG_VOF=y
> diff --git a/hw/ppc/Kconfig b/hw/ppc/Kconfig
> index e51e0e5e5ac6..964510dfc73d 100644
> --- a/hw/ppc/Kconfig
> +++ b/hw/ppc/Kconfig
> @@ -143,3 +143,6 @@ config FW_CFG_PPC
>
>  config FDT_PPC
>      bool
> +
> +config VOF
> +    bool

I think you should just add "select VOF" to config PSERIES section in 
Kconfig instead of adding it to default-configs/devices/ppc64-softmmu.mak. 
That should do it, it works in my updated pegasos2 patch:

https://osdn.net/projects/qmiga/scm/git/qemu/commits/3c1fad08469b4d3c04def22044e52b2d27774a61

[...]
> diff --git a/pc-bios/vof/entry.S b/pc-bios/vof/entry.S
> new file mode 100644
> index 000000000000..569688714c91
> --- /dev/null
> +++ b/pc-bios/vof/entry.S
> @@ -0,0 +1,51 @@
> +#define LOAD32(rn, name)    \
> +	lis     rn,name##@h;    \
> +	ori     rn,rn,name##@l
> +
> +#define ENTRY(func_name)    \
> +	.text;                  \
> +	.align  2;              \
> +	.globl  .func_name;     \
> +	.func_name:             \
> +	.globl  func_name;      \
> +	func_name:
> +
> +#define KVMPPC_HCALL_BASE       0xf000
> +#define KVMPPC_H_RTAS           (KVMPPC_HCALL_BASE + 0x0)
> +#define KVMPPC_H_VOF_CLIENT     (KVMPPC_HCALL_BASE + 0x5)
> +
> +	. = 0x100 /* Do exactly as SLOF does */
> +
> +ENTRY(_start)
> +#	LOAD32(%r31, 0) /* Go 32bit mode */
> +#	mtmsrd %r31,0
> +	LOAD32(2, __toc_start)
> +	b entry_c
> +
> +ENTRY(_prom_entry)
> +	LOAD32(2, __toc_start)
> +	stwu    %r1,-112(%r1)
> +	stw     %r31,104(%r1)
> +	mflr    %r31
> +	bl prom_entry
> +	nop
> +	mtlr    %r31
> +	ld      %r31,104(%r1)

It's getting there, now I see the first client call from the guest boot 
code but then it crashes on this ld opcode which apparently is 64 bit 
only:

IN:
0x00c00214:  9421ffd0  stwu     r1, -0x30(r1)
0x00c00218:  7c691b78  mr       r9, r3
0x00c0021c:  7c0802a6  mflr     r0
0x00c00220:  7d2903a6  mtctr    r9
0x00c00224:  3d2000d9  lis      r9, 0xd9
0x00c00228:  39400001  li       r10, 1
0x00c0022c:  3929fc58  addi     r9, r9, 0xfc58
0x00c00230:  90010034  stw      r0, 0x34(r1)
0x00c00234:  38610008  addi     r3, r1, 8
0x00c00238:  39000000  li       r8, 0
0x00c0023c:  90810014  stw      r4, 0x14(r1)
0x00c00240:  91210008  stw      r9, 8(r1)
0x00c00244:  9141000c  stw      r10, 0xc(r1)
0x00c00248:  91410010  stw      r10, 0x10(r1)
0x00c0024c:  91010018  stw      r8, 0x18(r1)
0x00c00250:  4e800421  bctrl

----------------
IN:
0x0000010c:  3c400000  lis      r2, 0
0x00000110:  60428e00  ori      r2, r2, 0x8e00
0x00000114:  9421ff90  stwu     r1, -0x70(r1)
0x00000118:  93e10068  stw      r31, 0x68(r1)
0x0000011c:  7fe802a6  mflr     r31
0x00000120:  4800028d  bl       0x3ac
[...]

IN:
0x000003e4:  7c691b78  mr       r9, r3
0x000003e8:  2c090000  cmpwi    r9, 0
0x000003ec:  4182000c  beq      0x3f8

----------------
IN:
0x000003f0:  807f0008  lwz      r3, 8(r31)
0x000003f4:  4bfffd45  bl       0x138

Raise exception at 00000144 => 00000008 (01)
hypercall r3=000000000000f005 r4=000000000000fae8 r5=000000000000010c r6=0000000000000005 r7=0000000000000e80 r8=0000000000000000 r9=00000000ffffffff r10=0000000000000063 r11=000000000000fa50 r12=0000000000000040 nip=00000144
vof_finddevice "/" => ph=0x1
----------------
IN:
0x000003f8:  60000000  nop
0x000003fc:  397f0020  addi     r11, r31, 0x20
0x00000400:  800b0004  lwz      r0, 4(r11)
0x00000404:  7c0803a6  mtlr     r0
0x00000408:  83cbfff8  lwz      r30, -8(r11)
0x0000040c:  83ebfffc  lwz      r31, -4(r11)
0x00000410:  7d615b78  mr       r1, r11
0x00000414:  4e800020  blr

invalid/unsupported opcode: 3a - 14 - 01 - 01 (ebe10068) 0000012c
----------------
IN:
0x00000124:  60000000  nop
0x00000128:  7fe803a6  mtlr     r31
0x0000012c:  ebe10068  ld       r31, 0x68(r1)

Raise exception at 0000012c => 00000060 (21)
invalid/unsupported opcode: 00 - 00 - 00 - 00 (00000000) fff00700
----------------
IN:
0xfff00700:  00000000  .byte    0x00, 0x00, 0x00, 0x00

Hopefully this is the last such opcode left before I can really test this.

Do you have some info on how the stdout works in VOF? I think I'll need 
that to test with Linux and get output but I'm not sure what's needed on 
the machine side.

Regards,
BALATON Zoltan

> +	addi    %r1,%r1,112
> +	blr
> +
> +ENTRY(ci_entry)
> +	mr	4,3
> +	LOAD32(3,KVMPPC_H_VOF_CLIENT)
> +	sc	1
> +	blr
> +
> +/* This is the actual RTAS blob copied to the OS at instantiate-rtas */
> +ENTRY(hv_rtas)
> +	mr      %r4,%r3
> +	LOAD32(3,KVMPPC_H_RTAS)
> +	sc	1
> +	blr
> +	.globl hv_rtas_size
> +hv_rtas_size:
> +	.long . - hv_rtas;
> diff --git a/pc-bios/vof/l.lds b/pc-bios/vof/l.lds
> new file mode 100644
> index 000000000000..10b557a81f78
> --- /dev/null
> +++ b/pc-bios/vof/l.lds
> @@ -0,0 +1,48 @@
> +OUTPUT_FORMAT("elf32-powerpc", "elf32-powerpc", "elf32-powerpc")
> +OUTPUT_ARCH(powerpc:common)
> +
> +/* set the entry point */
> +ENTRY ( __start )
> +
> +SECTIONS {
> +	__executable_start = .;
> +
> +	.text : {
> +		*(.text)
> +	}
> +
> +	__etext = .;
> +
> +	. = ALIGN(8);
> +
> +	.data : {
> +		*(.data)
> +		*(.rodata .rodata.*)
> +		*(.got1)
> +		*(.sdata)
> +		*(.opd)
> +	}
> +
> +	/* FIXME bss at end ??? */
> +
> +	. = ALIGN(8);
> +	__bss_start = .;
> +	.bss : {
> +		*(.sbss) *(.scommon)
> +		*(.dynbss)
> +		*(.bss)
> +	}
> +
> +	. = ALIGN(8);
> +	__bss_end = .;
> +	__bss_size = (__bss_end - __bss_start);
> +
> +	. = ALIGN(256);
> +	__toc_start = DEFINED (.TOC.) ? .TOC. : ADDR (.got) + 0x8000;
> +	.got :
> +	{
> +		 *(.toc .got)
> +	}
> +	. = ALIGN(8);
> +	__toc_end = .;
> +}
> -- 
> 2.30.2
> 
>
Alexey Kardashevskiy May 21, 2021, 12:25 a.m. UTC | #2
On 21/05/2021 07:59, BALATON Zoltan wrote:
> On Thu, 20 May 2021, Alexey Kardashevskiy wrote:
>> The PAPR platform describes an OS environment that's presented by
>> a combination of a hypervisor and firmware. The features it specifies
>> require collaboration between the firmware and the hypervisor.
>>
>> Since the beginning, the runtime component of the firmware (RTAS) has
>> been implemented as a 20 byte shim which simply forwards it to
>> a hypercall implemented in qemu. The boot time firmware component is
>> SLOF - but a build that's specific to qemu, and has always needed to be
>> updated in sync with it. Even though we've managed to limit the amount
>> of runtime communication we need between qemu and SLOF, there's some,
>> and it has become increasingly awkward to handle as we've implemented
>> new features.
>>
>> This implements a boot time OF client interface (CI) which is
>> enabled by a new "x-vof" pseries machine option (stands for "Virtual Open
>> Firmware). When enabled, QEMU implements the custom H_OF_CLIENT hcall
>> which implements Open Firmware Client Interface (OF CI). This allows
>> using a smaller stateless firmware which does not have to manage
>> the device tree.
>>
>> The new "vof.bin" firmware image is included with source code under
>> pc-bios/. It also includes RTAS blob.
>>
>> This implements a handful of CI methods just to get -kernel/-initrd
>> working. In particular, this implements the device tree fetching and
>> simple memory allocator - "claim" (an OF CI memory allocator) and updates
>> "/memory@0/available" to report the client about available memory.
>>
>> This implements changing some device tree properties which we know how
>> to deal with, the rest is ignored. To allow changes, this skips
>> fdt_pack() when x-vof=on as not packing the blob leaves some room for
>> appending.
>>
>> In absence of SLOF, this assigns phandles to device tree nodes to make
>> device tree traversing work.
>>
>> When x-vof=on, this adds "/chosen" every time QEMU (re)builds a tree.
>>
>> This adds basic instances support which are managed by a hash map
>> ihandle -> [phandle].
>>
>> Before the guest started, the used memory is:
>> 0..e60 - the initial firmware
>> 8000..10000 - stack
>> 400000.. - kernel
>> 3ea0000.. - initramdisk
>>
>> This OF CI does not implement "interpret".
>>
>> Unlike SLOF, this does not format uninitialized nvram. Instead, this
>> includes a disk image with pre-formatted nvram.
>>
>> With this basic support, this can only boot into kernel directly.
>> However this is just enough for the petitboot kernel and initradmdisk to
>> boot from any possible source. Note this requires reasonably recent guest
>> kernel with:
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=df5be5be8735 
>>
>>
>> The immediate benefit is much faster booting time which especially
>> crucial with fully emulated early CPU bring up environments. Also this
>> may come handy when/if GRUB-in-the-userspace sees light of the day.
>>
>> This separates VOF and sPAPR in a hope that VOF bits may be reused by
>> other POWERPC boards which do not support pSeries.
>>
>> This is coded in assumption that later on we might be adding support for
>> booting from QEMU backends (blockdev is the first candidate) without
>> devices/drivers in between as OF1275 does not require that and
>> it is quite easy to so.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>
>> The example command line is:
>>
>> /home/aik/pbuild/qemu-killslof-localhost-ppc64/qemu-system-ppc64 \
>> -nodefaults \
>> -chardev stdio,id=STDIO0,signal=off,mux=on \
>> -device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
>> -mon id=MON0,chardev=STDIO0,mode=readline \
>> -nographic \
>> -vga none \
>> -enable-kvm \
>> -m 8G \
>> -machine 
>> pseries,x-vof=on,cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off 
>> \
>> -kernel pbuild/kernel-le-guest/vmlinux \
>> -initrd pb/rootfs.cpio.xz \
>> -drive 
>> id=DRIVE0,if=none,file=./p/qemu-killslof/pc-bios/vof-nvram.bin,format=raw 
>> \
>> -global spapr-nvram.drive=DRIVE0 \
>> -snapshot \
>> -smp 8,threads=8 \
>> -L /home/aik/t/qemu-ppc64-bios/ \
>> -trace events=qemu_trace_events \
>> -d guest_errors \
>> -chardev socket,id=SOCKET0,server,nowait,path=qemu.mon.tmux26 \
>> -mon chardev=SOCKET0,mode=control
>>
>> ---
>> Changes:
>> v20:
>> * compile vof.bin with -mcpu=power4 for better compatibility
>> * s/std/stw/ in entry.S to make it work on ppc32
>> * fixed dt_available property to support both 32 and 64bit
>> * shuffled prom_args handling code
>> * do not enforce 32bit in MSR (again, to support 32bit platforms)
>>
> 
> [...]
> 
>> diff --git a/default-configs/devices/ppc64-softmmu.mak 
>> b/default-configs/devices/ppc64-softmmu.mak
>> index ae0841fa3a18..9fb201dfacfa 100644
>> --- a/default-configs/devices/ppc64-softmmu.mak
>> +++ b/default-configs/devices/ppc64-softmmu.mak
>> @@ -9,3 +9,4 @@ CONFIG_POWERNV=y
>>  # For pSeries
>>  CONFIG_PSERIES=y
>>  CONFIG_NVDIMM=y
>> +CONFIG_VOF=y
>> diff --git a/hw/ppc/Kconfig b/hw/ppc/Kconfig
>> index e51e0e5e5ac6..964510dfc73d 100644
>> --- a/hw/ppc/Kconfig
>> +++ b/hw/ppc/Kconfig
>> @@ -143,3 +143,6 @@ config FW_CFG_PPC
>>
>>  config FDT_PPC
>>      bool
>> +
>> +config VOF
>> +    bool
> 
> I think you should just add "select VOF" to config PSERIES section in 
> Kconfig instead of adding it to 
> default-configs/devices/ppc64-softmmu.mak. 

oh well, can do that too.

>  That should do it, it works 
> in my updated pegasos2 patch:
> 
> https://osdn.net/projects/qmiga/scm/git/qemu/commits/3c1fad08469b4d3c04def22044e52b2d27774a61 
> 
> 
> [...]
>> diff --git a/pc-bios/vof/entry.S b/pc-bios/vof/entry.S
>> new file mode 100644
>> index 000000000000..569688714c91
>> --- /dev/null
>> +++ b/pc-bios/vof/entry.S
>> @@ -0,0 +1,51 @@
>> +#define LOAD32(rn, name)    \
>> +    lis     rn,name##@h;    \
>> +    ori     rn,rn,name##@l
>> +
>> +#define ENTRY(func_name)    \
>> +    .text;                  \
>> +    .align  2;              \
>> +    .globl  .func_name;     \
>> +    .func_name:             \
>> +    .globl  func_name;      \
>> +    func_name:
>> +
>> +#define KVMPPC_HCALL_BASE       0xf000
>> +#define KVMPPC_H_RTAS           (KVMPPC_HCALL_BASE + 0x0)
>> +#define KVMPPC_H_VOF_CLIENT     (KVMPPC_HCALL_BASE + 0x5)
>> +
>> +    . = 0x100 /* Do exactly as SLOF does */
>> +
>> +ENTRY(_start)
>> +#    LOAD32(%r31, 0) /* Go 32bit mode */
>> +#    mtmsrd %r31,0
>> +    LOAD32(2, __toc_start)
>> +    b entry_c
>> +
>> +ENTRY(_prom_entry)
>> +    LOAD32(2, __toc_start)
>> +    stwu    %r1,-112(%r1)
>> +    stw     %r31,104(%r1)
>> +    mflr    %r31
>> +    bl prom_entry
>> +    nop
>> +    mtlr    %r31
>> +    ld      %r31,104(%r1)
> 
> It's getting there, now I see the first client call from the guest boot 
> code but then it crashes on this ld opcode which apparently is 64 bit only:

Oh right.


> Hopefully this is the last such opcode left before I can really test this.

Make it lwz, and test it?

> Do you have some info on how the stdout works in VOF? I think I'll need 
> that to test with Linux and get output but I'm not sure what's needed on 
> the machine side.

VOF opens stsout and stores the ihandle (in fdt) which the client 
(==kernel) uses for writing. To make it work properly, you need to hook 
up that instance to a device backend similar to what I have for spapr-vty:

https://github.com/aik/qemu/commit/a381a5b50c23c74013e2bd39cc5dad5b6385965d

This is not a part of this patch as I'm trying to keep things simpler 
and accessing backends from VOF is still unsettled. But there is a 
workaround which  is trace_vof_write, I use this. Thanks,
BALATON Zoltan May 21, 2021, 9:05 a.m. UTC | #3
On Fri, 21 May 2021, Alexey Kardashevskiy wrote:
> On 21/05/2021 07:59, BALATON Zoltan wrote:
>> On Thu, 20 May 2021, Alexey Kardashevskiy wrote:
>>> The PAPR platform describes an OS environment that's presented by
>>> a combination of a hypervisor and firmware. The features it specifies
>>> require collaboration between the firmware and the hypervisor.
>>> 
>>> Since the beginning, the runtime component of the firmware (RTAS) has
>>> been implemented as a 20 byte shim which simply forwards it to
>>> a hypercall implemented in qemu. The boot time firmware component is
>>> SLOF - but a build that's specific to qemu, and has always needed to be
>>> updated in sync with it. Even though we've managed to limit the amount
>>> of runtime communication we need between qemu and SLOF, there's some,
>>> and it has become increasingly awkward to handle as we've implemented
>>> new features.
>>> 
>>> This implements a boot time OF client interface (CI) which is
>>> enabled by a new "x-vof" pseries machine option (stands for "Virtual Open
>>> Firmware). When enabled, QEMU implements the custom H_OF_CLIENT hcall
>>> which implements Open Firmware Client Interface (OF CI). This allows
>>> using a smaller stateless firmware which does not have to manage
>>> the device tree.
>>> 
>>> The new "vof.bin" firmware image is included with source code under
>>> pc-bios/. It also includes RTAS blob.
>>> 
>>> This implements a handful of CI methods just to get -kernel/-initrd
>>> working. In particular, this implements the device tree fetching and
>>> simple memory allocator - "claim" (an OF CI memory allocator) and updates
>>> "/memory@0/available" to report the client about available memory.
>>> 
>>> This implements changing some device tree properties which we know how
>>> to deal with, the rest is ignored. To allow changes, this skips
>>> fdt_pack() when x-vof=on as not packing the blob leaves some room for
>>> appending.
>>> 
>>> In absence of SLOF, this assigns phandles to device tree nodes to make
>>> device tree traversing work.
>>> 
>>> When x-vof=on, this adds "/chosen" every time QEMU (re)builds a tree.
>>> 
>>> This adds basic instances support which are managed by a hash map
>>> ihandle -> [phandle].
>>> 
>>> Before the guest started, the used memory is:
>>> 0..e60 - the initial firmware
>>> 8000..10000 - stack
>>> 400000.. - kernel
>>> 3ea0000.. - initramdisk
>>> 
>>> This OF CI does not implement "interpret".
>>> 
>>> Unlike SLOF, this does not format uninitialized nvram. Instead, this
>>> includes a disk image with pre-formatted nvram.
>>> 
>>> With this basic support, this can only boot into kernel directly.
>>> However this is just enough for the petitboot kernel and initradmdisk to
>>> boot from any possible source. Note this requires reasonably recent guest
>>> kernel with:
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=df5be5be8735 
>>> 
>>> The immediate benefit is much faster booting time which especially
>>> crucial with fully emulated early CPU bring up environments. Also this
>>> may come handy when/if GRUB-in-the-userspace sees light of the day.
>>> 
>>> This separates VOF and sPAPR in a hope that VOF bits may be reused by
>>> other POWERPC boards which do not support pSeries.
>>> 
>>> This is coded in assumption that later on we might be adding support for
>>> booting from QEMU backends (blockdev is the first candidate) without
>>> devices/drivers in between as OF1275 does not require that and
>>> it is quite easy to so.
>>> 
>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>> ---
>>> 
>>> The example command line is:
>>> 
>>> /home/aik/pbuild/qemu-killslof-localhost-ppc64/qemu-system-ppc64 \
>>> -nodefaults \
>>> -chardev stdio,id=STDIO0,signal=off,mux=on \
>>> -device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
>>> -mon id=MON0,chardev=STDIO0,mode=readline \
>>> -nographic \
>>> -vga none \
>>> -enable-kvm \
>>> -m 8G \
>>> -machine 
>>> pseries,x-vof=on,cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off 
>>> \
>>> -kernel pbuild/kernel-le-guest/vmlinux \
>>> -initrd pb/rootfs.cpio.xz \
>>> -drive 
>>> id=DRIVE0,if=none,file=./p/qemu-killslof/pc-bios/vof-nvram.bin,format=raw 
>>> \
>>> -global spapr-nvram.drive=DRIVE0 \
>>> -snapshot \
>>> -smp 8,threads=8 \
>>> -L /home/aik/t/qemu-ppc64-bios/ \
>>> -trace events=qemu_trace_events \
>>> -d guest_errors \
>>> -chardev socket,id=SOCKET0,server,nowait,path=qemu.mon.tmux26 \
>>> -mon chardev=SOCKET0,mode=control
>>> 
>>> ---
>>> Changes:
>>> v20:
>>> * compile vof.bin with -mcpu=power4 for better compatibility
>>> * s/std/stw/ in entry.S to make it work on ppc32
>>> * fixed dt_available property to support both 32 and 64bit
>>> * shuffled prom_args handling code
>>> * do not enforce 32bit in MSR (again, to support 32bit platforms)
>>> 
>> 
>> [...]
>> 
>>> diff --git a/default-configs/devices/ppc64-softmmu.mak 
>>> b/default-configs/devices/ppc64-softmmu.mak
>>> index ae0841fa3a18..9fb201dfacfa 100644
>>> --- a/default-configs/devices/ppc64-softmmu.mak
>>> +++ b/default-configs/devices/ppc64-softmmu.mak
>>> @@ -9,3 +9,4 @@ CONFIG_POWERNV=y
>>>  # For pSeries
>>>  CONFIG_PSERIES=y
>>>  CONFIG_NVDIMM=y
>>> +CONFIG_VOF=y
>>> diff --git a/hw/ppc/Kconfig b/hw/ppc/Kconfig
>>> index e51e0e5e5ac6..964510dfc73d 100644
>>> --- a/hw/ppc/Kconfig
>>> +++ b/hw/ppc/Kconfig
>>> @@ -143,3 +143,6 @@ config FW_CFG_PPC
>>> 
>>>  config FDT_PPC
>>>      bool
>>> +
>>> +config VOF
>>> +    bool
>> 
>> I think you should just add "select VOF" to config PSERIES section in 
>> Kconfig instead of adding it to default-configs/devices/ppc64-softmmu.mak. 
>
> oh well, can do that too.

I think most config options should be selected by KConfig and the default 
config should only include machines, otherwise VOF would be added also 
when you don't compile PSERIES or PEGASOS2. With select in Kconfig it will 
be added when needed. That's why it's better to use select in this case.

>>  That should do it, it works in my updated pegasos2 patch:
>> 
>> https://osdn.net/projects/qmiga/scm/git/qemu/commits/3c1fad08469b4d3c04def22044e52b2d27774a61 
>> 
>> [...]
>>> diff --git a/pc-bios/vof/entry.S b/pc-bios/vof/entry.S
>>> new file mode 100644
>>> index 000000000000..569688714c91
>>> --- /dev/null
>>> +++ b/pc-bios/vof/entry.S
>>> @@ -0,0 +1,51 @@
>>> +#define LOAD32(rn, name)    \
>>> +    lis     rn,name##@h;    \
>>> +    ori     rn,rn,name##@l
>>> +
>>> +#define ENTRY(func_name)    \
>>> +    .text;                  \
>>> +    .align  2;              \
>>> +    .globl  .func_name;     \
>>> +    .func_name:             \
>>> +    .globl  func_name;      \
>>> +    func_name:
>>> +
>>> +#define KVMPPC_HCALL_BASE       0xf000
>>> +#define KVMPPC_H_RTAS           (KVMPPC_HCALL_BASE + 0x0)
>>> +#define KVMPPC_H_VOF_CLIENT     (KVMPPC_HCALL_BASE + 0x5)
>>> +
>>> +    . = 0x100 /* Do exactly as SLOF does */
>>> +
>>> +ENTRY(_start)
>>> +#    LOAD32(%r31, 0) /* Go 32bit mode */
>>> +#    mtmsrd %r31,0
>>> +    LOAD32(2, __toc_start)
>>> +    b entry_c
>>> +
>>> +ENTRY(_prom_entry)
>>> +    LOAD32(2, __toc_start)
>>> +    stwu    %r1,-112(%r1)
>>> +    stw     %r31,104(%r1)
>>> +    mflr    %r31
>>> +    bl prom_entry
>>> +    nop
>>> +    mtlr    %r31
>>> +    ld      %r31,104(%r1)
>> 
>> It's getting there, now I see the first client call from the guest boot 
>> code but then it crashes on this ld opcode which apparently is 64 bit only:
>
> Oh right.
>
>
>> Hopefully this is the last such opcode left before I can really test this.
>
> Make it lwz, and test it?

Yes, figured that out too after sending this message. Replacing with lwz 
works but I wonder that now you have stwu lwz do the stack offsets need 
adjusting too or you just waste 4 bytes now? With lwz here I found no 
further 64 bit opcodes and the guest boot code could walk the device tree. 
It failed later but I think that's because I'll need to fill more info 
about the machine in the device tree. I'll experiment with that but it 
looks like it could work at least for MorphOS. I'll have to try Linux too.

>> Do you have some info on how the stdout works in VOF? I think I'll need 
>> that to test with Linux and get output but I'm not sure what's needed on 
>> the machine side.
>
> VOF opens stsout and stores the ihandle (in fdt) which the client (==kernel) 
> uses for writing. To make it work properly, you need to hook up that instance 
> to a device backend similar to what I have for spapr-vty:
>
> https://github.com/aik/qemu/commit/a381a5b50c23c74013e2bd39cc5dad5b6385965d
>
> This is not a part of this patch as I'm trying to keep things simpler and 
> accessing backends from VOF is still unsettled. But there is a workaround 
> which  is trace_vof_write, I use this. Thanks,

The above patch is about stdin but stdout seems to be added by the current 
vof patch. What is spapr-vty? I don't think I have something similar in 
pegasos2 where I just have a normal serial port created by ISASuperIO in 
the vt8231 model. Can I use that backend somehow or have to create some 
other serial device to connect to stdout? Does trace_vof_write work for 
stuff output by the guest? I guess that's only for things printed by VOF 
itself but to see Linux output do I need a stdout in VOF or it will just 
open the serial with its own driver and use that? So I'm not sure what's 
the stdout parts in the current vof patch does and if I need that for 
anything. I'll try to experiment with it some more but fixing the ld and 
Kconfig seems to be enough to get it work for me.

Regards,
BALATON Zoltan
BALATON Zoltan May 21, 2021, 7:57 p.m. UTC | #4
On Fri, 21 May 2021, BALATON Zoltan wrote:
> On Fri, 21 May 2021, Alexey Kardashevskiy wrote:
>> On 21/05/2021 07:59, BALATON Zoltan wrote:
>>> On Thu, 20 May 2021, Alexey Kardashevskiy wrote:
>>>> The PAPR platform describes an OS environment that's presented by
>>>> a combination of a hypervisor and firmware. The features it specifies
>>>> require collaboration between the firmware and the hypervisor.
>>>> 
>>>> Since the beginning, the runtime component of the firmware (RTAS) has
>>>> been implemented as a 20 byte shim which simply forwards it to
>>>> a hypercall implemented in qemu. The boot time firmware component is
>>>> SLOF - but a build that's specific to qemu, and has always needed to be
>>>> updated in sync with it. Even though we've managed to limit the amount
>>>> of runtime communication we need between qemu and SLOF, there's some,
>>>> and it has become increasingly awkward to handle as we've implemented
>>>> new features.
>>>> 
>>>> This implements a boot time OF client interface (CI) which is
>>>> enabled by a new "x-vof" pseries machine option (stands for "Virtual Open
>>>> Firmware). When enabled, QEMU implements the custom H_OF_CLIENT hcall
>>>> which implements Open Firmware Client Interface (OF CI). This allows
>>>> using a smaller stateless firmware which does not have to manage
>>>> the device tree.
>>>> 
>>>> The new "vof.bin" firmware image is included with source code under
>>>> pc-bios/. It also includes RTAS blob.
>>>> 
>>>> This implements a handful of CI methods just to get -kernel/-initrd
>>>> working. In particular, this implements the device tree fetching and
>>>> simple memory allocator - "claim" (an OF CI memory allocator) and updates
>>>> "/memory@0/available" to report the client about available memory.
>>>> 
>>>> This implements changing some device tree properties which we know how
>>>> to deal with, the rest is ignored. To allow changes, this skips
>>>> fdt_pack() when x-vof=on as not packing the blob leaves some room for
>>>> appending.
>>>> 
>>>> In absence of SLOF, this assigns phandles to device tree nodes to make
>>>> device tree traversing work.
>>>> 
>>>> When x-vof=on, this adds "/chosen" every time QEMU (re)builds a tree.
>>>> 
>>>> This adds basic instances support which are managed by a hash map
>>>> ihandle -> [phandle].
>>>> 
>>>> Before the guest started, the used memory is:
>>>> 0..e60 - the initial firmware
>>>> 8000..10000 - stack
>>>> 400000.. - kernel
>>>> 3ea0000.. - initramdisk
>>>> 
>>>> This OF CI does not implement "interpret".
>>>> 
>>>> Unlike SLOF, this does not format uninitialized nvram. Instead, this
>>>> includes a disk image with pre-formatted nvram.
>>>> 
>>>> With this basic support, this can only boot into kernel directly.
>>>> However this is just enough for the petitboot kernel and initradmdisk to
>>>> boot from any possible source. Note this requires reasonably recent guest
>>>> kernel with:
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=df5be5be8735 
>>>> The immediate benefit is much faster booting time which especially
>>>> crucial with fully emulated early CPU bring up environments. Also this
>>>> may come handy when/if GRUB-in-the-userspace sees light of the day.
>>>> 
>>>> This separates VOF and sPAPR in a hope that VOF bits may be reused by
>>>> other POWERPC boards which do not support pSeries.
>>>> 
>>>> This is coded in assumption that later on we might be adding support for
>>>> booting from QEMU backends (blockdev is the first candidate) without
>>>> devices/drivers in between as OF1275 does not require that and
>>>> it is quite easy to so.
>>>> 
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>> 
>>>> The example command line is:
>>>> 
>>>> /home/aik/pbuild/qemu-killslof-localhost-ppc64/qemu-system-ppc64 \
>>>> -nodefaults \
>>>> -chardev stdio,id=STDIO0,signal=off,mux=on \
>>>> -device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
>>>> -mon id=MON0,chardev=STDIO0,mode=readline \
>>>> -nographic \
>>>> -vga none \
>>>> -enable-kvm \
>>>> -m 8G \
>>>> -machine 
>>>> pseries,x-vof=on,cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off 
>>>> \
>>>> -kernel pbuild/kernel-le-guest/vmlinux \
>>>> -initrd pb/rootfs.cpio.xz \
>>>> -drive 
>>>> id=DRIVE0,if=none,file=./p/qemu-killslof/pc-bios/vof-nvram.bin,format=raw 
>>>> \
>>>> -global spapr-nvram.drive=DRIVE0 \
>>>> -snapshot \
>>>> -smp 8,threads=8 \
>>>> -L /home/aik/t/qemu-ppc64-bios/ \
>>>> -trace events=qemu_trace_events \
>>>> -d guest_errors \
>>>> -chardev socket,id=SOCKET0,server,nowait,path=qemu.mon.tmux26 \
>>>> -mon chardev=SOCKET0,mode=control
>>>> 
>>>> ---
>>>> Changes:
>>>> v20:
>>>> * compile vof.bin with -mcpu=power4 for better compatibility
>>>> * s/std/stw/ in entry.S to make it work on ppc32
>>>> * fixed dt_available property to support both 32 and 64bit
>>>> * shuffled prom_args handling code
>>>> * do not enforce 32bit in MSR (again, to support 32bit platforms)
>>>> 
>>> 
>>> [...]
>>> 
>>>> diff --git a/default-configs/devices/ppc64-softmmu.mak 
>>>> b/default-configs/devices/ppc64-softmmu.mak
>>>> index ae0841fa3a18..9fb201dfacfa 100644
>>>> --- a/default-configs/devices/ppc64-softmmu.mak
>>>> +++ b/default-configs/devices/ppc64-softmmu.mak
>>>> @@ -9,3 +9,4 @@ CONFIG_POWERNV=y
>>>>  # For pSeries
>>>>  CONFIG_PSERIES=y
>>>>  CONFIG_NVDIMM=y
>>>> +CONFIG_VOF=y
>>>> diff --git a/hw/ppc/Kconfig b/hw/ppc/Kconfig
>>>> index e51e0e5e5ac6..964510dfc73d 100644
>>>> --- a/hw/ppc/Kconfig
>>>> +++ b/hw/ppc/Kconfig
>>>> @@ -143,3 +143,6 @@ config FW_CFG_PPC
>>>> 
>>>>  config FDT_PPC
>>>>      bool
>>>> +
>>>> +config VOF
>>>> +    bool
>>> 
>>> I think you should just add "select VOF" to config PSERIES section in 
>>> Kconfig instead of adding it to default-configs/devices/ppc64-softmmu.mak. 
>> 
>> oh well, can do that too.
>
> I think most config options should be selected by KConfig and the default 
> config should only include machines, otherwise VOF would be added also when 
> you don't compile PSERIES or PEGASOS2. With select in Kconfig it will be 
> added when needed. That's why it's better to use select in this case.
>
>>>  That should do it, it works in my updated pegasos2 patch:
>>> 
>>> https://osdn.net/projects/qmiga/scm/git/qemu/commits/3c1fad08469b4d3c04def22044e52b2d27774a61 
>>> [...]
>>>> diff --git a/pc-bios/vof/entry.S b/pc-bios/vof/entry.S
>>>> new file mode 100644
>>>> index 000000000000..569688714c91
>>>> --- /dev/null
>>>> +++ b/pc-bios/vof/entry.S
>>>> @@ -0,0 +1,51 @@
>>>> +#define LOAD32(rn, name)    \
>>>> +    lis     rn,name##@h;    \
>>>> +    ori     rn,rn,name##@l
>>>> +
>>>> +#define ENTRY(func_name)    \
>>>> +    .text;                  \
>>>> +    .align  2;              \
>>>> +    .globl  .func_name;     \
>>>> +    .func_name:             \
>>>> +    .globl  func_name;      \
>>>> +    func_name:
>>>> +
>>>> +#define KVMPPC_HCALL_BASE       0xf000
>>>> +#define KVMPPC_H_RTAS           (KVMPPC_HCALL_BASE + 0x0)
>>>> +#define KVMPPC_H_VOF_CLIENT     (KVMPPC_HCALL_BASE + 0x5)
>>>> +
>>>> +    . = 0x100 /* Do exactly as SLOF does */
>>>> +
>>>> +ENTRY(_start)
>>>> +#    LOAD32(%r31, 0) /* Go 32bit mode */
>>>> +#    mtmsrd %r31,0
>>>> +    LOAD32(2, __toc_start)
>>>> +    b entry_c
>>>> +
>>>> +ENTRY(_prom_entry)
>>>> +    LOAD32(2, __toc_start)
>>>> +    stwu    %r1,-112(%r1)
>>>> +    stw     %r31,104(%r1)
>>>> +    mflr    %r31
>>>> +    bl prom_entry
>>>> +    nop
>>>> +    mtlr    %r31
>>>> +    ld      %r31,104(%r1)
>>> 
>>> It's getting there, now I see the first client call from the guest boot 
>>> code but then it crashes on this ld opcode which apparently is 64 bit 
>>> only:
>> 
>> Oh right.
>> 
>> 
>>> Hopefully this is the last such opcode left before I can really test this.
>> 
>> Make it lwz, and test it?
>
> Yes, figured that out too after sending this message. Replacing with lwz 
> works but I wonder that now you have stwu lwz do the stack offsets need 
> adjusting too or you just waste 4 bytes now? With lwz here I found no further 
> 64 bit opcodes and the guest boot code could walk the device tree. It failed 
> later but I think that's because I'll need to fill more info about the 
> machine in the device tree. I'll experiment with that but it looks like it 
> could work at least for MorphOS. I'll have to try Linux too.

I was trying to get a linux kernel from a debian powerpc iso to do 
something (debian before 10.0 has Pegasos support) but I've run into the 
problem that the kernel is loaded at 0x400000 but the start address is at 
some offset from that. How do I set qemu,boot-kernel in this case? Because 
when I set it to the address/size where the kernel is loaded it jumps to 
the beginnig not the correct start address. If I set the address to the 
start address then size will be wrong so I don't know how to set 
qemu,boot-kernel in this case or is there another property to tell the 
start address? (Vof does not seem to check any other property and seems to 
assume the entry point is the same as the load address but for this linux 
kernel it's not.)

Regards,
BALATON Zoltan
Alexey Kardashevskiy May 22, 2021, 6:22 a.m. UTC | #5
On 21/05/2021 19:05, BALATON Zoltan wrote:
> On Fri, 21 May 2021, Alexey Kardashevskiy wrote:
>> On 21/05/2021 07:59, BALATON Zoltan wrote:
>>> On Thu, 20 May 2021, Alexey Kardashevskiy wrote:
>>>> The PAPR platform describes an OS environment that's presented by
>>>> a combination of a hypervisor and firmware. The features it specifies
>>>> require collaboration between the firmware and the hypervisor.
>>>>
>>>> Since the beginning, the runtime component of the firmware (RTAS) has
>>>> been implemented as a 20 byte shim which simply forwards it to
>>>> a hypercall implemented in qemu. The boot time firmware component is
>>>> SLOF - but a build that's specific to qemu, and has always needed to be
>>>> updated in sync with it. Even though we've managed to limit the amount
>>>> of runtime communication we need between qemu and SLOF, there's some,
>>>> and it has become increasingly awkward to handle as we've implemented
>>>> new features.
>>>>
>>>> This implements a boot time OF client interface (CI) which is
>>>> enabled by a new "x-vof" pseries machine option (stands for "Virtual 
>>>> Open
>>>> Firmware). When enabled, QEMU implements the custom H_OF_CLIENT hcall
>>>> which implements Open Firmware Client Interface (OF CI). This allows
>>>> using a smaller stateless firmware which does not have to manage
>>>> the device tree.
>>>>
>>>> The new "vof.bin" firmware image is included with source code under
>>>> pc-bios/. It also includes RTAS blob.
>>>>
>>>> This implements a handful of CI methods just to get -kernel/-initrd
>>>> working. In particular, this implements the device tree fetching and
>>>> simple memory allocator - "claim" (an OF CI memory allocator) and 
>>>> updates
>>>> "/memory@0/available" to report the client about available memory.
>>>>
>>>> This implements changing some device tree properties which we know how
>>>> to deal with, the rest is ignored. To allow changes, this skips
>>>> fdt_pack() when x-vof=on as not packing the blob leaves some room for
>>>> appending.
>>>>
>>>> In absence of SLOF, this assigns phandles to device tree nodes to make
>>>> device tree traversing work.
>>>>
>>>> When x-vof=on, this adds "/chosen" every time QEMU (re)builds a tree.
>>>>
>>>> This adds basic instances support which are managed by a hash map
>>>> ihandle -> [phandle].
>>>>
>>>> Before the guest started, the used memory is:
>>>> 0..e60 - the initial firmware
>>>> 8000..10000 - stack
>>>> 400000.. - kernel
>>>> 3ea0000.. - initramdisk
>>>>
>>>> This OF CI does not implement "interpret".
>>>>
>>>> Unlike SLOF, this does not format uninitialized nvram. Instead, this
>>>> includes a disk image with pre-formatted nvram.
>>>>
>>>> With this basic support, this can only boot into kernel directly.
>>>> However this is just enough for the petitboot kernel and 
>>>> initradmdisk to
>>>> boot from any possible source. Note this requires reasonably recent 
>>>> guest
>>>> kernel with:
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=df5be5be8735 
>>>>
>>>> The immediate benefit is much faster booting time which especially
>>>> crucial with fully emulated early CPU bring up environments. Also this
>>>> may come handy when/if GRUB-in-the-userspace sees light of the day.
>>>>
>>>> This separates VOF and sPAPR in a hope that VOF bits may be reused by
>>>> other POWERPC boards which do not support pSeries.
>>>>
>>>> This is coded in assumption that later on we might be adding support 
>>>> for
>>>> booting from QEMU backends (blockdev is the first candidate) without
>>>> devices/drivers in between as OF1275 does not require that and
>>>> it is quite easy to so.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>>
>>>> The example command line is:
>>>>
>>>> /home/aik/pbuild/qemu-killslof-localhost-ppc64/qemu-system-ppc64 \
>>>> -nodefaults \
>>>> -chardev stdio,id=STDIO0,signal=off,mux=on \
>>>> -device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
>>>> -mon id=MON0,chardev=STDIO0,mode=readline \
>>>> -nographic \
>>>> -vga none \
>>>> -enable-kvm \
>>>> -m 8G \
>>>> -machine 
>>>> pseries,x-vof=on,cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off 
>>>> \
>>>> -kernel pbuild/kernel-le-guest/vmlinux \
>>>> -initrd pb/rootfs.cpio.xz \
>>>> -drive 
>>>> id=DRIVE0,if=none,file=./p/qemu-killslof/pc-bios/vof-nvram.bin,format=raw 
>>>> \
>>>> -global spapr-nvram.drive=DRIVE0 \
>>>> -snapshot \
>>>> -smp 8,threads=8 \
>>>> -L /home/aik/t/qemu-ppc64-bios/ \
>>>> -trace events=qemu_trace_events \
>>>> -d guest_errors \
>>>> -chardev socket,id=SOCKET0,server,nowait,path=qemu.mon.tmux26 \
>>>> -mon chardev=SOCKET0,mode=control
>>>>
>>>> ---
>>>> Changes:
>>>> v20:
>>>> * compile vof.bin with -mcpu=power4 for better compatibility
>>>> * s/std/stw/ in entry.S to make it work on ppc32
>>>> * fixed dt_available property to support both 32 and 64bit
>>>> * shuffled prom_args handling code
>>>> * do not enforce 32bit in MSR (again, to support 32bit platforms)
>>>>
>>>
>>> [...]
>>>
>>>> diff --git a/default-configs/devices/ppc64-softmmu.mak 
>>>> b/default-configs/devices/ppc64-softmmu.mak
>>>> index ae0841fa3a18..9fb201dfacfa 100644
>>>> --- a/default-configs/devices/ppc64-softmmu.mak
>>>> +++ b/default-configs/devices/ppc64-softmmu.mak
>>>> @@ -9,3 +9,4 @@ CONFIG_POWERNV=y
>>>>  # For pSeries
>>>>  CONFIG_PSERIES=y
>>>>  CONFIG_NVDIMM=y
>>>> +CONFIG_VOF=y
>>>> diff --git a/hw/ppc/Kconfig b/hw/ppc/Kconfig
>>>> index e51e0e5e5ac6..964510dfc73d 100644
>>>> --- a/hw/ppc/Kconfig
>>>> +++ b/hw/ppc/Kconfig
>>>> @@ -143,3 +143,6 @@ config FW_CFG_PPC
>>>>
>>>>  config FDT_PPC
>>>>      bool
>>>> +
>>>> +config VOF
>>>> +    bool
>>>
>>> I think you should just add "select VOF" to config PSERIES section in 
>>> Kconfig instead of adding it to 
>>> default-configs/devices/ppc64-softmmu.mak. 
>>
>> oh well, can do that too.
> 
> I think most config options should be selected by KConfig and the 
> default config should only include machines, otherwise VOF would be 
> added also when you don't compile PSERIES or PEGASOS2. With select in 
> Kconfig it will be added when needed. That's why it's better to use 
> select in this case.
> 
>>>  That should do it, it works in my updated pegasos2 patch:
>>>
>>> https://osdn.net/projects/qmiga/scm/git/qemu/commits/3c1fad08469b4d3c04def22044e52b2d27774a61 
>>>
>>> [...]
>>>> diff --git a/pc-bios/vof/entry.S b/pc-bios/vof/entry.S
>>>> new file mode 100644
>>>> index 000000000000..569688714c91
>>>> --- /dev/null
>>>> +++ b/pc-bios/vof/entry.S
>>>> @@ -0,0 +1,51 @@
>>>> +#define LOAD32(rn, name)    \
>>>> +    lis     rn,name##@h;    \
>>>> +    ori     rn,rn,name##@l
>>>> +
>>>> +#define ENTRY(func_name)    \
>>>> +    .text;                  \
>>>> +    .align  2;              \
>>>> +    .globl  .func_name;     \
>>>> +    .func_name:             \
>>>> +    .globl  func_name;      \
>>>> +    func_name:
>>>> +
>>>> +#define KVMPPC_HCALL_BASE       0xf000
>>>> +#define KVMPPC_H_RTAS           (KVMPPC_HCALL_BASE + 0x0)
>>>> +#define KVMPPC_H_VOF_CLIENT     (KVMPPC_HCALL_BASE + 0x5)
>>>> +
>>>> +    . = 0x100 /* Do exactly as SLOF does */
>>>> +
>>>> +ENTRY(_start)
>>>> +#    LOAD32(%r31, 0) /* Go 32bit mode */
>>>> +#    mtmsrd %r31,0
>>>> +    LOAD32(2, __toc_start)
>>>> +    b entry_c
>>>> +
>>>> +ENTRY(_prom_entry)
>>>> +    LOAD32(2, __toc_start)
>>>> +    stwu    %r1,-112(%r1)
>>>> +    stw     %r31,104(%r1)
>>>> +    mflr    %r31
>>>> +    bl prom_entry
>>>> +    nop
>>>> +    mtlr    %r31
>>>> +    ld      %r31,104(%r1)
>>>
>>> It's getting there, now I see the first client call from the guest 
>>> boot code but then it crashes on this ld opcode which apparently is 
>>> 64 bit only:
>>
>> Oh right.
>>
>>
>>> Hopefully this is the last such opcode left before I can really test 
>>> this.
>>
>> Make it lwz, and test it?
> 
> Yes, figured that out too after sending this message. Replacing with lwz 
> works but I wonder that now you have stwu lwz do the stack offsets need 
> adjusting too or you just waste 4 bytes now?

Well, this assumes the 64bit client and that ABI. I think ideally the 
firmware is supposed to use its own stack but I did not bother here. I 
do not know 32bit ABI at all so say whether the existing code should 
just work or not :-/


> With lwz here I found no 
> further 64 bit opcodes and the guest boot code could walk the device 
> tree. It failed later but I think that's because I'll need to fill more 
> info about the machine in the device tree. I'll experiment with that but 
> it looks like it could work at least for MorphOS. I'll have to try Linux 
> too.


There are plenty of tracepoints, enable them all.

> 
>>> Do you have some info on how the stdout works in VOF? I think I'll 
>>> need that to test with Linux and get output but I'm not sure what's 
>>> needed on the machine side.
>>
>> VOF opens stsout and stores the ihandle (in fdt) which the client 
>> (==kernel) uses for writing. To make it work properly, you need to 
>> hook up that instance to a device backend similar to what I have for 
>> spapr-vty:
>>
>> https://github.com/aik/qemu/commit/a381a5b50c23c74013e2bd39cc5dad5b6385965d 
>>
>>
>> This is not a part of this patch as I'm trying to keep things simpler 
>> and accessing backends from VOF is still unsettled. But there is a 
>> workaround which  is trace_vof_write, I use this. Thanks,
> 
> The above patch is about stdin but stdout seems to be added by the 
> current vof patch. What is spapr-vty?

It is pseries' paravirtual serial device, pegasos does not have it.

> I don't think I have something 
> similar in pegasos2 where I just have a normal serial port created by 
> ISASuperIO in the vt8231 model.

Correct.

> Can I use that backend somehow or have 
> to create some other serial device to connect to stdout?
> Does 
> trace_vof_write work for stuff output by the guest?
> I guess that's only 
> for things printed by VOF itself

VOF itself does not prints anything in this patch.

> but to see Linux output do I need a 
> stdout in VOF 
> or it will just open the serial with its own driver and 
> use that?
> So I'm not sure what's the stdout parts in the current vof 
> patch does and if I need that for anything. I'll try to experiment with 
> it some more but fixing the ld and Kconfig seems to be enough to get it 
> work for me.

So for the client to print something, /chosen/stdout needs to have a 
valid ihandle.
The only way to get a valid ihandle is having a valid phandle which 
vof_client_open() can open.
A valid phandle is a phandle of any node in the device tree. On spapr we 
pick some spapr-vty, open it and store in /chosen/stdout.

 From this point output from the client can be seen via a tracepoint.

Now if we want proper output without tracepoints - we need to hook it up 
with some chardev backend (not a device such a vt8231 or spapr-vty but 
backend).

https://github.com/aik/qemu/commit/a381a5b50c23c74013e2bd3 does this:
1. when a phandle is open, QEMU will search for DeviceState* for the 
specific FDT node and get a chardev from the device.
2. when write() is called, QEMU calls qemu_chr_fe_write_all() on chardev 
from 1.

 From this point you do not need a tracepoint and the output will 
appears in the console you set up for stdout.

Now if you want input from this console, things get tricky. First, on 
powernv/pseries we only need this for grub as otherwise the kernel has 
all the drivers needed and will not use the client interface. For the 
grub, we need to provide a valid ihandle for /chosen/stdin which is easy 
but implementing read() on this is not as there is no simple 
device-type-independend way of reading from chardev. I hacked it for 
spapr-tvy but other serial devices will need special handling, or we'll 
have to introduce some VOF_SERIAL_READ interface for those which will 
face opposition :)

Makes sense?
Alexey Kardashevskiy May 22, 2021, 6:39 a.m. UTC | #6
On 22/05/2021 05:57, BALATON Zoltan wrote:
> On Fri, 21 May 2021, BALATON Zoltan wrote:
>> On Fri, 21 May 2021, Alexey Kardashevskiy wrote:
>>> On 21/05/2021 07:59, BALATON Zoltan wrote:
>>>> On Thu, 20 May 2021, Alexey Kardashevskiy wrote:
>>>>> The PAPR platform describes an OS environment that's presented by
>>>>> a combination of a hypervisor and firmware. The features it specifies
>>>>> require collaboration between the firmware and the hypervisor.
>>>>>
>>>>> Since the beginning, the runtime component of the firmware (RTAS) has
>>>>> been implemented as a 20 byte shim which simply forwards it to
>>>>> a hypercall implemented in qemu. The boot time firmware component is
>>>>> SLOF - but a build that's specific to qemu, and has always needed 
>>>>> to be
>>>>> updated in sync with it. Even though we've managed to limit the amount
>>>>> of runtime communication we need between qemu and SLOF, there's some,
>>>>> and it has become increasingly awkward to handle as we've implemented
>>>>> new features.
>>>>>
>>>>> This implements a boot time OF client interface (CI) which is
>>>>> enabled by a new "x-vof" pseries machine option (stands for 
>>>>> "Virtual Open
>>>>> Firmware). When enabled, QEMU implements the custom H_OF_CLIENT hcall
>>>>> which implements Open Firmware Client Interface (OF CI). This allows
>>>>> using a smaller stateless firmware which does not have to manage
>>>>> the device tree.
>>>>>
>>>>> The new "vof.bin" firmware image is included with source code under
>>>>> pc-bios/. It also includes RTAS blob.
>>>>>
>>>>> This implements a handful of CI methods just to get -kernel/-initrd
>>>>> working. In particular, this implements the device tree fetching and
>>>>> simple memory allocator - "claim" (an OF CI memory allocator) and 
>>>>> updates
>>>>> "/memory@0/available" to report the client about available memory.
>>>>>
>>>>> This implements changing some device tree properties which we know how
>>>>> to deal with, the rest is ignored. To allow changes, this skips
>>>>> fdt_pack() when x-vof=on as not packing the blob leaves some room for
>>>>> appending.
>>>>>
>>>>> In absence of SLOF, this assigns phandles to device tree nodes to make
>>>>> device tree traversing work.
>>>>>
>>>>> When x-vof=on, this adds "/chosen" every time QEMU (re)builds a tree.
>>>>>
>>>>> This adds basic instances support which are managed by a hash map
>>>>> ihandle -> [phandle].
>>>>>
>>>>> Before the guest started, the used memory is:
>>>>> 0..e60 - the initial firmware
>>>>> 8000..10000 - stack
>>>>> 400000.. - kernel
>>>>> 3ea0000.. - initramdisk
>>>>>
>>>>> This OF CI does not implement "interpret".
>>>>>
>>>>> Unlike SLOF, this does not format uninitialized nvram. Instead, this
>>>>> includes a disk image with pre-formatted nvram.
>>>>>
>>>>> With this basic support, this can only boot into kernel directly.
>>>>> However this is just enough for the petitboot kernel and 
>>>>> initradmdisk to
>>>>> boot from any possible source. Note this requires reasonably recent 
>>>>> guest
>>>>> kernel with:
>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=df5be5be8735 
>>>>> The immediate benefit is much faster booting time which especially
>>>>> crucial with fully emulated early CPU bring up environments. Also this
>>>>> may come handy when/if GRUB-in-the-userspace sees light of the day.
>>>>>
>>>>> This separates VOF and sPAPR in a hope that VOF bits may be reused by
>>>>> other POWERPC boards which do not support pSeries.
>>>>>
>>>>> This is coded in assumption that later on we might be adding 
>>>>> support for
>>>>> booting from QEMU backends (blockdev is the first candidate) without
>>>>> devices/drivers in between as OF1275 does not require that and
>>>>> it is quite easy to so.
>>>>>
>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>> ---
>>>>>
>>>>> The example command line is:
>>>>>
>>>>> /home/aik/pbuild/qemu-killslof-localhost-ppc64/qemu-system-ppc64 \
>>>>> -nodefaults \
>>>>> -chardev stdio,id=STDIO0,signal=off,mux=on \
>>>>> -device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
>>>>> -mon id=MON0,chardev=STDIO0,mode=readline \
>>>>> -nographic \
>>>>> -vga none \
>>>>> -enable-kvm \
>>>>> -m 8G \
>>>>> -machine 
>>>>> pseries,x-vof=on,cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off 
>>>>> \
>>>>> -kernel pbuild/kernel-le-guest/vmlinux \
>>>>> -initrd pb/rootfs.cpio.xz \
>>>>> -drive 
>>>>> id=DRIVE0,if=none,file=./p/qemu-killslof/pc-bios/vof-nvram.bin,format=raw 
>>>>> \
>>>>> -global spapr-nvram.drive=DRIVE0 \
>>>>> -snapshot \
>>>>> -smp 8,threads=8 \
>>>>> -L /home/aik/t/qemu-ppc64-bios/ \
>>>>> -trace events=qemu_trace_events \
>>>>> -d guest_errors \
>>>>> -chardev socket,id=SOCKET0,server,nowait,path=qemu.mon.tmux26 \
>>>>> -mon chardev=SOCKET0,mode=control
>>>>>
>>>>> ---
>>>>> Changes:
>>>>> v20:
>>>>> * compile vof.bin with -mcpu=power4 for better compatibility
>>>>> * s/std/stw/ in entry.S to make it work on ppc32
>>>>> * fixed dt_available property to support both 32 and 64bit
>>>>> * shuffled prom_args handling code
>>>>> * do not enforce 32bit in MSR (again, to support 32bit platforms)
>>>>>
>>>>
>>>> [...]
>>>>
>>>>> diff --git a/default-configs/devices/ppc64-softmmu.mak 
>>>>> b/default-configs/devices/ppc64-softmmu.mak
>>>>> index ae0841fa3a18..9fb201dfacfa 100644
>>>>> --- a/default-configs/devices/ppc64-softmmu.mak
>>>>> +++ b/default-configs/devices/ppc64-softmmu.mak
>>>>> @@ -9,3 +9,4 @@ CONFIG_POWERNV=y
>>>>>  # For pSeries
>>>>>  CONFIG_PSERIES=y
>>>>>  CONFIG_NVDIMM=y
>>>>> +CONFIG_VOF=y
>>>>> diff --git a/hw/ppc/Kconfig b/hw/ppc/Kconfig
>>>>> index e51e0e5e5ac6..964510dfc73d 100644
>>>>> --- a/hw/ppc/Kconfig
>>>>> +++ b/hw/ppc/Kconfig
>>>>> @@ -143,3 +143,6 @@ config FW_CFG_PPC
>>>>>
>>>>>  config FDT_PPC
>>>>>      bool
>>>>> +
>>>>> +config VOF
>>>>> +    bool
>>>>
>>>> I think you should just add "select VOF" to config PSERIES section 
>>>> in Kconfig instead of adding it to 
>>>> default-configs/devices/ppc64-softmmu.mak. 
>>>
>>> oh well, can do that too.
>>
>> I think most config options should be selected by KConfig and the 
>> default config should only include machines, otherwise VOF would be 
>> added also when you don't compile PSERIES or PEGASOS2. With select in 
>> Kconfig it will be added when needed. That's why it's better to use 
>> select in this case.
>>
>>>>  That should do it, it works in my updated pegasos2 patch:
>>>>
>>>> https://osdn.net/projects/qmiga/scm/git/qemu/commits/3c1fad08469b4d3c04def22044e52b2d27774a61 
>>>> [...]
>>>>> diff --git a/pc-bios/vof/entry.S b/pc-bios/vof/entry.S
>>>>> new file mode 100644
>>>>> index 000000000000..569688714c91
>>>>> --- /dev/null
>>>>> +++ b/pc-bios/vof/entry.S
>>>>> @@ -0,0 +1,51 @@
>>>>> +#define LOAD32(rn, name)    \
>>>>> +    lis     rn,name##@h;    \
>>>>> +    ori     rn,rn,name##@l
>>>>> +
>>>>> +#define ENTRY(func_name)    \
>>>>> +    .text;                  \
>>>>> +    .align  2;              \
>>>>> +    .globl  .func_name;     \
>>>>> +    .func_name:             \
>>>>> +    .globl  func_name;      \
>>>>> +    func_name:
>>>>> +
>>>>> +#define KVMPPC_HCALL_BASE       0xf000
>>>>> +#define KVMPPC_H_RTAS           (KVMPPC_HCALL_BASE + 0x0)
>>>>> +#define KVMPPC_H_VOF_CLIENT     (KVMPPC_HCALL_BASE + 0x5)
>>>>> +
>>>>> +    . = 0x100 /* Do exactly as SLOF does */
>>>>> +
>>>>> +ENTRY(_start)
>>>>> +#    LOAD32(%r31, 0) /* Go 32bit mode */
>>>>> +#    mtmsrd %r31,0
>>>>> +    LOAD32(2, __toc_start)
>>>>> +    b entry_c
>>>>> +
>>>>> +ENTRY(_prom_entry)
>>>>> +    LOAD32(2, __toc_start)
>>>>> +    stwu    %r1,-112(%r1)
>>>>> +    stw     %r31,104(%r1)
>>>>> +    mflr    %r31
>>>>> +    bl prom_entry
>>>>> +    nop
>>>>> +    mtlr    %r31
>>>>> +    ld      %r31,104(%r1)
>>>>
>>>> It's getting there, now I see the first client call from the guest 
>>>> boot code but then it crashes on this ld opcode which apparently is 
>>>> 64 bit only:
>>>
>>> Oh right.
>>>
>>>
>>>> Hopefully this is the last such opcode left before I can really test 
>>>> this.
>>>
>>> Make it lwz, and test it?
>>
>> Yes, figured that out too after sending this message. Replacing with 
>> lwz works but I wonder that now you have stwu lwz do the stack offsets 
>> need adjusting too or you just waste 4 bytes now? With lwz here I 
>> found no further 64 bit opcodes and the guest boot code could walk the 
>> device tree. It failed later but I think that's because I'll need to 
>> fill more info about the machine in the device tree. I'll experiment 
>> with that but it looks like it could work at least for MorphOS. I'll 
>> have to try Linux too.
> 
> I was trying to get a linux kernel from a debian powerpc iso to do 
> something (debian before 10.0 has Pegasos support) but I've run into the 
> problem that the kernel is loaded at 0x400000 but the start address is 
> at some offset from that. How do I set qemu,boot-kernel in this case?


The pseries kernel can work from any location (and it relocates itself 
to 0 at some point) even though it is linked at c000.0000.0000.0000, and 
there is no start address offset:

===
 > objdump -D ~/pbuild/kernel-le/vmlinux
/home/aik/pbuild/kernel-le/vmlinux:     file format elf64-powerpcle


Disassembly of section .head.text:

c000000000000000 <__start>:
c000000000000000:       48 00 00 08     tdi     0,r0,72
c000000000000004:       2c 00 00 48     b       c000000000000030 
<__start+0x30>
...
===

Not sure about pegasos2 kernels (or any ppc32 really), sorry.


> Because when I set it to the address/size where the kernel is loaded it 
> jumps to the beginnig not the correct start address. If I set the 
> address to the start address then size will be wrong so I don't know how 
> to set qemu,boot-kernel in this case or is there another property to 
> tell the start address?
> (Vof does not seem to check any other property 
> and seems to assume the entry point is the same as the load address but 
> for this linux kernel it's not.)

I guess if you really need an offset, you'll have to add a new property 
("qemu,boot-kernel-start"?) and look for it in the firmware. Or, say, 
put in gpr5 in your version of spapr_cpu_set_entry_state() and make 
boot_from_memory() use it.
BALATON Zoltan May 22, 2021, 1:01 p.m. UTC | #7
On Sat, 22 May 2021, Alexey Kardashevskiy wrote:
> On 21/05/2021 19:05, BALATON Zoltan wrote:
>> On Fri, 21 May 2021, Alexey Kardashevskiy wrote:
>>> On 21/05/2021 07:59, BALATON Zoltan wrote:
>>>> On Thu, 20 May 2021, Alexey Kardashevskiy wrote:
>>>>> The PAPR platform describes an OS environment that's presented by
>>>>> a combination of a hypervisor and firmware. The features it specifies
>>>>> require collaboration between the firmware and the hypervisor.
>>>>> 
>>>>> Since the beginning, the runtime component of the firmware (RTAS) has
>>>>> been implemented as a 20 byte shim which simply forwards it to
>>>>> a hypercall implemented in qemu. The boot time firmware component is
>>>>> SLOF - but a build that's specific to qemu, and has always needed to be
>>>>> updated in sync with it. Even though we've managed to limit the amount
>>>>> of runtime communication we need between qemu and SLOF, there's some,
>>>>> and it has become increasingly awkward to handle as we've implemented
>>>>> new features.
>>>>> 
>>>>> This implements a boot time OF client interface (CI) which is
>>>>> enabled by a new "x-vof" pseries machine option (stands for "Virtual 
>>>>> Open
>>>>> Firmware). When enabled, QEMU implements the custom H_OF_CLIENT hcall
>>>>> which implements Open Firmware Client Interface (OF CI). This allows
>>>>> using a smaller stateless firmware which does not have to manage
>>>>> the device tree.
>>>>> 
>>>>> The new "vof.bin" firmware image is included with source code under
>>>>> pc-bios/. It also includes RTAS blob.
>>>>> 
>>>>> This implements a handful of CI methods just to get -kernel/-initrd
>>>>> working. In particular, this implements the device tree fetching and
>>>>> simple memory allocator - "claim" (an OF CI memory allocator) and 
>>>>> updates
>>>>> "/memory@0/available" to report the client about available memory.
>>>>> 
>>>>> This implements changing some device tree properties which we know how
>>>>> to deal with, the rest is ignored. To allow changes, this skips
>>>>> fdt_pack() when x-vof=on as not packing the blob leaves some room for
>>>>> appending.
>>>>> 
>>>>> In absence of SLOF, this assigns phandles to device tree nodes to make
>>>>> device tree traversing work.
>>>>> 
>>>>> When x-vof=on, this adds "/chosen" every time QEMU (re)builds a tree.
>>>>> 
>>>>> This adds basic instances support which are managed by a hash map
>>>>> ihandle -> [phandle].
>>>>> 
>>>>> Before the guest started, the used memory is:
>>>>> 0..e60 - the initial firmware
>>>>> 8000..10000 - stack
>>>>> 400000.. - kernel
>>>>> 3ea0000.. - initramdisk
>>>>> 
>>>>> This OF CI does not implement "interpret".
>>>>> 
>>>>> Unlike SLOF, this does not format uninitialized nvram. Instead, this
>>>>> includes a disk image with pre-formatted nvram.
>>>>> 
>>>>> With this basic support, this can only boot into kernel directly.
>>>>> However this is just enough for the petitboot kernel and initradmdisk to
>>>>> boot from any possible source. Note this requires reasonably recent 
>>>>> guest
>>>>> kernel with:
>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=df5be5be8735 
>>>>> The immediate benefit is much faster booting time which especially
>>>>> crucial with fully emulated early CPU bring up environments. Also this
>>>>> may come handy when/if GRUB-in-the-userspace sees light of the day.
>>>>> 
>>>>> This separates VOF and sPAPR in a hope that VOF bits may be reused by
>>>>> other POWERPC boards which do not support pSeries.
>>>>> 
>>>>> This is coded in assumption that later on we might be adding support for
>>>>> booting from QEMU backends (blockdev is the first candidate) without
>>>>> devices/drivers in between as OF1275 does not require that and
>>>>> it is quite easy to so.
>>>>> 
>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>> ---
>>>>> 
>>>>> The example command line is:
>>>>> 
>>>>> /home/aik/pbuild/qemu-killslof-localhost-ppc64/qemu-system-ppc64 \
>>>>> -nodefaults \
>>>>> -chardev stdio,id=STDIO0,signal=off,mux=on \
>>>>> -device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
>>>>> -mon id=MON0,chardev=STDIO0,mode=readline \
>>>>> -nographic \
>>>>> -vga none \
>>>>> -enable-kvm \
>>>>> -m 8G \
>>>>> -machine 
>>>>> pseries,x-vof=on,cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off 
>>>>> \
>>>>> -kernel pbuild/kernel-le-guest/vmlinux \
>>>>> -initrd pb/rootfs.cpio.xz \
>>>>> -drive 
>>>>> id=DRIVE0,if=none,file=./p/qemu-killslof/pc-bios/vof-nvram.bin,format=raw 
>>>>> \
>>>>> -global spapr-nvram.drive=DRIVE0 \
>>>>> -snapshot \
>>>>> -smp 8,threads=8 \
>>>>> -L /home/aik/t/qemu-ppc64-bios/ \
>>>>> -trace events=qemu_trace_events \
>>>>> -d guest_errors \
>>>>> -chardev socket,id=SOCKET0,server,nowait,path=qemu.mon.tmux26 \
>>>>> -mon chardev=SOCKET0,mode=control
>>>>> 
>>>>> ---
>>>>> Changes:
>>>>> v20:
>>>>> * compile vof.bin with -mcpu=power4 for better compatibility
>>>>> * s/std/stw/ in entry.S to make it work on ppc32
>>>>> * fixed dt_available property to support both 32 and 64bit
>>>>> * shuffled prom_args handling code
>>>>> * do not enforce 32bit in MSR (again, to support 32bit platforms)
>>>>> 
>>>> 
>>>> [...]
>>>> 
>>>>> diff --git a/default-configs/devices/ppc64-softmmu.mak 
>>>>> b/default-configs/devices/ppc64-softmmu.mak
>>>>> index ae0841fa3a18..9fb201dfacfa 100644
>>>>> --- a/default-configs/devices/ppc64-softmmu.mak
>>>>> +++ b/default-configs/devices/ppc64-softmmu.mak
>>>>> @@ -9,3 +9,4 @@ CONFIG_POWERNV=y
>>>>>  # For pSeries
>>>>>  CONFIG_PSERIES=y
>>>>>  CONFIG_NVDIMM=y
>>>>> +CONFIG_VOF=y
>>>>> diff --git a/hw/ppc/Kconfig b/hw/ppc/Kconfig
>>>>> index e51e0e5e5ac6..964510dfc73d 100644
>>>>> --- a/hw/ppc/Kconfig
>>>>> +++ b/hw/ppc/Kconfig
>>>>> @@ -143,3 +143,6 @@ config FW_CFG_PPC
>>>>> 
>>>>>  config FDT_PPC
>>>>>      bool
>>>>> +
>>>>> +config VOF
>>>>> +    bool
>>>> 
>>>> I think you should just add "select VOF" to config PSERIES section in 
>>>> Kconfig instead of adding it to 
>>>> default-configs/devices/ppc64-softmmu.mak. 
>>> 
>>> oh well, can do that too.
>> 
>> I think most config options should be selected by KConfig and the default 
>> config should only include machines, otherwise VOF would be added also when 
>> you don't compile PSERIES or PEGASOS2. With select in Kconfig it will be 
>> added when needed. That's why it's better to use select in this case.
>> 
>>>>  That should do it, it works in my updated pegasos2 patch:
>>>> 
>>>> https://osdn.net/projects/qmiga/scm/git/qemu/commits/3c1fad08469b4d3c04def22044e52b2d27774a61 
>>>> [...]
>>>>> diff --git a/pc-bios/vof/entry.S b/pc-bios/vof/entry.S
>>>>> new file mode 100644
>>>>> index 000000000000..569688714c91
>>>>> --- /dev/null
>>>>> +++ b/pc-bios/vof/entry.S
>>>>> @@ -0,0 +1,51 @@
>>>>> +#define LOAD32(rn, name)    \
>>>>> +    lis     rn,name##@h;    \
>>>>> +    ori     rn,rn,name##@l
>>>>> +
>>>>> +#define ENTRY(func_name)    \
>>>>> +    .text;                  \
>>>>> +    .align  2;              \
>>>>> +    .globl  .func_name;     \
>>>>> +    .func_name:             \
>>>>> +    .globl  func_name;      \
>>>>> +    func_name:
>>>>> +
>>>>> +#define KVMPPC_HCALL_BASE       0xf000
>>>>> +#define KVMPPC_H_RTAS           (KVMPPC_HCALL_BASE + 0x0)
>>>>> +#define KVMPPC_H_VOF_CLIENT     (KVMPPC_HCALL_BASE + 0x5)
>>>>> +
>>>>> +    . = 0x100 /* Do exactly as SLOF does */
>>>>> +
>>>>> +ENTRY(_start)
>>>>> +#    LOAD32(%r31, 0) /* Go 32bit mode */
>>>>> +#    mtmsrd %r31,0
>>>>> +    LOAD32(2, __toc_start)
>>>>> +    b entry_c
>>>>> +
>>>>> +ENTRY(_prom_entry)
>>>>> +    LOAD32(2, __toc_start)
>>>>> +    stwu    %r1,-112(%r1)
>>>>> +    stw     %r31,104(%r1)
>>>>> +    mflr    %r31
>>>>> +    bl prom_entry
>>>>> +    nop
>>>>> +    mtlr    %r31
>>>>> +    ld      %r31,104(%r1)
>>>> 
>>>> It's getting there, now I see the first client call from the guest boot 
>>>> code but then it crashes on this ld opcode which apparently is 64 bit 
>>>> only:
>>> 
>>> Oh right.
>>> 
>>> 
>>>> Hopefully this is the last such opcode left before I can really test 
>>>> this.
>>> 
>>> Make it lwz, and test it?
>> 
>> Yes, figured that out too after sending this message. Replacing with lwz 
>> works but I wonder that now you have stwu lwz do the stack offsets need 
>> adjusting too or you just waste 4 bytes now?
>
> Well, this assumes the 64bit client and that ABI. I think ideally the 
> firmware is supposed to use its own stack but I did not bother here. I do not 
> know 32bit ABI at all so say whether the existing code should just work or 
> not :-/

It seems to work so that's OK, just thought if the firmware is 32 bit it 
does not need 64 bit values on stack but if that's also potentially used 
by a 64 bit kernel then it may be better to keep it that way to avoid 
confusion. With the 64 bit opcodes replaced it seems to work on pegasos2 
and the guest can call CI functions and get a reply so maybe it's just a 
few wasted bytes that's not a big deal.

>> With lwz here I found no further 64 bit opcodes and the guest boot code 
>> could walk the device tree. It failed later but I think that's because I'll 
>> need to fill more info about the machine in the device tree. I'll 
>> experiment with that but it looks like it could work at least for MorphOS. 
>> I'll have to try Linux too.
>
> There are plenty of tracepoints, enable them all.

I'm running with -trace enable="vof*" but it does not give me too much 
info as a lot of calls (such as peer, child, etc.) don't log anything 
other than there was a hypercall so only get info about opening paths and 
querying some props. The MorphOS boot.img just walks the device tree 
gathering some data about the machine then calls quiesce and boot into the 
OS that later tries to use the gathered info at which point it crashes 
without any logs if some info is not as expected. This does not make it 
easy to debug but I think once I fill the device tree enough with all 
needed info it should work. Currently I'm missing info about PCI devices 
that it may need.

>>>> Do you have some info on how the stdout works in VOF? I think I'll need 
>>>> that to test with Linux and get output but I'm not sure what's needed on 
>>>> the machine side.
>>> 
>>> VOF opens stsout and stores the ihandle (in fdt) which the client 
>>> (==kernel) uses for writing. To make it work properly, you need to hook up 
>>> that instance to a device backend similar to what I have for spapr-vty:
>>> 
>>> https://github.com/aik/qemu/commit/a381a5b50c23c74013e2bd39cc5dad5b6385965d 
>>> 
>>> This is not a part of this patch as I'm trying to keep things simpler and 
>>> accessing backends from VOF is still unsettled. But there is a workaround 
>>> which  is trace_vof_write, I use this. Thanks,
>> 
>> The above patch is about stdin but stdout seems to be added by the current 
>> vof patch. What is spapr-vty?
>
> It is pseries' paravirtual serial device, pegasos does not have it.
>
>> I don't think I have something similar in pegasos2 where I just have a 
>> normal serial port created by ISASuperIO in the vt8231 model.
>
> Correct.
>
>> Can I use that backend somehow or have to create some other serial device 
>> to connect to stdout?
>> Does trace_vof_write work for stuff output by the guest?
>> I guess that's only for things printed by VOF itself
>
> VOF itself does not prints anything in this patch.

However it seems to be needed for linux as the first thing it does seems 
to be getting /chosen/stdout and calls exit if it returns nothing. So I'll 
need this at least for linux. (I think MorphOS may also query it to print 
a banner or some messages but not sure it needs it, at least it does not 
abort right away if not found.)

>> but to see Linux output do I need a stdout in VOF or it will just open the 
>> serial with its own driver and use that?
>> So I'm not sure what's the stdout parts in the current vof patch does and 
>> if I need that for anything. I'll try to experiment with it some more but 
>> fixing the ld and Kconfig seems to be enough to get it work for me.
>
> So for the client to print something, /chosen/stdout needs to have a valid 
> ihandle.
> The only way to get a valid ihandle is having a valid phandle which 
> vof_client_open() can open.
> A valid phandle is a phandle of any node in the device tree. On spapr we pick 
> some spapr-vty, open it and store in /chosen/stdout.
>
> From this point output from the client can be seen via a tracepoint.
>
> Now if we want proper output without tracepoints - we need to hook it up with 
> some chardev backend (not a device such a vt8231 or spapr-vty but backend).

I don't know much about it but devices are also connected to some backend 
so is it possible to use the same backend for VOF as used for the normal 
serial port? But I need a way to find that and connect it to VOF and I'm 
not qure how to do that yet. Or do I need to create a separate serial 
backend and connect that to VOF? I'll try to look at spapr-vty to see what 
it does.

> https://github.com/aik/qemu/commit/a381a5b50c23c74013e2bd3 does this:
> 1. when a phandle is open, QEMU will search for DeviceState* for the specific 
> FDT node and get a chardev from the device.
> 2. when write() is called, QEMU calls qemu_chr_fe_write_all() on chardev from 
> 1.
>
> From this point you do not need a tracepoint and the output will appears in 
> the console you set up for stdout.
>
> Now if you want input from this console, things get tricky. First, on 
> powernv/pseries we only need this for grub as otherwise the kernel has all 
> the drivers needed and will not use the client interface. For the grub, we 
> need to provide a valid ihandle for /chosen/stdin which is easy but 
> implementing read() on this is not as there is no simple 
> device-type-independend way of reading from chardev. I hacked it for 
> spapr-tvy but other serial devices will need special handling, or we'll have 
> to introduce some VOF_SERIAL_READ interface for those which will face 
> opposition :)
>
> Makes sense?

It explains things a bit but still not entirely clear how can I get 
something to add as a stdout. With the pegasos2 firmware it puts the 
serial device there normally that it inits and opens. Without that 
firmware we have to somehow do that from QEMU so find the serial backend 
used by the serial device within the vt8231 model (or use a different 
backend just for this?) then open it and put it in the device tree. If 
that's correct or how to do it is not clear yet.

Regards.
BALATON Zoltan
BALATON Zoltan May 22, 2021, 1:08 p.m. UTC | #8
On Sat, 22 May 2021, Alexey Kardashevskiy wrote:
> On 22/05/2021 05:57, BALATON Zoltan wrote:
>> On Fri, 21 May 2021, BALATON Zoltan wrote:
>>> On Fri, 21 May 2021, Alexey Kardashevskiy wrote:
>>>> On 21/05/2021 07:59, BALATON Zoltan wrote:
>>>>> On Thu, 20 May 2021, Alexey Kardashevskiy wrote:
>>>>>> The PAPR platform describes an OS environment that's presented by
>>>>>> a combination of a hypervisor and firmware. The features it specifies
>>>>>> require collaboration between the firmware and the hypervisor.
>>>>>> 
>>>>>> Since the beginning, the runtime component of the firmware (RTAS) has
>>>>>> been implemented as a 20 byte shim which simply forwards it to
>>>>>> a hypercall implemented in qemu. The boot time firmware component is
>>>>>> SLOF - but a build that's specific to qemu, and has always needed to be
>>>>>> updated in sync with it. Even though we've managed to limit the amount
>>>>>> of runtime communication we need between qemu and SLOF, there's some,
>>>>>> and it has become increasingly awkward to handle as we've implemented
>>>>>> new features.
>>>>>> 
>>>>>> This implements a boot time OF client interface (CI) which is
>>>>>> enabled by a new "x-vof" pseries machine option (stands for "Virtual 
>>>>>> Open
>>>>>> Firmware). When enabled, QEMU implements the custom H_OF_CLIENT hcall
>>>>>> which implements Open Firmware Client Interface (OF CI). This allows
>>>>>> using a smaller stateless firmware which does not have to manage
>>>>>> the device tree.
>>>>>> 
>>>>>> The new "vof.bin" firmware image is included with source code under
>>>>>> pc-bios/. It also includes RTAS blob.
>>>>>> 
>>>>>> This implements a handful of CI methods just to get -kernel/-initrd
>>>>>> working. In particular, this implements the device tree fetching and
>>>>>> simple memory allocator - "claim" (an OF CI memory allocator) and 
>>>>>> updates
>>>>>> "/memory@0/available" to report the client about available memory.
>>>>>> 
>>>>>> This implements changing some device tree properties which we know how
>>>>>> to deal with, the rest is ignored. To allow changes, this skips
>>>>>> fdt_pack() when x-vof=on as not packing the blob leaves some room for
>>>>>> appending.
>>>>>> 
>>>>>> In absence of SLOF, this assigns phandles to device tree nodes to make
>>>>>> device tree traversing work.
>>>>>> 
>>>>>> When x-vof=on, this adds "/chosen" every time QEMU (re)builds a tree.
>>>>>> 
>>>>>> This adds basic instances support which are managed by a hash map
>>>>>> ihandle -> [phandle].
>>>>>> 
>>>>>> Before the guest started, the used memory is:
>>>>>> 0..e60 - the initial firmware
>>>>>> 8000..10000 - stack
>>>>>> 400000.. - kernel
>>>>>> 3ea0000.. - initramdisk
>>>>>> 
>>>>>> This OF CI does not implement "interpret".
>>>>>> 
>>>>>> Unlike SLOF, this does not format uninitialized nvram. Instead, this
>>>>>> includes a disk image with pre-formatted nvram.
>>>>>> 
>>>>>> With this basic support, this can only boot into kernel directly.
>>>>>> However this is just enough for the petitboot kernel and initradmdisk 
>>>>>> to
>>>>>> boot from any possible source. Note this requires reasonably recent 
>>>>>> guest
>>>>>> kernel with:
>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=df5be5be8735 
>>>>>> The immediate benefit is much faster booting time which especially
>>>>>> crucial with fully emulated early CPU bring up environments. Also this
>>>>>> may come handy when/if GRUB-in-the-userspace sees light of the day.
>>>>>> 
>>>>>> This separates VOF and sPAPR in a hope that VOF bits may be reused by
>>>>>> other POWERPC boards which do not support pSeries.
>>>>>> 
>>>>>> This is coded in assumption that later on we might be adding support 
>>>>>> for
>>>>>> booting from QEMU backends (blockdev is the first candidate) without
>>>>>> devices/drivers in between as OF1275 does not require that and
>>>>>> it is quite easy to so.
>>>>>> 
>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>> ---
>>>>>> 
>>>>>> The example command line is:
>>>>>> 
>>>>>> /home/aik/pbuild/qemu-killslof-localhost-ppc64/qemu-system-ppc64 \
>>>>>> -nodefaults \
>>>>>> -chardev stdio,id=STDIO0,signal=off,mux=on \
>>>>>> -device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
>>>>>> -mon id=MON0,chardev=STDIO0,mode=readline \
>>>>>> -nographic \
>>>>>> -vga none \
>>>>>> -enable-kvm \
>>>>>> -m 8G \
>>>>>> -machine 
>>>>>> pseries,x-vof=on,cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off 
>>>>>> \
>>>>>> -kernel pbuild/kernel-le-guest/vmlinux \
>>>>>> -initrd pb/rootfs.cpio.xz \
>>>>>> -drive 
>>>>>> id=DRIVE0,if=none,file=./p/qemu-killslof/pc-bios/vof-nvram.bin,format=raw 
>>>>>> \
>>>>>> -global spapr-nvram.drive=DRIVE0 \
>>>>>> -snapshot \
>>>>>> -smp 8,threads=8 \
>>>>>> -L /home/aik/t/qemu-ppc64-bios/ \
>>>>>> -trace events=qemu_trace_events \
>>>>>> -d guest_errors \
>>>>>> -chardev socket,id=SOCKET0,server,nowait,path=qemu.mon.tmux26 \
>>>>>> -mon chardev=SOCKET0,mode=control
>>>>>> 
>>>>>> ---
>>>>>> Changes:
>>>>>> v20:
>>>>>> * compile vof.bin with -mcpu=power4 for better compatibility
>>>>>> * s/std/stw/ in entry.S to make it work on ppc32
>>>>>> * fixed dt_available property to support both 32 and 64bit
>>>>>> * shuffled prom_args handling code
>>>>>> * do not enforce 32bit in MSR (again, to support 32bit platforms)
>>>>>> 
>>>>> 
>>>>> [...]
>>>>> 
>>>>>> diff --git a/default-configs/devices/ppc64-softmmu.mak 
>>>>>> b/default-configs/devices/ppc64-softmmu.mak
>>>>>> index ae0841fa3a18..9fb201dfacfa 100644
>>>>>> --- a/default-configs/devices/ppc64-softmmu.mak
>>>>>> +++ b/default-configs/devices/ppc64-softmmu.mak
>>>>>> @@ -9,3 +9,4 @@ CONFIG_POWERNV=y
>>>>>>  # For pSeries
>>>>>>  CONFIG_PSERIES=y
>>>>>>  CONFIG_NVDIMM=y
>>>>>> +CONFIG_VOF=y
>>>>>> diff --git a/hw/ppc/Kconfig b/hw/ppc/Kconfig
>>>>>> index e51e0e5e5ac6..964510dfc73d 100644
>>>>>> --- a/hw/ppc/Kconfig
>>>>>> +++ b/hw/ppc/Kconfig
>>>>>> @@ -143,3 +143,6 @@ config FW_CFG_PPC
>>>>>> 
>>>>>>  config FDT_PPC
>>>>>>      bool
>>>>>> +
>>>>>> +config VOF
>>>>>> +    bool
>>>>> 
>>>>> I think you should just add "select VOF" to config PSERIES section in 
>>>>> Kconfig instead of adding it to 
>>>>> default-configs/devices/ppc64-softmmu.mak. 
>>>> 
>>>> oh well, can do that too.
>>> 
>>> I think most config options should be selected by KConfig and the default 
>>> config should only include machines, otherwise VOF would be added also 
>>> when you don't compile PSERIES or PEGASOS2. With select in Kconfig it will 
>>> be added when needed. That's why it's better to use select in this case.
>>> 
>>>>>  That should do it, it works in my updated pegasos2 patch:
>>>>> 
>>>>> https://osdn.net/projects/qmiga/scm/git/qemu/commits/3c1fad08469b4d3c04def22044e52b2d27774a61 
>>>>> [...]
>>>>>> diff --git a/pc-bios/vof/entry.S b/pc-bios/vof/entry.S
>>>>>> new file mode 100644
>>>>>> index 000000000000..569688714c91
>>>>>> --- /dev/null
>>>>>> +++ b/pc-bios/vof/entry.S
>>>>>> @@ -0,0 +1,51 @@
>>>>>> +#define LOAD32(rn, name)    \
>>>>>> +    lis     rn,name##@h;    \
>>>>>> +    ori     rn,rn,name##@l
>>>>>> +
>>>>>> +#define ENTRY(func_name)    \
>>>>>> +    .text;                  \
>>>>>> +    .align  2;              \
>>>>>> +    .globl  .func_name;     \
>>>>>> +    .func_name:             \
>>>>>> +    .globl  func_name;      \
>>>>>> +    func_name:
>>>>>> +
>>>>>> +#define KVMPPC_HCALL_BASE       0xf000
>>>>>> +#define KVMPPC_H_RTAS           (KVMPPC_HCALL_BASE + 0x0)
>>>>>> +#define KVMPPC_H_VOF_CLIENT     (KVMPPC_HCALL_BASE + 0x5)
>>>>>> +
>>>>>> +    . = 0x100 /* Do exactly as SLOF does */
>>>>>> +
>>>>>> +ENTRY(_start)
>>>>>> +#    LOAD32(%r31, 0) /* Go 32bit mode */
>>>>>> +#    mtmsrd %r31,0
>>>>>> +    LOAD32(2, __toc_start)
>>>>>> +    b entry_c
>>>>>> +
>>>>>> +ENTRY(_prom_entry)
>>>>>> +    LOAD32(2, __toc_start)
>>>>>> +    stwu    %r1,-112(%r1)
>>>>>> +    stw     %r31,104(%r1)
>>>>>> +    mflr    %r31
>>>>>> +    bl prom_entry
>>>>>> +    nop
>>>>>> +    mtlr    %r31
>>>>>> +    ld      %r31,104(%r1)
>>>>> 
>>>>> It's getting there, now I see the first client call from the guest boot 
>>>>> code but then it crashes on this ld opcode which apparently is 64 bit 
>>>>> only:
>>>> 
>>>> Oh right.
>>>> 
>>>> 
>>>>> Hopefully this is the last such opcode left before I can really test 
>>>>> this.
>>>> 
>>>> Make it lwz, and test it?
>>> 
>>> Yes, figured that out too after sending this message. Replacing with lwz 
>>> works but I wonder that now you have stwu lwz do the stack offsets need 
>>> adjusting too or you just waste 4 bytes now? With lwz here I found no 
>>> further 64 bit opcodes and the guest boot code could walk the device tree. 
>>> It failed later but I think that's because I'll need to fill more info 
>>> about the machine in the device tree. I'll experiment with that but it 
>>> looks like it could work at least for MorphOS. I'll have to try Linux too.
>> 
>> I was trying to get a linux kernel from a debian powerpc iso to do 
>> something (debian before 10.0 has Pegasos support) but I've run into the 
>> problem that the kernel is loaded at 0x400000 but the start address is at 
>> some offset from that. How do I set qemu,boot-kernel in this case?
>
>
> The pseries kernel can work from any location (and it relocates itself to 0 
> at some point) even though it is linked at c000.0000.0000.0000, and there is 
> no start address offset:
>
> ===
>> objdump -D ~/pbuild/kernel-le/vmlinux
> /home/aik/pbuild/kernel-le/vmlinux:     file format elf64-powerpcle
>
>
> Disassembly of section .head.text:
>
> c000000000000000 <__start>:
> c000000000000000:       48 00 00 08     tdi     0,r0,72
> c000000000000004:       2c 00 00 48     b       c000000000000030 
> <__start+0x30>
> ...
> ===
>
> Not sure about pegasos2 kernels (or any ppc32 really), sorry.

The kernel from Debian 10.0 powerpc used on pegasos looks like this:

vmlinuz-chrp.initrd:     file format elf32-powerpc
vmlinuz-chrp.initrd
architecture: powerpc:common, flags 0x00000112:
EXEC_P, HAS_SYMS, D_PAGED
start address 0x004002fc

Program Header:
     LOAD off    0x00010000 vaddr 0x00400000 paddr 0x00400000 align 2**16
          filesz 0x0127b72a memsz 0x0127d5d8 flags rwx
    STACK off    0x00000000 vaddr 0x00000000 paddr 0x00000000 align 2**4
          filesz 0x00000000 memsz 0x00000000 flags rwx
     NOTE off    0x000000b4 vaddr 0x00000000 paddr 0x00000000 align 2**0
          filesz 0x0000002c memsz 0x00000000 flags ---
     NOTE off    0x000000e0 vaddr 0x00000000 paddr 0x00000000 align 2**0
          filesz 0x0000002c memsz 0x00000000 flags ---

Sections:
Idx Name          Size      VMA       LMA       File off  Algn
   0 .text         00008588  00400000  00400000  00010000  2**2
                   CONTENTS, ALLOC, LOAD, READONLY, CODE
   1 .text.unlikely 00000078  00408588  00408588  00018588  2**2
                   CONTENTS, ALLOC, LOAD, READONLY, CODE
   2 .data         00001bec  00409000  00409000  00019000  2**2
                   CONTENTS, ALLOC, LOAD, DATA
   3 .got          0000000c  0040abec  0040abec  0001abec  2**2
                   CONTENTS, ALLOC, LOAD, DATA
   4 __builtin_cmdline 00000800  0040abf8  0040abf8  0001abf8  2**2
                   CONTENTS, ALLOC, LOAD, DATA
   5 .kernel:vmlinux.strip 0047658e  0040c000  0040c000  0001c000  2**0
                   CONTENTS, ALLOC, LOAD, READONLY, DATA
   6 .kernel:initrd 00df872a  00883000  00883000  00493000  2**0
                   CONTENTS, ALLOC, LOAD, READONLY, DATA
   7 .bss          000015d8  0167c000  0167c000  0128b72a  2**2
                   ALLOC
   8 .debug_info   0000e7fd  00000000  00000000  0128b72a  2**0
                   CONTENTS, READONLY, DEBUGGING
   9 .debug_abbrev 00002a4f  00000000  00000000  01299f27  2**0
                   CONTENTS, READONLY, DEBUGGING
  10 .debug_loc    00009df1  00000000  00000000  0129c976  2**0
                   CONTENTS, READONLY, DEBUGGING
  11 .debug_aranges 00000250  00000000  00000000  012a6767  2**0
                   CONTENTS, READONLY, DEBUGGING
  12 .debug_line   000026b8  00000000  00000000  012a69b7  2**0
                   CONTENTS, READONLY, DEBUGGING
  13 .debug_str    00001d9c  00000000  00000000  012a906f  2**0
                   CONTENTS, READONLY, DEBUGGING
  14 .comment      0000001d  00000000  00000000  012aae0b  2**0
                   CONTENTS, READONLY
  15 .gnu.attributes 00000010  00000000  00000000  012aae28  2**0
                   CONTENTS, READONLY
  16 .debug_frame  00001c88  00000000  00000000  012aae38  2**2
                   CONTENTS, READONLY, DEBUGGING
  17 .debug_ranges 00000740  00000000  00000000  012acac0  2**0
                   CONTENTS, READONLY, DEBUGGING

It even seems to have the initrd embedded in it. If I just use 0x400000 as 
start address it does not work, has to jump to the start address for it 
to start correctly.

>> Because when I set it to the address/size where the kernel is loaded it 
>> jumps to the beginnig not the correct start address. If I set the address 
>> to the start address then size will be wrong so I don't know how to set 
>> qemu,boot-kernel in this case or is there another property to tell the 
>> start address?
>> (Vof does not seem to check any other property and seems to assume the 
>> entry point is the same as the load address but for this linux kernel it's 
>> not.)
>
> I guess if you really need an offset, you'll have to add a new property 
> ("qemu,boot-kernel-start"?) and look for it in the firmware. Or, say, put in 
> gpr5 in your version of spapr_cpu_set_entry_state() and make 
> boot_from_memory() use it.

Either way would work but I don't want to recompile vof.bin so if you 
implement any of these in the next version I can use that. For now I've 
just set kernel address to the start address and decreased size a bit, the 
memory for the kernel is still claimed correctly when it's loaded so 
unless something relies on the size in qemu,boot-kernel it does not matter 
and this way the kernel starts but only gets to finding no /chosen/stdout 
and exit there so I can't try it until I resolve that.

Regards,
BALATON Zoltan
BALATON Zoltan May 22, 2021, 3:02 p.m. UTC | #9
On Sat, 22 May 2021, BALATON Zoltan wrote:
> On Sat, 22 May 2021, Alexey Kardashevskiy wrote:
>> VOF itself does not prints anything in this patch.
>
> However it seems to be needed for linux as the first thing it does seems to 
> be getting /chosen/stdout and calls exit if it returns nothing. So I'll need 
> this at least for linux. (I think MorphOS may also query it to print a banner 
> or some messages but not sure it needs it, at least it does not abort right 
> away if not found.)
>
>>> but to see Linux output do I need a stdout in VOF or it will just open the 
>>> serial with its own driver and use that?
>>> So I'm not sure what's the stdout parts in the current vof patch does and 
>>> if I need that for anything. I'll try to experiment with it some more but 
>>> fixing the ld and Kconfig seems to be enough to get it work for me.
>> 
>> So for the client to print something, /chosen/stdout needs to have a valid 
>> ihandle.
>> The only way to get a valid ihandle is having a valid phandle which 
>> vof_client_open() can open.
>> A valid phandle is a phandle of any node in the device tree. On spapr we 
>> pick some spapr-vty, open it and store in /chosen/stdout.
>> 
>> From this point output from the client can be seen via a tracepoint.

I've got it now. Looking at the original firmware device tree dump:

https://osdn.net/projects/qmiga/wiki/SubprojectPegasos2/attach/PegasosII_OFW-Dump.txt

I see that /chosen/stdout points to "screen" which is an alias to 
/bootconsole. Just adding an empty /bootconsole node in the device tree 
and vof_client_open_store() that as /chosen/stdout works and I get output 
via vof_write traces so this is enough for now to test Linux. Properly 
connecting a serial backend can thus be postponed.

So with this the Linux kernel does not abort on the first device tree 
access but starts to decompress itself then the embedded initrd and 
crashes at calling setprop:

[...]
vof_client_handle: setprop

Thread 4 "qemu-system-ppc" received signal SIGSEGV, Segmentation fault.
(gdb) bt
#0  0x0000000000000000 in  ()
#1  0x0000555555a5c2bf in vof_setprop
     (vof=0x7ffff48e9420, vallen=4, valaddr=<optimized out>, pname=<optimized out>, nodeph=8, fdt=0x7fff8aaff010, ms=0x5555564f8800)
     at ../hw/ppc/vof.c:308
#2  0x0000555555a5c2bf in vof_client_handle
     (nrets=1, rets=0x7ffff48e93f0, nargs=4, args=0x7ffff48e93c0, service=0x7ffff48e9460 "setprop",
      vof=0x7ffff48e9420, fdt=0x7fff8aaff010, ms=0x5555564f8800) at ../hw/ppc/vof.c:842
#3  0x0000555555a5c2bf in vof_client_call
     (ms=0x5555564f8800, vof=vof@entry=0x55555662a3d0, fdt=fdt@entry=0x7fff8aaff010, args_real=args_real@entry=23580472)
     at ../hw/ppc/vof.c:935

loooks like it's trying to set /chosen/linux,initrd-start:

(gdb) up
#1  0x0000555555a5c2bf in vof_setprop (vof=0x7ffff48e9420, vallen=4, valaddr=<optimized out>, pname=<optimized out>, nodeph=8,
     fdt=0x7fff8aaff010, ms=0x5555564f8800) at ../hw/ppc/vof.c:308
308	        if (!vmc->setprop(ms, nodepath, propname, val, vallen)) {
(gdb) p nodepath
$1 = "/chosen\000\060/rPC,750CXE/", '\000' <repeats 234 times>
(gdb) p propname
$2 = "linux,initrd-start\000linux,initrd-end\000linux,cmdline-timeout\000bootarg"
(gdb) p val
$3 = <optimized out>

I think I need the callback for setprop in TYPE_VOF_MACHINE_IF. I can copy 
spapr_vof_setprop() but some explanation on why that's needed might help. 
Ciould I just do fdt_setprop in my callback as vof_setprop() would do 
without a machine callback or is there some special handling needed for 
these properties?

Regards.
BALATON Zoltan
BALATON Zoltan May 22, 2021, 4:46 p.m. UTC | #10
On Sat, 22 May 2021, BALATON Zoltan wrote:
> On Sat, 22 May 2021, BALATON Zoltan wrote:
>> On Sat, 22 May 2021, Alexey Kardashevskiy wrote:
>>> VOF itself does not prints anything in this patch.
>> 
>> However it seems to be needed for linux as the first thing it does seems to 
>> be getting /chosen/stdout and calls exit if it returns nothing. So I'll 
>> need this at least for linux. (I think MorphOS may also query it to print a 
>> banner or some messages but not sure it needs it, at least it does not 
>> abort right away if not found.)
>> 
>>>> but to see Linux output do I need a stdout in VOF or it will just open 
>>>> the serial with its own driver and use that?
>>>> So I'm not sure what's the stdout parts in the current vof patch does and 
>>>> if I need that for anything. I'll try to experiment with it some more but 
>>>> fixing the ld and Kconfig seems to be enough to get it work for me.
>>> 
>>> So for the client to print something, /chosen/stdout needs to have a valid 
>>> ihandle.
>>> The only way to get a valid ihandle is having a valid phandle which 
>>> vof_client_open() can open.
>>> A valid phandle is a phandle of any node in the device tree. On spapr we 
>>> pick some spapr-vty, open it and store in /chosen/stdout.
>>> 
>>> From this point output from the client can be seen via a tracepoint.
>
> I've got it now. Looking at the original firmware device tree dump:
>
> https://osdn.net/projects/qmiga/wiki/SubprojectPegasos2/attach/PegasosII_OFW-Dump.txt
>
> I see that /chosen/stdout points to "screen" which is an alias to 
> /bootconsole. Just adding an empty /bootconsole node in the device tree and 
> vof_client_open_store() that as /chosen/stdout works and I get output via 
> vof_write traces so this is enough for now to test Linux. Properly connecting 
> a serial backend can thus be postponed.

Using /failsafe instead of /bootconsole is even better because Linux then 
adds console=ttyS0 to the bootargs by default as it knows that's a serial 
port.

> So with this the Linux kernel does not abort on the first device tree access 
> but starts to decompress itself then the embedded initrd and crashes at 
> calling setprop:
>
> [...]
> vof_client_handle: setprop
>
> Thread 4 "qemu-system-ppc" received signal SIGSEGV, Segmentation fault.
> (gdb) bt
> #0  0x0000000000000000 in  ()
> #1  0x0000555555a5c2bf in vof_setprop
>    (vof=0x7ffff48e9420, vallen=4, valaddr=<optimized out>, pname=<optimized 
> out>, nodeph=8, fdt=0x7fff8aaff010, ms=0x5555564f8800)
>    at ../hw/ppc/vof.c:308
> #2  0x0000555555a5c2bf in vof_client_handle
>    (nrets=1, rets=0x7ffff48e93f0, nargs=4, args=0x7ffff48e93c0, 
> service=0x7ffff48e9460 "setprop",
>     vof=0x7ffff48e9420, fdt=0x7fff8aaff010, ms=0x5555564f8800) at 
> ../hw/ppc/vof.c:842
> #3  0x0000555555a5c2bf in vof_client_call
>    (ms=0x5555564f8800, vof=vof@entry=0x55555662a3d0, 
> fdt=fdt@entry=0x7fff8aaff010, args_real=args_real@entry=23580472)
>    at ../hw/ppc/vof.c:935
>
> loooks like it's trying to set /chosen/linux,initrd-start:
>
> (gdb) up
> #1  0x0000555555a5c2bf in vof_setprop (vof=0x7ffff48e9420, vallen=4, 
> valaddr=<optimized out>, pname=<optimized out>, nodeph=8,
>    fdt=0x7fff8aaff010, ms=0x5555564f8800) at ../hw/ppc/vof.c:308
> 308	        if (!vmc->setprop(ms, nodepath, propname, val, vallen)) {
> (gdb) p nodepath
> $1 = "/chosen\000\060/rPC,750CXE/", '\000' <repeats 234 times>
> (gdb) p propname
> $2 = 
> "linux,initrd-start\000linux,initrd-end\000linux,cmdline-timeout\000bootarg"
> (gdb) p val
> $3 = <optimized out>
>
> I think I need the callback for setprop in TYPE_VOF_MACHINE_IF. I can copy 
> spapr_vof_setprop() but some explanation on why that's needed might help. 
> Ciould I just do fdt_setprop in my callback as vof_setprop() would do without 
> a machine callback or is there some special handling needed for these 
> properties?

Just returning true from the setprop callback of the VofMachineIfClass for 
now to see what it would do and then it gets to all the way of calling 
quiesce. Unfortunately it then tries to call prom_printf on Pegasos2 as 
seen here:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/arch/powerpc/kernel/prom_init.c?h=v4.14.233#n3261

which does not work because I have to shut down vhyp at quiesce otherwise 
it trips an assert on writing sdr1 (and may also interfere with the 
guest's usage of syscalls). So I need a way to not generate an exception 
if the guest calls back into OF after quiesce. A hacky solution is to 
patch out the sc 1 or _prom_entry point to just return after quiesce but 
maybe a better way is needed such as a switch in vof.bin that it checks 
before doing a syscall. Other than this problem it seems to work for the 
most part so maybe making the _prom_entry check some global value that I 
can set from quiesce to stop it doing syscalls and just return would be 
the simplest way to avoid this crash in Linux and not need a special 
version of vof for pegasos2. (MorphOS does not seem to call OF after 
quiesce which seems safer to do anyway, don't know why Linux does that. 
It could just print that one line before quiesce and then it would work, 
unfortunately that's not what they did.)

Regards.
BALATON Zoltan
Alexey Kardashevskiy May 23, 2021, 3:20 a.m. UTC | #11
On 22/05/2021 23:01, BALATON Zoltan wrote:
> On Sat, 22 May 2021, Alexey Kardashevskiy wrote:
>> On 21/05/2021 19:05, BALATON Zoltan wrote:
>>> On Fri, 21 May 2021, Alexey Kardashevskiy wrote:
>>>> On 21/05/2021 07:59, BALATON Zoltan wrote:
>>>>> On Thu, 20 May 2021, Alexey Kardashevskiy wrote:
>>>>>> The PAPR platform describes an OS environment that's presented by
>>>>>> a combination of a hypervisor and firmware. The features it specifies
>>>>>> require collaboration between the firmware and the hypervisor.
>>>>>>
>>>>>> Since the beginning, the runtime component of the firmware (RTAS) has
>>>>>> been implemented as a 20 byte shim which simply forwards it to
>>>>>> a hypercall implemented in qemu. The boot time firmware component is
>>>>>> SLOF - but a build that's specific to qemu, and has always needed 
>>>>>> to be
>>>>>> updated in sync with it. Even though we've managed to limit the 
>>>>>> amount
>>>>>> of runtime communication we need between qemu and SLOF, there's some,
>>>>>> and it has become increasingly awkward to handle as we've implemented
>>>>>> new features.
>>>>>>
>>>>>> This implements a boot time OF client interface (CI) which is
>>>>>> enabled by a new "x-vof" pseries machine option (stands for 
>>>>>> "Virtual Open
>>>>>> Firmware). When enabled, QEMU implements the custom H_OF_CLIENT hcall
>>>>>> which implements Open Firmware Client Interface (OF CI). This allows
>>>>>> using a smaller stateless firmware which does not have to manage
>>>>>> the device tree.
>>>>>>
>>>>>> The new "vof.bin" firmware image is included with source code under
>>>>>> pc-bios/. It also includes RTAS blob.
>>>>>>
>>>>>> This implements a handful of CI methods just to get -kernel/-initrd
>>>>>> working. In particular, this implements the device tree fetching and
>>>>>> simple memory allocator - "claim" (an OF CI memory allocator) and 
>>>>>> updates
>>>>>> "/memory@0/available" to report the client about available memory.
>>>>>>
>>>>>> This implements changing some device tree properties which we know 
>>>>>> how
>>>>>> to deal with, the rest is ignored. To allow changes, this skips
>>>>>> fdt_pack() when x-vof=on as not packing the blob leaves some room for
>>>>>> appending.
>>>>>>
>>>>>> In absence of SLOF, this assigns phandles to device tree nodes to 
>>>>>> make
>>>>>> device tree traversing work.
>>>>>>
>>>>>> When x-vof=on, this adds "/chosen" every time QEMU (re)builds a tree.
>>>>>>
>>>>>> This adds basic instances support which are managed by a hash map
>>>>>> ihandle -> [phandle].
>>>>>>
>>>>>> Before the guest started, the used memory is:
>>>>>> 0..e60 - the initial firmware
>>>>>> 8000..10000 - stack
>>>>>> 400000.. - kernel
>>>>>> 3ea0000.. - initramdisk
>>>>>>
>>>>>> This OF CI does not implement "interpret".
>>>>>>
>>>>>> Unlike SLOF, this does not format uninitialized nvram. Instead, this
>>>>>> includes a disk image with pre-formatted nvram.
>>>>>>
>>>>>> With this basic support, this can only boot into kernel directly.
>>>>>> However this is just enough for the petitboot kernel and 
>>>>>> initradmdisk to
>>>>>> boot from any possible source. Note this requires reasonably 
>>>>>> recent guest
>>>>>> kernel with:
>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=df5be5be8735 
>>>>>> The immediate benefit is much faster booting time which especially
>>>>>> crucial with fully emulated early CPU bring up environments. Also 
>>>>>> this
>>>>>> may come handy when/if GRUB-in-the-userspace sees light of the day.
>>>>>>
>>>>>> This separates VOF and sPAPR in a hope that VOF bits may be reused by
>>>>>> other POWERPC boards which do not support pSeries.
>>>>>>
>>>>>> This is coded in assumption that later on we might be adding 
>>>>>> support for
>>>>>> booting from QEMU backends (blockdev is the first candidate) without
>>>>>> devices/drivers in between as OF1275 does not require that and
>>>>>> it is quite easy to so.
>>>>>>
>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>> ---
>>>>>>
>>>>>> The example command line is:
>>>>>>
>>>>>> /home/aik/pbuild/qemu-killslof-localhost-ppc64/qemu-system-ppc64 \
>>>>>> -nodefaults \
>>>>>> -chardev stdio,id=STDIO0,signal=off,mux=on \
>>>>>> -device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
>>>>>> -mon id=MON0,chardev=STDIO0,mode=readline \
>>>>>> -nographic \
>>>>>> -vga none \
>>>>>> -enable-kvm \
>>>>>> -m 8G \
>>>>>> -machine 
>>>>>> pseries,x-vof=on,cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off 
>>>>>> \
>>>>>> -kernel pbuild/kernel-le-guest/vmlinux \
>>>>>> -initrd pb/rootfs.cpio.xz \
>>>>>> -drive 
>>>>>> id=DRIVE0,if=none,file=./p/qemu-killslof/pc-bios/vof-nvram.bin,format=raw 
>>>>>> \
>>>>>> -global spapr-nvram.drive=DRIVE0 \
>>>>>> -snapshot \
>>>>>> -smp 8,threads=8 \
>>>>>> -L /home/aik/t/qemu-ppc64-bios/ \
>>>>>> -trace events=qemu_trace_events \
>>>>>> -d guest_errors \
>>>>>> -chardev socket,id=SOCKET0,server,nowait,path=qemu.mon.tmux26 \
>>>>>> -mon chardev=SOCKET0,mode=control
>>>>>>
>>>>>> ---
>>>>>> Changes:
>>>>>> v20:
>>>>>> * compile vof.bin with -mcpu=power4 for better compatibility
>>>>>> * s/std/stw/ in entry.S to make it work on ppc32
>>>>>> * fixed dt_available property to support both 32 and 64bit
>>>>>> * shuffled prom_args handling code
>>>>>> * do not enforce 32bit in MSR (again, to support 32bit platforms)
>>>>>>
>>>>>
>>>>> [...]
>>>>>
>>>>>> diff --git a/default-configs/devices/ppc64-softmmu.mak 
>>>>>> b/default-configs/devices/ppc64-softmmu.mak
>>>>>> index ae0841fa3a18..9fb201dfacfa 100644
>>>>>> --- a/default-configs/devices/ppc64-softmmu.mak
>>>>>> +++ b/default-configs/devices/ppc64-softmmu.mak
>>>>>> @@ -9,3 +9,4 @@ CONFIG_POWERNV=y
>>>>>>  # For pSeries
>>>>>>  CONFIG_PSERIES=y
>>>>>>  CONFIG_NVDIMM=y
>>>>>> +CONFIG_VOF=y
>>>>>> diff --git a/hw/ppc/Kconfig b/hw/ppc/Kconfig
>>>>>> index e51e0e5e5ac6..964510dfc73d 100644
>>>>>> --- a/hw/ppc/Kconfig
>>>>>> +++ b/hw/ppc/Kconfig
>>>>>> @@ -143,3 +143,6 @@ config FW_CFG_PPC
>>>>>>
>>>>>>  config FDT_PPC
>>>>>>      bool
>>>>>> +
>>>>>> +config VOF
>>>>>> +    bool
>>>>>
>>>>> I think you should just add "select VOF" to config PSERIES section 
>>>>> in Kconfig instead of adding it to 
>>>>> default-configs/devices/ppc64-softmmu.mak. 
>>>>
>>>> oh well, can do that too.
>>>
>>> I think most config options should be selected by KConfig and the 
>>> default config should only include machines, otherwise VOF would be 
>>> added also when you don't compile PSERIES or PEGASOS2. With select in 
>>> Kconfig it will be added when needed. That's why it's better to use 
>>> select in this case.
>>>
>>>>>  That should do it, it works in my updated pegasos2 patch:
>>>>>
>>>>> https://osdn.net/projects/qmiga/scm/git/qemu/commits/3c1fad08469b4d3c04def22044e52b2d27774a61 
>>>>> [...]
>>>>>> diff --git a/pc-bios/vof/entry.S b/pc-bios/vof/entry.S
>>>>>> new file mode 100644
>>>>>> index 000000000000..569688714c91
>>>>>> --- /dev/null
>>>>>> +++ b/pc-bios/vof/entry.S
>>>>>> @@ -0,0 +1,51 @@
>>>>>> +#define LOAD32(rn, name)    \
>>>>>> +    lis     rn,name##@h;    \
>>>>>> +    ori     rn,rn,name##@l
>>>>>> +
>>>>>> +#define ENTRY(func_name)    \
>>>>>> +    .text;                  \
>>>>>> +    .align  2;              \
>>>>>> +    .globl  .func_name;     \
>>>>>> +    .func_name:             \
>>>>>> +    .globl  func_name;      \
>>>>>> +    func_name:
>>>>>> +
>>>>>> +#define KVMPPC_HCALL_BASE       0xf000
>>>>>> +#define KVMPPC_H_RTAS           (KVMPPC_HCALL_BASE + 0x0)
>>>>>> +#define KVMPPC_H_VOF_CLIENT     (KVMPPC_HCALL_BASE + 0x5)
>>>>>> +
>>>>>> +    . = 0x100 /* Do exactly as SLOF does */
>>>>>> +
>>>>>> +ENTRY(_start)
>>>>>> +#    LOAD32(%r31, 0) /* Go 32bit mode */
>>>>>> +#    mtmsrd %r31,0
>>>>>> +    LOAD32(2, __toc_start)
>>>>>> +    b entry_c
>>>>>> +
>>>>>> +ENTRY(_prom_entry)
>>>>>> +    LOAD32(2, __toc_start)
>>>>>> +    stwu    %r1,-112(%r1)
>>>>>> +    stw     %r31,104(%r1)
>>>>>> +    mflr    %r31
>>>>>> +    bl prom_entry
>>>>>> +    nop
>>>>>> +    mtlr    %r31
>>>>>> +    ld      %r31,104(%r1)
>>>>>
>>>>> It's getting there, now I see the first client call from the guest 
>>>>> boot code but then it crashes on this ld opcode which apparently is 
>>>>> 64 bit only:
>>>>
>>>> Oh right.
>>>>
>>>>
>>>>> Hopefully this is the last such opcode left before I can really 
>>>>> test this.
>>>>
>>>> Make it lwz, and test it?
>>>
>>> Yes, figured that out too after sending this message. Replacing with 
>>> lwz works but I wonder that now you have stwu lwz do the stack 
>>> offsets need adjusting too or you just waste 4 bytes now?
>>
>> Well, this assumes the 64bit client and that ABI. I think ideally the 
>> firmware is supposed to use its own stack but I did not bother here. I 
>> do not know 32bit ABI at all so say whether the existing code should 
>> just work or not :-/
> 
> It seems to work so that's OK, just thought if the firmware is 32 bit it 
> does not need 64 bit values on stack but if that's also potentially used 
> by a 64 bit kernel then it may be better to keep it that way to avoid 
> confusion. With the 64 bit opcodes replaced it seems to work on pegasos2 
> and the guest can call CI functions and get a reply so maybe it's just a 
> few wasted bytes that's not a big deal.
> 
>>> With lwz here I found no further 64 bit opcodes and the guest boot 
>>> code could walk the device tree. It failed later but I think that's 
>>> because I'll need to fill more info about the machine in the device 
>>> tree. I'll experiment with that but it looks like it could work at 
>>> least for MorphOS. I'll have to try Linux too.
>>
>> There are plenty of tracepoints, enable them all.
> 
> I'm running with -trace enable="vof*" but it does not give me too much 
> info as a lot of calls (such as peer, child, etc.) don't log anything 
> other than there was a hypercall so only get info about opening paths 
> and querying some props. The MorphOS boot.img just walks the device tree 
> gathering some data about the machine then calls quiesce and boot into 
> the OS that later tries to use the gathered info at which point it 
> crashes without any logs if some info is not as expected. This does not 
> make it easy to debug but I think once I fill the device tree enough 
> with all needed info it should work. Currently I'm missing info about 
> PCI devices that it may need.


One thing to note about PCI is that normally I think the client expects 
the firmware to do PCI probing and SLOF does it. But VOF does not and 
Linux scans PCI bus(es) itself. Might be a problem for you kernel.


> 
>>>>> Do you have some info on how the stdout works in VOF? I think I'll 
>>>>> need that to test with Linux and get output but I'm not sure what's 
>>>>> needed on the machine side.
>>>>
>>>> VOF opens stsout and stores the ihandle (in fdt) which the client 
>>>> (==kernel) uses for writing. To make it work properly, you need to 
>>>> hook up that instance to a device backend similar to what I have for 
>>>> spapr-vty:
>>>>
>>>> https://github.com/aik/qemu/commit/a381a5b50c23c74013e2bd39cc5dad5b6385965d 
>>>>
>>>> This is not a part of this patch as I'm trying to keep things 
>>>> simpler and accessing backends from VOF is still unsettled. But 
>>>> there is a workaround which  is trace_vof_write, I use this. Thanks,
>>>
>>> The above patch is about stdin but stdout seems to be added by the 
>>> current vof patch. What is spapr-vty?
>>
>> It is pseries' paravirtual serial device, pegasos does not have it.
>>
>>> I don't think I have something similar in pegasos2 where I just have 
>>> a normal serial port created by ISASuperIO in the vt8231 model.
>>
>> Correct.
>>
>>> Can I use that backend somehow or have to create some other serial 
>>> device to connect to stdout?
>>> Does trace_vof_write work for stuff output by the guest?
>>> I guess that's only for things printed by VOF itself
>>
>> VOF itself does not prints anything in this patch.
> 
> However it seems to be needed for linux as the first thing it does seems 
> to be getting /chosen/stdout and calls exit if it returns nothing. So 

Right, Linux does but VOF (==vof.bin) does not.

> I'll need this at least for linux. (I think MorphOS may also query it to 
> print a banner or some messages but not sure it needs it, at least it 
> does not abort right away if not found.)

Tracepoints print this :)

>>> but to see Linux output do I need a stdout in VOF or it will just 
>>> open the serial with its own driver and use that?
>>> So I'm not sure what's the stdout parts in the current vof patch does 
>>> and if I need that for anything. I'll try to experiment with it some 
>>> more but fixing the ld and Kconfig seems to be enough to get it work 
>>> for me.
>>
>> So for the client to print something, /chosen/stdout needs to have a 
>> valid ihandle.
>> The only way to get a valid ihandle is having a valid phandle which 
>> vof_client_open() can open.
>> A valid phandle is a phandle of any node in the device tree. On spapr 
>> we pick some spapr-vty, open it and store in /chosen/stdout.
>>
>> From this point output from the client can be seen via a tracepoint.
>>
>> Now if we want proper output without tracepoints - we need to hook it 
>> up with some chardev backend (not a device such a vt8231 or spapr-vty 
>> but backend).
> 
> I don't know much about it but devices are also connected to some 
> backend so is it possible to use the same backend for VOF as used for 
> the normal serial port?

Yes but with this initial patch there is no backend support, you only 
get tracepoints.

> But I need a way to find that and connect it to 
> VOF and I'm not qure how to do that yet.

Pick some device in the machine reset code (or you can open the root - 
"/"), resolve its FW (==FDT) path, call vof_client_open_store() on it, 
it will store ihandle in the FDT. This will enable stdout and the output 
can be seen via tracepoint.


> Or do I need to create a 
> separate serial backend and connect that to VOF? I'll try to look at 
> spapr-vty to see what it does.

No additional devices needed.


> 
>> https://github.com/aik/qemu/commit/a381a5b50c23c74013e2bd3 does this:
>> 1. when a phandle is open, QEMU will search for DeviceState* for the 
>> specific FDT node and get a chardev from the device.
>> 2. when write() is called, QEMU calls qemu_chr_fe_write_all() on 
>> chardev from 1.
>>
>> From this point you do not need a tracepoint and the output will 
>> appears in the console you set up for stdout.
>>
>> Now if you want input from this console, things get tricky. First, on 
>> powernv/pseries we only need this for grub as otherwise the kernel has 
>> all the drivers needed and will not use the client interface. For the 
>> grub, we need to provide a valid ihandle for /chosen/stdin which is 
>> easy but implementing read() on this is not as there is no simple 
>> device-type-independend way of reading from chardev. I hacked it for 
>> spapr-tvy but other serial devices will need special handling, or 
>> we'll have to introduce some VOF_SERIAL_READ interface for those which 
>> will face opposition :)
>>
>> Makes sense?
> 
> It explains things a bit but still not entirely clear how can I get 
> something to add as a stdout. With the pegasos2 firmware it puts the 
> serial device there normally that it inits and opens. Without that 
> firmware we have to somehow do that from QEMU so find the serial backend 
> used by the serial device within the vt8231 model (or use a different 
> backend just for this?) then open it and put it in the device tree. If 
> that's correct or how to do it is not clear yet.

spapr looks through all spapr-vty and picks one with the lowest @reg. 
You can do a similar thing. Or add a machine option with a serial device 
id which you want to be the default console. So many options :)
Alexey Kardashevskiy May 23, 2021, 3:31 a.m. UTC | #12
On 23/05/2021 01:02, BALATON Zoltan wrote:
> On Sat, 22 May 2021, BALATON Zoltan wrote:
>> On Sat, 22 May 2021, Alexey Kardashevskiy wrote:
>>> VOF itself does not prints anything in this patch.
>>
>> However it seems to be needed for linux as the first thing it does 
>> seems to be getting /chosen/stdout and calls exit if it returns 
>> nothing. So I'll need this at least for linux. (I think MorphOS may 
>> also query it to print a banner or some messages but not sure it needs 
>> it, at least it does not abort right away if not found.)
>>
>>>> but to see Linux output do I need a stdout in VOF or it will just 
>>>> open the serial with its own driver and use that?
>>>> So I'm not sure what's the stdout parts in the current vof patch 
>>>> does and if I need that for anything. I'll try to experiment with it 
>>>> some more but fixing the ld and Kconfig seems to be enough to get it 
>>>> work for me.
>>>
>>> So for the client to print something, /chosen/stdout needs to have a 
>>> valid ihandle.
>>> The only way to get a valid ihandle is having a valid phandle which 
>>> vof_client_open() can open.
>>> A valid phandle is a phandle of any node in the device tree. On spapr 
>>> we pick some spapr-vty, open it and store in /chosen/stdout.
>>>
>>> From this point output from the client can be seen via a tracepoint.
> 
> I've got it now. Looking at the original firmware device tree dump:
> 
> https://osdn.net/projects/qmiga/wiki/SubprojectPegasos2/attach/PegasosII_OFW-Dump.txt 
> 
> 
> I see that /chosen/stdout points to "screen" which is an alias to 
> /bootconsole. Just adding an empty /bootconsole node in the device tree 
> and vof_client_open_store() that as /chosen/stdout works and I get 
> output via vof_write traces so this is enough for now to test Linux. 
> Properly connecting a serial backend can thus be postponed.
> 
> So with this the Linux kernel does not abort on the first device tree 
> access but starts to decompress itself then the embedded initrd and 
> crashes at calling setprop:
> 
> [...]
> vof_client_handle: setprop
> 
> Thread 4 "qemu-system-ppc" received signal SIGSEGV, Segmentation fault.
> (gdb) bt
> #0  0x0000000000000000 in  ()
> #1  0x0000555555a5c2bf in vof_setprop
>      (vof=0x7ffff48e9420, vallen=4, valaddr=<optimized out>, 
> pname=<optimized out>, nodeph=8, fdt=0x7fff8aaff010, ms=0x5555564f8800)
>      at ../hw/ppc/vof.c:308
> #2  0x0000555555a5c2bf in vof_client_handle
>      (nrets=1, rets=0x7ffff48e93f0, nargs=4, args=0x7ffff48e93c0, 
> service=0x7ffff48e9460 "setprop",
>       vof=0x7ffff48e9420, fdt=0x7fff8aaff010, ms=0x5555564f8800) at 
> ../hw/ppc/vof.c:842
> #3  0x0000555555a5c2bf in vof_client_call
>      (ms=0x5555564f8800, vof=vof@entry=0x55555662a3d0, 
> fdt=fdt@entry=0x7fff8aaff010, args_real=args_real@entry=23580472)
>      at ../hw/ppc/vof.c:935
> 
> loooks like it's trying to set /chosen/linux,initrd-start:

It is not horribly clear why it crashed though.

> 
> (gdb) up
> #1  0x0000555555a5c2bf in vof_setprop (vof=0x7ffff48e9420, vallen=4, 
> valaddr=<optimized out>, pname=<optimized out>, nodeph=8,
>      fdt=0x7fff8aaff010, ms=0x5555564f8800) at ../hw/ppc/vof.c:308
> 308            if (!vmc->setprop(ms, nodepath, propname, val, vallen)) {
> (gdb) p nodepath
> $1 = "/chosen\000\060/rPC,750CXE/", '\000' <repeats 234 times>
> (gdb) p propname
> $2 = 
> "linux,initrd-start\000linux,initrd-end\000linux,cmdline-timeout\000bootarg" 
> 
> (gdb) p val
> $3 = <optimized out>
> 
> I think I need the callback for setprop in TYPE_VOF_MACHINE_IF. I can 
> copy spapr_vof_setprop() but some explanation on why that's needed might 
> help. Ciould I just do fdt_setprop in my callback as vof_setprop() would 
> do without a machine callback or is there some special handling needed 
> for these properties?

The short answer is yes, you do not need TYPE_VOF_MACHINE_IF.

The long answer is that we build the FDT on spapr twice:
1. at the reset time and
2. after "ibm,client-arhitecture-support" (early in the boot the spapr 
paravirtual client says what it supports - ISA level, MMU features, etc)

Between 1 and 2 the kernel moves initrd and we do not update the QEMU's 
version of its location, the tree at 2) will have the old values.

So for that reason I have TYPE_VOF_MACHINE_IF. You most definitely do 
not need it.
Alexey Kardashevskiy May 23, 2021, 3:41 a.m. UTC | #13
On 23/05/2021 02:46, BALATON Zoltan wrote:
> On Sat, 22 May 2021, BALATON Zoltan wrote:
>> On Sat, 22 May 2021, BALATON Zoltan wrote:
>>> On Sat, 22 May 2021, Alexey Kardashevskiy wrote:
>>>> VOF itself does not prints anything in this patch.
>>>
>>> However it seems to be needed for linux as the first thing it does 
>>> seems to be getting /chosen/stdout and calls exit if it returns 
>>> nothing. So I'll need this at least for linux. (I think MorphOS may 
>>> also query it to print a banner or some messages but not sure it 
>>> needs it, at least it does not abort right away if not found.)
>>>
>>>>> but to see Linux output do I need a stdout in VOF or it will just 
>>>>> open the serial with its own driver and use that?
>>>>> So I'm not sure what's the stdout parts in the current vof patch 
>>>>> does and if I need that for anything. I'll try to experiment with 
>>>>> it some more but fixing the ld and Kconfig seems to be enough to 
>>>>> get it work for me.
>>>>
>>>> So for the client to print something, /chosen/stdout needs to have a 
>>>> valid ihandle.
>>>> The only way to get a valid ihandle is having a valid phandle which 
>>>> vof_client_open() can open.
>>>> A valid phandle is a phandle of any node in the device tree. On 
>>>> spapr we pick some spapr-vty, open it and store in /chosen/stdout.
>>>>
>>>> From this point output from the client can be seen via a tracepoint.
>>
>> I've got it now. Looking at the original firmware device tree dump:
>>
>> https://osdn.net/projects/qmiga/wiki/SubprojectPegasos2/attach/PegasosII_OFW-Dump.txt 
>>
>>
>> I see that /chosen/stdout points to "screen" which is an alias to 
>> /bootconsole. Just adding an empty /bootconsole node in the device 
>> tree and vof_client_open_store() that as /chosen/stdout works and I 
>> get output via vof_write traces so this is enough for now to test 
>> Linux. Properly connecting a serial backend can thus be postponed.
> 
> Using /failsafe instead of /bootconsole is even better because Linux 
> then adds console=ttyS0 to the bootargs by default as it knows that's a 
> serial port.

When linux boots so far that it can use whatever is passed in "console=" 
- the client interface is done pretty much and the output happens 
without it.


> 
>> So with this the Linux kernel does not abort on the first device tree 
>> access but starts to decompress itself then the embedded initrd and 
>> crashes at calling setprop:
>>
>> [...]
>> vof_client_handle: setprop
>>
>> Thread 4 "qemu-system-ppc" received signal SIGSEGV, Segmentation fault.
>> (gdb) bt
>> #0  0x0000000000000000 in  ()
>> #1  0x0000555555a5c2bf in vof_setprop
>>    (vof=0x7ffff48e9420, vallen=4, valaddr=<optimized out>, 
>> pname=<optimized out>, nodeph=8, fdt=0x7fff8aaff010, ms=0x5555564f8800)
>>    at ../hw/ppc/vof.c:308
>> #2  0x0000555555a5c2bf in vof_client_handle
>>    (nrets=1, rets=0x7ffff48e93f0, nargs=4, args=0x7ffff48e93c0, 
>> service=0x7ffff48e9460 "setprop",
>>     vof=0x7ffff48e9420, fdt=0x7fff8aaff010, ms=0x5555564f8800) at 
>> ../hw/ppc/vof.c:842
>> #3  0x0000555555a5c2bf in vof_client_call
>>    (ms=0x5555564f8800, vof=vof@entry=0x55555662a3d0, 
>> fdt=fdt@entry=0x7fff8aaff010, args_real=args_real@entry=23580472)
>>    at ../hw/ppc/vof.c:935
>>
>> loooks like it's trying to set /chosen/linux,initrd-start:
>>
>> (gdb) up
>> #1  0x0000555555a5c2bf in vof_setprop (vof=0x7ffff48e9420, vallen=4, 
>> valaddr=<optimized out>, pname=<optimized out>, nodeph=8,
>>    fdt=0x7fff8aaff010, ms=0x5555564f8800) at ../hw/ppc/vof.c:308
>> 308            if (!vmc->setprop(ms, nodepath, propname, val, vallen)) {
>> (gdb) p nodepath
>> $1 = "/chosen\000\060/rPC,750CXE/", '\000' <repeats 234 times>
>> (gdb) p propname
>> $2 = 
>> "linux,initrd-start\000linux,initrd-end\000linux,cmdline-timeout\000bootarg" 
>>
>> (gdb) p val
>> $3 = <optimized out>
>>
>> I think I need the callback for setprop in TYPE_VOF_MACHINE_IF. I can 
>> copy spapr_vof_setprop() but some explanation on why that's needed 
>> might help. Ciould I just do fdt_setprop in my callback as 
>> vof_setprop() would do without a machine callback or is there some 
>> special handling needed for these properties?
> 
> Just returning true from the setprop callback of the VofMachineIfClass 
> for now to see what it would do and then it gets to all the way of 
> calling quiesce. Unfortunately it then tries to call prom_printf on 
> Pegasos2 as seen here:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/arch/powerpc/kernel/prom_init.c?h=v4.14.233#n3261 
> 
> 
> which does not work because I have to shut down vhyp at quiesce 

What is vhyp and why do you have to shut it down?


> otherwise it trips an assert on writing sdr1 (and may also interfere 
> with the guest's usage of syscalls).

Where is that assert?

I am a bit lost here. Nothing in the current VOF should touch any actual 
device, it prints via tracepoints or (with that additional patch) to a 
chardev backend.


> So I need a way to not generate an 
> exception if the guest calls back into OF after quiesce. A hacky 
> solution is to patch out the sc 1 or _prom_entry point to just return 
> after quiesce but maybe a better way is needed such as a switch in 
> vof.bin that it checks before doing a syscall. Other than this problem 
> it seems to work for the most part so maybe making the _prom_entry check 
> some global value that I can set from quiesce to stop it doing syscalls 
> and just return would be the simplest way to avoid this crash in Linux 
> and not need a special version of vof for pegasos2. (MorphOS does not 
> seem to call OF after quiesce which seems safer to do anyway, don't know 
> why Linux does that. It could just print that one line before quiesce 
> and then it would work, unfortunately that's not what they did.)

quiesce is supposed to wait until ongoing DMA is finished (or something 
like that), it was (people say) a request from Apple back then and was 
never really architected.
Alexey Kardashevskiy May 23, 2021, 3:47 a.m. UTC | #14
On 22/05/2021 23:08, BALATON Zoltan wrote:
> On Sat, 22 May 2021, Alexey Kardashevskiy wrote:
>> On 22/05/2021 05:57, BALATON Zoltan wrote:
>>> On Fri, 21 May 2021, BALATON Zoltan wrote:
>>>> On Fri, 21 May 2021, Alexey Kardashevskiy wrote:
>>>>> On 21/05/2021 07:59, BALATON Zoltan wrote:
>>>>>> On Thu, 20 May 2021, Alexey Kardashevskiy wrote:
>>>>>>> The PAPR platform describes an OS environment that's presented by
>>>>>>> a combination of a hypervisor and firmware. The features it 
>>>>>>> specifies
>>>>>>> require collaboration between the firmware and the hypervisor.
>>>>>>>
>>>>>>> Since the beginning, the runtime component of the firmware (RTAS) 
>>>>>>> has
>>>>>>> been implemented as a 20 byte shim which simply forwards it to
>>>>>>> a hypercall implemented in qemu. The boot time firmware component is
>>>>>>> SLOF - but a build that's specific to qemu, and has always needed 
>>>>>>> to be
>>>>>>> updated in sync with it. Even though we've managed to limit the 
>>>>>>> amount
>>>>>>> of runtime communication we need between qemu and SLOF, there's 
>>>>>>> some,
>>>>>>> and it has become increasingly awkward to handle as we've 
>>>>>>> implemented
>>>>>>> new features.
>>>>>>>
>>>>>>> This implements a boot time OF client interface (CI) which is
>>>>>>> enabled by a new "x-vof" pseries machine option (stands for 
>>>>>>> "Virtual Open
>>>>>>> Firmware). When enabled, QEMU implements the custom H_OF_CLIENT 
>>>>>>> hcall
>>>>>>> which implements Open Firmware Client Interface (OF CI). This allows
>>>>>>> using a smaller stateless firmware which does not have to manage
>>>>>>> the device tree.
>>>>>>>
>>>>>>> The new "vof.bin" firmware image is included with source code under
>>>>>>> pc-bios/. It also includes RTAS blob.
>>>>>>>
>>>>>>> This implements a handful of CI methods just to get -kernel/-initrd
>>>>>>> working. In particular, this implements the device tree fetching and
>>>>>>> simple memory allocator - "claim" (an OF CI memory allocator) and 
>>>>>>> updates
>>>>>>> "/memory@0/available" to report the client about available memory.
>>>>>>>
>>>>>>> This implements changing some device tree properties which we 
>>>>>>> know how
>>>>>>> to deal with, the rest is ignored. To allow changes, this skips
>>>>>>> fdt_pack() when x-vof=on as not packing the blob leaves some room 
>>>>>>> for
>>>>>>> appending.
>>>>>>>
>>>>>>> In absence of SLOF, this assigns phandles to device tree nodes to 
>>>>>>> make
>>>>>>> device tree traversing work.
>>>>>>>
>>>>>>> When x-vof=on, this adds "/chosen" every time QEMU (re)builds a 
>>>>>>> tree.
>>>>>>>
>>>>>>> This adds basic instances support which are managed by a hash map
>>>>>>> ihandle -> [phandle].
>>>>>>>
>>>>>>> Before the guest started, the used memory is:
>>>>>>> 0..e60 - the initial firmware
>>>>>>> 8000..10000 - stack
>>>>>>> 400000.. - kernel
>>>>>>> 3ea0000.. - initramdisk
>>>>>>>
>>>>>>> This OF CI does not implement "interpret".
>>>>>>>
>>>>>>> Unlike SLOF, this does not format uninitialized nvram. Instead, this
>>>>>>> includes a disk image with pre-formatted nvram.
>>>>>>>
>>>>>>> With this basic support, this can only boot into kernel directly.
>>>>>>> However this is just enough for the petitboot kernel and 
>>>>>>> initradmdisk to
>>>>>>> boot from any possible source. Note this requires reasonably 
>>>>>>> recent guest
>>>>>>> kernel with:
>>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=df5be5be8735 
>>>>>>> The immediate benefit is much faster booting time which especially
>>>>>>> crucial with fully emulated early CPU bring up environments. Also 
>>>>>>> this
>>>>>>> may come handy when/if GRUB-in-the-userspace sees light of the day.
>>>>>>>
>>>>>>> This separates VOF and sPAPR in a hope that VOF bits may be 
>>>>>>> reused by
>>>>>>> other POWERPC boards which do not support pSeries.
>>>>>>>
>>>>>>> This is coded in assumption that later on we might be adding 
>>>>>>> support for
>>>>>>> booting from QEMU backends (blockdev is the first candidate) without
>>>>>>> devices/drivers in between as OF1275 does not require that and
>>>>>>> it is quite easy to so.
>>>>>>>
>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>> ---
>>>>>>>
>>>>>>> The example command line is:
>>>>>>>
>>>>>>> /home/aik/pbuild/qemu-killslof-localhost-ppc64/qemu-system-ppc64 \
>>>>>>> -nodefaults \
>>>>>>> -chardev stdio,id=STDIO0,signal=off,mux=on \
>>>>>>> -device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
>>>>>>> -mon id=MON0,chardev=STDIO0,mode=readline \
>>>>>>> -nographic \
>>>>>>> -vga none \
>>>>>>> -enable-kvm \
>>>>>>> -m 8G \
>>>>>>> -machine 
>>>>>>> pseries,x-vof=on,cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off 
>>>>>>> \
>>>>>>> -kernel pbuild/kernel-le-guest/vmlinux \
>>>>>>> -initrd pb/rootfs.cpio.xz \
>>>>>>> -drive 
>>>>>>> id=DRIVE0,if=none,file=./p/qemu-killslof/pc-bios/vof-nvram.bin,format=raw 
>>>>>>> \
>>>>>>> -global spapr-nvram.drive=DRIVE0 \
>>>>>>> -snapshot \
>>>>>>> -smp 8,threads=8 \
>>>>>>> -L /home/aik/t/qemu-ppc64-bios/ \
>>>>>>> -trace events=qemu_trace_events \
>>>>>>> -d guest_errors \
>>>>>>> -chardev socket,id=SOCKET0,server,nowait,path=qemu.mon.tmux26 \
>>>>>>> -mon chardev=SOCKET0,mode=control
>>>>>>>
>>>>>>> ---
>>>>>>> Changes:
>>>>>>> v20:
>>>>>>> * compile vof.bin with -mcpu=power4 for better compatibility
>>>>>>> * s/std/stw/ in entry.S to make it work on ppc32
>>>>>>> * fixed dt_available property to support both 32 and 64bit
>>>>>>> * shuffled prom_args handling code
>>>>>>> * do not enforce 32bit in MSR (again, to support 32bit platforms)
>>>>>>>
>>>>>>
>>>>>> [...]
>>>>>>
>>>>>>> diff --git a/default-configs/devices/ppc64-softmmu.mak 
>>>>>>> b/default-configs/devices/ppc64-softmmu.mak
>>>>>>> index ae0841fa3a18..9fb201dfacfa 100644
>>>>>>> --- a/default-configs/devices/ppc64-softmmu.mak
>>>>>>> +++ b/default-configs/devices/ppc64-softmmu.mak
>>>>>>> @@ -9,3 +9,4 @@ CONFIG_POWERNV=y
>>>>>>>  # For pSeries
>>>>>>>  CONFIG_PSERIES=y
>>>>>>>  CONFIG_NVDIMM=y
>>>>>>> +CONFIG_VOF=y
>>>>>>> diff --git a/hw/ppc/Kconfig b/hw/ppc/Kconfig
>>>>>>> index e51e0e5e5ac6..964510dfc73d 100644
>>>>>>> --- a/hw/ppc/Kconfig
>>>>>>> +++ b/hw/ppc/Kconfig
>>>>>>> @@ -143,3 +143,6 @@ config FW_CFG_PPC
>>>>>>>
>>>>>>>  config FDT_PPC
>>>>>>>      bool
>>>>>>> +
>>>>>>> +config VOF
>>>>>>> +    bool
>>>>>>
>>>>>> I think you should just add "select VOF" to config PSERIES section 
>>>>>> in Kconfig instead of adding it to 
>>>>>> default-configs/devices/ppc64-softmmu.mak. 
>>>>>
>>>>> oh well, can do that too.
>>>>
>>>> I think most config options should be selected by KConfig and the 
>>>> default config should only include machines, otherwise VOF would be 
>>>> added also when you don't compile PSERIES or PEGASOS2. With select 
>>>> in Kconfig it will be added when needed. That's why it's better to 
>>>> use select in this case.
>>>>
>>>>>>  That should do it, it works in my updated pegasos2 patch:
>>>>>>
>>>>>> https://osdn.net/projects/qmiga/scm/git/qemu/commits/3c1fad08469b4d3c04def22044e52b2d27774a61 
>>>>>> [...]
>>>>>>> diff --git a/pc-bios/vof/entry.S b/pc-bios/vof/entry.S
>>>>>>> new file mode 100644
>>>>>>> index 000000000000..569688714c91
>>>>>>> --- /dev/null
>>>>>>> +++ b/pc-bios/vof/entry.S
>>>>>>> @@ -0,0 +1,51 @@
>>>>>>> +#define LOAD32(rn, name)    \
>>>>>>> +    lis     rn,name##@h;    \
>>>>>>> +    ori     rn,rn,name##@l
>>>>>>> +
>>>>>>> +#define ENTRY(func_name)    \
>>>>>>> +    .text;                  \
>>>>>>> +    .align  2;              \
>>>>>>> +    .globl  .func_name;     \
>>>>>>> +    .func_name:             \
>>>>>>> +    .globl  func_name;      \
>>>>>>> +    func_name:
>>>>>>> +
>>>>>>> +#define KVMPPC_HCALL_BASE       0xf000
>>>>>>> +#define KVMPPC_H_RTAS           (KVMPPC_HCALL_BASE + 0x0)
>>>>>>> +#define KVMPPC_H_VOF_CLIENT     (KVMPPC_HCALL_BASE + 0x5)
>>>>>>> +
>>>>>>> +    . = 0x100 /* Do exactly as SLOF does */
>>>>>>> +
>>>>>>> +ENTRY(_start)
>>>>>>> +#    LOAD32(%r31, 0) /* Go 32bit mode */
>>>>>>> +#    mtmsrd %r31,0
>>>>>>> +    LOAD32(2, __toc_start)
>>>>>>> +    b entry_c
>>>>>>> +
>>>>>>> +ENTRY(_prom_entry)
>>>>>>> +    LOAD32(2, __toc_start)
>>>>>>> +    stwu    %r1,-112(%r1)
>>>>>>> +    stw     %r31,104(%r1)
>>>>>>> +    mflr    %r31
>>>>>>> +    bl prom_entry
>>>>>>> +    nop
>>>>>>> +    mtlr    %r31
>>>>>>> +    ld      %r31,104(%r1)
>>>>>>
>>>>>> It's getting there, now I see the first client call from the guest 
>>>>>> boot code but then it crashes on this ld opcode which apparently 
>>>>>> is 64 bit only:
>>>>>
>>>>> Oh right.
>>>>>
>>>>>
>>>>>> Hopefully this is the last such opcode left before I can really 
>>>>>> test this.
>>>>>
>>>>> Make it lwz, and test it?
>>>>
>>>> Yes, figured that out too after sending this message. Replacing with 
>>>> lwz works but I wonder that now you have stwu lwz do the stack 
>>>> offsets need adjusting too or you just waste 4 bytes now? With lwz 
>>>> here I found no further 64 bit opcodes and the guest boot code could 
>>>> walk the device tree. It failed later but I think that's because 
>>>> I'll need to fill more info about the machine in the device tree. 
>>>> I'll experiment with that but it looks like it could work at least 
>>>> for MorphOS. I'll have to try Linux too.
>>>
>>> I was trying to get a linux kernel from a debian powerpc iso to do 
>>> something (debian before 10.0 has Pegasos support) but I've run into 
>>> the problem that the kernel is loaded at 0x400000 but the start 
>>> address is at some offset from that. How do I set qemu,boot-kernel in 
>>> this case?
>>
>>
>> The pseries kernel can work from any location (and it relocates itself 
>> to 0 at some point) even though it is linked at c000.0000.0000.0000, 
>> and there is no start address offset:
>>
>> ===
>>> objdump -D ~/pbuild/kernel-le/vmlinux
>> /home/aik/pbuild/kernel-le/vmlinux:     file format elf64-powerpcle
>>
>>
>> Disassembly of section .head.text:
>>
>> c000000000000000 <__start>:
>> c000000000000000:       48 00 00 08     tdi     0,r0,72
>> c000000000000004:       2c 00 00 48     b       c000000000000030 
>> <__start+0x30>
>> ...
>> ===
>>
>> Not sure about pegasos2 kernels (or any ppc32 really), sorry.
> 
> The kernel from Debian 10.0 powerpc used on pegasos looks like this:
> 
> vmlinuz-chrp.initrd:     file format elf32-powerpc
> vmlinuz-chrp.initrd
> architecture: powerpc:common, flags 0x00000112:
> EXEC_P, HAS_SYMS, D_PAGED
> start address 0x004002fc
> 
> Program Header:
>      LOAD off    0x00010000 vaddr 0x00400000 paddr 0x00400000 align 2**16
>           filesz 0x0127b72a memsz 0x0127d5d8 flags rwx
>     STACK off    0x00000000 vaddr 0x00000000 paddr 0x00000000 align 2**4
>           filesz 0x00000000 memsz 0x00000000 flags rwx
>      NOTE off    0x000000b4 vaddr 0x00000000 paddr 0x00000000 align 2**0
>           filesz 0x0000002c memsz 0x00000000 flags ---
>      NOTE off    0x000000e0 vaddr 0x00000000 paddr 0x00000000 align 2**0
>           filesz 0x0000002c memsz 0x00000000 flags ---
> 
> Sections:
> Idx Name          Size      VMA       LMA       File off  Algn
>    0 .text         00008588  00400000  00400000  00010000  2**2
>                    CONTENTS, ALLOC, LOAD, READONLY, CODE
>    1 .text.unlikely 00000078  00408588  00408588  00018588  2**2
>                    CONTENTS, ALLOC, LOAD, READONLY, CODE
>    2 .data         00001bec  00409000  00409000  00019000  2**2
>                    CONTENTS, ALLOC, LOAD, DATA
>    3 .got          0000000c  0040abec  0040abec  0001abec  2**2
>                    CONTENTS, ALLOC, LOAD, DATA
>    4 __builtin_cmdline 00000800  0040abf8  0040abf8  0001abf8  2**2
>                    CONTENTS, ALLOC, LOAD, DATA
>    5 .kernel:vmlinux.strip 0047658e  0040c000  0040c000  0001c000  2**0
>                    CONTENTS, ALLOC, LOAD, READONLY, DATA
>    6 .kernel:initrd 00df872a  00883000  00883000  00493000  2**0
>                    CONTENTS, ALLOC, LOAD, READONLY, DATA
>    7 .bss          000015d8  0167c000  0167c000  0128b72a  2**2
>                    ALLOC
>    8 .debug_info   0000e7fd  00000000  00000000  0128b72a  2**0
>                    CONTENTS, READONLY, DEBUGGING
>    9 .debug_abbrev 00002a4f  00000000  00000000  01299f27  2**0
>                    CONTENTS, READONLY, DEBUGGING
>   10 .debug_loc    00009df1  00000000  00000000  0129c976  2**0
>                    CONTENTS, READONLY, DEBUGGING
>   11 .debug_aranges 00000250  00000000  00000000  012a6767  2**0
>                    CONTENTS, READONLY, DEBUGGING
>   12 .debug_line   000026b8  00000000  00000000  012a69b7  2**0
>                    CONTENTS, READONLY, DEBUGGING
>   13 .debug_str    00001d9c  00000000  00000000  012a906f  2**0
>                    CONTENTS, READONLY, DEBUGGING
>   14 .comment      0000001d  00000000  00000000  012aae0b  2**0
>                    CONTENTS, READONLY
>   15 .gnu.attributes 00000010  00000000  00000000  012aae28  2**0
>                    CONTENTS, READONLY
>   16 .debug_frame  00001c88  00000000  00000000  012aae38  2**2
>                    CONTENTS, READONLY, DEBUGGING
>   17 .debug_ranges 00000740  00000000  00000000  012acac0  2**0
>                    CONTENTS, READONLY, DEBUGGING
> 
> It even seems to have the initrd embedded in it. If I just use 0x400000 
> as start address it does not work, has to jump to the start address for 
> it to start correctly.
> 
>>> Because when I set it to the address/size where the kernel is loaded 
>>> it jumps to the beginnig not the correct start address. If I set the 
>>> address to the start address then size will be wrong so I don't know 
>>> how to set qemu,boot-kernel in this case or is there another property 
>>> to tell the start address?
>>> (Vof does not seem to check any other property and seems to assume 
>>> the entry point is the same as the load address but for this linux 
>>> kernel it's not.)
>>
>> I guess if you really need an offset, you'll have to add a new 
>> property ("qemu,boot-kernel-start"?) and look for it in the firmware. 
>> Or, say, put in gpr5 in your version of spapr_cpu_set_entry_state() 
>> and make boot_from_memory() use it.
> 
> Either way would work but I don't want to recompile vof.bin so if you 

I really do not want to add features with no user for it; and having 
this added with pegasos2 support make it clear why it is there. Also 
recompile is really simple :)


> implement any of these in the next version I can use that. For now I've 
> just set kernel address to the start address and decreased size a bit, 
> the memory for the kernel is still claimed correctly when it's loaded so 
> unless something relies on the size in qemu,boot-kernel it does not 
> matter and this way the kernel starts but only gets to finding no 
> /chosen/stdout and exit there so I can't try it until I resolve that.
BALATON Zoltan May 23, 2021, 11:19 a.m. UTC | #15
On Sun, 23 May 2021, Alexey Kardashevskiy wrote:
> On 22/05/2021 23:01, BALATON Zoltan wrote:
>> On Sat, 22 May 2021, Alexey Kardashevskiy wrote:
>>> On 21/05/2021 19:05, BALATON Zoltan wrote:
>>>> On Fri, 21 May 2021, Alexey Kardashevskiy wrote:
>>>>> On 21/05/2021 07:59, BALATON Zoltan wrote:
>>>>>> On Thu, 20 May 2021, Alexey Kardashevskiy wrote:
>>>>>>> The PAPR platform describes an OS environment that's presented by
>>>>>>> a combination of a hypervisor and firmware. The features it specifies
>>>>>>> require collaboration between the firmware and the hypervisor.
>>>>>>> 
>>>>>>> Since the beginning, the runtime component of the firmware (RTAS) has
>>>>>>> been implemented as a 20 byte shim which simply forwards it to
>>>>>>> a hypercall implemented in qemu. The boot time firmware component is
>>>>>>> SLOF - but a build that's specific to qemu, and has always needed to 
>>>>>>> be
>>>>>>> updated in sync with it. Even though we've managed to limit the amount
>>>>>>> of runtime communication we need between qemu and SLOF, there's some,
>>>>>>> and it has become increasingly awkward to handle as we've implemented
>>>>>>> new features.
>>>>>>> 
>>>>>>> This implements a boot time OF client interface (CI) which is
>>>>>>> enabled by a new "x-vof" pseries machine option (stands for "Virtual 
>>>>>>> Open
>>>>>>> Firmware). When enabled, QEMU implements the custom H_OF_CLIENT hcall
>>>>>>> which implements Open Firmware Client Interface (OF CI). This allows
>>>>>>> using a smaller stateless firmware which does not have to manage
>>>>>>> the device tree.
>>>>>>> 
>>>>>>> The new "vof.bin" firmware image is included with source code under
>>>>>>> pc-bios/. It also includes RTAS blob.
>>>>>>> 
>>>>>>> This implements a handful of CI methods just to get -kernel/-initrd
>>>>>>> working. In particular, this implements the device tree fetching and
>>>>>>> simple memory allocator - "claim" (an OF CI memory allocator) and 
>>>>>>> updates
>>>>>>> "/memory@0/available" to report the client about available memory.
>>>>>>> 
>>>>>>> This implements changing some device tree properties which we know how
>>>>>>> to deal with, the rest is ignored. To allow changes, this skips
>>>>>>> fdt_pack() when x-vof=on as not packing the blob leaves some room for
>>>>>>> appending.
>>>>>>> 
>>>>>>> In absence of SLOF, this assigns phandles to device tree nodes to make
>>>>>>> device tree traversing work.
>>>>>>> 
>>>>>>> When x-vof=on, this adds "/chosen" every time QEMU (re)builds a tree.
>>>>>>> 
>>>>>>> This adds basic instances support which are managed by a hash map
>>>>>>> ihandle -> [phandle].
>>>>>>> 
>>>>>>> Before the guest started, the used memory is:
>>>>>>> 0..e60 - the initial firmware
>>>>>>> 8000..10000 - stack
>>>>>>> 400000.. - kernel
>>>>>>> 3ea0000.. - initramdisk
>>>>>>> 
>>>>>>> This OF CI does not implement "interpret".
>>>>>>> 
>>>>>>> Unlike SLOF, this does not format uninitialized nvram. Instead, this
>>>>>>> includes a disk image with pre-formatted nvram.
>>>>>>> 
>>>>>>> With this basic support, this can only boot into kernel directly.
>>>>>>> However this is just enough for the petitboot kernel and initradmdisk 
>>>>>>> to
>>>>>>> boot from any possible source. Note this requires reasonably recent 
>>>>>>> guest
>>>>>>> kernel with:
>>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=df5be5be8735 
>>>>>>> The immediate benefit is much faster booting time which especially
>>>>>>> crucial with fully emulated early CPU bring up environments. Also this
>>>>>>> may come handy when/if GRUB-in-the-userspace sees light of the day.
>>>>>>> 
>>>>>>> This separates VOF and sPAPR in a hope that VOF bits may be reused by
>>>>>>> other POWERPC boards which do not support pSeries.
>>>>>>> 
>>>>>>> This is coded in assumption that later on we might be adding support 
>>>>>>> for
>>>>>>> booting from QEMU backends (blockdev is the first candidate) without
>>>>>>> devices/drivers in between as OF1275 does not require that and
>>>>>>> it is quite easy to so.
>>>>>>> 
>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>> ---
>>>>>>> 
>>>>>>> The example command line is:
>>>>>>> 
>>>>>>> /home/aik/pbuild/qemu-killslof-localhost-ppc64/qemu-system-ppc64 \
>>>>>>> -nodefaults \
>>>>>>> -chardev stdio,id=STDIO0,signal=off,mux=on \
>>>>>>> -device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
>>>>>>> -mon id=MON0,chardev=STDIO0,mode=readline \
>>>>>>> -nographic \
>>>>>>> -vga none \
>>>>>>> -enable-kvm \
>>>>>>> -m 8G \
>>>>>>> -machine 
>>>>>>> pseries,x-vof=on,cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off 
>>>>>>> \
>>>>>>> -kernel pbuild/kernel-le-guest/vmlinux \
>>>>>>> -initrd pb/rootfs.cpio.xz \
>>>>>>> -drive 
>>>>>>> id=DRIVE0,if=none,file=./p/qemu-killslof/pc-bios/vof-nvram.bin,format=raw 
>>>>>>> \
>>>>>>> -global spapr-nvram.drive=DRIVE0 \
>>>>>>> -snapshot \
>>>>>>> -smp 8,threads=8 \
>>>>>>> -L /home/aik/t/qemu-ppc64-bios/ \
>>>>>>> -trace events=qemu_trace_events \
>>>>>>> -d guest_errors \
>>>>>>> -chardev socket,id=SOCKET0,server,nowait,path=qemu.mon.tmux26 \
>>>>>>> -mon chardev=SOCKET0,mode=control
>>>>>>> 
>>>>>>> ---
>>>>>>> Changes:
>>>>>>> v20:
>>>>>>> * compile vof.bin with -mcpu=power4 for better compatibility
>>>>>>> * s/std/stw/ in entry.S to make it work on ppc32
>>>>>>> * fixed dt_available property to support both 32 and 64bit
>>>>>>> * shuffled prom_args handling code
>>>>>>> * do not enforce 32bit in MSR (again, to support 32bit platforms)
>>>>>>> 
>>>>>> 
>>>>>> [...]
>>>>>> 
>>>>>>> diff --git a/default-configs/devices/ppc64-softmmu.mak 
>>>>>>> b/default-configs/devices/ppc64-softmmu.mak
>>>>>>> index ae0841fa3a18..9fb201dfacfa 100644
>>>>>>> --- a/default-configs/devices/ppc64-softmmu.mak
>>>>>>> +++ b/default-configs/devices/ppc64-softmmu.mak
>>>>>>> @@ -9,3 +9,4 @@ CONFIG_POWERNV=y
>>>>>>>  # For pSeries
>>>>>>>  CONFIG_PSERIES=y
>>>>>>>  CONFIG_NVDIMM=y
>>>>>>> +CONFIG_VOF=y
>>>>>>> diff --git a/hw/ppc/Kconfig b/hw/ppc/Kconfig
>>>>>>> index e51e0e5e5ac6..964510dfc73d 100644
>>>>>>> --- a/hw/ppc/Kconfig
>>>>>>> +++ b/hw/ppc/Kconfig
>>>>>>> @@ -143,3 +143,6 @@ config FW_CFG_PPC
>>>>>>> 
>>>>>>>  config FDT_PPC
>>>>>>>      bool
>>>>>>> +
>>>>>>> +config VOF
>>>>>>> +    bool
>>>>>> 
>>>>>> I think you should just add "select VOF" to config PSERIES section in 
>>>>>> Kconfig instead of adding it to 
>>>>>> default-configs/devices/ppc64-softmmu.mak. 
>>>>> 
>>>>> oh well, can do that too.
>>>> 
>>>> I think most config options should be selected by KConfig and the default 
>>>> config should only include machines, otherwise VOF would be added also 
>>>> when you don't compile PSERIES or PEGASOS2. With select in Kconfig it 
>>>> will be added when needed. That's why it's better to use select in this 
>>>> case.
>>>> 
>>>>>>  That should do it, it works in my updated pegasos2 patch:
>>>>>> 
>>>>>> https://osdn.net/projects/qmiga/scm/git/qemu/commits/3c1fad08469b4d3c04def22044e52b2d27774a61 
>>>>>> [...]
>>>>>>> diff --git a/pc-bios/vof/entry.S b/pc-bios/vof/entry.S
>>>>>>> new file mode 100644
>>>>>>> index 000000000000..569688714c91
>>>>>>> --- /dev/null
>>>>>>> +++ b/pc-bios/vof/entry.S
>>>>>>> @@ -0,0 +1,51 @@
>>>>>>> +#define LOAD32(rn, name)    \
>>>>>>> +    lis     rn,name##@h;    \
>>>>>>> +    ori     rn,rn,name##@l
>>>>>>> +
>>>>>>> +#define ENTRY(func_name)    \
>>>>>>> +    .text;                  \
>>>>>>> +    .align  2;              \
>>>>>>> +    .globl  .func_name;     \
>>>>>>> +    .func_name:             \
>>>>>>> +    .globl  func_name;      \
>>>>>>> +    func_name:
>>>>>>> +
>>>>>>> +#define KVMPPC_HCALL_BASE       0xf000
>>>>>>> +#define KVMPPC_H_RTAS           (KVMPPC_HCALL_BASE + 0x0)
>>>>>>> +#define KVMPPC_H_VOF_CLIENT     (KVMPPC_HCALL_BASE + 0x5)
>>>>>>> +
>>>>>>> +    . = 0x100 /* Do exactly as SLOF does */
>>>>>>> +
>>>>>>> +ENTRY(_start)
>>>>>>> +#    LOAD32(%r31, 0) /* Go 32bit mode */
>>>>>>> +#    mtmsrd %r31,0
>>>>>>> +    LOAD32(2, __toc_start)
>>>>>>> +    b entry_c
>>>>>>> +
>>>>>>> +ENTRY(_prom_entry)
>>>>>>> +    LOAD32(2, __toc_start)
>>>>>>> +    stwu    %r1,-112(%r1)
>>>>>>> +    stw     %r31,104(%r1)
>>>>>>> +    mflr    %r31
>>>>>>> +    bl prom_entry
>>>>>>> +    nop
>>>>>>> +    mtlr    %r31
>>>>>>> +    ld      %r31,104(%r1)
>>>>>> 
>>>>>> It's getting there, now I see the first client call from the guest boot 
>>>>>> code but then it crashes on this ld opcode which apparently is 64 bit 
>>>>>> only:
>>>>> 
>>>>> Oh right.
>>>>> 
>>>>> 
>>>>>> Hopefully this is the last such opcode left before I can really test 
>>>>>> this.
>>>>> 
>>>>> Make it lwz, and test it?
>>>> 
>>>> Yes, figured that out too after sending this message. Replacing with lwz 
>>>> works but I wonder that now you have stwu lwz do the stack offsets need 
>>>> adjusting too or you just waste 4 bytes now?
>>> 
>>> Well, this assumes the 64bit client and that ABI. I think ideally the 
>>> firmware is supposed to use its own stack but I did not bother here. I do 
>>> not know 32bit ABI at all so say whether the existing code should just 
>>> work or not :-/
>> 
>> It seems to work so that's OK, just thought if the firmware is 32 bit it 
>> does not need 64 bit values on stack but if that's also potentially used by 
>> a 64 bit kernel then it may be better to keep it that way to avoid 
>> confusion. With the 64 bit opcodes replaced it seems to work on pegasos2 
>> and the guest can call CI functions and get a reply so maybe it's just a 
>> few wasted bytes that's not a big deal.
>> 
>>>> With lwz here I found no further 64 bit opcodes and the guest boot code 
>>>> could walk the device tree. It failed later but I think that's because 
>>>> I'll need to fill more info about the machine in the device tree. I'll 
>>>> experiment with that but it looks like it could work at least for 
>>>> MorphOS. I'll have to try Linux too.
>>> 
>>> There are plenty of tracepoints, enable them all.
>> 
>> I'm running with -trace enable="vof*" but it does not give me too much info 
>> as a lot of calls (such as peer, child, etc.) don't log anything other than 
>> there was a hypercall so only get info about opening paths and querying 
>> some props. The MorphOS boot.img just walks the device tree gathering some 
>> data about the machine then calls quiesce and boot into the OS that later 
>> tries to use the gathered info at which point it crashes without any logs 
>> if some info is not as expected. This does not make it easy to debug but I 
>> think once I fill the device tree enough with all needed info it should 
>> work. Currently I'm missing info about PCI devices that it may need.
>
>
> One thing to note about PCI is that normally I think the client expects the 
> firmware to do PCI probing and SLOF does it. But VOF does not and Linux scans 
> PCI bus(es) itself. Might be a problem for you kernel.

I'm not sure what info does MorphOS get from the device tree and what it 
probes itself but I think it may at least need device ids and info about 
the PCI bus to be able to access the config regs, after that it should set 
the devices up hopefully. I could add these from the board code to device 
tree so VOF does not need to do anything about it. However I'm not getting 
to that point yet because it crashes on something that it's missing and 
couldn't yet find out what is that.

I'd like to get Linux working now as that would be enough to test this and 
then if for MorphOS we still need a ROM it's not a problem if at least we 
can boot Linux without the original firmware. But I can't make Linux open 
a serial console and I don't know what it needs for that. Do you happen to 
know? I've looked at the sources in Linux/arch/powerpc but not sure how it 
would find and open a serial port on pegasos2. It seems to work with the 
board firmware and now I can get it to boot with VOF but then it does not 
open serial so it probably needs something in the device tree or expects 
the firmware to set something up that we should add in pegasos2.c when 
using VOF.

>>>>>> Do you have some info on how the stdout works in VOF? I think I'll need 
>>>>>> that to test with Linux and get output but I'm not sure what's needed 
>>>>>> on the machine side.
>>>>> 
>>>>> VOF opens stsout and stores the ihandle (in fdt) which the client 
>>>>> (==kernel) uses for writing. To make it work properly, you need to hook 
>>>>> up that instance to a device backend similar to what I have for 
>>>>> spapr-vty:
>>>>> 
>>>>> https://github.com/aik/qemu/commit/a381a5b50c23c74013e2bd39cc5dad5b6385965d 
>>>>> This is not a part of this patch as I'm trying to keep things simpler 
>>>>> and accessing backends from VOF is still unsettled. But there is a 
>>>>> workaround which  is trace_vof_write, I use this. Thanks,
>>>> 
>>>> The above patch is about stdin but stdout seems to be added by the 
>>>> current vof patch. What is spapr-vty?
>>> 
>>> It is pseries' paravirtual serial device, pegasos does not have it.
>>> 
>>>> I don't think I have something similar in pegasos2 where I just have a 
>>>> normal serial port created by ISASuperIO in the vt8231 model.
>>> 
>>> Correct.
>>> 
>>>> Can I use that backend somehow or have to create some other serial device 
>>>> to connect to stdout?
>>>> Does trace_vof_write work for stuff output by the guest?
>>>> I guess that's only for things printed by VOF itself
>>> 
>>> VOF itself does not prints anything in this patch.
>> 
>> However it seems to be needed for linux as the first thing it does seems to 
>> be getting /chosen/stdout and calls exit if it returns nothing. So 
>
> Right, Linux does but VOF (==vof.bin) does not.
>
>> I'll need this at least for linux. (I think MorphOS may also query it to 
>> print a banner or some messages but not sure it needs it, at least it does 
>> not abort right away if not found.)
>
> Tracepoints print this :)

The vof_write tracepoints only work until the guest calls quiesce, after 
that it should open the serial and use that or init the screen but it does 
not seem to work yet.

>>>> but to see Linux output do I need a stdout in VOF or it will just open 
>>>> the serial with its own driver and use that?
>>>> So I'm not sure what's the stdout parts in the current vof patch does and 
>>>> if I need that for anything. I'll try to experiment with it some more but 
>>>> fixing the ld and Kconfig seems to be enough to get it work for me.
>>> 
>>> So for the client to print something, /chosen/stdout needs to have a valid 
>>> ihandle.
>>> The only way to get a valid ihandle is having a valid phandle which 
>>> vof_client_open() can open.
>>> A valid phandle is a phandle of any node in the device tree. On spapr we 
>>> pick some spapr-vty, open it and store in /chosen/stdout.
>>> 
>>> From this point output from the client can be seen via a tracepoint.
>>> 
>>> Now if we want proper output without tracepoints - we need to hook it up 
>>> with some chardev backend (not a device such a vt8231 or spapr-vty but 
>>> backend).
>> 
>> I don't know much about it but devices are also connected to some backend 
>> so is it possible to use the same backend for VOF as used for the normal 
>> serial port?
>
> Yes but with this initial patch there is no backend support, you only get 
> tracepoints.

OK, I've got that now, traces work and if Linux would open the serial with 
its own driver then that would be enough for now.

>> But I need a way to find that and connect it to VOF and I'm not qure how to 
>> do that yet.
>
> Pick some device in the machine reset code (or you can open the root - "/"), 
> resolve its FW (==FDT) path, call vof_client_open_store() on it, it will 
> store ihandle in the FDT. This will enable stdout and the output can be seen 
> via tracepoint.
>
>
>> Or do I need to create a separate serial backend and connect that to VOF? 
>> I'll try to look at spapr-vty to see what it does.
>
> No additional devices needed.

Yes, as I wrote in a subsequent message I've figured this out.

>>> https://github.com/aik/qemu/commit/a381a5b50c23c74013e2bd3 does this:
>>> 1. when a phandle is open, QEMU will search for DeviceState* for the 
>>> specific FDT node and get a chardev from the device.
>>> 2. when write() is called, QEMU calls qemu_chr_fe_write_all() on chardev 
>>> from 1.
>>> 
>>> From this point you do not need a tracepoint and the output will appears 
>>> in the console you set up for stdout.
>>> 
>>> Now if you want input from this console, things get tricky. First, on 
>>> powernv/pseries we only need this for grub as otherwise the kernel has all 
>>> the drivers needed and will not use the client interface. For the grub, we 
>>> need to provide a valid ihandle for /chosen/stdin which is easy but 
>>> implementing read() on this is not as there is no simple 
>>> device-type-independend way of reading from chardev. I hacked it for 
>>> spapr-tvy but other serial devices will need special handling, or we'll 
>>> have to introduce some VOF_SERIAL_READ interface for those which will face 
>>> opposition :)
>>> 
>>> Makes sense?
>> 
>> It explains things a bit but still not entirely clear how can I get 
>> something to add as a stdout. With the pegasos2 firmware it puts the serial 
>> device there normally that it inits and opens. Without that firmware we 
>> have to somehow do that from QEMU so find the serial backend used by the 
>> serial device within the vt8231 model (or use a different backend just for 
>> this?) then open it and put it in the device tree. If that's correct or how 
>> to do it is not clear yet.
>
> spapr looks through all spapr-vty and picks one with the lowest @reg. You can 
> do a similar thing. Or add a machine option with a serial device id which you 
> want to be the default console. So many options :)

Fortunately pegasos2 has a single serial port so that's easy to find. For 
now I'm using what the board firmware does and add a /failsafe node with 
device_type serial and open that which works for vof_write traces and 
Linux finds it as a serial console and adds console=ttyS0 to command 
line if it's not there yet so it should work but then it does not seem to 
find the serial device so I get no output. I don't know what Linux needs 
from the device tree to find the serial. I've tried adding it and some 
properties I say it querying but could not make it work yet.

Regards,
BALATON Zoltan
BALATON Zoltan May 23, 2021, 11:24 a.m. UTC | #16
On Sun, 23 May 2021, Alexey Kardashevskiy wrote:
> On 23/05/2021 01:02, BALATON Zoltan wrote:
>> On Sat, 22 May 2021, BALATON Zoltan wrote:
>>> On Sat, 22 May 2021, Alexey Kardashevskiy wrote:
>>>> VOF itself does not prints anything in this patch.
>>> 
>>> However it seems to be needed for linux as the first thing it does seems 
>>> to be getting /chosen/stdout and calls exit if it returns nothing. So I'll 
>>> need this at least for linux. (I think MorphOS may also query it to print 
>>> a banner or some messages but not sure it needs it, at least it does not 
>>> abort right away if not found.)
>>> 
>>>>> but to see Linux output do I need a stdout in VOF or it will just open 
>>>>> the serial with its own driver and use that?
>>>>> So I'm not sure what's the stdout parts in the current vof patch does 
>>>>> and if I need that for anything. I'll try to experiment with it some 
>>>>> more but fixing the ld and Kconfig seems to be enough to get it work for 
>>>>> me.
>>>> 
>>>> So for the client to print something, /chosen/stdout needs to have a 
>>>> valid ihandle.
>>>> The only way to get a valid ihandle is having a valid phandle which 
>>>> vof_client_open() can open.
>>>> A valid phandle is a phandle of any node in the device tree. On spapr we 
>>>> pick some spapr-vty, open it and store in /chosen/stdout.
>>>> 
>>>> From this point output from the client can be seen via a tracepoint.
>> 
>> I've got it now. Looking at the original firmware device tree dump:
>> 
>> https://osdn.net/projects/qmiga/wiki/SubprojectPegasos2/attach/PegasosII_OFW-Dump.txt 
>> 
>> I see that /chosen/stdout points to "screen" which is an alias to 
>> /bootconsole. Just adding an empty /bootconsole node in the device tree and 
>> vof_client_open_store() that as /chosen/stdout works and I get output via 
>> vof_write traces so this is enough for now to test Linux. Properly 
>> connecting a serial backend can thus be postponed.
>> 
>> So with this the Linux kernel does not abort on the first device tree 
>> access but starts to decompress itself then the embedded initrd and crashes 
>> at calling setprop:
>> 
>> [...]
>> vof_client_handle: setprop
>> 
>> Thread 4 "qemu-system-ppc" received signal SIGSEGV, Segmentation fault.
>> (gdb) bt
>> #0  0x0000000000000000 in  ()
>> #1  0x0000555555a5c2bf in vof_setprop
>>      (vof=0x7ffff48e9420, vallen=4, valaddr=<optimized out>, 
>> pname=<optimized out>, nodeph=8, fdt=0x7fff8aaff010, ms=0x5555564f8800)
>>      at ../hw/ppc/vof.c:308
>> #2  0x0000555555a5c2bf in vof_client_handle
>>      (nrets=1, rets=0x7ffff48e93f0, nargs=4, args=0x7ffff48e93c0, 
>> service=0x7ffff48e9460 "setprop",
>>       vof=0x7ffff48e9420, fdt=0x7fff8aaff010, ms=0x5555564f8800) at 
>> ../hw/ppc/vof.c:842
>> #3  0x0000555555a5c2bf in vof_client_call
>>      (ms=0x5555564f8800, vof=vof@entry=0x55555662a3d0, 
>> fdt=fdt@entry=0x7fff8aaff010, args_real=args_real@entry=23580472)
>>      at ../hw/ppc/vof.c:935
>> 
>> loooks like it's trying to set /chosen/linux,initrd-start:
>
> It is not horribly clear why it crashed though.

It crashed becuase I had TYPE_VOF_MACHINE_IF but did not set a setprop 
callback and it tried to call that here. Adding a {return true;} empty 
callback avoids this.

>> (gdb) up
>> #1  0x0000555555a5c2bf in vof_setprop (vof=0x7ffff48e9420, vallen=4, 
>> valaddr=<optimized out>, pname=<optimized out>, nodeph=8,
>>      fdt=0x7fff8aaff010, ms=0x5555564f8800) at ../hw/ppc/vof.c:308
>> 308            if (!vmc->setprop(ms, nodepath, propname, val, vallen)) {
>> (gdb) p nodepath
>> $1 = "/chosen\000\060/rPC,750CXE/", '\000' <repeats 234 times>
>> (gdb) p propname
>> $2 = 
>> "linux,initrd-start\000linux,initrd-end\000linux,cmdline-timeout\000bootarg" 
>> (gdb) p val
>> $3 = <optimized out>
>> 
>> I think I need the callback for setprop in TYPE_VOF_MACHINE_IF. I can copy 
>> spapr_vof_setprop() but some explanation on why that's needed might help. 
>> Ciould I just do fdt_setprop in my callback as vof_setprop() would do 
>> without a machine callback or is there some special handling needed for 
>> these properties?
>
> The short answer is yes, you do not need TYPE_VOF_MACHINE_IF.
>
> The long answer is that we build the FDT on spapr twice:
> 1. at the reset time and
> 2. after "ibm,client-arhitecture-support" (early in the boot the spapr 
> paravirtual client says what it supports - ISA level, MMU features, etc)
>
> Between 1 and 2 the kernel moves initrd and we do not update the QEMU's 
> version of its location, the tree at 2) will have the old values.
>
> So for that reason I have TYPE_VOF_MACHINE_IF. You most definitely do not 
> need it.

I need TYPE_VOF_MACHINE_IF because that has the quiesce callback that I 
need to shut VOF down when the guest is finished with it otherwise it 
would crash later (more on this in next message). But since I shut down 
VOF here I don't need to remember changes to the FDT so I can just use an 
empty setprop callback. (I wouldn't even need that if VOF would check that 
a callback is non-NULL before calling it.)

Regards,
BALATON Zoltan
BALATON Zoltan May 23, 2021, 12:02 p.m. UTC | #17
On Sun, 23 May 2021, Alexey Kardashevskiy wrote:
> On 23/05/2021 02:46, BALATON Zoltan wrote:
>> On Sat, 22 May 2021, BALATON Zoltan wrote:
>>> On Sat, 22 May 2021, BALATON Zoltan wrote:
>>>> On Sat, 22 May 2021, Alexey Kardashevskiy wrote:
>>>>> VOF itself does not prints anything in this patch.
>>>> 
>>>> However it seems to be needed for linux as the first thing it does seems 
>>>> to be getting /chosen/stdout and calls exit if it returns nothing. So 
>>>> I'll need this at least for linux. (I think MorphOS may also query it to 
>>>> print a banner or some messages but not sure it needs it, at least it 
>>>> does not abort right away if not found.)
>>>> 
>>>>>> but to see Linux output do I need a stdout in VOF or it will just open 
>>>>>> the serial with its own driver and use that?
>>>>>> So I'm not sure what's the stdout parts in the current vof patch does 
>>>>>> and if I need that for anything. I'll try to experiment with it some 
>>>>>> more but fixing the ld and Kconfig seems to be enough to get it work 
>>>>>> for me.
>>>>> 
>>>>> So for the client to print something, /chosen/stdout needs to have a 
>>>>> valid ihandle.
>>>>> The only way to get a valid ihandle is having a valid phandle which 
>>>>> vof_client_open() can open.
>>>>> A valid phandle is a phandle of any node in the device tree. On spapr we 
>>>>> pick some spapr-vty, open it and store in /chosen/stdout.
>>>>> 
>>>>> From this point output from the client can be seen via a tracepoint.
>>> 
>>> I've got it now. Looking at the original firmware device tree dump:
>>> 
>>> https://osdn.net/projects/qmiga/wiki/SubprojectPegasos2/attach/PegasosII_OFW-Dump.txt 
>>> 
>>> I see that /chosen/stdout points to "screen" which is an alias to 
>>> /bootconsole. Just adding an empty /bootconsole node in the device tree 
>>> and vof_client_open_store() that as /chosen/stdout works and I get output 
>>> via vof_write traces so this is enough for now to test Linux. Properly 
>>> connecting a serial backend can thus be postponed.
>> 
>> Using /failsafe instead of /bootconsole is even better because Linux then 
>> adds console=ttyS0 to the bootargs by default as it knows that's a serial 
>> port.
>
> When linux boots so far that it can use whatever is passed in "console=" - 
> the client interface is done pretty much and the output happens without it.

That's the problem that Linux does not open serial yet when booting with 
VOF but I don't have everyhing in the device tree yet and devices may be 
set up differently when the board firmware haven't run so I'm not sure 
what's missing for Linux to find and open serial. Does anybody happen to 
know?

>>> So with this the Linux kernel does not abort on the first device tree 
>>> access but starts to decompress itself then the embedded initrd and 
>>> crashes at calling setprop:
>>> 
>>> [...]
>>> vof_client_handle: setprop
>>> 
>>> Thread 4 "qemu-system-ppc" received signal SIGSEGV, Segmentation fault.
>>> (gdb) bt
>>> #0  0x0000000000000000 in  ()
>>> #1  0x0000555555a5c2bf in vof_setprop
>>>    (vof=0x7ffff48e9420, vallen=4, valaddr=<optimized out>, 
>>> pname=<optimized out>, nodeph=8, fdt=0x7fff8aaff010, ms=0x5555564f8800)
>>>    at ../hw/ppc/vof.c:308
>>> #2  0x0000555555a5c2bf in vof_client_handle
>>>    (nrets=1, rets=0x7ffff48e93f0, nargs=4, args=0x7ffff48e93c0, 
>>> service=0x7ffff48e9460 "setprop",
>>>     vof=0x7ffff48e9420, fdt=0x7fff8aaff010, ms=0x5555564f8800) at 
>>> ../hw/ppc/vof.c:842
>>> #3  0x0000555555a5c2bf in vof_client_call
>>>    (ms=0x5555564f8800, vof=vof@entry=0x55555662a3d0, 
>>> fdt=fdt@entry=0x7fff8aaff010, args_real=args_real@entry=23580472)
>>>    at ../hw/ppc/vof.c:935
>>> 
>>> loooks like it's trying to set /chosen/linux,initrd-start:
>>> 
>>> (gdb) up
>>> #1  0x0000555555a5c2bf in vof_setprop (vof=0x7ffff48e9420, vallen=4, 
>>> valaddr=<optimized out>, pname=<optimized out>, nodeph=8,
>>>    fdt=0x7fff8aaff010, ms=0x5555564f8800) at ../hw/ppc/vof.c:308
>>> 308            if (!vmc->setprop(ms, nodepath, propname, val, vallen)) {
>>> (gdb) p nodepath
>>> $1 = "/chosen\000\060/rPC,750CXE/", '\000' <repeats 234 times>
>>> (gdb) p propname
>>> $2 = 
>>> "linux,initrd-start\000linux,initrd-end\000linux,cmdline-timeout\000bootarg" 
>>> (gdb) p val
>>> $3 = <optimized out>
>>> 
>>> I think I need the callback for setprop in TYPE_VOF_MACHINE_IF. I can copy 
>>> spapr_vof_setprop() but some explanation on why that's needed might help. 
>>> Ciould I just do fdt_setprop in my callback as vof_setprop() would do 
>>> without a machine callback or is there some special handling needed for 
>>> these properties?
>> 
>> Just returning true from the setprop callback of the VofMachineIfClass for 
>> now to see what it would do and then it gets to all the way of calling 
>> quiesce. Unfortunately it then tries to call prom_printf on Pegasos2 as 
>> seen here:
>> 
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/arch/powerpc/kernel/prom_init.c?h=v4.14.233#n3261 
>> 
>> which does not work because I have to shut down vhyp at quiesce 
>
> What is vhyp and why do you have to shut it down?

The vhyp is the TYPE_PPC_VIRTUAL_HYPERVISOR interface that I need to get 
hypercalls working as I don't normally have it on pegasos2 so I need to 
install that for VOF but have to tear it down on quiece otherwise it would 
conflict with things later (at least the assert below but guests also use 
syscalls and I'm not sure that would also conflict). It works though early 
in the boot when VOF and guest code using VOF runs which is before the 
guest takes over the CPU and no syscalls are used by guests yet at this 
point.

This is the current version of the patch I'm experimenting with:

https://osdn.net/projects/qmiga/scm/git/qemu/commits/dd4ed0901501e12921cbdbe9e1f918167b168197

and the pegasos2.c after the patch:

https://osdn.net/projects/qmiga/scm/git/qemu/blobs/pegasos2/hw/ppc/pegasos2.c

maybe it explains more what I'm talking about.

>> otherwise it trips an assert on writing sdr1 (and may also interfere with 
>> the guest's usage of syscalls).
>
> Where is that assert?

It's here on line 73 in ppc_store_sdr1():

https://git.qemu.org/?p=qemu.git;a=blob;f=target/ppc/cpu.c;h=d957d1a687bf8ade79b5f466dd696b56f63d7e1e;hb=HEAD#l73

which is called when the guest tries to set up the MMU I think and if I 
still have vhyp set at that point. So I have to remove that on quiesce but 
then any further CI call will cause an exception due to sc 1 being a 
normal syscall again but we don't have exception handlers yet so it will 
be a run away exception first due to sc 1 then due to invalid opcode at 
the handler address.

> I am a bit lost here. Nothing in the current VOF should touch any actual 
> device, it prints via tracepoints or (with that additional patch) to a 
> chardev backend.
>
>
>> So I need a way to not generate an exception if the guest calls back into 
>> OF after quiesce. A hacky solution is to patch out the sc 1 or _prom_entry 
>> point to just return after quiesce but maybe a better way is needed such as 
>> a switch in vof.bin that it checks before doing a syscall. Other than this 
>> problem it seems to work for the most part so maybe making the _prom_entry 
>> check some global value that I can set from quiesce to stop it doing 
>> syscalls and just return would be the simplest way to avoid this crash in 
>> Linux and not need a special version of vof for pegasos2. (MorphOS does not 
>> seem to call OF after quiesce which seems safer to do anyway, don't know 
>> why Linux does that. It could just print that one line before quiesce and 
>> then it would work, unfortunately that's not what they did.)
>
> quiesce is supposed to wait until ongoing DMA is finished (or something like 
> that), it was (people say) a request from Apple back then and was never 
> really architected.

Still it's used by guests to signal that they're finshed with OF calls so 
it's a convenient place to shut down VOF. Unfortunately Linux does another 
write call after quiesce which is silly as it does not even work on the 
real firmware (I'm not seeing the output of that call even with 
pegasos2.rom just does not crash) and the comment in the kernel says that 
some firmwares do crash so I don't know why they put it there but it's 
there and since there are binaries out there with this bug/feature we 
should handle that somehow. I can think of two ways:

One is patching the ci_entry to just return after quiesce without doing 
the hypercall that I've done in the patch above but instead of the hack 
binary patching. a better way would be to have a known address in VOF 
holding a flag that I can flip to disable ci_entry so it would check the 
flag and return if it's set then I would not need to modify the binary and 
know the address of ci_entry.

Or second option would be to have dummy exception handlers in VOF that 
ignores this exception so it won't crash on this CI call after quiesce 
that Linux does.

Does it make sense? Do you have other idea?

Regards,
BALATON Zoltan
BALATON Zoltan May 23, 2021, 12:12 p.m. UTC | #18
On Sun, 23 May 2021, Alexey Kardashevskiy wrote:
> On 22/05/2021 23:08, BALATON Zoltan wrote:
>> On Sat, 22 May 2021, Alexey Kardashevskiy wrote:
>>> On 22/05/2021 05:57, BALATON Zoltan wrote:
>>>> On Fri, 21 May 2021, BALATON Zoltan wrote:
>>>>> On Fri, 21 May 2021, Alexey Kardashevskiy wrote:
>>>>>> On 21/05/2021 07:59, BALATON Zoltan wrote:
>>>>>>> On Thu, 20 May 2021, Alexey Kardashevskiy wrote:
>>>>>>>> The PAPR platform describes an OS environment that's presented by
>>>>>>>> a combination of a hypervisor and firmware. The features it specifies
>>>>>>>> require collaboration between the firmware and the hypervisor.
>>>>>>>> 
>>>>>>>> Since the beginning, the runtime component of the firmware (RTAS) has
>>>>>>>> been implemented as a 20 byte shim which simply forwards it to
>>>>>>>> a hypercall implemented in qemu. The boot time firmware component is
>>>>>>>> SLOF - but a build that's specific to qemu, and has always needed to 
>>>>>>>> be
>>>>>>>> updated in sync with it. Even though we've managed to limit the 
>>>>>>>> amount
>>>>>>>> of runtime communication we need between qemu and SLOF, there's some,
>>>>>>>> and it has become increasingly awkward to handle as we've implemented
>>>>>>>> new features.
>>>>>>>> 
>>>>>>>> This implements a boot time OF client interface (CI) which is
>>>>>>>> enabled by a new "x-vof" pseries machine option (stands for "Virtual 
>>>>>>>> Open
>>>>>>>> Firmware). When enabled, QEMU implements the custom H_OF_CLIENT hcall
>>>>>>>> which implements Open Firmware Client Interface (OF CI). This allows
>>>>>>>> using a smaller stateless firmware which does not have to manage
>>>>>>>> the device tree.
>>>>>>>> 
>>>>>>>> The new "vof.bin" firmware image is included with source code under
>>>>>>>> pc-bios/. It also includes RTAS blob.
>>>>>>>> 
>>>>>>>> This implements a handful of CI methods just to get -kernel/-initrd
>>>>>>>> working. In particular, this implements the device tree fetching and
>>>>>>>> simple memory allocator - "claim" (an OF CI memory allocator) and 
>>>>>>>> updates
>>>>>>>> "/memory@0/available" to report the client about available memory.
>>>>>>>> 
>>>>>>>> This implements changing some device tree properties which we know 
>>>>>>>> how
>>>>>>>> to deal with, the rest is ignored. To allow changes, this skips
>>>>>>>> fdt_pack() when x-vof=on as not packing the blob leaves some room for
>>>>>>>> appending.
>>>>>>>> 
>>>>>>>> In absence of SLOF, this assigns phandles to device tree nodes to 
>>>>>>>> make
>>>>>>>> device tree traversing work.
>>>>>>>> 
>>>>>>>> When x-vof=on, this adds "/chosen" every time QEMU (re)builds a tree.
>>>>>>>> 
>>>>>>>> This adds basic instances support which are managed by a hash map
>>>>>>>> ihandle -> [phandle].
>>>>>>>> 
>>>>>>>> Before the guest started, the used memory is:
>>>>>>>> 0..e60 - the initial firmware
>>>>>>>> 8000..10000 - stack
>>>>>>>> 400000.. - kernel
>>>>>>>> 3ea0000.. - initramdisk
>>>>>>>> 
>>>>>>>> This OF CI does not implement "interpret".
>>>>>>>> 
>>>>>>>> Unlike SLOF, this does not format uninitialized nvram. Instead, this
>>>>>>>> includes a disk image with pre-formatted nvram.
>>>>>>>> 
>>>>>>>> With this basic support, this can only boot into kernel directly.
>>>>>>>> However this is just enough for the petitboot kernel and initradmdisk 
>>>>>>>> to
>>>>>>>> boot from any possible source. Note this requires reasonably recent 
>>>>>>>> guest
>>>>>>>> kernel with:
>>>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=df5be5be8735 
>>>>>>>> The immediate benefit is much faster booting time which especially
>>>>>>>> crucial with fully emulated early CPU bring up environments. Also 
>>>>>>>> this
>>>>>>>> may come handy when/if GRUB-in-the-userspace sees light of the day.
>>>>>>>> 
>>>>>>>> This separates VOF and sPAPR in a hope that VOF bits may be reused by
>>>>>>>> other POWERPC boards which do not support pSeries.
>>>>>>>> 
>>>>>>>> This is coded in assumption that later on we might be adding support 
>>>>>>>> for
>>>>>>>> booting from QEMU backends (blockdev is the first candidate) without
>>>>>>>> devices/drivers in between as OF1275 does not require that and
>>>>>>>> it is quite easy to so.
>>>>>>>> 
>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>> ---
>>>>>>>> 
>>>>>>>> The example command line is:
>>>>>>>> 
>>>>>>>> /home/aik/pbuild/qemu-killslof-localhost-ppc64/qemu-system-ppc64 \
>>>>>>>> -nodefaults \
>>>>>>>> -chardev stdio,id=STDIO0,signal=off,mux=on \
>>>>>>>> -device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
>>>>>>>> -mon id=MON0,chardev=STDIO0,mode=readline \
>>>>>>>> -nographic \
>>>>>>>> -vga none \
>>>>>>>> -enable-kvm \
>>>>>>>> -m 8G \
>>>>>>>> -machine 
>>>>>>>> pseries,x-vof=on,cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off 
>>>>>>>> \
>>>>>>>> -kernel pbuild/kernel-le-guest/vmlinux \
>>>>>>>> -initrd pb/rootfs.cpio.xz \
>>>>>>>> -drive 
>>>>>>>> id=DRIVE0,if=none,file=./p/qemu-killslof/pc-bios/vof-nvram.bin,format=raw 
>>>>>>>> \
>>>>>>>> -global spapr-nvram.drive=DRIVE0 \
>>>>>>>> -snapshot \
>>>>>>>> -smp 8,threads=8 \
>>>>>>>> -L /home/aik/t/qemu-ppc64-bios/ \
>>>>>>>> -trace events=qemu_trace_events \
>>>>>>>> -d guest_errors \
>>>>>>>> -chardev socket,id=SOCKET0,server,nowait,path=qemu.mon.tmux26 \
>>>>>>>> -mon chardev=SOCKET0,mode=control
>>>>>>>> 
>>>>>>>> ---
>>>>>>>> Changes:
>>>>>>>> v20:
>>>>>>>> * compile vof.bin with -mcpu=power4 for better compatibility
>>>>>>>> * s/std/stw/ in entry.S to make it work on ppc32
>>>>>>>> * fixed dt_available property to support both 32 and 64bit
>>>>>>>> * shuffled prom_args handling code
>>>>>>>> * do not enforce 32bit in MSR (again, to support 32bit platforms)
>>>>>>>> 
>>>>>>> 
>>>>>>> [...]
>>>>>>> 
>>>>>>>> diff --git a/default-configs/devices/ppc64-softmmu.mak 
>>>>>>>> b/default-configs/devices/ppc64-softmmu.mak
>>>>>>>> index ae0841fa3a18..9fb201dfacfa 100644
>>>>>>>> --- a/default-configs/devices/ppc64-softmmu.mak
>>>>>>>> +++ b/default-configs/devices/ppc64-softmmu.mak
>>>>>>>> @@ -9,3 +9,4 @@ CONFIG_POWERNV=y
>>>>>>>>  # For pSeries
>>>>>>>>  CONFIG_PSERIES=y
>>>>>>>>  CONFIG_NVDIMM=y
>>>>>>>> +CONFIG_VOF=y
>>>>>>>> diff --git a/hw/ppc/Kconfig b/hw/ppc/Kconfig
>>>>>>>> index e51e0e5e5ac6..964510dfc73d 100644
>>>>>>>> --- a/hw/ppc/Kconfig
>>>>>>>> +++ b/hw/ppc/Kconfig
>>>>>>>> @@ -143,3 +143,6 @@ config FW_CFG_PPC
>>>>>>>> 
>>>>>>>>  config FDT_PPC
>>>>>>>>      bool
>>>>>>>> +
>>>>>>>> +config VOF
>>>>>>>> +    bool
>>>>>>> 
>>>>>>> I think you should just add "select VOF" to config PSERIES section in 
>>>>>>> Kconfig instead of adding it to 
>>>>>>> default-configs/devices/ppc64-softmmu.mak. 
>>>>>> 
>>>>>> oh well, can do that too.
>>>>> 
>>>>> I think most config options should be selected by KConfig and the 
>>>>> default config should only include machines, otherwise VOF would be 
>>>>> added also when you don't compile PSERIES or PEGASOS2. With select in 
>>>>> Kconfig it will be added when needed. That's why it's better to use 
>>>>> select in this case.
>>>>> 
>>>>>>>  That should do it, it works in my updated pegasos2 patch:
>>>>>>> 
>>>>>>> https://osdn.net/projects/qmiga/scm/git/qemu/commits/3c1fad08469b4d3c04def22044e52b2d27774a61 
>>>>>>> [...]
>>>>>>>> diff --git a/pc-bios/vof/entry.S b/pc-bios/vof/entry.S
>>>>>>>> new file mode 100644
>>>>>>>> index 000000000000..569688714c91
>>>>>>>> --- /dev/null
>>>>>>>> +++ b/pc-bios/vof/entry.S
>>>>>>>> @@ -0,0 +1,51 @@
>>>>>>>> +#define LOAD32(rn, name)    \
>>>>>>>> +    lis     rn,name##@h;    \
>>>>>>>> +    ori     rn,rn,name##@l
>>>>>>>> +
>>>>>>>> +#define ENTRY(func_name)    \
>>>>>>>> +    .text;                  \
>>>>>>>> +    .align  2;              \
>>>>>>>> +    .globl  .func_name;     \
>>>>>>>> +    .func_name:             \
>>>>>>>> +    .globl  func_name;      \
>>>>>>>> +    func_name:
>>>>>>>> +
>>>>>>>> +#define KVMPPC_HCALL_BASE       0xf000
>>>>>>>> +#define KVMPPC_H_RTAS           (KVMPPC_HCALL_BASE + 0x0)
>>>>>>>> +#define KVMPPC_H_VOF_CLIENT     (KVMPPC_HCALL_BASE + 0x5)
>>>>>>>> +
>>>>>>>> +    . = 0x100 /* Do exactly as SLOF does */
>>>>>>>> +
>>>>>>>> +ENTRY(_start)
>>>>>>>> +#    LOAD32(%r31, 0) /* Go 32bit mode */
>>>>>>>> +#    mtmsrd %r31,0
>>>>>>>> +    LOAD32(2, __toc_start)
>>>>>>>> +    b entry_c
>>>>>>>> +
>>>>>>>> +ENTRY(_prom_entry)
>>>>>>>> +    LOAD32(2, __toc_start)
>>>>>>>> +    stwu    %r1,-112(%r1)
>>>>>>>> +    stw     %r31,104(%r1)
>>>>>>>> +    mflr    %r31
>>>>>>>> +    bl prom_entry
>>>>>>>> +    nop
>>>>>>>> +    mtlr    %r31
>>>>>>>> +    ld      %r31,104(%r1)
>>>>>>> 
>>>>>>> It's getting there, now I see the first client call from the guest 
>>>>>>> boot code but then it crashes on this ld opcode which apparently is 64 
>>>>>>> bit only:
>>>>>> 
>>>>>> Oh right.
>>>>>> 
>>>>>> 
>>>>>>> Hopefully this is the last such opcode left before I can really test 
>>>>>>> this.
>>>>>> 
>>>>>> Make it lwz, and test it?
>>>>> 
>>>>> Yes, figured that out too after sending this message. Replacing with lwz 
>>>>> works but I wonder that now you have stwu lwz do the stack offsets need 
>>>>> adjusting too or you just waste 4 bytes now? With lwz here I found no 
>>>>> further 64 bit opcodes and the guest boot code could walk the device 
>>>>> tree. It failed later but I think that's because I'll need to fill more 
>>>>> info about the machine in the device tree. I'll experiment with that but 
>>>>> it looks like it could work at least for MorphOS. I'll have to try Linux 
>>>>> too.
>>>> 
>>>> I was trying to get a linux kernel from a debian powerpc iso to do 
>>>> something (debian before 10.0 has Pegasos support) but I've run into the 
>>>> problem that the kernel is loaded at 0x400000 but the start address is at 
>>>> some offset from that. How do I set qemu,boot-kernel in this case?
>>> 
>>> 
>>> The pseries kernel can work from any location (and it relocates itself to 
>>> 0 at some point) even though it is linked at c000.0000.0000.0000, and 
>>> there is no start address offset:
>>> 
>>> ===
>>>> objdump -D ~/pbuild/kernel-le/vmlinux
>>> /home/aik/pbuild/kernel-le/vmlinux:     file format elf64-powerpcle
>>> 
>>> 
>>> Disassembly of section .head.text:
>>> 
>>> c000000000000000 <__start>:
>>> c000000000000000:       48 00 00 08     tdi     0,r0,72
>>> c000000000000004:       2c 00 00 48     b       c000000000000030 
>>> <__start+0x30>
>>> ...
>>> ===
>>> 
>>> Not sure about pegasos2 kernels (or any ppc32 really), sorry.
>> 
>> The kernel from Debian 10.0 powerpc used on pegasos looks like this:
>> 
>> vmlinuz-chrp.initrd:     file format elf32-powerpc
>> vmlinuz-chrp.initrd
>> architecture: powerpc:common, flags 0x00000112:
>> EXEC_P, HAS_SYMS, D_PAGED
>> start address 0x004002fc
>> 
>> Program Header:
>>      LOAD off    0x00010000 vaddr 0x00400000 paddr 0x00400000 align 2**16
>>           filesz 0x0127b72a memsz 0x0127d5d8 flags rwx
>>     STACK off    0x00000000 vaddr 0x00000000 paddr 0x00000000 align 2**4
>>           filesz 0x00000000 memsz 0x00000000 flags rwx
>>      NOTE off    0x000000b4 vaddr 0x00000000 paddr 0x00000000 align 2**0
>>           filesz 0x0000002c memsz 0x00000000 flags ---
>>      NOTE off    0x000000e0 vaddr 0x00000000 paddr 0x00000000 align 2**0
>>           filesz 0x0000002c memsz 0x00000000 flags ---
>> 
>> Sections:
>> Idx Name          Size      VMA       LMA       File off  Algn
>>    0 .text         00008588  00400000  00400000  00010000  2**2
>>                    CONTENTS, ALLOC, LOAD, READONLY, CODE
>>    1 .text.unlikely 00000078  00408588  00408588  00018588  2**2
>>                    CONTENTS, ALLOC, LOAD, READONLY, CODE
>>    2 .data         00001bec  00409000  00409000  00019000  2**2
>>                    CONTENTS, ALLOC, LOAD, DATA
>>    3 .got          0000000c  0040abec  0040abec  0001abec  2**2
>>                    CONTENTS, ALLOC, LOAD, DATA
>>    4 __builtin_cmdline 00000800  0040abf8  0040abf8  0001abf8  2**2
>>                    CONTENTS, ALLOC, LOAD, DATA
>>    5 .kernel:vmlinux.strip 0047658e  0040c000  0040c000  0001c000  2**0
>>                    CONTENTS, ALLOC, LOAD, READONLY, DATA
>>    6 .kernel:initrd 00df872a  00883000  00883000  00493000  2**0
>>                    CONTENTS, ALLOC, LOAD, READONLY, DATA
>>    7 .bss          000015d8  0167c000  0167c000  0128b72a  2**2
>>                    ALLOC
>>    8 .debug_info   0000e7fd  00000000  00000000  0128b72a  2**0
>>                    CONTENTS, READONLY, DEBUGGING
>>    9 .debug_abbrev 00002a4f  00000000  00000000  01299f27  2**0
>>                    CONTENTS, READONLY, DEBUGGING
>>   10 .debug_loc    00009df1  00000000  00000000  0129c976  2**0
>>                    CONTENTS, READONLY, DEBUGGING
>>   11 .debug_aranges 00000250  00000000  00000000  012a6767  2**0
>>                    CONTENTS, READONLY, DEBUGGING
>>   12 .debug_line   000026b8  00000000  00000000  012a69b7  2**0
>>                    CONTENTS, READONLY, DEBUGGING
>>   13 .debug_str    00001d9c  00000000  00000000  012a906f  2**0
>>                    CONTENTS, READONLY, DEBUGGING
>>   14 .comment      0000001d  00000000  00000000  012aae0b  2**0
>>                    CONTENTS, READONLY
>>   15 .gnu.attributes 00000010  00000000  00000000  012aae28  2**0
>>                    CONTENTS, READONLY
>>   16 .debug_frame  00001c88  00000000  00000000  012aae38  2**2
>>                    CONTENTS, READONLY, DEBUGGING
>>   17 .debug_ranges 00000740  00000000  00000000  012acac0  2**0
>>                    CONTENTS, READONLY, DEBUGGING
>> 
>> It even seems to have the initrd embedded in it. If I just use 0x400000 as 
>> start address it does not work, has to jump to the start address for it to 
>> start correctly.
>> 
>>>> Because when I set it to the address/size where the kernel is loaded it 
>>>> jumps to the beginnig not the correct start address. If I set the address 
>>>> to the start address then size will be wrong so I don't know how to set 
>>>> qemu,boot-kernel in this case or is there another property to tell the 
>>>> start address?
>>>> (Vof does not seem to check any other property and seems to assume the 
>>>> entry point is the same as the load address but for this linux kernel 
>>>> it's not.)
>>> 
>>> I guess if you really need an offset, you'll have to add a new property 
>>> ("qemu,boot-kernel-start"?) and look for it in the firmware. Or, say, put 
>>> in gpr5 in your version of spapr_cpu_set_entry_state() and make 
>>> boot_from_memory() use it.
>> 
>> Either way would work but I don't want to recompile vof.bin so if you 
>
> I really do not want to add features with no user for it; and having this 
> added with pegasos2 support make it clear why it is there. Also recompile is 
> really simple :)

Provided you have a cross compiler set up and do not run into problems 
you've mentioned before. So I'd prefer to not increase the source of bugs 
by also modifying VOF. This is not only for pegasos2. An ELF file does not 
necessarily have the entry point equal to its load address so while you 
happen to have a kernel that does that now you could have another later 
that won't so supporting it in some way would be the right thing to do 
anyway. Also I can't decide what is better, using a gpr or a device tree 
property so whatever you prefer. You probably need a respin anyway so 
adding it to that seems to be simpler than for me to try starting 
compiling and testing VOF too (also I can't test with spapr so can't make 
sure the changes I make won't break something). So you seem to be a better 
position for VOF changes. I think I only need these:

1. Get rid of the ld 64 bit opcode in _prom_entry
2. Support ELF with entry point != load address
3. Have a way to disable ci_entry after quiesce so it won't do sc 1 that 
would generate exception in my case otherwise or ignore that exception 
within VOF.

I think other issues have been already resolved with your latest patch 
version unless I forgot something.

Regards,
BALATON Zoltan
BALATON Zoltan May 23, 2021, 5:09 p.m. UTC | #19
On Sun, 23 May 2021, BALATON Zoltan wrote:
> On Sun, 23 May 2021, Alexey Kardashevskiy wrote:
>> One thing to note about PCI is that normally I think the client expects the 
>> firmware to do PCI probing and SLOF does it. But VOF does not and Linux 
>> scans PCI bus(es) itself. Might be a problem for you kernel.
>
> I'm not sure what info does MorphOS get from the device tree and what it 
> probes itself but I think it may at least need device ids and info about the 
> PCI bus to be able to access the config regs, after that it should set the 
> devices up hopefully. I could add these from the board code to device tree so 
> VOF does not need to do anything about it. However I'm not getting to that 
> point yet because it crashes on something that it's missing and couldn't yet 
> find out what is that.
>
> I'd like to get Linux working now as that would be enough to test this and 
> then if for MorphOS we still need a ROM it's not a problem if at least we can 
> boot Linux without the original firmware. But I can't make Linux open a 
> serial console and I don't know what it needs for that. Do you happen to 
> know? I've looked at the sources in Linux/arch/powerpc but not sure how it 
> would find and open a serial port on pegasos2. It seems to work with the 
> board firmware and now I can get it to boot with VOF but then it does not 
> open serial so it probably needs something in the device tree or expects the 
> firmware to set something up that we should add in pegasos2.c when using VOF.

I've now found that Linux uses rtas methods read-pci-config and 
write-pci-config for PCI access on pegasos2 so this means that we'll 
probably need rtas too (I hoped we could get away without it if it were 
only used for shutdown/reboot or so but seems Linux needs it for PCI as 
well and does not scan the bus and won't find some devices without it).

While VOF can do rtas, this causes a problem with the hypercall method 
using sc 1 that goes through vhyp but trips the assert in ppc_store_sdr1() 
so cannot work after guest is past quiesce. So the question is why is that 
assert there and would using sc 1 for hypercalls on pegasos2 cause other 
problems later even if the assert could be removed? Can somebody who knows 
more about it explain this please? If this cannot be resolved then we may 
need a different hypercall method on pegasos2 (I've considered MOL OSI or 
are there other options? I may use some advice from people who know it 
better, especially the possible interaction with KVM later as the long 
term goal with pegasos2 is to be able to run with KVM on PPC hardware 
eventually.) But this also means that if that assert cannot be dropped or 
there may be other problems with sc 1 hypercalls then we maybe cannot have 
the same vof.bin and we'll need a separate version that I would like to 
avoid if possible so if there's a simple way to keep it working or make 
vof.bin use alternate hypercall method without needing a separate binary 
that would be the direction I'd tend to go. Even if we need a seoarate 
version I'd like to keep as much common as possible.

I've tested that the missing rtas is not the reason for getting no output 
via serial though, as even when disabling rtas on pegasos2.rom it boots 
and I still get serial output just some PCI devices are not detected (such 
as USB, the video card and the not emulated ethernet port but these are 
not fatal so it might even work as a first try without rtas, just to boot 
a Linux kernel for testing it would be enough if I can fix the serial 
output). I still don't know why it's not finding serial but I think it may 
be some missing or wrong info in the device tree I generat. I'll try to 
focus on this for now and leave the above rtas question for later.

Regards,
BALATON Zoltan
Alexey Kardashevskiy May 24, 2021, 4:26 a.m. UTC | #20
On 5/23/21 21:24, BALATON Zoltan wrote:
> On Sun, 23 May 2021, Alexey Kardashevskiy wrote:
>> On 23/05/2021 01:02, BALATON Zoltan wrote:
>>> On Sat, 22 May 2021, BALATON Zoltan wrote:
>>>> On Sat, 22 May 2021, Alexey Kardashevskiy wrote:
>>>>> VOF itself does not prints anything in this patch.
>>>>
>>>> However it seems to be needed for linux as the first thing it does 
>>>> seems to be getting /chosen/stdout and calls exit if it returns 
>>>> nothing. So I'll need this at least for linux. (I think MorphOS may 
>>>> also query it to print a banner or some messages but not sure it 
>>>> needs it, at least it does not abort right away if not found.)
>>>>
>>>>>> but to see Linux output do I need a stdout in VOF or it will just 
>>>>>> open the serial with its own driver and use that?
>>>>>> So I'm not sure what's the stdout parts in the current vof patch 
>>>>>> does and if I need that for anything. I'll try to experiment with 
>>>>>> it some more but fixing the ld and Kconfig seems to be enough to 
>>>>>> get it work for me.
>>>>>
>>>>> So for the client to print something, /chosen/stdout needs to have 
>>>>> a valid ihandle.
>>>>> The only way to get a valid ihandle is having a valid phandle which 
>>>>> vof_client_open() can open.
>>>>> A valid phandle is a phandle of any node in the device tree. On 
>>>>> spapr we pick some spapr-vty, open it and store in /chosen/stdout.
>>>>>
>>>>> From this point output from the client can be seen via a tracepoint.
>>>
>>> I've got it now. Looking at the original firmware device tree dump:
>>>
>>> https://osdn.net/projects/qmiga/wiki/SubprojectPegasos2/attach/PegasosII_OFW-Dump.txt 
>>>
>>> I see that /chosen/stdout points to "screen" which is an alias to 
>>> /bootconsole. Just adding an empty /bootconsole node in the device 
>>> tree and vof_client_open_store() that as /chosen/stdout works and I 
>>> get output via vof_write traces so this is enough for now to test 
>>> Linux. Properly connecting a serial backend can thus be postponed.
>>>
>>> So with this the Linux kernel does not abort on the first device tree 
>>> access but starts to decompress itself then the embedded initrd and 
>>> crashes at calling setprop:
>>>
>>> [...]
>>> vof_client_handle: setprop
>>>
>>> Thread 4 "qemu-system-ppc" received signal SIGSEGV, Segmentation fault.
>>> (gdb) bt
>>> #0  0x0000000000000000 in  ()
>>> #1  0x0000555555a5c2bf in vof_setprop
>>>      (vof=0x7ffff48e9420, vallen=4, valaddr=<optimized out>, 
>>> pname=<optimized out>, nodeph=8, fdt=0x7fff8aaff010, ms=0x5555564f8800)
>>>      at ../hw/ppc/vof.c:308
>>> #2  0x0000555555a5c2bf in vof_client_handle
>>>      (nrets=1, rets=0x7ffff48e93f0, nargs=4, args=0x7ffff48e93c0, 
>>> service=0x7ffff48e9460 "setprop",
>>>       vof=0x7ffff48e9420, fdt=0x7fff8aaff010, ms=0x5555564f8800) at 
>>> ../hw/ppc/vof.c:842
>>> #3  0x0000555555a5c2bf in vof_client_call
>>>      (ms=0x5555564f8800, vof=vof@entry=0x55555662a3d0, 
>>> fdt=fdt@entry=0x7fff8aaff010, args_real=args_real@entry=23580472)
>>>      at ../hw/ppc/vof.c:935
>>>
>>> loooks like it's trying to set /chosen/linux,initrd-start:
>>
>> It is not horribly clear why it crashed though.
> 
> It crashed becuase I had TYPE_VOF_MACHINE_IF but did not set a setprop 
> callback and it tried to call that here. Adding a {return true;} empty 
> callback avoids this.


Ah ok.

> 
>>> (gdb) up
>>> #1  0x0000555555a5c2bf in vof_setprop (vof=0x7ffff48e9420, vallen=4, 
>>> valaddr=<optimized out>, pname=<optimized out>, nodeph=8,
>>>      fdt=0x7fff8aaff010, ms=0x5555564f8800) at ../hw/ppc/vof.c:308
>>> 308            if (!vmc->setprop(ms, nodepath, propname, val, vallen)) {
>>> (gdb) p nodepath
>>> $1 = "/chosen\000\060/rPC,750CXE/", '\000' <repeats 234 times>
>>> (gdb) p propname
>>> $2 = 
>>> "linux,initrd-start\000linux,initrd-end\000linux,cmdline-timeout\000bootarg" 
>>> (gdb) p val
>>> $3 = <optimized out>
>>>
>>> I think I need the callback for setprop in TYPE_VOF_MACHINE_IF. I can 
>>> copy spapr_vof_setprop() but some explanation on why that's needed 
>>> might help. Ciould I just do fdt_setprop in my callback as 
>>> vof_setprop() would do without a machine callback or is there some 
>>> special handling needed for these properties?
>>
>> The short answer is yes, you do not need TYPE_VOF_MACHINE_IF.
>>
>> The long answer is that we build the FDT on spapr twice:
>> 1. at the reset time and
>> 2. after "ibm,client-arhitecture-support" (early in the boot the spapr 
>> paravirtual client says what it supports - ISA level, MMU features, etc)
>>
>> Between 1 and 2 the kernel moves initrd and we do not update the 
>> QEMU's version of its location, the tree at 2) will have the old values.
>>
>> So for that reason I have TYPE_VOF_MACHINE_IF. You most definitely do 
>> not need it.
> 
> I need TYPE_VOF_MACHINE_IF because that has the quiesce callback that I 
> need to shut VOF down when the guest is finished with it otherwise it 
> would crash later (more on this in next message).

Nah, quiesce() only means stopping IO in VOF. VOF is shut down when the 
client decides to stop using it (and zero that memory).

> But since I shut down 
> VOF here I don't need to remember changes to the FDT so I can just use 
> an empty setprop callback. (I wouldn't even need that if VOF would check 
> that a callback is non-NULL before calling it.)

I'll add the check.

I'll need some time to go though the other mails, closer to the weekend, 
there are too many gaps in my knowledge about those 32bit systems.

I am really not sure that you need TYPE_PPC_VIRTUAL_HYPERVISOR (is this 
just to make "sc 1" work? there should be a better way) or RTAS 
(although it looks like you need it for PCI, you likely do not need it 
for your serial device which is ISA which I have no idea how it works). 
Do you have an actual machine? Can you dump its device tree to see what 
yours is missing?
David Gibson May 24, 2021, 5:23 a.m. UTC | #21
On Thu, May 20, 2021 at 11:59:07PM +0200, BALATON Zoltan wrote:
> On Thu, 20 May 2021, Alexey Kardashevskiy wrote:
> > The PAPR platform describes an OS environment that's presented by
> > a combination of a hypervisor and firmware. The features it specifies
> > require collaboration between the firmware and the hypervisor.
> > 
> > Since the beginning, the runtime component of the firmware (RTAS) has
> > been implemented as a 20 byte shim which simply forwards it to
> > a hypercall implemented in qemu. The boot time firmware component is
> > SLOF - but a build that's specific to qemu, and has always needed to be
> > updated in sync with it. Even though we've managed to limit the amount
> > of runtime communication we need between qemu and SLOF, there's some,
> > and it has become increasingly awkward to handle as we've implemented
> > new features.
> > 
> > This implements a boot time OF client interface (CI) which is
> > enabled by a new "x-vof" pseries machine option (stands for "Virtual Open
> > Firmware). When enabled, QEMU implements the custom H_OF_CLIENT hcall
> > which implements Open Firmware Client Interface (OF CI). This allows
> > using a smaller stateless firmware which does not have to manage
> > the device tree.
> > 
> > The new "vof.bin" firmware image is included with source code under
> > pc-bios/. It also includes RTAS blob.
> > 
> > This implements a handful of CI methods just to get -kernel/-initrd
> > working. In particular, this implements the device tree fetching and
> > simple memory allocator - "claim" (an OF CI memory allocator) and updates
> > "/memory@0/available" to report the client about available memory.
> > 
> > This implements changing some device tree properties which we know how
> > to deal with, the rest is ignored. To allow changes, this skips
> > fdt_pack() when x-vof=on as not packing the blob leaves some room for
> > appending.
> > 
> > In absence of SLOF, this assigns phandles to device tree nodes to make
> > device tree traversing work.
> > 
> > When x-vof=on, this adds "/chosen" every time QEMU (re)builds a tree.
> > 
> > This adds basic instances support which are managed by a hash map
> > ihandle -> [phandle].
> > 
> > Before the guest started, the used memory is:
> > 0..e60 - the initial firmware
> > 8000..10000 - stack
> > 400000.. - kernel
> > 3ea0000.. - initramdisk
> > 
> > This OF CI does not implement "interpret".
> > 
> > Unlike SLOF, this does not format uninitialized nvram. Instead, this
> > includes a disk image with pre-formatted nvram.
> > 
> > With this basic support, this can only boot into kernel directly.
> > However this is just enough for the petitboot kernel and initradmdisk to
> > boot from any possible source. Note this requires reasonably recent guest
> > kernel with:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=df5be5be8735
> > 
> > The immediate benefit is much faster booting time which especially
> > crucial with fully emulated early CPU bring up environments. Also this
> > may come handy when/if GRUB-in-the-userspace sees light of the day.
> > 
> > This separates VOF and sPAPR in a hope that VOF bits may be reused by
> > other POWERPC boards which do not support pSeries.
> > 
> > This is coded in assumption that later on we might be adding support for
> > booting from QEMU backends (blockdev is the first candidate) without
> > devices/drivers in between as OF1275 does not require that and
> > it is quite easy to so.
> > 
> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > ---
> > 
> > The example command line is:
> > 
> > /home/aik/pbuild/qemu-killslof-localhost-ppc64/qemu-system-ppc64 \
> > -nodefaults \
> > -chardev stdio,id=STDIO0,signal=off,mux=on \
> > -device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
> > -mon id=MON0,chardev=STDIO0,mode=readline \
> > -nographic \
> > -vga none \
> > -enable-kvm \
> > -m 8G \
> > -machine pseries,x-vof=on,cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off \
> > -kernel pbuild/kernel-le-guest/vmlinux \
> > -initrd pb/rootfs.cpio.xz \
> > -drive id=DRIVE0,if=none,file=./p/qemu-killslof/pc-bios/vof-nvram.bin,format=raw \
> > -global spapr-nvram.drive=DRIVE0 \
> > -snapshot \
> > -smp 8,threads=8 \
> > -L /home/aik/t/qemu-ppc64-bios/ \
> > -trace events=qemu_trace_events \
> > -d guest_errors \
> > -chardev socket,id=SOCKET0,server,nowait,path=qemu.mon.tmux26 \
> > -mon chardev=SOCKET0,mode=control
> > 
> > ---
> > Changes:
> > v20:
> > * compile vof.bin with -mcpu=power4 for better compatibility
> > * s/std/stw/ in entry.S to make it work on ppc32
> > * fixed dt_available property to support both 32 and 64bit
> > * shuffled prom_args handling code
> > * do not enforce 32bit in MSR (again, to support 32bit platforms)
> > 
> 
> [...]
> 
> > diff --git a/default-configs/devices/ppc64-softmmu.mak b/default-configs/devices/ppc64-softmmu.mak
> > index ae0841fa3a18..9fb201dfacfa 100644
> > --- a/default-configs/devices/ppc64-softmmu.mak
> > +++ b/default-configs/devices/ppc64-softmmu.mak
> > @@ -9,3 +9,4 @@ CONFIG_POWERNV=y
> >  # For pSeries
> >  CONFIG_PSERIES=y
> >  CONFIG_NVDIMM=y
> > +CONFIG_VOF=y
> > diff --git a/hw/ppc/Kconfig b/hw/ppc/Kconfig
> > index e51e0e5e5ac6..964510dfc73d 100644
> > --- a/hw/ppc/Kconfig
> > +++ b/hw/ppc/Kconfig
> > @@ -143,3 +143,6 @@ config FW_CFG_PPC
> > 
> >  config FDT_PPC
> >      bool
> > +
> > +config VOF
> > +    bool
> 
> I think you should just add "select VOF" to config PSERIES section in
> Kconfig instead of adding it to default-configs/devices/ppc64-softmmu.mak.
> That should do it, it works in my updated pegasos2 patch:

No, we don't want a "select": PSERIES doesn't require VOF while we
still support SLOF, and indeed we're quite a ways from being ready to
even make VOF the default pseries firmware.
David Gibson May 24, 2021, 5:40 a.m. UTC | #22
On Mon, May 24, 2021 at 02:26:42PM +1000, Alexey Kardashevskiy wrote:
> 
> 
> On 5/23/21 21:24, BALATON Zoltan wrote:
> > On Sun, 23 May 2021, Alexey Kardashevskiy wrote:
> > > On 23/05/2021 01:02, BALATON Zoltan wrote:
> > > > On Sat, 22 May 2021, BALATON Zoltan wrote:
> > > > > On Sat, 22 May 2021, Alexey Kardashevskiy wrote:
> > > > > > VOF itself does not prints anything in this patch.
> > > > > 
> > > > > However it seems to be needed for linux as the first thing
> > > > > it does seems to be getting /chosen/stdout and calls exit if
> > > > > it returns nothing. So I'll need this at least for linux. (I
> > > > > think MorphOS may also query it to print a banner or some
> > > > > messages but not sure it needs it, at least it does not
> > > > > abort right away if not found.)
> > > > > 
> > > > > > > but to see Linux output do I need a stdout in VOF or
> > > > > > > it will just open the serial with its own driver and
> > > > > > > use that?
> > > > > > > So I'm not sure what's the stdout parts in the
> > > > > > > current vof patch does and if I need that for
> > > > > > > anything. I'll try to experiment with it some more
> > > > > > > but fixing the ld and Kconfig seems to be enough to
> > > > > > > get it work for me.
> > > > > > 
> > > > > > So for the client to print something, /chosen/stdout
> > > > > > needs to have a valid ihandle.
> > > > > > The only way to get a valid ihandle is having a valid
> > > > > > phandle which vof_client_open() can open.
> > > > > > A valid phandle is a phandle of any node in the device
> > > > > > tree. On spapr we pick some spapr-vty, open it and store
> > > > > > in /chosen/stdout.
> > > > > > 
> > > > > > From this point output from the client can be seen via a tracepoint.
> > > > 
> > > > I've got it now. Looking at the original firmware device tree dump:
> > > > 
> > > > https://osdn.net/projects/qmiga/wiki/SubprojectPegasos2/attach/PegasosII_OFW-Dump.txt
> > > > 
> > > > I see that /chosen/stdout points to "screen" which is an alias
> > > > to /bootconsole. Just adding an empty /bootconsole node in the
> > > > device tree and vof_client_open_store() that as /chosen/stdout
> > > > works and I get output via vof_write traces so this is enough
> > > > for now to test Linux. Properly connecting a serial backend can
> > > > thus be postponed.
> > > > 
> > > > So with this the Linux kernel does not abort on the first device
> > > > tree access but starts to decompress itself then the embedded
> > > > initrd and crashes at calling setprop:
> > > > 
> > > > [...]
> > > > vof_client_handle: setprop
> > > > 
> > > > Thread 4 "qemu-system-ppc" received signal SIGSEGV, Segmentation fault.
> > > > (gdb) bt
> > > > #0  0x0000000000000000 in  ()
> > > > #1  0x0000555555a5c2bf in vof_setprop
> > > >      (vof=0x7ffff48e9420, vallen=4, valaddr=<optimized out>,
> > > > pname=<optimized out>, nodeph=8, fdt=0x7fff8aaff010,
> > > > ms=0x5555564f8800)
> > > >      at ../hw/ppc/vof.c:308
> > > > #2  0x0000555555a5c2bf in vof_client_handle
> > > >      (nrets=1, rets=0x7ffff48e93f0, nargs=4,
> > > > args=0x7ffff48e93c0, service=0x7ffff48e9460 "setprop",
> > > >       vof=0x7ffff48e9420, fdt=0x7fff8aaff010, ms=0x5555564f8800)
> > > > at ../hw/ppc/vof.c:842
> > > > #3  0x0000555555a5c2bf in vof_client_call
> > > >      (ms=0x5555564f8800, vof=vof@entry=0x55555662a3d0,
> > > > fdt=fdt@entry=0x7fff8aaff010,
> > > > args_real=args_real@entry=23580472)
> > > >      at ../hw/ppc/vof.c:935
> > > > 
> > > > loooks like it's trying to set /chosen/linux,initrd-start:
> > > 
> > > It is not horribly clear why it crashed though.
> > 
> > It crashed becuase I had TYPE_VOF_MACHINE_IF but did not set a setprop
> > callback and it tried to call that here. Adding a {return true;} empty
> > callback avoids this.
> 
> 
> Ah ok.
> 
> > 
> > > > (gdb) up
> > > > #1  0x0000555555a5c2bf in vof_setprop (vof=0x7ffff48e9420,
> > > > vallen=4, valaddr=<optimized out>, pname=<optimized out>,
> > > > nodeph=8,
> > > >      fdt=0x7fff8aaff010, ms=0x5555564f8800) at ../hw/ppc/vof.c:308
> > > > 308            if (!vmc->setprop(ms, nodepath, propname, val, vallen)) {
> > > > (gdb) p nodepath
> > > > $1 = "/chosen\000\060/rPC,750CXE/", '\000' <repeats 234 times>
> > > > (gdb) p propname
> > > > $2 = "linux,initrd-start\000linux,initrd-end\000linux,cmdline-timeout\000bootarg"
> > > > (gdb) p val
> > > > $3 = <optimized out>
> > > > 
> > > > I think I need the callback for setprop in TYPE_VOF_MACHINE_IF.
> > > > I can copy spapr_vof_setprop() but some explanation on why
> > > > that's needed might help. Ciould I just do fdt_setprop in my
> > > > callback as vof_setprop() would do without a machine callback or
> > > > is there some special handling needed for these properties?
> > > 
> > > The short answer is yes, you do not need TYPE_VOF_MACHINE_IF.
> > > 
> > > The long answer is that we build the FDT on spapr twice:
> > > 1. at the reset time and
> > > 2. after "ibm,client-arhitecture-support" (early in the boot the
> > > spapr paravirtual client says what it supports - ISA level, MMU
> > > features, etc)
> > > 
> > > Between 1 and 2 the kernel moves initrd and we do not update the
> > > QEMU's version of its location, the tree at 2) will have the old
> > > values.
> > > 
> > > So for that reason I have TYPE_VOF_MACHINE_IF. You most definitely
> > > do not need it.
> > 
> > I need TYPE_VOF_MACHINE_IF because that has the quiesce callback that I
> > need to shut VOF down when the guest is finished with it otherwise it
> > would crash later (more on this in next message).
> 
> Nah, quiesce() only means stopping IO in VOF. VOF is shut down when the
> client decides to stop using it (and zero that memory).
> 
> > But since I shut down VOF here I don't need to remember changes to the
> > FDT so I can just use an empty setprop callback. (I wouldn't even need
> > that if VOF would check that a callback is non-NULL before calling it.)
> 
> I'll add the check.
> 
> I'll need some time to go though the other mails, closer to the weekend,
> there are too many gaps in my knowledge about those 32bit systems.
> 
> I am really not sure that you need TYPE_PPC_VIRTUAL_HYPERVISOR (is this just
> to make "sc 1" work? there should be a better way) or RTAS (although it
> looks like you need it for PCI, you likely do not need it for your serial
> device which is ISA which I have no idea how it works). Do you have an
> actual machine? Can you dump its device tree to see what yours is missing?

IIUC, it's basicaly so that the 'sc 1' instructions can be routed
through to VOF.  'sc 1' is an illegal instruction on ppc32, AFAIK, so
we need some sort of hack here.

vhyp wasn't really designed for this, but I suspect it is the simplest
way to intercept those 'sc 1' calls.

Unfortunately, shutting it down presents a real problem.  Currently
you're relying on quiesce being the last call to OF the client makes.
That's often the case in practice, but not necessarily in all cases,
as you've seen.  However, there's no alternative point at which we can
determine that we're done with the client interface.

My inclination for now would be to just leave the vhyp handler in
place.  Strictly speaking it won't give you correct behaviour: later
calls to 'sc 1' will invoke VOF rather than giving a 0x700 exception.
But nothing on a 32-bit system should be attempting 'sc 1' anyway, so
I think it will probably work in practice.
David Gibson May 24, 2021, 6:01 a.m. UTC | #23
On Sun, May 23, 2021 at 07:09:26PM +0200, BALATON Zoltan wrote:
> On Sun, 23 May 2021, BALATON Zoltan wrote:
> > On Sun, 23 May 2021, Alexey Kardashevskiy wrote:
> > > One thing to note about PCI is that normally I think the client
> > > expects the firmware to do PCI probing and SLOF does it. But VOF
> > > does not and Linux scans PCI bus(es) itself. Might be a problem for
> > > you kernel.
> > 
> > I'm not sure what info does MorphOS get from the device tree and what it
> > probes itself but I think it may at least need device ids and info about
> > the PCI bus to be able to access the config regs, after that it should
> > set the devices up hopefully. I could add these from the board code to
> > device tree so VOF does not need to do anything about it. However I'm
> > not getting to that point yet because it crashes on something that it's
> > missing and couldn't yet find out what is that.
> > 
> > I'd like to get Linux working now as that would be enough to test this
> > and then if for MorphOS we still need a ROM it's not a problem if at
> > least we can boot Linux without the original firmware. But I can't make
> > Linux open a serial console and I don't know what it needs for that. Do
> > you happen to know? I've looked at the sources in Linux/arch/powerpc but
> > not sure how it would find and open a serial port on pegasos2. It seems
> > to work with the board firmware and now I can get it to boot with VOF
> > but then it does not open serial so it probably needs something in the
> > device tree or expects the firmware to set something up that we should
> > add in pegasos2.c when using VOF.
> 
> I've now found that Linux uses rtas methods read-pci-config and
> write-pci-config for PCI access on pegasos2 so this means that we'll
> probably need rtas too (I hoped we could get away without it if it were only
> used for shutdown/reboot or so but seems Linux needs it for PCI as well and
> does not scan the bus and won't find some devices without it).

Yes, definitely sounds like you'll need an RTAS implementation.

> While VOF can do rtas, this causes a problem with the hypercall method using
> sc 1 that goes through vhyp but trips the assert in ppc_store_sdr1() so
> cannot work after guest is past quiesce.

> So the question is why is that
> assert there

Ah.. right.  So, vhyp was designed for the PAPR use case, where we
want to model the CPU when it's in supervisor and user mode, but not
when it's in hypervisor mode.  We want qemu to mimic the behaviour of
the hypervisor, rather than attempting to actually execute hypervisor
code in the virtual CPU.

On systems that have a hypervisor mode, SDR1 is hypervisor privileged,
so it makes no sense for the guest to attempt to set it.  That should
be caught by the general SPR code and turned into a 0x700, hence the
assert() if we somehow reach ppc_store_sdr1().

So, we are seeing a problem here because you want the 'sc 1'
interception of vhyp, but not the rest of the stuff that goes with it.

> and would using sc 1 for hypercalls on pegasos2 cause other
> problems later even if the assert could be removed?

At least in the short term, I think you probably can remove the
assert.  In your case the 'sc 1' calls aren't truly to a hypervisor,
but a special case escape to qemu for the firmware emulation.  I think
it's unlikely to cause problems later, because nothing on a 32-bit
system should be attempting an 'sc 1'.  The only thing I can think of
that would fail is some test case which explicitly verified that 'sc
1' triggered a 0x700 (SIGILL from userspace).

> Can somebody who knows
> more about it explain this please? If this cannot be resolved then we may
> need a different hypercall method on pegasos2 (I've considered MOL OSI or
> are there other options? I may use some advice from people who know it
> better, especially the possible interaction with KVM later as the long term
> goal with pegasos2 is to be able to run with KVM on PPC hardware
> eventually.)

Right, you might need an alternative method eventually.  Really any
illegal instruction for your cpu is a possible candidate.  Bear in
mind that this is *not* truly a hypercall interface, instead it's
something we're special casing for the purposes of faking the
firmware.

The "attn" instruction used on BookE might be a reasonable candidate
(assuming it doesn't conflict with something on 32-bit BookS) - that's
often used for things like signalling the attention of hardware
debuggers, and this is somewhat akin.

Mostly it's just a matter of working out what would be least messy to
intercept in the TCG instruction decoding path.


> But this also means that if that assert cannot be dropped or
> there may be other problems with sc 1 hypercalls then we maybe cannot have
> the same vof.bin and we'll need a separate version that I would like to
> avoid if possible so if there's a simple way to keep it working or make
> vof.bin use alternate hypercall method without needing a separate binary
> that would be the direction I'd tend to go. Even if we need a seoarate
> version I'd like to keep as much common as possible.
> 
> I've tested that the missing rtas is not the reason for getting no output
> via serial though, as even when disabling rtas on pegasos2.rom it boots and
> I still get serial output just some PCI devices are not detected (such as
> USB, the video card and the not emulated ethernet port but these are not
> fatal so it might even work as a first try without rtas, just to boot a
> Linux kernel for testing it would be enough if I can fix the serial output).
> I still don't know why it's not finding serial but I think it may be some
> missing or wrong info in the device tree I generat. I'll try to focus on
> this for now and leave the above rtas question for later.

Oh.. another thought on that.  You have an ISA serial port on Pegasos,
I believe.  I wonder if the PCI->ISA bridge needs some configuration /
initialization that the firmware is expected to do.  If so you'll need
to mimic that setup in qemu for the VOF case.
BALATON Zoltan May 24, 2021, 9:57 a.m. UTC | #24
On Mon, 24 May 2021, David Gibson wrote:
> On Thu, May 20, 2021 at 11:59:07PM +0200, BALATON Zoltan wrote:
>> On Thu, 20 May 2021, Alexey Kardashevskiy wrote:
>>> The PAPR platform describes an OS environment that's presented by
>>> a combination of a hypervisor and firmware. The features it specifies
>>> require collaboration between the firmware and the hypervisor.
>>>
>>> Since the beginning, the runtime component of the firmware (RTAS) has
>>> been implemented as a 20 byte shim which simply forwards it to
>>> a hypercall implemented in qemu. The boot time firmware component is
>>> SLOF - but a build that's specific to qemu, and has always needed to be
>>> updated in sync with it. Even though we've managed to limit the amount
>>> of runtime communication we need between qemu and SLOF, there's some,
>>> and it has become increasingly awkward to handle as we've implemented
>>> new features.
>>>
>>> This implements a boot time OF client interface (CI) which is
>>> enabled by a new "x-vof" pseries machine option (stands for "Virtual Open
>>> Firmware). When enabled, QEMU implements the custom H_OF_CLIENT hcall
>>> which implements Open Firmware Client Interface (OF CI). This allows
>>> using a smaller stateless firmware which does not have to manage
>>> the device tree.
>>>
>>> The new "vof.bin" firmware image is included with source code under
>>> pc-bios/. It also includes RTAS blob.
>>>
>>> This implements a handful of CI methods just to get -kernel/-initrd
>>> working. In particular, this implements the device tree fetching and
>>> simple memory allocator - "claim" (an OF CI memory allocator) and updates
>>> "/memory@0/available" to report the client about available memory.
>>>
>>> This implements changing some device tree properties which we know how
>>> to deal with, the rest is ignored. To allow changes, this skips
>>> fdt_pack() when x-vof=on as not packing the blob leaves some room for
>>> appending.
>>>
>>> In absence of SLOF, this assigns phandles to device tree nodes to make
>>> device tree traversing work.
>>>
>>> When x-vof=on, this adds "/chosen" every time QEMU (re)builds a tree.
>>>
>>> This adds basic instances support which are managed by a hash map
>>> ihandle -> [phandle].
>>>
>>> Before the guest started, the used memory is:
>>> 0..e60 - the initial firmware
>>> 8000..10000 - stack
>>> 400000.. - kernel
>>> 3ea0000.. - initramdisk
>>>
>>> This OF CI does not implement "interpret".
>>>
>>> Unlike SLOF, this does not format uninitialized nvram. Instead, this
>>> includes a disk image with pre-formatted nvram.
>>>
>>> With this basic support, this can only boot into kernel directly.
>>> However this is just enough for the petitboot kernel and initradmdisk to
>>> boot from any possible source. Note this requires reasonably recent guest
>>> kernel with:
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=df5be5be8735
>>>
>>> The immediate benefit is much faster booting time which especially
>>> crucial with fully emulated early CPU bring up environments. Also this
>>> may come handy when/if GRUB-in-the-userspace sees light of the day.
>>>
>>> This separates VOF and sPAPR in a hope that VOF bits may be reused by
>>> other POWERPC boards which do not support pSeries.
>>>
>>> This is coded in assumption that later on we might be adding support for
>>> booting from QEMU backends (blockdev is the first candidate) without
>>> devices/drivers in between as OF1275 does not require that and
>>> it is quite easy to so.
>>>
>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>> ---
>>>
>>> The example command line is:
>>>
>>> /home/aik/pbuild/qemu-killslof-localhost-ppc64/qemu-system-ppc64 \
>>> -nodefaults \
>>> -chardev stdio,id=STDIO0,signal=off,mux=on \
>>> -device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
>>> -mon id=MON0,chardev=STDIO0,mode=readline \
>>> -nographic \
>>> -vga none \
>>> -enable-kvm \
>>> -m 8G \
>>> -machine pseries,x-vof=on,cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off \
>>> -kernel pbuild/kernel-le-guest/vmlinux \
>>> -initrd pb/rootfs.cpio.xz \
>>> -drive id=DRIVE0,if=none,file=./p/qemu-killslof/pc-bios/vof-nvram.bin,format=raw \
>>> -global spapr-nvram.drive=DRIVE0 \
>>> -snapshot \
>>> -smp 8,threads=8 \
>>> -L /home/aik/t/qemu-ppc64-bios/ \
>>> -trace events=qemu_trace_events \
>>> -d guest_errors \
>>> -chardev socket,id=SOCKET0,server,nowait,path=qemu.mon.tmux26 \
>>> -mon chardev=SOCKET0,mode=control
>>>
>>> ---
>>> Changes:
>>> v20:
>>> * compile vof.bin with -mcpu=power4 for better compatibility
>>> * s/std/stw/ in entry.S to make it work on ppc32
>>> * fixed dt_available property to support both 32 and 64bit
>>> * shuffled prom_args handling code
>>> * do not enforce 32bit in MSR (again, to support 32bit platforms)
>>>
>>
>> [...]
>>
>>> diff --git a/default-configs/devices/ppc64-softmmu.mak b/default-configs/devices/ppc64-softmmu.mak
>>> index ae0841fa3a18..9fb201dfacfa 100644
>>> --- a/default-configs/devices/ppc64-softmmu.mak
>>> +++ b/default-configs/devices/ppc64-softmmu.mak
>>> @@ -9,3 +9,4 @@ CONFIG_POWERNV=y
>>>  # For pSeries
>>>  CONFIG_PSERIES=y
>>>  CONFIG_NVDIMM=y
>>> +CONFIG_VOF=y
>>> diff --git a/hw/ppc/Kconfig b/hw/ppc/Kconfig
>>> index e51e0e5e5ac6..964510dfc73d 100644
>>> --- a/hw/ppc/Kconfig
>>> +++ b/hw/ppc/Kconfig
>>> @@ -143,3 +143,6 @@ config FW_CFG_PPC
>>>
>>>  config FDT_PPC
>>>      bool
>>> +
>>> +config VOF
>>> +    bool
>>
>> I think you should just add "select VOF" to config PSERIES section in
>> Kconfig instead of adding it to default-configs/devices/ppc64-softmmu.mak.
>> That should do it, it works in my updated pegasos2 patch:
>
> No, we don't want a "select": PSERIES doesn't require VOF while we
> still support SLOF, and indeed we're quite a ways from being ready to
> even make VOF the default pseries firmware.

Shouldn't you then also need to make code in spapr adding x-vof 
conditional on CONFIG_VOF or make sure it cannot be enabled if not 
compiled in? Otherwise it means VOF is always an option so spapr depends 
on VOF for which select is the way to describe that.

Regards,
BALATON Zoltan
David Gibson May 24, 2021, 10:50 a.m. UTC | #25
On Mon, May 24, 2021 at 11:57:27AM +0200, BALATON Zoltan wrote:
> On Mon, 24 May 2021, David Gibson wrote:
> > On Thu, May 20, 2021 at 11:59:07PM +0200, BALATON Zoltan wrote:
> > > On Thu, 20 May 2021, Alexey Kardashevskiy wrote:
> > > > The PAPR platform describes an OS environment that's presented by
> > > > a combination of a hypervisor and firmware. The features it specifies
> > > > require collaboration between the firmware and the hypervisor.
> > > > 
> > > > Since the beginning, the runtime component of the firmware (RTAS) has
> > > > been implemented as a 20 byte shim which simply forwards it to
> > > > a hypercall implemented in qemu. The boot time firmware component is
> > > > SLOF - but a build that's specific to qemu, and has always needed to be
> > > > updated in sync with it. Even though we've managed to limit the amount
> > > > of runtime communication we need between qemu and SLOF, there's some,
> > > > and it has become increasingly awkward to handle as we've implemented
> > > > new features.
> > > > 
> > > > This implements a boot time OF client interface (CI) which is
> > > > enabled by a new "x-vof" pseries machine option (stands for "Virtual Open
> > > > Firmware). When enabled, QEMU implements the custom H_OF_CLIENT hcall
> > > > which implements Open Firmware Client Interface (OF CI). This allows
> > > > using a smaller stateless firmware which does not have to manage
> > > > the device tree.
> > > > 
> > > > The new "vof.bin" firmware image is included with source code under
> > > > pc-bios/. It also includes RTAS blob.
> > > > 
> > > > This implements a handful of CI methods just to get -kernel/-initrd
> > > > working. In particular, this implements the device tree fetching and
> > > > simple memory allocator - "claim" (an OF CI memory allocator) and updates
> > > > "/memory@0/available" to report the client about available memory.
> > > > 
> > > > This implements changing some device tree properties which we know how
> > > > to deal with, the rest is ignored. To allow changes, this skips
> > > > fdt_pack() when x-vof=on as not packing the blob leaves some room for
> > > > appending.
> > > > 
> > > > In absence of SLOF, this assigns phandles to device tree nodes to make
> > > > device tree traversing work.
> > > > 
> > > > When x-vof=on, this adds "/chosen" every time QEMU (re)builds a tree.
> > > > 
> > > > This adds basic instances support which are managed by a hash map
> > > > ihandle -> [phandle].
> > > > 
> > > > Before the guest started, the used memory is:
> > > > 0..e60 - the initial firmware
> > > > 8000..10000 - stack
> > > > 400000.. - kernel
> > > > 3ea0000.. - initramdisk
> > > > 
> > > > This OF CI does not implement "interpret".
> > > > 
> > > > Unlike SLOF, this does not format uninitialized nvram. Instead, this
> > > > includes a disk image with pre-formatted nvram.
> > > > 
> > > > With this basic support, this can only boot into kernel directly.
> > > > However this is just enough for the petitboot kernel and initradmdisk to
> > > > boot from any possible source. Note this requires reasonably recent guest
> > > > kernel with:
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=df5be5be8735
> > > > 
> > > > The immediate benefit is much faster booting time which especially
> > > > crucial with fully emulated early CPU bring up environments. Also this
> > > > may come handy when/if GRUB-in-the-userspace sees light of the day.
> > > > 
> > > > This separates VOF and sPAPR in a hope that VOF bits may be reused by
> > > > other POWERPC boards which do not support pSeries.
> > > > 
> > > > This is coded in assumption that later on we might be adding support for
> > > > booting from QEMU backends (blockdev is the first candidate) without
> > > > devices/drivers in between as OF1275 does not require that and
> > > > it is quite easy to so.
> > > > 
> > > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > > > ---
> > > > 
> > > > The example command line is:
> > > > 
> > > > /home/aik/pbuild/qemu-killslof-localhost-ppc64/qemu-system-ppc64 \
> > > > -nodefaults \
> > > > -chardev stdio,id=STDIO0,signal=off,mux=on \
> > > > -device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
> > > > -mon id=MON0,chardev=STDIO0,mode=readline \
> > > > -nographic \
> > > > -vga none \
> > > > -enable-kvm \
> > > > -m 8G \
> > > > -machine pseries,x-vof=on,cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off \
> > > > -kernel pbuild/kernel-le-guest/vmlinux \
> > > > -initrd pb/rootfs.cpio.xz \
> > > > -drive id=DRIVE0,if=none,file=./p/qemu-killslof/pc-bios/vof-nvram.bin,format=raw \
> > > > -global spapr-nvram.drive=DRIVE0 \
> > > > -snapshot \
> > > > -smp 8,threads=8 \
> > > > -L /home/aik/t/qemu-ppc64-bios/ \
> > > > -trace events=qemu_trace_events \
> > > > -d guest_errors \
> > > > -chardev socket,id=SOCKET0,server,nowait,path=qemu.mon.tmux26 \
> > > > -mon chardev=SOCKET0,mode=control
> > > > 
> > > > ---
> > > > Changes:
> > > > v20:
> > > > * compile vof.bin with -mcpu=power4 for better compatibility
> > > > * s/std/stw/ in entry.S to make it work on ppc32
> > > > * fixed dt_available property to support both 32 and 64bit
> > > > * shuffled prom_args handling code
> > > > * do not enforce 32bit in MSR (again, to support 32bit platforms)
> > > > 
> > > 
> > > [...]
> > > 
> > > > diff --git a/default-configs/devices/ppc64-softmmu.mak b/default-configs/devices/ppc64-softmmu.mak
> > > > index ae0841fa3a18..9fb201dfacfa 100644
> > > > --- a/default-configs/devices/ppc64-softmmu.mak
> > > > +++ b/default-configs/devices/ppc64-softmmu.mak
> > > > @@ -9,3 +9,4 @@ CONFIG_POWERNV=y
> > > >  # For pSeries
> > > >  CONFIG_PSERIES=y
> > > >  CONFIG_NVDIMM=y
> > > > +CONFIG_VOF=y
> > > > diff --git a/hw/ppc/Kconfig b/hw/ppc/Kconfig
> > > > index e51e0e5e5ac6..964510dfc73d 100644
> > > > --- a/hw/ppc/Kconfig
> > > > +++ b/hw/ppc/Kconfig
> > > > @@ -143,3 +143,6 @@ config FW_CFG_PPC
> > > > 
> > > >  config FDT_PPC
> > > >      bool
> > > > +
> > > > +config VOF
> > > > +    bool
> > > 
> > > I think you should just add "select VOF" to config PSERIES section in
> > > Kconfig instead of adding it to default-configs/devices/ppc64-softmmu.mak.
> > > That should do it, it works in my updated pegasos2 patch:
> > 
> > No, we don't want a "select": PSERIES doesn't require VOF while we
> > still support SLOF, and indeed we're quite a ways from being ready to
> > even make VOF the default pseries firmware.
> 
> Shouldn't you then also need to make code in spapr adding x-vof conditional
> on CONFIG_VOF or make sure it cannot be enabled if not compiled in?
> Otherwise it means VOF is always an option so spapr depends on VOF for which
> select is the way to describe that.

Uh, yes, we probably should.
BALATON Zoltan May 24, 2021, 10:55 a.m. UTC | #26
On Mon, 24 May 2021, David Gibson wrote:
> On Sun, May 23, 2021 at 07:09:26PM +0200, BALATON Zoltan wrote:
>> On Sun, 23 May 2021, BALATON Zoltan wrote:
>>> On Sun, 23 May 2021, Alexey Kardashevskiy wrote:
>>>> One thing to note about PCI is that normally I think the client
>>>> expects the firmware to do PCI probing and SLOF does it. But VOF
>>>> does not and Linux scans PCI bus(es) itself. Might be a problem for
>>>> you kernel.
>>>
>>> I'm not sure what info does MorphOS get from the device tree and what it
>>> probes itself but I think it may at least need device ids and info about
>>> the PCI bus to be able to access the config regs, after that it should
>>> set the devices up hopefully. I could add these from the board code to
>>> device tree so VOF does not need to do anything about it. However I'm
>>> not getting to that point yet because it crashes on something that it's
>>> missing and couldn't yet find out what is that.
>>>
>>> I'd like to get Linux working now as that would be enough to test this
>>> and then if for MorphOS we still need a ROM it's not a problem if at
>>> least we can boot Linux without the original firmware. But I can't make
>>> Linux open a serial console and I don't know what it needs for that. Do
>>> you happen to know? I've looked at the sources in Linux/arch/powerpc but
>>> not sure how it would find and open a serial port on pegasos2. It seems
>>> to work with the board firmware and now I can get it to boot with VOF
>>> but then it does not open serial so it probably needs something in the
>>> device tree or expects the firmware to set something up that we should
>>> add in pegasos2.c when using VOF.
>>
>> I've now found that Linux uses rtas methods read-pci-config and
>> write-pci-config for PCI access on pegasos2 so this means that we'll
>> probably need rtas too (I hoped we could get away without it if it were only
>> used for shutdown/reboot or so but seems Linux needs it for PCI as well and
>> does not scan the bus and won't find some devices without it).
>
> Yes, definitely sounds like you'll need an RTAS implementation.

I plan to fix that after managed to get serial working as that seems to 
not need it. If I delete the rtas-size property from /rtas on the original 
firmware that makes Linux skip instantiating rtas, but I still get serial 
output just not accessing PCI devices. So I think it should work and keeps 
things simpler at first. Then I'll try rtas later.

>> While VOF can do rtas, this causes a problem with the hypercall method using
>> sc 1 that goes through vhyp but trips the assert in ppc_store_sdr1() so
>> cannot work after guest is past quiesce.
>
>> So the question is why is that
>> assert there
>
> Ah.. right.  So, vhyp was designed for the PAPR use case, where we
> want to model the CPU when it's in supervisor and user mode, but not
> when it's in hypervisor mode.  We want qemu to mimic the behaviour of
> the hypervisor, rather than attempting to actually execute hypervisor
> code in the virtual CPU.
>
> On systems that have a hypervisor mode, SDR1 is hypervisor privileged,
> so it makes no sense for the guest to attempt to set it.  That should
> be caught by the general SPR code and turned into a 0x700, hence the
> assert() if we somehow reach ppc_store_sdr1().
>
> So, we are seeing a problem here because you want the 'sc 1'
> interception of vhyp, but not the rest of the stuff that goes with it.
>
>> and would using sc 1 for hypercalls on pegasos2 cause other
>> problems later even if the assert could be removed?
>
> At least in the short term, I think you probably can remove the
> assert.  In your case the 'sc 1' calls aren't truly to a hypervisor,
> but a special case escape to qemu for the firmware emulation.  I think
> it's unlikely to cause problems later, because nothing on a 32-bit
> system should be attempting an 'sc 1'.  The only thing I can think of
> that would fail is some test case which explicitly verified that 'sc
> 1' triggered a 0x700 (SIGILL from userspace).

OK so the assert should check if the CPU has an HV bit. I think there was 
a #detine for that somewhere that I can add to the assert then I can try 
that. What I wasn't sure about is that sc 1 would conflict with the 
guest's usage of normal sc calls or are these going through different 
paths and only sc 1 will trigger vhyp callback not affecting notmal sc 
calls? (Or if this causes an otherwise unnecessary VM exit on KVM even 
when it works then maybe looking for a different way in the future might 
be needed. But for now if this works with modifying the assert to allow 
this on ppc32 then I go for that as that's the simplest way for now.)

>> Can somebody who knows
>> more about it explain this please? If this cannot be resolved then we may
>> need a different hypercall method on pegasos2 (I've considered MOL OSI or
>> are there other options? I may use some advice from people who know it
>> better, especially the possible interaction with KVM later as the long term
>> goal with pegasos2 is to be able to run with KVM on PPC hardware
>> eventually.)
>
> Right, you might need an alternative method eventually.  Really any
> illegal instruction for your cpu is a possible candidate.  Bear in
> mind that this is *not* truly a hypercall interface, instead it's
> something we're special casing for the purposes of faking the
> firmware.
>
> The "attn" instruction used on BookE might be a reasonable candidate
> (assuming it doesn't conflict with something on 32-bit BookS) - that's
> often used for things like signalling the attention of hardware
> debuggers, and this is somewhat akin.
>
> Mostly it's just a matter of working out what would be least messy to
> intercept in the TCG instruction decoding path.

I'll wait for the current ongoing reorganisations to settle for that. If 
an alternative is needed I was considering the interface used by Mac on 
Linux:

https://lists.nongnu.org/archive/html/qemu-ppc/2021-03/msg00047.html

becuase there are some paravirtual drivers I think that use these on Mac 
OS X so this might also be useful for that use case for Mac emulation. But 
that seems very similar just checking for magic values at a normal syscall 
which means all syscalls will be intercepted anyway. In that case if sc 1 
does not interfere with normal sc instructions then it may be better to 
keep that as the invalid instruction we trap on.

>> But this also means that if that assert cannot be dropped or
>> there may be other problems with sc 1 hypercalls then we maybe cannot have
>> the same vof.bin and we'll need a separate version that I would like to
>> avoid if possible so if there's a simple way to keep it working or make
>> vof.bin use alternate hypercall method without needing a separate binary
>> that would be the direction I'd tend to go. Even if we need a seoarate
>> version I'd like to keep as much common as possible.
>>
>> I've tested that the missing rtas is not the reason for getting no output
>> via serial though, as even when disabling rtas on pegasos2.rom it boots and
>> I still get serial output just some PCI devices are not detected (such as
>> USB, the video card and the not emulated ethernet port but these are not
>> fatal so it might even work as a first try without rtas, just to boot a
>> Linux kernel for testing it would be enough if I can fix the serial output).
>> I still don't know why it's not finding serial but I think it may be some
>> missing or wrong info in the device tree I generat. I'll try to focus on
>> this for now and leave the above rtas question for later.
>
> Oh.. another thought on that.  You have an ISA serial port on Pegasos,
> I believe.  I wonder if the PCI->ISA bridge needs some configuration /
> initialization that the firmware is expected to do.  If so you'll need
> to mimic that setup in qemu for the VOF case.

That's what I begin to think because I've added everything to the device 
tree that I thought could be needed and I still don't get it working so it 
may need some config from the firmware. But how do I access device 
registers from board code? I've tried adding a machine reset method and 
write to memory mapped device registers but all my attempts failed. I've 
tried cpu_stl_le_data and even memory_region_dispatch_write but these did 
not get to the device. What's the way to access guest mmio regs from QEMU?

Regards,
BALATON Zoltan
BALATON Zoltan May 24, 2021, 11:56 a.m. UTC | #27
On Mon, 24 May 2021, David Gibson wrote:
> On Mon, May 24, 2021 at 02:26:42PM +1000, Alexey Kardashevskiy wrote:
>> On 5/23/21 21:24, BALATON Zoltan wrote:
>>> On Sun, 23 May 2021, Alexey Kardashevskiy wrote:
>>>> On 23/05/2021 01:02, BALATON Zoltan wrote:
>>>>> On Sat, 22 May 2021, BALATON Zoltan wrote:
>>>>>> On Sat, 22 May 2021, Alexey Kardashevskiy wrote:
>>>>>>> VOF itself does not prints anything in this patch.
>>>>>>
>>>>>> However it seems to be needed for linux as the first thing
>>>>>> it does seems to be getting /chosen/stdout and calls exit if
>>>>>> it returns nothing. So I'll need this at least for linux. (I
>>>>>> think MorphOS may also query it to print a banner or some
>>>>>> messages but not sure it needs it, at least it does not
>>>>>> abort right away if not found.)
>>>>>>
>>>>>>>> but to see Linux output do I need a stdout in VOF or
>>>>>>>> it will just open the serial with its own driver and
>>>>>>>> use that?
>>>>>>>> So I'm not sure what's the stdout parts in the
>>>>>>>> current vof patch does and if I need that for
>>>>>>>> anything. I'll try to experiment with it some more
>>>>>>>> but fixing the ld and Kconfig seems to be enough to
>>>>>>>> get it work for me.
>>>>>>>
>>>>>>> So for the client to print something, /chosen/stdout
>>>>>>> needs to have a valid ihandle.
>>>>>>> The only way to get a valid ihandle is having a valid
>>>>>>> phandle which vof_client_open() can open.
>>>>>>> A valid phandle is a phandle of any node in the device
>>>>>>> tree. On spapr we pick some spapr-vty, open it and store
>>>>>>> in /chosen/stdout.
>>>>>>>
>>>>>>> From this point output from the client can be seen via a tracepoint.
>>>>>
>>>>> I've got it now. Looking at the original firmware device tree dump:
>>>>>
>>>>> https://osdn.net/projects/qmiga/wiki/SubprojectPegasos2/attach/PegasosII_OFW-Dump.txt
>>>>>
>>>>> I see that /chosen/stdout points to "screen" which is an alias
>>>>> to /bootconsole. Just adding an empty /bootconsole node in the
>>>>> device tree and vof_client_open_store() that as /chosen/stdout
>>>>> works and I get output via vof_write traces so this is enough
>>>>> for now to test Linux. Properly connecting a serial backend can
>>>>> thus be postponed.
>>>>>
>>>>> So with this the Linux kernel does not abort on the first device
>>>>> tree access but starts to decompress itself then the embedded
>>>>> initrd and crashes at calling setprop:
>>>>>
>>>>> [...]
>>>>> vof_client_handle: setprop
>>>>>
>>>>> Thread 4 "qemu-system-ppc" received signal SIGSEGV, Segmentation fault.
>>>>> (gdb) bt
>>>>> #0  0x0000000000000000 in  ()
>>>>> #1  0x0000555555a5c2bf in vof_setprop
>>>>>      (vof=0x7ffff48e9420, vallen=4, valaddr=<optimized out>,
>>>>> pname=<optimized out>, nodeph=8, fdt=0x7fff8aaff010,
>>>>> ms=0x5555564f8800)
>>>>>      at ../hw/ppc/vof.c:308
>>>>> #2  0x0000555555a5c2bf in vof_client_handle
>>>>>      (nrets=1, rets=0x7ffff48e93f0, nargs=4,
>>>>> args=0x7ffff48e93c0, service=0x7ffff48e9460 "setprop",
>>>>>       vof=0x7ffff48e9420, fdt=0x7fff8aaff010, ms=0x5555564f8800)
>>>>> at ../hw/ppc/vof.c:842
>>>>> #3  0x0000555555a5c2bf in vof_client_call
>>>>>      (ms=0x5555564f8800, vof=vof@entry=0x55555662a3d0,
>>>>> fdt=fdt@entry=0x7fff8aaff010,
>>>>> args_real=args_real@entry=23580472)
>>>>>      at ../hw/ppc/vof.c:935
>>>>>
>>>>> loooks like it's trying to set /chosen/linux,initrd-start:
>>>>
>>>> It is not horribly clear why it crashed though.
>>>
>>> It crashed becuase I had TYPE_VOF_MACHINE_IF but did not set a setprop
>>> callback and it tried to call that here. Adding a {return true;} empty
>>> callback avoids this.
>>
>>
>> Ah ok.
>>
>>>
>>>>> (gdb) up
>>>>> #1  0x0000555555a5c2bf in vof_setprop (vof=0x7ffff48e9420,
>>>>> vallen=4, valaddr=<optimized out>, pname=<optimized out>,
>>>>> nodeph=8,
>>>>>      fdt=0x7fff8aaff010, ms=0x5555564f8800) at ../hw/ppc/vof.c:308
>>>>> 308            if (!vmc->setprop(ms, nodepath, propname, val, vallen)) {
>>>>> (gdb) p nodepath
>>>>> $1 = "/chosen\000\060/rPC,750CXE/", '\000' <repeats 234 times>
>>>>> (gdb) p propname
>>>>> $2 = "linux,initrd-start\000linux,initrd-end\000linux,cmdline-timeout\000bootarg"
>>>>> (gdb) p val
>>>>> $3 = <optimized out>
>>>>>
>>>>> I think I need the callback for setprop in TYPE_VOF_MACHINE_IF.
>>>>> I can copy spapr_vof_setprop() but some explanation on why
>>>>> that's needed might help. Ciould I just do fdt_setprop in my
>>>>> callback as vof_setprop() would do without a machine callback or
>>>>> is there some special handling needed for these properties?
>>>>
>>>> The short answer is yes, you do not need TYPE_VOF_MACHINE_IF.
>>>>
>>>> The long answer is that we build the FDT on spapr twice:
>>>> 1. at the reset time and
>>>> 2. after "ibm,client-arhitecture-support" (early in the boot the
>>>> spapr paravirtual client says what it supports - ISA level, MMU
>>>> features, etc)
>>>>
>>>> Between 1 and 2 the kernel moves initrd and we do not update the
>>>> QEMU's version of its location, the tree at 2) will have the old
>>>> values.
>>>>
>>>> So for that reason I have TYPE_VOF_MACHINE_IF. You most definitely
>>>> do not need it.
>>>
>>> I need TYPE_VOF_MACHINE_IF because that has the quiesce callback that I
>>> need to shut VOF down when the guest is finished with it otherwise it
>>> would crash later (more on this in next message).
>>
>> Nah, quiesce() only means stopping IO in VOF. VOF is shut down when the
>> client decides to stop using it (and zero that memory).
>>
>>> But since I shut down VOF here I don't need to remember changes to the
>>> FDT so I can just use an empty setprop callback. (I wouldn't even need
>>> that if VOF would check that a callback is non-NULL before calling it.)
>>
>> I'll add the check.
>>
>> I'll need some time to go though the other mails, closer to the weekend,
>> there are too many gaps in my knowledge about those 32bit systems.
>>
>> I am really not sure that you need TYPE_PPC_VIRTUAL_HYPERVISOR (is this just
>> to make "sc 1" work? there should be a better way) or RTAS (although it
>> looks like you need it for PCI, you likely do not need it for your serial
>> device which is ISA which I have no idea how it works). Do you have an
>> actual machine? Can you dump its device tree to see what yours is missing?
>
> IIUC, it's basicaly so that the 'sc 1' instructions can be routed
> through to VOF.  'sc 1' is an illegal instruction on ppc32, AFAIK, so
> we need some sort of hack here.

Yes correct, I'm just using vhyp as that was already there and is the 
simplest way to get it working without any other changes needed to 
target/ppc or vof (apart from small changes to vof to make it work on 
ppc32 and correctly handle ELF with entry != load address that the 
pegasos2 kernel happens to do which are probably bugs in vof anyway so 
could be fixed).

> vhyp wasn't really designed for this, but I suspect it is the simplest
> way to intercept those 'sc 1' calls.
>
> Unfortunately, shutting it down presents a real problem.  Currently
> you're relying on quiesce being the last call to OF the client makes.
> That's often the case in practice, but not necessarily in all cases,
> as you've seen.  However, there's no alternative point at which we can
> determine that we're done with the client interface.
>
> My inclination for now would be to just leave the vhyp handler in
> place.  Strictly speaking it won't give you correct behaviour: later
> calls to 'sc 1' will invoke VOF rather than giving a 0x700 exception.
> But nothing on a 32-bit system should be attempting 'sc 1' anyway, so
> I think it will probably work in practice.

I agree with that if it works. Until we find a reason to replace it I 
think this is the simplest way and so far I could get it mostly working. 
I'll keep trying.

Regards,
BALATON Zoltan
BALATON Zoltan May 24, 2021, 12:42 p.m. UTC | #28
On Mon, 24 May 2021, David Gibson wrote:
> On Sun, May 23, 2021 at 07:09:26PM +0200, BALATON Zoltan wrote:
>> On Sun, 23 May 2021, BALATON Zoltan wrote:
>>> On Sun, 23 May 2021, Alexey Kardashevskiy wrote:
>>>> One thing to note about PCI is that normally I think the client
>>>> expects the firmware to do PCI probing and SLOF does it. But VOF
>>>> does not and Linux scans PCI bus(es) itself. Might be a problem for
>>>> you kernel.
>>>
>>> I'm not sure what info does MorphOS get from the device tree and what it
>>> probes itself but I think it may at least need device ids and info about
>>> the PCI bus to be able to access the config regs, after that it should
>>> set the devices up hopefully. I could add these from the board code to
>>> device tree so VOF does not need to do anything about it. However I'm
>>> not getting to that point yet because it crashes on something that it's
>>> missing and couldn't yet find out what is that.
>>>
>>> I'd like to get Linux working now as that would be enough to test this
>>> and then if for MorphOS we still need a ROM it's not a problem if at
>>> least we can boot Linux without the original firmware. But I can't make
>>> Linux open a serial console and I don't know what it needs for that. Do
>>> you happen to know? I've looked at the sources in Linux/arch/powerpc but
>>> not sure how it would find and open a serial port on pegasos2. It seems
>>> to work with the board firmware and now I can get it to boot with VOF
>>> but then it does not open serial so it probably needs something in the
>>> device tree or expects the firmware to set something up that we should
>>> add in pegasos2.c when using VOF.
>>
>> I've now found that Linux uses rtas methods read-pci-config and
>> write-pci-config for PCI access on pegasos2 so this means that we'll
>> probably need rtas too (I hoped we could get away without it if it were only
>> used for shutdown/reboot or so but seems Linux needs it for PCI as well and
>> does not scan the bus and won't find some devices without it).
>
> Yes, definitely sounds like you'll need an RTAS implementation.
>
>> While VOF can do rtas, this causes a problem with the hypercall method using
>> sc 1 that goes through vhyp but trips the assert in ppc_store_sdr1() so
>> cannot work after guest is past quiesce.
>
>> So the question is why is that
>> assert there
>
> Ah.. right.  So, vhyp was designed for the PAPR use case, where we
> want to model the CPU when it's in supervisor and user mode, but not
> when it's in hypervisor mode.  We want qemu to mimic the behaviour of
> the hypervisor, rather than attempting to actually execute hypervisor
> code in the virtual CPU.
>
> On systems that have a hypervisor mode, SDR1 is hypervisor privileged,
> so it makes no sense for the guest to attempt to set it.  That should
> be caught by the general SPR code and turned into a 0x700, hence the
> assert() if we somehow reach ppc_store_sdr1().

This seems to work to avoid my problem so I can leave vhyp enabled after 
qiuesce for now:

diff --git a/target/ppc/cpu.c b/target/ppc/cpu.c
index d957d1a687..13b87b9b36 100644
--- a/target/ppc/cpu.c
+++ b/target/ppc/cpu.c
@@ -70,7 +70,7 @@ void ppc_store_sdr1(CPUPPCState *env, target_ulong value)
  {
      PowerPCCPU *cpu = env_archcpu(env);
      qemu_log_mask(CPU_LOG_MMU, "%s: " TARGET_FMT_lx "\n", __func__, value);
-    assert(!cpu->vhyp);
+    assert(!cpu->env.has_hv_mode || !cpu->vhyp);
  #if defined(TARGET_PPC64)
      if (mmu_is_64bit(env->mmu_model)) {
          target_ulong sdr_mask = SDR_64_HTABORG | SDR_64_HTABSIZE;

But I wonder if the assert should also be moved within the TARGET_PPC64 
block and if we may need to generate some exception here instead. Not sure 
what a real CPU would do in this case but if accessing sdr1 is privileged 
in HV mode then there should be an exception or if that's catched 
elsewhere then this assert may not be needed at all. I can make a patch if 
you tell me what should it do.

Regards,
BALATON Zoltan
Alexey Kardashevskiy May 24, 2021, 12:46 p.m. UTC | #29
On 24/05/2021 20:55, BALATON Zoltan wrote:
> On Mon, 24 May 2021, David Gibson wrote:
>> On Sun, May 23, 2021 at 07:09:26PM +0200, BALATON Zoltan wrote:
>>> On Sun, 23 May 2021, BALATON Zoltan wrote:
>>>> On Sun, 23 May 2021, Alexey Kardashevskiy wrote:
>>>>> One thing to note about PCI is that normally I think the client
>>>>> expects the firmware to do PCI probing and SLOF does it. But VOF
>>>>> does not and Linux scans PCI bus(es) itself. Might be a problem for
>>>>> you kernel.
>>>>
>>>> I'm not sure what info does MorphOS get from the device tree and 
>>>> what it
>>>> probes itself but I think it may at least need device ids and info 
>>>> about
>>>> the PCI bus to be able to access the config regs, after that it should
>>>> set the devices up hopefully. I could add these from the board code to
>>>> device tree so VOF does not need to do anything about it. However I'm
>>>> not getting to that point yet because it crashes on something that it's
>>>> missing and couldn't yet find out what is that.
>>>>
>>>> I'd like to get Linux working now as that would be enough to test this
>>>> and then if for MorphOS we still need a ROM it's not a problem if at
>>>> least we can boot Linux without the original firmware. But I can't make
>>>> Linux open a serial console and I don't know what it needs for that. Do
>>>> you happen to know? I've looked at the sources in Linux/arch/powerpc 
>>>> but
>>>> not sure how it would find and open a serial port on pegasos2. It seems
>>>> to work with the board firmware and now I can get it to boot with VOF
>>>> but then it does not open serial so it probably needs something in the
>>>> device tree or expects the firmware to set something up that we should
>>>> add in pegasos2.c when using VOF.
>>>
>>> I've now found that Linux uses rtas methods read-pci-config and
>>> write-pci-config for PCI access on pegasos2 so this means that we'll
>>> probably need rtas too (I hoped we could get away without it if it 
>>> were only
>>> used for shutdown/reboot or so but seems Linux needs it for PCI as 
>>> well and
>>> does not scan the bus and won't find some devices without it).
>>
>> Yes, definitely sounds like you'll need an RTAS implementation.
> 
> I plan to fix that after managed to get serial working as that seems to 
> not need it. If I delete the rtas-size property from /rtas on the 
> original firmware that makes Linux skip instantiating rtas, but I still 
> get serial output just not accessing PCI devices. So I think it should 
> work and keeps things simpler at first. Then I'll try rtas later.
> 
>>> While VOF can do rtas, this causes a problem with the hypercall 
>>> method using
>>> sc 1 that goes through vhyp but trips the assert in ppc_store_sdr1() so
>>> cannot work after guest is past quiesce.
>>
>>> So the question is why is that
>>> assert there
>>
>> Ah.. right.  So, vhyp was designed for the PAPR use case, where we
>> want to model the CPU when it's in supervisor and user mode, but not
>> when it's in hypervisor mode.  We want qemu to mimic the behaviour of
>> the hypervisor, rather than attempting to actually execute hypervisor
>> code in the virtual CPU.
>>
>> On systems that have a hypervisor mode, SDR1 is hypervisor privileged,
>> so it makes no sense for the guest to attempt to set it.  That should
>> be caught by the general SPR code and turned into a 0x700, hence the
>> assert() if we somehow reach ppc_store_sdr1().
>>
>> So, we are seeing a problem here because you want the 'sc 1'
>> interception of vhyp, but not the rest of the stuff that goes with it.
>>
>>> and would using sc 1 for hypercalls on pegasos2 cause other
>>> problems later even if the assert could be removed?
>>
>> At least in the short term, I think you probably can remove the
>> assert.  In your case the 'sc 1' calls aren't truly to a hypervisor,
>> but a special case escape to qemu for the firmware emulation.  I think
>> it's unlikely to cause problems later, because nothing on a 32-bit
>> system should be attempting an 'sc 1'.  The only thing I can think of
>> that would fail is some test case which explicitly verified that 'sc
>> 1' triggered a 0x700 (SIGILL from userspace).
> 
> OK so the assert should check if the CPU has an HV bit. I think there 
> was a #detine for that somewhere that I can add to the assert then I can 
> try that. What I wasn't sure about is that sc 1 would conflict with the 
> guest's usage of normal sc calls or are these going through different 
> paths and only sc 1 will trigger vhyp callback not affecting notmal sc 
> calls? (Or if this causes an otherwise unnecessary VM exit on KVM even 
> when it works then maybe looking for a different way in the future might 
> be needed. But for now if this works with modifying the assert to allow 
> this on ppc32 then I go for that as that's the simplest way for now.)
> 
>>> Can somebody who knows
>>> more about it explain this please? If this cannot be resolved then we 
>>> may
>>> need a different hypercall method on pegasos2 (I've considered MOL 
>>> OSI or
>>> are there other options? I may use some advice from people who know it
>>> better, especially the possible interaction with KVM later as the 
>>> long term
>>> goal with pegasos2 is to be able to run with KVM on PPC hardware
>>> eventually.)
>>
>> Right, you might need an alternative method eventually.  Really any
>> illegal instruction for your cpu is a possible candidate.  Bear in
>> mind that this is *not* truly a hypercall interface, instead it's
>> something we're special casing for the purposes of faking the
>> firmware.
>>
>> The "attn" instruction used on BookE might be a reasonable candidate
>> (assuming it doesn't conflict with something on 32-bit BookS) - that's
>> often used for things like signalling the attention of hardware
>> debuggers, and this is somewhat akin.
>>
>> Mostly it's just a matter of working out what would be least messy to
>> intercept in the TCG instruction decoding path.
> 
> I'll wait for the current ongoing reorganisations to settle for that. If 
> an alternative is needed I was considering the interface used by Mac on 
> Linux:
> 
> https://lists.nongnu.org/archive/html/qemu-ppc/2021-03/msg00047.html
> 
> becuase there are some paravirtual drivers I think that use these on Mac 
> OS X so this might also be useful for that use case for Mac emulation. 
> But that seems very similar just checking for magic values at a normal 
> syscall which means all syscalls will be intercepted anyway. In that 
> case if sc 1 does not interfere with normal sc instructions then it may 
> be better to keep that as the invalid instruction we trap on.
> 
>>> But this also means that if that assert cannot be dropped or
>>> there may be other problems with sc 1 hypercalls then we maybe cannot 
>>> have
>>> the same vof.bin and we'll need a separate version that I would like to
>>> avoid if possible so if there's a simple way to keep it working or make
>>> vof.bin use alternate hypercall method without needing a separate binary
>>> that would be the direction I'd tend to go. Even if we need a seoarate
>>> version I'd like to keep as much common as possible.
>>>
>>> I've tested that the missing rtas is not the reason for getting no 
>>> output
>>> via serial though, as even when disabling rtas on pegasos2.rom it 
>>> boots and
>>> I still get serial output just some PCI devices are not detected 
>>> (such as
>>> USB, the video card and the not emulated ethernet port but these are not
>>> fatal so it might even work as a first try without rtas, just to boot a
>>> Linux kernel for testing it would be enough if I can fix the serial 
>>> output).
>>> I still don't know why it's not finding serial but I think it may be 
>>> some
>>> missing or wrong info in the device tree I generat. I'll try to focus on
>>> this for now and leave the above rtas question for later.
>>
>> Oh.. another thought on that.  You have an ISA serial port on Pegasos,
>> I believe.  I wonder if the PCI->ISA bridge needs some configuration /
>> initialization that the firmware is expected to do.  If so you'll need
>> to mimic that setup in qemu for the VOF case.
> 
> That's what I begin to think because I've added everything to the device 
> tree that I thought could be needed and I still don't get it working so 
> it may need some config from the firmware. But how do I access device 
> registers from board code? I've tried adding a machine reset method and 
> write to memory mapped device registers but all my attempts failed. I've 
> tried cpu_stl_le_data and even memory_region_dispatch_write but these 
> did not get to the device. What's the way to access guest mmio regs from 
> QEMU?

If we know that that serial is sitting behind PCI->ISA bridge (is it?), 
I think you need to assign a BAR to that bridge, do some ISA setup (no 
idea which) and enable that bridge (write MEMORY to PCI_COMMAND), this 
should enable its registers.

In pseries we add "linux,pci-probe-only"=0 which makes Linux do all the 
above instead of relying on the firmware doing BAR assignment.
BALATON Zoltan May 24, 2021, 10:34 p.m. UTC | #30
On Mon, 24 May 2021, Alexey Kardashevskiy wrote:
> On 24/05/2021 20:55, BALATON Zoltan wrote:
>> On Mon, 24 May 2021, David Gibson wrote:
>>> On Sun, May 23, 2021 at 07:09:26PM +0200, BALATON Zoltan wrote:
>>>> On Sun, 23 May 2021, BALATON Zoltan wrote:
>>>>> On Sun, 23 May 2021, Alexey Kardashevskiy wrote:
>>>>>> One thing to note about PCI is that normally I think the client
>>>>>> expects the firmware to do PCI probing and SLOF does it. But VOF
>>>>>> does not and Linux scans PCI bus(es) itself. Might be a problem for
>>>>>> you kernel.
>>>>> 
>>>>> I'm not sure what info does MorphOS get from the device tree and what it
>>>>> probes itself but I think it may at least need device ids and info about
>>>>> the PCI bus to be able to access the config regs, after that it should
>>>>> set the devices up hopefully. I could add these from the board code to
>>>>> device tree so VOF does not need to do anything about it. However I'm
>>>>> not getting to that point yet because it crashes on something that it's
>>>>> missing and couldn't yet find out what is that.
>>>>> 
>>>>> I'd like to get Linux working now as that would be enough to test this
>>>>> and then if for MorphOS we still need a ROM it's not a problem if at
>>>>> least we can boot Linux without the original firmware. But I can't make
>>>>> Linux open a serial console and I don't know what it needs for that. Do
>>>>> you happen to know? I've looked at the sources in Linux/arch/powerpc but
>>>>> not sure how it would find and open a serial port on pegasos2. It seems
>>>>> to work with the board firmware and now I can get it to boot with VOF
>>>>> but then it does not open serial so it probably needs something in the
>>>>> device tree or expects the firmware to set something up that we should
>>>>> add in pegasos2.c when using VOF.
>>>> 
>>>> I've now found that Linux uses rtas methods read-pci-config and
>>>> write-pci-config for PCI access on pegasos2 so this means that we'll
>>>> probably need rtas too (I hoped we could get away without it if it were 
>>>> only
>>>> used for shutdown/reboot or so but seems Linux needs it for PCI as well 
>>>> and
>>>> does not scan the bus and won't find some devices without it).
>>> 
>>> Yes, definitely sounds like you'll need an RTAS implementation.
>> 
>> I plan to fix that after managed to get serial working as that seems to not 
>> need it. If I delete the rtas-size property from /rtas on the original 
>> firmware that makes Linux skip instantiating rtas, but I still get serial 
>> output just not accessing PCI devices. So I think it should work and keeps 
>> things simpler at first. Then I'll try rtas later.
>> 
>>>> While VOF can do rtas, this causes a problem with the hypercall method 
>>>> using
>>>> sc 1 that goes through vhyp but trips the assert in ppc_store_sdr1() so
>>>> cannot work after guest is past quiesce.
>>> 
>>>> So the question is why is that
>>>> assert there
>>> 
>>> Ah.. right.  So, vhyp was designed for the PAPR use case, where we
>>> want to model the CPU when it's in supervisor and user mode, but not
>>> when it's in hypervisor mode.  We want qemu to mimic the behaviour of
>>> the hypervisor, rather than attempting to actually execute hypervisor
>>> code in the virtual CPU.
>>> 
>>> On systems that have a hypervisor mode, SDR1 is hypervisor privileged,
>>> so it makes no sense for the guest to attempt to set it.  That should
>>> be caught by the general SPR code and turned into a 0x700, hence the
>>> assert() if we somehow reach ppc_store_sdr1().
>>> 
>>> So, we are seeing a problem here because you want the 'sc 1'
>>> interception of vhyp, but not the rest of the stuff that goes with it.
>>> 
>>>> and would using sc 1 for hypercalls on pegasos2 cause other
>>>> problems later even if the assert could be removed?
>>> 
>>> At least in the short term, I think you probably can remove the
>>> assert.  In your case the 'sc 1' calls aren't truly to a hypervisor,
>>> but a special case escape to qemu for the firmware emulation.  I think
>>> it's unlikely to cause problems later, because nothing on a 32-bit
>>> system should be attempting an 'sc 1'.  The only thing I can think of
>>> that would fail is some test case which explicitly verified that 'sc
>>> 1' triggered a 0x700 (SIGILL from userspace).
>> 
>> OK so the assert should check if the CPU has an HV bit. I think there was a 
>> #detine for that somewhere that I can add to the assert then I can try 
>> that. What I wasn't sure about is that sc 1 would conflict with the guest's 
>> usage of normal sc calls or are these going through different paths and 
>> only sc 1 will trigger vhyp callback not affecting notmal sc calls? (Or if 
>> this causes an otherwise unnecessary VM exit on KVM even when it works then 
>> maybe looking for a different way in the future might be needed. But for 
>> now if this works with modifying the assert to allow this on ppc32 then I 
>> go for that as that's the simplest way for now.)
>> 
>>>> Can somebody who knows
>>>> more about it explain this please? If this cannot be resolved then we may
>>>> need a different hypercall method on pegasos2 (I've considered MOL OSI or
>>>> are there other options? I may use some advice from people who know it
>>>> better, especially the possible interaction with KVM later as the long 
>>>> term
>>>> goal with pegasos2 is to be able to run with KVM on PPC hardware
>>>> eventually.)
>>> 
>>> Right, you might need an alternative method eventually.  Really any
>>> illegal instruction for your cpu is a possible candidate.  Bear in
>>> mind that this is *not* truly a hypercall interface, instead it's
>>> something we're special casing for the purposes of faking the
>>> firmware.
>>> 
>>> The "attn" instruction used on BookE might be a reasonable candidate
>>> (assuming it doesn't conflict with something on 32-bit BookS) - that's
>>> often used for things like signalling the attention of hardware
>>> debuggers, and this is somewhat akin.
>>> 
>>> Mostly it's just a matter of working out what would be least messy to
>>> intercept in the TCG instruction decoding path.
>> 
>> I'll wait for the current ongoing reorganisations to settle for that. If an 
>> alternative is needed I was considering the interface used by Mac on Linux:
>> 
>> https://lists.nongnu.org/archive/html/qemu-ppc/2021-03/msg00047.html
>> 
>> becuase there are some paravirtual drivers I think that use these on Mac OS 
>> X so this might also be useful for that use case for Mac emulation. But 
>> that seems very similar just checking for magic values at a normal syscall 
>> which means all syscalls will be intercepted anyway. In that case if sc 1 
>> does not interfere with normal sc instructions then it may be better to 
>> keep that as the invalid instruction we trap on.
>> 
>>>> But this also means that if that assert cannot be dropped or
>>>> there may be other problems with sc 1 hypercalls then we maybe cannot 
>>>> have
>>>> the same vof.bin and we'll need a separate version that I would like to
>>>> avoid if possible so if there's a simple way to keep it working or make
>>>> vof.bin use alternate hypercall method without needing a separate binary
>>>> that would be the direction I'd tend to go. Even if we need a seoarate
>>>> version I'd like to keep as much common as possible.
>>>> 
>>>> I've tested that the missing rtas is not the reason for getting no output
>>>> via serial though, as even when disabling rtas on pegasos2.rom it boots 
>>>> and
>>>> I still get serial output just some PCI devices are not detected (such as
>>>> USB, the video card and the not emulated ethernet port but these are not
>>>> fatal so it might even work as a first try without rtas, just to boot a
>>>> Linux kernel for testing it would be enough if I can fix the serial 
>>>> output).
>>>> I still don't know why it's not finding serial but I think it may be some
>>>> missing or wrong info in the device tree I generat. I'll try to focus on
>>>> this for now and leave the above rtas question for later.
>>> 
>>> Oh.. another thought on that.  You have an ISA serial port on Pegasos,
>>> I believe.  I wonder if the PCI->ISA bridge needs some configuration /
>>> initialization that the firmware is expected to do.  If so you'll need
>>> to mimic that setup in qemu for the VOF case.
>> 
>> That's what I begin to think because I've added everything to the device 
>> tree that I thought could be needed and I still don't get it working so it 
>> may need some config from the firmware. But how do I access device 
>> registers from board code? I've tried adding a machine reset method and 
>> write to memory mapped device registers but all my attempts failed. I've 
>> tried cpu_stl_le_data and even memory_region_dispatch_write but these did 
>> not get to the device. What's the way to access guest mmio regs from QEMU?
>
> If we know that that serial is sitting behind PCI->ISA bridge (is it?), I 
> think you need to assign a BAR to that bridge, do some ISA setup (no idea 
> which) and enable that bridge (write MEMORY to PCI_COMMAND), this should 
> enable its registers.
>
> In pseries we add "linux,pci-probe-only"=0 which makes Linux do all the above 
> instead of relying on the firmware doing BAR assignment.

Turns out it was not that. Configuring the serial device is not 
implemented in the ISA bridge model because it could not be done cleanly 
(I had a hacky way that was rejected) so the port is always enabled and 
the other defaults seem to work at least for getting serial without 
further config of devices. What was missing is a bus-range property in the 
device tree for /pci nodes that's seemingly unrelated but Linux needs this 
to get past trying the detect PCI even if it can't probe devices without 
rtas and without that it never gets to write anything to serial even if it 
detects it. Also the order in which PCI busses are added to the device 
tree seem to matter regardless of their properties or there's still some 
problem with this bus-range property that I'll need to check again.

But now I can get serial output with Linux under VOF and it boots but will 
need to implement rtas for PCI devices and RTC access. I've started with 
the general rtas callbacks infrastructure but I'll need to implement PCI 
access methods. (MorphOS is still not happy with it, maybe it needs more 
device infos in the device tree but as long as Linux boots with it I don't 
care as those who want MorphOS could use a firmware rom image.)

Regards,
BALATON Zoltan
David Gibson May 25, 2021, 5:23 a.m. UTC | #31
On Mon, May 24, 2021 at 12:55:07PM +0200, BALATON Zoltan wrote:
> On Mon, 24 May 2021, David Gibson wrote:
> > On Sun, May 23, 2021 at 07:09:26PM +0200, BALATON Zoltan wrote:
> > > On Sun, 23 May 2021, BALATON Zoltan wrote:
> > > > On Sun, 23 May 2021, Alexey Kardashevskiy wrote:
> > > > > One thing to note about PCI is that normally I think the client
> > > > > expects the firmware to do PCI probing and SLOF does it. But VOF
> > > > > does not and Linux scans PCI bus(es) itself. Might be a problem for
> > > > > you kernel.
> > > > 
> > > > I'm not sure what info does MorphOS get from the device tree and what it
> > > > probes itself but I think it may at least need device ids and info about
> > > > the PCI bus to be able to access the config regs, after that it should
> > > > set the devices up hopefully. I could add these from the board code to
> > > > device tree so VOF does not need to do anything about it. However I'm
> > > > not getting to that point yet because it crashes on something that it's
> > > > missing and couldn't yet find out what is that.
> > > > 
> > > > I'd like to get Linux working now as that would be enough to test this
> > > > and then if for MorphOS we still need a ROM it's not a problem if at
> > > > least we can boot Linux without the original firmware. But I can't make
> > > > Linux open a serial console and I don't know what it needs for that. Do
> > > > you happen to know? I've looked at the sources in Linux/arch/powerpc but
> > > > not sure how it would find and open a serial port on pegasos2. It seems
> > > > to work with the board firmware and now I can get it to boot with VOF
> > > > but then it does not open serial so it probably needs something in the
> > > > device tree or expects the firmware to set something up that we should
> > > > add in pegasos2.c when using VOF.
> > > 
> > > I've now found that Linux uses rtas methods read-pci-config and
> > > write-pci-config for PCI access on pegasos2 so this means that we'll
> > > probably need rtas too (I hoped we could get away without it if it were only
> > > used for shutdown/reboot or so but seems Linux needs it for PCI as well and
> > > does not scan the bus and won't find some devices without it).
> > 
> > Yes, definitely sounds like you'll need an RTAS implementation.
> 
> I plan to fix that after managed to get serial working as that seems to not
> need it. If I delete the rtas-size property from /rtas on the original
> firmware that makes Linux skip instantiating rtas, but I still get serial
> output just not accessing PCI devices. So I think it should work and keeps
> things simpler at first. Then I'll try rtas later.
> 
> > > While VOF can do rtas, this causes a problem with the hypercall method using
> > > sc 1 that goes through vhyp but trips the assert in ppc_store_sdr1() so
> > > cannot work after guest is past quiesce.
> > 
> > > So the question is why is that
> > > assert there
> > 
> > Ah.. right.  So, vhyp was designed for the PAPR use case, where we
> > want to model the CPU when it's in supervisor and user mode, but not
> > when it's in hypervisor mode.  We want qemu to mimic the behaviour of
> > the hypervisor, rather than attempting to actually execute hypervisor
> > code in the virtual CPU.
> > 
> > On systems that have a hypervisor mode, SDR1 is hypervisor privileged,
> > so it makes no sense for the guest to attempt to set it.  That should
> > be caught by the general SPR code and turned into a 0x700, hence the
> > assert() if we somehow reach ppc_store_sdr1().
> > 
> > So, we are seeing a problem here because you want the 'sc 1'
> > interception of vhyp, but not the rest of the stuff that goes with it.
> > 
> > > and would using sc 1 for hypercalls on pegasos2 cause other
> > > problems later even if the assert could be removed?
> > 
> > At least in the short term, I think you probably can remove the
> > assert.  In your case the 'sc 1' calls aren't truly to a hypervisor,
> > but a special case escape to qemu for the firmware emulation.  I think
> > it's unlikely to cause problems later, because nothing on a 32-bit
> > system should be attempting an 'sc 1'.  The only thing I can think of
> > that would fail is some test case which explicitly verified that 'sc
> > 1' triggered a 0x700 (SIGILL from userspace).
> 
> OK so the assert should check if the CPU has an HV bit. I think there was a
> #detine for that somewhere that I can add to the assert then I can try that.
> What I wasn't sure about is that sc 1 would conflict with the guest's usage
> of normal sc calls or are these going through different paths and only sc 1
> will trigger vhyp callback not affecting notmal sc calls?

The vhyp shouldn't affect normal system calls, 'sc 1' is specifically
for hypercalls, as opposed to normal 'sc' (a.k.a. 'sc 0'), and the
vhyp only intercepts the hypercall version (after all Linux on PAPR
certainly uses its own system calls, and hypercalls are active for the
lifetime of the guest there).

> (Or if this causes
> an otherwise unnecessary VM exit on KVM even when it works then maybe
> looking for a different way in the future might be needed.

What you're doing here won't work with KVM as it stands.  There are
basically two paths into the vhyp hypercall path: 1) from TCG, if we
interpret an 'sc 1' instruction we enter vhyp, 2) from KVM, if we get
a KVM_EXIT_PAPR_HCALL KVM exit then we also go to the vhyp path.

The second path is specific to the PAPR (ppc64) implementation of KVM,
and will not work for a non-PAPR platform without substantial
modification of the KVM code.

> But for now if
> this works with modifying the assert to allow this on ppc32 then I go for
> that as that's the simplest way for now.)
> 
> > > Can somebody who knows
> > > more about it explain this please? If this cannot be resolved then we may
> > > need a different hypercall method on pegasos2 (I've considered MOL OSI or
> > > are there other options? I may use some advice from people who know it
> > > better, especially the possible interaction with KVM later as the long term
> > > goal with pegasos2 is to be able to run with KVM on PPC hardware
> > > eventually.)
> > 
> > Right, you might need an alternative method eventually.  Really any
> > illegal instruction for your cpu is a possible candidate.  Bear in
> > mind that this is *not* truly a hypercall interface, instead it's
> > something we're special casing for the purposes of faking the
> > firmware.
> > 
> > The "attn" instruction used on BookE might be a reasonable candidate
> > (assuming it doesn't conflict with something on 32-bit BookS) - that's
> > often used for things like signalling the attention of hardware
> > debuggers, and this is somewhat akin.
> > 
> > Mostly it's just a matter of working out what would be least messy to
> > intercept in the TCG instruction decoding path.
> 
> I'll wait for the current ongoing reorganisations to settle for that. If an
> alternative is needed I was considering the interface used by Mac on Linux:
> 
> https://lists.nongnu.org/archive/html/qemu-ppc/2021-03/msg00047.html
> 
> becuase there are some paravirtual drivers I think that use these on Mac OS
> X so this might also be useful for that use case for Mac emulation. But that
> seems very similar just checking for magic values at a normal syscall which
> means all syscalls will be intercepted anyway. In that case if sc 1 does not
> interfere with normal sc instructions then it may be better to keep that as
> the invalid instruction we trap on.
> 
> > > But this also means that if that assert cannot be dropped or
> > > there may be other problems with sc 1 hypercalls then we maybe cannot have
> > > the same vof.bin and we'll need a separate version that I would like to
> > > avoid if possible so if there's a simple way to keep it working or make
> > > vof.bin use alternate hypercall method without needing a separate binary
> > > that would be the direction I'd tend to go. Even if we need a seoarate
> > > version I'd like to keep as much common as possible.
> > > 
> > > I've tested that the missing rtas is not the reason for getting no output
> > > via serial though, as even when disabling rtas on pegasos2.rom it boots and
> > > I still get serial output just some PCI devices are not detected (such as
> > > USB, the video card and the not emulated ethernet port but these are not
> > > fatal so it might even work as a first try without rtas, just to boot a
> > > Linux kernel for testing it would be enough if I can fix the serial output).
> > > I still don't know why it's not finding serial but I think it may be some
> > > missing or wrong info in the device tree I generat. I'll try to focus on
> > > this for now and leave the above rtas question for later.
> > 
> > Oh.. another thought on that.  You have an ISA serial port on Pegasos,
> > I believe.  I wonder if the PCI->ISA bridge needs some configuration /
> > initialization that the firmware is expected to do.  If so you'll need
> > to mimic that setup in qemu for the VOF case.
> 
> That's what I begin to think because I've added everything to the device
> tree that I thought could be needed and I still don't get it working so it
> may need some config from the firmware. But how do I access device registers
> from board code? I've tried adding a machine reset method and write to
> memory mapped device registers but all my attempts failed. I've tried
> cpu_stl_le_data and even memory_region_dispatch_write but these did not get
> to the device. What's the way to access guest mmio regs from QEMU?

That's odd, cpu_stl() and memory_region_dispatch_write() should work
from board code (after the relevant memory regions are configured, of
course).  As an ISA serial port, it's probably accessed through IO
space, not memory space though, so you'd need &address_space_io.  And
if there is some bridge configuration then it's the bridge control
registers you need to look at not the serial registers - you'd have to
look at the bridge documentation for that.  Or, I guess the bridge
implementation in qemu, which you wrote part of.
David Gibson May 25, 2021, 5:24 a.m. UTC | #32
On Mon, May 24, 2021 at 10:46:26PM +1000, Alexey Kardashevskiy wrote:
> 
> 
> On 24/05/2021 20:55, BALATON Zoltan wrote:
> > On Mon, 24 May 2021, David Gibson wrote:
> > > On Sun, May 23, 2021 at 07:09:26PM +0200, BALATON Zoltan wrote:
> > > > On Sun, 23 May 2021, BALATON Zoltan wrote:
> > > > > On Sun, 23 May 2021, Alexey Kardashevskiy wrote:
> > > > > > One thing to note about PCI is that normally I think the client
> > > > > > expects the firmware to do PCI probing and SLOF does it. But VOF
> > > > > > does not and Linux scans PCI bus(es) itself. Might be a problem for
> > > > > > you kernel.
> > > > > 
> > > > > I'm not sure what info does MorphOS get from the device tree
> > > > > and what it
> > > > > probes itself but I think it may at least need device ids
> > > > > and info about
> > > > > the PCI bus to be able to access the config regs, after that it should
> > > > > set the devices up hopefully. I could add these from the board code to
> > > > > device tree so VOF does not need to do anything about it. However I'm
> > > > > not getting to that point yet because it crashes on something that it's
> > > > > missing and couldn't yet find out what is that.
> > > > > 
> > > > > I'd like to get Linux working now as that would be enough to test this
> > > > > and then if for MorphOS we still need a ROM it's not a problem if at
> > > > > least we can boot Linux without the original firmware. But I can't make
> > > > > Linux open a serial console and I don't know what it needs for that. Do
> > > > > you happen to know? I've looked at the sources in
> > > > > Linux/arch/powerpc but
> > > > > not sure how it would find and open a serial port on pegasos2. It seems
> > > > > to work with the board firmware and now I can get it to boot with VOF
> > > > > but then it does not open serial so it probably needs something in the
> > > > > device tree or expects the firmware to set something up that we should
> > > > > add in pegasos2.c when using VOF.
> > > > 
> > > > I've now found that Linux uses rtas methods read-pci-config and
> > > > write-pci-config for PCI access on pegasos2 so this means that we'll
> > > > probably need rtas too (I hoped we could get away without it if
> > > > it were only
> > > > used for shutdown/reboot or so but seems Linux needs it for PCI
> > > > as well and
> > > > does not scan the bus and won't find some devices without it).
> > > 
> > > Yes, definitely sounds like you'll need an RTAS implementation.
> > 
> > I plan to fix that after managed to get serial working as that seems to
> > not need it. If I delete the rtas-size property from /rtas on the
> > original firmware that makes Linux skip instantiating rtas, but I still
> > get serial output just not accessing PCI devices. So I think it should
> > work and keeps things simpler at first. Then I'll try rtas later.
> > 
> > > > While VOF can do rtas, this causes a problem with the hypercall
> > > > method using
> > > > sc 1 that goes through vhyp but trips the assert in ppc_store_sdr1() so
> > > > cannot work after guest is past quiesce.
> > > 
> > > > So the question is why is that
> > > > assert there
> > > 
> > > Ah.. right.  So, vhyp was designed for the PAPR use case, where we
> > > want to model the CPU when it's in supervisor and user mode, but not
> > > when it's in hypervisor mode.  We want qemu to mimic the behaviour of
> > > the hypervisor, rather than attempting to actually execute hypervisor
> > > code in the virtual CPU.
> > > 
> > > On systems that have a hypervisor mode, SDR1 is hypervisor privileged,
> > > so it makes no sense for the guest to attempt to set it.  That should
> > > be caught by the general SPR code and turned into a 0x700, hence the
> > > assert() if we somehow reach ppc_store_sdr1().
> > > 
> > > So, we are seeing a problem here because you want the 'sc 1'
> > > interception of vhyp, but not the rest of the stuff that goes with it.
> > > 
> > > > and would using sc 1 for hypercalls on pegasos2 cause other
> > > > problems later even if the assert could be removed?
> > > 
> > > At least in the short term, I think you probably can remove the
> > > assert.  In your case the 'sc 1' calls aren't truly to a hypervisor,
> > > but a special case escape to qemu for the firmware emulation.  I think
> > > it's unlikely to cause problems later, because nothing on a 32-bit
> > > system should be attempting an 'sc 1'.  The only thing I can think of
> > > that would fail is some test case which explicitly verified that 'sc
> > > 1' triggered a 0x700 (SIGILL from userspace).
> > 
> > OK so the assert should check if the CPU has an HV bit. I think there
> > was a #detine for that somewhere that I can add to the assert then I can
> > try that. What I wasn't sure about is that sc 1 would conflict with the
> > guest's usage of normal sc calls or are these going through different
> > paths and only sc 1 will trigger vhyp callback not affecting notmal sc
> > calls? (Or if this causes an otherwise unnecessary VM exit on KVM even
> > when it works then maybe looking for a different way in the future might
> > be needed. But for now if this works with modifying the assert to allow
> > this on ppc32 then I go for that as that's the simplest way for now.)
> > 
> > > > Can somebody who knows
> > > > more about it explain this please? If this cannot be resolved
> > > > then we may
> > > > need a different hypercall method on pegasos2 (I've considered
> > > > MOL OSI or
> > > > are there other options? I may use some advice from people who know it
> > > > better, especially the possible interaction with KVM later as
> > > > the long term
> > > > goal with pegasos2 is to be able to run with KVM on PPC hardware
> > > > eventually.)
> > > 
> > > Right, you might need an alternative method eventually.  Really any
> > > illegal instruction for your cpu is a possible candidate.  Bear in
> > > mind that this is *not* truly a hypercall interface, instead it's
> > > something we're special casing for the purposes of faking the
> > > firmware.
> > > 
> > > The "attn" instruction used on BookE might be a reasonable candidate
> > > (assuming it doesn't conflict with something on 32-bit BookS) - that's
> > > often used for things like signalling the attention of hardware
> > > debuggers, and this is somewhat akin.
> > > 
> > > Mostly it's just a matter of working out what would be least messy to
> > > intercept in the TCG instruction decoding path.
> > 
> > I'll wait for the current ongoing reorganisations to settle for that. If
> > an alternative is needed I was considering the interface used by Mac on
> > Linux:
> > 
> > https://lists.nongnu.org/archive/html/qemu-ppc/2021-03/msg00047.html
> > 
> > becuase there are some paravirtual drivers I think that use these on Mac
> > OS X so this might also be useful for that use case for Mac emulation.
> > But that seems very similar just checking for magic values at a normal
> > syscall which means all syscalls will be intercepted anyway. In that
> > case if sc 1 does not interfere with normal sc instructions then it may
> > be better to keep that as the invalid instruction we trap on.
> > 
> > > > But this also means that if that assert cannot be dropped or
> > > > there may be other problems with sc 1 hypercalls then we maybe
> > > > cannot have
> > > > the same vof.bin and we'll need a separate version that I would like to
> > > > avoid if possible so if there's a simple way to keep it working or make
> > > > vof.bin use alternate hypercall method without needing a separate binary
> > > > that would be the direction I'd tend to go. Even if we need a seoarate
> > > > version I'd like to keep as much common as possible.
> > > > 
> > > > I've tested that the missing rtas is not the reason for getting
> > > > no output
> > > > via serial though, as even when disabling rtas on pegasos2.rom
> > > > it boots and
> > > > I still get serial output just some PCI devices are not detected
> > > > (such as
> > > > USB, the video card and the not emulated ethernet port but these are not
> > > > fatal so it might even work as a first try without rtas, just to boot a
> > > > Linux kernel for testing it would be enough if I can fix the
> > > > serial output).
> > > > I still don't know why it's not finding serial but I think it
> > > > may be some
> > > > missing or wrong info in the device tree I generat. I'll try to focus on
> > > > this for now and leave the above rtas question for later.
> > > 
> > > Oh.. another thought on that.  You have an ISA serial port on Pegasos,
> > > I believe.  I wonder if the PCI->ISA bridge needs some configuration /
> > > initialization that the firmware is expected to do.  If so you'll need
> > > to mimic that setup in qemu for the VOF case.
> > 
> > That's what I begin to think because I've added everything to the device
> > tree that I thought could be needed and I still don't get it working so
> > it may need some config from the firmware. But how do I access device
> > registers from board code? I've tried adding a machine reset method and
> > write to memory mapped device registers but all my attempts failed. I've
> > tried cpu_stl_le_data and even memory_region_dispatch_write but these
> > did not get to the device. What's the way to access guest mmio regs from
> > QEMU?
> 
> If we know that that serial is sitting behind PCI->ISA bridge (is it?), I

Uh.. maybe.  I think ISA bridges at least sometimes behave differently
from regular PCI devices or bridges, because legacy.  Also note that
it's probably IO space you need to map in, not MMIO space.

> think you need to assign a BAR to that bridge, do some ISA setup (no idea
> which) and enable that bridge (write MEMORY to PCI_COMMAND), this should
> enable its registers.
> 
> In pseries we add "linux,pci-probe-only"=0 which makes Linux do all the
> above instead of relying on the firmware doing BAR assignment.
> 
>
David Gibson May 25, 2021, 5:29 a.m. UTC | #33
On Mon, May 24, 2021 at 02:42:30PM +0200, BALATON Zoltan wrote:
> On Mon, 24 May 2021, David Gibson wrote:
> > On Sun, May 23, 2021 at 07:09:26PM +0200, BALATON Zoltan wrote:
> > > On Sun, 23 May 2021, BALATON Zoltan wrote:
> > > > On Sun, 23 May 2021, Alexey Kardashevskiy wrote:
> > > > > One thing to note about PCI is that normally I think the client
> > > > > expects the firmware to do PCI probing and SLOF does it. But VOF
> > > > > does not and Linux scans PCI bus(es) itself. Might be a problem for
> > > > > you kernel.
> > > > 
> > > > I'm not sure what info does MorphOS get from the device tree and what it
> > > > probes itself but I think it may at least need device ids and info about
> > > > the PCI bus to be able to access the config regs, after that it should
> > > > set the devices up hopefully. I could add these from the board code to
> > > > device tree so VOF does not need to do anything about it. However I'm
> > > > not getting to that point yet because it crashes on something that it's
> > > > missing and couldn't yet find out what is that.
> > > > 
> > > > I'd like to get Linux working now as that would be enough to test this
> > > > and then if for MorphOS we still need a ROM it's not a problem if at
> > > > least we can boot Linux without the original firmware. But I can't make
> > > > Linux open a serial console and I don't know what it needs for that. Do
> > > > you happen to know? I've looked at the sources in Linux/arch/powerpc but
> > > > not sure how it would find and open a serial port on pegasos2. It seems
> > > > to work with the board firmware and now I can get it to boot with VOF
> > > > but then it does not open serial so it probably needs something in the
> > > > device tree or expects the firmware to set something up that we should
> > > > add in pegasos2.c when using VOF.
> > > 
> > > I've now found that Linux uses rtas methods read-pci-config and
> > > write-pci-config for PCI access on pegasos2 so this means that we'll
> > > probably need rtas too (I hoped we could get away without it if it were only
> > > used for shutdown/reboot or so but seems Linux needs it for PCI as well and
> > > does not scan the bus and won't find some devices without it).
> > 
> > Yes, definitely sounds like you'll need an RTAS implementation.
> > 
> > > While VOF can do rtas, this causes a problem with the hypercall method using
> > > sc 1 that goes through vhyp but trips the assert in ppc_store_sdr1() so
> > > cannot work after guest is past quiesce.
> > 
> > > So the question is why is that
> > > assert there
> > 
> > Ah.. right.  So, vhyp was designed for the PAPR use case, where we
> > want to model the CPU when it's in supervisor and user mode, but not
> > when it's in hypervisor mode.  We want qemu to mimic the behaviour of
> > the hypervisor, rather than attempting to actually execute hypervisor
> > code in the virtual CPU.
> > 
> > On systems that have a hypervisor mode, SDR1 is hypervisor privileged,
> > so it makes no sense for the guest to attempt to set it.  That should
> > be caught by the general SPR code and turned into a 0x700, hence the
> > assert() if we somehow reach ppc_store_sdr1().
> 
> This seems to work to avoid my problem so I can leave vhyp enabled after
> qiuesce for now:
> 
> diff --git a/target/ppc/cpu.c b/target/ppc/cpu.c
> index d957d1a687..13b87b9b36 100644
> --- a/target/ppc/cpu.c
> +++ b/target/ppc/cpu.c
> @@ -70,7 +70,7 @@ void ppc_store_sdr1(CPUPPCState *env, target_ulong value)
>  {
>      PowerPCCPU *cpu = env_archcpu(env);
>      qemu_log_mask(CPU_LOG_MMU, "%s: " TARGET_FMT_lx "\n", __func__, value);
> -    assert(!cpu->vhyp);
> +    assert(!cpu->env.has_hv_mode || !cpu->vhyp);
>  #if defined(TARGET_PPC64)
>      if (mmu_is_64bit(env->mmu_model)) {
>          target_ulong sdr_mask = SDR_64_HTABORG | SDR_64_HTABSIZE;
> 
> But I wonder if the assert should also be moved within the TARGET_PPC64
> block and if we may need to generate some exception here instead. Not sure
> what a real CPU would do in this case but if accessing sdr1 is privileged in
> HV mode then there should be an exception or if that's catched
> elsewhere

It should be caught elsehwere.  Specifically, when the SDR1 SPR is
registered, on CPUs with a hypervisor mode it should be registered as
hypervisor privileged, so the general mtspr dispatch logic should
generate the exception if it's called from !HV code.  The assert here
is just to sanity check that it has done so before we enter the actual
softmmu code.

> then this assert may not be needed at all. I can make a patch if you tell me
> what should it do.
> 
> Regards,
> BALATON Zoltan
>
BALATON Zoltan May 25, 2021, 9:55 a.m. UTC | #34
On Tue, 25 May 2021, David Gibson wrote:
> On Mon, May 24, 2021 at 02:42:30PM +0200, BALATON Zoltan wrote:
>> On Mon, 24 May 2021, David Gibson wrote:
>>> On Sun, May 23, 2021 at 07:09:26PM +0200, BALATON Zoltan wrote:
>>>> On Sun, 23 May 2021, BALATON Zoltan wrote:
>>>>> On Sun, 23 May 2021, Alexey Kardashevskiy wrote:
>>>>>> One thing to note about PCI is that normally I think the client
>>>>>> expects the firmware to do PCI probing and SLOF does it. But VOF
>>>>>> does not and Linux scans PCI bus(es) itself. Might be a problem for
>>>>>> you kernel.
>>>>>
>>>>> I'm not sure what info does MorphOS get from the device tree and what it
>>>>> probes itself but I think it may at least need device ids and info about
>>>>> the PCI bus to be able to access the config regs, after that it should
>>>>> set the devices up hopefully. I could add these from the board code to
>>>>> device tree so VOF does not need to do anything about it. However I'm
>>>>> not getting to that point yet because it crashes on something that it's
>>>>> missing and couldn't yet find out what is that.
>>>>>
>>>>> I'd like to get Linux working now as that would be enough to test this
>>>>> and then if for MorphOS we still need a ROM it's not a problem if at
>>>>> least we can boot Linux without the original firmware. But I can't make
>>>>> Linux open a serial console and I don't know what it needs for that. Do
>>>>> you happen to know? I've looked at the sources in Linux/arch/powerpc but
>>>>> not sure how it would find and open a serial port on pegasos2. It seems
>>>>> to work with the board firmware and now I can get it to boot with VOF
>>>>> but then it does not open serial so it probably needs something in the
>>>>> device tree or expects the firmware to set something up that we should
>>>>> add in pegasos2.c when using VOF.
>>>>
>>>> I've now found that Linux uses rtas methods read-pci-config and
>>>> write-pci-config for PCI access on pegasos2 so this means that we'll
>>>> probably need rtas too (I hoped we could get away without it if it were only
>>>> used for shutdown/reboot or so but seems Linux needs it for PCI as well and
>>>> does not scan the bus and won't find some devices without it).
>>>
>>> Yes, definitely sounds like you'll need an RTAS implementation.
>>>
>>>> While VOF can do rtas, this causes a problem with the hypercall method using
>>>> sc 1 that goes through vhyp but trips the assert in ppc_store_sdr1() so
>>>> cannot work after guest is past quiesce.
>>>
>>>> So the question is why is that
>>>> assert there
>>>
>>> Ah.. right.  So, vhyp was designed for the PAPR use case, where we
>>> want to model the CPU when it's in supervisor and user mode, but not
>>> when it's in hypervisor mode.  We want qemu to mimic the behaviour of
>>> the hypervisor, rather than attempting to actually execute hypervisor
>>> code in the virtual CPU.
>>>
>>> On systems that have a hypervisor mode, SDR1 is hypervisor privileged,
>>> so it makes no sense for the guest to attempt to set it.  That should
>>> be caught by the general SPR code and turned into a 0x700, hence the
>>> assert() if we somehow reach ppc_store_sdr1().
>>
>> This seems to work to avoid my problem so I can leave vhyp enabled after
>> qiuesce for now:
>>
>> diff --git a/target/ppc/cpu.c b/target/ppc/cpu.c
>> index d957d1a687..13b87b9b36 100644
>> --- a/target/ppc/cpu.c
>> +++ b/target/ppc/cpu.c
>> @@ -70,7 +70,7 @@ void ppc_store_sdr1(CPUPPCState *env, target_ulong value)
>>  {
>>      PowerPCCPU *cpu = env_archcpu(env);
>>      qemu_log_mask(CPU_LOG_MMU, "%s: " TARGET_FMT_lx "\n", __func__, value);
>> -    assert(!cpu->vhyp);
>> +    assert(!cpu->env.has_hv_mode || !cpu->vhyp);
>>  #if defined(TARGET_PPC64)
>>      if (mmu_is_64bit(env->mmu_model)) {
>>          target_ulong sdr_mask = SDR_64_HTABORG | SDR_64_HTABSIZE;
>>
>> But I wonder if the assert should also be moved within the TARGET_PPC64
>> block and if we may need to generate some exception here instead. Not sure
>> what a real CPU would do in this case but if accessing sdr1 is privileged in
>> HV mode then there should be an exception or if that's catched
>> elsewhere
>
> It should be caught elsehwere.  Specifically, when the SDR1 SPR is
> registered, on CPUs with a hypervisor mode it should be registered as
> hypervisor privileged, so the general mtspr dispatch logic should
> generate the exception if it's called from !HV code.  The assert here
> is just to sanity check that it has done so before we enter the actual
> softmmu code.

So what's the decision then? Remove this assert or modify it like above 
and move it to the TARGET_PPC64 block (as no 32 bit CPU should have an HV 
bit anyway).

Regards,
BALATON Zoltan
BALATON Zoltan May 25, 2021, 10:08 a.m. UTC | #35
On Tue, 25 May 2021, David Gibson wrote:
> On Mon, May 24, 2021 at 12:55:07PM +0200, BALATON Zoltan wrote:
>> On Mon, 24 May 2021, David Gibson wrote:
>>> On Sun, May 23, 2021 at 07:09:26PM +0200, BALATON Zoltan wrote:
>>>> On Sun, 23 May 2021, BALATON Zoltan wrote:
>>>>> On Sun, 23 May 2021, Alexey Kardashevskiy wrote:
>>>>>> One thing to note about PCI is that normally I think the client
>>>>>> expects the firmware to do PCI probing and SLOF does it. But VOF
>>>>>> does not and Linux scans PCI bus(es) itself. Might be a problem for
>>>>>> you kernel.
>>>>>
>>>>> I'm not sure what info does MorphOS get from the device tree and what it
>>>>> probes itself but I think it may at least need device ids and info about
>>>>> the PCI bus to be able to access the config regs, after that it should
>>>>> set the devices up hopefully. I could add these from the board code to
>>>>> device tree so VOF does not need to do anything about it. However I'm
>>>>> not getting to that point yet because it crashes on something that it's
>>>>> missing and couldn't yet find out what is that.
>>>>>
>>>>> I'd like to get Linux working now as that would be enough to test this
>>>>> and then if for MorphOS we still need a ROM it's not a problem if at
>>>>> least we can boot Linux without the original firmware. But I can't make
>>>>> Linux open a serial console and I don't know what it needs for that. Do
>>>>> you happen to know? I've looked at the sources in Linux/arch/powerpc but
>>>>> not sure how it would find and open a serial port on pegasos2. It seems
>>>>> to work with the board firmware and now I can get it to boot with VOF
>>>>> but then it does not open serial so it probably needs something in the
>>>>> device tree or expects the firmware to set something up that we should
>>>>> add in pegasos2.c when using VOF.
>>>>
>>>> I've now found that Linux uses rtas methods read-pci-config and
>>>> write-pci-config for PCI access on pegasos2 so this means that we'll
>>>> probably need rtas too (I hoped we could get away without it if it were only
>>>> used for shutdown/reboot or so but seems Linux needs it for PCI as well and
>>>> does not scan the bus and won't find some devices without it).
>>>
>>> Yes, definitely sounds like you'll need an RTAS implementation.
>>
>> I plan to fix that after managed to get serial working as that seems to not
>> need it. If I delete the rtas-size property from /rtas on the original
>> firmware that makes Linux skip instantiating rtas, but I still get serial
>> output just not accessing PCI devices. So I think it should work and keeps
>> things simpler at first. Then I'll try rtas later.
>>
>>>> While VOF can do rtas, this causes a problem with the hypercall method using
>>>> sc 1 that goes through vhyp but trips the assert in ppc_store_sdr1() so
>>>> cannot work after guest is past quiesce.
>>>
>>>> So the question is why is that
>>>> assert there
>>>
>>> Ah.. right.  So, vhyp was designed for the PAPR use case, where we
>>> want to model the CPU when it's in supervisor and user mode, but not
>>> when it's in hypervisor mode.  We want qemu to mimic the behaviour of
>>> the hypervisor, rather than attempting to actually execute hypervisor
>>> code in the virtual CPU.
>>>
>>> On systems that have a hypervisor mode, SDR1 is hypervisor privileged,
>>> so it makes no sense for the guest to attempt to set it.  That should
>>> be caught by the general SPR code and turned into a 0x700, hence the
>>> assert() if we somehow reach ppc_store_sdr1().
>>>
>>> So, we are seeing a problem here because you want the 'sc 1'
>>> interception of vhyp, but not the rest of the stuff that goes with it.
>>>
>>>> and would using sc 1 for hypercalls on pegasos2 cause other
>>>> problems later even if the assert could be removed?
>>>
>>> At least in the short term, I think you probably can remove the
>>> assert.  In your case the 'sc 1' calls aren't truly to a hypervisor,
>>> but a special case escape to qemu for the firmware emulation.  I think
>>> it's unlikely to cause problems later, because nothing on a 32-bit
>>> system should be attempting an 'sc 1'.  The only thing I can think of
>>> that would fail is some test case which explicitly verified that 'sc
>>> 1' triggered a 0x700 (SIGILL from userspace).
>>
>> OK so the assert should check if the CPU has an HV bit. I think there was a
>> #detine for that somewhere that I can add to the assert then I can try that.
>> What I wasn't sure about is that sc 1 would conflict with the guest's usage
>> of normal sc calls or are these going through different paths and only sc 1
>> will trigger vhyp callback not affecting notmal sc calls?
>
> The vhyp shouldn't affect normal system calls, 'sc 1' is specifically
> for hypercalls, as opposed to normal 'sc' (a.k.a. 'sc 0'), and the
> vhyp only intercepts the hypercall version (after all Linux on PAPR
> certainly uses its own system calls, and hypercalls are active for the
> lifetime of the guest there).
>
>> (Or if this causes
>> an otherwise unnecessary VM exit on KVM even when it works then maybe
>> looking for a different way in the future might be needed.
>
> What you're doing here won't work with KVM as it stands.  There are
> basically two paths into the vhyp hypercall path: 1) from TCG, if we
> interpret an 'sc 1' instruction we enter vhyp, 2) from KVM, if we get
> a KVM_EXIT_PAPR_HCALL KVM exit then we also go to the vhyp path.
>
> The second path is specific to the PAPR (ppc64) implementation of KVM,
> and will not work for a non-PAPR platform without substantial
> modification of the KVM code.

OK so then at that point when we try KVM we'll need to look at alternative 
ways, I think MOL OSI worked with KVM at least in MOL but will probably 
make all syscalls exit KVM but since we'll probably need to use KVM PR it 
will exit anyway. For now I keep this vhyp as it does not run with KVM for 
other reasons yet so that's another area to clean up so as a proof of 
concept first version of using VOF vhyp will do.

[...]
>>>> I've tested that the missing rtas is not the reason for getting no output
>>>> via serial though, as even when disabling rtas on pegasos2.rom it boots and
>>>> I still get serial output just some PCI devices are not detected (such as
>>>> USB, the video card and the not emulated ethernet port but these are not
>>>> fatal so it might even work as a first try without rtas, just to boot a
>>>> Linux kernel for testing it would be enough if I can fix the serial output).
>>>> I still don't know why it's not finding serial but I think it may be some
>>>> missing or wrong info in the device tree I generat. I'll try to focus on
>>>> this for now and leave the above rtas question for later.
>>>
>>> Oh.. another thought on that.  You have an ISA serial port on Pegasos,
>>> I believe.  I wonder if the PCI->ISA bridge needs some configuration /
>>> initialization that the firmware is expected to do.  If so you'll need
>>> to mimic that setup in qemu for the VOF case.
>>
>> That's what I begin to think because I've added everything to the device
>> tree that I thought could be needed and I still don't get it working so it
>> may need some config from the firmware. But how do I access device registers
>> from board code? I've tried adding a machine reset method and write to
>> memory mapped device registers but all my attempts failed. I've tried
>> cpu_stl_le_data and even memory_region_dispatch_write but these did not get
>> to the device. What's the way to access guest mmio regs from QEMU?
>
> That's odd, cpu_stl() and memory_region_dispatch_write() should work
> from board code (after the relevant memory regions are configured, of
> course).  As an ISA serial port, it's probably accessed through IO
> space, not memory space though, so you'd need &address_space_io.  And
> if there is some bridge configuration then it's the bridge control
> registers you need to look at not the serial registers - you'd have to
> look at the bridge documentation for that.  Or, I guess the bridge
> implementation in qemu, which you wrote part of.

I've found at last that stl_le_phys() works. There are so many of these 
that I never know when to use which.

I think the address_space_rw calls in vof_client_call() in vof.c could 
also use these for somewhat shorter code. I've ended up with 
stl_le_phys(CPU(cpu)->as, addr, val) in my machine reset methodbut I don't 
even need that now as it works without additional setup. Also VOF's memory 
access is basically the same as the already existing rtas_st() and co. so 
maybe that could be reused to make code smaller?

Regards,
BALATON Zoltan
David Gibson May 27, 2021, 5:31 a.m. UTC | #36
On Tue, May 25, 2021 at 11:55:43AM +0200, BALATON Zoltan wrote:
> On Tue, 25 May 2021, David Gibson wrote:
> > On Mon, May 24, 2021 at 02:42:30PM +0200, BALATON Zoltan wrote:
> > > On Mon, 24 May 2021, David Gibson wrote:
> > > > On Sun, May 23, 2021 at 07:09:26PM +0200, BALATON Zoltan wrote:
> > > > > On Sun, 23 May 2021, BALATON Zoltan wrote:
> > > > > > On Sun, 23 May 2021, Alexey Kardashevskiy wrote:
> > > > > > > One thing to note about PCI is that normally I think the client
> > > > > > > expects the firmware to do PCI probing and SLOF does it. But VOF
> > > > > > > does not and Linux scans PCI bus(es) itself. Might be a problem for
> > > > > > > you kernel.
> > > > > > 
> > > > > > I'm not sure what info does MorphOS get from the device tree and what it
> > > > > > probes itself but I think it may at least need device ids and info about
> > > > > > the PCI bus to be able to access the config regs, after that it should
> > > > > > set the devices up hopefully. I could add these from the board code to
> > > > > > device tree so VOF does not need to do anything about it. However I'm
> > > > > > not getting to that point yet because it crashes on something that it's
> > > > > > missing and couldn't yet find out what is that.
> > > > > > 
> > > > > > I'd like to get Linux working now as that would be enough to test this
> > > > > > and then if for MorphOS we still need a ROM it's not a problem if at
> > > > > > least we can boot Linux without the original firmware. But I can't make
> > > > > > Linux open a serial console and I don't know what it needs for that. Do
> > > > > > you happen to know? I've looked at the sources in Linux/arch/powerpc but
> > > > > > not sure how it would find and open a serial port on pegasos2. It seems
> > > > > > to work with the board firmware and now I can get it to boot with VOF
> > > > > > but then it does not open serial so it probably needs something in the
> > > > > > device tree or expects the firmware to set something up that we should
> > > > > > add in pegasos2.c when using VOF.
> > > > > 
> > > > > I've now found that Linux uses rtas methods read-pci-config and
> > > > > write-pci-config for PCI access on pegasos2 so this means that we'll
> > > > > probably need rtas too (I hoped we could get away without it if it were only
> > > > > used for shutdown/reboot or so but seems Linux needs it for PCI as well and
> > > > > does not scan the bus and won't find some devices without it).
> > > > 
> > > > Yes, definitely sounds like you'll need an RTAS implementation.
> > > > 
> > > > > While VOF can do rtas, this causes a problem with the hypercall method using
> > > > > sc 1 that goes through vhyp but trips the assert in ppc_store_sdr1() so
> > > > > cannot work after guest is past quiesce.
> > > > 
> > > > > So the question is why is that
> > > > > assert there
> > > > 
> > > > Ah.. right.  So, vhyp was designed for the PAPR use case, where we
> > > > want to model the CPU when it's in supervisor and user mode, but not
> > > > when it's in hypervisor mode.  We want qemu to mimic the behaviour of
> > > > the hypervisor, rather than attempting to actually execute hypervisor
> > > > code in the virtual CPU.
> > > > 
> > > > On systems that have a hypervisor mode, SDR1 is hypervisor privileged,
> > > > so it makes no sense for the guest to attempt to set it.  That should
> > > > be caught by the general SPR code and turned into a 0x700, hence the
> > > > assert() if we somehow reach ppc_store_sdr1().
> > > 
> > > This seems to work to avoid my problem so I can leave vhyp enabled after
> > > qiuesce for now:
> > > 
> > > diff --git a/target/ppc/cpu.c b/target/ppc/cpu.c
> > > index d957d1a687..13b87b9b36 100644
> > > --- a/target/ppc/cpu.c
> > > +++ b/target/ppc/cpu.c
> > > @@ -70,7 +70,7 @@ void ppc_store_sdr1(CPUPPCState *env, target_ulong value)
> > >  {
> > >      PowerPCCPU *cpu = env_archcpu(env);
> > >      qemu_log_mask(CPU_LOG_MMU, "%s: " TARGET_FMT_lx "\n", __func__, value);
> > > -    assert(!cpu->vhyp);
> > > +    assert(!cpu->env.has_hv_mode || !cpu->vhyp);
> > >  #if defined(TARGET_PPC64)
> > >      if (mmu_is_64bit(env->mmu_model)) {
> > >          target_ulong sdr_mask = SDR_64_HTABORG | SDR_64_HTABSIZE;
> > > 
> > > But I wonder if the assert should also be moved within the TARGET_PPC64
> > > block and if we may need to generate some exception here instead. Not sure
> > > what a real CPU would do in this case but if accessing sdr1 is privileged in
> > > HV mode then there should be an exception or if that's catched
> > > elsewhere
> > 
> > It should be caught elsehwere.  Specifically, when the SDR1 SPR is
> > registered, on CPUs with a hypervisor mode it should be registered as
> > hypervisor privileged, so the general mtspr dispatch logic should
> > generate the exception if it's called from !HV code.  The assert here
> > is just to sanity check that it has done so before we enter the actual
> > softmmu code.
> 
> So what's the decision then? Remove this assert or modify it like above and
> move it to the TARGET_PPC64 block (as no 32 bit CPU should have an HV bit
> anyway).

Uh, I guess modify it with the if-hv-available thing.  Don't move it
under the ifdef, it still makes logical sense for 32-bit systems, even
though the HV available side should never trip.
David Gibson May 27, 2021, 5:34 a.m. UTC | #37
On Tue, May 25, 2021 at 12:08:45PM +0200, BALATON Zoltan wrote:
> On Tue, 25 May 2021, David Gibson wrote:
> > On Mon, May 24, 2021 at 12:55:07PM +0200, BALATON Zoltan wrote:
> > > On Mon, 24 May 2021, David Gibson wrote:
> > > > On Sun, May 23, 2021 at 07:09:26PM +0200, BALATON Zoltan wrote:
> > > > > On Sun, 23 May 2021, BALATON Zoltan wrote:
> > > > > > On Sun, 23 May 2021, Alexey Kardashevskiy wrote:
> > > > > > > One thing to note about PCI is that normally I think the client
> > > > > > > expects the firmware to do PCI probing and SLOF does it. But VOF
> > > > > > > does not and Linux scans PCI bus(es) itself. Might be a problem for
> > > > > > > you kernel.
> > > > > > 
> > > > > > I'm not sure what info does MorphOS get from the device tree and what it
> > > > > > probes itself but I think it may at least need device ids and info about
> > > > > > the PCI bus to be able to access the config regs, after that it should
> > > > > > set the devices up hopefully. I could add these from the board code to
> > > > > > device tree so VOF does not need to do anything about it. However I'm
> > > > > > not getting to that point yet because it crashes on something that it's
> > > > > > missing and couldn't yet find out what is that.
> > > > > > 
> > > > > > I'd like to get Linux working now as that would be enough to test this
> > > > > > and then if for MorphOS we still need a ROM it's not a problem if at
> > > > > > least we can boot Linux without the original firmware. But I can't make
> > > > > > Linux open a serial console and I don't know what it needs for that. Do
> > > > > > you happen to know? I've looked at the sources in Linux/arch/powerpc but
> > > > > > not sure how it would find and open a serial port on pegasos2. It seems
> > > > > > to work with the board firmware and now I can get it to boot with VOF
> > > > > > but then it does not open serial so it probably needs something in the
> > > > > > device tree or expects the firmware to set something up that we should
> > > > > > add in pegasos2.c when using VOF.
> > > > > 
> > > > > I've now found that Linux uses rtas methods read-pci-config and
> > > > > write-pci-config for PCI access on pegasos2 so this means that we'll
> > > > > probably need rtas too (I hoped we could get away without it if it were only
> > > > > used for shutdown/reboot or so but seems Linux needs it for PCI as well and
> > > > > does not scan the bus and won't find some devices without it).
> > > > 
> > > > Yes, definitely sounds like you'll need an RTAS implementation.
> > > 
> > > I plan to fix that after managed to get serial working as that seems to not
> > > need it. If I delete the rtas-size property from /rtas on the original
> > > firmware that makes Linux skip instantiating rtas, but I still get serial
> > > output just not accessing PCI devices. So I think it should work and keeps
> > > things simpler at first. Then I'll try rtas later.
> > > 
> > > > > While VOF can do rtas, this causes a problem with the hypercall method using
> > > > > sc 1 that goes through vhyp but trips the assert in ppc_store_sdr1() so
> > > > > cannot work after guest is past quiesce.
> > > > 
> > > > > So the question is why is that
> > > > > assert there
> > > > 
> > > > Ah.. right.  So, vhyp was designed for the PAPR use case, where we
> > > > want to model the CPU when it's in supervisor and user mode, but not
> > > > when it's in hypervisor mode.  We want qemu to mimic the behaviour of
> > > > the hypervisor, rather than attempting to actually execute hypervisor
> > > > code in the virtual CPU.
> > > > 
> > > > On systems that have a hypervisor mode, SDR1 is hypervisor privileged,
> > > > so it makes no sense for the guest to attempt to set it.  That should
> > > > be caught by the general SPR code and turned into a 0x700, hence the
> > > > assert() if we somehow reach ppc_store_sdr1().
> > > > 
> > > > So, we are seeing a problem here because you want the 'sc 1'
> > > > interception of vhyp, but not the rest of the stuff that goes with it.
> > > > 
> > > > > and would using sc 1 for hypercalls on pegasos2 cause other
> > > > > problems later even if the assert could be removed?
> > > > 
> > > > At least in the short term, I think you probably can remove the
> > > > assert.  In your case the 'sc 1' calls aren't truly to a hypervisor,
> > > > but a special case escape to qemu for the firmware emulation.  I think
> > > > it's unlikely to cause problems later, because nothing on a 32-bit
> > > > system should be attempting an 'sc 1'.  The only thing I can think of
> > > > that would fail is some test case which explicitly verified that 'sc
> > > > 1' triggered a 0x700 (SIGILL from userspace).
> > > 
> > > OK so the assert should check if the CPU has an HV bit. I think there was a
> > > #detine for that somewhere that I can add to the assert then I can try that.
> > > What I wasn't sure about is that sc 1 would conflict with the guest's usage
> > > of normal sc calls or are these going through different paths and only sc 1
> > > will trigger vhyp callback not affecting notmal sc calls?
> > 
> > The vhyp shouldn't affect normal system calls, 'sc 1' is specifically
> > for hypercalls, as opposed to normal 'sc' (a.k.a. 'sc 0'), and the
> > vhyp only intercepts the hypercall version (after all Linux on PAPR
> > certainly uses its own system calls, and hypercalls are active for the
> > lifetime of the guest there).
> > 
> > > (Or if this causes
> > > an otherwise unnecessary VM exit on KVM even when it works then maybe
> > > looking for a different way in the future might be needed.
> > 
> > What you're doing here won't work with KVM as it stands.  There are
> > basically two paths into the vhyp hypercall path: 1) from TCG, if we
> > interpret an 'sc 1' instruction we enter vhyp, 2) from KVM, if we get
> > a KVM_EXIT_PAPR_HCALL KVM exit then we also go to the vhyp path.
> > 
> > The second path is specific to the PAPR (ppc64) implementation of KVM,
> > and will not work for a non-PAPR platform without substantial
> > modification of the KVM code.
> 
> OK so then at that point when we try KVM we'll need to look at alternative
> ways, I think MOL OSI worked with KVM at least in MOL but will probably make
> all syscalls exit KVM but since we'll probably need to use KVM PR it will
> exit anyway. For now I keep this vhyp as it does not run with KVM for other
> reasons yet so that's another area to clean up so as a proof of concept
> first version of using VOF vhyp will do.

Eh, since you'll need to modify KVM anyway, it probably makes just as
much sense to modify it to catch the 'sc 1' as MoL's magic thingy.

> [...]
> > > > > I've tested that the missing rtas is not the reason for getting no output
> > > > > via serial though, as even when disabling rtas on pegasos2.rom it boots and
> > > > > I still get serial output just some PCI devices are not detected (such as
> > > > > USB, the video card and the not emulated ethernet port but these are not
> > > > > fatal so it might even work as a first try without rtas, just to boot a
> > > > > Linux kernel for testing it would be enough if I can fix the serial output).
> > > > > I still don't know why it's not finding serial but I think it may be some
> > > > > missing or wrong info in the device tree I generat. I'll try to focus on
> > > > > this for now and leave the above rtas question for later.
> > > > 
> > > > Oh.. another thought on that.  You have an ISA serial port on Pegasos,
> > > > I believe.  I wonder if the PCI->ISA bridge needs some configuration /
> > > > initialization that the firmware is expected to do.  If so you'll need
> > > > to mimic that setup in qemu for the VOF case.
> > > 
> > > That's what I begin to think because I've added everything to the device
> > > tree that I thought could be needed and I still don't get it working so it
> > > may need some config from the firmware. But how do I access device registers
> > > from board code? I've tried adding a machine reset method and write to
> > > memory mapped device registers but all my attempts failed. I've tried
> > > cpu_stl_le_data and even memory_region_dispatch_write but these did not get
> > > to the device. What's the way to access guest mmio regs from QEMU?
> > 
> > That's odd, cpu_stl() and memory_region_dispatch_write() should work
> > from board code (after the relevant memory regions are configured, of
> > course).  As an ISA serial port, it's probably accessed through IO
> > space, not memory space though, so you'd need &address_space_io.  And
> > if there is some bridge configuration then it's the bridge control
> > registers you need to look at not the serial registers - you'd have to
> > look at the bridge documentation for that.  Or, I guess the bridge
> > implementation in qemu, which you wrote part of.
> 
> I've found at last that stl_le_phys() works. There are so many of these that
> I never know when to use which.
> 
> I think the address_space_rw calls in vof_client_call() in vof.c could also
> use these for somewhat shorter code. I've ended up with
> stl_le_phys(CPU(cpu)->as, addr, val) in my machine reset methodbut I don't
> even need that now as it works without additional setup. Also VOF's memory
> access is basically the same as the already existing rtas_st() and co. so
> maybe that could be reused to make code smaller?

rtas_ld() and rtas_st() should only be used for reading/writing RTAS
parameters to and from memory.  Accessing IO shouldn't be done with
those.

For IO you probably want the cpu_st*() variants in most cases, since
you're trying to emulate an IO access from the virtual cpu.
BALATON Zoltan May 27, 2021, 12:42 p.m. UTC | #38
On Thu, 27 May 2021, David Gibson wrote:
> On Tue, May 25, 2021 at 12:08:45PM +0200, BALATON Zoltan wrote:
>> On Tue, 25 May 2021, David Gibson wrote:
>>> On Mon, May 24, 2021 at 12:55:07PM +0200, BALATON Zoltan wrote:
>>>> On Mon, 24 May 2021, David Gibson wrote:
>>>>> On Sun, May 23, 2021 at 07:09:26PM +0200, BALATON Zoltan wrote:
>>>>>> On Sun, 23 May 2021, BALATON Zoltan wrote:
>>>>>>> On Sun, 23 May 2021, Alexey Kardashevskiy wrote:
>>>>>>>> One thing to note about PCI is that normally I think the client
>>>>>>>> expects the firmware to do PCI probing and SLOF does it. But VOF
>>>>>>>> does not and Linux scans PCI bus(es) itself. Might be a problem for
>>>>>>>> you kernel.
>>>>>>>
>>>>>>> I'm not sure what info does MorphOS get from the device tree and what it
>>>>>>> probes itself but I think it may at least need device ids and info about
>>>>>>> the PCI bus to be able to access the config regs, after that it should
>>>>>>> set the devices up hopefully. I could add these from the board code to
>>>>>>> device tree so VOF does not need to do anything about it. However I'm
>>>>>>> not getting to that point yet because it crashes on something that it's
>>>>>>> missing and couldn't yet find out what is that.
>>>>>>>
>>>>>>> I'd like to get Linux working now as that would be enough to test this
>>>>>>> and then if for MorphOS we still need a ROM it's not a problem if at
>>>>>>> least we can boot Linux without the original firmware. But I can't make
>>>>>>> Linux open a serial console and I don't know what it needs for that. Do
>>>>>>> you happen to know? I've looked at the sources in Linux/arch/powerpc but
>>>>>>> not sure how it would find and open a serial port on pegasos2. It seems
>>>>>>> to work with the board firmware and now I can get it to boot with VOF
>>>>>>> but then it does not open serial so it probably needs something in the
>>>>>>> device tree or expects the firmware to set something up that we should
>>>>>>> add in pegasos2.c when using VOF.
>>>>>>
>>>>>> I've now found that Linux uses rtas methods read-pci-config and
>>>>>> write-pci-config for PCI access on pegasos2 so this means that we'll
>>>>>> probably need rtas too (I hoped we could get away without it if it were only
>>>>>> used for shutdown/reboot or so but seems Linux needs it for PCI as well and
>>>>>> does not scan the bus and won't find some devices without it).
>>>>>
>>>>> Yes, definitely sounds like you'll need an RTAS implementation.
>>>>
>>>> I plan to fix that after managed to get serial working as that seems to not
>>>> need it. If I delete the rtas-size property from /rtas on the original
>>>> firmware that makes Linux skip instantiating rtas, but I still get serial
>>>> output just not accessing PCI devices. So I think it should work and keeps
>>>> things simpler at first. Then I'll try rtas later.
>>>>
>>>>>> While VOF can do rtas, this causes a problem with the hypercall method using
>>>>>> sc 1 that goes through vhyp but trips the assert in ppc_store_sdr1() so
>>>>>> cannot work after guest is past quiesce.
>>>>>
>>>>>> So the question is why is that
>>>>>> assert there
>>>>>
>>>>> Ah.. right.  So, vhyp was designed for the PAPR use case, where we
>>>>> want to model the CPU when it's in supervisor and user mode, but not
>>>>> when it's in hypervisor mode.  We want qemu to mimic the behaviour of
>>>>> the hypervisor, rather than attempting to actually execute hypervisor
>>>>> code in the virtual CPU.
>>>>>
>>>>> On systems that have a hypervisor mode, SDR1 is hypervisor privileged,
>>>>> so it makes no sense for the guest to attempt to set it.  That should
>>>>> be caught by the general SPR code and turned into a 0x700, hence the
>>>>> assert() if we somehow reach ppc_store_sdr1().
>>>>>
>>>>> So, we are seeing a problem here because you want the 'sc 1'
>>>>> interception of vhyp, but not the rest of the stuff that goes with it.
>>>>>
>>>>>> and would using sc 1 for hypercalls on pegasos2 cause other
>>>>>> problems later even if the assert could be removed?
>>>>>
>>>>> At least in the short term, I think you probably can remove the
>>>>> assert.  In your case the 'sc 1' calls aren't truly to a hypervisor,
>>>>> but a special case escape to qemu for the firmware emulation.  I think
>>>>> it's unlikely to cause problems later, because nothing on a 32-bit
>>>>> system should be attempting an 'sc 1'.  The only thing I can think of
>>>>> that would fail is some test case which explicitly verified that 'sc
>>>>> 1' triggered a 0x700 (SIGILL from userspace).
>>>>
>>>> OK so the assert should check if the CPU has an HV bit. I think there was a
>>>> #detine for that somewhere that I can add to the assert then I can try that.
>>>> What I wasn't sure about is that sc 1 would conflict with the guest's usage
>>>> of normal sc calls or are these going through different paths and only sc 1
>>>> will trigger vhyp callback not affecting notmal sc calls?
>>>
>>> The vhyp shouldn't affect normal system calls, 'sc 1' is specifically
>>> for hypercalls, as opposed to normal 'sc' (a.k.a. 'sc 0'), and the
>>> vhyp only intercepts the hypercall version (after all Linux on PAPR
>>> certainly uses its own system calls, and hypercalls are active for the
>>> lifetime of the guest there).
>>>
>>>> (Or if this causes
>>>> an otherwise unnecessary VM exit on KVM even when it works then maybe
>>>> looking for a different way in the future might be needed.
>>>
>>> What you're doing here won't work with KVM as it stands.  There are
>>> basically two paths into the vhyp hypercall path: 1) from TCG, if we
>>> interpret an 'sc 1' instruction we enter vhyp, 2) from KVM, if we get
>>> a KVM_EXIT_PAPR_HCALL KVM exit then we also go to the vhyp path.
>>>
>>> The second path is specific to the PAPR (ppc64) implementation of KVM,
>>> and will not work for a non-PAPR platform without substantial
>>> modification of the KVM code.
>>
>> OK so then at that point when we try KVM we'll need to look at alternative
>> ways, I think MOL OSI worked with KVM at least in MOL but will probably make
>> all syscalls exit KVM but since we'll probably need to use KVM PR it will
>> exit anyway. For now I keep this vhyp as it does not run with KVM for other
>> reasons yet so that's another area to clean up so as a proof of concept
>> first version of using VOF vhyp will do.
>
> Eh, since you'll need to modify KVM anyway, it probably makes just as
> much sense to modify it to catch the 'sc 1' as MoL's magic thingy.

I'm not sure how KVM works for this case so I also don't know why and what 
would need to be modified. I think we'll only have KVM PR working as newer 
POWER CPUs having HV (besides being rare among potential users) are 
probably too different to run the OSes that expect at most a G4 on 
pegasos2 so likely it won't work with KVM HV. If we have KVM PR doesn't sc 
already trap so we could add MOL OSI without further modification to KVM 
itself only needing change in QEMU? I also hope that MOL OSI could be 
useful for porting some paravirt drivers from MOL for running Mac OS X on 
Mac emulation but I don't know about that for sure so I'm open to any 
other solution too. For now I'm going with vhyp which is enough fot 
testing with TCG and if somebody wants KVM they could use he original 
firmware for now so this could be improved in a later version unless a 
simple solution is found before the freeze for 6.1. If we're in KVM PR 
what happens for sc 1 could that be used too so maybe what we have now 
could work?

>> [...]
>>>>>> I've tested that the missing rtas is not the reason for getting no output
>>>>>> via serial though, as even when disabling rtas on pegasos2.rom it boots and
>>>>>> I still get serial output just some PCI devices are not detected (such as
>>>>>> USB, the video card and the not emulated ethernet port but these are not
>>>>>> fatal so it might even work as a first try without rtas, just to boot a
>>>>>> Linux kernel for testing it would be enough if I can fix the serial output).
>>>>>> I still don't know why it's not finding serial but I think it may be some
>>>>>> missing or wrong info in the device tree I generat. I'll try to focus on
>>>>>> this for now and leave the above rtas question for later.
>>>>>
>>>>> Oh.. another thought on that.  You have an ISA serial port on Pegasos,
>>>>> I believe.  I wonder if the PCI->ISA bridge needs some configuration /
>>>>> initialization that the firmware is expected to do.  If so you'll need
>>>>> to mimic that setup in qemu for the VOF case.
>>>>
>>>> That's what I begin to think because I've added everything to the device
>>>> tree that I thought could be needed and I still don't get it working so it
>>>> may need some config from the firmware. But how do I access device registers
>>>> from board code? I've tried adding a machine reset method and write to
>>>> memory mapped device registers but all my attempts failed. I've tried
>>>> cpu_stl_le_data and even memory_region_dispatch_write but these did not get
>>>> to the device. What's the way to access guest mmio regs from QEMU?
>>>
>>> That's odd, cpu_stl() and memory_region_dispatch_write() should work
>>> from board code (after the relevant memory regions are configured, of
>>> course).  As an ISA serial port, it's probably accessed through IO
>>> space, not memory space though, so you'd need &address_space_io.  And
>>> if there is some bridge configuration then it's the bridge control
>>> registers you need to look at not the serial registers - you'd have to
>>> look at the bridge documentation for that.  Or, I guess the bridge
>>> implementation in qemu, which you wrote part of.
>>
>> I've found at last that stl_le_phys() works. There are so many of these that
>> I never know when to use which.
>>
>> I think the address_space_rw calls in vof_client_call() in vof.c could also
>> use these for somewhat shorter code. I've ended up with
>> stl_le_phys(CPU(cpu)->as, addr, val) in my machine reset methodbut I don't
>> even need that now as it works without additional setup. Also VOF's memory
>> access is basically the same as the already existing rtas_st() and co. so
>> maybe that could be reused to make code smaller?
>
> rtas_ld() and rtas_st() should only be used for reading/writing RTAS
> parameters to and from memory.  Accessing IO shouldn't be done with
> those.
>
> For IO you probably want the cpu_st*() variants in most cases, since
> you're trying to emulate an IO access from the virtual cpu.

I think I've tried that but what worked to access mmio device registers 
are stl_le_phys and similar that are wrappers around address_space_stl_*. 
But I did not mean that for rtas_ld/_st but the part when vof accessing 
the parameters passed by its hypercall which is memory access:

https://github.com/patchew-project/qemu/blob/patchew/20210520090557.435689-1-aik%40ozlabs.ru/hw/ppc/vof.c

line 893, and vof_client_call before that is very similar to what h_rtas 
does here:

https://git.qemu.org/?p=qemu.git;a=blob;f=hw/ppc/spapr_hcall.c;h=f25014afda408002ee1ec1027a0dd7a6025eca61;hb=HEAD#l639

and I also need to do the same for rtas in pegasos2 for which I'm just 
using ldl_be_phys for now but I wonder if we really need 3 ways to do the 
same or the rtas_ld/_st could be made more generic and reused here?

Regards,
BALATON Zoltan
BALATON Zoltan May 29, 2021, 6:10 p.m. UTC | #39
On Thu, 20 May 2021, Alexey Kardashevskiy wrote:
> diff --git a/hw/ppc/spapr_vof.c b/hw/ppc/spapr_vof.c
> new file mode 100644
> index 000000000000..5e34d5402abf
> --- /dev/null
> +++ b/hw/ppc/spapr_vof.c
> @@ -0,0 +1,156 @@
> +/*
> + * SPAPR machine hooks to Virtual Open Firmware,
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +#include "qemu/osdep.h"
> +#include "qemu-common.h"
> +#include <sys/ioctl.h>
> +#include "qapi/error.h"
> +#include "hw/ppc/spapr.h"
> +#include "hw/ppc/spapr_vio.h"
> +#include "hw/ppc/fdt.h"
> +#include "sysemu/sysemu.h"
> +#include "qom/qom-qobject.h"
> +#include "trace.h"
> +
> +/* Copied from SLOF, and 4K is definitely not enough for GRUB */
> +#define OF_STACK_SIZE       0x8000

I found a reference explaining its value better than the comment above. 
Section 8.2.2 here:

https://www.devicetree.org/open-firmware/bindings/ppc/release/ppc-2_1.html

says it should be at least 32k. This define should be in vof.h so I don't 
have to duplicate it in pegasos2.c. Or vof_init could allocate and claim 
the stack so board code doesn't have to do that either. Maybe taking a 
pointer argument for preferred stack address as input and could return 
the aligned address where the stack was allocated or just store stack_base 
in struct vof where tha board code could get it for adding to r1 on 
calling the guest code.

Regards,
BALATON Zoltan
BALATON Zoltan May 30, 2021, 5:33 p.m. UTC | #40
Hello,

Two more problems I've found while testing with pegasos2 but I'm not sure 
how to fix them:

On Thu, 20 May 2021, Alexey Kardashevskiy wrote:
> diff --git a/hw/ppc/vof.c b/hw/ppc/vof.c
> new file mode 100644
> index 000000000000..a283b7d251a7
> --- /dev/null
> +++ b/hw/ppc/vof.c
> @@ -0,0 +1,1021 @@
> +/*
> + * QEMU PowerPC Virtual Open Firmware.
> + *
> + * This implements client interface from OpenFirmware IEEE1275 on the QEMU
> + * side to leave only a very basic firmware in the VM.
> + *
> + * Copyright (c) 2021 IBM Corporation.
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu-common.h"
> +#include "qemu/timer.h"
> +#include "qemu/range.h"
> +#include "qemu/units.h"
> +#include "qapi/error.h"
> +#include <sys/ioctl.h>
> +#include "exec/ram_addr.h"
> +#include "exec/address-spaces.h"
> +#include "hw/ppc/vof.h"
> +#include "hw/ppc/fdt.h"
> +#include "sysemu/runstate.h"
> +#include "qom/qom-qobject.h"
> +#include "trace.h"
> +
> +#include <libfdt.h>
> +
> +/*
> + * OF 1275 "nextprop" description suggests is it 32 bytes max but
> + * LoPAPR defines "ibm,query-interrupt-source-number" which is 33 chars long.
> + */
> +#define OF_PROPNAME_LEN_MAX 64
> +
> +#define VOF_MAX_PATH        256
> +#define VOF_MAX_SETPROPLEN  2048
> +#define VOF_MAX_METHODLEN   256
> +#define VOF_MAX_FORTHCODE   256
> +#define VOF_VTY_BUF_SIZE    256
> +
> +typedef struct {
> +    uint64_t start;
> +    uint64_t size;
> +} OfClaimed;
> +
> +typedef struct {
> +    char *path; /* the path used to open the instance */
> +    uint32_t phandle;
> +} OfInstance;
> +
> +#define VOF_MEM_READ(pa, buf, size) \
> +    address_space_read_full(&address_space_memory, \
> +    (pa), MEMTXATTRS_UNSPECIFIED, (buf), (size))
> +#define VOF_MEM_WRITE(pa, buf, size) \
> +    address_space_write(&address_space_memory, \
> +    (pa), MEMTXATTRS_UNSPECIFIED, (buf), (size))
> +
> +static int readstr(hwaddr pa, char *buf, int size)
> +{
> +    if (VOF_MEM_READ(pa, buf, size) != MEMTX_OK) {
> +        return -1;
> +    }
> +    if (strnlen(buf, size) == size) {
> +        buf[size - 1] = '\0';
> +        trace_vof_error_str_truncated(buf, size);
> +        return -1;
> +    }
> +    return 0;
> +}
> +
> +static bool cmpservice(const char *s, unsigned nargs, unsigned nret,
> +                       const char *s1, unsigned nargscheck, unsigned nretcheck)
> +{
> +    if (strcmp(s, s1)) {
> +        return false;
> +    }
> +    if ((nargscheck && (nargs != nargscheck)) ||
> +        (nretcheck && (nret != nretcheck))) {
> +        trace_vof_error_param(s, nargscheck, nretcheck, nargs, nret);
> +        return false;
> +    }
> +
> +    return true;
> +}
> +
> +static void prop_format(char *tval, int tlen, const void *prop, int len)
> +{
> +    int i;
> +    const unsigned char *c;
> +    char *t;
> +    const char bin[] = "...";
> +
> +    for (i = 0, c = prop; i < len; ++i, ++c) {
> +        if (*c == '\0' && i == len - 1) {
> +            strncpy(tval, prop, tlen - 1);
> +            return;
> +        }
> +        if (*c < 0x20 || *c >= 0x80) {
> +            break;
> +        }
> +    }
> +
> +    for (i = 0, c = prop, t = tval; i < len; ++i, ++c) {
> +        if (t >= tval + tlen - sizeof(bin) - 1 - 2 - 1) {
> +            strcpy(t, bin);
> +            return;
> +        }
> +        if (i && i % 4 == 0 && i != len - 1) {
> +            strcat(t, " ");
> +            ++t;
> +        }
> +        t += sprintf(t, "%02X", *c & 0xFF);
> +    }
> +}
> +
> +static int get_path(const void *fdt, int offset, char *buf, int len)
> +{
> +    int ret;
> +
> +    ret = fdt_get_path(fdt, offset, buf, len - 1);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    buf[len - 1] = '\0';
> +
> +    return strlen(buf) + 1;
> +}
> +
> +static int phandle_to_path(const void *fdt, uint32_t ph, char *buf, int len)
> +{
> +    int ret;
> +
> +    ret = fdt_node_offset_by_phandle(fdt, ph);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    return get_path(fdt, ret, buf, len);
> +}
> +
> +static uint32_t vof_finddevice(const void *fdt, uint32_t nodeaddr)
> +{
> +    char fullnode[VOF_MAX_PATH];
> +    uint32_t ret = -1;
> +    int offset;
> +
> +    if (readstr(nodeaddr, fullnode, sizeof(fullnode))) {
> +        return (uint32_t) ret;
> +    }
> +
> +    offset = fdt_path_offset(fdt, fullnode);
> +    if (offset >= 0) {
> +        ret = fdt_get_phandle(fdt, offset);
> +    }
> +    trace_vof_finddevice(fullnode, ret);
> +    return (uint32_t) ret;
> +}

The Linux init function that runs on pegasos2 here:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/arch/powerpc/kernel/prom_init.c?h=v4.14.234#n2658

calls finddevice once with isa@c and next with isa@C (small and capital C) 
both of which works with the board firmware but with vof the comparison is 
case sensitive and one of these fails so I can't make it work. I don't 
know if this is a problem in libfdt or the vof_finddevice above should do 
something else to get case insensitive comparison.

> +
> +static const void *getprop(const void *fdt, int nodeoff, const char *propname,
> +                           int *proplen, bool *write0)
> +{
> +    const char *unit, *prop;
> +
> +    /*
> +     * The "name" property is not actually stored as a property in the FDT,
> +     * we emulate it by returning a pointer to the node's name and adjust
> +     * proplen to include only the name but not the unit.
> +     */
> +    if (strcmp(propname, "name") == 0) {
> +        prop = fdt_get_name(fdt, nodeoff, proplen);
> +        if (!prop) {
> +            *proplen = 0;
> +            return NULL;
> +        }
> +
> +        unit = memchr(prop, '@', *proplen);
> +        if (unit) {
> +            *proplen = unit - prop;
> +        }
> +        *proplen += 1;
> +
> +        /*
> +         * Since it might be cut at "@" and there will be no trailing zero
> +         * in the prop buffer, tell the caller to write zero at the end.
> +         */
> +        if (write0) {
> +            *write0 = true;
> +        }
> +        return prop;
> +    }
> +
> +    if (write0) {
> +        *write0 = false;
> +    }
> +    return fdt_getprop(fdt, nodeoff, propname, proplen);
> +}

MorphOS checks the name property of the root node ("/") to decide what 
platform it runs on so we may need to be able to set this property on / 
where it should return "bplan,Pegasos2", therefore the above maybe should 
do getprop first and only generate name property if it's not set (or at 
least check if we're on the root node and allow setting name property 
there). (On Macs the root node is named "device-tree" and this was before 
found to be needed for MorphOS.)

Other than the above two problems, I've found that getting the device tree 
from vof returns it in reverse order compared to the board firmware if I 
add it the expected order. This may or may not be a problem but to avoid 
it I can build the tree in reverse order then it comes out right so unless 
there's an easy fix this should not cause a problem but may worth a 
comment somewhere.

Regards,
BALATON Zoltan
BALATON Zoltan May 31, 2021, 1:07 p.m. UTC | #41
On Sun, 30 May 2021, BALATON Zoltan wrote:
> On Thu, 20 May 2021, Alexey Kardashevskiy wrote:
>> diff --git a/hw/ppc/vof.c b/hw/ppc/vof.c
>> new file mode 100644
>> index 000000000000..a283b7d251a7
>> --- /dev/null
>> +++ b/hw/ppc/vof.c
>> @@ -0,0 +1,1021 @@
>> +/*
>> + * QEMU PowerPC Virtual Open Firmware.
>> + *
>> + * This implements client interface from OpenFirmware IEEE1275 on the QEMU
>> + * side to leave only a very basic firmware in the VM.
>> + *
>> + * Copyright (c) 2021 IBM Corporation.
>> + *
>> + * SPDX-License-Identifier: GPL-2.0-or-later
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu-common.h"
>> +#include "qemu/timer.h"
>> +#include "qemu/range.h"
>> +#include "qemu/units.h"
>> +#include "qapi/error.h"
>> +#include <sys/ioctl.h>
>> +#include "exec/ram_addr.h"
>> +#include "exec/address-spaces.h"
>> +#include "hw/ppc/vof.h"
>> +#include "hw/ppc/fdt.h"
>> +#include "sysemu/runstate.h"
>> +#include "qom/qom-qobject.h"
>> +#include "trace.h"
>> +
>> +#include <libfdt.h>
>> +
>> +/*
>> + * OF 1275 "nextprop" description suggests is it 32 bytes max but
>> + * LoPAPR defines "ibm,query-interrupt-source-number" which is 33 chars 
>> long.
>> + */
>> +#define OF_PROPNAME_LEN_MAX 64
>> +
>> +#define VOF_MAX_PATH        256
>> +#define VOF_MAX_SETPROPLEN  2048
>> +#define VOF_MAX_METHODLEN   256
>> +#define VOF_MAX_FORTHCODE   256
>> +#define VOF_VTY_BUF_SIZE    256
>> +
>> +typedef struct {
>> +    uint64_t start;
>> +    uint64_t size;
>> +} OfClaimed;
>> +
>> +typedef struct {
>> +    char *path; /* the path used to open the instance */
>> +    uint32_t phandle;
>> +} OfInstance;
>> +
>> +#define VOF_MEM_READ(pa, buf, size) \
>> +    address_space_read_full(&address_space_memory, \
>> +    (pa), MEMTXATTRS_UNSPECIFIED, (buf), (size))
>> +#define VOF_MEM_WRITE(pa, buf, size) \
>> +    address_space_write(&address_space_memory, \
>> +    (pa), MEMTXATTRS_UNSPECIFIED, (buf), (size))
>> +
>> +static int readstr(hwaddr pa, char *buf, int size)
>> +{
>> +    if (VOF_MEM_READ(pa, buf, size) != MEMTX_OK) {
>> +        return -1;
>> +    }
>> +    if (strnlen(buf, size) == size) {
>> +        buf[size - 1] = '\0';
>> +        trace_vof_error_str_truncated(buf, size);
>> +        return -1;
>> +    }
>> +    return 0;
>> +}
>> +
>> +static bool cmpservice(const char *s, unsigned nargs, unsigned nret,
>> +                       const char *s1, unsigned nargscheck, unsigned 
>> nretcheck)
>> +{
>> +    if (strcmp(s, s1)) {
>> +        return false;
>> +    }
>> +    if ((nargscheck && (nargs != nargscheck)) ||
>> +        (nretcheck && (nret != nretcheck))) {
>> +        trace_vof_error_param(s, nargscheck, nretcheck, nargs, nret);
>> +        return false;
>> +    }
>> +
>> +    return true;
>> +}
>> +
>> +static void prop_format(char *tval, int tlen, const void *prop, int len)
>> +{
>> +    int i;
>> +    const unsigned char *c;
>> +    char *t;
>> +    const char bin[] = "...";
>> +
>> +    for (i = 0, c = prop; i < len; ++i, ++c) {
>> +        if (*c == '\0' && i == len - 1) {
>> +            strncpy(tval, prop, tlen - 1);
>> +            return;
>> +        }
>> +        if (*c < 0x20 || *c >= 0x80) {
>> +            break;
>> +        }
>> +    }
>> +
>> +    for (i = 0, c = prop, t = tval; i < len; ++i, ++c) {
>> +        if (t >= tval + tlen - sizeof(bin) - 1 - 2 - 1) {
>> +            strcpy(t, bin);
>> +            return;
>> +        }
>> +        if (i && i % 4 == 0 && i != len - 1) {
>> +            strcat(t, " ");
>> +            ++t;
>> +        }
>> +        t += sprintf(t, "%02X", *c & 0xFF);
>> +    }
>> +}
>> +
>> +static int get_path(const void *fdt, int offset, char *buf, int len)
>> +{
>> +    int ret;
>> +
>> +    ret = fdt_get_path(fdt, offset, buf, len - 1);
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>> +
>> +    buf[len - 1] = '\0';
>> +
>> +    return strlen(buf) + 1;
>> +}
>> +
>> +static int phandle_to_path(const void *fdt, uint32_t ph, char *buf, int 
>> len)
>> +{
>> +    int ret;
>> +
>> +    ret = fdt_node_offset_by_phandle(fdt, ph);
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>> +
>> +    return get_path(fdt, ret, buf, len);
>> +}
>> +
>> +static uint32_t vof_finddevice(const void *fdt, uint32_t nodeaddr)
>> +{
>> +    char fullnode[VOF_MAX_PATH];
>> +    uint32_t ret = -1;
>> +    int offset;
>> +
>> +    if (readstr(nodeaddr, fullnode, sizeof(fullnode))) {
>> +        return (uint32_t) ret;
>> +    }
>> +
>> +    offset = fdt_path_offset(fdt, fullnode);
>> +    if (offset >= 0) {
>> +        ret = fdt_get_phandle(fdt, offset);
>> +    }
>> +    trace_vof_finddevice(fullnode, ret);
>> +    return (uint32_t) ret;
>> +}
>
> The Linux init function that runs on pegasos2 here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/arch/powerpc/kernel/prom_init.c?h=v4.14.234#n2658
>
> calls finddevice once with isa@c and next with isa@C (small and capital C) 
> both of which works with the board firmware but with vof the comparison is 
> case sensitive and one of these fails so I can't make it work. I don't know 
> if this is a problem in libfdt or the vof_finddevice above should do 
> something else to get case insensitive comparison.

This fixes the issue with Linux but I'm not sure if there's any better 
solution or would it break anything else.

Regards,
BALATON Zoltan

>diff --git a/hw/ppc/vof.c b/hw/ppc/vof.c
index a283b7d251..b47bbd509d 100644
--- a/hw/ppc/vof.c
+++ b/hw/ppc/vof.c
@@ -144,12 +144,15 @@ static uint32_t vof_finddevice(const void *fdt, 
uint32_t nodeaddr)
      char fullnode[VOF_MAX_PATH];
      uint32_t ret = -1;
      int offset;
+    gchar *p;

      if (readstr(nodeaddr, fullnode, sizeof(fullnode))) {
          return (uint32_t) ret;
      }

-    offset = fdt_path_offset(fdt, fullnode);
+    p = g_ascii_strdown(fullnode, -1);
+    offset = fdt_path_offset(fdt, p);
+    g_free(p);
      if (offset >= 0) {
          ret = fdt_get_phandle(fdt, offset);
      }
Alexey Kardashevskiy June 1, 2021, 12:02 p.m. UTC | #42
On 31/05/2021 23:07, BALATON Zoltan wrote:
> On Sun, 30 May 2021, BALATON Zoltan wrote:
>> On Thu, 20 May 2021, Alexey Kardashevskiy wrote:
>>> diff --git a/hw/ppc/vof.c b/hw/ppc/vof.c
>>> new file mode 100644
>>> index 000000000000..a283b7d251a7
>>> --- /dev/null
>>> +++ b/hw/ppc/vof.c
>>> @@ -0,0 +1,1021 @@
>>> +/*
>>> + * QEMU PowerPC Virtual Open Firmware.
>>> + *
>>> + * This implements client interface from OpenFirmware IEEE1275 on 
>>> the QEMU
>>> + * side to leave only a very basic firmware in the VM.
>>> + *
>>> + * Copyright (c) 2021 IBM Corporation.
>>> + *
>>> + * SPDX-License-Identifier: GPL-2.0-or-later
>>> + */
>>> +
>>> +#include "qemu/osdep.h"
>>> +#include "qemu-common.h"
>>> +#include "qemu/timer.h"
>>> +#include "qemu/range.h"
>>> +#include "qemu/units.h"
>>> +#include "qapi/error.h"
>>> +#include <sys/ioctl.h>
>>> +#include "exec/ram_addr.h"
>>> +#include "exec/address-spaces.h"
>>> +#include "hw/ppc/vof.h"
>>> +#include "hw/ppc/fdt.h"
>>> +#include "sysemu/runstate.h"
>>> +#include "qom/qom-qobject.h"
>>> +#include "trace.h"
>>> +
>>> +#include <libfdt.h>
>>> +
>>> +/*
>>> + * OF 1275 "nextprop" description suggests is it 32 bytes max but
>>> + * LoPAPR defines "ibm,query-interrupt-source-number" which is 33 
>>> chars long.
>>> + */
>>> +#define OF_PROPNAME_LEN_MAX 64
>>> +
>>> +#define VOF_MAX_PATH        256
>>> +#define VOF_MAX_SETPROPLEN  2048
>>> +#define VOF_MAX_METHODLEN   256
>>> +#define VOF_MAX_FORTHCODE   256
>>> +#define VOF_VTY_BUF_SIZE    256
>>> +
>>> +typedef struct {
>>> +    uint64_t start;
>>> +    uint64_t size;
>>> +} OfClaimed;
>>> +
>>> +typedef struct {
>>> +    char *path; /* the path used to open the instance */
>>> +    uint32_t phandle;
>>> +} OfInstance;
>>> +
>>> +#define VOF_MEM_READ(pa, buf, size) \
>>> +    address_space_read_full(&address_space_memory, \
>>> +    (pa), MEMTXATTRS_UNSPECIFIED, (buf), (size))
>>> +#define VOF_MEM_WRITE(pa, buf, size) \
>>> +    address_space_write(&address_space_memory, \
>>> +    (pa), MEMTXATTRS_UNSPECIFIED, (buf), (size))
>>> +
>>> +static int readstr(hwaddr pa, char *buf, int size)
>>> +{
>>> +    if (VOF_MEM_READ(pa, buf, size) != MEMTX_OK) {
>>> +        return -1;
>>> +    }
>>> +    if (strnlen(buf, size) == size) {
>>> +        buf[size - 1] = '\0';
>>> +        trace_vof_error_str_truncated(buf, size);
>>> +        return -1;
>>> +    }
>>> +    return 0;
>>> +}
>>> +
>>> +static bool cmpservice(const char *s, unsigned nargs, unsigned nret,
>>> +                       const char *s1, unsigned nargscheck, unsigned 
>>> nretcheck)
>>> +{
>>> +    if (strcmp(s, s1)) {
>>> +        return false;
>>> +    }
>>> +    if ((nargscheck && (nargs != nargscheck)) ||
>>> +        (nretcheck && (nret != nretcheck))) {
>>> +        trace_vof_error_param(s, nargscheck, nretcheck, nargs, nret);
>>> +        return false;
>>> +    }
>>> +
>>> +    return true;
>>> +}
>>> +
>>> +static void prop_format(char *tval, int tlen, const void *prop, int 
>>> len)
>>> +{
>>> +    int i;
>>> +    const unsigned char *c;
>>> +    char *t;
>>> +    const char bin[] = "...";
>>> +
>>> +    for (i = 0, c = prop; i < len; ++i, ++c) {
>>> +        if (*c == '\0' && i == len - 1) {
>>> +            strncpy(tval, prop, tlen - 1);
>>> +            return;
>>> +        }
>>> +        if (*c < 0x20 || *c >= 0x80) {
>>> +            break;
>>> +        }
>>> +    }
>>> +
>>> +    for (i = 0, c = prop, t = tval; i < len; ++i, ++c) {
>>> +        if (t >= tval + tlen - sizeof(bin) - 1 - 2 - 1) {
>>> +            strcpy(t, bin);
>>> +            return;
>>> +        }
>>> +        if (i && i % 4 == 0 && i != len - 1) {
>>> +            strcat(t, " ");
>>> +            ++t;
>>> +        }
>>> +        t += sprintf(t, "%02X", *c & 0xFF);
>>> +    }
>>> +}
>>> +
>>> +static int get_path(const void *fdt, int offset, char *buf, int len)
>>> +{
>>> +    int ret;
>>> +
>>> +    ret = fdt_get_path(fdt, offset, buf, len - 1);
>>> +    if (ret < 0) {
>>> +        return ret;
>>> +    }
>>> +
>>> +    buf[len - 1] = '\0';
>>> +
>>> +    return strlen(buf) + 1;
>>> +}
>>> +
>>> +static int phandle_to_path(const void *fdt, uint32_t ph, char *buf, 
>>> int len)
>>> +{
>>> +    int ret;
>>> +
>>> +    ret = fdt_node_offset_by_phandle(fdt, ph);
>>> +    if (ret < 0) {
>>> +        return ret;
>>> +    }
>>> +
>>> +    return get_path(fdt, ret, buf, len);
>>> +}
>>> +
>>> +static uint32_t vof_finddevice(const void *fdt, uint32_t nodeaddr)
>>> +{
>>> +    char fullnode[VOF_MAX_PATH];
>>> +    uint32_t ret = -1;
>>> +    int offset;
>>> +
>>> +    if (readstr(nodeaddr, fullnode, sizeof(fullnode))) {
>>> +        return (uint32_t) ret;
>>> +    }
>>> +
>>> +    offset = fdt_path_offset(fdt, fullnode);
>>> +    if (offset >= 0) {
>>> +        ret = fdt_get_phandle(fdt, offset);
>>> +    }
>>> +    trace_vof_finddevice(fullnode, ret);
>>> +    return (uint32_t) ret;
>>> +}
>>
>> The Linux init function that runs on pegasos2 here:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/arch/powerpc/kernel/prom_init.c?h=v4.14.234#n2658 
>>
>>
>> calls finddevice once with isa@c and next with isa@C (small and 
>> capital C) both of which works with the board firmware but with vof 
>> the comparison is case sensitive and one of these fails so I can't 
>> make it work. I don't know if this is a problem in libfdt or the 
>> vof_finddevice above should do something else to get case insensitive 
>> comparison.
> 
> This fixes the issue with Linux but I'm not sure if there's any better 
> solution or would it break anything else.

The bit after "@" is an address and needs to be case insensitive and 
I'll fix this indeed. I'm not so sure about the part before "@", I 
cannot imagine what could break if I made search insensitive to case. Hm :-/

> 
> Regards,
> BALATON Zoltan
> 
>> diff --git a/hw/ppc/vof.c b/hw/ppc/vof.c
> index a283b7d251..b47bbd509d 100644
> --- a/hw/ppc/vof.c
> +++ b/hw/ppc/vof.c
> @@ -144,12 +144,15 @@ static uint32_t vof_finddevice(const void *fdt, 
> uint32_t nodeaddr)
>      char fullnode[VOF_MAX_PATH];
>      uint32_t ret = -1;
>      int offset;
> +    gchar *p;
> 
>      if (readstr(nodeaddr, fullnode, sizeof(fullnode))) {
>          return (uint32_t) ret;
>      }
> 
> -    offset = fdt_path_offset(fdt, fullnode);
> +    p = g_ascii_strdown(fullnode, -1);
> +    offset = fdt_path_offset(fdt, p);
> +    g_free(p);
>      if (offset >= 0) {
>          ret = fdt_get_phandle(fdt, offset);
>      }
BALATON Zoltan June 1, 2021, 2:12 p.m. UTC | #43
On Tue, 1 Jun 2021, Alexey Kardashevskiy wrote:
> On 31/05/2021 23:07, BALATON Zoltan wrote:
>> On Sun, 30 May 2021, BALATON Zoltan wrote:
>>> On Thu, 20 May 2021, Alexey Kardashevskiy wrote:
>>>> diff --git a/hw/ppc/vof.c b/hw/ppc/vof.c
>>>> new file mode 100644
>>>> index 000000000000..a283b7d251a7
>>>> --- /dev/null
>>>> +++ b/hw/ppc/vof.c
>>>> @@ -0,0 +1,1021 @@
>>>> +/*
>>>> + * QEMU PowerPC Virtual Open Firmware.
>>>> + *
>>>> + * This implements client interface from OpenFirmware IEEE1275 on the 
>>>> QEMU
>>>> + * side to leave only a very basic firmware in the VM.
>>>> + *
>>>> + * Copyright (c) 2021 IBM Corporation.
>>>> + *
>>>> + * SPDX-License-Identifier: GPL-2.0-or-later
>>>> + */
>>>> +
>>>> +#include "qemu/osdep.h"
>>>> +#include "qemu-common.h"
>>>> +#include "qemu/timer.h"
>>>> +#include "qemu/range.h"
>>>> +#include "qemu/units.h"
>>>> +#include "qapi/error.h"
>>>> +#include <sys/ioctl.h>
>>>> +#include "exec/ram_addr.h"
>>>> +#include "exec/address-spaces.h"
>>>> +#include "hw/ppc/vof.h"
>>>> +#include "hw/ppc/fdt.h"
>>>> +#include "sysemu/runstate.h"
>>>> +#include "qom/qom-qobject.h"
>>>> +#include "trace.h"
>>>> +
>>>> +#include <libfdt.h>
>>>> +
>>>> +/*
>>>> + * OF 1275 "nextprop" description suggests is it 32 bytes max but
>>>> + * LoPAPR defines "ibm,query-interrupt-source-number" which is 33 chars 
>>>> long.
>>>> + */
>>>> +#define OF_PROPNAME_LEN_MAX 64
>>>> +
>>>> +#define VOF_MAX_PATH        256
>>>> +#define VOF_MAX_SETPROPLEN  2048
>>>> +#define VOF_MAX_METHODLEN   256
>>>> +#define VOF_MAX_FORTHCODE   256
>>>> +#define VOF_VTY_BUF_SIZE    256
>>>> +
>>>> +typedef struct {
>>>> +    uint64_t start;
>>>> +    uint64_t size;
>>>> +} OfClaimed;
>>>> +
>>>> +typedef struct {
>>>> +    char *path; /* the path used to open the instance */
>>>> +    uint32_t phandle;
>>>> +} OfInstance;
>>>> +
>>>> +#define VOF_MEM_READ(pa, buf, size) \
>>>> +    address_space_read_full(&address_space_memory, \
>>>> +    (pa), MEMTXATTRS_UNSPECIFIED, (buf), (size))
>>>> +#define VOF_MEM_WRITE(pa, buf, size) \
>>>> +    address_space_write(&address_space_memory, \
>>>> +    (pa), MEMTXATTRS_UNSPECIFIED, (buf), (size))
>>>> +
>>>> +static int readstr(hwaddr pa, char *buf, int size)
>>>> +{
>>>> +    if (VOF_MEM_READ(pa, buf, size) != MEMTX_OK) {
>>>> +        return -1;
>>>> +    }
>>>> +    if (strnlen(buf, size) == size) {
>>>> +        buf[size - 1] = '\0';
>>>> +        trace_vof_error_str_truncated(buf, size);
>>>> +        return -1;
>>>> +    }
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static bool cmpservice(const char *s, unsigned nargs, unsigned nret,
>>>> +                       const char *s1, unsigned nargscheck, unsigned 
>>>> nretcheck)
>>>> +{
>>>> +    if (strcmp(s, s1)) {
>>>> +        return false;
>>>> +    }
>>>> +    if ((nargscheck && (nargs != nargscheck)) ||
>>>> +        (nretcheck && (nret != nretcheck))) {
>>>> +        trace_vof_error_param(s, nargscheck, nretcheck, nargs, nret);
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    return true;
>>>> +}
>>>> +
>>>> +static void prop_format(char *tval, int tlen, const void *prop, int len)
>>>> +{
>>>> +    int i;
>>>> +    const unsigned char *c;
>>>> +    char *t;
>>>> +    const char bin[] = "...";
>>>> +
>>>> +    for (i = 0, c = prop; i < len; ++i, ++c) {
>>>> +        if (*c == '\0' && i == len - 1) {
>>>> +            strncpy(tval, prop, tlen - 1);
>>>> +            return;
>>>> +        }
>>>> +        if (*c < 0x20 || *c >= 0x80) {
>>>> +            break;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    for (i = 0, c = prop, t = tval; i < len; ++i, ++c) {
>>>> +        if (t >= tval + tlen - sizeof(bin) - 1 - 2 - 1) {
>>>> +            strcpy(t, bin);
>>>> +            return;
>>>> +        }
>>>> +        if (i && i % 4 == 0 && i != len - 1) {
>>>> +            strcat(t, " ");
>>>> +            ++t;
>>>> +        }
>>>> +        t += sprintf(t, "%02X", *c & 0xFF);
>>>> +    }
>>>> +}
>>>> +
>>>> +static int get_path(const void *fdt, int offset, char *buf, int len)
>>>> +{
>>>> +    int ret;
>>>> +
>>>> +    ret = fdt_get_path(fdt, offset, buf, len - 1);
>>>> +    if (ret < 0) {
>>>> +        return ret;
>>>> +    }
>>>> +
>>>> +    buf[len - 1] = '\0';
>>>> +
>>>> +    return strlen(buf) + 1;
>>>> +}
>>>> +
>>>> +static int phandle_to_path(const void *fdt, uint32_t ph, char *buf, int 
>>>> len)
>>>> +{
>>>> +    int ret;
>>>> +
>>>> +    ret = fdt_node_offset_by_phandle(fdt, ph);
>>>> +    if (ret < 0) {
>>>> +        return ret;
>>>> +    }
>>>> +
>>>> +    return get_path(fdt, ret, buf, len);
>>>> +}
>>>> +
>>>> +static uint32_t vof_finddevice(const void *fdt, uint32_t nodeaddr)
>>>> +{
>>>> +    char fullnode[VOF_MAX_PATH];
>>>> +    uint32_t ret = -1;
>>>> +    int offset;
>>>> +
>>>> +    if (readstr(nodeaddr, fullnode, sizeof(fullnode))) {
>>>> +        return (uint32_t) ret;
>>>> +    }
>>>> +
>>>> +    offset = fdt_path_offset(fdt, fullnode);
>>>> +    if (offset >= 0) {
>>>> +        ret = fdt_get_phandle(fdt, offset);
>>>> +    }
>>>> +    trace_vof_finddevice(fullnode, ret);
>>>> +    return (uint32_t) ret;
>>>> +}
>>> 
>>> The Linux init function that runs on pegasos2 here:
>>> 
>>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/arch/powerpc/kernel/prom_init.c?h=v4.14.234#n2658 
>>> 
>>> calls finddevice once with isa@c and next with isa@C (small and capital C) 
>>> both of which works with the board firmware but with vof the comparison is 
>>> case sensitive and one of these fails so I can't make it work. I don't 
>>> know if this is a problem in libfdt or the vof_finddevice above should do 
>>> something else to get case insensitive comparison.
>> 
>> This fixes the issue with Linux but I'm not sure if there's any better 
>> solution or would it break anything else.
>
> The bit after "@" is an address and needs to be case insensitive and I'll fix 
> this indeed. I'm not so sure about the part before "@", I cannot imagine what 
> could break if I made search insensitive to case. Hm :-/

Fixing the match in the address part is probably enough as the name sent 
by guests is probably always lower case but the address could be formatted 
differently and that's what caused the problem. The patch below was only a 
quick fix to be able to test it further but your fix should work too.

With this and the ld replaced in entry.S I can now boot Linux which is 
enough to submit the pegasos2 vof patch after an updated patch from you 
fixes these in vof.

MorphOS still misses something but I'm not sure what as it uses the data 
gathered from the device tree later without printing diagnostics and fails 
due to a NULL dereference much after that so it seems to assume some value 
should exist but I'm not sure what value it needs and where that should 
come from. Maybe I'll try some more to find out just to make it simpler to 
boot but since it boots with the board firmware it's enough if Linux works 
with vof for now.

Regards,
BALATON Zoltan

>>> diff --git a/hw/ppc/vof.c b/hw/ppc/vof.c
>> index a283b7d251..b47bbd509d 100644
>> --- a/hw/ppc/vof.c
>> +++ b/hw/ppc/vof.c
>> @@ -144,12 +144,15 @@ static uint32_t vof_finddevice(const void *fdt, 
>> uint32_t nodeaddr)
>>      char fullnode[VOF_MAX_PATH];
>>      uint32_t ret = -1;
>>      int offset;
>> +    gchar *p;
>>
>>      if (readstr(nodeaddr, fullnode, sizeof(fullnode))) {
>>          return (uint32_t) ret;
>>      }
>> 
>> -    offset = fdt_path_offset(fdt, fullnode);
>> +    p = g_ascii_strdown(fullnode, -1);
>> +    offset = fdt_path_offset(fdt, p);
>> +    g_free(p);
>>      if (offset >= 0) {
>>          ret = fdt_get_phandle(fdt, offset);
>>      }
>
>
David Gibson June 2, 2021, 7:57 a.m. UTC | #44
On Thu, May 27, 2021 at 02:42:39PM +0200, BALATON Zoltan wrote:
> On Thu, 27 May 2021, David Gibson wrote:
> > On Tue, May 25, 2021 at 12:08:45PM +0200, BALATON Zoltan wrote:
> > > On Tue, 25 May 2021, David Gibson wrote:
> > > > On Mon, May 24, 2021 at 12:55:07PM +0200, BALATON Zoltan wrote:
> > > > > On Mon, 24 May 2021, David Gibson wrote:
> > > > > > On Sun, May 23, 2021 at 07:09:26PM +0200, BALATON Zoltan wrote:
> > > > > > > On Sun, 23 May 2021, BALATON Zoltan wrote:
> > > > > > > > On Sun, 23 May 2021, Alexey Kardashevskiy wrote:
> > > > > > > > > One thing to note about PCI is that normally I think the client
> > > > > > > > > expects the firmware to do PCI probing and SLOF does it. But VOF
> > > > > > > > > does not and Linux scans PCI bus(es) itself. Might be a problem for
> > > > > > > > > you kernel.
> > > > > > > > 
> > > > > > > > I'm not sure what info does MorphOS get from the device tree and what it
> > > > > > > > probes itself but I think it may at least need device ids and info about
> > > > > > > > the PCI bus to be able to access the config regs, after that it should
> > > > > > > > set the devices up hopefully. I could add these from the board code to
> > > > > > > > device tree so VOF does not need to do anything about it. However I'm
> > > > > > > > not getting to that point yet because it crashes on something that it's
> > > > > > > > missing and couldn't yet find out what is that.
> > > > > > > > 
> > > > > > > > I'd like to get Linux working now as that would be enough to test this
> > > > > > > > and then if for MorphOS we still need a ROM it's not a problem if at
> > > > > > > > least we can boot Linux without the original firmware. But I can't make
> > > > > > > > Linux open a serial console and I don't know what it needs for that. Do
> > > > > > > > you happen to know? I've looked at the sources in Linux/arch/powerpc but
> > > > > > > > not sure how it would find and open a serial port on pegasos2. It seems
> > > > > > > > to work with the board firmware and now I can get it to boot with VOF
> > > > > > > > but then it does not open serial so it probably needs something in the
> > > > > > > > device tree or expects the firmware to set something up that we should
> > > > > > > > add in pegasos2.c when using VOF.
> > > > > > > 
> > > > > > > I've now found that Linux uses rtas methods read-pci-config and
> > > > > > > write-pci-config for PCI access on pegasos2 so this means that we'll
> > > > > > > probably need rtas too (I hoped we could get away without it if it were only
> > > > > > > used for shutdown/reboot or so but seems Linux needs it for PCI as well and
> > > > > > > does not scan the bus and won't find some devices without it).
> > > > > > 
> > > > > > Yes, definitely sounds like you'll need an RTAS implementation.
> > > > > 
> > > > > I plan to fix that after managed to get serial working as that seems to not
> > > > > need it. If I delete the rtas-size property from /rtas on the original
> > > > > firmware that makes Linux skip instantiating rtas, but I still get serial
> > > > > output just not accessing PCI devices. So I think it should work and keeps
> > > > > things simpler at first. Then I'll try rtas later.
> > > > > 
> > > > > > > While VOF can do rtas, this causes a problem with the hypercall method using
> > > > > > > sc 1 that goes through vhyp but trips the assert in ppc_store_sdr1() so
> > > > > > > cannot work after guest is past quiesce.
> > > > > > 
> > > > > > > So the question is why is that
> > > > > > > assert there
> > > > > > 
> > > > > > Ah.. right.  So, vhyp was designed for the PAPR use case, where we
> > > > > > want to model the CPU when it's in supervisor and user mode, but not
> > > > > > when it's in hypervisor mode.  We want qemu to mimic the behaviour of
> > > > > > the hypervisor, rather than attempting to actually execute hypervisor
> > > > > > code in the virtual CPU.
> > > > > > 
> > > > > > On systems that have a hypervisor mode, SDR1 is hypervisor privileged,
> > > > > > so it makes no sense for the guest to attempt to set it.  That should
> > > > > > be caught by the general SPR code and turned into a 0x700, hence the
> > > > > > assert() if we somehow reach ppc_store_sdr1().
> > > > > > 
> > > > > > So, we are seeing a problem here because you want the 'sc 1'
> > > > > > interception of vhyp, but not the rest of the stuff that goes with it.
> > > > > > 
> > > > > > > and would using sc 1 for hypercalls on pegasos2 cause other
> > > > > > > problems later even if the assert could be removed?
> > > > > > 
> > > > > > At least in the short term, I think you probably can remove the
> > > > > > assert.  In your case the 'sc 1' calls aren't truly to a hypervisor,
> > > > > > but a special case escape to qemu for the firmware emulation.  I think
> > > > > > it's unlikely to cause problems later, because nothing on a 32-bit
> > > > > > system should be attempting an 'sc 1'.  The only thing I can think of
> > > > > > that would fail is some test case which explicitly verified that 'sc
> > > > > > 1' triggered a 0x700 (SIGILL from userspace).
> > > > > 
> > > > > OK so the assert should check if the CPU has an HV bit. I think there was a
> > > > > #detine for that somewhere that I can add to the assert then I can try that.
> > > > > What I wasn't sure about is that sc 1 would conflict with the guest's usage
> > > > > of normal sc calls or are these going through different paths and only sc 1
> > > > > will trigger vhyp callback not affecting notmal sc calls?
> > > > 
> > > > The vhyp shouldn't affect normal system calls, 'sc 1' is specifically
> > > > for hypercalls, as opposed to normal 'sc' (a.k.a. 'sc 0'), and the
> > > > vhyp only intercepts the hypercall version (after all Linux on PAPR
> > > > certainly uses its own system calls, and hypercalls are active for the
> > > > lifetime of the guest there).
> > > > 
> > > > > (Or if this causes
> > > > > an otherwise unnecessary VM exit on KVM even when it works then maybe
> > > > > looking for a different way in the future might be needed.
> > > > 
> > > > What you're doing here won't work with KVM as it stands.  There are
> > > > basically two paths into the vhyp hypercall path: 1) from TCG, if we
> > > > interpret an 'sc 1' instruction we enter vhyp, 2) from KVM, if we get
> > > > a KVM_EXIT_PAPR_HCALL KVM exit then we also go to the vhyp path.
> > > > 
> > > > The second path is specific to the PAPR (ppc64) implementation of KVM,
> > > > and will not work for a non-PAPR platform without substantial
> > > > modification of the KVM code.
> > > 
> > > OK so then at that point when we try KVM we'll need to look at alternative
> > > ways, I think MOL OSI worked with KVM at least in MOL but will probably make
> > > all syscalls exit KVM but since we'll probably need to use KVM PR it will
> > > exit anyway. For now I keep this vhyp as it does not run with KVM for other
> > > reasons yet so that's another area to clean up so as a proof of concept
> > > first version of using VOF vhyp will do.
> > 
> > Eh, since you'll need to modify KVM anyway, it probably makes just as
> > much sense to modify it to catch the 'sc 1' as MoL's magic thingy.
> 
> I'm not sure how KVM works for this case so I also don't know why and what
> would need to be modified. I think we'll only have KVM PR working as newer
> POWER CPUs having HV (besides being rare among potential users) are probably
> too different to run the OSes that expect at most a G4 on pegasos2 so likely
> it won't work with KVM HV.

Oh, it definitely won't work with KVM HV.

> If we have KVM PR doesn't sc already trap so we
> could add MOL OSI without further modification to KVM itself only needing
> change in QEMU?

Uh... I guess so?

> I also hope that MOL OSI could be useful for porting some
> paravirt drivers from MOL for running Mac OS X on Mac emulation but I don't
> know about that for sure so I'm open to any other solution too.

Maybe.  I never know much about MOL to begin with, and anything I did
know was a decade or more ago so I've probably forgotten.

> For now I'm
> going with vhyp which is enough fot testing with TCG and if somebody wants
> KVM they could use he original firmware for now so this could be improved in
> a later version unless a simple solution is found before the freeze for 6.1.
> If we're in KVM PR what happens for sc 1 could that be used too so maybe
> what we have now could work?

Note that if you do go down the MOL path it wouldn't be that complex
to make a "vMOL" interface so you can use the same mechanism for KVM
and TCG.

> > > [...]
> > > > > > > I've tested that the missing rtas is not the reason for getting no output
> > > > > > > via serial though, as even when disabling rtas on pegasos2.rom it boots and
> > > > > > > I still get serial output just some PCI devices are not detected (such as
> > > > > > > USB, the video card and the not emulated ethernet port but these are not
> > > > > > > fatal so it might even work as a first try without rtas, just to boot a
> > > > > > > Linux kernel for testing it would be enough if I can fix the serial output).
> > > > > > > I still don't know why it's not finding serial but I think it may be some
> > > > > > > missing or wrong info in the device tree I generat. I'll try to focus on
> > > > > > > this for now and leave the above rtas question for later.
> > > > > > 
> > > > > > Oh.. another thought on that.  You have an ISA serial port on Pegasos,
> > > > > > I believe.  I wonder if the PCI->ISA bridge needs some configuration /
> > > > > > initialization that the firmware is expected to do.  If so you'll need
> > > > > > to mimic that setup in qemu for the VOF case.
> > > > > 
> > > > > That's what I begin to think because I've added everything to the device
> > > > > tree that I thought could be needed and I still don't get it working so it
> > > > > may need some config from the firmware. But how do I access device registers
> > > > > from board code? I've tried adding a machine reset method and write to
> > > > > memory mapped device registers but all my attempts failed. I've tried
> > > > > cpu_stl_le_data and even memory_region_dispatch_write but these did not get
> > > > > to the device. What's the way to access guest mmio regs from QEMU?
> > > > 
> > > > That's odd, cpu_stl() and memory_region_dispatch_write() should work
> > > > from board code (after the relevant memory regions are configured, of
> > > > course).  As an ISA serial port, it's probably accessed through IO
> > > > space, not memory space though, so you'd need &address_space_io.  And
> > > > if there is some bridge configuration then it's the bridge control
> > > > registers you need to look at not the serial registers - you'd have to
> > > > look at the bridge documentation for that.  Or, I guess the bridge
> > > > implementation in qemu, which you wrote part of.
> > > 
> > > I've found at last that stl_le_phys() works. There are so many of these that
> > > I never know when to use which.
> > > 
> > > I think the address_space_rw calls in vof_client_call() in vof.c could also
> > > use these for somewhat shorter code. I've ended up with
> > > stl_le_phys(CPU(cpu)->as, addr, val) in my machine reset methodbut I don't
> > > even need that now as it works without additional setup. Also VOF's memory
> > > access is basically the same as the already existing rtas_st() and co. so
> > > maybe that could be reused to make code smaller?
> > 
> > rtas_ld() and rtas_st() should only be used for reading/writing RTAS
> > parameters to and from memory.  Accessing IO shouldn't be done with
> > those.
> > 
> > For IO you probably want the cpu_st*() variants in most cases, since
> > you're trying to emulate an IO access from the virtual cpu.
> 
> I think I've tried that but what worked to access mmio device registers are
> stl_le_phys and similar that are wrappers around address_space_stl_*. But I
> did not mean that for rtas_ld/_st but the part when vof accessing the
> parameters passed by its hypercall which is memory access:
> 
> https://github.com/patchew-project/qemu/blob/patchew/20210520090557.435689-1-aik%40ozlabs.ru/hw/ppc/vof.c
> 
> line 893, and vof_client_call before that is very similar to what h_rtas
> does here:
> 
> https://git.qemu.org/?p=qemu.git;a=blob;f=hw/ppc/spapr_hcall.c;h=f25014afda408002ee1ec1027a0dd7a6025eca61;hb=HEAD#l639
> 
> and I also need to do the same for rtas in pegasos2 for which I'm just using
> ldl_be_phys for now but I wonder if we really need 3 ways to do the same or
> the rtas_ld/_st could be made more generic and reused here?

For your rtas implementation you could definitely re-use them.  For
the client call I'm a bit less confident, but if the in-guest-memory
structures are really the same, then it would make sense.
BALATON Zoltan June 2, 2021, 12:29 p.m. UTC | #45
On Wed, 2 Jun 2021, David Gibson wrote:
> On Thu, May 27, 2021 at 02:42:39PM +0200, BALATON Zoltan wrote:
>> On Thu, 27 May 2021, David Gibson wrote:
>>> On Tue, May 25, 2021 at 12:08:45PM +0200, BALATON Zoltan wrote:
>>>> On Tue, 25 May 2021, David Gibson wrote:
>>>>> On Mon, May 24, 2021 at 12:55:07PM +0200, BALATON Zoltan wrote:
>>>>>> On Mon, 24 May 2021, David Gibson wrote:
>>>>>>> On Sun, May 23, 2021 at 07:09:26PM +0200, BALATON Zoltan wrote:
>>>>>>>> On Sun, 23 May 2021, BALATON Zoltan wrote:
>>>>>>>>> On Sun, 23 May 2021, Alexey Kardashevskiy wrote:
>>>>>>>>>> One thing to note about PCI is that normally I think the client
>>>>>>>>>> expects the firmware to do PCI probing and SLOF does it. But VOF
>>>>>>>>>> does not and Linux scans PCI bus(es) itself. Might be a problem for
>>>>>>>>>> you kernel.
>>>>>>>>>
>>>>>>>>> I'm not sure what info does MorphOS get from the device tree and what it
>>>>>>>>> probes itself but I think it may at least need device ids and info about
>>>>>>>>> the PCI bus to be able to access the config regs, after that it should
>>>>>>>>> set the devices up hopefully. I could add these from the board code to
>>>>>>>>> device tree so VOF does not need to do anything about it. However I'm
>>>>>>>>> not getting to that point yet because it crashes on something that it's
>>>>>>>>> missing and couldn't yet find out what is that.
>>>>>>>>>
>>>>>>>>> I'd like to get Linux working now as that would be enough to test this
>>>>>>>>> and then if for MorphOS we still need a ROM it's not a problem if at
>>>>>>>>> least we can boot Linux without the original firmware. But I can't make
>>>>>>>>> Linux open a serial console and I don't know what it needs for that. Do
>>>>>>>>> you happen to know? I've looked at the sources in Linux/arch/powerpc but
>>>>>>>>> not sure how it would find and open a serial port on pegasos2. It seems
>>>>>>>>> to work with the board firmware and now I can get it to boot with VOF
>>>>>>>>> but then it does not open serial so it probably needs something in the
>>>>>>>>> device tree or expects the firmware to set something up that we should
>>>>>>>>> add in pegasos2.c when using VOF.
>>>>>>>>
>>>>>>>> I've now found that Linux uses rtas methods read-pci-config and
>>>>>>>> write-pci-config for PCI access on pegasos2 so this means that we'll
>>>>>>>> probably need rtas too (I hoped we could get away without it if it were only
>>>>>>>> used for shutdown/reboot or so but seems Linux needs it for PCI as well and
>>>>>>>> does not scan the bus and won't find some devices without it).
>>>>>>>
>>>>>>> Yes, definitely sounds like you'll need an RTAS implementation.
>>>>>>
>>>>>> I plan to fix that after managed to get serial working as that seems to not
>>>>>> need it. If I delete the rtas-size property from /rtas on the original
>>>>>> firmware that makes Linux skip instantiating rtas, but I still get serial
>>>>>> output just not accessing PCI devices. So I think it should work and keeps
>>>>>> things simpler at first. Then I'll try rtas later.
>>>>>>
>>>>>>>> While VOF can do rtas, this causes a problem with the hypercall method using
>>>>>>>> sc 1 that goes through vhyp but trips the assert in ppc_store_sdr1() so
>>>>>>>> cannot work after guest is past quiesce.
>>>>>>>
>>>>>>>> So the question is why is that
>>>>>>>> assert there
>>>>>>>
>>>>>>> Ah.. right.  So, vhyp was designed for the PAPR use case, where we
>>>>>>> want to model the CPU when it's in supervisor and user mode, but not
>>>>>>> when it's in hypervisor mode.  We want qemu to mimic the behaviour of
>>>>>>> the hypervisor, rather than attempting to actually execute hypervisor
>>>>>>> code in the virtual CPU.
>>>>>>>
>>>>>>> On systems that have a hypervisor mode, SDR1 is hypervisor privileged,
>>>>>>> so it makes no sense for the guest to attempt to set it.  That should
>>>>>>> be caught by the general SPR code and turned into a 0x700, hence the
>>>>>>> assert() if we somehow reach ppc_store_sdr1().
>>>>>>>
>>>>>>> So, we are seeing a problem here because you want the 'sc 1'
>>>>>>> interception of vhyp, but not the rest of the stuff that goes with it.
>>>>>>>
>>>>>>>> and would using sc 1 for hypercalls on pegasos2 cause other
>>>>>>>> problems later even if the assert could be removed?
>>>>>>>
>>>>>>> At least in the short term, I think you probably can remove the
>>>>>>> assert.  In your case the 'sc 1' calls aren't truly to a hypervisor,
>>>>>>> but a special case escape to qemu for the firmware emulation.  I think
>>>>>>> it's unlikely to cause problems later, because nothing on a 32-bit
>>>>>>> system should be attempting an 'sc 1'.  The only thing I can think of
>>>>>>> that would fail is some test case which explicitly verified that 'sc
>>>>>>> 1' triggered a 0x700 (SIGILL from userspace).
>>>>>>
>>>>>> OK so the assert should check if the CPU has an HV bit. I think there was a
>>>>>> #detine for that somewhere that I can add to the assert then I can try that.
>>>>>> What I wasn't sure about is that sc 1 would conflict with the guest's usage
>>>>>> of normal sc calls or are these going through different paths and only sc 1
>>>>>> will trigger vhyp callback not affecting notmal sc calls?
>>>>>
>>>>> The vhyp shouldn't affect normal system calls, 'sc 1' is specifically
>>>>> for hypercalls, as opposed to normal 'sc' (a.k.a. 'sc 0'), and the
>>>>> vhyp only intercepts the hypercall version (after all Linux on PAPR
>>>>> certainly uses its own system calls, and hypercalls are active for the
>>>>> lifetime of the guest there).
>>>>>
>>>>>> (Or if this causes
>>>>>> an otherwise unnecessary VM exit on KVM even when it works then maybe
>>>>>> looking for a different way in the future might be needed.
>>>>>
>>>>> What you're doing here won't work with KVM as it stands.  There are
>>>>> basically two paths into the vhyp hypercall path: 1) from TCG, if we
>>>>> interpret an 'sc 1' instruction we enter vhyp, 2) from KVM, if we get
>>>>> a KVM_EXIT_PAPR_HCALL KVM exit then we also go to the vhyp path.
>>>>>
>>>>> The second path is specific to the PAPR (ppc64) implementation of KVM,
>>>>> and will not work for a non-PAPR platform without substantial
>>>>> modification of the KVM code.
>>>>
>>>> OK so then at that point when we try KVM we'll need to look at alternative
>>>> ways, I think MOL OSI worked with KVM at least in MOL but will probably make
>>>> all syscalls exit KVM but since we'll probably need to use KVM PR it will
>>>> exit anyway. For now I keep this vhyp as it does not run with KVM for other
>>>> reasons yet so that's another area to clean up so as a proof of concept
>>>> first version of using VOF vhyp will do.
>>>
>>> Eh, since you'll need to modify KVM anyway, it probably makes just as
>>> much sense to modify it to catch the 'sc 1' as MoL's magic thingy.
>>
>> I'm not sure how KVM works for this case so I also don't know why and what
>> would need to be modified. I think we'll only have KVM PR working as newer
>> POWER CPUs having HV (besides being rare among potential users) are probably
>> too different to run the OSes that expect at most a G4 on pegasos2 so likely
>> it won't work with KVM HV.
>
> Oh, it definitely won't work with KVM HV.
>
>> If we have KVM PR doesn't sc already trap so we
>> could add MOL OSI without further modification to KVM itself only needing
>> change in QEMU?
>
> Uh... I guess so?
>
>> I also hope that MOL OSI could be useful for porting some
>> paravirt drivers from MOL for running Mac OS X on Mac emulation but I don't
>> know about that for sure so I'm open to any other solution too.
>
> Maybe.  I never know much about MOL to begin with, and anything I did
> know was a decade or more ago so I've probably forgotten.

That may still be more than what I know about it since I never had any 
knowledge about PPC KVM and don't have any PPC hardware to test with so 
I'm mostly guessing. (I could test with KVM emulated in QEMU and I did set 
up an environment for that but that's a bit slow and inconvenient so I'd 
leave KVM support to those interested and have more knowledge and hardware 
for it.)

>> For now I'm
>> going with vhyp which is enough fot testing with TCG and if somebody wants
>> KVM they could use he original firmware for now so this could be improved in
>> a later version unless a simple solution is found before the freeze for 6.1.
>> If we're in KVM PR what happens for sc 1 could that be used too so maybe
>> what we have now could work?
>
> Note that if you do go down the MOL path it wouldn't be that complex
> to make a "vMOL" interface so you can use the same mechanism for KVM
> and TCG.

Not sure what you mean by VMOL. Is it modifying MOL to use sc 1 like VOF 
instead of its OSI way for hypercalls? That would lose the advantage of 
being able to reuse MOL guest drivers without modification (which might be 
useful for running OS X guest on Mac emulation) so if we can't use vhyp 
then maybe using OSI would be the next choice for this reason but for now 
vhyp seems to be working for what I could test so unless somebody here 
sees a problem with it and has a better idea I'm going with vhyp for now 
just because that's what VOF uses and I don't want to modify VOF to reuse 
it as it is so I don't need to maintain a separate version and also get 
any enhancements without further need to sync with spapr VOF.

I've found this document about possible hypercall interfaces on KVM (see 
Hypercall ABIs at the end):

https://www.kernel.org/doc/html/latest/virt/kvm/ppc-pv.html

Having both ePAPR (1.) and PAPR (2.) hypercalls is a bit confusing. Does 
vhyp correspond to 2. PAPR? The ePAPR (1.) seems to be preferred by KVM 
and MOL OSI supported for compatibility. So if we need something else 
instead of 2. PAPR hypercalls there seems to be two options: ePAPR and 
MOL OSI which should work with KVM but then I'm not sure how to handle 
those on TCG.

>>>> [...]
>>>>>>>> I've tested that the missing rtas is not the reason for getting no output
>>>>>>>> via serial though, as even when disabling rtas on pegasos2.rom it boots and
>>>>>>>> I still get serial output just some PCI devices are not detected (such as
>>>>>>>> USB, the video card and the not emulated ethernet port but these are not
>>>>>>>> fatal so it might even work as a first try without rtas, just to boot a
>>>>>>>> Linux kernel for testing it would be enough if I can fix the serial output).
>>>>>>>> I still don't know why it's not finding serial but I think it may be some
>>>>>>>> missing or wrong info in the device tree I generat. I'll try to focus on
>>>>>>>> this for now and leave the above rtas question for later.
>>>>>>>
>>>>>>> Oh.. another thought on that.  You have an ISA serial port on Pegasos,
>>>>>>> I believe.  I wonder if the PCI->ISA bridge needs some configuration /
>>>>>>> initialization that the firmware is expected to do.  If so you'll need
>>>>>>> to mimic that setup in qemu for the VOF case.
>>>>>>
>>>>>> That's what I begin to think because I've added everything to the device
>>>>>> tree that I thought could be needed and I still don't get it working so it
>>>>>> may need some config from the firmware. But how do I access device registers
>>>>>> from board code? I've tried adding a machine reset method and write to
>>>>>> memory mapped device registers but all my attempts failed. I've tried
>>>>>> cpu_stl_le_data and even memory_region_dispatch_write but these did not get
>>>>>> to the device. What's the way to access guest mmio regs from QEMU?
>>>>>
>>>>> That's odd, cpu_stl() and memory_region_dispatch_write() should work
>>>>> from board code (after the relevant memory regions are configured, of
>>>>> course).  As an ISA serial port, it's probably accessed through IO
>>>>> space, not memory space though, so you'd need &address_space_io.  And
>>>>> if there is some bridge configuration then it's the bridge control
>>>>> registers you need to look at not the serial registers - you'd have to
>>>>> look at the bridge documentation for that.  Or, I guess the bridge
>>>>> implementation in qemu, which you wrote part of.
>>>>
>>>> I've found at last that stl_le_phys() works. There are so many of these that
>>>> I never know when to use which.
>>>>
>>>> I think the address_space_rw calls in vof_client_call() in vof.c could also
>>>> use these for somewhat shorter code. I've ended up with
>>>> stl_le_phys(CPU(cpu)->as, addr, val) in my machine reset methodbut I don't
>>>> even need that now as it works without additional setup. Also VOF's memory
>>>> access is basically the same as the already existing rtas_st() and co. so
>>>> maybe that could be reused to make code smaller?
>>>
>>> rtas_ld() and rtas_st() should only be used for reading/writing RTAS
>>> parameters to and from memory.  Accessing IO shouldn't be done with
>>> those.
>>>
>>> For IO you probably want the cpu_st*() variants in most cases, since
>>> you're trying to emulate an IO access from the virtual cpu.
>>
>> I think I've tried that but what worked to access mmio device registers are
>> stl_le_phys and similar that are wrappers around address_space_stl_*. But I
>> did not mean that for rtas_ld/_st but the part when vof accessing the
>> parameters passed by its hypercall which is memory access:
>>
>> https://github.com/patchew-project/qemu/blob/patchew/20210520090557.435689-1-aik%40ozlabs.ru/hw/ppc/vof.c
>>
>> line 893, and vof_client_call before that is very similar to what h_rtas
>> does here:
>>
>> https://git.qemu.org/?p=qemu.git;a=blob;f=hw/ppc/spapr_hcall.c;h=f25014afda408002ee1ec1027a0dd7a6025eca61;hb=HEAD#l639
>>
>> and I also need to do the same for rtas in pegasos2 for which I'm just using
>> ldl_be_phys for now but I wonder if we really need 3 ways to do the same or
>> the rtas_ld/_st could be made more generic and reused here?
>
> For your rtas implementation you could definitely re-use them.  For
> the client call I'm a bit less confident, but if the in-guest-memory
> structures are really the same, then it would make sense.

The memory structure seems very similar to me, the only difference is 
calling the first field service in VOF instead of token in RTAS. Both are 
just an array of big endian unit32_t with token, nargs, nret at the front 
followed by args and rets. Since these rtas_ld/st are defined in spapr.h I 
did not bother to split them off, so for pegasos2 rtas I'm just using the 
ldl_be_* functions directly for which these are a shorthand for. If these 
were split off for sharing between spapr rtas and VOF I may be able to 
reuse them as well but it's not that important so just mentioned it as a 
possible later clean up.

Regards,
BALATON Zoltan
David Gibson June 4, 2021, 6:19 a.m. UTC | #46
On Sun, May 30, 2021 at 07:33:01PM +0200, BALATON Zoltan wrote:
> Hello,
> 
> Two more problems I've found while testing with pegasos2 but I'm not sure
> how to fix them:
> 
> On Thu, 20 May 2021, Alexey Kardashevskiy wrote:
> > diff --git a/hw/ppc/vof.c b/hw/ppc/vof.c
> > new file mode 100644
> > index 000000000000..a283b7d251a7
> > --- /dev/null
> > +++ b/hw/ppc/vof.c
> > @@ -0,0 +1,1021 @@
> > +/*
> > + * QEMU PowerPC Virtual Open Firmware.
> > + *
> > + * This implements client interface from OpenFirmware IEEE1275 on the QEMU
> > + * side to leave only a very basic firmware in the VM.
> > + *
> > + * Copyright (c) 2021 IBM Corporation.
> > + *
> > + * SPDX-License-Identifier: GPL-2.0-or-later
> > + */
> > +
> > +#include "qemu/osdep.h"
> > +#include "qemu-common.h"
> > +#include "qemu/timer.h"
> > +#include "qemu/range.h"
> > +#include "qemu/units.h"
> > +#include "qapi/error.h"
> > +#include <sys/ioctl.h>
> > +#include "exec/ram_addr.h"
> > +#include "exec/address-spaces.h"
> > +#include "hw/ppc/vof.h"
> > +#include "hw/ppc/fdt.h"
> > +#include "sysemu/runstate.h"
> > +#include "qom/qom-qobject.h"
> > +#include "trace.h"
> > +
> > +#include <libfdt.h>
> > +
> > +/*
> > + * OF 1275 "nextprop" description suggests is it 32 bytes max but
> > + * LoPAPR defines "ibm,query-interrupt-source-number" which is 33 chars long.
> > + */
> > +#define OF_PROPNAME_LEN_MAX 64
> > +
> > +#define VOF_MAX_PATH        256
> > +#define VOF_MAX_SETPROPLEN  2048
> > +#define VOF_MAX_METHODLEN   256
> > +#define VOF_MAX_FORTHCODE   256
> > +#define VOF_VTY_BUF_SIZE    256
> > +
> > +typedef struct {
> > +    uint64_t start;
> > +    uint64_t size;
> > +} OfClaimed;
> > +
> > +typedef struct {
> > +    char *path; /* the path used to open the instance */
> > +    uint32_t phandle;
> > +} OfInstance;
> > +
> > +#define VOF_MEM_READ(pa, buf, size) \
> > +    address_space_read_full(&address_space_memory, \
> > +    (pa), MEMTXATTRS_UNSPECIFIED, (buf), (size))
> > +#define VOF_MEM_WRITE(pa, buf, size) \
> > +    address_space_write(&address_space_memory, \
> > +    (pa), MEMTXATTRS_UNSPECIFIED, (buf), (size))
> > +
> > +static int readstr(hwaddr pa, char *buf, int size)
> > +{
> > +    if (VOF_MEM_READ(pa, buf, size) != MEMTX_OK) {
> > +        return -1;
> > +    }
> > +    if (strnlen(buf, size) == size) {
> > +        buf[size - 1] = '\0';
> > +        trace_vof_error_str_truncated(buf, size);
> > +        return -1;
> > +    }
> > +    return 0;
> > +}
> > +
> > +static bool cmpservice(const char *s, unsigned nargs, unsigned nret,
> > +                       const char *s1, unsigned nargscheck, unsigned nretcheck)
> > +{
> > +    if (strcmp(s, s1)) {
> > +        return false;
> > +    }
> > +    if ((nargscheck && (nargs != nargscheck)) ||
> > +        (nretcheck && (nret != nretcheck))) {
> > +        trace_vof_error_param(s, nargscheck, nretcheck, nargs, nret);
> > +        return false;
> > +    }
> > +
> > +    return true;
> > +}
> > +
> > +static void prop_format(char *tval, int tlen, const void *prop, int len)
> > +{
> > +    int i;
> > +    const unsigned char *c;
> > +    char *t;
> > +    const char bin[] = "...";
> > +
> > +    for (i = 0, c = prop; i < len; ++i, ++c) {
> > +        if (*c == '\0' && i == len - 1) {
> > +            strncpy(tval, prop, tlen - 1);
> > +            return;
> > +        }
> > +        if (*c < 0x20 || *c >= 0x80) {
> > +            break;
> > +        }
> > +    }
> > +
> > +    for (i = 0, c = prop, t = tval; i < len; ++i, ++c) {
> > +        if (t >= tval + tlen - sizeof(bin) - 1 - 2 - 1) {
> > +            strcpy(t, bin);
> > +            return;
> > +        }
> > +        if (i && i % 4 == 0 && i != len - 1) {
> > +            strcat(t, " ");
> > +            ++t;
> > +        }
> > +        t += sprintf(t, "%02X", *c & 0xFF);
> > +    }
> > +}
> > +
> > +static int get_path(const void *fdt, int offset, char *buf, int len)
> > +{
> > +    int ret;
> > +
> > +    ret = fdt_get_path(fdt, offset, buf, len - 1);
> > +    if (ret < 0) {
> > +        return ret;
> > +    }
> > +
> > +    buf[len - 1] = '\0';
> > +
> > +    return strlen(buf) + 1;
> > +}
> > +
> > +static int phandle_to_path(const void *fdt, uint32_t ph, char *buf, int len)
> > +{
> > +    int ret;
> > +
> > +    ret = fdt_node_offset_by_phandle(fdt, ph);
> > +    if (ret < 0) {
> > +        return ret;
> > +    }
> > +
> > +    return get_path(fdt, ret, buf, len);
> > +}
> > +
> > +static uint32_t vof_finddevice(const void *fdt, uint32_t nodeaddr)
> > +{
> > +    char fullnode[VOF_MAX_PATH];
> > +    uint32_t ret = -1;
> > +    int offset;
> > +
> > +    if (readstr(nodeaddr, fullnode, sizeof(fullnode))) {
> > +        return (uint32_t) ret;
> > +    }
> > +
> > +    offset = fdt_path_offset(fdt, fullnode);
> > +    if (offset >= 0) {
> > +        ret = fdt_get_phandle(fdt, offset);
> > +    }
> > +    trace_vof_finddevice(fullnode, ret);
> > +    return (uint32_t) ret;
> > +}
> 
> The Linux init function that runs on pegasos2 here:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/arch/powerpc/kernel/prom_init.c?h=v4.14.234#n2658
> 
> calls finddevice once with isa@c and next with isa@C (small and capital C)
> both of which works with the board firmware but with vof the comparison is
> case sensitive and one of these fails so I can't make it work. I don't know
> if this is a problem in libfdt or the vof_finddevice above should do
> something else to get case insensitive comparison.

This is kind of a subtle incompatibility between the traditional OF
world and the flat tree world.  In traditional OF, the unit address
(bit after the @) doesn't exist as a string.  Instead when you do the
finddevice it will parse that address and compare it against the 'reg'
properties for each of the relevant nodes.  Since that's an integer
comparison, case doesn't enter into it.

But, how to parse (and write) addresses depends on the bus, so the
firmware has to understand each bus type and act accordingly.  That
doesn't really work in the world of minimal firmwares dor the flat
tree.  So instead, we just incorporate a pre-formatted unit address in
the flat tree directly.  Most of the time that works fine, but there
are some edge cases like the one you've hit.

> > +static const void *getprop(const void *fdt, int nodeoff, const char *propname,
> > +                           int *proplen, bool *write0)
> > +{
> > +    const char *unit, *prop;
> > +
> > +    /*
> > +     * The "name" property is not actually stored as a property in the FDT,
> > +     * we emulate it by returning a pointer to the node's name and adjust
> > +     * proplen to include only the name but not the unit.
> > +     */
> > +    if (strcmp(propname, "name") == 0) {
> > +        prop = fdt_get_name(fdt, nodeoff, proplen);
> > +        if (!prop) {
> > +            *proplen = 0;
> > +            return NULL;
> > +        }
> > +
> > +        unit = memchr(prop, '@', *proplen);
> > +        if (unit) {
> > +            *proplen = unit - prop;
> > +        }
> > +        *proplen += 1;
> > +
> > +        /*
> > +         * Since it might be cut at "@" and there will be no trailing zero
> > +         * in the prop buffer, tell the caller to write zero at the end.
> > +         */
> > +        if (write0) {
> > +            *write0 = true;
> > +        }
> > +        return prop;
> > +    }
> > +
> > +    if (write0) {
> > +        *write0 = false;
> > +    }
> > +    return fdt_getprop(fdt, nodeoff, propname, proplen);
> > +}
> 
> MorphOS checks the name property of the root node ("/") to decide what
> platform it runs on so we may need to be able to set this property on /
> where it should return "bplan,Pegasos2", therefore the above maybe should do
> getprop first and only generate name property if it's not set (or at least
> check if we're on the root node and allow setting name property there). (On
> Macs the root node is named "device-tree" and this was before found to be
> needed for MorphOS.)

Ah.  Hrm.  Have to think about what to do about that.

> Other than the above two problems, I've found that getting the device tree
> from vof returns it in reverse order compared to the board firmware if I add
> it the expected order. This may or may not be a problem but to avoid it I
> can build the tree in reverse order then it comes out right so unless
> there's an easy fix this should not cause a problem but may worth a comment
> somewhere.

The order of things in the device tree *should* never matter.  If it
does, that's definitely a client bug... but of course that doesn't
necessarily mean we won't have to work around it in practice.
David Gibson June 4, 2021, 6:21 a.m. UTC | #47
On Tue, Jun 01, 2021 at 04:12:44PM +0200, BALATON Zoltan wrote:
> On Tue, 1 Jun 2021, Alexey Kardashevskiy wrote:
> > On 31/05/2021 23:07, BALATON Zoltan wrote:
> > > On Sun, 30 May 2021, BALATON Zoltan wrote:
> > > > On Thu, 20 May 2021, Alexey Kardashevskiy wrote:
> > > > > diff --git a/hw/ppc/vof.c b/hw/ppc/vof.c
> > > > > new file mode 100644
> > > > > index 000000000000..a283b7d251a7
> > > > > --- /dev/null
> > > > > +++ b/hw/ppc/vof.c
> > > > > @@ -0,0 +1,1021 @@
> > > > > +/*
> > > > > + * QEMU PowerPC Virtual Open Firmware.
> > > > > + *
> > > > > + * This implements client interface from OpenFirmware
> > > > > IEEE1275 on the QEMU
> > > > > + * side to leave only a very basic firmware in the VM.
> > > > > + *
> > > > > + * Copyright (c) 2021 IBM Corporation.
> > > > > + *
> > > > > + * SPDX-License-Identifier: GPL-2.0-or-later
> > > > > + */
> > > > > +
> > > > > +#include "qemu/osdep.h"
> > > > > +#include "qemu-common.h"
> > > > > +#include "qemu/timer.h"
> > > > > +#include "qemu/range.h"
> > > > > +#include "qemu/units.h"
> > > > > +#include "qapi/error.h"
> > > > > +#include <sys/ioctl.h>
> > > > > +#include "exec/ram_addr.h"
> > > > > +#include "exec/address-spaces.h"
> > > > > +#include "hw/ppc/vof.h"
> > > > > +#include "hw/ppc/fdt.h"
> > > > > +#include "sysemu/runstate.h"
> > > > > +#include "qom/qom-qobject.h"
> > > > > +#include "trace.h"
> > > > > +
> > > > > +#include <libfdt.h>
> > > > > +
> > > > > +/*
> > > > > + * OF 1275 "nextprop" description suggests is it 32 bytes max but
> > > > > + * LoPAPR defines "ibm,query-interrupt-source-number" which
> > > > > is 33 chars long.
> > > > > + */
> > > > > +#define OF_PROPNAME_LEN_MAX 64
> > > > > +
> > > > > +#define VOF_MAX_PATH        256
> > > > > +#define VOF_MAX_SETPROPLEN  2048
> > > > > +#define VOF_MAX_METHODLEN   256
> > > > > +#define VOF_MAX_FORTHCODE   256
> > > > > +#define VOF_VTY_BUF_SIZE    256
> > > > > +
> > > > > +typedef struct {
> > > > > +    uint64_t start;
> > > > > +    uint64_t size;
> > > > > +} OfClaimed;
> > > > > +
> > > > > +typedef struct {
> > > > > +    char *path; /* the path used to open the instance */
> > > > > +    uint32_t phandle;
> > > > > +} OfInstance;
> > > > > +
> > > > > +#define VOF_MEM_READ(pa, buf, size) \
> > > > > +    address_space_read_full(&address_space_memory, \
> > > > > +    (pa), MEMTXATTRS_UNSPECIFIED, (buf), (size))
> > > > > +#define VOF_MEM_WRITE(pa, buf, size) \
> > > > > +    address_space_write(&address_space_memory, \
> > > > > +    (pa), MEMTXATTRS_UNSPECIFIED, (buf), (size))
> > > > > +
> > > > > +static int readstr(hwaddr pa, char *buf, int size)
> > > > > +{
> > > > > +    if (VOF_MEM_READ(pa, buf, size) != MEMTX_OK) {
> > > > > +        return -1;
> > > > > +    }
> > > > > +    if (strnlen(buf, size) == size) {
> > > > > +        buf[size - 1] = '\0';
> > > > > +        trace_vof_error_str_truncated(buf, size);
> > > > > +        return -1;
> > > > > +    }
> > > > > +    return 0;
> > > > > +}
> > > > > +
> > > > > +static bool cmpservice(const char *s, unsigned nargs, unsigned nret,
> > > > > +                       const char *s1, unsigned nargscheck,
> > > > > unsigned nretcheck)
> > > > > +{
> > > > > +    if (strcmp(s, s1)) {
> > > > > +        return false;
> > > > > +    }
> > > > > +    if ((nargscheck && (nargs != nargscheck)) ||
> > > > > +        (nretcheck && (nret != nretcheck))) {
> > > > > +        trace_vof_error_param(s, nargscheck, nretcheck, nargs, nret);
> > > > > +        return false;
> > > > > +    }
> > > > > +
> > > > > +    return true;
> > > > > +}
> > > > > +
> > > > > +static void prop_format(char *tval, int tlen, const void *prop, int len)
> > > > > +{
> > > > > +    int i;
> > > > > +    const unsigned char *c;
> > > > > +    char *t;
> > > > > +    const char bin[] = "...";
> > > > > +
> > > > > +    for (i = 0, c = prop; i < len; ++i, ++c) {
> > > > > +        if (*c == '\0' && i == len - 1) {
> > > > > +            strncpy(tval, prop, tlen - 1);
> > > > > +            return;
> > > > > +        }
> > > > > +        if (*c < 0x20 || *c >= 0x80) {
> > > > > +            break;
> > > > > +        }
> > > > > +    }
> > > > > +
> > > > > +    for (i = 0, c = prop, t = tval; i < len; ++i, ++c) {
> > > > > +        if (t >= tval + tlen - sizeof(bin) - 1 - 2 - 1) {
> > > > > +            strcpy(t, bin);
> > > > > +            return;
> > > > > +        }
> > > > > +        if (i && i % 4 == 0 && i != len - 1) {
> > > > > +            strcat(t, " ");
> > > > > +            ++t;
> > > > > +        }
> > > > > +        t += sprintf(t, "%02X", *c & 0xFF);
> > > > > +    }
> > > > > +}
> > > > > +
> > > > > +static int get_path(const void *fdt, int offset, char *buf, int len)
> > > > > +{
> > > > > +    int ret;
> > > > > +
> > > > > +    ret = fdt_get_path(fdt, offset, buf, len - 1);
> > > > > +    if (ret < 0) {
> > > > > +        return ret;
> > > > > +    }
> > > > > +
> > > > > +    buf[len - 1] = '\0';
> > > > > +
> > > > > +    return strlen(buf) + 1;
> > > > > +}
> > > > > +
> > > > > +static int phandle_to_path(const void *fdt, uint32_t ph,
> > > > > char *buf, int len)
> > > > > +{
> > > > > +    int ret;
> > > > > +
> > > > > +    ret = fdt_node_offset_by_phandle(fdt, ph);
> > > > > +    if (ret < 0) {
> > > > > +        return ret;
> > > > > +    }
> > > > > +
> > > > > +    return get_path(fdt, ret, buf, len);
> > > > > +}
> > > > > +
> > > > > +static uint32_t vof_finddevice(const void *fdt, uint32_t nodeaddr)
> > > > > +{
> > > > > +    char fullnode[VOF_MAX_PATH];
> > > > > +    uint32_t ret = -1;
> > > > > +    int offset;
> > > > > +
> > > > > +    if (readstr(nodeaddr, fullnode, sizeof(fullnode))) {
> > > > > +        return (uint32_t) ret;
> > > > > +    }
> > > > > +
> > > > > +    offset = fdt_path_offset(fdt, fullnode);
> > > > > +    if (offset >= 0) {
> > > > > +        ret = fdt_get_phandle(fdt, offset);
> > > > > +    }
> > > > > +    trace_vof_finddevice(fullnode, ret);
> > > > > +    return (uint32_t) ret;
> > > > > +}
> > > > 
> > > > The Linux init function that runs on pegasos2 here:
> > > > 
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/arch/powerpc/kernel/prom_init.c?h=v4.14.234#n2658
> > > > 
> > > > calls finddevice once with isa@c and next with isa@C (small and
> > > > capital C) both of which works with the board firmware but with
> > > > vof the comparison is case sensitive and one of these fails so I
> > > > can't make it work. I don't know if this is a problem in libfdt
> > > > or the vof_finddevice above should do something else to get case
> > > > insensitive comparison.
> > > 
> > > This fixes the issue with Linux but I'm not sure if there's any
> > > better solution or would it break anything else.
> > 
> > The bit after "@" is an address and needs to be case insensitive and
> > I'll fix this indeed. I'm not so sure about the part before "@", I
> > cannot imagine what could break if I made search insensitive to case. Hm
> > :-/
> 
> Fixing the match in the address part is probably enough as the name sent by
> guests is probably always lower case

I'm confused, I thought you just said that it looked for both isa@c
and isa@C, which seems to contradict guests always using lower case.

> but the address could be formatted
> differently and that's what caused the problem. The patch below was only a
> quick fix to be able to test it further but your fix should work too.
> 
> With this and the ld replaced in entry.S I can now boot Linux which is
> enough to submit the pegasos2 vof patch after an updated patch from you
> fixes these in vof.
> 
> MorphOS still misses something but I'm not sure what as it uses the data
> gathered from the device tree later without printing diagnostics and fails
> due to a NULL dereference much after that so it seems to assume some value
> should exist but I'm not sure what value it needs and where that should come
> from. Maybe I'll try some more to find out just to make it simpler to boot
> but since it boots with the board firmware it's enough if Linux works with
> vof for now.
David Gibson June 4, 2021, 6:29 a.m. UTC | #48
On Wed, Jun 02, 2021 at 02:29:29PM +0200, BALATON Zoltan wrote:
> On Wed, 2 Jun 2021, David Gibson wrote:
> > On Thu, May 27, 2021 at 02:42:39PM +0200, BALATON Zoltan wrote:
> > > On Thu, 27 May 2021, David Gibson wrote:
> > > > On Tue, May 25, 2021 at 12:08:45PM +0200, BALATON Zoltan wrote:
> > > > > On Tue, 25 May 2021, David Gibson wrote:
> > > > > > On Mon, May 24, 2021 at 12:55:07PM +0200, BALATON Zoltan wrote:
> > > > > > > On Mon, 24 May 2021, David Gibson wrote:
> > > > > > > > On Sun, May 23, 2021 at 07:09:26PM +0200, BALATON Zoltan wrote:
> > > > > > > > > On Sun, 23 May 2021, BALATON Zoltan wrote:
> > > > > > > > > > On Sun, 23 May 2021, Alexey Kardashevskiy wrote:
> > > > > > > > > > > One thing to note about PCI is that normally I think the client
> > > > > > > > > > > expects the firmware to do PCI probing and SLOF does it. But VOF
> > > > > > > > > > > does not and Linux scans PCI bus(es) itself. Might be a problem for
> > > > > > > > > > > you kernel.
> > > > > > > > > > 
> > > > > > > > > > I'm not sure what info does MorphOS get from the device tree and what it
> > > > > > > > > > probes itself but I think it may at least need device ids and info about
> > > > > > > > > > the PCI bus to be able to access the config regs, after that it should
> > > > > > > > > > set the devices up hopefully. I could add these from the board code to
> > > > > > > > > > device tree so VOF does not need to do anything about it. However I'm
> > > > > > > > > > not getting to that point yet because it crashes on something that it's
> > > > > > > > > > missing and couldn't yet find out what is that.
> > > > > > > > > > 
> > > > > > > > > > I'd like to get Linux working now as that would be enough to test this
> > > > > > > > > > and then if for MorphOS we still need a ROM it's not a problem if at
> > > > > > > > > > least we can boot Linux without the original firmware. But I can't make
> > > > > > > > > > Linux open a serial console and I don't know what it needs for that. Do
> > > > > > > > > > you happen to know? I've looked at the sources in Linux/arch/powerpc but
> > > > > > > > > > not sure how it would find and open a serial port on pegasos2. It seems
> > > > > > > > > > to work with the board firmware and now I can get it to boot with VOF
> > > > > > > > > > but then it does not open serial so it probably needs something in the
> > > > > > > > > > device tree or expects the firmware to set something up that we should
> > > > > > > > > > add in pegasos2.c when using VOF.
> > > > > > > > > 
> > > > > > > > > I've now found that Linux uses rtas methods read-pci-config and
> > > > > > > > > write-pci-config for PCI access on pegasos2 so this means that we'll
> > > > > > > > > probably need rtas too (I hoped we could get away without it if it were only
> > > > > > > > > used for shutdown/reboot or so but seems Linux needs it for PCI as well and
> > > > > > > > > does not scan the bus and won't find some devices without it).
> > > > > > > > 
> > > > > > > > Yes, definitely sounds like you'll need an RTAS implementation.
> > > > > > > 
> > > > > > > I plan to fix that after managed to get serial working as that seems to not
> > > > > > > need it. If I delete the rtas-size property from /rtas on the original
> > > > > > > firmware that makes Linux skip instantiating rtas, but I still get serial
> > > > > > > output just not accessing PCI devices. So I think it should work and keeps
> > > > > > > things simpler at first. Then I'll try rtas later.
> > > > > > > 
> > > > > > > > > While VOF can do rtas, this causes a problem with the hypercall method using
> > > > > > > > > sc 1 that goes through vhyp but trips the assert in ppc_store_sdr1() so
> > > > > > > > > cannot work after guest is past quiesce.
> > > > > > > > 
> > > > > > > > > So the question is why is that
> > > > > > > > > assert there
> > > > > > > > 
> > > > > > > > Ah.. right.  So, vhyp was designed for the PAPR use case, where we
> > > > > > > > want to model the CPU when it's in supervisor and user mode, but not
> > > > > > > > when it's in hypervisor mode.  We want qemu to mimic the behaviour of
> > > > > > > > the hypervisor, rather than attempting to actually execute hypervisor
> > > > > > > > code in the virtual CPU.
> > > > > > > > 
> > > > > > > > On systems that have a hypervisor mode, SDR1 is hypervisor privileged,
> > > > > > > > so it makes no sense for the guest to attempt to set it.  That should
> > > > > > > > be caught by the general SPR code and turned into a 0x700, hence the
> > > > > > > > assert() if we somehow reach ppc_store_sdr1().
> > > > > > > > 
> > > > > > > > So, we are seeing a problem here because you want the 'sc 1'
> > > > > > > > interception of vhyp, but not the rest of the stuff that goes with it.
> > > > > > > > 
> > > > > > > > > and would using sc 1 for hypercalls on pegasos2 cause other
> > > > > > > > > problems later even if the assert could be removed?
> > > > > > > > 
> > > > > > > > At least in the short term, I think you probably can remove the
> > > > > > > > assert.  In your case the 'sc 1' calls aren't truly to a hypervisor,
> > > > > > > > but a special case escape to qemu for the firmware emulation.  I think
> > > > > > > > it's unlikely to cause problems later, because nothing on a 32-bit
> > > > > > > > system should be attempting an 'sc 1'.  The only thing I can think of
> > > > > > > > that would fail is some test case which explicitly verified that 'sc
> > > > > > > > 1' triggered a 0x700 (SIGILL from userspace).
> > > > > > > 
> > > > > > > OK so the assert should check if the CPU has an HV bit. I think there was a
> > > > > > > #detine for that somewhere that I can add to the assert then I can try that.
> > > > > > > What I wasn't sure about is that sc 1 would conflict with the guest's usage
> > > > > > > of normal sc calls or are these going through different paths and only sc 1
> > > > > > > will trigger vhyp callback not affecting notmal sc calls?
> > > > > > 
> > > > > > The vhyp shouldn't affect normal system calls, 'sc 1' is specifically
> > > > > > for hypercalls, as opposed to normal 'sc' (a.k.a. 'sc 0'), and the
> > > > > > vhyp only intercepts the hypercall version (after all Linux on PAPR
> > > > > > certainly uses its own system calls, and hypercalls are active for the
> > > > > > lifetime of the guest there).
> > > > > > 
> > > > > > > (Or if this causes
> > > > > > > an otherwise unnecessary VM exit on KVM even when it works then maybe
> > > > > > > looking for a different way in the future might be needed.
> > > > > > 
> > > > > > What you're doing here won't work with KVM as it stands.  There are
> > > > > > basically two paths into the vhyp hypercall path: 1) from TCG, if we
> > > > > > interpret an 'sc 1' instruction we enter vhyp, 2) from KVM, if we get
> > > > > > a KVM_EXIT_PAPR_HCALL KVM exit then we also go to the vhyp path.
> > > > > > 
> > > > > > The second path is specific to the PAPR (ppc64) implementation of KVM,
> > > > > > and will not work for a non-PAPR platform without substantial
> > > > > > modification of the KVM code.
> > > > > 
> > > > > OK so then at that point when we try KVM we'll need to look at alternative
> > > > > ways, I think MOL OSI worked with KVM at least in MOL but will probably make
> > > > > all syscalls exit KVM but since we'll probably need to use KVM PR it will
> > > > > exit anyway. For now I keep this vhyp as it does not run with KVM for other
> > > > > reasons yet so that's another area to clean up so as a proof of concept
> > > > > first version of using VOF vhyp will do.
> > > > 
> > > > Eh, since you'll need to modify KVM anyway, it probably makes just as
> > > > much sense to modify it to catch the 'sc 1' as MoL's magic thingy.
> > > 
> > > I'm not sure how KVM works for this case so I also don't know why and what
> > > would need to be modified. I think we'll only have KVM PR working as newer
> > > POWER CPUs having HV (besides being rare among potential users) are probably
> > > too different to run the OSes that expect at most a G4 on pegasos2 so likely
> > > it won't work with KVM HV.
> > 
> > Oh, it definitely won't work with KVM HV.
> > 
> > > If we have KVM PR doesn't sc already trap so we
> > > could add MOL OSI without further modification to KVM itself only needing
> > > change in QEMU?
> > 
> > Uh... I guess so?
> > 
> > > I also hope that MOL OSI could be useful for porting some
> > > paravirt drivers from MOL for running Mac OS X on Mac emulation but I don't
> > > know about that for sure so I'm open to any other solution too.
> > 
> > Maybe.  I never know much about MOL to begin with, and anything I did
> > know was a decade or more ago so I've probably forgotten.
> 
> That may still be more than what I know about it since I never had any
> knowledge about PPC KVM and don't have any PPC hardware to test with so I'm
> mostly guessing. (I could test with KVM emulated in QEMU and I did set up an
> environment for that but that's a bit slow and inconvenient so I'd leave KVM
> support to those interested and have more knowledge and hardware for it.)

Sounds like a problem for someone else another time, then.

> > > For now I'm
> > > going with vhyp which is enough fot testing with TCG and if somebody wants
> > > KVM they could use he original firmware for now so this could be improved in
> > > a later version unless a simple solution is found before the freeze for 6.1.
> > > If we're in KVM PR what happens for sc 1 could that be used too so maybe
> > > what we have now could work?
> > 
> > Note that if you do go down the MOL path it wouldn't be that complex
> > to make a "vMOL" interface so you can use the same mechanism for KVM
> > and TCG.
> 
> Not sure what you mean by VMOL. Is it modifying MOL to use sc 1 like VOF
> instead of its OSI way for hypercalls?

No, I mean on the qemu side adding an optional hook which will
intercept sc 0 instructions with the MOL magic register values and
redirect them to a machine registered callback, rather than emulating
the CPU's behaviour of jumping to the system call vector in guest
space.

Basically an equivalent of vhyp, but for MOL magic syscalls, instead
of hypercalls.

> That would lose the advantage of
> being able to reuse MOL guest drivers without modification (which might be
> useful for running OS X guest on Mac emulation) so if we can't use vhyp then
> maybe using OSI would be the next choice for this reason but for now vhyp
> seems to be working for what I could test so unless somebody here sees a
> problem with it and has a better idea I'm going with vhyp for now just
> because that's what VOF uses and I don't want to modify VOF to reuse it as
> it is so I don't need to maintain a separate version and also get any
> enhancements without further need to sync with spapr VOF.
> 
> I've found this document about possible hypercall interfaces on KVM (see
> Hypercall ABIs at the end):
> 
> https://www.kernel.org/doc/html/latest/virt/kvm/ppc-pv.html
> 
> Having both ePAPR (1.) and PAPR (2.) hypercalls is a bit confusing. Does
> vhyp correspond to 2. PAPR?

Yes.

> The ePAPR (1.) seems to be preferred by KVM and
> MOL OSI supported for compatibility.

That document looks pretty out of date.  Most of it is only discussing
KVM PR, which is now barely maintained.  KVM HV only works with PAPR
hypercalls.

> So if we need something else instead of
> 2. PAPR hypercalls there seems to be two options: ePAPR and MOL OSI which
> should work with KVM but then I'm not sure how to handle those on TCG.
> 
> > > > > [...]
> > > > > > > > > I've tested that the missing rtas is not the reason for getting no output
> > > > > > > > > via serial though, as even when disabling rtas on pegasos2.rom it boots and
> > > > > > > > > I still get serial output just some PCI devices are not detected (such as
> > > > > > > > > USB, the video card and the not emulated ethernet port but these are not
> > > > > > > > > fatal so it might even work as a first try without rtas, just to boot a
> > > > > > > > > Linux kernel for testing it would be enough if I can fix the serial output).
> > > > > > > > > I still don't know why it's not finding serial but I think it may be some
> > > > > > > > > missing or wrong info in the device tree I generat. I'll try to focus on
> > > > > > > > > this for now and leave the above rtas question for later.
> > > > > > > > 
> > > > > > > > Oh.. another thought on that.  You have an ISA serial port on Pegasos,
> > > > > > > > I believe.  I wonder if the PCI->ISA bridge needs some configuration /
> > > > > > > > initialization that the firmware is expected to do.  If so you'll need
> > > > > > > > to mimic that setup in qemu for the VOF case.
> > > > > > > 
> > > > > > > That's what I begin to think because I've added everything to the device
> > > > > > > tree that I thought could be needed and I still don't get it working so it
> > > > > > > may need some config from the firmware. But how do I access device registers
> > > > > > > from board code? I've tried adding a machine reset method and write to
> > > > > > > memory mapped device registers but all my attempts failed. I've tried
> > > > > > > cpu_stl_le_data and even memory_region_dispatch_write but these did not get
> > > > > > > to the device. What's the way to access guest mmio regs from QEMU?
> > > > > > 
> > > > > > That's odd, cpu_stl() and memory_region_dispatch_write() should work
> > > > > > from board code (after the relevant memory regions are configured, of
> > > > > > course).  As an ISA serial port, it's probably accessed through IO
> > > > > > space, not memory space though, so you'd need &address_space_io.  And
> > > > > > if there is some bridge configuration then it's the bridge control
> > > > > > registers you need to look at not the serial registers - you'd have to
> > > > > > look at the bridge documentation for that.  Or, I guess the bridge
> > > > > > implementation in qemu, which you wrote part of.
> > > > > 
> > > > > I've found at last that stl_le_phys() works. There are so many of these that
> > > > > I never know when to use which.
> > > > > 
> > > > > I think the address_space_rw calls in vof_client_call() in vof.c could also
> > > > > use these for somewhat shorter code. I've ended up with
> > > > > stl_le_phys(CPU(cpu)->as, addr, val) in my machine reset methodbut I don't
> > > > > even need that now as it works without additional setup. Also VOF's memory
> > > > > access is basically the same as the already existing rtas_st() and co. so
> > > > > maybe that could be reused to make code smaller?
> > > > 
> > > > rtas_ld() and rtas_st() should only be used for reading/writing RTAS
> > > > parameters to and from memory.  Accessing IO shouldn't be done with
> > > > those.
> > > > 
> > > > For IO you probably want the cpu_st*() variants in most cases, since
> > > > you're trying to emulate an IO access from the virtual cpu.
> > > 
> > > I think I've tried that but what worked to access mmio device registers are
> > > stl_le_phys and similar that are wrappers around address_space_stl_*. But I
> > > did not mean that for rtas_ld/_st but the part when vof accessing the
> > > parameters passed by its hypercall which is memory access:
> > > 
> > > https://github.com/patchew-project/qemu/blob/patchew/20210520090557.435689-1-aik%40ozlabs.ru/hw/ppc/vof.c
> > > 
> > > line 893, and vof_client_call before that is very similar to what h_rtas
> > > does here:
> > > 
> > > https://git.qemu.org/?p=qemu.git;a=blob;f=hw/ppc/spapr_hcall.c;h=f25014afda408002ee1ec1027a0dd7a6025eca61;hb=HEAD#l639
> > > 
> > > and I also need to do the same for rtas in pegasos2 for which I'm just using
> > > ldl_be_phys for now but I wonder if we really need 3 ways to do the same or
> > > the rtas_ld/_st could be made more generic and reused here?
> > 
> > For your rtas implementation you could definitely re-use them.  For
> > the client call I'm a bit less confident, but if the in-guest-memory
> > structures are really the same, then it would make sense.
> 
> The memory structure seems very similar to me, the only difference is
> calling the first field service in VOF instead of token in RTAS. Both are
> just an array of big endian unit32_t with token, nargs, nret at the front
> followed by args and rets. Since these rtas_ld/st are defined in spapr.h I
> did not bother to split them off, so for pegasos2 rtas I'm just using the
> ldl_be_* functions directly for which these are a shorthand for. If these
> were split off for sharing between spapr rtas and VOF I may be able to reuse
> them as well but it's not that important so just mentioned it as a possible
> later clean up.

Ok, sounds reasonable to re-use them then, though maybe add an aliased
name for clarity ofci_{ld,st}(), maybe?  (for "Open Firmware Client
Interface")
BALATON Zoltan June 4, 2021, 1:27 p.m. UTC | #49
On Fri, 4 Jun 2021, David Gibson wrote:
> On Tue, Jun 01, 2021 at 04:12:44PM +0200, BALATON Zoltan wrote:
>> On Tue, 1 Jun 2021, Alexey Kardashevskiy wrote:
>>> On 31/05/2021 23:07, BALATON Zoltan wrote:
>>>> On Sun, 30 May 2021, BALATON Zoltan wrote:
>>>>> On Thu, 20 May 2021, Alexey Kardashevskiy wrote:
>>>>>> diff --git a/hw/ppc/vof.c b/hw/ppc/vof.c
>>>>>> new file mode 100644
>>>>>> index 000000000000..a283b7d251a7
>>>>>> --- /dev/null
>>>>>> +++ b/hw/ppc/vof.c
>>>>>> @@ -0,0 +1,1021 @@
>>>>>> +/*
>>>>>> + * QEMU PowerPC Virtual Open Firmware.
>>>>>> + *
>>>>>> + * This implements client interface from OpenFirmware
>>>>>> IEEE1275 on the QEMU
>>>>>> + * side to leave only a very basic firmware in the VM.
>>>>>> + *
>>>>>> + * Copyright (c) 2021 IBM Corporation.
>>>>>> + *
>>>>>> + * SPDX-License-Identifier: GPL-2.0-or-later
>>>>>> + */
>>>>>> +
>>>>>> +#include "qemu/osdep.h"
>>>>>> +#include "qemu-common.h"
>>>>>> +#include "qemu/timer.h"
>>>>>> +#include "qemu/range.h"
>>>>>> +#include "qemu/units.h"
>>>>>> +#include "qapi/error.h"
>>>>>> +#include <sys/ioctl.h>
>>>>>> +#include "exec/ram_addr.h"
>>>>>> +#include "exec/address-spaces.h"
>>>>>> +#include "hw/ppc/vof.h"
>>>>>> +#include "hw/ppc/fdt.h"
>>>>>> +#include "sysemu/runstate.h"
>>>>>> +#include "qom/qom-qobject.h"
>>>>>> +#include "trace.h"
>>>>>> +
>>>>>> +#include <libfdt.h>
>>>>>> +
>>>>>> +/*
>>>>>> + * OF 1275 "nextprop" description suggests is it 32 bytes max but
>>>>>> + * LoPAPR defines "ibm,query-interrupt-source-number" which
>>>>>> is 33 chars long.
>>>>>> + */
>>>>>> +#define OF_PROPNAME_LEN_MAX 64
>>>>>> +
>>>>>> +#define VOF_MAX_PATH        256
>>>>>> +#define VOF_MAX_SETPROPLEN  2048
>>>>>> +#define VOF_MAX_METHODLEN   256
>>>>>> +#define VOF_MAX_FORTHCODE   256
>>>>>> +#define VOF_VTY_BUF_SIZE    256
>>>>>> +
>>>>>> +typedef struct {
>>>>>> +    uint64_t start;
>>>>>> +    uint64_t size;
>>>>>> +} OfClaimed;
>>>>>> +
>>>>>> +typedef struct {
>>>>>> +    char *path; /* the path used to open the instance */
>>>>>> +    uint32_t phandle;
>>>>>> +} OfInstance;
>>>>>> +
>>>>>> +#define VOF_MEM_READ(pa, buf, size) \
>>>>>> +    address_space_read_full(&address_space_memory, \
>>>>>> +    (pa), MEMTXATTRS_UNSPECIFIED, (buf), (size))
>>>>>> +#define VOF_MEM_WRITE(pa, buf, size) \
>>>>>> +    address_space_write(&address_space_memory, \
>>>>>> +    (pa), MEMTXATTRS_UNSPECIFIED, (buf), (size))
>>>>>> +
>>>>>> +static int readstr(hwaddr pa, char *buf, int size)
>>>>>> +{
>>>>>> +    if (VOF_MEM_READ(pa, buf, size) != MEMTX_OK) {
>>>>>> +        return -1;
>>>>>> +    }
>>>>>> +    if (strnlen(buf, size) == size) {
>>>>>> +        buf[size - 1] = '\0';
>>>>>> +        trace_vof_error_str_truncated(buf, size);
>>>>>> +        return -1;
>>>>>> +    }
>>>>>> +    return 0;
>>>>>> +}
>>>>>> +
>>>>>> +static bool cmpservice(const char *s, unsigned nargs, unsigned nret,
>>>>>> +                       const char *s1, unsigned nargscheck,
>>>>>> unsigned nretcheck)
>>>>>> +{
>>>>>> +    if (strcmp(s, s1)) {
>>>>>> +        return false;
>>>>>> +    }
>>>>>> +    if ((nargscheck && (nargs != nargscheck)) ||
>>>>>> +        (nretcheck && (nret != nretcheck))) {
>>>>>> +        trace_vof_error_param(s, nargscheck, nretcheck, nargs, nret);
>>>>>> +        return false;
>>>>>> +    }
>>>>>> +
>>>>>> +    return true;
>>>>>> +}
>>>>>> +
>>>>>> +static void prop_format(char *tval, int tlen, const void *prop, int len)
>>>>>> +{
>>>>>> +    int i;
>>>>>> +    const unsigned char *c;
>>>>>> +    char *t;
>>>>>> +    const char bin[] = "...";
>>>>>> +
>>>>>> +    for (i = 0, c = prop; i < len; ++i, ++c) {
>>>>>> +        if (*c == '\0' && i == len - 1) {
>>>>>> +            strncpy(tval, prop, tlen - 1);
>>>>>> +            return;
>>>>>> +        }
>>>>>> +        if (*c < 0x20 || *c >= 0x80) {
>>>>>> +            break;
>>>>>> +        }
>>>>>> +    }
>>>>>> +
>>>>>> +    for (i = 0, c = prop, t = tval; i < len; ++i, ++c) {
>>>>>> +        if (t >= tval + tlen - sizeof(bin) - 1 - 2 - 1) {
>>>>>> +            strcpy(t, bin);
>>>>>> +            return;
>>>>>> +        }
>>>>>> +        if (i && i % 4 == 0 && i != len - 1) {
>>>>>> +            strcat(t, " ");
>>>>>> +            ++t;
>>>>>> +        }
>>>>>> +        t += sprintf(t, "%02X", *c & 0xFF);
>>>>>> +    }
>>>>>> +}
>>>>>> +
>>>>>> +static int get_path(const void *fdt, int offset, char *buf, int len)
>>>>>> +{
>>>>>> +    int ret;
>>>>>> +
>>>>>> +    ret = fdt_get_path(fdt, offset, buf, len - 1);
>>>>>> +    if (ret < 0) {
>>>>>> +        return ret;
>>>>>> +    }
>>>>>> +
>>>>>> +    buf[len - 1] = '\0';
>>>>>> +
>>>>>> +    return strlen(buf) + 1;
>>>>>> +}
>>>>>> +
>>>>>> +static int phandle_to_path(const void *fdt, uint32_t ph,
>>>>>> char *buf, int len)
>>>>>> +{
>>>>>> +    int ret;
>>>>>> +
>>>>>> +    ret = fdt_node_offset_by_phandle(fdt, ph);
>>>>>> +    if (ret < 0) {
>>>>>> +        return ret;
>>>>>> +    }
>>>>>> +
>>>>>> +    return get_path(fdt, ret, buf, len);
>>>>>> +}
>>>>>> +
>>>>>> +static uint32_t vof_finddevice(const void *fdt, uint32_t nodeaddr)
>>>>>> +{
>>>>>> +    char fullnode[VOF_MAX_PATH];
>>>>>> +    uint32_t ret = -1;
>>>>>> +    int offset;
>>>>>> +
>>>>>> +    if (readstr(nodeaddr, fullnode, sizeof(fullnode))) {
>>>>>> +        return (uint32_t) ret;
>>>>>> +    }
>>>>>> +
>>>>>> +    offset = fdt_path_offset(fdt, fullnode);
>>>>>> +    if (offset >= 0) {
>>>>>> +        ret = fdt_get_phandle(fdt, offset);
>>>>>> +    }
>>>>>> +    trace_vof_finddevice(fullnode, ret);
>>>>>> +    return (uint32_t) ret;
>>>>>> +}
>>>>>
>>>>> The Linux init function that runs on pegasos2 here:
>>>>>
>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/arch/powerpc/kernel/prom_init.c?h=v4.14.234#n2658
>>>>>
>>>>> calls finddevice once with isa@c and next with isa@C (small and
>>>>> capital C) both of which works with the board firmware but with
>>>>> vof the comparison is case sensitive and one of these fails so I
>>>>> can't make it work. I don't know if this is a problem in libfdt
>>>>> or the vof_finddevice above should do something else to get case
>>>>> insensitive comparison.
>>>>
>>>> This fixes the issue with Linux but I'm not sure if there's any
>>>> better solution or would it break anything else.
>>>
>>> The bit after "@" is an address and needs to be case insensitive and
>>> I'll fix this indeed. I'm not so sure about the part before "@", I
>>> cannot imagine what could break if I made search insensitive to case. Hm
>>> :-/
>>
>> Fixing the match in the address part is probably enough as the name sent by
>> guests is probably always lower case
>
> I'm confused, I thought you just said that it looked for both isa@c
> and isa@C, which seems to contradict guests always using lower case.

I mean the part before the @ sign (that is the name part, "isa" above) is 
always lower case. I haven't seen guests trying to query that with other 
than lower case but the part after @ can be different even in the same 
guest code just a few lines apart as in the Linux kernel. So fixing the 
comparison to e.g. do toupper in the address part after @ should work I 
think even if we continue to do case sensitive comparison in the name 
part. Alexey said he'll fix that so there's no problem.

Regards,
BALATON Zoltam
BALATON Zoltan June 4, 2021, 1:50 p.m. UTC | #50
On Fri, 4 Jun 2021, David Gibson wrote:
> On Sun, May 30, 2021 at 07:33:01PM +0200, BALATON Zoltan wrote:
>> Hello,
>>
>> Two more problems I've found while testing with pegasos2 but I'm not sure
>> how to fix them:
>>
>> On Thu, 20 May 2021, Alexey Kardashevskiy wrote:
>>> diff --git a/hw/ppc/vof.c b/hw/ppc/vof.c
>>> new file mode 100644
>>> index 000000000000..a283b7d251a7
>>> --- /dev/null
>>> +++ b/hw/ppc/vof.c
>>> @@ -0,0 +1,1021 @@
>>> +/*
>>> + * QEMU PowerPC Virtual Open Firmware.
>>> + *
>>> + * This implements client interface from OpenFirmware IEEE1275 on the QEMU
>>> + * side to leave only a very basic firmware in the VM.
>>> + *
>>> + * Copyright (c) 2021 IBM Corporation.
>>> + *
>>> + * SPDX-License-Identifier: GPL-2.0-or-later
>>> + */
>>> +
>>> +#include "qemu/osdep.h"
>>> +#include "qemu-common.h"
>>> +#include "qemu/timer.h"
>>> +#include "qemu/range.h"
>>> +#include "qemu/units.h"
>>> +#include "qapi/error.h"
>>> +#include <sys/ioctl.h>
>>> +#include "exec/ram_addr.h"
>>> +#include "exec/address-spaces.h"
>>> +#include "hw/ppc/vof.h"
>>> +#include "hw/ppc/fdt.h"
>>> +#include "sysemu/runstate.h"
>>> +#include "qom/qom-qobject.h"
>>> +#include "trace.h"
>>> +
>>> +#include <libfdt.h>
>>> +
>>> +/*
>>> + * OF 1275 "nextprop" description suggests is it 32 bytes max but
>>> + * LoPAPR defines "ibm,query-interrupt-source-number" which is 33 chars long.
>>> + */
>>> +#define OF_PROPNAME_LEN_MAX 64
>>> +
>>> +#define VOF_MAX_PATH        256
>>> +#define VOF_MAX_SETPROPLEN  2048
>>> +#define VOF_MAX_METHODLEN   256
>>> +#define VOF_MAX_FORTHCODE   256
>>> +#define VOF_VTY_BUF_SIZE    256
>>> +
>>> +typedef struct {
>>> +    uint64_t start;
>>> +    uint64_t size;
>>> +} OfClaimed;
>>> +
>>> +typedef struct {
>>> +    char *path; /* the path used to open the instance */
>>> +    uint32_t phandle;
>>> +} OfInstance;
>>> +
>>> +#define VOF_MEM_READ(pa, buf, size) \
>>> +    address_space_read_full(&address_space_memory, \
>>> +    (pa), MEMTXATTRS_UNSPECIFIED, (buf), (size))
>>> +#define VOF_MEM_WRITE(pa, buf, size) \
>>> +    address_space_write(&address_space_memory, \
>>> +    (pa), MEMTXATTRS_UNSPECIFIED, (buf), (size))
>>> +
>>> +static int readstr(hwaddr pa, char *buf, int size)
>>> +{
>>> +    if (VOF_MEM_READ(pa, buf, size) != MEMTX_OK) {
>>> +        return -1;
>>> +    }
>>> +    if (strnlen(buf, size) == size) {
>>> +        buf[size - 1] = '\0';
>>> +        trace_vof_error_str_truncated(buf, size);
>>> +        return -1;
>>> +    }
>>> +    return 0;
>>> +}
>>> +
>>> +static bool cmpservice(const char *s, unsigned nargs, unsigned nret,
>>> +                       const char *s1, unsigned nargscheck, unsigned nretcheck)
>>> +{
>>> +    if (strcmp(s, s1)) {
>>> +        return false;
>>> +    }
>>> +    if ((nargscheck && (nargs != nargscheck)) ||
>>> +        (nretcheck && (nret != nretcheck))) {
>>> +        trace_vof_error_param(s, nargscheck, nretcheck, nargs, nret);
>>> +        return false;
>>> +    }
>>> +
>>> +    return true;
>>> +}
>>> +
>>> +static void prop_format(char *tval, int tlen, const void *prop, int len)
>>> +{
>>> +    int i;
>>> +    const unsigned char *c;
>>> +    char *t;
>>> +    const char bin[] = "...";
>>> +
>>> +    for (i = 0, c = prop; i < len; ++i, ++c) {
>>> +        if (*c == '\0' && i == len - 1) {
>>> +            strncpy(tval, prop, tlen - 1);
>>> +            return;
>>> +        }
>>> +        if (*c < 0x20 || *c >= 0x80) {
>>> +            break;
>>> +        }
>>> +    }
>>> +
>>> +    for (i = 0, c = prop, t = tval; i < len; ++i, ++c) {
>>> +        if (t >= tval + tlen - sizeof(bin) - 1 - 2 - 1) {
>>> +            strcpy(t, bin);
>>> +            return;
>>> +        }
>>> +        if (i && i % 4 == 0 && i != len - 1) {
>>> +            strcat(t, " ");
>>> +            ++t;
>>> +        }
>>> +        t += sprintf(t, "%02X", *c & 0xFF);
>>> +    }
>>> +}
>>> +
>>> +static int get_path(const void *fdt, int offset, char *buf, int len)
>>> +{
>>> +    int ret;
>>> +
>>> +    ret = fdt_get_path(fdt, offset, buf, len - 1);
>>> +    if (ret < 0) {
>>> +        return ret;
>>> +    }
>>> +
>>> +    buf[len - 1] = '\0';
>>> +
>>> +    return strlen(buf) + 1;
>>> +}
>>> +
>>> +static int phandle_to_path(const void *fdt, uint32_t ph, char *buf, int len)
>>> +{
>>> +    int ret;
>>> +
>>> +    ret = fdt_node_offset_by_phandle(fdt, ph);
>>> +    if (ret < 0) {
>>> +        return ret;
>>> +    }
>>> +
>>> +    return get_path(fdt, ret, buf, len);
>>> +}
>>> +
>>> +static uint32_t vof_finddevice(const void *fdt, uint32_t nodeaddr)
>>> +{
>>> +    char fullnode[VOF_MAX_PATH];
>>> +    uint32_t ret = -1;
>>> +    int offset;
>>> +
>>> +    if (readstr(nodeaddr, fullnode, sizeof(fullnode))) {
>>> +        return (uint32_t) ret;
>>> +    }
>>> +
>>> +    offset = fdt_path_offset(fdt, fullnode);
>>> +    if (offset >= 0) {
>>> +        ret = fdt_get_phandle(fdt, offset);
>>> +    }
>>> +    trace_vof_finddevice(fullnode, ret);
>>> +    return (uint32_t) ret;
>>> +}
>>
>> The Linux init function that runs on pegasos2 here:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/arch/powerpc/kernel/prom_init.c?h=v4.14.234#n2658
>>
>> calls finddevice once with isa@c and next with isa@C (small and capital C)
>> both of which works with the board firmware but with vof the comparison is
>> case sensitive and one of these fails so I can't make it work. I don't know
>> if this is a problem in libfdt or the vof_finddevice above should do
>> something else to get case insensitive comparison.
>
> This is kind of a subtle incompatibility between the traditional OF
> world and the flat tree world.  In traditional OF, the unit address
> (bit after the @) doesn't exist as a string.  Instead when you do the
> finddevice it will parse that address and compare it against the 'reg'
> properties for each of the relevant nodes.  Since that's an integer
> comparison, case doesn't enter into it.
>
> But, how to parse (and write) addresses depends on the bus, so the
> firmware has to understand each bus type and act accordingly.  That
> doesn't really work in the world of minimal firmwares dor the flat
> tree.  So instead, we just incorporate a pre-formatted unit address in
> the flat tree directly.  Most of the time that works fine, but there
> are some edge cases like the one you've hit.

OK, thanks for the clarification, as said in previous message I think 
doing case insesitive comparison just in the address part should work then 
we don't have to implement reg parsing in VOF.

>>> +static const void *getprop(const void *fdt, int nodeoff, const char *propname,
>>> +                           int *proplen, bool *write0)
>>> +{
>>> +    const char *unit, *prop;
>>> +
>>> +    /*
>>> +     * The "name" property is not actually stored as a property in the FDT,
>>> +     * we emulate it by returning a pointer to the node's name and adjust
>>> +     * proplen to include only the name but not the unit.
>>> +     */
>>> +    if (strcmp(propname, "name") == 0) {
>>> +        prop = fdt_get_name(fdt, nodeoff, proplen);
>>> +        if (!prop) {
>>> +            *proplen = 0;
>>> +            return NULL;
>>> +        }
>>> +
>>> +        unit = memchr(prop, '@', *proplen);
>>> +        if (unit) {
>>> +            *proplen = unit - prop;
>>> +        }
>>> +        *proplen += 1;
>>> +
>>> +        /*
>>> +         * Since it might be cut at "@" and there will be no trailing zero
>>> +         * in the prop buffer, tell the caller to write zero at the end.
>>> +         */
>>> +        if (write0) {
>>> +            *write0 = true;
>>> +        }
>>> +        return prop;
>>> +    }
>>> +
>>> +    if (write0) {
>>> +        *write0 = false;
>>> +    }
>>> +    return fdt_getprop(fdt, nodeoff, propname, proplen);
>>> +}
>>
>> MorphOS checks the name property of the root node ("/") to decide what
>> platform it runs on so we may need to be able to set this property on /
>> where it should return "bplan,Pegasos2", therefore the above maybe should do
>> getprop first and only generate name property if it's not set (or at least
>> check if we're on the root node and allow setting name property there). (On
>> Macs the root node is named "device-tree" and this was before found to be
>> needed for MorphOS.)
>
> Ah.  Hrm.  Have to think about what to do about that.

This is easy to fix, this seems to allow setting a name property or return 
a default:

>diff --git a/hw/ppc/vof.c b/hw/ppc/vof.c
index b47bbd509d..746842593e 100644
--- a/hw/ppc/vof.c
+++ b/hw/ppc/vof.c
@@ -163,14 +163,14 @@ static uint32_t vof_finddevice(const void *fdt, 
uint32_t nodeaddr)
  static const void *getprop(const void *fdt, int nodeoff, const char *propname,
                             int *proplen, bool *write0)
  {
-    const char *unit, *prop;
+    const char *unit, *prop = fdt_getprop(fdt, nodeoff, propname, proplen);

      /*
       * The "name" property is not actually stored as a property in the FDT,
       * we emulate it by returning a pointer to the node's name and adjust
       * proplen to include only the name but not the unit.
       */
-    if (strcmp(propname, "name") == 0) {
+    if (!prop && strcmp(propname, "name") == 0) {
          prop = fdt_get_name(fdt, nodeoff, proplen);
          if (!prop) {
              *proplen = 0;
@@ -196,7 +196,7 @@ static const void *getprop(const void *fdt, int nodeoff, const char *propname,
      if (write0) {
          *write0 = false;
      }
-    return fdt_getprop(fdt, nodeoff, propname, proplen);
+    return prop;
  }

This allows adding a name property to "/" different from the default but 
this does not yet fix MorphOS booting with VOF on pegasos2. I think it 
tries to query name on / and check if it's called "device-tree" in which 
case it assumes Mac hardware otherwise it goes with pegasos2 so even if we 
return nothing for name it would not matter in this case as we don't use 
VOF on Mac. If we wanted that then this would become a problem so it could 
be fixed now in advance just in case other guests may need it.

>> Other than the above two problems, I've found that getting the device tree
>> from vof returns it in reverse order compared to the board firmware if I add
>> it the expected order. This may or may not be a problem but to avoid it I
>> can build the tree in reverse order then it comes out right so unless
>> there's an easy fix this should not cause a problem but may worth a comment
>> somewhere.
>
> The order of things in the device tree *should* never matter.  If it
> does, that's definitely a client bug... but of course that doesn't
> necessarily mean we won't have to work around it in practice.

I don't know if it matters or not but having the device tree in the same 
order as the firmware ROM helps with comparing it for debugging but I've 
found I can solve this by building the tree in reverse order so no changes 
to VOF is needed for this, just thought adding a comment somewhere may 
clarify it but it's not really a problem.

I still don't know what's MorphOS is missing, I've tried adding almost all 
misssing properties, checked what hardware is init by the firmware and 
tried to do the same in board reset code and even after that MorphOS still 
takes a different route with VOF and crashes but boots with the board 
firmware. I'm now thinking it may be either different memory organisation 
or the missing name properties that are not returned by nextprop in VOF so 
they are only appearing when explicitely queried whereas with the board 
firmware they are present as properties. With the above patch I could 
explicitely set it on nodes and test if that makes a difference.

I got to this because adding more missing props or init more devices did 
not make a difference so I'm guessing it may be something else then and 
the only difference I can see compared to board firmware are the different 
memory ranges in claimed (VOF puts itself to 0 for example); and the 
missing name and additional phandle props in the device tree. MorphOS 
copies the whole device tree on startup then later it uses this copy of 
the device tree after shutting down OF with quiesce. I can imagine it may 
use some name props like that on the cpu node without checking assuming 
it's always there and if we're missing that it may cause a NULL 
dereference. I have no better idea what else could be missing so I'll test 
this next. If it helps I can try to come up with a patch to VOF to return 
these name props or allow setting them as above.

Regards,
BALATON Zoltan
BALATON Zoltan June 4, 2021, 1:59 p.m. UTC | #51
On Fri, 4 Jun 2021, David Gibson wrote:
> On Wed, Jun 02, 2021 at 02:29:29PM +0200, BALATON Zoltan wrote:
>> On Wed, 2 Jun 2021, David Gibson wrote:
>>> On Thu, May 27, 2021 at 02:42:39PM +0200, BALATON Zoltan wrote:
>>>> On Thu, 27 May 2021, David Gibson wrote:
>>>>> On Tue, May 25, 2021 at 12:08:45PM +0200, BALATON Zoltan wrote:
>>>>>> On Tue, 25 May 2021, David Gibson wrote:
>>>>>>> On Mon, May 24, 2021 at 12:55:07PM +0200, BALATON Zoltan wrote:
>>>>>>>> On Mon, 24 May 2021, David Gibson wrote:
>>>>>>>>> On Sun, May 23, 2021 at 07:09:26PM +0200, BALATON Zoltan wrote:
>>>>>>>>>> On Sun, 23 May 2021, BALATON Zoltan wrote:
>>>>>>>>>>> On Sun, 23 May 2021, Alexey Kardashevskiy wrote:
>>>>>>>>>>>> One thing to note about PCI is that normally I think the client
>>>>>>>>>>>> expects the firmware to do PCI probing and SLOF does it. But VOF
>>>>>>>>>>>> does not and Linux scans PCI bus(es) itself. Might be a problem for
>>>>>>>>>>>> you kernel.
>>>>>>>>>>>
>>>>>>>>>>> I'm not sure what info does MorphOS get from the device tree and what it
>>>>>>>>>>> probes itself but I think it may at least need device ids and info about
>>>>>>>>>>> the PCI bus to be able to access the config regs, after that it should
>>>>>>>>>>> set the devices up hopefully. I could add these from the board code to
>>>>>>>>>>> device tree so VOF does not need to do anything about it. However I'm
>>>>>>>>>>> not getting to that point yet because it crashes on something that it's
>>>>>>>>>>> missing and couldn't yet find out what is that.
>>>>>>>>>>>
>>>>>>>>>>> I'd like to get Linux working now as that would be enough to test this
>>>>>>>>>>> and then if for MorphOS we still need a ROM it's not a problem if at
>>>>>>>>>>> least we can boot Linux without the original firmware. But I can't make
>>>>>>>>>>> Linux open a serial console and I don't know what it needs for that. Do
>>>>>>>>>>> you happen to know? I've looked at the sources in Linux/arch/powerpc but
>>>>>>>>>>> not sure how it would find and open a serial port on pegasos2. It seems
>>>>>>>>>>> to work with the board firmware and now I can get it to boot with VOF
>>>>>>>>>>> but then it does not open serial so it probably needs something in the
>>>>>>>>>>> device tree or expects the firmware to set something up that we should
>>>>>>>>>>> add in pegasos2.c when using VOF.
>>>>>>>>>>
>>>>>>>>>> I've now found that Linux uses rtas methods read-pci-config and
>>>>>>>>>> write-pci-config for PCI access on pegasos2 so this means that we'll
>>>>>>>>>> probably need rtas too (I hoped we could get away without it if it were only
>>>>>>>>>> used for shutdown/reboot or so but seems Linux needs it for PCI as well and
>>>>>>>>>> does not scan the bus and won't find some devices without it).
>>>>>>>>>
>>>>>>>>> Yes, definitely sounds like you'll need an RTAS implementation.
>>>>>>>>
>>>>>>>> I plan to fix that after managed to get serial working as that seems to not
>>>>>>>> need it. If I delete the rtas-size property from /rtas on the original
>>>>>>>> firmware that makes Linux skip instantiating rtas, but I still get serial
>>>>>>>> output just not accessing PCI devices. So I think it should work and keeps
>>>>>>>> things simpler at first. Then I'll try rtas later.
>>>>>>>>
>>>>>>>>>> While VOF can do rtas, this causes a problem with the hypercall method using
>>>>>>>>>> sc 1 that goes through vhyp but trips the assert in ppc_store_sdr1() so
>>>>>>>>>> cannot work after guest is past quiesce.
>>>>>>>>>
>>>>>>>>>> So the question is why is that
>>>>>>>>>> assert there
>>>>>>>>>
>>>>>>>>> Ah.. right.  So, vhyp was designed for the PAPR use case, where we
>>>>>>>>> want to model the CPU when it's in supervisor and user mode, but not
>>>>>>>>> when it's in hypervisor mode.  We want qemu to mimic the behaviour of
>>>>>>>>> the hypervisor, rather than attempting to actually execute hypervisor
>>>>>>>>> code in the virtual CPU.
>>>>>>>>>
>>>>>>>>> On systems that have a hypervisor mode, SDR1 is hypervisor privileged,
>>>>>>>>> so it makes no sense for the guest to attempt to set it.  That should
>>>>>>>>> be caught by the general SPR code and turned into a 0x700, hence the
>>>>>>>>> assert() if we somehow reach ppc_store_sdr1().
>>>>>>>>>
>>>>>>>>> So, we are seeing a problem here because you want the 'sc 1'
>>>>>>>>> interception of vhyp, but not the rest of the stuff that goes with it.
>>>>>>>>>
>>>>>>>>>> and would using sc 1 for hypercalls on pegasos2 cause other
>>>>>>>>>> problems later even if the assert could be removed?
>>>>>>>>>
>>>>>>>>> At least in the short term, I think you probably can remove the
>>>>>>>>> assert.  In your case the 'sc 1' calls aren't truly to a hypervisor,
>>>>>>>>> but a special case escape to qemu for the firmware emulation.  I think
>>>>>>>>> it's unlikely to cause problems later, because nothing on a 32-bit
>>>>>>>>> system should be attempting an 'sc 1'.  The only thing I can think of
>>>>>>>>> that would fail is some test case which explicitly verified that 'sc
>>>>>>>>> 1' triggered a 0x700 (SIGILL from userspace).
>>>>>>>>
>>>>>>>> OK so the assert should check if the CPU has an HV bit. I think there was a
>>>>>>>> #detine for that somewhere that I can add to the assert then I can try that.
>>>>>>>> What I wasn't sure about is that sc 1 would conflict with the guest's usage
>>>>>>>> of normal sc calls or are these going through different paths and only sc 1
>>>>>>>> will trigger vhyp callback not affecting notmal sc calls?
>>>>>>>
>>>>>>> The vhyp shouldn't affect normal system calls, 'sc 1' is specifically
>>>>>>> for hypercalls, as opposed to normal 'sc' (a.k.a. 'sc 0'), and the
>>>>>>> vhyp only intercepts the hypercall version (after all Linux on PAPR
>>>>>>> certainly uses its own system calls, and hypercalls are active for the
>>>>>>> lifetime of the guest there).
>>>>>>>
>>>>>>>> (Or if this causes
>>>>>>>> an otherwise unnecessary VM exit on KVM even when it works then maybe
>>>>>>>> looking for a different way in the future might be needed.
>>>>>>>
>>>>>>> What you're doing here won't work with KVM as it stands.  There are
>>>>>>> basically two paths into the vhyp hypercall path: 1) from TCG, if we
>>>>>>> interpret an 'sc 1' instruction we enter vhyp, 2) from KVM, if we get
>>>>>>> a KVM_EXIT_PAPR_HCALL KVM exit then we also go to the vhyp path.
>>>>>>>
>>>>>>> The second path is specific to the PAPR (ppc64) implementation of KVM,
>>>>>>> and will not work for a non-PAPR platform without substantial
>>>>>>> modification of the KVM code.
>>>>>>
>>>>>> OK so then at that point when we try KVM we'll need to look at alternative
>>>>>> ways, I think MOL OSI worked with KVM at least in MOL but will probably make
>>>>>> all syscalls exit KVM but since we'll probably need to use KVM PR it will
>>>>>> exit anyway. For now I keep this vhyp as it does not run with KVM for other
>>>>>> reasons yet so that's another area to clean up so as a proof of concept
>>>>>> first version of using VOF vhyp will do.
>>>>>
>>>>> Eh, since you'll need to modify KVM anyway, it probably makes just as
>>>>> much sense to modify it to catch the 'sc 1' as MoL's magic thingy.
>>>>
>>>> I'm not sure how KVM works for this case so I also don't know why and what
>>>> would need to be modified. I think we'll only have KVM PR working as newer
>>>> POWER CPUs having HV (besides being rare among potential users) are probably
>>>> too different to run the OSes that expect at most a G4 on pegasos2 so likely
>>>> it won't work with KVM HV.
>>>
>>> Oh, it definitely won't work with KVM HV.
>>>
>>>> If we have KVM PR doesn't sc already trap so we
>>>> could add MOL OSI without further modification to KVM itself only needing
>>>> change in QEMU?
>>>
>>> Uh... I guess so?
>>>
>>>> I also hope that MOL OSI could be useful for porting some
>>>> paravirt drivers from MOL for running Mac OS X on Mac emulation but I don't
>>>> know about that for sure so I'm open to any other solution too.
>>>
>>> Maybe.  I never know much about MOL to begin with, and anything I did
>>> know was a decade or more ago so I've probably forgotten.
>>
>> That may still be more than what I know about it since I never had any
>> knowledge about PPC KVM and don't have any PPC hardware to test with so I'm
>> mostly guessing. (I could test with KVM emulated in QEMU and I did set up an
>> environment for that but that's a bit slow and inconvenient so I'd leave KVM
>> support to those interested and have more knowledge and hardware for it.)
>
> Sounds like a problem for someone else another time, then.
>
>>>> For now I'm
>>>> going with vhyp which is enough fot testing with TCG and if somebody wants
>>>> KVM they could use he original firmware for now so this could be improved in
>>>> a later version unless a simple solution is found before the freeze for 6.1.
>>>> If we're in KVM PR what happens for sc 1 could that be used too so maybe
>>>> what we have now could work?
>>>
>>> Note that if you do go down the MOL path it wouldn't be that complex
>>> to make a "vMOL" interface so you can use the same mechanism for KVM
>>> and TCG.
>>
>> Not sure what you mean by VMOL. Is it modifying MOL to use sc 1 like VOF
>> instead of its OSI way for hypercalls?
>
> No, I mean on the qemu side adding an optional hook which will
> intercept sc 0 instructions with the MOL magic register values and
> redirect them to a machine registered callback, rather than emulating
> the CPU's behaviour of jumping to the system call vector in guest
> space.
>
> Basically an equivalent of vhyp, but for MOL magic syscalls, instead
> of hypercalls.

OK, that's basically what BenH's OSI patch I've linked to before did I 
think, it may just need updating for changes in target/ppc since that 
patch was created. However that would also mean we'd need another version 
of VOF that uses this instead of sc 1 then so unless we need that I'd keep 
a single VOF that works for both spapr and pegasos2.

>> That would lose the advantage of
>> being able to reuse MOL guest drivers without modification (which might be
>> useful for running OS X guest on Mac emulation) so if we can't use vhyp then
>> maybe using OSI would be the next choice for this reason but for now vhyp
>> seems to be working for what I could test so unless somebody here sees a
>> problem with it and has a better idea I'm going with vhyp for now just
>> because that's what VOF uses and I don't want to modify VOF to reuse it as
>> it is so I don't need to maintain a separate version and also get any
>> enhancements without further need to sync with spapr VOF.
>>
>> I've found this document about possible hypercall interfaces on KVM (see
>> Hypercall ABIs at the end):
>>
>> https://www.kernel.org/doc/html/latest/virt/kvm/ppc-pv.html
>>
>> Having both ePAPR (1.) and PAPR (2.) hypercalls is a bit confusing. Does
>> vhyp correspond to 2. PAPR?
>
> Yes.

What's ePAPR then and how is it different from PAPR? I mean the acronym 
not the hypercall method, the latter is explained in that doc but what 
ePAPR stands for and why is that method called like that is not clear to 
me.

>> The ePAPR (1.) seems to be preferred by KVM and
>> MOL OSI supported for compatibility.
>
> That document looks pretty out of date.  Most of it is only discussing
> KVM PR, which is now barely maintained.  KVM HV only works with PAPR
> hypercalls.

The links says it's latest kernel docs, so maybe an update need to be sent 
to KVM?

>> So if we need something else instead of
>> 2. PAPR hypercalls there seems to be two options: ePAPR and MOL OSI which
>> should work with KVM but then I'm not sure how to handle those on TCG.
>>
>>>>>> [...]
>>>>>>>>>> I've tested that the missing rtas is not the reason for getting no output
>>>>>>>>>> via serial though, as even when disabling rtas on pegasos2.rom it boots and
>>>>>>>>>> I still get serial output just some PCI devices are not detected (such as
>>>>>>>>>> USB, the video card and the not emulated ethernet port but these are not
>>>>>>>>>> fatal so it might even work as a first try without rtas, just to boot a
>>>>>>>>>> Linux kernel for testing it would be enough if I can fix the serial output).
>>>>>>>>>> I still don't know why it's not finding serial but I think it may be some
>>>>>>>>>> missing or wrong info in the device tree I generat. I'll try to focus on
>>>>>>>>>> this for now and leave the above rtas question for later.
>>>>>>>>>
>>>>>>>>> Oh.. another thought on that.  You have an ISA serial port on Pegasos,
>>>>>>>>> I believe.  I wonder if the PCI->ISA bridge needs some configuration /
>>>>>>>>> initialization that the firmware is expected to do.  If so you'll need
>>>>>>>>> to mimic that setup in qemu for the VOF case.
>>>>>>>>
>>>>>>>> That's what I begin to think because I've added everything to the device
>>>>>>>> tree that I thought could be needed and I still don't get it working so it
>>>>>>>> may need some config from the firmware. But how do I access device registers
>>>>>>>> from board code? I've tried adding a machine reset method and write to
>>>>>>>> memory mapped device registers but all my attempts failed. I've tried
>>>>>>>> cpu_stl_le_data and even memory_region_dispatch_write but these did not get
>>>>>>>> to the device. What's the way to access guest mmio regs from QEMU?
>>>>>>>
>>>>>>> That's odd, cpu_stl() and memory_region_dispatch_write() should work
>>>>>>> from board code (after the relevant memory regions are configured, of
>>>>>>> course).  As an ISA serial port, it's probably accessed through IO
>>>>>>> space, not memory space though, so you'd need &address_space_io.  And
>>>>>>> if there is some bridge configuration then it's the bridge control
>>>>>>> registers you need to look at not the serial registers - you'd have to
>>>>>>> look at the bridge documentation for that.  Or, I guess the bridge
>>>>>>> implementation in qemu, which you wrote part of.
>>>>>>
>>>>>> I've found at last that stl_le_phys() works. There are so many of these that
>>>>>> I never know when to use which.
>>>>>>
>>>>>> I think the address_space_rw calls in vof_client_call() in vof.c could also
>>>>>> use these for somewhat shorter code. I've ended up with
>>>>>> stl_le_phys(CPU(cpu)->as, addr, val) in my machine reset methodbut I don't
>>>>>> even need that now as it works without additional setup. Also VOF's memory
>>>>>> access is basically the same as the already existing rtas_st() and co. so
>>>>>> maybe that could be reused to make code smaller?
>>>>>
>>>>> rtas_ld() and rtas_st() should only be used for reading/writing RTAS
>>>>> parameters to and from memory.  Accessing IO shouldn't be done with
>>>>> those.
>>>>>
>>>>> For IO you probably want the cpu_st*() variants in most cases, since
>>>>> you're trying to emulate an IO access from the virtual cpu.
>>>>
>>>> I think I've tried that but what worked to access mmio device registers are
>>>> stl_le_phys and similar that are wrappers around address_space_stl_*. But I
>>>> did not mean that for rtas_ld/_st but the part when vof accessing the
>>>> parameters passed by its hypercall which is memory access:
>>>>
>>>> https://github.com/patchew-project/qemu/blob/patchew/20210520090557.435689-1-aik%40ozlabs.ru/hw/ppc/vof.c
>>>>
>>>> line 893, and vof_client_call before that is very similar to what h_rtas
>>>> does here:
>>>>
>>>> https://git.qemu.org/?p=qemu.git;a=blob;f=hw/ppc/spapr_hcall.c;h=f25014afda408002ee1ec1027a0dd7a6025eca61;hb=HEAD#l639
>>>>
>>>> and I also need to do the same for rtas in pegasos2 for which I'm just using
>>>> ldl_be_phys for now but I wonder if we really need 3 ways to do the same or
>>>> the rtas_ld/_st could be made more generic and reused here?
>>>
>>> For your rtas implementation you could definitely re-use them.  For
>>> the client call I'm a bit less confident, but if the in-guest-memory
>>> structures are really the same, then it would make sense.
>>
>> The memory structure seems very similar to me, the only difference is
>> calling the first field service in VOF instead of token in RTAS. Both are
>> just an array of big endian unit32_t with token, nargs, nret at the front
>> followed by args and rets. Since these rtas_ld/st are defined in spapr.h I
>> did not bother to split them off, so for pegasos2 rtas I'm just using the
>> ldl_be_* functions directly for which these are a shorthand for. If these
>> were split off for sharing between spapr rtas and VOF I may be able to reuse
>> them as well but it's not that important so just mentioned it as a possible
>> later clean up.
>
> Ok, sounds reasonable to re-use them then, though maybe add an aliased
> name for clarity ofci_{ld,st}(), maybe?  (for "Open Firmware Client
> Interface")

I'll wait for what Alexey decides to do in the next VOF patch version and 
if I can reuse that (I could if these were defined in vof.h). I don't want 
to come up with yet another abstraction to ldl_be_* which does not seem to 
make it more clear than using the actual functions for guest memory access 
which is what we're doing while getting the hypercall args so I think 
either using ldl_be_* directly or reusing already existing rfas_ls/_st 
would make sense but adding similar funcs with another name just makes it 
more confusing.

Regards,
BALATON Zoltan
BALATON Zoltan June 4, 2021, 2:34 p.m. UTC | #52
On Fri, 4 Jun 2021, BALATON Zoltan wrote:
> On Fri, 4 Jun 2021, David Gibson wrote:
>> On Sun, May 30, 2021 at 07:33:01PM +0200, BALATON Zoltan wrote:
>>> Hello,
>>> 
>>> Two more problems I've found while testing with pegasos2 but I'm not sure
>>> how to fix them:
>>> 
>>> On Thu, 20 May 2021, Alexey Kardashevskiy wrote:
>>>> diff --git a/hw/ppc/vof.c b/hw/ppc/vof.c
>>>> new file mode 100644
>>>> index 000000000000..a283b7d251a7
>>>> --- /dev/null
>>>> +++ b/hw/ppc/vof.c
>>>> @@ -0,0 +1,1021 @@
>>>> +/*
>>>> + * QEMU PowerPC Virtual Open Firmware.
>>>> + *
>>>> + * This implements client interface from OpenFirmware IEEE1275 on the 
>>>> QEMU
>>>> + * side to leave only a very basic firmware in the VM.
>>>> + *
>>>> + * Copyright (c) 2021 IBM Corporation.
>>>> + *
>>>> + * SPDX-License-Identifier: GPL-2.0-or-later
>>>> + */
>>>> +
>>>> +#include "qemu/osdep.h"
>>>> +#include "qemu-common.h"
>>>> +#include "qemu/timer.h"
>>>> +#include "qemu/range.h"
>>>> +#include "qemu/units.h"
>>>> +#include "qapi/error.h"
>>>> +#include <sys/ioctl.h>
>>>> +#include "exec/ram_addr.h"
>>>> +#include "exec/address-spaces.h"
>>>> +#include "hw/ppc/vof.h"
>>>> +#include "hw/ppc/fdt.h"
>>>> +#include "sysemu/runstate.h"
>>>> +#include "qom/qom-qobject.h"
>>>> +#include "trace.h"
>>>> +
>>>> +#include <libfdt.h>
>>>> +
>>>> +/*
>>>> + * OF 1275 "nextprop" description suggests is it 32 bytes max but
>>>> + * LoPAPR defines "ibm,query-interrupt-source-number" which is 33 chars 
>>>> long.
>>>> + */
>>>> +#define OF_PROPNAME_LEN_MAX 64
>>>> +
>>>> +#define VOF_MAX_PATH        256
>>>> +#define VOF_MAX_SETPROPLEN  2048
>>>> +#define VOF_MAX_METHODLEN   256
>>>> +#define VOF_MAX_FORTHCODE   256
>>>> +#define VOF_VTY_BUF_SIZE    256
>>>> +
>>>> +typedef struct {
>>>> +    uint64_t start;
>>>> +    uint64_t size;
>>>> +} OfClaimed;
>>>> +
>>>> +typedef struct {
>>>> +    char *path; /* the path used to open the instance */
>>>> +    uint32_t phandle;
>>>> +} OfInstance;
>>>> +
>>>> +#define VOF_MEM_READ(pa, buf, size) \
>>>> +    address_space_read_full(&address_space_memory, \
>>>> +    (pa), MEMTXATTRS_UNSPECIFIED, (buf), (size))
>>>> +#define VOF_MEM_WRITE(pa, buf, size) \
>>>> +    address_space_write(&address_space_memory, \
>>>> +    (pa), MEMTXATTRS_UNSPECIFIED, (buf), (size))
>>>> +
>>>> +static int readstr(hwaddr pa, char *buf, int size)
>>>> +{
>>>> +    if (VOF_MEM_READ(pa, buf, size) != MEMTX_OK) {
>>>> +        return -1;
>>>> +    }
>>>> +    if (strnlen(buf, size) == size) {
>>>> +        buf[size - 1] = '\0';
>>>> +        trace_vof_error_str_truncated(buf, size);
>>>> +        return -1;
>>>> +    }
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static bool cmpservice(const char *s, unsigned nargs, unsigned nret,
>>>> +                       const char *s1, unsigned nargscheck, unsigned 
>>>> nretcheck)
>>>> +{
>>>> +    if (strcmp(s, s1)) {
>>>> +        return false;
>>>> +    }
>>>> +    if ((nargscheck && (nargs != nargscheck)) ||
>>>> +        (nretcheck && (nret != nretcheck))) {
>>>> +        trace_vof_error_param(s, nargscheck, nretcheck, nargs, nret);
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    return true;
>>>> +}
>>>> +
>>>> +static void prop_format(char *tval, int tlen, const void *prop, int len)
>>>> +{
>>>> +    int i;
>>>> +    const unsigned char *c;
>>>> +    char *t;
>>>> +    const char bin[] = "...";
>>>> +
>>>> +    for (i = 0, c = prop; i < len; ++i, ++c) {
>>>> +        if (*c == '\0' && i == len - 1) {
>>>> +            strncpy(tval, prop, tlen - 1);
>>>> +            return;
>>>> +        }
>>>> +        if (*c < 0x20 || *c >= 0x80) {
>>>> +            break;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    for (i = 0, c = prop, t = tval; i < len; ++i, ++c) {
>>>> +        if (t >= tval + tlen - sizeof(bin) - 1 - 2 - 1) {
>>>> +            strcpy(t, bin);
>>>> +            return;
>>>> +        }
>>>> +        if (i && i % 4 == 0 && i != len - 1) {
>>>> +            strcat(t, " ");
>>>> +            ++t;
>>>> +        }
>>>> +        t += sprintf(t, "%02X", *c & 0xFF);
>>>> +    }
>>>> +}
>>>> +
>>>> +static int get_path(const void *fdt, int offset, char *buf, int len)
>>>> +{
>>>> +    int ret;
>>>> +
>>>> +    ret = fdt_get_path(fdt, offset, buf, len - 1);
>>>> +    if (ret < 0) {
>>>> +        return ret;
>>>> +    }
>>>> +
>>>> +    buf[len - 1] = '\0';
>>>> +
>>>> +    return strlen(buf) + 1;
>>>> +}
>>>> +
>>>> +static int phandle_to_path(const void *fdt, uint32_t ph, char *buf, int 
>>>> len)
>>>> +{
>>>> +    int ret;
>>>> +
>>>> +    ret = fdt_node_offset_by_phandle(fdt, ph);
>>>> +    if (ret < 0) {
>>>> +        return ret;
>>>> +    }
>>>> +
>>>> +    return get_path(fdt, ret, buf, len);
>>>> +}
>>>> +
>>>> +static uint32_t vof_finddevice(const void *fdt, uint32_t nodeaddr)
>>>> +{
>>>> +    char fullnode[VOF_MAX_PATH];
>>>> +    uint32_t ret = -1;
>>>> +    int offset;
>>>> +
>>>> +    if (readstr(nodeaddr, fullnode, sizeof(fullnode))) {
>>>> +        return (uint32_t) ret;
>>>> +    }
>>>> +
>>>> +    offset = fdt_path_offset(fdt, fullnode);
>>>> +    if (offset >= 0) {
>>>> +        ret = fdt_get_phandle(fdt, offset);
>>>> +    }
>>>> +    trace_vof_finddevice(fullnode, ret);
>>>> +    return (uint32_t) ret;
>>>> +}
>>> 
>>> The Linux init function that runs on pegasos2 here:
>>> 
>>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/arch/powerpc/kernel/prom_init.c?h=v4.14.234#n2658
>>> 
>>> calls finddevice once with isa@c and next with isa@C (small and capital C)
>>> both of which works with the board firmware but with vof the comparison is
>>> case sensitive and one of these fails so I can't make it work. I don't 
>>> know
>>> if this is a problem in libfdt or the vof_finddevice above should do
>>> something else to get case insensitive comparison.
>> 
>> This is kind of a subtle incompatibility between the traditional OF
>> world and the flat tree world.  In traditional OF, the unit address
>> (bit after the @) doesn't exist as a string.  Instead when you do the
>> finddevice it will parse that address and compare it against the 'reg'
>> properties for each of the relevant nodes.  Since that's an integer
>> comparison, case doesn't enter into it.
>> 
>> But, how to parse (and write) addresses depends on the bus, so the
>> firmware has to understand each bus type and act accordingly.  That
>> doesn't really work in the world of minimal firmwares dor the flat
>> tree.  So instead, we just incorporate a pre-formatted unit address in
>> the flat tree directly.  Most of the time that works fine, but there
>> are some edge cases like the one you've hit.
>
> OK, thanks for the clarification, as said in previous message I think doing 
> case insesitive comparison just in the address part should work then we don't 
> have to implement reg parsing in VOF.
>
>>>> +static const void *getprop(const void *fdt, int nodeoff, const char 
>>>> *propname,
>>>> +                           int *proplen, bool *write0)
>>>> +{
>>>> +    const char *unit, *prop;
>>>> +
>>>> +    /*
>>>> +     * The "name" property is not actually stored as a property in the 
>>>> FDT,
>>>> +     * we emulate it by returning a pointer to the node's name and 
>>>> adjust
>>>> +     * proplen to include only the name but not the unit.
>>>> +     */
>>>> +    if (strcmp(propname, "name") == 0) {
>>>> +        prop = fdt_get_name(fdt, nodeoff, proplen);
>>>> +        if (!prop) {
>>>> +            *proplen = 0;
>>>> +            return NULL;
>>>> +        }
>>>> +
>>>> +        unit = memchr(prop, '@', *proplen);
>>>> +        if (unit) {
>>>> +            *proplen = unit - prop;
>>>> +        }
>>>> +        *proplen += 1;
>>>> +
>>>> +        /*
>>>> +         * Since it might be cut at "@" and there will be no trailing 
>>>> zero
>>>> +         * in the prop buffer, tell the caller to write zero at the end.
>>>> +         */
>>>> +        if (write0) {
>>>> +            *write0 = true;
>>>> +        }
>>>> +        return prop;
>>>> +    }
>>>> +
>>>> +    if (write0) {
>>>> +        *write0 = false;
>>>> +    }
>>>> +    return fdt_getprop(fdt, nodeoff, propname, proplen);
>>>> +}
>>> 
>>> MorphOS checks the name property of the root node ("/") to decide what
>>> platform it runs on so we may need to be able to set this property on /
>>> where it should return "bplan,Pegasos2", therefore the above maybe should 
>>> do
>>> getprop first and only generate name property if it's not set (or at least
>>> check if we're on the root node and allow setting name property there). 
>>> (On
>>> Macs the root node is named "device-tree" and this was before found to be
>>> needed for MorphOS.)
>> 
>> Ah.  Hrm.  Have to think about what to do about that.
>
> This is easy to fix, this seems to allow setting a name property or return a 
> default:
>
>> diff --git a/hw/ppc/vof.c b/hw/ppc/vof.c
> index b47bbd509d..746842593e 100644
> --- a/hw/ppc/vof.c
> +++ b/hw/ppc/vof.c
> @@ -163,14 +163,14 @@ static uint32_t vof_finddevice(const void *fdt, 
> uint32_t nodeaddr)
> static const void *getprop(const void *fdt, int nodeoff, const char 
> *propname,
>                            int *proplen, bool *write0)
> {
> -    const char *unit, *prop;
> +    const char *unit, *prop = fdt_getprop(fdt, nodeoff, propname, proplen);
>
>     /*
>      * The "name" property is not actually stored as a property in the FDT,
>      * we emulate it by returning a pointer to the node's name and adjust
>      * proplen to include only the name but not the unit.
>      */
> -    if (strcmp(propname, "name") == 0) {
> +    if (!prop && strcmp(propname, "name") == 0) {
>         prop = fdt_get_name(fdt, nodeoff, proplen);
>         if (!prop) {
>             *proplen = 0;
> @@ -196,7 +196,7 @@ static const void *getprop(const void *fdt, int nodeoff, 
> const char *propname,
>     if (write0) {
>         *write0 = false;
>     }
> -    return fdt_getprop(fdt, nodeoff, propname, proplen);
> +    return prop;
> }
>
> This allows adding a name property to "/" different from the default but this 
> does not yet fix MorphOS booting with VOF on pegasos2. I think it tries to 
> query name on / and check if it's called "device-tree" in which case it 
> assumes Mac hardware otherwise it goes with pegasos2 so even if we return 
> nothing for name it would not matter in this case as we don't use VOF on Mac. 
> If we wanted that then this would become a problem so it could be fixed now 
> in advance just in case other guests may need it.
>
>>> Other than the above two problems, I've found that getting the device tree
>>> from vof returns it in reverse order compared to the board firmware if I 
>>> add
>>> it the expected order. This may or may not be a problem but to avoid it I
>>> can build the tree in reverse order then it comes out right so unless
>>> there's an easy fix this should not cause a problem but may worth a 
>>> comment
>>> somewhere.
>> 
>> The order of things in the device tree *should* never matter.  If it
>> does, that's definitely a client bug... but of course that doesn't
>> necessarily mean we won't have to work around it in practice.
>
> I don't know if it matters or not but having the device tree in the same 
> order as the firmware ROM helps with comparing it for debugging but I've 
> found I can solve this by building the tree in reverse order so no changes to 
> VOF is needed for this, just thought adding a comment somewhere may clarify 
> it but it's not really a problem.
>
> I still don't know what's MorphOS is missing, I've tried adding almost all 
> misssing properties, checked what hardware is init by the firmware and tried 
> to do the same in board reset code and even after that MorphOS still takes a 
> different route with VOF and crashes but boots with the board firmware. I'm 
> now thinking it may be either different memory organisation or the missing 
> name properties that are not returned by nextprop in VOF so they are only 
> appearing when explicitely queried whereas with the board firmware they are 
> present as properties. With the above patch I could explicitely set it on 
> nodes and test if that makes a difference.
>
> I got to this because adding more missing props or init more devices did not 
> make a difference so I'm guessing it may be something else then and the only 
> difference I can see compared to board firmware are the different memory 
> ranges in claimed (VOF puts itself to 0 for example); and the missing name 
> and additional phandle props in the device tree. MorphOS copies the whole 
> device tree on startup then later it uses this copy of the device tree after 
> shutting down OF with quiesce. I can imagine it may use some name props like 
> that on the cpu node without checking assuming it's always there and if we're 
> missing that it may cause a NULL dereference. I have no better idea what else 
> could be missing so I'll test this next. If it helps I can try to come up 
> with a patch to VOF to return these name props or allow setting them as 
> above.

Looks like it's the missing name props after all. Adding it to /memory and 
cpu makes it go further but probably needs more as it then does not find 
the boot device. Comparing with the device tree created by board firmware 
not all nodes seem to have a name property so maybe the board firmware 
also adds this to some nodes explicitely overriding the default so we 
should do the same in VOF for which the above patch is enough. Feel free 
to squash it into the next vof patch version or I can submit it afterwards 
as a separate patch whichever you prefer. Then I'll need to find out what 
other name props I need to set in board code for MorphOS. Linux does not 
seem to need any of the name props and boots without them. What I have now 
is good enough for Linux but if I can also fix MorphOS that would make it 
simpler to use because then one does not need the non-distributable 
firmware ROM which is the point of trying to use VOF here.

Regards,
BALATON Zoltan
BALATON Zoltan June 6, 2021, 10:21 p.m. UTC | #53
On Fri, 4 Jun 2021, David Gibson wrote:
> On Wed, Jun 02, 2021 at 02:29:29PM +0200, BALATON Zoltan wrote:
>> On Wed, 2 Jun 2021, David Gibson wrote:
>>> On Thu, May 27, 2021 at 02:42:39PM +0200, BALATON Zoltan wrote:
>>>> On Thu, 27 May 2021, David Gibson wrote:
>>>>> On Tue, May 25, 2021 at 12:08:45PM +0200, BALATON Zoltan wrote:
>>>>>> On Tue, 25 May 2021, David Gibson wrote:
>>>>>>> On Mon, May 24, 2021 at 12:55:07PM +0200, BALATON Zoltan wrote:
>>>>>>>> On Mon, 24 May 2021, David Gibson wrote:
>>>>>>>>> On Sun, May 23, 2021 at 07:09:26PM +0200, BALATON Zoltan wrote:
>>>>>>>>>> On Sun, 23 May 2021, BALATON Zoltan wrote:
>>>>>>>>>> and would using sc 1 for hypercalls on pegasos2 cause other
>>>>>>>>>> problems later even if the assert could be removed?
>>>>>>>>>
>>>>>>>>> At least in the short term, I think you probably can remove the
>>>>>>>>> assert.  In your case the 'sc 1' calls aren't truly to a hypervisor,
>>>>>>>>> but a special case escape to qemu for the firmware emulation.  I think
>>>>>>>>> it's unlikely to cause problems later, because nothing on a 32-bit
>>>>>>>>> system should be attempting an 'sc 1'.  The only thing I can think of
>>>>>>>>> that would fail is some test case which explicitly verified that 'sc
>>>>>>>>> 1' triggered a 0x700 (SIGILL from userspace).
>>>>>>>>
>>>>>>>> OK so the assert should check if the CPU has an HV bit. I think there was a
>>>>>>>> #detine for that somewhere that I can add to the assert then I can try that.
>>>>>>>> What I wasn't sure about is that sc 1 would conflict with the guest's usage
>>>>>>>> of normal sc calls or are these going through different paths and only sc 1
>>>>>>>> will trigger vhyp callback not affecting notmal sc calls?
>>>>>>>
>>>>>>> The vhyp shouldn't affect normal system calls, 'sc 1' is specifically
>>>>>>> for hypercalls, as opposed to normal 'sc' (a.k.a. 'sc 0'), and the
>>>>>>> vhyp only intercepts the hypercall version (after all Linux on PAPR
>>>>>>> certainly uses its own system calls, and hypercalls are active for the
>>>>>>> lifetime of the guest there).
>>>>>>>
>>>>>>>> (Or if this causes
>>>>>>>> an otherwise unnecessary VM exit on KVM even when it works then maybe
>>>>>>>> looking for a different way in the future might be needed.
>>>>>>>
>>>>>>> What you're doing here won't work with KVM as it stands.  There are
>>>>>>> basically two paths into the vhyp hypercall path: 1) from TCG, if we
>>>>>>> interpret an 'sc 1' instruction we enter vhyp, 2) from KVM, if we get
>>>>>>> a KVM_EXIT_PAPR_HCALL KVM exit then we also go to the vhyp path.
>>>>>>>
>>>>>>> The second path is specific to the PAPR (ppc64) implementation of KVM,
>>>>>>> and will not work for a non-PAPR platform without substantial
>>>>>>> modification of the KVM code.
>>>>>>
>>>>>> OK so then at that point when we try KVM we'll need to look at alternative
>>>>>> ways, I think MOL OSI worked with KVM at least in MOL but will probably make
>>>>>> all syscalls exit KVM but since we'll probably need to use KVM PR it will
>>>>>> exit anyway. For now I keep this vhyp as it does not run with KVM for other
>>>>>> reasons yet so that's another area to clean up so as a proof of concept
>>>>>> first version of using VOF vhyp will do.
>>>>>
>>>>> Eh, since you'll need to modify KVM anyway, it probably makes just as
>>>>> much sense to modify it to catch the 'sc 1' as MoL's magic thingy.
>>>>
>>>> I'm not sure how KVM works for this case so I also don't know why and what
>>>> would need to be modified. I think we'll only have KVM PR working as newer
>>>> POWER CPUs having HV (besides being rare among potential users) are probably
>>>> too different to run the OSes that expect at most a G4 on pegasos2 so likely
>>>> it won't work with KVM HV.
>>>
>>> Oh, it definitely won't work with KVM HV.
>>>
>>>> If we have KVM PR doesn't sc already trap so we
>>>> could add MOL OSI without further modification to KVM itself only needing
>>>> change in QEMU?
>>>
>>> Uh... I guess so?
>>>
>>>> I also hope that MOL OSI could be useful for porting some
>>>> paravirt drivers from MOL for running Mac OS X on Mac emulation but I don't
>>>> know about that for sure so I'm open to any other solution too.
>>>
>>> Maybe.  I never know much about MOL to begin with, and anything I did
>>> know was a decade or more ago so I've probably forgotten.
>>
>> That may still be more than what I know about it since I never had any
>> knowledge about PPC KVM and don't have any PPC hardware to test with so I'm
>> mostly guessing. (I could test with KVM emulated in QEMU and I did set up an
>> environment for that but that's a bit slow and inconvenient so I'd leave KVM
>> support to those interested and have more knowledge and hardware for it.)
>
> Sounds like a problem for someone else another time, then.

So now that it works on TCG with vhyp I tried what it would do on KVM PR 
with the sc 1 but I could only test that on QEMU itself running in a Linux 
guest. First I've hit missing this callback:

https://git.qemu.org/?p=qemu.git;a=blob;f=target/ppc/kvm.c;h=104a308abb5700b2fe075397271f314d7f607543;hb=HEAD#l856

that I can fix by providing a callback in pegasos2.c that does what the 
else clause would do returning POWERPC_CPU(current_cpu)->env.spr[SPR_SDR1] 
(I guess that's the correct thing to do if it works without vhyp).

After getting past this, the host QEMU crashed on the first sc 1 call with 
this error:

qemu: fatal: Trying to deliver HV exception (MSR) 8 with no HV support

NIP 0000000000000148   LR 0000000000000590 CTR 0000000000000000 XER 0000000000000000 CPU#0
MSR 000000000000d032 HID0 0000000060000000  HF 00004012 iidx 0 didx 0
TB 00000203 876006644638 DECR 422427
GPR00 0000000000000680 000000000000fe90 0000000000008e00 000000000000f005
GPR04 000000000000fe9c 0000000000000001 0000000000000e78 0000000000000000
GPR08 000000000000fe98 000000000000fe9c 0000000000000001 0000000000000000
GPR12 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR16 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR24 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR28 0000000000000000 0000000000000000 0000000000008e9c 000000000000fe90
CR 20000000  [ E  -  -  -  -  -  -  -  ]             RES ffffffffffffffff
FPR00 bff0000000000000 0000000000000000 0000000000000000 0000000000000000
FPR04 0000000000000000 0000000000000000 0000000000000000 0000000000000000
FPR08 0000000000000000 0000000000000000 0000000000000000 0000000000000000
FPR12 3ff553f7ced91687 0000000000000000 0000000000000000 0000000000000000
FPR16 0000000000000000 0000000000000000 0000000000000000 0000000000000000
FPR20 0000000000000000 0000000000000000 0000000000000000 0000000000000000
FPR24 0000000000000000 0000000000000000 0000000000000000 0000000000000000
FPR28 0000000000000000 0000000000000000 0000000000000000 0000000000000000
FPSCR 0000000082004000
  SRR0 00000000000001d4  SRR1 300000000000d032    PVR 00000000003c0301 VRSAVE 00000000ffffffff
SPRG0 000000003fe00000 SPRG1 c00000000ff60000  SPRG2 c00000000ff60000  SPRG3 0000000000000000
SPRG4 0000000000000000 SPRG5 0000000000000000  SPRG6 0000000000000000  SPRG7 0000000000000000
  SDR1 000000003f000006   DAR f00000000090abf0  DSISR 0000000042000000
Aborted (core dumped)

(vof.bin looks like this:

  100:   3c 40 00 00     lis     r2,0
  104:   60 42 8e 00     ori     r2,r2,36352
  108:   48 00 00 cc     b       0x1d4
  10c:   3c 40 00 00     lis     r2,0
  110:   60 42 8e 00     ori     r2,r2,36352
  114:   94 21 ff 90     stwu    r1,-112(r1)
  118:   93 e1 00 68     stw     r31,104(r1)
  11c:   7f e8 02 a6     mflr    r31
  120:   48 00 02 8d     bl      0x3ac
  124:   60 00 00 00     nop
  128:   7f e8 03 a6     mtlr    r31
  12c:   83 e1 00 68     lwz     r31,104(r1)
  130:   38 21 00 70     addi    r1,r1,112
  134:   4e 80 00 20     blr
  138:   7c 64 1b 78     mr      r4,r3
  13c:   3c 60 00 00     lis     r3,0
  140:   60 63 f0 05     ori     r3,r3,61445
  144:   44 00 00 22     sc      1
  148:   4e 80 00 20     blr

so I think it's the sc 1 at 0x144) The error is coming from here:

https://git.qemu.org/?p=qemu.git;a=blob;f=target/ppc/excp_helper.c;h=fd147e2a37662456d30f7ab74b23bfb036260ced;hb=HEAD#l830

What does this mean? What would a real CPU do with this and where it could 
be catched to use as hypercall method on CPUs without HV or what else 
should we do if we wanted this to work with KVM PR too in the future?

Regards,
BALATON Zoltan

>>>> For now I'm
>>>> going with vhyp which is enough fot testing with TCG and if somebody wants
>>>> KVM they could use he original firmware for now so this could be improved in
>>>> a later version unless a simple solution is found before the freeze for 6.1.
>>>> If we're in KVM PR what happens for sc 1 could that be used too so maybe
>>>> what we have now could work?
>>>
>>> Note that if you do go down the MOL path it wouldn't be that complex
>>> to make a "vMOL" interface so you can use the same mechanism for KVM
>>> and TCG.
>>
>> Not sure what you mean by VMOL. Is it modifying MOL to use sc 1 like VOF
>> instead of its OSI way for hypercalls?
>
> No, I mean on the qemu side adding an optional hook which will
> intercept sc 0 instructions with the MOL magic register values and
> redirect them to a machine registered callback, rather than emulating
> the CPU's behaviour of jumping to the system call vector in guest
> space.
>
> Basically an equivalent of vhyp, but for MOL magic syscalls, instead
> of hypercalls.
>
>> That would lose the advantage of
>> being able to reuse MOL guest drivers without modification (which might be
>> useful for running OS X guest on Mac emulation) so if we can't use vhyp then
>> maybe using OSI would be the next choice for this reason but for now vhyp
>> seems to be working for what I could test so unless somebody here sees a
>> problem with it and has a better idea I'm going with vhyp for now just
>> because that's what VOF uses and I don't want to modify VOF to reuse it as
>> it is so I don't need to maintain a separate version and also get any
>> enhancements without further need to sync with spapr VOF.
>>
>> I've found this document about possible hypercall interfaces on KVM (see
>> Hypercall ABIs at the end):
>>
>> https://www.kernel.org/doc/html/latest/virt/kvm/ppc-pv.html
>>
>> Having both ePAPR (1.) and PAPR (2.) hypercalls is a bit confusing. Does
>> vhyp correspond to 2. PAPR?
>
> Yes.
>
>> The ePAPR (1.) seems to be preferred by KVM and
>> MOL OSI supported for compatibility.
>
> That document looks pretty out of date.  Most of it is only discussing
> KVM PR, which is now barely maintained.  KVM HV only works with PAPR
> hypercalls.
>
>> So if we need something else instead of
>> 2. PAPR hypercalls there seems to be two options: ePAPR and MOL OSI which
>> should work with KVM but then I'm not sure how to handle those on TCG.
David Gibson June 7, 2021, 3:02 a.m. UTC | #54
On Fri, Jun 04, 2021 at 03:27:12PM +0200, BALATON Zoltan wrote:
> On Fri, 4 Jun 2021, David Gibson wrote:
> > On Tue, Jun 01, 2021 at 04:12:44PM +0200, BALATON Zoltan wrote:
> > > On Tue, 1 Jun 2021, Alexey Kardashevskiy wrote:
> > > > On 31/05/2021 23:07, BALATON Zoltan wrote:
> > > > > On Sun, 30 May 2021, BALATON Zoltan wrote:
> > > > > > On Thu, 20 May 2021, Alexey Kardashevskiy wrote:
> > > > > > > diff --git a/hw/ppc/vof.c b/hw/ppc/vof.c
> > > > > > > new file mode 100644
> > > > > > > index 000000000000..a283b7d251a7
> > > > > > > --- /dev/null
> > > > > > > +++ b/hw/ppc/vof.c
> > > > > > > @@ -0,0 +1,1021 @@
> > > > > > > +/*
> > > > > > > + * QEMU PowerPC Virtual Open Firmware.
> > > > > > > + *
> > > > > > > + * This implements client interface from OpenFirmware
> > > > > > > IEEE1275 on the QEMU
> > > > > > > + * side to leave only a very basic firmware in the VM.
> > > > > > > + *
> > > > > > > + * Copyright (c) 2021 IBM Corporation.
> > > > > > > + *
> > > > > > > + * SPDX-License-Identifier: GPL-2.0-or-later
> > > > > > > + */
> > > > > > > +
> > > > > > > +#include "qemu/osdep.h"
> > > > > > > +#include "qemu-common.h"
> > > > > > > +#include "qemu/timer.h"
> > > > > > > +#include "qemu/range.h"
> > > > > > > +#include "qemu/units.h"
> > > > > > > +#include "qapi/error.h"
> > > > > > > +#include <sys/ioctl.h>
> > > > > > > +#include "exec/ram_addr.h"
> > > > > > > +#include "exec/address-spaces.h"
> > > > > > > +#include "hw/ppc/vof.h"
> > > > > > > +#include "hw/ppc/fdt.h"
> > > > > > > +#include "sysemu/runstate.h"
> > > > > > > +#include "qom/qom-qobject.h"
> > > > > > > +#include "trace.h"
> > > > > > > +
> > > > > > > +#include <libfdt.h>
> > > > > > > +
> > > > > > > +/*
> > > > > > > + * OF 1275 "nextprop" description suggests is it 32 bytes max but
> > > > > > > + * LoPAPR defines "ibm,query-interrupt-source-number" which
> > > > > > > is 33 chars long.
> > > > > > > + */
> > > > > > > +#define OF_PROPNAME_LEN_MAX 64
> > > > > > > +
> > > > > > > +#define VOF_MAX_PATH        256
> > > > > > > +#define VOF_MAX_SETPROPLEN  2048
> > > > > > > +#define VOF_MAX_METHODLEN   256
> > > > > > > +#define VOF_MAX_FORTHCODE   256
> > > > > > > +#define VOF_VTY_BUF_SIZE    256
> > > > > > > +
> > > > > > > +typedef struct {
> > > > > > > +    uint64_t start;
> > > > > > > +    uint64_t size;
> > > > > > > +} OfClaimed;
> > > > > > > +
> > > > > > > +typedef struct {
> > > > > > > +    char *path; /* the path used to open the instance */
> > > > > > > +    uint32_t phandle;
> > > > > > > +} OfInstance;
> > > > > > > +
> > > > > > > +#define VOF_MEM_READ(pa, buf, size) \
> > > > > > > +    address_space_read_full(&address_space_memory, \
> > > > > > > +    (pa), MEMTXATTRS_UNSPECIFIED, (buf), (size))
> > > > > > > +#define VOF_MEM_WRITE(pa, buf, size) \
> > > > > > > +    address_space_write(&address_space_memory, \
> > > > > > > +    (pa), MEMTXATTRS_UNSPECIFIED, (buf), (size))
> > > > > > > +
> > > > > > > +static int readstr(hwaddr pa, char *buf, int size)
> > > > > > > +{
> > > > > > > +    if (VOF_MEM_READ(pa, buf, size) != MEMTX_OK) {
> > > > > > > +        return -1;
> > > > > > > +    }
> > > > > > > +    if (strnlen(buf, size) == size) {
> > > > > > > +        buf[size - 1] = '\0';
> > > > > > > +        trace_vof_error_str_truncated(buf, size);
> > > > > > > +        return -1;
> > > > > > > +    }
> > > > > > > +    return 0;
> > > > > > > +}
> > > > > > > +
> > > > > > > +static bool cmpservice(const char *s, unsigned nargs, unsigned nret,
> > > > > > > +                       const char *s1, unsigned nargscheck,
> > > > > > > unsigned nretcheck)
> > > > > > > +{
> > > > > > > +    if (strcmp(s, s1)) {
> > > > > > > +        return false;
> > > > > > > +    }
> > > > > > > +    if ((nargscheck && (nargs != nargscheck)) ||
> > > > > > > +        (nretcheck && (nret != nretcheck))) {
> > > > > > > +        trace_vof_error_param(s, nargscheck, nretcheck, nargs, nret);
> > > > > > > +        return false;
> > > > > > > +    }
> > > > > > > +
> > > > > > > +    return true;
> > > > > > > +}
> > > > > > > +
> > > > > > > +static void prop_format(char *tval, int tlen, const void *prop, int len)
> > > > > > > +{
> > > > > > > +    int i;
> > > > > > > +    const unsigned char *c;
> > > > > > > +    char *t;
> > > > > > > +    const char bin[] = "...";
> > > > > > > +
> > > > > > > +    for (i = 0, c = prop; i < len; ++i, ++c) {
> > > > > > > +        if (*c == '\0' && i == len - 1) {
> > > > > > > +            strncpy(tval, prop, tlen - 1);
> > > > > > > +            return;
> > > > > > > +        }
> > > > > > > +        if (*c < 0x20 || *c >= 0x80) {
> > > > > > > +            break;
> > > > > > > +        }
> > > > > > > +    }
> > > > > > > +
> > > > > > > +    for (i = 0, c = prop, t = tval; i < len; ++i, ++c) {
> > > > > > > +        if (t >= tval + tlen - sizeof(bin) - 1 - 2 - 1) {
> > > > > > > +            strcpy(t, bin);
> > > > > > > +            return;
> > > > > > > +        }
> > > > > > > +        if (i && i % 4 == 0 && i != len - 1) {
> > > > > > > +            strcat(t, " ");
> > > > > > > +            ++t;
> > > > > > > +        }
> > > > > > > +        t += sprintf(t, "%02X", *c & 0xFF);
> > > > > > > +    }
> > > > > > > +}
> > > > > > > +
> > > > > > > +static int get_path(const void *fdt, int offset, char *buf, int len)
> > > > > > > +{
> > > > > > > +    int ret;
> > > > > > > +
> > > > > > > +    ret = fdt_get_path(fdt, offset, buf, len - 1);
> > > > > > > +    if (ret < 0) {
> > > > > > > +        return ret;
> > > > > > > +    }
> > > > > > > +
> > > > > > > +    buf[len - 1] = '\0';
> > > > > > > +
> > > > > > > +    return strlen(buf) + 1;
> > > > > > > +}
> > > > > > > +
> > > > > > > +static int phandle_to_path(const void *fdt, uint32_t ph,
> > > > > > > char *buf, int len)
> > > > > > > +{
> > > > > > > +    int ret;
> > > > > > > +
> > > > > > > +    ret = fdt_node_offset_by_phandle(fdt, ph);
> > > > > > > +    if (ret < 0) {
> > > > > > > +        return ret;
> > > > > > > +    }
> > > > > > > +
> > > > > > > +    return get_path(fdt, ret, buf, len);
> > > > > > > +}
> > > > > > > +
> > > > > > > +static uint32_t vof_finddevice(const void *fdt, uint32_t nodeaddr)
> > > > > > > +{
> > > > > > > +    char fullnode[VOF_MAX_PATH];
> > > > > > > +    uint32_t ret = -1;
> > > > > > > +    int offset;
> > > > > > > +
> > > > > > > +    if (readstr(nodeaddr, fullnode, sizeof(fullnode))) {
> > > > > > > +        return (uint32_t) ret;
> > > > > > > +    }
> > > > > > > +
> > > > > > > +    offset = fdt_path_offset(fdt, fullnode);
> > > > > > > +    if (offset >= 0) {
> > > > > > > +        ret = fdt_get_phandle(fdt, offset);
> > > > > > > +    }
> > > > > > > +    trace_vof_finddevice(fullnode, ret);
> > > > > > > +    return (uint32_t) ret;
> > > > > > > +}
> > > > > > 
> > > > > > The Linux init function that runs on pegasos2 here:
> > > > > > 
> > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/arch/powerpc/kernel/prom_init.c?h=v4.14.234#n2658
> > > > > > 
> > > > > > calls finddevice once with isa@c and next with isa@C (small and
> > > > > > capital C) both of which works with the board firmware but with
> > > > > > vof the comparison is case sensitive and one of these fails so I
> > > > > > can't make it work. I don't know if this is a problem in libfdt
> > > > > > or the vof_finddevice above should do something else to get case
> > > > > > insensitive comparison.
> > > > > 
> > > > > This fixes the issue with Linux but I'm not sure if there's any
> > > > > better solution or would it break anything else.
> > > > 
> > > > The bit after "@" is an address and needs to be case insensitive and
> > > > I'll fix this indeed. I'm not so sure about the part before "@", I
> > > > cannot imagine what could break if I made search insensitive to case. Hm
> > > > :-/
> > > 
> > > Fixing the match in the address part is probably enough as the name sent by
> > > guests is probably always lower case
> > 
> > I'm confused, I thought you just said that it looked for both isa@c
> > and isa@C, which seems to contradict guests always using lower case.
> 
> I mean the part before the @ sign (that is the name part, "isa" above) is
> always lower case. I haven't seen guests trying to query that with other
> than lower case

Ah, I see.  Yes, I think you can count on that, because I believe even
in traditional OF the part before the @ *is* case-sensitive.  At least
there are certainly conventions about how the vendor is capitalized,
so I assume it is.

> but the part after @ can be different even in the same guest
> code just a few lines apart as in the Linux kernel. So fixing the comparison
> to e.g. do toupper in the address part after @ should work I think even if
> we continue to do case sensitive comparison in the name part. Alexey said
> he'll fix that so there's no problem.

Yeah, that will probably work fine in practice.  It's not technically
correct in all cases, because how you're supposed to do the comparison
depends on the bus type.
David Gibson June 7, 2021, 3:05 a.m. UTC | #55
On Fri, Jun 04, 2021 at 03:50:28PM +0200, BALATON Zoltan wrote:
> On Fri, 4 Jun 2021, David Gibson wrote:
> > On Sun, May 30, 2021 at 07:33:01PM +0200, BALATON Zoltan wrote:
[snip]
> > > MorphOS checks the name property of the root node ("/") to decide what
> > > platform it runs on so we may need to be able to set this property on /
> > > where it should return "bplan,Pegasos2", therefore the above maybe should do
> > > getprop first and only generate name property if it's not set (or at least
> > > check if we're on the root node and allow setting name property there). (On
> > > Macs the root node is named "device-tree" and this was before found to be
> > > needed for MorphOS.)
> > 
> > Ah.  Hrm.  Have to think about what to do about that.
> 
> This is easy to fix, this seems to allow setting a name property or return a
> default:
> 
> > diff --git a/hw/ppc/vof.c b/hw/ppc/vof.c
> index b47bbd509d..746842593e 100644
> --- a/hw/ppc/vof.c
> +++ b/hw/ppc/vof.c
> @@ -163,14 +163,14 @@ static uint32_t vof_finddevice(const void *fdt,
> uint32_t nodeaddr)
>  static const void *getprop(const void *fdt, int nodeoff, const char *propname,
>                             int *proplen, bool *write0)
>  {
> -    const char *unit, *prop;
> +    const char *unit, *prop = fdt_getprop(fdt, nodeoff, propname, proplen);
> 
>      /*
>       * The "name" property is not actually stored as a property in the FDT,
>       * we emulate it by returning a pointer to the node's name and adjust
>       * proplen to include only the name but not the unit.
>       */
> -    if (strcmp(propname, "name") == 0) {
> +    if (!prop && strcmp(propname, "name") == 0) {
>          prop = fdt_get_name(fdt, nodeoff, proplen);
>          if (!prop) {
>              *proplen = 0;
> @@ -196,7 +196,7 @@ static const void *getprop(const void *fdt, int nodeoff, const char *propname,
>      if (write0) {
>          *write0 = false;
>      }
> -    return fdt_getprop(fdt, nodeoff, propname, proplen);
> +    return prop;
>  }

Kind of a hack, but it'll do for now.

> This allows adding a name property to "/" different from the default but
> this does not yet fix MorphOS booting with VOF on pegasos2. I think it tries
> to query name on / and check if it's called "device-tree" in which case it
> assumes Mac hardware otherwise it goes with pegasos2 so even if we return
> nothing for name it would not matter in this case as we don't use VOF on
> Mac. If we wanted that then this would become a problem so it could be fixed
> now in advance just in case other guests may need it.
> 
> > > Other than the above two problems, I've found that getting the device tree
> > > from vof returns it in reverse order compared to the board firmware if I add
> > > it the expected order. This may or may not be a problem but to avoid it I
> > > can build the tree in reverse order then it comes out right so unless
> > > there's an easy fix this should not cause a problem but may worth a comment
> > > somewhere.
> > 
> > The order of things in the device tree *should* never matter.  If it
> > does, that's definitely a client bug... but of course that doesn't
> > necessarily mean we won't have to work around it in practice.
> 
> I don't know if it matters or not but having the device tree in the same
> order as the firmware ROM helps with comparing it for debugging but I've
> found I can solve this by building the tree in reverse order so no changes
> to VOF is needed for this, just thought adding a comment somewhere may
> clarify it but it's not really a problem.
> 
> I still don't know what's MorphOS is missing, I've tried adding almost all
> misssing properties, checked what hardware is init by the firmware and tried
> to do the same in board reset code and even after that MorphOS still takes a
> different route with VOF and crashes but boots with the board firmware. I'm
> now thinking it may be either different memory organisation or the missing
> name properties that are not returned by nextprop in VOF so they are only
> appearing when explicitely queried whereas with the board firmware they are
> present as properties. With the above patch I could explicitely set it on
> nodes and test if that makes a difference.
> 
> I got to this because adding more missing props or init more devices did not
> make a difference so I'm guessing it may be something else then and the only
> difference I can see compared to board firmware are the different memory
> ranges in claimed (VOF puts itself to 0 for example); and the missing name
> and additional phandle props in the device tree. MorphOS copies the whole
> device tree on startup then later it uses this copy of the device tree after
> shutting down OF with quiesce. I can imagine it may use some name props like
> that on the cpu node without checking assuming it's always there and if
> we're missing that it may cause a NULL dereference. I have no better idea
> what else could be missing so I'll test this next. If it helps I can try to
> come up with a patch to VOF to return these name props or allow setting them
> as above.
> 
> Regards,
> BALATON Zoltan
>
David Gibson June 7, 2021, 3:30 a.m. UTC | #56
On Fri, Jun 04, 2021 at 03:59:22PM +0200, BALATON Zoltan wrote:
> On Fri, 4 Jun 2021, David Gibson wrote:
> > On Wed, Jun 02, 2021 at 02:29:29PM +0200, BALATON Zoltan wrote:
> > > On Wed, 2 Jun 2021, David Gibson wrote:
> > > > On Thu, May 27, 2021 at 02:42:39PM +0200, BALATON Zoltan wrote:
> > > > > On Thu, 27 May 2021, David Gibson wrote:
> > > > > > On Tue, May 25, 2021 at 12:08:45PM +0200, BALATON Zoltan wrote:
> > > > > > > On Tue, 25 May 2021, David Gibson wrote:
> > > > > > > > On Mon, May 24, 2021 at 12:55:07PM +0200, BALATON Zoltan wrote:
> > > > > > > > > On Mon, 24 May 2021, David Gibson wrote:
> > > > > > > > > > On Sun, May 23, 2021 at 07:09:26PM +0200, BALATON Zoltan wrote:
> > > > > > > > > > > On Sun, 23 May 2021, BALATON Zoltan wrote:
> > > > > > > > > > > > On Sun, 23 May 2021, Alexey Kardashevskiy wrote:
> > > > > > > > > > > > > One thing to note about PCI is that normally I think the client
> > > > > > > > > > > > > expects the firmware to do PCI probing and SLOF does it. But VOF
> > > > > > > > > > > > > does not and Linux scans PCI bus(es) itself. Might be a problem for
> > > > > > > > > > > > > you kernel.
> > > > > > > > > > > > 
> > > > > > > > > > > > I'm not sure what info does MorphOS get from the device tree and what it
> > > > > > > > > > > > probes itself but I think it may at least need device ids and info about
> > > > > > > > > > > > the PCI bus to be able to access the config regs, after that it should
> > > > > > > > > > > > set the devices up hopefully. I could add these from the board code to
> > > > > > > > > > > > device tree so VOF does not need to do anything about it. However I'm
> > > > > > > > > > > > not getting to that point yet because it crashes on something that it's
> > > > > > > > > > > > missing and couldn't yet find out what is that.
> > > > > > > > > > > > 
> > > > > > > > > > > > I'd like to get Linux working now as that would be enough to test this
> > > > > > > > > > > > and then if for MorphOS we still need a ROM it's not a problem if at
> > > > > > > > > > > > least we can boot Linux without the original firmware. But I can't make
> > > > > > > > > > > > Linux open a serial console and I don't know what it needs for that. Do
> > > > > > > > > > > > you happen to know? I've looked at the sources in Linux/arch/powerpc but
> > > > > > > > > > > > not sure how it would find and open a serial port on pegasos2. It seems
> > > > > > > > > > > > to work with the board firmware and now I can get it to boot with VOF
> > > > > > > > > > > > but then it does not open serial so it probably needs something in the
> > > > > > > > > > > > device tree or expects the firmware to set something up that we should
> > > > > > > > > > > > add in pegasos2.c when using VOF.
> > > > > > > > > > > 
> > > > > > > > > > > I've now found that Linux uses rtas methods read-pci-config and
> > > > > > > > > > > write-pci-config for PCI access on pegasos2 so this means that we'll
> > > > > > > > > > > probably need rtas too (I hoped we could get away without it if it were only
> > > > > > > > > > > used for shutdown/reboot or so but seems Linux needs it for PCI as well and
> > > > > > > > > > > does not scan the bus and won't find some devices without it).
> > > > > > > > > > 
> > > > > > > > > > Yes, definitely sounds like you'll need an RTAS implementation.
> > > > > > > > > 
> > > > > > > > > I plan to fix that after managed to get serial working as that seems to not
> > > > > > > > > need it. If I delete the rtas-size property from /rtas on the original
> > > > > > > > > firmware that makes Linux skip instantiating rtas, but I still get serial
> > > > > > > > > output just not accessing PCI devices. So I think it should work and keeps
> > > > > > > > > things simpler at first. Then I'll try rtas later.
> > > > > > > > > 
> > > > > > > > > > > While VOF can do rtas, this causes a problem with the hypercall method using
> > > > > > > > > > > sc 1 that goes through vhyp but trips the assert in ppc_store_sdr1() so
> > > > > > > > > > > cannot work after guest is past quiesce.
> > > > > > > > > > 
> > > > > > > > > > > So the question is why is that
> > > > > > > > > > > assert there
> > > > > > > > > > 
> > > > > > > > > > Ah.. right.  So, vhyp was designed for the PAPR use case, where we
> > > > > > > > > > want to model the CPU when it's in supervisor and user mode, but not
> > > > > > > > > > when it's in hypervisor mode.  We want qemu to mimic the behaviour of
> > > > > > > > > > the hypervisor, rather than attempting to actually execute hypervisor
> > > > > > > > > > code in the virtual CPU.
> > > > > > > > > > 
> > > > > > > > > > On systems that have a hypervisor mode, SDR1 is hypervisor privileged,
> > > > > > > > > > so it makes no sense for the guest to attempt to set it.  That should
> > > > > > > > > > be caught by the general SPR code and turned into a 0x700, hence the
> > > > > > > > > > assert() if we somehow reach ppc_store_sdr1().
> > > > > > > > > > 
> > > > > > > > > > So, we are seeing a problem here because you want the 'sc 1'
> > > > > > > > > > interception of vhyp, but not the rest of the stuff that goes with it.
> > > > > > > > > > 
> > > > > > > > > > > and would using sc 1 for hypercalls on pegasos2 cause other
> > > > > > > > > > > problems later even if the assert could be removed?
> > > > > > > > > > 
> > > > > > > > > > At least in the short term, I think you probably can remove the
> > > > > > > > > > assert.  In your case the 'sc 1' calls aren't truly to a hypervisor,
> > > > > > > > > > but a special case escape to qemu for the firmware emulation.  I think
> > > > > > > > > > it's unlikely to cause problems later, because nothing on a 32-bit
> > > > > > > > > > system should be attempting an 'sc 1'.  The only thing I can think of
> > > > > > > > > > that would fail is some test case which explicitly verified that 'sc
> > > > > > > > > > 1' triggered a 0x700 (SIGILL from userspace).
> > > > > > > > > 
> > > > > > > > > OK so the assert should check if the CPU has an HV bit. I think there was a
> > > > > > > > > #detine for that somewhere that I can add to the assert then I can try that.
> > > > > > > > > What I wasn't sure about is that sc 1 would conflict with the guest's usage
> > > > > > > > > of normal sc calls or are these going through different paths and only sc 1
> > > > > > > > > will trigger vhyp callback not affecting notmal sc calls?
> > > > > > > > 
> > > > > > > > The vhyp shouldn't affect normal system calls, 'sc 1' is specifically
> > > > > > > > for hypercalls, as opposed to normal 'sc' (a.k.a. 'sc 0'), and the
> > > > > > > > vhyp only intercepts the hypercall version (after all Linux on PAPR
> > > > > > > > certainly uses its own system calls, and hypercalls are active for the
> > > > > > > > lifetime of the guest there).
> > > > > > > > 
> > > > > > > > > (Or if this causes
> > > > > > > > > an otherwise unnecessary VM exit on KVM even when it works then maybe
> > > > > > > > > looking for a different way in the future might be needed.
> > > > > > > > 
> > > > > > > > What you're doing here won't work with KVM as it stands.  There are
> > > > > > > > basically two paths into the vhyp hypercall path: 1) from TCG, if we
> > > > > > > > interpret an 'sc 1' instruction we enter vhyp, 2) from KVM, if we get
> > > > > > > > a KVM_EXIT_PAPR_HCALL KVM exit then we also go to the vhyp path.
> > > > > > > > 
> > > > > > > > The second path is specific to the PAPR (ppc64) implementation of KVM,
> > > > > > > > and will not work for a non-PAPR platform without substantial
> > > > > > > > modification of the KVM code.
> > > > > > > 
> > > > > > > OK so then at that point when we try KVM we'll need to look at alternative
> > > > > > > ways, I think MOL OSI worked with KVM at least in MOL but will probably make
> > > > > > > all syscalls exit KVM but since we'll probably need to use KVM PR it will
> > > > > > > exit anyway. For now I keep this vhyp as it does not run with KVM for other
> > > > > > > reasons yet so that's another area to clean up so as a proof of concept
> > > > > > > first version of using VOF vhyp will do.
> > > > > > 
> > > > > > Eh, since you'll need to modify KVM anyway, it probably makes just as
> > > > > > much sense to modify it to catch the 'sc 1' as MoL's magic thingy.
> > > > > 
> > > > > I'm not sure how KVM works for this case so I also don't know why and what
> > > > > would need to be modified. I think we'll only have KVM PR working as newer
> > > > > POWER CPUs having HV (besides being rare among potential users) are probably
> > > > > too different to run the OSes that expect at most a G4 on pegasos2 so likely
> > > > > it won't work with KVM HV.
> > > > 
> > > > Oh, it definitely won't work with KVM HV.
> > > > 
> > > > > If we have KVM PR doesn't sc already trap so we
> > > > > could add MOL OSI without further modification to KVM itself only needing
> > > > > change in QEMU?
> > > > 
> > > > Uh... I guess so?
> > > > 
> > > > > I also hope that MOL OSI could be useful for porting some
> > > > > paravirt drivers from MOL for running Mac OS X on Mac emulation but I don't
> > > > > know about that for sure so I'm open to any other solution too.
> > > > 
> > > > Maybe.  I never know much about MOL to begin with, and anything I did
> > > > know was a decade or more ago so I've probably forgotten.
> > > 
> > > That may still be more than what I know about it since I never had any
> > > knowledge about PPC KVM and don't have any PPC hardware to test with so I'm
> > > mostly guessing. (I could test with KVM emulated in QEMU and I did set up an
> > > environment for that but that's a bit slow and inconvenient so I'd leave KVM
> > > support to those interested and have more knowledge and hardware for it.)
> > 
> > Sounds like a problem for someone else another time, then.
> > 
> > > > > For now I'm
> > > > > going with vhyp which is enough fot testing with TCG and if somebody wants
> > > > > KVM they could use he original firmware for now so this could be improved in
> > > > > a later version unless a simple solution is found before the freeze for 6.1.
> > > > > If we're in KVM PR what happens for sc 1 could that be used too so maybe
> > > > > what we have now could work?
> > > > 
> > > > Note that if you do go down the MOL path it wouldn't be that complex
> > > > to make a "vMOL" interface so you can use the same mechanism for KVM
> > > > and TCG.
> > > 
> > > Not sure what you mean by VMOL. Is it modifying MOL to use sc 1 like VOF
> > > instead of its OSI way for hypercalls?
> > 
> > No, I mean on the qemu side adding an optional hook which will
> > intercept sc 0 instructions with the MOL magic register values and
> > redirect them to a machine registered callback, rather than emulating
> > the CPU's behaviour of jumping to the system call vector in guest
> > space.
> > 
> > Basically an equivalent of vhyp, but for MOL magic syscalls, instead
> > of hypercalls.
> 
> OK, that's basically what BenH's OSI patch I've linked to before did I
> think,

Ok, but probably cleaned up to more modern qemu approaches.

> it may just need updating for changes in target/ppc since that patch
> was created. However that would also mean we'd need another version of VOF
> that uses this instead of sc 1 then so unless we need that I'd keep a single
> VOF that works for both spapr and pegasos2.

Yeah, fair enough.

> > > That would lose the advantage of
> > > being able to reuse MOL guest drivers without modification (which might be
> > > useful for running OS X guest on Mac emulation) so if we can't use vhyp then
> > > maybe using OSI would be the next choice for this reason but for now vhyp
> > > seems to be working for what I could test so unless somebody here sees a
> > > problem with it and has a better idea I'm going with vhyp for now just
> > > because that's what VOF uses and I don't want to modify VOF to reuse it as
> > > it is so I don't need to maintain a separate version and also get any
> > > enhancements without further need to sync with spapr VOF.
> > > 
> > > I've found this document about possible hypercall interfaces on KVM (see
> > > Hypercall ABIs at the end):
> > > 
> > > https://www.kernel.org/doc/html/latest/virt/kvm/ppc-pv.html
> > > 
> > > Having both ePAPR (1.) and PAPR (2.) hypercalls is a bit confusing. Does
> > > vhyp correspond to 2. PAPR?
> > 
> > Yes.
> 
> What's ePAPR then and how is it different from PAPR? I mean the acronym not
> the hypercall method, the latter is explained in that doc but what ePAPR
> stands for and why is that method called like that is not clear to me.

Ok, history lesson time.

For a long time PAPR has been the document that described the OS
environment for IBM POWER based server hardware.  Before it was called
PAPR (POWER Architecture Platform Requirements) it was called the
"RPA" (Requirements for the POWER Architecture, I think?).  You might
see the old name in a few places.

Requiring a full Open Firmware and a bunch of other fairly heavyweight
stuff, PAPR really wasn't suitable for embedded ppc chips and boards.
The situation with those used to be a complete mess with basically
every board variant having it's own different firmware with its own
different way of presenting some fragments of vital data to the OS.

ePAPR - Embedded Power Architecture Platform Requirements - was
created as a standard to try to unify how this stuff was handled on
embedded ppc chips.  I was one of the authors on early versions of
it.  It's mostly based around giving the OS a flattened device tree,
with some deliberately minimal requirements on firmware initialization
and entry state.  Here's a link to one of those early versions:

http://elinux.org/images/c/cf/Power_ePAPR_APPROVED_v1.1.pdf

I thought there were later versions, but I couldn't seem to find any.
It's possible the process of refining later versions just petered out
as the embedded ppc world mostly died and the flattened device tree
development mostly moved to ARM.

Since some of the embedded chips from Freescale had hypervisor
capabilities, a hypercall model was added to ePAPR - but that wasn't
something I was greatly involved in, so I don't know much about it.

ePAPR is the reason that the original PAPR is sometimes referred to as
"sPAPR" to disambiguate.

> > > The ePAPR (1.) seems to be preferred by KVM and
> > > MOL OSI supported for compatibility.
> > 
> > That document looks pretty out of date.  Most of it is only discussing
> > KVM PR, which is now barely maintained.  KVM HV only works with PAPR
> > hypercalls.
> 
> The links says it's latest kernel docs, so maybe an update need to be sent
> to KVM?

I guess, but the chances of me finding time to do it are approximately
zero.

> > > So if we need something else instead of
> > > 2. PAPR hypercalls there seems to be two options: ePAPR and MOL OSI which
> > > should work with KVM but then I'm not sure how to handle those on TCG.
> > > 
> > > > > > > [...]
> > > > > > > > > > > I've tested that the missing rtas is not the reason for getting no output
> > > > > > > > > > > via serial though, as even when disabling rtas on pegasos2.rom it boots and
> > > > > > > > > > > I still get serial output just some PCI devices are not detected (such as
> > > > > > > > > > > USB, the video card and the not emulated ethernet port but these are not
> > > > > > > > > > > fatal so it might even work as a first try without rtas, just to boot a
> > > > > > > > > > > Linux kernel for testing it would be enough if I can fix the serial output).
> > > > > > > > > > > I still don't know why it's not finding serial but I think it may be some
> > > > > > > > > > > missing or wrong info in the device tree I generat. I'll try to focus on
> > > > > > > > > > > this for now and leave the above rtas question for later.
> > > > > > > > > > 
> > > > > > > > > > Oh.. another thought on that.  You have an ISA serial port on Pegasos,
> > > > > > > > > > I believe.  I wonder if the PCI->ISA bridge needs some configuration /
> > > > > > > > > > initialization that the firmware is expected to do.  If so you'll need
> > > > > > > > > > to mimic that setup in qemu for the VOF case.
> > > > > > > > > 
> > > > > > > > > That's what I begin to think because I've added everything to the device
> > > > > > > > > tree that I thought could be needed and I still don't get it working so it
> > > > > > > > > may need some config from the firmware. But how do I access device registers
> > > > > > > > > from board code? I've tried adding a machine reset method and write to
> > > > > > > > > memory mapped device registers but all my attempts failed. I've tried
> > > > > > > > > cpu_stl_le_data and even memory_region_dispatch_write but these did not get
> > > > > > > > > to the device. What's the way to access guest mmio regs from QEMU?
> > > > > > > > 
> > > > > > > > That's odd, cpu_stl() and memory_region_dispatch_write() should work
> > > > > > > > from board code (after the relevant memory regions are configured, of
> > > > > > > > course).  As an ISA serial port, it's probably accessed through IO
> > > > > > > > space, not memory space though, so you'd need &address_space_io.  And
> > > > > > > > if there is some bridge configuration then it's the bridge control
> > > > > > > > registers you need to look at not the serial registers - you'd have to
> > > > > > > > look at the bridge documentation for that.  Or, I guess the bridge
> > > > > > > > implementation in qemu, which you wrote part of.
> > > > > > > 
> > > > > > > I've found at last that stl_le_phys() works. There are so many of these that
> > > > > > > I never know when to use which.
> > > > > > > 
> > > > > > > I think the address_space_rw calls in vof_client_call() in vof.c could also
> > > > > > > use these for somewhat shorter code. I've ended up with
> > > > > > > stl_le_phys(CPU(cpu)->as, addr, val) in my machine reset methodbut I don't
> > > > > > > even need that now as it works without additional setup. Also VOF's memory
> > > > > > > access is basically the same as the already existing rtas_st() and co. so
> > > > > > > maybe that could be reused to make code smaller?
> > > > > > 
> > > > > > rtas_ld() and rtas_st() should only be used for reading/writing RTAS
> > > > > > parameters to and from memory.  Accessing IO shouldn't be done with
> > > > > > those.
> > > > > > 
> > > > > > For IO you probably want the cpu_st*() variants in most cases, since
> > > > > > you're trying to emulate an IO access from the virtual cpu.
> > > > > 
> > > > > I think I've tried that but what worked to access mmio device registers are
> > > > > stl_le_phys and similar that are wrappers around address_space_stl_*. But I
> > > > > did not mean that for rtas_ld/_st but the part when vof accessing the
> > > > > parameters passed by its hypercall which is memory access:
> > > > > 
> > > > > https://github.com/patchew-project/qemu/blob/patchew/20210520090557.435689-1-aik%40ozlabs.ru/hw/ppc/vof.c
> > > > > 
> > > > > line 893, and vof_client_call before that is very similar to what h_rtas
> > > > > does here:
> > > > > 
> > > > > https://git.qemu.org/?p=qemu.git;a=blob;f=hw/ppc/spapr_hcall.c;h=f25014afda408002ee1ec1027a0dd7a6025eca61;hb=HEAD#l639
> > > > > 
> > > > > and I also need to do the same for rtas in pegasos2 for which I'm just using
> > > > > ldl_be_phys for now but I wonder if we really need 3 ways to do the same or
> > > > > the rtas_ld/_st could be made more generic and reused here?
> > > > 
> > > > For your rtas implementation you could definitely re-use them.  For
> > > > the client call I'm a bit less confident, but if the in-guest-memory
> > > > structures are really the same, then it would make sense.
> > > 
> > > The memory structure seems very similar to me, the only difference is
> > > calling the first field service in VOF instead of token in RTAS. Both are
> > > just an array of big endian unit32_t with token, nargs, nret at the front
> > > followed by args and rets. Since these rtas_ld/st are defined in spapr.h I
> > > did not bother to split them off, so for pegasos2 rtas I'm just using the
> > > ldl_be_* functions directly for which these are a shorthand for. If these
> > > were split off for sharing between spapr rtas and VOF I may be able to reuse
> > > them as well but it's not that important so just mentioned it as a possible
> > > later clean up.
> > 
> > Ok, sounds reasonable to re-use them then, though maybe add an aliased
> > name for clarity ofci_{ld,st}(), maybe?  (for "Open Firmware Client
> > Interface")
> 
> I'll wait for what Alexey decides to do in the next VOF patch version and if
> I can reuse that (I could if these were defined in vof.h). I don't want to
> come up with yet another abstraction to ldl_be_* which does not seem to make
> it more clear than using the actual functions for guest memory access which
> is what we're doing while getting the hypercall args so I think either using
> ldl_be_* directly or reusing already existing rfas_ls/_st would make sense
> but adding similar funcs with another name just makes it more confusing.

Well, the point of the rtas_ld() functions isn't o be a different way
of accessing memory.  It's just a convenience wrapper that takes an
RTAS args array and an argument index and does the right thing to
retrieve it for you.

So, if your RTAS function implementation when you want to get argument
0, you just go rtas_ld(args, 0) - more readable than having a bunch of
offset calculations and a long winded call to the BE memory access
function.  You can look at the examples in hw/ppc/sppar_rtas.c to see
how its used.

Actually, looking again at how it works, you should probably only use
rtas_ld() if your general dispatch code has pre-parsed the args
structure into separate args and rets arrays, again as we do in
spapr_rtas.c
David Gibson June 7, 2021, 3:37 a.m. UTC | #57
On Mon, Jun 07, 2021 at 12:21:21AM +0200, BALATON Zoltan wrote:
> On Fri, 4 Jun 2021, David Gibson wrote:
> > On Wed, Jun 02, 2021 at 02:29:29PM +0200, BALATON Zoltan wrote:
> > > On Wed, 2 Jun 2021, David Gibson wrote:
> > > > On Thu, May 27, 2021 at 02:42:39PM +0200, BALATON Zoltan wrote:
> > > > > On Thu, 27 May 2021, David Gibson wrote:
> > > > > > On Tue, May 25, 2021 at 12:08:45PM +0200, BALATON Zoltan wrote:
> > > > > > > On Tue, 25 May 2021, David Gibson wrote:
> > > > > > > > On Mon, May 24, 2021 at 12:55:07PM +0200, BALATON Zoltan wrote:
> > > > > > > > > On Mon, 24 May 2021, David Gibson wrote:
> > > > > > > > > > On Sun, May 23, 2021 at 07:09:26PM +0200, BALATON Zoltan wrote:
> > > > > > > > > > > On Sun, 23 May 2021, BALATON Zoltan wrote:
> > > > > > > > > > > and would using sc 1 for hypercalls on pegasos2 cause other
> > > > > > > > > > > problems later even if the assert could be removed?
> > > > > > > > > > 
> > > > > > > > > > At least in the short term, I think you probably can remove the
> > > > > > > > > > assert.  In your case the 'sc 1' calls aren't truly to a hypervisor,
> > > > > > > > > > but a special case escape to qemu for the firmware emulation.  I think
> > > > > > > > > > it's unlikely to cause problems later, because nothing on a 32-bit
> > > > > > > > > > system should be attempting an 'sc 1'.  The only thing I can think of
> > > > > > > > > > that would fail is some test case which explicitly verified that 'sc
> > > > > > > > > > 1' triggered a 0x700 (SIGILL from userspace).
> > > > > > > > > 
> > > > > > > > > OK so the assert should check if the CPU has an HV bit. I think there was a
> > > > > > > > > #detine for that somewhere that I can add to the assert then I can try that.
> > > > > > > > > What I wasn't sure about is that sc 1 would conflict with the guest's usage
> > > > > > > > > of normal sc calls or are these going through different paths and only sc 1
> > > > > > > > > will trigger vhyp callback not affecting notmal sc calls?
> > > > > > > > 
> > > > > > > > The vhyp shouldn't affect normal system calls, 'sc 1' is specifically
> > > > > > > > for hypercalls, as opposed to normal 'sc' (a.k.a. 'sc 0'), and the
> > > > > > > > vhyp only intercepts the hypercall version (after all Linux on PAPR
> > > > > > > > certainly uses its own system calls, and hypercalls are active for the
> > > > > > > > lifetime of the guest there).
> > > > > > > > 
> > > > > > > > > (Or if this causes
> > > > > > > > > an otherwise unnecessary VM exit on KVM even when it works then maybe
> > > > > > > > > looking for a different way in the future might be needed.
> > > > > > > > 
> > > > > > > > What you're doing here won't work with KVM as it stands.  There are
> > > > > > > > basically two paths into the vhyp hypercall path: 1) from TCG, if we
> > > > > > > > interpret an 'sc 1' instruction we enter vhyp, 2) from KVM, if we get
> > > > > > > > a KVM_EXIT_PAPR_HCALL KVM exit then we also go to the vhyp path.
> > > > > > > > 
> > > > > > > > The second path is specific to the PAPR (ppc64) implementation of KVM,
> > > > > > > > and will not work for a non-PAPR platform without substantial
> > > > > > > > modification of the KVM code.
> > > > > > > 
> > > > > > > OK so then at that point when we try KVM we'll need to look at alternative
> > > > > > > ways, I think MOL OSI worked with KVM at least in MOL but will probably make
> > > > > > > all syscalls exit KVM but since we'll probably need to use KVM PR it will
> > > > > > > exit anyway. For now I keep this vhyp as it does not run with KVM for other
> > > > > > > reasons yet so that's another area to clean up so as a proof of concept
> > > > > > > first version of using VOF vhyp will do.
> > > > > > 
> > > > > > Eh, since you'll need to modify KVM anyway, it probably makes just as
> > > > > > much sense to modify it to catch the 'sc 1' as MoL's magic thingy.
> > > > > 
> > > > > I'm not sure how KVM works for this case so I also don't know why and what
> > > > > would need to be modified. I think we'll only have KVM PR working as newer
> > > > > POWER CPUs having HV (besides being rare among potential users) are probably
> > > > > too different to run the OSes that expect at most a G4 on pegasos2 so likely
> > > > > it won't work with KVM HV.
> > > > 
> > > > Oh, it definitely won't work with KVM HV.
> > > > 
> > > > > If we have KVM PR doesn't sc already trap so we
> > > > > could add MOL OSI without further modification to KVM itself only needing
> > > > > change in QEMU?
> > > > 
> > > > Uh... I guess so?
> > > > 
> > > > > I also hope that MOL OSI could be useful for porting some
> > > > > paravirt drivers from MOL for running Mac OS X on Mac emulation but I don't
> > > > > know about that for sure so I'm open to any other solution too.
> > > > 
> > > > Maybe.  I never know much about MOL to begin with, and anything I did
> > > > know was a decade or more ago so I've probably forgotten.
> > > 
> > > That may still be more than what I know about it since I never had any
> > > knowledge about PPC KVM and don't have any PPC hardware to test with so I'm
> > > mostly guessing. (I could test with KVM emulated in QEMU and I did set up an
> > > environment for that but that's a bit slow and inconvenient so I'd leave KVM
> > > support to those interested and have more knowledge and hardware for it.)
> > 
> > Sounds like a problem for someone else another time, then.
> 
> So now that it works on TCG with vhyp I tried what it would do on KVM PR
> with the sc 1 but I could only test that on QEMU itself running in a Linux
> guest. First I've hit missing this callback:
> 
> https://git.qemu.org/?p=qemu.git;a=blob;f=target/ppc/kvm.c;h=104a308abb5700b2fe075397271f314d7f607543;hb=HEAD#l856
> 
> that I can fix by providing a callback in pegasos2.c that does what the else
> clause would do returning POWERPC_CPU(current_cpu)->env.spr[SPR_SDR1] (I
> guess that's the correct thing to do if it works without vhyp).

For your case, yes that's right.  Again vhyp is designed for the case
where the (hash) MMU is owned by the hypervisor.  But due to a gross
hack the way we communicate the userspace address of the hash table to
KVM PR is via the SDR1 register, which is why we need that hook.

> After getting past this, the host QEMU crashed on the first sc 1 call with
> this error:
> 
> qemu: fatal: Trying to deliver HV exception (MSR) 8 with no HV support

> NIP 0000000000000148   LR 0000000000000590 CTR 0000000000000000 XER 0000000000000000 CPU#0
> MSR 000000000000d032 HID0 0000000060000000  HF 00004012 iidx 0 didx 0
> TB 00000203 876006644638 DECR 422427
> GPR00 0000000000000680 000000000000fe90 0000000000008e00 000000000000f005
> GPR04 000000000000fe9c 0000000000000001 0000000000000e78 0000000000000000
> GPR08 000000000000fe98 000000000000fe9c 0000000000000001 0000000000000000
> GPR12 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> GPR16 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> GPR20 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> GPR24 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> GPR28 0000000000000000 0000000000000000 0000000000008e9c 000000000000fe90
> CR 20000000  [ E  -  -  -  -  -  -  -  ]             RES ffffffffffffffff
> FPR00 bff0000000000000 0000000000000000 0000000000000000 0000000000000000
> FPR04 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> FPR08 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> FPR12 3ff553f7ced91687 0000000000000000 0000000000000000 0000000000000000
> FPR16 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> FPR20 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> FPR24 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> FPR28 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> FPSCR 0000000082004000
>  SRR0 00000000000001d4  SRR1 300000000000d032    PVR 00000000003c0301 VRSAVE 00000000ffffffff
> SPRG0 000000003fe00000 SPRG1 c00000000ff60000  SPRG2 c00000000ff60000  SPRG3 0000000000000000
> SPRG4 0000000000000000 SPRG5 0000000000000000  SPRG6 0000000000000000  SPRG7 0000000000000000
>  SDR1 000000003f000006   DAR f00000000090abf0  DSISR 0000000042000000
> Aborted (core dumped)
> 
> (vof.bin looks like this:
> 
>  100:   3c 40 00 00     lis     r2,0
>  104:   60 42 8e 00     ori     r2,r2,36352
>  108:   48 00 00 cc     b       0x1d4
>  10c:   3c 40 00 00     lis     r2,0
>  110:   60 42 8e 00     ori     r2,r2,36352
>  114:   94 21 ff 90     stwu    r1,-112(r1)
>  118:   93 e1 00 68     stw     r31,104(r1)
>  11c:   7f e8 02 a6     mflr    r31
>  120:   48 00 02 8d     bl      0x3ac
>  124:   60 00 00 00     nop
>  128:   7f e8 03 a6     mtlr    r31
>  12c:   83 e1 00 68     lwz     r31,104(r1)
>  130:   38 21 00 70     addi    r1,r1,112
>  134:   4e 80 00 20     blr
>  138:   7c 64 1b 78     mr      r4,r3
>  13c:   3c 60 00 00     lis     r3,0
>  140:   60 63 f0 05     ori     r3,r3,61445
>  144:   44 00 00 22     sc      1
>  148:   4e 80 00 20     blr
> 
> so I think it's the sc 1 at 0x144) The error is coming from here:
> 
> https://git.qemu.org/?p=qemu.git;a=blob;f=target/ppc/excp_helper.c;h=fd147e2a37662456d30f7ab74b23bfb036260ced;hb=HEAD#l830
> 
> What does this mean? What would a real CPU do with this and where it could
> be catched to use as hypercall method on CPUs without HV or what else should
> we do if we wanted this to work with KVM PR too in the future?

The interesting bit is actually how we're getting to that part of
powerpc_excp.  I guess we must be getting a KVM exit for that 'sc 1',
but I don't know what type.  If we can figure out that would be where
we'd need to intercept it and send it to the vhyp handler instead of
actually trying to enter the hypercall vector on the emulated CPU,
which doesn't have one.
BALATON Zoltan June 7, 2021, 10:20 p.m. UTC | #58
On Mon, 7 Jun 2021, David Gibson wrote:
> On Mon, Jun 07, 2021 at 12:21:21AM +0200, BALATON Zoltan wrote:
>> On Fri, 4 Jun 2021, David Gibson wrote:
>>> On Wed, Jun 02, 2021 at 02:29:29PM +0200, BALATON Zoltan wrote:
>>>> On Wed, 2 Jun 2021, David Gibson wrote:
>>>>> On Thu, May 27, 2021 at 02:42:39PM +0200, BALATON Zoltan wrote:
>>>>>> On Thu, 27 May 2021, David Gibson wrote:
>>>>>>> On Tue, May 25, 2021 at 12:08:45PM +0200, BALATON Zoltan wrote:
>>>>>>>> On Tue, 25 May 2021, David Gibson wrote:
>>>>>>>>> On Mon, May 24, 2021 at 12:55:07PM +0200, BALATON Zoltan wrote:
>>>>>>>>>> On Mon, 24 May 2021, David Gibson wrote:
>>>>>>>>>>> On Sun, May 23, 2021 at 07:09:26PM +0200, BALATON Zoltan wrote:
>>>>>>>>>>>> On Sun, 23 May 2021, BALATON Zoltan wrote:
>>>>>>>>>>>> and would using sc 1 for hypercalls on pegasos2 cause other
>>>>>>>>>>>> problems later even if the assert could be removed?
>>>>>>>>>>>
>>>>>>>>>>> At least in the short term, I think you probably can remove the
>>>>>>>>>>> assert.  In your case the 'sc 1' calls aren't truly to a hypervisor,
>>>>>>>>>>> but a special case escape to qemu for the firmware emulation.  I think
>>>>>>>>>>> it's unlikely to cause problems later, because nothing on a 32-bit
>>>>>>>>>>> system should be attempting an 'sc 1'.  The only thing I can think of
>>>>>>>>>>> that would fail is some test case which explicitly verified that 'sc
>>>>>>>>>>> 1' triggered a 0x700 (SIGILL from userspace).
>>>>>>>>>>
>>>>>>>>>> OK so the assert should check if the CPU has an HV bit. I think there was a
>>>>>>>>>> #detine for that somewhere that I can add to the assert then I can try that.
>>>>>>>>>> What I wasn't sure about is that sc 1 would conflict with the guest's usage
>>>>>>>>>> of normal sc calls or are these going through different paths and only sc 1
>>>>>>>>>> will trigger vhyp callback not affecting notmal sc calls?
>>>>>>>>>
>>>>>>>>> The vhyp shouldn't affect normal system calls, 'sc 1' is specifically
>>>>>>>>> for hypercalls, as opposed to normal 'sc' (a.k.a. 'sc 0'), and the
>>>>>>>>> vhyp only intercepts the hypercall version (after all Linux on PAPR
>>>>>>>>> certainly uses its own system calls, and hypercalls are active for the
>>>>>>>>> lifetime of the guest there).
>>>>>>>>>
>>>>>>>>>> (Or if this causes
>>>>>>>>>> an otherwise unnecessary VM exit on KVM even when it works then maybe
>>>>>>>>>> looking for a different way in the future might be needed.
>>>>>>>>>
>>>>>>>>> What you're doing here won't work with KVM as it stands.  There are
>>>>>>>>> basically two paths into the vhyp hypercall path: 1) from TCG, if we
>>>>>>>>> interpret an 'sc 1' instruction we enter vhyp, 2) from KVM, if we get
>>>>>>>>> a KVM_EXIT_PAPR_HCALL KVM exit then we also go to the vhyp path.
>>>>>>>>>
>>>>>>>>> The second path is specific to the PAPR (ppc64) implementation of KVM,
>>>>>>>>> and will not work for a non-PAPR platform without substantial
>>>>>>>>> modification of the KVM code.
>>>>>>>>
>>>>>>>> OK so then at that point when we try KVM we'll need to look at alternative
>>>>>>>> ways, I think MOL OSI worked with KVM at least in MOL but will probably make
>>>>>>>> all syscalls exit KVM but since we'll probably need to use KVM PR it will
>>>>>>>> exit anyway. For now I keep this vhyp as it does not run with KVM for other
>>>>>>>> reasons yet so that's another area to clean up so as a proof of concept
>>>>>>>> first version of using VOF vhyp will do.
>>>>>>>
>>>>>>> Eh, since you'll need to modify KVM anyway, it probably makes just as
>>>>>>> much sense to modify it to catch the 'sc 1' as MoL's magic thingy.
>>>>>>
>>>>>> I'm not sure how KVM works for this case so I also don't know why and what
>>>>>> would need to be modified. I think we'll only have KVM PR working as newer
>>>>>> POWER CPUs having HV (besides being rare among potential users) are probably
>>>>>> too different to run the OSes that expect at most a G4 on pegasos2 so likely
>>>>>> it won't work with KVM HV.
>>>>>
>>>>> Oh, it definitely won't work with KVM HV.
>>>>>
>>>>>> If we have KVM PR doesn't sc already trap so we
>>>>>> could add MOL OSI without further modification to KVM itself only needing
>>>>>> change in QEMU?
>>>>>
>>>>> Uh... I guess so?
>>>>>
>>>>>> I also hope that MOL OSI could be useful for porting some
>>>>>> paravirt drivers from MOL for running Mac OS X on Mac emulation but I don't
>>>>>> know about that for sure so I'm open to any other solution too.
>>>>>
>>>>> Maybe.  I never know much about MOL to begin with, and anything I did
>>>>> know was a decade or more ago so I've probably forgotten.
>>>>
>>>> That may still be more than what I know about it since I never had any
>>>> knowledge about PPC KVM and don't have any PPC hardware to test with so I'm
>>>> mostly guessing. (I could test with KVM emulated in QEMU and I did set up an
>>>> environment for that but that's a bit slow and inconvenient so I'd leave KVM
>>>> support to those interested and have more knowledge and hardware for it.)
>>>
>>> Sounds like a problem for someone else another time, then.
>>
>> So now that it works on TCG with vhyp I tried what it would do on KVM PR
>> with the sc 1 but I could only test that on QEMU itself running in a Linux
>> guest. First I've hit missing this callback:
>>
>> https://git.qemu.org/?p=qemu.git;a=blob;f=target/ppc/kvm.c;h=104a308abb5700b2fe075397271f314d7f607543;hb=HEAD#l856
>>
>> that I can fix by providing a callback in pegasos2.c that does what the else
>> clause would do returning POWERPC_CPU(current_cpu)->env.spr[SPR_SDR1] (I
>> guess that's the correct thing to do if it works without vhyp).
>
> For your case, yes that's right.  Again vhyp is designed for the case
> where the (hash) MMU is owned by the hypervisor.  But due to a gross
> hack the way we communicate the userspace address of the hash table to
> KVM PR is via the SDR1 register, which is why we need that hook.
>
>> After getting past this, the host QEMU crashed on the first sc 1 call with
>> this error:
>>
>> qemu: fatal: Trying to deliver HV exception (MSR) 8 with no HV support
>
>> NIP 0000000000000148   LR 0000000000000590 CTR 0000000000000000 XER 0000000000000000 CPU#0
>> MSR 000000000000d032 HID0 0000000060000000  HF 00004012 iidx 0 didx 0
>> TB 00000203 876006644638 DECR 422427
>> GPR00 0000000000000680 000000000000fe90 0000000000008e00 000000000000f005
>> GPR04 000000000000fe9c 0000000000000001 0000000000000e78 0000000000000000
>> GPR08 000000000000fe98 000000000000fe9c 0000000000000001 0000000000000000
>> GPR12 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> GPR16 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> GPR20 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> GPR24 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> GPR28 0000000000000000 0000000000000000 0000000000008e9c 000000000000fe90
>> CR 20000000  [ E  -  -  -  -  -  -  -  ]             RES ffffffffffffffff
>> FPR00 bff0000000000000 0000000000000000 0000000000000000 0000000000000000
>> FPR04 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> FPR08 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> FPR12 3ff553f7ced91687 0000000000000000 0000000000000000 0000000000000000
>> FPR16 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> FPR20 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> FPR24 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> FPR28 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> FPSCR 0000000082004000
>>  SRR0 00000000000001d4  SRR1 300000000000d032    PVR 00000000003c0301 VRSAVE 00000000ffffffff
>> SPRG0 000000003fe00000 SPRG1 c00000000ff60000  SPRG2 c00000000ff60000  SPRG3 0000000000000000
>> SPRG4 0000000000000000 SPRG5 0000000000000000  SPRG6 0000000000000000  SPRG7 0000000000000000
>>  SDR1 000000003f000006   DAR f00000000090abf0  DSISR 0000000042000000
>> Aborted (core dumped)
>>
>> (vof.bin looks like this:
>>
>>  100:   3c 40 00 00     lis     r2,0
>>  104:   60 42 8e 00     ori     r2,r2,36352
>>  108:   48 00 00 cc     b       0x1d4
>>  10c:   3c 40 00 00     lis     r2,0
>>  110:   60 42 8e 00     ori     r2,r2,36352
>>  114:   94 21 ff 90     stwu    r1,-112(r1)
>>  118:   93 e1 00 68     stw     r31,104(r1)
>>  11c:   7f e8 02 a6     mflr    r31
>>  120:   48 00 02 8d     bl      0x3ac
>>  124:   60 00 00 00     nop
>>  128:   7f e8 03 a6     mtlr    r31
>>  12c:   83 e1 00 68     lwz     r31,104(r1)
>>  130:   38 21 00 70     addi    r1,r1,112
>>  134:   4e 80 00 20     blr
>>  138:   7c 64 1b 78     mr      r4,r3
>>  13c:   3c 60 00 00     lis     r3,0
>>  140:   60 63 f0 05     ori     r3,r3,61445
>>  144:   44 00 00 22     sc      1
>>  148:   4e 80 00 20     blr
>>
>> so I think it's the sc 1 at 0x144) The error is coming from here:
>>
>> https://git.qemu.org/?p=qemu.git;a=blob;f=target/ppc/excp_helper.c;h=fd147e2a37662456d30f7ab74b23bfb036260ced;hb=HEAD#l830
>>
>> What does this mean? What would a real CPU do with this and where it could
>> be catched to use as hypercall method on CPUs without HV or what else should
>> we do if we wanted this to work with KVM PR too in the future?
>
> The interesting bit is actually how we're getting to that part of
> powerpc_excp.  I guess we must be getting a KVM exit for that 'sc 1',
> but I don't know what type.  If we can figure out that would be where
> we'd need to intercept it and send it to the vhyp handler instead of
> actually trying to enter the hypercall vector on the emulated CPU,
> which doesn't have one.

Well, this is emulated KVM PR running in a TCG guest because as I 
mentioned before I don't have real hardware to test KVM on. So I've booted 
Linux on qemu-system-ppc64 -M mac99,via=pmu (that's using 970 (G5) CPU) 
and run qemu-system-ppc with KVM PR within it. So it's ultimately coming 
from somewhere in TCG:

#0  0x00007f5d1c0f09ba in raise () at /lib64/libc.so.6
#1  0x00007f5d1c0d9524 in abort () at /lib64/libc.so.6
#2  0x00005557807f4776 in cpu_abort (cpu=cpu@entry=0x5557826bff30, fmt=fmt@entry=0x555780a8ad88 "Trying to deliver HV exception (MSR) %d with no HV support\n") at ../cpu.c:376
#3  0x00005557806a09c6 in powerpc_excp (cpu=0x5557826bff30, excp_model=13, excp=<optimized out>) at ../target/ppc/excp_helper.c:833
#4  0x000055578073bf43 in cpu_handle_exception (ret=<synthetic pointer>, cpu=0x555782669640) at ../accel/tcg/cpu-exec.c:524
#5  0x000055578073bf43 in cpu_exec (cpu=cpu@entry=0x5557826bff30) at ../accel/tcg/cpu-exec.c:778
#6  0x0000555780750d82 in tcg_cpus_exec (cpu=cpu@entry=0x5557826bff30) at ../accel/tcg/tcg-accel-ops.c:67
#7  0x0000555780755103 in mttcg_cpu_thread_fn (arg=arg@entry=0x5557826bff30) at ../accel/tcg/tcg-accel-ops-mttcg.c:70
#8  0x0000555780974eea in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:521
#9  0x00007f5d1c28704c in start_thread () at /lib64/libpthread.so.0
#10 0x00007f5d1c1b82cf in clone () at /lib64/libc.so.6

This is a backtrace on the host because to outer TCG qemu is getting this 
abort when the guest qemu in it runs the sc 1 via KVM PR. So it's trapped 
on the host not in the guest but I don't know what a real CPU would do in 
this case and if that's emulated correctly in nested KVM (apparently not 
as it's crashing). This makes it a bit hard to test as I can also run into 
KVM emulation bugs in the host QEMU as well as problems with KVM PR that 
would also happen on real hardware but it might not be obvious which I've 
hit.


Another unrelated problem I've found with KVM PR is when trying to run it 
with the board firmware (that's not crashing as that's not using sc 1) it 
initially starts but gets stuck soon after starting. When enabling kvm 
traces in QEMU I see endless kvm exits:

kvm_run_exit cpu_index 0, reason 6
kvm_run_exit cpu_index 0, reason 6
kvm_run_exit cpu_index 0, reason 6
kvm_run_exit cpu_index 0, reason 6

but NIP is not advancing in info registers:

(qemu) info registers
kvm_failed_spr_get Warning: Unable to retrieve SPR 1013 from KVM: Invalid argument
NIP fff05958   LR fff05524 CTR 00000000 XER 20000000 CPU#0
MSR 00000030 HID0 00000000  HF 6c000002 iidx 3 didx 3
TB 00000000 00000000 DECR 0
GPR00 0000000000000000 0000000000000000 0000000000000000 0000000000000081
GPR04 00000000fe000d00 0000000000000069 00000000fff042a0 00000000fff054cc
GPR08 0000000000f5e7f8 00000000ffffff00 00000000fff04274 0000000000000000
GPR12 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR16 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR24 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR28 0000000000000000 0000000000000000 0000000000000002 0000000000000000
CR 40000000  [ G  -  -  -  -  -  -  -  ]             RES ffffffff
FPR00 0000000000000000 0000000000000000 0000000000000000 0000000000000000
FPR04 0000000000000000 0000000000000000 0000000000000000 0000000000000000
FPR08 0000000000000000 0000000000000000 0000000000000000 0000000000000000
FPR12 0000000000000000 0000000000000000 0000000000000000 0000000000000000
FPR16 0000000000000000 0000000000000000 0000000000000000 0000000000000000
FPR20 0000000000000000 0000000000000000 0000000000000000 0000000000000000
FPR24 0000000000000000 0000000000000000 0000000000000000 0000000000000000
FPR28 0000000000000000 0000000000000000 0000000000000000 0000000000000000
FPSCR 00000000
  SRR0 00000000  SRR1 00000000    PVR 000c0209 VRSAVE 00000000
SPRG0 00000000 SPRG1 00000000  SPRG2 00000000  SPRG3 00000000
SPRG4 00000000 SPRG5 00000000  SPRG6 00000000  SPRG7 00000000
  SDR1 00000000   DAR 00000000  DSISR 00000000
(qemu) kvm_failed_spr_set Warning: Unable to set SPR 1013 to KVM: Invalid argument

It stays on fff05958, the instruction it seems to be stuck on is doing 
some io:

0xfff05938:  7c0006ac  eieio
0xfff0593c:  98640001  stb      r3, 1(r4)
0xfff05940:  7c0006ac  eieio
0xfff05944:  4e800020  blr
0xfff05948:  7c641b78  mr       r4, r3
0xfff0594c:  6484fe00  oris     r4, r4, 0xfe00
0xfff05950:  7c0006ac  eieio
0xfff05954:  88640000  lbz      r3, 0(r4)
0xfff05958:  7c0006ac  eieio
0xfff0595c:  7c0004ac  sync
0xfff05960:  4e800020  blr

but likely it's trying to access a device which is not emulated so 
nothing's there. When I run the same with TCG I get some invalid accesss 
warnings with -d guest_errors enabled around the same point:

Invalid access at addr 0xFE000E43, size 1, region '(null)', reason: rejected
Invalid access at addr 0xE43, size 1, region '(null)', reason: rejected
Invalid access at addr 0xFE000E44, size 1, region '(null)', reason: rejected
Invalid access at addr 0xE44, size 1, region '(null)', reason: rejected
Invalid access at addr 0xFE000E41, size 1, region '(null)', reason: rejected
Invalid access at addr 0xE41, size 1, region '(null)', reason: rejected
Invalid access at addr 0xFE000E42, size 1, region '(null)', reason: rejected
Invalid access at addr 0xE42, size 1, region '(null)', reason: rejected
Invalid access at addr 0xFE000E40, size 1, region '(null)', reason: rejected
Invalid access at addr 0xE40, size 1, region '(null)', reason: rejected

but it's moving on with TCG and since the device that should be here is 
not really needed (it's setting up some clock generators on real hardware 
at this point) it's working anyway. Is this a problem with KVM so do I 
really need to put unimplemented devices to every address the guest may 
access even when that's not needed on TCG or is this a bug somewhere that 
after detecting this error it's not advancing the IP and trying to execute 
the same instruction again? I'm not sure how to debug this further or 
where to look for a bug or fix it.

Regards,
BALATON Zoltan
BALATON Zoltan June 7, 2021, 10:54 p.m. UTC | #59
On Mon, 7 Jun 2021, David Gibson wrote:
> On Fri, Jun 04, 2021 at 03:59:22PM +0200, BALATON Zoltan wrote:
>> On Fri, 4 Jun 2021, David Gibson wrote:
>>> On Wed, Jun 02, 2021 at 02:29:29PM +0200, BALATON Zoltan wrote:
>>>> On Wed, 2 Jun 2021, David Gibson wrote:
>>>>> On Thu, May 27, 2021 at 02:42:39PM +0200, BALATON Zoltan wrote:
>>>>>> On Thu, 27 May 2021, David Gibson wrote:
>>>>>>> On Tue, May 25, 2021 at 12:08:45PM +0200, BALATON Zoltan wrote:
>>>>>>>> On Tue, 25 May 2021, David Gibson wrote:
>>>>>>>>> On Mon, May 24, 2021 at 12:55:07PM +0200, BALATON Zoltan wrote:
>>>>>>>>>> On Mon, 24 May 2021, David Gibson wrote:
>> What's ePAPR then and how is it different from PAPR? I mean the acronym not
>> the hypercall method, the latter is explained in that doc but what ePAPR
>> stands for and why is that method called like that is not clear to me.
>
> Ok, history lesson time.
>
> For a long time PAPR has been the document that described the OS
> environment for IBM POWER based server hardware.  Before it was called
> PAPR (POWER Architecture Platform Requirements) it was called the
> "RPA" (Requirements for the POWER Architecture, I think?).  You might
> see the old name in a few places.
>
> Requiring a full Open Firmware and a bunch of other fairly heavyweight
> stuff, PAPR really wasn't suitable for embedded ppc chips and boards.
> The situation with those used to be a complete mess with basically
> every board variant having it's own different firmware with its own
> different way of presenting some fragments of vital data to the OS.
>
> ePAPR - Embedded Power Architecture Platform Requirements - was
> created as a standard to try to unify how this stuff was handled on
> embedded ppc chips.  I was one of the authors on early versions of
> it.  It's mostly based around giving the OS a flattened device tree,
> with some deliberately minimal requirements on firmware initialization
> and entry state.  Here's a link to one of those early versions:
>
> http://elinux.org/images/c/cf/Power_ePAPR_APPROVED_v1.1.pdf
>
> I thought there were later versions, but I couldn't seem to find any.
> It's possible the process of refining later versions just petered out
> as the embedded ppc world mostly died and the flattened device tree
> development mostly moved to ARM.
>
> Since some of the embedded chips from Freescale had hypervisor
> capabilities, a hypercall model was added to ePAPR - but that wasn't
> something I was greatly involved in, so I don't know much about it.
>
> ePAPR is the reason that the original PAPR is sometimes referred to as
> "sPAPR" to disambiguate.

Ah, thanks that really puts it in context. I've heard about PReP and CHRP 
in connection with the boards I've tried to emulate but don't know much 
about PAPR and server POWER systems.

>>>> The ePAPR (1.) seems to be preferred by KVM and
>>>> MOL OSI supported for compatibility.
>>>
>>> That document looks pretty out of date.  Most of it is only discussing
>>> KVM PR, which is now barely maintained.  KVM HV only works with PAPR
>>> hypercalls.
>>
>> The links says it's latest kernel docs, so maybe an update need to be sent
>> to KVM?
>
> I guess, but the chances of me finding time to do it are approximately
> zero.
>
>>>> So if we need something else instead of
>>>> 2. PAPR hypercalls there seems to be two options: ePAPR and MOL OSI which
>>>> should work with KVM but then I'm not sure how to handle those on TCG.
>>>>
>>>>>>>> [...]
>>>>>>>>>>>> I've tested that the missing rtas is not the reason for getting no output
>>>>>>>>>>>> via serial though, as even when disabling rtas on pegasos2.rom it boots and
>>>>>>>>>>>> I still get serial output just some PCI devices are not detected (such as
>>>>>>>>>>>> USB, the video card and the not emulated ethernet port but these are not
>>>>>>>>>>>> fatal so it might even work as a first try without rtas, just to boot a
>>>>>>>>>>>> Linux kernel for testing it would be enough if I can fix the serial output).
>>>>>>>>>>>> I still don't know why it's not finding serial but I think it may be some
>>>>>>>>>>>> missing or wrong info in the device tree I generat. I'll try to focus on
>>>>>>>>>>>> this for now and leave the above rtas question for later.
>>>>>>>>>>>
>>>>>>>>>>> Oh.. another thought on that.  You have an ISA serial port on Pegasos,
>>>>>>>>>>> I believe.  I wonder if the PCI->ISA bridge needs some configuration /
>>>>>>>>>>> initialization that the firmware is expected to do.  If so you'll need
>>>>>>>>>>> to mimic that setup in qemu for the VOF case.
>>>>>>>>>>
>>>>>>>>>> That's what I begin to think because I've added everything to the device
>>>>>>>>>> tree that I thought could be needed and I still don't get it working so it
>>>>>>>>>> may need some config from the firmware. But how do I access device registers
>>>>>>>>>> from board code? I've tried adding a machine reset method and write to
>>>>>>>>>> memory mapped device registers but all my attempts failed. I've tried
>>>>>>>>>> cpu_stl_le_data and even memory_region_dispatch_write but these did not get
>>>>>>>>>> to the device. What's the way to access guest mmio regs from QEMU?
>>>>>>>>>
>>>>>>>>> That's odd, cpu_stl() and memory_region_dispatch_write() should work
>>>>>>>>> from board code (after the relevant memory regions are configured, of
>>>>>>>>> course).  As an ISA serial port, it's probably accessed through IO
>>>>>>>>> space, not memory space though, so you'd need &address_space_io.  And
>>>>>>>>> if there is some bridge configuration then it's the bridge control
>>>>>>>>> registers you need to look at not the serial registers - you'd have to
>>>>>>>>> look at the bridge documentation for that.  Or, I guess the bridge
>>>>>>>>> implementation in qemu, which you wrote part of.
>>>>>>>>
>>>>>>>> I've found at last that stl_le_phys() works. There are so many of these that
>>>>>>>> I never know when to use which.
>>>>>>>>
>>>>>>>> I think the address_space_rw calls in vof_client_call() in vof.c could also
>>>>>>>> use these for somewhat shorter code. I've ended up with
>>>>>>>> stl_le_phys(CPU(cpu)->as, addr, val) in my machine reset methodbut I don't
>>>>>>>> even need that now as it works without additional setup. Also VOF's memory
>>>>>>>> access is basically the same as the already existing rtas_st() and co. so
>>>>>>>> maybe that could be reused to make code smaller?
>>>>>>>
>>>>>>> rtas_ld() and rtas_st() should only be used for reading/writing RTAS
>>>>>>> parameters to and from memory.  Accessing IO shouldn't be done with
>>>>>>> those.
>>>>>>>
>>>>>>> For IO you probably want the cpu_st*() variants in most cases, since
>>>>>>> you're trying to emulate an IO access from the virtual cpu.
>>>>>>
>>>>>> I think I've tried that but what worked to access mmio device registers are
>>>>>> stl_le_phys and similar that are wrappers around address_space_stl_*. But I
>>>>>> did not mean that for rtas_ld/_st but the part when vof accessing the
>>>>>> parameters passed by its hypercall which is memory access:
>>>>>>
>>>>>> https://github.com/patchew-project/qemu/blob/patchew/20210520090557.435689-1-aik%40ozlabs.ru/hw/ppc/vof.c
>>>>>>
>>>>>> line 893, and vof_client_call before that is very similar to what h_rtas
>>>>>> does here:
>>>>>>
>>>>>> https://git.qemu.org/?p=qemu.git;a=blob;f=hw/ppc/spapr_hcall.c;h=f25014afda408002ee1ec1027a0dd7a6025eca61;hb=HEAD#l639
>>>>>>
>>>>>> and I also need to do the same for rtas in pegasos2 for which I'm just using
>>>>>> ldl_be_phys for now but I wonder if we really need 3 ways to do the same or
>>>>>> the rtas_ld/_st could be made more generic and reused here?
>>>>>
>>>>> For your rtas implementation you could definitely re-use them.  For
>>>>> the client call I'm a bit less confident, but if the in-guest-memory
>>>>> structures are really the same, then it would make sense.
>>>>
>>>> The memory structure seems very similar to me, the only difference is
>>>> calling the first field service in VOF instead of token in RTAS. Both are
>>>> just an array of big endian unit32_t with token, nargs, nret at the front
>>>> followed by args and rets. Since these rtas_ld/st are defined in spapr.h I
>>>> did not bother to split them off, so for pegasos2 rtas I'm just using the
>>>> ldl_be_* functions directly for which these are a shorthand for. If these
>>>> were split off for sharing between spapr rtas and VOF I may be able to reuse
>>>> them as well but it's not that important so just mentioned it as a possible
>>>> later clean up.
>>>
>>> Ok, sounds reasonable to re-use them then, though maybe add an aliased
>>> name for clarity ofci_{ld,st}(), maybe?  (for "Open Firmware Client
>>> Interface")
>>
>> I'll wait for what Alexey decides to do in the next VOF patch version and if
>> I can reuse that (I could if these were defined in vof.h). I don't want to
>> come up with yet another abstraction to ldl_be_* which does not seem to make
>> it more clear than using the actual functions for guest memory access which
>> is what we're doing while getting the hypercall args so I think either using
>> ldl_be_* directly or reusing already existing rfas_ls/_st would make sense
>> but adding similar funcs with another name just makes it more confusing.
>
> Well, the point of the rtas_ld() functions isn't o be a different way
> of accessing memory.  It's just a convenience wrapper that takes an
> RTAS args array and an argument index and does the right thing to
> retrieve it for you.
>
> So, if your RTAS function implementation when you want to get argument
> 0, you just go rtas_ld(args, 0) - more readable than having a bunch of
> offset calculations and a long winded call to the BE memory access
> function.  You can look at the examples in hw/ppc/sppar_rtas.c to see
> how its used.
>
> Actually, looking again at how it works, you should probably only use
> rtas_ld() if your general dispatch code has pre-parsed the args
> structure into separate args and rets arrays, again as we do in
> spapr_rtas.c

The problem with those rtas_* functions is that they are in spapr now so 
to reuse it I'd need to split them off which I did not do because it's not 
too bad without it and modifying spapr would mean another round of review 
which could take long and delay my other patches. So if somebody splits 
these off for reuse (like if Alexey wants to reuse them in VOF) then I may 
use them but otherwise I've just noted these could be reused but don't 
intend to do that now. This could also be done later for both VOF and 
pegasos2 as a clean up so it does not seem to be too important at the 
moment.

Regards,
BALATON Zoltan
Alexey Kardashevskiy June 9, 2021, 5:51 a.m. UTC | #60
On 6/8/21 08:54, BALATON Zoltan wrote:
> On Mon, 7 Jun 2021, David Gibson wrote:
>> On Fri, Jun 04, 2021 at 03:59:22PM +0200, BALATON Zoltan wrote:
>>> On Fri, 4 Jun 2021, David Gibson wrote:
>>>> On Wed, Jun 02, 2021 at 02:29:29PM +0200, BALATON Zoltan wrote:
>>>>> On Wed, 2 Jun 2021, David Gibson wrote:
>>>>>> On Thu, May 27, 2021 at 02:42:39PM +0200, BALATON Zoltan wrote:
>>>>>>> On Thu, 27 May 2021, David Gibson wrote:
>>>>>>>> On Tue, May 25, 2021 at 12:08:45PM +0200, BALATON Zoltan wrote:
>>>>>>>>> On Tue, 25 May 2021, David Gibson wrote:
>>>>>>>>>> On Mon, May 24, 2021 at 12:55:07PM +0200, BALATON Zoltan wrote:
>>>>>>>>>>> On Mon, 24 May 2021, David Gibson wrote:
>>> What's ePAPR then and how is it different from PAPR? I mean the 
>>> acronym not
>>> the hypercall method, the latter is explained in that doc but what ePAPR
>>> stands for and why is that method called like that is not clear to me.
>>
>> Ok, history lesson time.
>>
>> For a long time PAPR has been the document that described the OS
>> environment for IBM POWER based server hardware.  Before it was called
>> PAPR (POWER Architecture Platform Requirements) it was called the
>> "RPA" (Requirements for the POWER Architecture, I think?).  You might
>> see the old name in a few places.
>>
>> Requiring a full Open Firmware and a bunch of other fairly heavyweight
>> stuff, PAPR really wasn't suitable for embedded ppc chips and boards.
>> The situation with those used to be a complete mess with basically
>> every board variant having it's own different firmware with its own
>> different way of presenting some fragments of vital data to the OS.
>>
>> ePAPR - Embedded Power Architecture Platform Requirements - was
>> created as a standard to try to unify how this stuff was handled on
>> embedded ppc chips.  I was one of the authors on early versions of
>> it.  It's mostly based around giving the OS a flattened device tree,
>> with some deliberately minimal requirements on firmware initialization
>> and entry state.  Here's a link to one of those early versions:
>>
>> http://elinux.org/images/c/cf/Power_ePAPR_APPROVED_v1.1.pdf
>>
>> I thought there were later versions, but I couldn't seem to find any.
>> It's possible the process of refining later versions just petered out
>> as the embedded ppc world mostly died and the flattened device tree
>> development mostly moved to ARM.
>>
>> Since some of the embedded chips from Freescale had hypervisor
>> capabilities, a hypercall model was added to ePAPR - but that wasn't
>> something I was greatly involved in, so I don't know much about it.
>>
>> ePAPR is the reason that the original PAPR is sometimes referred to as
>> "sPAPR" to disambiguate.
> 
> Ah, thanks that really puts it in context. I've heard about PReP and 
> CHRP in connection with the boards I've tried to emulate but don't know 
> much about PAPR and server POWER systems.
> 
>>>>> The ePAPR (1.) seems to be preferred by KVM and
>>>>> MOL OSI supported for compatibility.
>>>>
>>>> That document looks pretty out of date.  Most of it is only discussing
>>>> KVM PR, which is now barely maintained.  KVM HV only works with PAPR
>>>> hypercalls.
>>>
>>> The links says it's latest kernel docs, so maybe an update need to be 
>>> sent
>>> to KVM?
>>
>> I guess, but the chances of me finding time to do it are approximately
>> zero.
>>
>>>>> So if we need something else instead of
>>>>> 2. PAPR hypercalls there seems to be two options: ePAPR and MOL OSI 
>>>>> which
>>>>> should work with KVM but then I'm not sure how to handle those on TCG.
>>>>>
>>>>>>>>> [...]
>>>>>>>>>>>>> I've tested that the missing rtas is not the reason for 
>>>>>>>>>>>>> getting no output
>>>>>>>>>>>>> via serial though, as even when disabling rtas on 
>>>>>>>>>>>>> pegasos2.rom it boots and
>>>>>>>>>>>>> I still get serial output just some PCI devices are not 
>>>>>>>>>>>>> detected (such as
>>>>>>>>>>>>> USB, the video card and the not emulated ethernet port but 
>>>>>>>>>>>>> these are not
>>>>>>>>>>>>> fatal so it might even work as a first try without rtas, 
>>>>>>>>>>>>> just to boot a
>>>>>>>>>>>>> Linux kernel for testing it would be enough if I can fix 
>>>>>>>>>>>>> the serial output).
>>>>>>>>>>>>> I still don't know why it's not finding serial but I think 
>>>>>>>>>>>>> it may be some
>>>>>>>>>>>>> missing or wrong info in the device tree I generat. I'll 
>>>>>>>>>>>>> try to focus on
>>>>>>>>>>>>> this for now and leave the above rtas question for later.
>>>>>>>>>>>>
>>>>>>>>>>>> Oh.. another thought on that.  You have an ISA serial port 
>>>>>>>>>>>> on Pegasos,
>>>>>>>>>>>> I believe.  I wonder if the PCI->ISA bridge needs some 
>>>>>>>>>>>> configuration /
>>>>>>>>>>>> initialization that the firmware is expected to do.  If so 
>>>>>>>>>>>> you'll need
>>>>>>>>>>>> to mimic that setup in qemu for the VOF case.
>>>>>>>>>>>
>>>>>>>>>>> That's what I begin to think because I've added everything to 
>>>>>>>>>>> the device
>>>>>>>>>>> tree that I thought could be needed and I still don't get it 
>>>>>>>>>>> working so it
>>>>>>>>>>> may need some config from the firmware. But how do I access 
>>>>>>>>>>> device registers
>>>>>>>>>>> from board code? I've tried adding a machine reset method and 
>>>>>>>>>>> write to
>>>>>>>>>>> memory mapped device registers but all my attempts failed. 
>>>>>>>>>>> I've tried
>>>>>>>>>>> cpu_stl_le_data and even memory_region_dispatch_write but 
>>>>>>>>>>> these did not get
>>>>>>>>>>> to the device. What's the way to access guest mmio regs from 
>>>>>>>>>>> QEMU?
>>>>>>>>>>
>>>>>>>>>> That's odd, cpu_stl() and memory_region_dispatch_write() 
>>>>>>>>>> should work
>>>>>>>>>> from board code (after the relevant memory regions are 
>>>>>>>>>> configured, of
>>>>>>>>>> course).  As an ISA serial port, it's probably accessed 
>>>>>>>>>> through IO
>>>>>>>>>> space, not memory space though, so you'd need 
>>>>>>>>>> &address_space_io.  And
>>>>>>>>>> if there is some bridge configuration then it's the bridge 
>>>>>>>>>> control
>>>>>>>>>> registers you need to look at not the serial registers - you'd 
>>>>>>>>>> have to
>>>>>>>>>> look at the bridge documentation for that.  Or, I guess the 
>>>>>>>>>> bridge
>>>>>>>>>> implementation in qemu, which you wrote part of.
>>>>>>>>>
>>>>>>>>> I've found at last that stl_le_phys() works. There are so many 
>>>>>>>>> of these that
>>>>>>>>> I never know when to use which.
>>>>>>>>>
>>>>>>>>> I think the address_space_rw calls in vof_client_call() in 
>>>>>>>>> vof.c could also
>>>>>>>>> use these for somewhat shorter code. I've ended up with
>>>>>>>>> stl_le_phys(CPU(cpu)->as, addr, val) in my machine reset 
>>>>>>>>> methodbut I don't
>>>>>>>>> even need that now as it works without additional setup. Also 
>>>>>>>>> VOF's memory
>>>>>>>>> access is basically the same as the already existing rtas_st() 
>>>>>>>>> and co. so
>>>>>>>>> maybe that could be reused to make code smaller?
>>>>>>>>
>>>>>>>> rtas_ld() and rtas_st() should only be used for reading/writing 
>>>>>>>> RTAS
>>>>>>>> parameters to and from memory.  Accessing IO shouldn't be done with
>>>>>>>> those.
>>>>>>>>
>>>>>>>> For IO you probably want the cpu_st*() variants in most cases, 
>>>>>>>> since
>>>>>>>> you're trying to emulate an IO access from the virtual cpu.
>>>>>>>
>>>>>>> I think I've tried that but what worked to access mmio device 
>>>>>>> registers are
>>>>>>> stl_le_phys and similar that are wrappers around 
>>>>>>> address_space_stl_*. But I
>>>>>>> did not mean that for rtas_ld/_st but the part when vof accessing 
>>>>>>> the
>>>>>>> parameters passed by its hypercall which is memory access:
>>>>>>>
>>>>>>> https://github.com/patchew-project/qemu/blob/patchew/20210520090557.435689-1-aik%40ozlabs.ru/hw/ppc/vof.c 
>>>>>>>
>>>>>>>
>>>>>>> line 893, and vof_client_call before that is very similar to what 
>>>>>>> h_rtas
>>>>>>> does here:
>>>>>>>
>>>>>>> https://git.qemu.org/?p=qemu.git;a=blob;f=hw/ppc/spapr_hcall.c;h=f25014afda408002ee1ec1027a0dd7a6025eca61;hb=HEAD#l639 
>>>>>>>
>>>>>>>
>>>>>>> and I also need to do the same for rtas in pegasos2 for which I'm 
>>>>>>> just using
>>>>>>> ldl_be_phys for now but I wonder if we really need 3 ways to do 
>>>>>>> the same or
>>>>>>> the rtas_ld/_st could be made more generic and reused here?
>>>>>>
>>>>>> For your rtas implementation you could definitely re-use them.  For
>>>>>> the client call I'm a bit less confident, but if the in-guest-memory
>>>>>> structures are really the same, then it would make sense.
>>>>>
>>>>> The memory structure seems very similar to me, the only difference is
>>>>> calling the first field service in VOF instead of token in RTAS. 
>>>>> Both are
>>>>> just an array of big endian unit32_t with token, nargs, nret at the 
>>>>> front
>>>>> followed by args and rets. Since these rtas_ld/st are defined in 
>>>>> spapr.h I
>>>>> did not bother to split them off, so for pegasos2 rtas I'm just 
>>>>> using the
>>>>> ldl_be_* functions directly for which these are a shorthand for. If 
>>>>> these
>>>>> were split off for sharing between spapr rtas and VOF I may be able 
>>>>> to reuse
>>>>> them as well but it's not that important so just mentioned it as a 
>>>>> possible
>>>>> later clean up.
>>>>
>>>> Ok, sounds reasonable to re-use them then, though maybe add an aliased
>>>> name for clarity ofci_{ld,st}(), maybe?  (for "Open Firmware Client
>>>> Interface")
>>>
>>> I'll wait for what Alexey decides to do in the next VOF patch version 
>>> and if
>>> I can reuse that (I could if these were defined in vof.h). I don't 
>>> want to
>>> come up with yet another abstraction to ldl_be_* which does not seem 
>>> to make
>>> it more clear than using the actual functions for guest memory access 
>>> which
>>> is what we're doing while getting the hypercall args so I think 
>>> either using
>>> ldl_be_* directly or reusing already existing rfas_ls/_st would make 
>>> sense
>>> but adding similar funcs with another name just makes it more confusing.
>>
>> Well, the point of the rtas_ld() functions isn't o be a different way
>> of accessing memory.  It's just a convenience wrapper that takes an
>> RTAS args array and an argument index and does the right thing to
>> retrieve it for you.
>>
>> So, if your RTAS function implementation when you want to get argument
>> 0, you just go rtas_ld(args, 0) - more readable than having a bunch of
>> offset calculations and a long winded call to the BE memory access
>> function.  You can look at the examples in hw/ppc/sppar_rtas.c to see
>> how its used.
>>
>> Actually, looking again at how it works, you should probably only use
>> rtas_ld() if your general dispatch code has pre-parsed the args
>> structure into separate args and rets arrays, again as we do in
>> spapr_rtas.c
> 
> The problem with those rtas_* functions is that they are in spapr now so 
> to reuse it I'd need to split them off which I did not do because it's 
> not too bad without it and modifying spapr would mean another round of 
> review which could take long and delay my other patches. So if somebody 
> splits these off for reuse (like if Alexey wants to reuse them in VOF) 
> then I may use them but otherwise I've just noted these could be reused 
> but don't intend to do that now. This could also be done later for both 
> VOF and pegasos2 as a clean up so it does not seem to be too important 
> at the moment.

I added VOF_MEM_READ/VOF_MEM_WRITE as (unlike others) they can return an 
error code. I am not quite sure why we did not bother then when added 
rtas_ld/st (were we just learning then?) but we do care now.

I am moving those to vof.h.

Here is v21:
https://github.com/aik/qemu/commits/killslof-cli-v21

changes:
v21:
* s/ld/ldz/ in entry.S
* moved CONFIG_VOF from default-configs/devices/ppc64-softmmu.mak to Kconfig
* made CONFIG_VOF optional
* s/l.lds/vof.lds/
* force 32 BE in spapr_machine_reset() instead of the firmware
* added checks for non-null methods of VofMachineIfClass
* moved OF_STACK_SIZE to vof.h, renamed to VOF_..., added a better comment
* added  path_offset wrapper for handling mixed case for addresses after 
"@" in node names
* changed getprop() to check for actual "name" property in the fdt
* moved VOF_MEM_READ/VOF_MEM_WRITE to vof.h for sharing as (unlike similar
rtas_ld/ldl_be_*) they return error codes
* VOF_MEM_READ uses now address_space_read (it was address_space_read_full
before, not sure why)



I'll post it .... may be on friday unless you find something else :)
Alexey Kardashevskiy June 9, 2021, 6:13 a.m. UTC | #61
On 6/7/21 13:05, David Gibson wrote:
> On Fri, Jun 04, 2021 at 03:50:28PM +0200, BALATON Zoltan wrote:
>> On Fri, 4 Jun 2021, David Gibson wrote:
>>> On Sun, May 30, 2021 at 07:33:01PM +0200, BALATON Zoltan wrote:
> [snip]
>>>> MorphOS checks the name property of the root node ("/") to decide what
>>>> platform it runs on so we may need to be able to set this property on /
>>>> where it should return "bplan,Pegasos2", therefore the above maybe should do
>>>> getprop first and only generate name property if it's not set (or at least
>>>> check if we're on the root node and allow setting name property there). (On
>>>> Macs the root node is named "device-tree" and this was before found to be
>>>> needed for MorphOS.)
>>>
>>> Ah.  Hrm.  Have to think about what to do about that.
>>
>> This is easy to fix, this seems to allow setting a name property or return a
>> default:
>>
>>> diff --git a/hw/ppc/vof.c b/hw/ppc/vof.c
>> index b47bbd509d..746842593e 100644
>> --- a/hw/ppc/vof.c
>> +++ b/hw/ppc/vof.c
>> @@ -163,14 +163,14 @@ static uint32_t vof_finddevice(const void *fdt,
>> uint32_t nodeaddr)
>>   static const void *getprop(const void *fdt, int nodeoff, const char *propname,
>>                              int *proplen, bool *write0)
>>   {
>> -    const char *unit, *prop;
>> +    const char *unit, *prop = fdt_getprop(fdt, nodeoff, propname, proplen);
>>
>>       /*
>>        * The "name" property is not actually stored as a property in the FDT,
>>        * we emulate it by returning a pointer to the node's name and adjust
>>        * proplen to include only the name but not the unit.
>>        */
>> -    if (strcmp(propname, "name") == 0) {
>> +    if (!prop && strcmp(propname, "name") == 0) {
>>           prop = fdt_get_name(fdt, nodeoff, proplen);
>>           if (!prop) {
>>               *proplen = 0;
>> @@ -196,7 +196,7 @@ static const void *getprop(const void *fdt, int nodeoff, const char *propname,
>>       if (write0) {
>>           *write0 = false;
>>       }
>> -    return fdt_getprop(fdt, nodeoff, propname, proplen);
>> +    return prop;
>>   }
> 
> Kind of a hack, but it'll do for now.
> 

oops missed this subthread but I ended up doing it anyway, just tiny bit 
different.
BALATON Zoltan June 9, 2021, 10:19 a.m. UTC | #62
On Wed, 9 Jun 2021, Alexey Kardashevskiy wrote:
> On 6/8/21 08:54, BALATON Zoltan wrote:
>> On Mon, 7 Jun 2021, David Gibson wrote:
>>> On Fri, Jun 04, 2021 at 03:59:22PM +0200, BALATON Zoltan wrote:
>>>> On Fri, 4 Jun 2021, David Gibson wrote:
>>>>> On Wed, Jun 02, 2021 at 02:29:29PM +0200, BALATON Zoltan wrote:
>>>>>> On Wed, 2 Jun 2021, David Gibson wrote:
>>>>>>> On Thu, May 27, 2021 at 02:42:39PM +0200, BALATON Zoltan wrote:
>>>>>>>> On Thu, 27 May 2021, David Gibson wrote:
>>>>>>>>> On Tue, May 25, 2021 at 12:08:45PM +0200, BALATON Zoltan wrote:
>>>>>>>>>> On Tue, 25 May 2021, David Gibson wrote:
>>>>>>>>>>> On Mon, May 24, 2021 at 12:55:07PM +0200, BALATON Zoltan wrote:
>>>>>>>>>>>> On Mon, 24 May 2021, David Gibson wrote:
>>>> What's ePAPR then and how is it different from PAPR? I mean the acronym 
>>>> not
>>>> the hypercall method, the latter is explained in that doc but what ePAPR
>>>> stands for and why is that method called like that is not clear to me.
>>> 
>>> Ok, history lesson time.
>>> 
>>> For a long time PAPR has been the document that described the OS
>>> environment for IBM POWER based server hardware.  Before it was called
>>> PAPR (POWER Architecture Platform Requirements) it was called the
>>> "RPA" (Requirements for the POWER Architecture, I think?).  You might
>>> see the old name in a few places.
>>> 
>>> Requiring a full Open Firmware and a bunch of other fairly heavyweight
>>> stuff, PAPR really wasn't suitable for embedded ppc chips and boards.
>>> The situation with those used to be a complete mess with basically
>>> every board variant having it's own different firmware with its own
>>> different way of presenting some fragments of vital data to the OS.
>>> 
>>> ePAPR - Embedded Power Architecture Platform Requirements - was
>>> created as a standard to try to unify how this stuff was handled on
>>> embedded ppc chips.  I was one of the authors on early versions of
>>> it.  It's mostly based around giving the OS a flattened device tree,
>>> with some deliberately minimal requirements on firmware initialization
>>> and entry state.  Here's a link to one of those early versions:
>>> 
>>> http://elinux.org/images/c/cf/Power_ePAPR_APPROVED_v1.1.pdf
>>> 
>>> I thought there were later versions, but I couldn't seem to find any.
>>> It's possible the process of refining later versions just petered out
>>> as the embedded ppc world mostly died and the flattened device tree
>>> development mostly moved to ARM.
>>> 
>>> Since some of the embedded chips from Freescale had hypervisor
>>> capabilities, a hypercall model was added to ePAPR - but that wasn't
>>> something I was greatly involved in, so I don't know much about it.
>>> 
>>> ePAPR is the reason that the original PAPR is sometimes referred to as
>>> "sPAPR" to disambiguate.
>> 
>> Ah, thanks that really puts it in context. I've heard about PReP and CHRP 
>> in connection with the boards I've tried to emulate but don't know much 
>> about PAPR and server POWER systems.
>> 
>>>>>> The ePAPR (1.) seems to be preferred by KVM and
>>>>>> MOL OSI supported for compatibility.
>>>>> 
>>>>> That document looks pretty out of date.  Most of it is only discussing
>>>>> KVM PR, which is now barely maintained.  KVM HV only works with PAPR
>>>>> hypercalls.
>>>> 
>>>> The links says it's latest kernel docs, so maybe an update need to be 
>>>> sent
>>>> to KVM?
>>> 
>>> I guess, but the chances of me finding time to do it are approximately
>>> zero.
>>> 
>>>>>> So if we need something else instead of
>>>>>> 2. PAPR hypercalls there seems to be two options: ePAPR and MOL OSI 
>>>>>> which
>>>>>> should work with KVM but then I'm not sure how to handle those on TCG.
>>>>>>