diff mbox series

hw/riscv: split RAM into low and high memory

Message ID 20230731015317.1026996-1-fei2.wu@intel.com (mailing list archive)
State New, archived
Headers show
Series hw/riscv: split RAM into low and high memory | expand

Commit Message

Wu, Fei July 31, 2023, 1:53 a.m. UTC
riscv virt platform's memory started at 0x80000000 and
straddled the 4GiB boundary. Curiously enough, this choice
of a memory layout will prevent from launching a VM with
a bit more than 2000MiB and PCIe pass-thru on an x86 host, due
to identity mapping requirements for the MSI doorbell on x86,
and these (APIC/IOAPIC) live right below 4GiB.

So just split the RAM range into two portions:
- 1 GiB range from 0x80000000 to 0xc0000000.
- The remainder at 0x100000000

...leaving a hole between the ranges.

Signed-off-by: Andrei Warkentin <andrei.warkentin@intel.com>
Signed-off-by: Fei Wu <fei2.wu@intel.com>
---
 hw/riscv/virt.c         | 74 ++++++++++++++++++++++++++++++++++++-----
 include/hw/riscv/virt.h |  1 +
 2 files changed, 66 insertions(+), 9 deletions(-)

Comments

Daniel Henrique Barboza July 31, 2023, 10:46 p.m. UTC | #1
On 7/30/23 22:53, Fei Wu wrote:
> riscv virt platform's memory started at 0x80000000 and
> straddled the 4GiB boundary. Curiously enough, this choice
> of a memory layout will prevent from launching a VM with
> a bit more than 2000MiB and PCIe pass-thru on an x86 host, due
> to identity mapping requirements for the MSI doorbell on x86,
> and these (APIC/IOAPIC) live right below 4GiB.
> 
> So just split the RAM range into two portions:
> - 1 GiB range from 0x80000000 to 0xc0000000.
> - The remainder at 0x100000000
> 
> ...leaving a hole between the ranges.

I am afraid this breaks some existing distro setups, like Ubuntu. After this patch
this emulation stopped working:

~/work/qemu/build/qemu-system-riscv64 \
	-machine virt -nographic -m 8G -smp 8 \
	-kernel ./uboot-ubuntu/usr/lib/u-boot/qemu-riscv64_smode/uboot.elf \
	-drive file=snapshot.img,format=qcow2,if=virtio \
	-netdev bridge,id=bridge1,br=virbr0 -device virtio-net-pci,netdev=bridge1


This is basically a guest created via the official Canonical tutorial:

https://wiki.ubuntu.com/RISC-V/QEMU

The error being thrown:

=================

Boot HART ID              : 4
Boot HART Domain          : root
Boot HART Priv Version    : v1.12
Boot HART Base ISA        : rv64imafdch
Boot HART ISA Extensions  : time,sstc
Boot HART PMP Count       : 16
Boot HART PMP Granularity : 4
Boot HART PMP Address Bits: 54
Boot HART MHPM Count      : 16
Boot HART MIDELEG         : 0x0000000000001666
Boot HART MEDELEG         : 0x0000000000f0b509


U-Boot 2022.07+dfsg-1ubuntu4.2 (Nov 24 2022 - 18:47:41 +0000)

CPU:   rv64imafdch_zicbom_zicboz_zicsr_zifencei_zihintpause_zawrs_zfa_zca_zcd_zba_zbb_zbc_zbs_sstc_svadu
Model: riscv-virtio,qemu
DRAM:  Unhandled exception: Store/AMO access fault
EPC: 00000000802018b8 RA: 00000000802126a0 TVAL: 00000000ff733f90

Code: b823 06b2 bc23 06b2 b023 08b2 b423 08b2 (b823 08b2)


resetting ...
System reset not supported on this platform
### ERROR ### Please RESET the board ###
QEMU 8.0.90 monitor - type 'help' for more infor
=================


Based on the change made I can make an educated guess on what is going wrong.
We have another board with a similar memory topology you're making here, the
Microchip Polarfire (microchip_pfsoc.c). We were having some problems with this
board while trying to consolidate the boot process between all boards in
hw/riscv/boot.c because of its non-continuous RAM bank. The full story can be
read in the commit message of 4b402886ac89 ("hw/riscv: change riscv_compute_fdt_addr()
semantics") but the short version can be seen in riscv_compute_fdt_addr()
from boot.c:

  - if ram_start is less than 3072MiB, the FDT will be  put at the lowest value
between 3072 MiB and the end of that RAM bank;

- if ram_start is higher than 3072 MiB the FDT will be put at the end of the
RAM bank.

So, after this patch, since riscv_compute_fdt_addr() is being used with the now
lower RAM bank, the fdt is being put in LOW_MEM - fdt_size for any setup that has
more than 1Gb RAM, and this breaks assumptions made by uboot and Ubuntu and possibly
others that are trying to retrieve the FDT from the gap that you created between
low and hi mem in this patch.

In fact, this same Ubuntu guest I mentioned above will boot if I put only 1 Gb of RAM
(-m 1Gb). If I try with -m 1.1Gb I reproduce this error. This can be a validation of
the guess I'm making here: Ubuntu is trying to fetch stuff (probably the fdt) from
the gap between the memory areas.

This change on top of this patch doesn't work either:

$ git diff
diff --git a/hw/riscv/virt.c b/hw/riscv/virt.c
index 8fbdc7220c..dfff48d849 100644
--- a/hw/riscv/virt.c
+++ b/hw/riscv/virt.c
@@ -1335,9 +1335,16 @@ static void virt_machine_done(Notifier *notifier, void *data)
                                           kernel_start_addr, true, NULL);
      }
  
-    fdt_load_addr = riscv_compute_fdt_addr(memmap[VIRT_DRAM].base,
+    if (machine->ram_size < memmap[VIRT_DRAM].size) {
+        fdt_load_addr = riscv_compute_fdt_addr(memmap[VIRT_DRAM].base,
                                             memmap[VIRT_DRAM].size,
                                             machine);
+    } else {
+        fdt_load_addr = riscv_compute_fdt_addr(memmap[VIRT_DRAM_HIGH].base,
+                                           memmap[VIRT_DRAM_HIGH].size,
+                                           machine);
+    }
+
    

This would put the fdt at the end of the HI RAM for guests with more than 1Gb of RAM.
This change in fact makes the situation even worse, breaking setups that were working
before with this patch.

There's a chance that reducing the gap between the RAM banks can make Ubuntu work
reliably again but now I'm a little cold feet with this change.


I think we'll need some kernel/Opensbi folks to weight in here to see if there's a
failsafe memory setup that won't break distros out there and allow your passthrough
to work.



Thanks,

Daniel

> 
> Signed-off-by: Andrei Warkentin <andrei.warkentin@intel.com>
> Signed-off-by: Fei Wu <fei2.wu@intel.com>
> ---
>   hw/riscv/virt.c         | 74 ++++++++++++++++++++++++++++++++++++-----
>   include/hw/riscv/virt.h |  1 +
>   2 files changed, 66 insertions(+), 9 deletions(-)
> 
> diff --git a/hw/riscv/virt.c b/hw/riscv/virt.c
> index d90286dc46..8fbdc7220c 100644
> --- a/hw/riscv/virt.c
> +++ b/hw/riscv/virt.c
> @@ -75,7 +75,9 @@
>   #error "Can't accomodate all IMSIC groups in address space"
>   #endif
>   
> -static const MemMapEntry virt_memmap[] = {
> +#define LOW_MEM (1 * GiB)
> +
> +static MemMapEntry virt_memmap[] = {
>       [VIRT_DEBUG] =        {        0x0,         0x100 },
>       [VIRT_MROM] =         {     0x1000,        0xf000 },
>       [VIRT_TEST] =         {   0x100000,        0x1000 },
> @@ -96,6 +98,7 @@ static const MemMapEntry virt_memmap[] = {
>       [VIRT_PCIE_ECAM] =    { 0x30000000,    0x10000000 },
>       [VIRT_PCIE_MMIO] =    { 0x40000000,    0x40000000 },
>       [VIRT_DRAM] =         { 0x80000000,           0x0 },
> +    [VIRT_DRAM_HIGH] =    { 0x100000000,          0x0 },
>   };
>   
>   /* PCIe high mmio is fixed for RV32 */
> @@ -295,15 +298,12 @@ static void create_fdt_socket_cpus(RISCVVirtState *s, int socket,
>       }
>   }
>   
> -static void create_fdt_socket_memory(RISCVVirtState *s,
> -                                     const MemMapEntry *memmap, int socket)
> +static void create_fdt_socket_mem_range(RISCVVirtState *s, uint64_t addr,
> +                                        uint64_t size, int socket)
>   {
>       char *mem_name;
> -    uint64_t addr, size;
>       MachineState *ms = MACHINE(s);
>   
> -    addr = memmap[VIRT_DRAM].base + riscv_socket_mem_offset(ms, socket);
> -    size = riscv_socket_mem_size(ms, socket);
>       mem_name = g_strdup_printf("/memory@%lx", (long)addr);
>       qemu_fdt_add_subnode(ms->fdt, mem_name);
>       qemu_fdt_setprop_cells(ms->fdt, mem_name, "reg",
> @@ -313,6 +313,34 @@ static void create_fdt_socket_memory(RISCVVirtState *s,
>       g_free(mem_name);
>   }
>   
> +static void create_fdt_socket_memory(RISCVVirtState *s,
> +                                     const MemMapEntry *memmap, int socket)
> +{
> +    uint64_t addr, size;
> +    MachineState *mc = MACHINE(s);
> +    uint64_t sock_offset = riscv_socket_mem_offset(mc, socket);
> +    uint64_t sock_size = riscv_socket_mem_size(mc, socket);
> +
> +    if (sock_offset < memmap[VIRT_DRAM].size) {
> +        uint64_t low_mem_end = memmap[VIRT_DRAM].base + memmap[VIRT_DRAM].size;
> +
> +        addr = memmap[VIRT_DRAM].base + sock_offset;
> +        size = MIN(low_mem_end - addr, sock_size);
> +        create_fdt_socket_mem_range(s, addr, size, socket);
> +
> +        size = sock_size - size;
> +        if (size > 0) {
> +            create_fdt_socket_mem_range(s, memmap[VIRT_DRAM_HIGH].base,
> +                                        size, socket);
> +        }
> +    } else {
> +        addr = memmap[VIRT_DRAM_HIGH].base +
> +               sock_offset - memmap[VIRT_DRAM].size;
> +
> +        create_fdt_socket_mem_range(s, addr, sock_size, socket);
> +    }
> +}
> +
>   static void create_fdt_socket_clint(RISCVVirtState *s,
>                                       const MemMapEntry *memmap, int socket,
>                                       uint32_t *intc_phandles)
> @@ -1334,10 +1362,12 @@ static void virt_machine_done(Notifier *notifier, void *data)
>   
>   static void virt_machine_init(MachineState *machine)
>   {
> -    const MemMapEntry *memmap = virt_memmap;
> +    MemMapEntry *memmap = virt_memmap;
>       RISCVVirtState *s = RISCV_VIRT_MACHINE(machine);
>       MemoryRegion *system_memory = get_system_memory();
>       MemoryRegion *mask_rom = g_new(MemoryRegion, 1);
> +    MemoryRegion *ram_below_4g, *ram_above_4g;
> +    uint64_t ram_size_low, ram_size_high;
>       char *soc_name;
>       DeviceState *mmio_irqchip, *virtio_irqchip, *pcie_irqchip;
>       int i, base_hartid, hart_count;
> @@ -1448,6 +1478,17 @@ static void virt_machine_init(MachineState *machine)
>           }
>       }
>   
> +    if (machine->ram_size > LOW_MEM) {
> +        ram_size_high = machine->ram_size - LOW_MEM;
> +        ram_size_low = LOW_MEM;
> +    } else {
> +        ram_size_high = 0;
> +        ram_size_low = machine->ram_size;
> +    }
> +
> +    memmap[VIRT_DRAM].size = ram_size_low;
> +    memmap[VIRT_DRAM_HIGH].size = ram_size_high;
> +
>       if (riscv_is_32bit(&s->soc[0])) {
>   #if HOST_LONG_BITS == 64
>           /* limit RAM size in a 32-bit system */
> @@ -1460,7 +1501,8 @@ static void virt_machine_init(MachineState *machine)
>           virt_high_pcie_memmap.size = VIRT32_HIGH_PCIE_MMIO_SIZE;
>       } else {
>           virt_high_pcie_memmap.size = VIRT64_HIGH_PCIE_MMIO_SIZE;
> -        virt_high_pcie_memmap.base = memmap[VIRT_DRAM].base + machine->ram_size;
> +        virt_high_pcie_memmap.base = memmap[VIRT_DRAM_HIGH].base +
> +                                     memmap[VIRT_DRAM_HIGH].size;
>           virt_high_pcie_memmap.base =
>               ROUND_UP(virt_high_pcie_memmap.base, virt_high_pcie_memmap.size);
>       }
> @@ -1468,8 +1510,22 @@ static void virt_machine_init(MachineState *machine)
>       s->memmap = virt_memmap;
>   
>       /* register system main memory (actual RAM) */
> +    ram_below_4g = g_malloc(sizeof(*ram_below_4g));
> +    memory_region_init_alias(ram_below_4g, NULL, "ram-below-4g", machine->ram,
> +                             0, memmap[VIRT_DRAM].size);
>       memory_region_add_subregion(system_memory, memmap[VIRT_DRAM].base,
> -        machine->ram);
> +                                ram_below_4g);
> +
> +    if (memmap[VIRT_DRAM_HIGH].size) {
> +        ram_above_4g = g_malloc(sizeof(*ram_above_4g));
> +        memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g",
> +                                 machine->ram,
> +                                 memmap[VIRT_DRAM].size,
> +                                 memmap[VIRT_DRAM_HIGH].size);
> +        memory_region_add_subregion(system_memory,
> +                                    memmap[VIRT_DRAM_HIGH].base,
> +                                    ram_above_4g);
> +    }
>   
>       /* boot rom */
>       memory_region_init_rom(mask_rom, NULL, "riscv_virt_board.mrom",
> diff --git a/include/hw/riscv/virt.h b/include/hw/riscv/virt.h
> index e5c474b26e..36004eb6ef 100644
> --- a/include/hw/riscv/virt.h
> +++ b/include/hw/riscv/virt.h
> @@ -79,6 +79,7 @@ enum {
>       VIRT_IMSIC_S,
>       VIRT_FLASH,
>       VIRT_DRAM,
> +    VIRT_DRAM_HIGH,
>       VIRT_PCIE_MMIO,
>       VIRT_PCIE_PIO,
>       VIRT_PLATFORM_BUS,
Wu, Fei Aug. 1, 2023, 2:34 a.m. UTC | #2
On 8/1/2023 6:46 AM, Daniel Henrique Barboza wrote:
> 
> 
> On 7/30/23 22:53, Fei Wu wrote:
>> riscv virt platform's memory started at 0x80000000 and
>> straddled the 4GiB boundary. Curiously enough, this choice
>> of a memory layout will prevent from launching a VM with
>> a bit more than 2000MiB and PCIe pass-thru on an x86 host, due
>> to identity mapping requirements for the MSI doorbell on x86,
>> and these (APIC/IOAPIC) live right below 4GiB.
>>
>> So just split the RAM range into two portions:
>> - 1 GiB range from 0x80000000 to 0xc0000000.
>> - The remainder at 0x100000000
>>
>> ...leaving a hole between the ranges.
> 
> I am afraid this breaks some existing distro setups, like Ubuntu. After
> this patch
> this emulation stopped working:
> 
> ~/work/qemu/build/qemu-system-riscv64 \
>     -machine virt -nographic -m 8G -smp 8 \
>     -kernel ./uboot-ubuntu/usr/lib/u-boot/qemu-riscv64_smode/uboot.elf \
>     -drive file=snapshot.img,format=qcow2,if=virtio \
>     -netdev bridge,id=bridge1,br=virbr0 -device
> virtio-net-pci,netdev=bridge1
> 
> 
> This is basically a guest created via the official Canonical tutorial:
> 
> https://wiki.ubuntu.com/RISC-V/QEMU
> 
> The error being thrown:
> 
> =================
> 
> Boot HART ID              : 4
> Boot HART Domain          : root
> Boot HART Priv Version    : v1.12
> Boot HART Base ISA        : rv64imafdch
> Boot HART ISA Extensions  : time,sstc
> Boot HART PMP Count       : 16
> Boot HART PMP Granularity : 4
> Boot HART PMP Address Bits: 54
> Boot HART MHPM Count      : 16
> Boot HART MIDELEG         : 0x0000000000001666
> Boot HART MEDELEG         : 0x0000000000f0b509
> 
> 
> U-Boot 2022.07+dfsg-1ubuntu4.2 (Nov 24 2022 - 18:47:41 +0000)
> 
> CPU:  
> rv64imafdch_zicbom_zicboz_zicsr_zifencei_zihintpause_zawrs_zfa_zca_zcd_zba_zbb_zbc_zbs_sstc_svadu
> Model: riscv-virtio,qemu
> DRAM:  Unhandled exception: Store/AMO access fault
> EPC: 00000000802018b8 RA: 00000000802126a0 TVAL: 00000000ff733f90
> 
> Code: b823 06b2 bc23 06b2 b023 08b2 b423 08b2 (b823 08b2)
> 
> 
> resetting ...
> System reset not supported on this platform
> ### ERROR ### Please RESET the board ###
> QEMU 8.0.90 monitor - type 'help' for more infor
> =================
> 
> 
> Based on the change made I can make an educated guess on what is going
> wrong.
> We have another board with a similar memory topology you're making here,
> the
> Microchip Polarfire (microchip_pfsoc.c). We were having some problems
> with this
> board while trying to consolidate the boot process between all boards in
> hw/riscv/boot.c because of its non-continuous RAM bank. The full story
> can be
> read in the commit message of 4b402886ac89 ("hw/riscv: change
> riscv_compute_fdt_addr()
> semantics") but the short version can be seen in riscv_compute_fdt_addr()
> from boot.c:
> 
>  - if ram_start is less than 3072MiB, the FDT will be  put at the lowest
> value
> between 3072 MiB and the end of that RAM bank;
> 
> - if ram_start is higher than 3072 MiB the FDT will be put at the end of
> the
> RAM bank.
> 
> So, after this patch, since riscv_compute_fdt_addr() is being used with
> the now
> lower RAM bank, the fdt is being put in LOW_MEM - fdt_size for any setup
> that has
> more than 1Gb RAM, and this breaks assumptions made by uboot and Ubuntu
> and possibly
> others that are trying to retrieve the FDT from the gap that you created
> between
> low and hi mem in this patch.
> 
> In fact, this same Ubuntu guest I mentioned above will boot if I put
> only 1 Gb of RAM
> (-m 1Gb). If I try with -m 1.1Gb I reproduce this error. This can be a
> validation of
> the guess I'm making here: Ubuntu is trying to fetch stuff (probably the
> fdt) from
> the gap between the memory areas.
> 
> This change on top of this patch doesn't work either:
> 
> $ git diff
> diff --git a/hw/riscv/virt.c b/hw/riscv/virt.c
> index 8fbdc7220c..dfff48d849 100644
> --- a/hw/riscv/virt.c
> +++ b/hw/riscv/virt.c
> @@ -1335,9 +1335,16 @@ static void virt_machine_done(Notifier *notifier,
> void *data)
>                                           kernel_start_addr, true, NULL);
>      }
>  
> -    fdt_load_addr = riscv_compute_fdt_addr(memmap[VIRT_DRAM].base,
> +    if (machine->ram_size < memmap[VIRT_DRAM].size) {
> +        fdt_load_addr = riscv_compute_fdt_addr(memmap[VIRT_DRAM].base,
>                                             memmap[VIRT_DRAM].size,
>                                             machine);
> +    } else {
> +        fdt_load_addr =
> riscv_compute_fdt_addr(memmap[VIRT_DRAM_HIGH].base,
> +                                           memmap[VIRT_DRAM_HIGH].size,
> +                                           machine);
> +    }
> +
>   
> This would put the fdt at the end of the HI RAM for guests with more
> than 1Gb of RAM.
> This change in fact makes the situation even worse, breaking setups that
> were working
> before with this patch.
> 
> There's a chance that reducing the gap between the RAM banks can make
> Ubuntu work
> reliably again but now I'm a little cold feet with this change.
> 
> 
> I think we'll need some kernel/Opensbi folks to weight in here to see if
> there's a
> failsafe memory setup that won't break distros out there and allow your
> passthrough
> to work.
> 
Thank you for the review. Yes, we need to consolidate the whole stack in
the boot process.

In my testing, u-boot can read fdt, but it doesn't handle the
multi-range physical memory situation well but assumes a contiguous
range, see fdtdec_setup_mem_size_base() in u-boot:

int fdtdec_setup_mem_size_base(void)
{
        int ret;
        ofnode mem;
        struct resource res;

        mem = ofnode_path("/memory");
        if (!ofnode_valid(mem)) {
                debug("%s: Missing /memory node\n", __func__);
                return -EINVAL;
        }

        ret = ofnode_read_resource(mem, 0, &res);
        if (ret != 0) {
                debug("%s: Unable to decode first memory bank\n", __func__);
                return -EINVAL;
        }

        gd->ram_size = (phys_size_t)(res.end - res.start + 1);
        gd->ram_base = (unsigned long)res.start;
        debug("- Ram base: %08llX\n", (unsigned long long)gd->ram_base);
        debug("%s: Initial DRAM size %llx\n", __func__,
              (unsigned long long)gd->ram_size);

        return 0;
}

Also, board_get_usable_ram_top() returns 4G-1 for this setting, that's
why the error message TVAL: 00000000ff733f90 comes from. Actually it
does boot up if I switch the order of the two memory nodes in fdt, but
certainly it's not a generic solution.

There is a similar issue in grub grub_efi_mm_add_regions().

Thanks,
Fei.


> 
> 
> Thanks,
> 
> Daniel
> 
>>
>> Signed-off-by: Andrei Warkentin <andrei.warkentin@intel.com>
>> Signed-off-by: Fei Wu <fei2.wu@intel.com>
>> ---
>>   hw/riscv/virt.c         | 74 ++++++++++++++++++++++++++++++++++++-----
>>   include/hw/riscv/virt.h |  1 +
>>   2 files changed, 66 insertions(+), 9 deletions(-)
>>
>> diff --git a/hw/riscv/virt.c b/hw/riscv/virt.c
>> index d90286dc46..8fbdc7220c 100644
>> --- a/hw/riscv/virt.c
>> +++ b/hw/riscv/virt.c
>> @@ -75,7 +75,9 @@
>>   #error "Can't accomodate all IMSIC groups in address space"
>>   #endif
>>   -static const MemMapEntry virt_memmap[] = {
>> +#define LOW_MEM (1 * GiB)
>> +
>> +static MemMapEntry virt_memmap[] = {
>>       [VIRT_DEBUG] =        {        0x0,         0x100 },
>>       [VIRT_MROM] =         {     0x1000,        0xf000 },
>>       [VIRT_TEST] =         {   0x100000,        0x1000 },
>> @@ -96,6 +98,7 @@ static const MemMapEntry virt_memmap[] = {
>>       [VIRT_PCIE_ECAM] =    { 0x30000000,    0x10000000 },
>>       [VIRT_PCIE_MMIO] =    { 0x40000000,    0x40000000 },
>>       [VIRT_DRAM] =         { 0x80000000,           0x0 },
>> +    [VIRT_DRAM_HIGH] =    { 0x100000000,          0x0 },
>>   };
>>     /* PCIe high mmio is fixed for RV32 */
>> @@ -295,15 +298,12 @@ static void
>> create_fdt_socket_cpus(RISCVVirtState *s, int socket,
>>       }
>>   }
>>   -static void create_fdt_socket_memory(RISCVVirtState *s,
>> -                                     const MemMapEntry *memmap, int
>> socket)
>> +static void create_fdt_socket_mem_range(RISCVVirtState *s, uint64_t
>> addr,
>> +                                        uint64_t size, int socket)
>>   {
>>       char *mem_name;
>> -    uint64_t addr, size;
>>       MachineState *ms = MACHINE(s);
>>   -    addr = memmap[VIRT_DRAM].base + riscv_socket_mem_offset(ms,
>> socket);
>> -    size = riscv_socket_mem_size(ms, socket);
>>       mem_name = g_strdup_printf("/memory@%lx", (long)addr);
>>       qemu_fdt_add_subnode(ms->fdt, mem_name);
>>       qemu_fdt_setprop_cells(ms->fdt, mem_name, "reg",
>> @@ -313,6 +313,34 @@ static void
>> create_fdt_socket_memory(RISCVVirtState *s,
>>       g_free(mem_name);
>>   }
>>   +static void create_fdt_socket_memory(RISCVVirtState *s,
>> +                                     const MemMapEntry *memmap, int
>> socket)
>> +{
>> +    uint64_t addr, size;
>> +    MachineState *mc = MACHINE(s);
>> +    uint64_t sock_offset = riscv_socket_mem_offset(mc, socket);
>> +    uint64_t sock_size = riscv_socket_mem_size(mc, socket);
>> +
>> +    if (sock_offset < memmap[VIRT_DRAM].size) {
>> +        uint64_t low_mem_end = memmap[VIRT_DRAM].base +
>> memmap[VIRT_DRAM].size;
>> +
>> +        addr = memmap[VIRT_DRAM].base + sock_offset;
>> +        size = MIN(low_mem_end - addr, sock_size);
>> +        create_fdt_socket_mem_range(s, addr, size, socket);
>> +
>> +        size = sock_size - size;
>> +        if (size > 0) {
>> +            create_fdt_socket_mem_range(s, memmap[VIRT_DRAM_HIGH].base,
>> +                                        size, socket);
>> +        }
>> +    } else {
>> +        addr = memmap[VIRT_DRAM_HIGH].base +
>> +               sock_offset - memmap[VIRT_DRAM].size;
>> +
>> +        create_fdt_socket_mem_range(s, addr, sock_size, socket);
>> +    }
>> +}
>> +
>>   static void create_fdt_socket_clint(RISCVVirtState *s,
>>                                       const MemMapEntry *memmap, int
>> socket,
>>                                       uint32_t *intc_phandles)
>> @@ -1334,10 +1362,12 @@ static void virt_machine_done(Notifier
>> *notifier, void *data)
>>     static void virt_machine_init(MachineState *machine)
>>   {
>> -    const MemMapEntry *memmap = virt_memmap;
>> +    MemMapEntry *memmap = virt_memmap;
>>       RISCVVirtState *s = RISCV_VIRT_MACHINE(machine);
>>       MemoryRegion *system_memory = get_system_memory();
>>       MemoryRegion *mask_rom = g_new(MemoryRegion, 1);
>> +    MemoryRegion *ram_below_4g, *ram_above_4g;
>> +    uint64_t ram_size_low, ram_size_high;
>>       char *soc_name;
>>       DeviceState *mmio_irqchip, *virtio_irqchip, *pcie_irqchip;
>>       int i, base_hartid, hart_count;
>> @@ -1448,6 +1478,17 @@ static void virt_machine_init(MachineState
>> *machine)
>>           }
>>       }
>>   +    if (machine->ram_size > LOW_MEM) {
>> +        ram_size_high = machine->ram_size - LOW_MEM;
>> +        ram_size_low = LOW_MEM;
>> +    } else {
>> +        ram_size_high = 0;
>> +        ram_size_low = machine->ram_size;
>> +    }
>> +
>> +    memmap[VIRT_DRAM].size = ram_size_low;
>> +    memmap[VIRT_DRAM_HIGH].size = ram_size_high;
>> +
>>       if (riscv_is_32bit(&s->soc[0])) {
>>   #if HOST_LONG_BITS == 64
>>           /* limit RAM size in a 32-bit system */
>> @@ -1460,7 +1501,8 @@ static void virt_machine_init(MachineState
>> *machine)
>>           virt_high_pcie_memmap.size = VIRT32_HIGH_PCIE_MMIO_SIZE;
>>       } else {
>>           virt_high_pcie_memmap.size = VIRT64_HIGH_PCIE_MMIO_SIZE;
>> -        virt_high_pcie_memmap.base = memmap[VIRT_DRAM].base +
>> machine->ram_size;
>> +        virt_high_pcie_memmap.base = memmap[VIRT_DRAM_HIGH].base +
>> +                                     memmap[VIRT_DRAM_HIGH].size;
>>           virt_high_pcie_memmap.base =
>>               ROUND_UP(virt_high_pcie_memmap.base,
>> virt_high_pcie_memmap.size);
>>       }
>> @@ -1468,8 +1510,22 @@ static void virt_machine_init(MachineState
>> *machine)
>>       s->memmap = virt_memmap;
>>         /* register system main memory (actual RAM) */
>> +    ram_below_4g = g_malloc(sizeof(*ram_below_4g));
>> +    memory_region_init_alias(ram_below_4g, NULL, "ram-below-4g",
>> machine->ram,
>> +                             0, memmap[VIRT_DRAM].size);
>>       memory_region_add_subregion(system_memory, memmap[VIRT_DRAM].base,
>> -        machine->ram);
>> +                                ram_below_4g);
>> +
>> +    if (memmap[VIRT_DRAM_HIGH].size) {
>> +        ram_above_4g = g_malloc(sizeof(*ram_above_4g));
>> +        memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g",
>> +                                 machine->ram,
>> +                                 memmap[VIRT_DRAM].size,
>> +                                 memmap[VIRT_DRAM_HIGH].size);
>> +        memory_region_add_subregion(system_memory,
>> +                                    memmap[VIRT_DRAM_HIGH].base,
>> +                                    ram_above_4g);
>> +    }
>>         /* boot rom */
>>       memory_region_init_rom(mask_rom, NULL, "riscv_virt_board.mrom",
>> diff --git a/include/hw/riscv/virt.h b/include/hw/riscv/virt.h
>> index e5c474b26e..36004eb6ef 100644
>> --- a/include/hw/riscv/virt.h
>> +++ b/include/hw/riscv/virt.h
>> @@ -79,6 +79,7 @@ enum {
>>       VIRT_IMSIC_S,
>>       VIRT_FLASH,
>>       VIRT_DRAM,
>> +    VIRT_DRAM_HIGH,
>>       VIRT_PCIE_MMIO,
>>       VIRT_PCIE_PIO,
>>       VIRT_PLATFORM_BUS,
Wu, Fei Aug. 3, 2023, 12:45 a.m. UTC | #3
On 8/1/2023 6:46 AM, Daniel Henrique Barboza wrote:
> 
> 
> On 7/30/23 22:53, Fei Wu wrote:
>> riscv virt platform's memory started at 0x80000000 and
>> straddled the 4GiB boundary. Curiously enough, this choice
>> of a memory layout will prevent from launching a VM with
>> a bit more than 2000MiB and PCIe pass-thru on an x86 host, due
>> to identity mapping requirements for the MSI doorbell on x86,
>> and these (APIC/IOAPIC) live right below 4GiB.
>>
>> So just split the RAM range into two portions:
>> - 1 GiB range from 0x80000000 to 0xc0000000.
>> - The remainder at 0x100000000
>>
>> ...leaving a hole between the ranges.
> 
> I am afraid this breaks some existing distro setups, like Ubuntu. After
> this patch
> this emulation stopped working:
> 
> ~/work/qemu/build/qemu-system-riscv64 \
>     -machine virt -nographic -m 8G -smp 8 \
>     -kernel ./uboot-ubuntu/usr/lib/u-boot/qemu-riscv64_smode/uboot.elf \
>     -drive file=snapshot.img,format=qcow2,if=virtio \
>     -netdev bridge,id=bridge1,br=virbr0 -device
> virtio-net-pci,netdev=bridge1
> 
> 
> This is basically a guest created via the official Canonical tutorial:
> 
> https://wiki.ubuntu.com/RISC-V/QEMU
> 
> The error being thrown:
> 
> =================
> 
> Boot HART ID              : 4
> Boot HART Domain          : root
> Boot HART Priv Version    : v1.12
> Boot HART Base ISA        : rv64imafdch
> Boot HART ISA Extensions  : time,sstc
> Boot HART PMP Count       : 16
> Boot HART PMP Granularity : 4
> Boot HART PMP Address Bits: 54
> Boot HART MHPM Count      : 16
> Boot HART MIDELEG         : 0x0000000000001666
> Boot HART MEDELEG         : 0x0000000000f0b509
> 
> 
> U-Boot 2022.07+dfsg-1ubuntu4.2 (Nov 24 2022 - 18:47:41 +0000)
> 
> CPU:  
> rv64imafdch_zicbom_zicboz_zicsr_zifencei_zihintpause_zawrs_zfa_zca_zcd_zba_zbb_zbc_zbs_sstc_svadu
> Model: riscv-virtio,qemu
> DRAM:  Unhandled exception: Store/AMO access fault
> EPC: 00000000802018b8 RA: 00000000802126a0 TVAL: 00000000ff733f90
> 
> Code: b823 06b2 bc23 06b2 b023 08b2 b423 08b2 (b823 08b2)
> 
> 
> resetting ...
> System reset not supported on this platform
> ### ERROR ### Please RESET the board ###
> QEMU 8.0.90 monitor - type 'help' for more infor
> =================
> 
> 
> Based on the change made I can make an educated guess on what is going
> wrong.
> We have another board with a similar memory topology you're making here,
> the
> Microchip Polarfire (microchip_pfsoc.c). We were having some problems
> with this
> board while trying to consolidate the boot process between all boards in
> hw/riscv/boot.c because of its non-continuous RAM bank. The full story
> can be
> read in the commit message of 4b402886ac89 ("hw/riscv: change
> riscv_compute_fdt_addr()
> semantics") but the short version can be seen in riscv_compute_fdt_addr()
> from boot.c:
> 
>  - if ram_start is less than 3072MiB, the FDT will be  put at the lowest
> value
> between 3072 MiB and the end of that RAM bank;
> 
> - if ram_start is higher than 3072 MiB the FDT will be put at the end of
> the
> RAM bank.
> 
> So, after this patch, since riscv_compute_fdt_addr() is being used with
> the now
> lower RAM bank, the fdt is being put in LOW_MEM - fdt_size for any setup
> that has
> more than 1Gb RAM, and this breaks assumptions made by uboot and Ubuntu
> and possibly
> others that are trying to retrieve the FDT from the gap that you created
> between
> low and hi mem in this patch.
> 
> In fact, this same Ubuntu guest I mentioned above will boot if I put
> only 1 Gb of RAM
> (-m 1Gb). If I try with -m 1.1Gb I reproduce this error. This can be a
> validation of
> the guess I'm making here: Ubuntu is trying to fetch stuff (probably the
> fdt) from
> the gap between the memory areas.
> 
> This change on top of this patch doesn't work either:
> 
> $ git diff
> diff --git a/hw/riscv/virt.c b/hw/riscv/virt.c
> index 8fbdc7220c..dfff48d849 100644
> --- a/hw/riscv/virt.c
> +++ b/hw/riscv/virt.c
> @@ -1335,9 +1335,16 @@ static void virt_machine_done(Notifier *notifier,
> void *data)
>                                           kernel_start_addr, true, NULL);
>      }
>  
> -    fdt_load_addr = riscv_compute_fdt_addr(memmap[VIRT_DRAM].base,
> +    if (machine->ram_size < memmap[VIRT_DRAM].size) {
> +        fdt_load_addr = riscv_compute_fdt_addr(memmap[VIRT_DRAM].base,
>                                             memmap[VIRT_DRAM].size,
>                                             machine);
> +    } else {
> +        fdt_load_addr =
> riscv_compute_fdt_addr(memmap[VIRT_DRAM_HIGH].base,
> +                                           memmap[VIRT_DRAM_HIGH].size,
> +                                           machine);
> +    }
> +
>   
> This would put the fdt at the end of the HI RAM for guests with more
> than 1Gb of RAM.
> This change in fact makes the situation even worse, breaking setups that
> were working
> before with this patch.
> 
> There's a chance that reducing the gap between the RAM banks can make
> Ubuntu work
> reliably again but now I'm a little cold feet with this change.
> 
> 
> I think we'll need some kernel/Opensbi folks to weight in here to see if
> there's a
> failsafe memory setup that won't break distros out there and allow your
> passthrough
> to work.
> 
Hi Daniel,

Do you know who we should talk to? I think it's not uncommon to have
seperated multi-range memory, the stack should support this configuration.

Thanks,
Fei.
> 
> 
> Thanks,
> 
> Daniel
> 
>>
>> Signed-off-by: Andrei Warkentin <andrei.warkentin@intel.com>
>> Signed-off-by: Fei Wu <fei2.wu@intel.com>
>> ---
>>   hw/riscv/virt.c         | 74 ++++++++++++++++++++++++++++++++++++-----
>>   include/hw/riscv/virt.h |  1 +
>>   2 files changed, 66 insertions(+), 9 deletions(-)
>>
>> diff --git a/hw/riscv/virt.c b/hw/riscv/virt.c
>> index d90286dc46..8fbdc7220c 100644
>> --- a/hw/riscv/virt.c
>> +++ b/hw/riscv/virt.c
>> @@ -75,7 +75,9 @@
>>   #error "Can't accomodate all IMSIC groups in address space"
>>   #endif
>>   -static const MemMapEntry virt_memmap[] = {
>> +#define LOW_MEM (1 * GiB)
>> +
>> +static MemMapEntry virt_memmap[] = {
>>       [VIRT_DEBUG] =        {        0x0,         0x100 },
>>       [VIRT_MROM] =         {     0x1000,        0xf000 },
>>       [VIRT_TEST] =         {   0x100000,        0x1000 },
>> @@ -96,6 +98,7 @@ static const MemMapEntry virt_memmap[] = {
>>       [VIRT_PCIE_ECAM] =    { 0x30000000,    0x10000000 },
>>       [VIRT_PCIE_MMIO] =    { 0x40000000,    0x40000000 },
>>       [VIRT_DRAM] =         { 0x80000000,           0x0 },
>> +    [VIRT_DRAM_HIGH] =    { 0x100000000,          0x0 },
>>   };
>>     /* PCIe high mmio is fixed for RV32 */
>> @@ -295,15 +298,12 @@ static void
>> create_fdt_socket_cpus(RISCVVirtState *s, int socket,
>>       }
>>   }
>>   -static void create_fdt_socket_memory(RISCVVirtState *s,
>> -                                     const MemMapEntry *memmap, int
>> socket)
>> +static void create_fdt_socket_mem_range(RISCVVirtState *s, uint64_t
>> addr,
>> +                                        uint64_t size, int socket)
>>   {
>>       char *mem_name;
>> -    uint64_t addr, size;
>>       MachineState *ms = MACHINE(s);
>>   -    addr = memmap[VIRT_DRAM].base + riscv_socket_mem_offset(ms,
>> socket);
>> -    size = riscv_socket_mem_size(ms, socket);
>>       mem_name = g_strdup_printf("/memory@%lx", (long)addr);
>>       qemu_fdt_add_subnode(ms->fdt, mem_name);
>>       qemu_fdt_setprop_cells(ms->fdt, mem_name, "reg",
>> @@ -313,6 +313,34 @@ static void
>> create_fdt_socket_memory(RISCVVirtState *s,
>>       g_free(mem_name);
>>   }
>>   +static void create_fdt_socket_memory(RISCVVirtState *s,
>> +                                     const MemMapEntry *memmap, int
>> socket)
>> +{
>> +    uint64_t addr, size;
>> +    MachineState *mc = MACHINE(s);
>> +    uint64_t sock_offset = riscv_socket_mem_offset(mc, socket);
>> +    uint64_t sock_size = riscv_socket_mem_size(mc, socket);
>> +
>> +    if (sock_offset < memmap[VIRT_DRAM].size) {
>> +        uint64_t low_mem_end = memmap[VIRT_DRAM].base +
>> memmap[VIRT_DRAM].size;
>> +
>> +        addr = memmap[VIRT_DRAM].base + sock_offset;
>> +        size = MIN(low_mem_end - addr, sock_size);
>> +        create_fdt_socket_mem_range(s, addr, size, socket);
>> +
>> +        size = sock_size - size;
>> +        if (size > 0) {
>> +            create_fdt_socket_mem_range(s, memmap[VIRT_DRAM_HIGH].base,
>> +                                        size, socket);
>> +        }
>> +    } else {
>> +        addr = memmap[VIRT_DRAM_HIGH].base +
>> +               sock_offset - memmap[VIRT_DRAM].size;
>> +
>> +        create_fdt_socket_mem_range(s, addr, sock_size, socket);
>> +    }
>> +}
>> +
>>   static void create_fdt_socket_clint(RISCVVirtState *s,
>>                                       const MemMapEntry *memmap, int
>> socket,
>>                                       uint32_t *intc_phandles)
>> @@ -1334,10 +1362,12 @@ static void virt_machine_done(Notifier
>> *notifier, void *data)
>>     static void virt_machine_init(MachineState *machine)
>>   {
>> -    const MemMapEntry *memmap = virt_memmap;
>> +    MemMapEntry *memmap = virt_memmap;
>>       RISCVVirtState *s = RISCV_VIRT_MACHINE(machine);
>>       MemoryRegion *system_memory = get_system_memory();
>>       MemoryRegion *mask_rom = g_new(MemoryRegion, 1);
>> +    MemoryRegion *ram_below_4g, *ram_above_4g;
>> +    uint64_t ram_size_low, ram_size_high;
>>       char *soc_name;
>>       DeviceState *mmio_irqchip, *virtio_irqchip, *pcie_irqchip;
>>       int i, base_hartid, hart_count;
>> @@ -1448,6 +1478,17 @@ static void virt_machine_init(MachineState
>> *machine)
>>           }
>>       }
>>   +    if (machine->ram_size > LOW_MEM) {
>> +        ram_size_high = machine->ram_size - LOW_MEM;
>> +        ram_size_low = LOW_MEM;
>> +    } else {
>> +        ram_size_high = 0;
>> +        ram_size_low = machine->ram_size;
>> +    }
>> +
>> +    memmap[VIRT_DRAM].size = ram_size_low;
>> +    memmap[VIRT_DRAM_HIGH].size = ram_size_high;
>> +
>>       if (riscv_is_32bit(&s->soc[0])) {
>>   #if HOST_LONG_BITS == 64
>>           /* limit RAM size in a 32-bit system */
>> @@ -1460,7 +1501,8 @@ static void virt_machine_init(MachineState
>> *machine)
>>           virt_high_pcie_memmap.size = VIRT32_HIGH_PCIE_MMIO_SIZE;
>>       } else {
>>           virt_high_pcie_memmap.size = VIRT64_HIGH_PCIE_MMIO_SIZE;
>> -        virt_high_pcie_memmap.base = memmap[VIRT_DRAM].base +
>> machine->ram_size;
>> +        virt_high_pcie_memmap.base = memmap[VIRT_DRAM_HIGH].base +
>> +                                     memmap[VIRT_DRAM_HIGH].size;
>>           virt_high_pcie_memmap.base =
>>               ROUND_UP(virt_high_pcie_memmap.base,
>> virt_high_pcie_memmap.size);
>>       }
>> @@ -1468,8 +1510,22 @@ static void virt_machine_init(MachineState
>> *machine)
>>       s->memmap = virt_memmap;
>>         /* register system main memory (actual RAM) */
>> +    ram_below_4g = g_malloc(sizeof(*ram_below_4g));
>> +    memory_region_init_alias(ram_below_4g, NULL, "ram-below-4g",
>> machine->ram,
>> +                             0, memmap[VIRT_DRAM].size);
>>       memory_region_add_subregion(system_memory, memmap[VIRT_DRAM].base,
>> -        machine->ram);
>> +                                ram_below_4g);
>> +
>> +    if (memmap[VIRT_DRAM_HIGH].size) {
>> +        ram_above_4g = g_malloc(sizeof(*ram_above_4g));
>> +        memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g",
>> +                                 machine->ram,
>> +                                 memmap[VIRT_DRAM].size,
>> +                                 memmap[VIRT_DRAM_HIGH].size);
>> +        memory_region_add_subregion(system_memory,
>> +                                    memmap[VIRT_DRAM_HIGH].base,
>> +                                    ram_above_4g);
>> +    }
>>         /* boot rom */
>>       memory_region_init_rom(mask_rom, NULL, "riscv_virt_board.mrom",
>> diff --git a/include/hw/riscv/virt.h b/include/hw/riscv/virt.h
>> index e5c474b26e..36004eb6ef 100644
>> --- a/include/hw/riscv/virt.h
>> +++ b/include/hw/riscv/virt.h
>> @@ -79,6 +79,7 @@ enum {
>>       VIRT_IMSIC_S,
>>       VIRT_FLASH,
>>       VIRT_DRAM,
>> +    VIRT_DRAM_HIGH,
>>       VIRT_PCIE_MMIO,
>>       VIRT_PCIE_PIO,
>>       VIRT_PLATFORM_BUS,
Andrew Jones Aug. 3, 2023, 3:07 p.m. UTC | #4
On Mon, Jul 31, 2023 at 09:53:17AM +0800, Fei Wu wrote:
> riscv virt platform's memory started at 0x80000000 and
> straddled the 4GiB boundary. Curiously enough, this choice
> of a memory layout will prevent from launching a VM with
> a bit more than 2000MiB and PCIe pass-thru on an x86 host, due
> to identity mapping requirements for the MSI doorbell on x86,
> and these (APIC/IOAPIC) live right below 4GiB.
> 
> So just split the RAM range into two portions:
> - 1 GiB range from 0x80000000 to 0xc0000000.
> - The remainder at 0x100000000
> 
> ...leaving a hole between the ranges.

Can you elaborate on the use case? Maybe provide details of the host
system and the QEMU command line? I'm wondering why we didn't have
any problems with the arm virt machine type. Has nobody tried this
use case with that? Is the use case something valid for riscv, but
not arm?

Thanks,
drew
Wu, Fei Aug. 4, 2023, 9:15 a.m. UTC | #5
On 8/3/2023 11:07 PM, Andrew Jones wrote:
> On Mon, Jul 31, 2023 at 09:53:17AM +0800, Fei Wu wrote:
>> riscv virt platform's memory started at 0x80000000 and
>> straddled the 4GiB boundary. Curiously enough, this choice
>> of a memory layout will prevent from launching a VM with
>> a bit more than 2000MiB and PCIe pass-thru on an x86 host, due
>> to identity mapping requirements for the MSI doorbell on x86,
>> and these (APIC/IOAPIC) live right below 4GiB.
>>
>> So just split the RAM range into two portions:
>> - 1 GiB range from 0x80000000 to 0xc0000000.
>> - The remainder at 0x100000000
>>
>> ...leaving a hole between the ranges.
> 
> Can you elaborate on the use case? Maybe provide details of the host
> system and the QEMU command line? I'm wondering why we didn't have
> any problems with the arm virt machine type. Has nobody tried this
> use case with that? Is the use case something valid for riscv, but
> not arm?
> 
Firstly we have to enable pcie passthru on host, find the device groups,
e.g. the vga card, and add their pci ids to host kernel cmdline:
	vfio-pci.ids=10de:0f02,10de:0e08

then start vm through qemu as follows:
$Q -machine virt -m 4G -smp 4 -nographic \
  -bios /usr/lib/riscv64-linux-gnu/opensbi/generic/fw_jump.elf \
  -kernel ./vmlinuz -initrd initrd.img -append "root=/dev/vda1 rw" \
  -drive
file=ubuntu-22.04.1-preinstalled-server-riscv64+unmatched.img,if=virtio,format=raw
\
  -device vfio-pci,host=01:00.0 -device vfio-pci,host=01:00.1 \
  -netdev user,id=vnet,hostfwd=:127.0.0.1:2223-:22 -device
virtio-net-pci,netdev=vnet

Without this patch, qemu exits immediately instead of boots up.

Just tried pcie passthru on arm, it cannot handle 4G memory either.
$Q -m 4G -smp 4 -cpu max -M virt -nographic \
  -pflash /usr/share/AAVMF/AAVMF_CODE.fd -pflash flash1.img \
  -drive if=none,file=ubuntu-22.04-server-cloudimg-arm64.img,id=hd0 \
  -device virtio-blk-device,drive=hd0 \
  -device vfio-pci,host=01:00.0 -device vfio-pci,host=01:00.1

qemu-system-aarch64: -device vfio-pci,host=01:00.0: VFIO_MAP_DMA failed:
Invalid argument
qemu-system-aarch64: -device vfio-pci,host=01:00.0: vfio 0000:01:00.0:
failed to setup container for group 11: memory listener initialization
failed: Region mach-virt.ram: vfio_dma_map(0x55de3c2a97f0, 0x40000000,
0x100000000, 0x7f8fcbe00000) = -22 (Invalid argument)

Thanks,
Fei.

> Thanks,
> drew
Alistair Francis Sept. 7, 2023, 3:17 a.m. UTC | #6
On Thu, Aug 3, 2023 at 10:47 AM Wu, Fei <fei2.wu@intel.com> wrote:
>
> On 8/1/2023 6:46 AM, Daniel Henrique Barboza wrote:
> >
> >
> > On 7/30/23 22:53, Fei Wu wrote:
> >> riscv virt platform's memory started at 0x80000000 and
> >> straddled the 4GiB boundary. Curiously enough, this choice
> >> of a memory layout will prevent from launching a VM with
> >> a bit more than 2000MiB and PCIe pass-thru on an x86 host, due
> >> to identity mapping requirements for the MSI doorbell on x86,
> >> and these (APIC/IOAPIC) live right below 4GiB.
> >>
> >> So just split the RAM range into two portions:
> >> - 1 GiB range from 0x80000000 to 0xc0000000.
> >> - The remainder at 0x100000000
> >>
> >> ...leaving a hole between the ranges.
> >
> > I am afraid this breaks some existing distro setups, like Ubuntu. After
> > this patch
> > this emulation stopped working:
> >
> > ~/work/qemu/build/qemu-system-riscv64 \
> >     -machine virt -nographic -m 8G -smp 8 \
> >     -kernel ./uboot-ubuntu/usr/lib/u-boot/qemu-riscv64_smode/uboot.elf \
> >     -drive file=snapshot.img,format=qcow2,if=virtio \
> >     -netdev bridge,id=bridge1,br=virbr0 -device
> > virtio-net-pci,netdev=bridge1
> >
> >
> > This is basically a guest created via the official Canonical tutorial:
> >
> > https://wiki.ubuntu.com/RISC-V/QEMU
> >
> > The error being thrown:
> >
> > =================
> >
> > Boot HART ID              : 4
> > Boot HART Domain          : root
> > Boot HART Priv Version    : v1.12
> > Boot HART Base ISA        : rv64imafdch
> > Boot HART ISA Extensions  : time,sstc
> > Boot HART PMP Count       : 16
> > Boot HART PMP Granularity : 4
> > Boot HART PMP Address Bits: 54
> > Boot HART MHPM Count      : 16
> > Boot HART MIDELEG         : 0x0000000000001666
> > Boot HART MEDELEG         : 0x0000000000f0b509
> >
> >
> > U-Boot 2022.07+dfsg-1ubuntu4.2 (Nov 24 2022 - 18:47:41 +0000)
> >
> > CPU:
> > rv64imafdch_zicbom_zicboz_zicsr_zifencei_zihintpause_zawrs_zfa_zca_zcd_zba_zbb_zbc_zbs_sstc_svadu
> > Model: riscv-virtio,qemu
> > DRAM:  Unhandled exception: Store/AMO access fault
> > EPC: 00000000802018b8 RA: 00000000802126a0 TVAL: 00000000ff733f90
> >
> > Code: b823 06b2 bc23 06b2 b023 08b2 b423 08b2 (b823 08b2)
> >
> >
> > resetting ...
> > System reset not supported on this platform
> > ### ERROR ### Please RESET the board ###
> > QEMU 8.0.90 monitor - type 'help' for more infor
> > =================
> >
> >
> > Based on the change made I can make an educated guess on what is going
> > wrong.
> > We have another board with a similar memory topology you're making here,
> > the
> > Microchip Polarfire (microchip_pfsoc.c). We were having some problems
> > with this
> > board while trying to consolidate the boot process between all boards in
> > hw/riscv/boot.c because of its non-continuous RAM bank. The full story
> > can be
> > read in the commit message of 4b402886ac89 ("hw/riscv: change
> > riscv_compute_fdt_addr()
> > semantics") but the short version can be seen in riscv_compute_fdt_addr()
> > from boot.c:
> >
> >  - if ram_start is less than 3072MiB, the FDT will be  put at the lowest
> > value
> > between 3072 MiB and the end of that RAM bank;
> >
> > - if ram_start is higher than 3072 MiB the FDT will be put at the end of
> > the
> > RAM bank.
> >
> > So, after this patch, since riscv_compute_fdt_addr() is being used with
> > the now
> > lower RAM bank, the fdt is being put in LOW_MEM - fdt_size for any setup
> > that has
> > more than 1Gb RAM, and this breaks assumptions made by uboot and Ubuntu
> > and possibly
> > others that are trying to retrieve the FDT from the gap that you created
> > between
> > low and hi mem in this patch.
> >
> > In fact, this same Ubuntu guest I mentioned above will boot if I put
> > only 1 Gb of RAM
> > (-m 1Gb). If I try with -m 1.1Gb I reproduce this error. This can be a
> > validation of
> > the guess I'm making here: Ubuntu is trying to fetch stuff (probably the
> > fdt) from
> > the gap between the memory areas.
> >
> > This change on top of this patch doesn't work either:
> >
> > $ git diff
> > diff --git a/hw/riscv/virt.c b/hw/riscv/virt.c
> > index 8fbdc7220c..dfff48d849 100644
> > --- a/hw/riscv/virt.c
> > +++ b/hw/riscv/virt.c
> > @@ -1335,9 +1335,16 @@ static void virt_machine_done(Notifier *notifier,
> > void *data)
> >                                           kernel_start_addr, true, NULL);
> >      }
> >
> > -    fdt_load_addr = riscv_compute_fdt_addr(memmap[VIRT_DRAM].base,
> > +    if (machine->ram_size < memmap[VIRT_DRAM].size) {
> > +        fdt_load_addr = riscv_compute_fdt_addr(memmap[VIRT_DRAM].base,
> >                                             memmap[VIRT_DRAM].size,
> >                                             machine);
> > +    } else {
> > +        fdt_load_addr =
> > riscv_compute_fdt_addr(memmap[VIRT_DRAM_HIGH].base,
> > +                                           memmap[VIRT_DRAM_HIGH].size,
> > +                                           machine);
> > +    }
> > +
> >
> > This would put the fdt at the end of the HI RAM for guests with more
> > than 1Gb of RAM.
> > This change in fact makes the situation even worse, breaking setups that
> > were working
> > before with this patch.
> >
> > There's a chance that reducing the gap between the RAM banks can make
> > Ubuntu work
> > reliably again but now I'm a little cold feet with this change.
> >
> >
> > I think we'll need some kernel/Opensbi folks to weight in here to see if
> > there's a
> > failsafe memory setup that won't break distros out there and allow your
> > passthrough
> > to work.
> >
> Hi Daniel,
>
> Do you know who we should talk to? I think it's not uncommon to have
> seperated multi-range memory, the stack should support this configuration.

@Palmer Dabbelt @Anup Patel any ideas?

Alistair
Philippe Mathieu-Daudé Sept. 7, 2023, 7:16 a.m. UTC | #7
Widening Cc to ARM/VFIO.

On 4/8/23 11:15, Wu, Fei wrote:
> On 8/3/2023 11:07 PM, Andrew Jones wrote:
>> On Mon, Jul 31, 2023 at 09:53:17AM +0800, Fei Wu wrote:
>>> riscv virt platform's memory started at 0x80000000 and
>>> straddled the 4GiB boundary. Curiously enough, this choice
>>> of a memory layout will prevent from launching a VM with
>>> a bit more than 2000MiB and PCIe pass-thru on an x86 host, due
>>> to identity mapping requirements for the MSI doorbell on x86,
>>> and these (APIC/IOAPIC) live right below 4GiB.
>>>
>>> So just split the RAM range into two portions:
>>> - 1 GiB range from 0x80000000 to 0xc0000000.
>>> - The remainder at 0x100000000
>>>
>>> ...leaving a hole between the ranges.
>>
>> Can you elaborate on the use case? Maybe provide details of the host
>> system and the QEMU command line? I'm wondering why we didn't have
>> any problems with the arm virt machine type. Has nobody tried this
>> use case with that? Is the use case something valid for riscv, but
>> not arm?
>>
> Firstly we have to enable pcie passthru on host, find the device groups,
> e.g. the vga card, and add their pci ids to host kernel cmdline:
> 	vfio-pci.ids=10de:0f02,10de:0e08
> 
> then start vm through qemu as follows:
> $Q -machine virt -m 4G -smp 4 -nographic \
>    -bios /usr/lib/riscv64-linux-gnu/opensbi/generic/fw_jump.elf \
>    -kernel ./vmlinuz -initrd initrd.img -append "root=/dev/vda1 rw" \
>    -drive
> file=ubuntu-22.04.1-preinstalled-server-riscv64+unmatched.img,if=virtio,format=raw
> \
>    -device vfio-pci,host=01:00.0 -device vfio-pci,host=01:00.1 \
>    -netdev user,id=vnet,hostfwd=:127.0.0.1:2223-:22 -device
> virtio-net-pci,netdev=vnet
> 
> Without this patch, qemu exits immediately instead of boots up.
> 
> Just tried pcie passthru on arm, it cannot handle 4G memory either.
> $Q -m 4G -smp 4 -cpu max -M virt -nographic \
>    -pflash /usr/share/AAVMF/AAVMF_CODE.fd -pflash flash1.img \
>    -drive if=none,file=ubuntu-22.04-server-cloudimg-arm64.img,id=hd0 \
>    -device virtio-blk-device,drive=hd0 \
>    -device vfio-pci,host=01:00.0 -device vfio-pci,host=01:00.1
> 
> qemu-system-aarch64: -device vfio-pci,host=01:00.0: VFIO_MAP_DMA failed:
> Invalid argument
> qemu-system-aarch64: -device vfio-pci,host=01:00.0: vfio 0000:01:00.0:
> failed to setup container for group 11: memory listener initialization
> failed: Region mach-virt.ram: vfio_dma_map(0x55de3c2a97f0, 0x40000000,
> 0x100000000, 0x7f8fcbe00000) = -22 (Invalid argument)
> 
> Thanks,
> Fei.
> 
>> Thanks,
>> drew
> 
>
Eric Auger Sept. 7, 2023, 9:10 a.m. UTC | #8
Hi,

On 9/7/23 09:16, Philippe Mathieu-Daudé wrote:
> Widening Cc to ARM/VFIO.
>
> On 4/8/23 11:15, Wu, Fei wrote:
>> On 8/3/2023 11:07 PM, Andrew Jones wrote:
>>> On Mon, Jul 31, 2023 at 09:53:17AM +0800, Fei Wu wrote:
>>>> riscv virt platform's memory started at 0x80000000 and
>>>> straddled the 4GiB boundary. Curiously enough, this choice
>>>> of a memory layout will prevent from launching a VM with
>>>> a bit more than 2000MiB and PCIe pass-thru on an x86 host, due
>>>> to identity mapping requirements for the MSI doorbell on x86,
>>>> and these (APIC/IOAPIC) live right below 4GiB.
>>>>
>>>> So just split the RAM range into two portions:
>>>> - 1 GiB range from 0x80000000 to 0xc0000000.
>>>> - The remainder at 0x100000000
>>>>
>>>> ...leaving a hole between the ranges.
>>>
>>> Can you elaborate on the use case? Maybe provide details of the host
>>> system and the QEMU command line? I'm wondering why we didn't have
>>> any problems with the arm virt machine type. Has nobody tried this
>>> use case with that? Is the use case something valid for riscv, but
>>> not arm?
>>>
>> Firstly we have to enable pcie passthru on host, find the device groups,
>> e.g. the vga card, and add their pci ids to host kernel cmdline:
>>     vfio-pci.ids=10de:0f02,10de:0e08
>>
>> then start vm through qemu as follows:
>> $Q -machine virt -m 4G -smp 4 -nographic \
>>    -bios /usr/lib/riscv64-linux-gnu/opensbi/generic/fw_jump.elf \
>>    -kernel ./vmlinuz -initrd initrd.img -append "root=/dev/vda1 rw" \
>>    -drive
>> file=ubuntu-22.04.1-preinstalled-server-riscv64+unmatched.img,if=virtio,format=raw
>>
>> \
>>    -device vfio-pci,host=01:00.0 -device vfio-pci,host=01:00.1 \
>>    -netdev user,id=vnet,hostfwd=:127.0.0.1:2223-:22 -device
>> virtio-net-pci,netdev=vnet
>>
>> Without this patch, qemu exits immediately instead of boots up.
>>
>> Just tried pcie passthru on arm, it cannot handle 4G memory either.
>> $Q -m 4G -smp 4 -cpu max -M virt -nographic \
>>    -pflash /usr/share/AAVMF/AAVMF_CODE.fd -pflash flash1.img \
>>    -drive if=none,file=ubuntu-22.04-server-cloudimg-arm64.img,id=hd0 \
>>    -device virtio-blk-device,drive=hd0 \
>>    -device vfio-pci,host=01:00.0 -device vfio-pci,host=01:00.1
>>
>> qemu-system-aarch64: -device vfio-pci,host=01:00.0: VFIO_MAP_DMA failed:
>> Invalid argument
>> qemu-system-aarch64: -device vfio-pci,host=01:00.0: vfio 0000:01:00.0:
>> failed to setup container for group 11: memory listener initialization
>> failed: Region mach-virt.ram: vfio_dma_map(0x55de3c2a97f0, 0x40000000,
>> 0x100000000, 0x7f8fcbe00000) = -22 (Invalid argument)

The collision between the x86 host MSI reserved region [0xfee00000,
0xfeefffff] and the ARM guest RAM starting at 1GB has also always
existed. But now this collision is properly detected instead of being
silenced. People have not really complained about this so far. Since the
existing guest RAM layout couldn't be changed, I am afraid we couldn't
do much.

Eric
>>
>> Thanks,
>> Fei.
>>
>>> Thanks,
>>> drew
>>
>>
>
Wu, Fei Sept. 7, 2023, 10:04 a.m. UTC | #9
On 9/7/2023 5:10 PM, Eric Auger wrote:
> Hi,
> 
> On 9/7/23 09:16, Philippe Mathieu-Daudé wrote:
>> Widening Cc to ARM/VFIO.
>>
>> On 4/8/23 11:15, Wu, Fei wrote:
>>> On 8/3/2023 11:07 PM, Andrew Jones wrote:
>>>> On Mon, Jul 31, 2023 at 09:53:17AM +0800, Fei Wu wrote:
>>>>> riscv virt platform's memory started at 0x80000000 and
>>>>> straddled the 4GiB boundary. Curiously enough, this choice
>>>>> of a memory layout will prevent from launching a VM with
>>>>> a bit more than 2000MiB and PCIe pass-thru on an x86 host, due
>>>>> to identity mapping requirements for the MSI doorbell on x86,
>>>>> and these (APIC/IOAPIC) live right below 4GiB.
>>>>>
>>>>> So just split the RAM range into two portions:
>>>>> - 1 GiB range from 0x80000000 to 0xc0000000.
>>>>> - The remainder at 0x100000000
>>>>>
>>>>> ...leaving a hole between the ranges.
>>>>
>>>> Can you elaborate on the use case? Maybe provide details of the host
>>>> system and the QEMU command line? I'm wondering why we didn't have
>>>> any problems with the arm virt machine type. Has nobody tried this
>>>> use case with that? Is the use case something valid for riscv, but
>>>> not arm?
>>>>
>>> Firstly we have to enable pcie passthru on host, find the device groups,
>>> e.g. the vga card, and add their pci ids to host kernel cmdline:
>>>     vfio-pci.ids=10de:0f02,10de:0e08
>>>
>>> then start vm through qemu as follows:
>>> $Q -machine virt -m 4G -smp 4 -nographic \
>>>    -bios /usr/lib/riscv64-linux-gnu/opensbi/generic/fw_jump.elf \
>>>    -kernel ./vmlinuz -initrd initrd.img -append "root=/dev/vda1 rw" \
>>>    -drive
>>> file=ubuntu-22.04.1-preinstalled-server-riscv64+unmatched.img,if=virtio,format=raw
>>>
>>> \
>>>    -device vfio-pci,host=01:00.0 -device vfio-pci,host=01:00.1 \
>>>    -netdev user,id=vnet,hostfwd=:127.0.0.1:2223-:22 -device
>>> virtio-net-pci,netdev=vnet
>>>
>>> Without this patch, qemu exits immediately instead of boots up.
>>>
>>> Just tried pcie passthru on arm, it cannot handle 4G memory either.
>>> $Q -m 4G -smp 4 -cpu max -M virt -nographic \
>>>    -pflash /usr/share/AAVMF/AAVMF_CODE.fd -pflash flash1.img \
>>>    -drive if=none,file=ubuntu-22.04-server-cloudimg-arm64.img,id=hd0 \
>>>    -device virtio-blk-device,drive=hd0 \
>>>    -device vfio-pci,host=01:00.0 -device vfio-pci,host=01:00.1
>>>
>>> qemu-system-aarch64: -device vfio-pci,host=01:00.0: VFIO_MAP_DMA failed:
>>> Invalid argument
>>> qemu-system-aarch64: -device vfio-pci,host=01:00.0: vfio 0000:01:00.0:
>>> failed to setup container for group 11: memory listener initialization
>>> failed: Region mach-virt.ram: vfio_dma_map(0x55de3c2a97f0, 0x40000000,
>>> 0x100000000, 0x7f8fcbe00000) = -22 (Invalid argument)
> 
> The collision between the x86 host MSI reserved region [0xfee00000,
> 0xfeefffff] and the ARM guest RAM starting at 1GB has also always
> existed. But now this collision is properly detected instead of being
> silenced. People have not really complained about this so far. Since the
> existing guest RAM layout couldn't be changed, I am afraid we couldn't
> do much.
> 
Just as what this patch does for riscv, arm guest on x86 could do the
same thing to adjust the guest RAM layout if necessary? This looks like
a nice-to-have feature for arm if there still is a requirement to run
arm guest on x86.

Thanks,
Fei.

> Eric
>>>
>>> Thanks,
>>> Fei.
>>>
>>>> Thanks,
>>>> drew
>>>
>>>
>>
>
Eric Auger Sept. 7, 2023, 1:49 p.m. UTC | #10
Hi,

On 9/7/23 12:04, Wu, Fei wrote:
> On 9/7/2023 5:10 PM, Eric Auger wrote:
>> Hi,
>>
>> On 9/7/23 09:16, Philippe Mathieu-Daudé wrote:
>>> Widening Cc to ARM/VFIO.
>>>
>>> On 4/8/23 11:15, Wu, Fei wrote:
>>>> On 8/3/2023 11:07 PM, Andrew Jones wrote:
>>>>> On Mon, Jul 31, 2023 at 09:53:17AM +0800, Fei Wu wrote:
>>>>>> riscv virt platform's memory started at 0x80000000 and
>>>>>> straddled the 4GiB boundary. Curiously enough, this choice
>>>>>> of a memory layout will prevent from launching a VM with
>>>>>> a bit more than 2000MiB and PCIe pass-thru on an x86 host, due
>>>>>> to identity mapping requirements for the MSI doorbell on x86,
>>>>>> and these (APIC/IOAPIC) live right below 4GiB.
>>>>>>
>>>>>> So just split the RAM range into two portions:
>>>>>> - 1 GiB range from 0x80000000 to 0xc0000000.
>>>>>> - The remainder at 0x100000000
>>>>>>
>>>>>> ...leaving a hole between the ranges.
>>>>> Can you elaborate on the use case? Maybe provide details of the host
>>>>> system and the QEMU command line? I'm wondering why we didn't have
>>>>> any problems with the arm virt machine type. Has nobody tried this
>>>>> use case with that? Is the use case something valid for riscv, but
>>>>> not arm?
>>>>>
>>>> Firstly we have to enable pcie passthru on host, find the device groups,
>>>> e.g. the vga card, and add their pci ids to host kernel cmdline:
>>>>     vfio-pci.ids=10de:0f02,10de:0e08
>>>>
>>>> then start vm through qemu as follows:
>>>> $Q -machine virt -m 4G -smp 4 -nographic \
>>>>    -bios /usr/lib/riscv64-linux-gnu/opensbi/generic/fw_jump.elf \
>>>>    -kernel ./vmlinuz -initrd initrd.img -append "root=/dev/vda1 rw" \
>>>>    -drive
>>>> file=ubuntu-22.04.1-preinstalled-server-riscv64+unmatched.img,if=virtio,format=raw
>>>>
>>>> \
>>>>    -device vfio-pci,host=01:00.0 -device vfio-pci,host=01:00.1 \
>>>>    -netdev user,id=vnet,hostfwd=:127.0.0.1:2223-:22 -device
>>>> virtio-net-pci,netdev=vnet
>>>>
>>>> Without this patch, qemu exits immediately instead of boots up.
>>>>
>>>> Just tried pcie passthru on arm, it cannot handle 4G memory either.
>>>> $Q -m 4G -smp 4 -cpu max -M virt -nographic \
>>>>    -pflash /usr/share/AAVMF/AAVMF_CODE.fd -pflash flash1.img \
>>>>    -drive if=none,file=ubuntu-22.04-server-cloudimg-arm64.img,id=hd0 \
>>>>    -device virtio-blk-device,drive=hd0 \
>>>>    -device vfio-pci,host=01:00.0 -device vfio-pci,host=01:00.1
>>>>
>>>> qemu-system-aarch64: -device vfio-pci,host=01:00.0: VFIO_MAP_DMA failed:
>>>> Invalid argument
>>>> qemu-system-aarch64: -device vfio-pci,host=01:00.0: vfio 0000:01:00.0:
>>>> failed to setup container for group 11: memory listener initialization
>>>> failed: Region mach-virt.ram: vfio_dma_map(0x55de3c2a97f0, 0x40000000,
>>>> 0x100000000, 0x7f8fcbe00000) = -22 (Invalid argument)
>> The collision between the x86 host MSI reserved region [0xfee00000,
>> 0xfeefffff] and the ARM guest RAM starting at 1GB has also always
>> existed. But now this collision is properly detected instead of being
>> silenced. People have not really complained about this so far. Since the
>> existing guest RAM layout couldn't be changed, I am afraid we couldn't
>> do much.
>>
> Just as what this patch does for riscv, arm guest on x86 could do the
> same thing to adjust the guest RAM layout if necessary? This looks like
> a nice-to-have feature for arm if there still is a requirement to run
> arm guest on x86.

Well Peter was opposed to any change in the legacy RAM layout. At some
point on ARM VIRT we added some extra RAM but never changed the existing
layout . But actually nobody ever complained about the lack of support
of this UC (TCG arm guest with host assigned device on x86 host).

Eric
>
> Thanks,
> Fei.
>
>> Eric
>>>> Thanks,
>>>> Fei.
>>>>
>>>>> Thanks,
>>>>> drew
>>>>
Anup Patel Sept. 7, 2023, 3:46 p.m. UTC | #11
On Tue, Aug 1, 2023 at 4:16 AM Daniel Henrique Barboza
<dbarboza@ventanamicro.com> wrote:
>
>
>
> On 7/30/23 22:53, Fei Wu wrote:
> > riscv virt platform's memory started at 0x80000000 and
> > straddled the 4GiB boundary. Curiously enough, this choice
> > of a memory layout will prevent from launching a VM with
> > a bit more than 2000MiB and PCIe pass-thru on an x86 host, due
> > to identity mapping requirements for the MSI doorbell on x86,
> > and these (APIC/IOAPIC) live right below 4GiB.
> >
> > So just split the RAM range into two portions:
> > - 1 GiB range from 0x80000000 to 0xc0000000.
> > - The remainder at 0x100000000
> >
> > ...leaving a hole between the ranges.
>
> I am afraid this breaks some existing distro setups, like Ubuntu. After this patch
> this emulation stopped working:
>
> ~/work/qemu/build/qemu-system-riscv64 \
>         -machine virt -nographic -m 8G -smp 8 \
>         -kernel ./uboot-ubuntu/usr/lib/u-boot/qemu-riscv64_smode/uboot.elf \
>         -drive file=snapshot.img,format=qcow2,if=virtio \
>         -netdev bridge,id=bridge1,br=virbr0 -device virtio-net-pci,netdev=bridge1
>
>
> This is basically a guest created via the official Canonical tutorial:
>
> https://wiki.ubuntu.com/RISC-V/QEMU
>
> The error being thrown:
>
> =================
>
> Boot HART ID              : 4
> Boot HART Domain          : root
> Boot HART Priv Version    : v1.12
> Boot HART Base ISA        : rv64imafdch
> Boot HART ISA Extensions  : time,sstc
> Boot HART PMP Count       : 16
> Boot HART PMP Granularity : 4
> Boot HART PMP Address Bits: 54
> Boot HART MHPM Count      : 16
> Boot HART MIDELEG         : 0x0000000000001666
> Boot HART MEDELEG         : 0x0000000000f0b509
>
>
> U-Boot 2022.07+dfsg-1ubuntu4.2 (Nov 24 2022 - 18:47:41 +0000)
>
> CPU:   rv64imafdch_zicbom_zicboz_zicsr_zifencei_zihintpause_zawrs_zfa_zca_zcd_zba_zbb_zbc_zbs_sstc_svadu
> Model: riscv-virtio,qemu
> DRAM:  Unhandled exception: Store/AMO access fault
> EPC: 00000000802018b8 RA: 00000000802126a0 TVAL: 00000000ff733f90
>
> Code: b823 06b2 bc23 06b2 b023 08b2 b423 08b2 (b823 08b2)
>
>
> resetting ...
> System reset not supported on this platform
> ### ERROR ### Please RESET the board ###
> QEMU 8.0.90 monitor - type 'help' for more infor
> =================

Can you try again after setting CONFIG_NR_DRAM_BANKS=2 in
qemu-riscv64_smode_defconfig and qemu-riscv64_spl_defconfig ?

Regards,
Anup

>
>
> Based on the change made I can make an educated guess on what is going wrong.
> We have another board with a similar memory topology you're making here, the
> Microchip Polarfire (microchip_pfsoc.c). We were having some problems with this
> board while trying to consolidate the boot process between all boards in
> hw/riscv/boot.c because of its non-continuous RAM bank. The full story can be
> read in the commit message of 4b402886ac89 ("hw/riscv: change riscv_compute_fdt_addr()
> semantics") but the short version can be seen in riscv_compute_fdt_addr()
> from boot.c:
>
>   - if ram_start is less than 3072MiB, the FDT will be  put at the lowest value
> between 3072 MiB and the end of that RAM bank;
>
> - if ram_start is higher than 3072 MiB the FDT will be put at the end of the
> RAM bank.
>
> So, after this patch, since riscv_compute_fdt_addr() is being used with the now
> lower RAM bank, the fdt is being put in LOW_MEM - fdt_size for any setup that has
> more than 1Gb RAM, and this breaks assumptions made by uboot and Ubuntu and possibly
> others that are trying to retrieve the FDT from the gap that you created between
> low and hi mem in this patch.
>
> In fact, this same Ubuntu guest I mentioned above will boot if I put only 1 Gb of RAM
> (-m 1Gb). If I try with -m 1.1Gb I reproduce this error. This can be a validation of
> the guess I'm making here: Ubuntu is trying to fetch stuff (probably the fdt) from
> the gap between the memory areas.
>
> This change on top of this patch doesn't work either:
>
> $ git diff
> diff --git a/hw/riscv/virt.c b/hw/riscv/virt.c
> index 8fbdc7220c..dfff48d849 100644
> --- a/hw/riscv/virt.c
> +++ b/hw/riscv/virt.c
> @@ -1335,9 +1335,16 @@ static void virt_machine_done(Notifier *notifier, void *data)
>                                            kernel_start_addr, true, NULL);
>       }
>
> -    fdt_load_addr = riscv_compute_fdt_addr(memmap[VIRT_DRAM].base,
> +    if (machine->ram_size < memmap[VIRT_DRAM].size) {
> +        fdt_load_addr = riscv_compute_fdt_addr(memmap[VIRT_DRAM].base,
>                                              memmap[VIRT_DRAM].size,
>                                              machine);
> +    } else {
> +        fdt_load_addr = riscv_compute_fdt_addr(memmap[VIRT_DRAM_HIGH].base,
> +                                           memmap[VIRT_DRAM_HIGH].size,
> +                                           machine);
> +    }
> +
>
>
> This would put the fdt at the end of the HI RAM for guests with more than 1Gb of RAM.
> This change in fact makes the situation even worse, breaking setups that were working
> before with this patch.
>
> There's a chance that reducing the gap between the RAM banks can make Ubuntu work
> reliably again but now I'm a little cold feet with this change.
>
>
> I think we'll need some kernel/Opensbi folks to weight in here to see if there's a
> failsafe memory setup that won't break distros out there and allow your passthrough
> to work.
>
>
>
> Thanks,
>
> Daniel
>
> >
> > Signed-off-by: Andrei Warkentin <andrei.warkentin@intel.com>
> > Signed-off-by: Fei Wu <fei2.wu@intel.com>
> > ---
> >   hw/riscv/virt.c         | 74 ++++++++++++++++++++++++++++++++++++-----
> >   include/hw/riscv/virt.h |  1 +
> >   2 files changed, 66 insertions(+), 9 deletions(-)
> >
> > diff --git a/hw/riscv/virt.c b/hw/riscv/virt.c
> > index d90286dc46..8fbdc7220c 100644
> > --- a/hw/riscv/virt.c
> > +++ b/hw/riscv/virt.c
> > @@ -75,7 +75,9 @@
> >   #error "Can't accomodate all IMSIC groups in address space"
> >   #endif
> >
> > -static const MemMapEntry virt_memmap[] = {
> > +#define LOW_MEM (1 * GiB)
> > +
> > +static MemMapEntry virt_memmap[] = {
> >       [VIRT_DEBUG] =        {        0x0,         0x100 },
> >       [VIRT_MROM] =         {     0x1000,        0xf000 },
> >       [VIRT_TEST] =         {   0x100000,        0x1000 },
> > @@ -96,6 +98,7 @@ static const MemMapEntry virt_memmap[] = {
> >       [VIRT_PCIE_ECAM] =    { 0x30000000,    0x10000000 },
> >       [VIRT_PCIE_MMIO] =    { 0x40000000,    0x40000000 },
> >       [VIRT_DRAM] =         { 0x80000000,           0x0 },
> > +    [VIRT_DRAM_HIGH] =    { 0x100000000,          0x0 },
> >   };
> >
> >   /* PCIe high mmio is fixed for RV32 */
> > @@ -295,15 +298,12 @@ static void create_fdt_socket_cpus(RISCVVirtState *s, int socket,
> >       }
> >   }
> >
> > -static void create_fdt_socket_memory(RISCVVirtState *s,
> > -                                     const MemMapEntry *memmap, int socket)
> > +static void create_fdt_socket_mem_range(RISCVVirtState *s, uint64_t addr,
> > +                                        uint64_t size, int socket)
> >   {
> >       char *mem_name;
> > -    uint64_t addr, size;
> >       MachineState *ms = MACHINE(s);
> >
> > -    addr = memmap[VIRT_DRAM].base + riscv_socket_mem_offset(ms, socket);
> > -    size = riscv_socket_mem_size(ms, socket);
> >       mem_name = g_strdup_printf("/memory@%lx", (long)addr);
> >       qemu_fdt_add_subnode(ms->fdt, mem_name);
> >       qemu_fdt_setprop_cells(ms->fdt, mem_name, "reg",
> > @@ -313,6 +313,34 @@ static void create_fdt_socket_memory(RISCVVirtState *s,
> >       g_free(mem_name);
> >   }
> >
> > +static void create_fdt_socket_memory(RISCVVirtState *s,
> > +                                     const MemMapEntry *memmap, int socket)
> > +{
> > +    uint64_t addr, size;
> > +    MachineState *mc = MACHINE(s);
> > +    uint64_t sock_offset = riscv_socket_mem_offset(mc, socket);
> > +    uint64_t sock_size = riscv_socket_mem_size(mc, socket);
> > +
> > +    if (sock_offset < memmap[VIRT_DRAM].size) {
> > +        uint64_t low_mem_end = memmap[VIRT_DRAM].base + memmap[VIRT_DRAM].size;
> > +
> > +        addr = memmap[VIRT_DRAM].base + sock_offset;
> > +        size = MIN(low_mem_end - addr, sock_size);
> > +        create_fdt_socket_mem_range(s, addr, size, socket);
> > +
> > +        size = sock_size - size;
> > +        if (size > 0) {
> > +            create_fdt_socket_mem_range(s, memmap[VIRT_DRAM_HIGH].base,
> > +                                        size, socket);
> > +        }
> > +    } else {
> > +        addr = memmap[VIRT_DRAM_HIGH].base +
> > +               sock_offset - memmap[VIRT_DRAM].size;
> > +
> > +        create_fdt_socket_mem_range(s, addr, sock_size, socket);
> > +    }
> > +}
> > +
> >   static void create_fdt_socket_clint(RISCVVirtState *s,
> >                                       const MemMapEntry *memmap, int socket,
> >                                       uint32_t *intc_phandles)
> > @@ -1334,10 +1362,12 @@ static void virt_machine_done(Notifier *notifier, void *data)
> >
> >   static void virt_machine_init(MachineState *machine)
> >   {
> > -    const MemMapEntry *memmap = virt_memmap;
> > +    MemMapEntry *memmap = virt_memmap;
> >       RISCVVirtState *s = RISCV_VIRT_MACHINE(machine);
> >       MemoryRegion *system_memory = get_system_memory();
> >       MemoryRegion *mask_rom = g_new(MemoryRegion, 1);
> > +    MemoryRegion *ram_below_4g, *ram_above_4g;
> > +    uint64_t ram_size_low, ram_size_high;
> >       char *soc_name;
> >       DeviceState *mmio_irqchip, *virtio_irqchip, *pcie_irqchip;
> >       int i, base_hartid, hart_count;
> > @@ -1448,6 +1478,17 @@ static void virt_machine_init(MachineState *machine)
> >           }
> >       }
> >
> > +    if (machine->ram_size > LOW_MEM) {
> > +        ram_size_high = machine->ram_size - LOW_MEM;
> > +        ram_size_low = LOW_MEM;
> > +    } else {
> > +        ram_size_high = 0;
> > +        ram_size_low = machine->ram_size;
> > +    }
> > +
> > +    memmap[VIRT_DRAM].size = ram_size_low;
> > +    memmap[VIRT_DRAM_HIGH].size = ram_size_high;
> > +
> >       if (riscv_is_32bit(&s->soc[0])) {
> >   #if HOST_LONG_BITS == 64
> >           /* limit RAM size in a 32-bit system */
> > @@ -1460,7 +1501,8 @@ static void virt_machine_init(MachineState *machine)
> >           virt_high_pcie_memmap.size = VIRT32_HIGH_PCIE_MMIO_SIZE;
> >       } else {
> >           virt_high_pcie_memmap.size = VIRT64_HIGH_PCIE_MMIO_SIZE;
> > -        virt_high_pcie_memmap.base = memmap[VIRT_DRAM].base + machine->ram_size;
> > +        virt_high_pcie_memmap.base = memmap[VIRT_DRAM_HIGH].base +
> > +                                     memmap[VIRT_DRAM_HIGH].size;
> >           virt_high_pcie_memmap.base =
> >               ROUND_UP(virt_high_pcie_memmap.base, virt_high_pcie_memmap.size);
> >       }
> > @@ -1468,8 +1510,22 @@ static void virt_machine_init(MachineState *machine)
> >       s->memmap = virt_memmap;
> >
> >       /* register system main memory (actual RAM) */
> > +    ram_below_4g = g_malloc(sizeof(*ram_below_4g));
> > +    memory_region_init_alias(ram_below_4g, NULL, "ram-below-4g", machine->ram,
> > +                             0, memmap[VIRT_DRAM].size);
> >       memory_region_add_subregion(system_memory, memmap[VIRT_DRAM].base,
> > -        machine->ram);
> > +                                ram_below_4g);
> > +
> > +    if (memmap[VIRT_DRAM_HIGH].size) {
> > +        ram_above_4g = g_malloc(sizeof(*ram_above_4g));
> > +        memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g",
> > +                                 machine->ram,
> > +                                 memmap[VIRT_DRAM].size,
> > +                                 memmap[VIRT_DRAM_HIGH].size);
> > +        memory_region_add_subregion(system_memory,
> > +                                    memmap[VIRT_DRAM_HIGH].base,
> > +                                    ram_above_4g);
> > +    }
> >
> >       /* boot rom */
> >       memory_region_init_rom(mask_rom, NULL, "riscv_virt_board.mrom",
> > diff --git a/include/hw/riscv/virt.h b/include/hw/riscv/virt.h
> > index e5c474b26e..36004eb6ef 100644
> > --- a/include/hw/riscv/virt.h
> > +++ b/include/hw/riscv/virt.h
> > @@ -79,6 +79,7 @@ enum {
> >       VIRT_IMSIC_S,
> >       VIRT_FLASH,
> >       VIRT_DRAM,
> > +    VIRT_DRAM_HIGH,
> >       VIRT_PCIE_MMIO,
> >       VIRT_PCIE_PIO,
> >       VIRT_PLATFORM_BUS,
>
Anup Patel Sept. 7, 2023, 3:57 p.m. UTC | #12
On Thu, Sep 7, 2023 at 8:48 AM Alistair Francis <alistair23@gmail.com> wrote:
>
> On Thu, Aug 3, 2023 at 10:47 AM Wu, Fei <fei2.wu@intel.com> wrote:
> >
> > On 8/1/2023 6:46 AM, Daniel Henrique Barboza wrote:
> > >
> > >
> > > On 7/30/23 22:53, Fei Wu wrote:
> > >> riscv virt platform's memory started at 0x80000000 and
> > >> straddled the 4GiB boundary. Curiously enough, this choice
> > >> of a memory layout will prevent from launching a VM with
> > >> a bit more than 2000MiB and PCIe pass-thru on an x86 host, due
> > >> to identity mapping requirements for the MSI doorbell on x86,
> > >> and these (APIC/IOAPIC) live right below 4GiB.
> > >>
> > >> So just split the RAM range into two portions:
> > >> - 1 GiB range from 0x80000000 to 0xc0000000.
> > >> - The remainder at 0x100000000
> > >>
> > >> ...leaving a hole between the ranges.
> > >
> > > I am afraid this breaks some existing distro setups, like Ubuntu. After
> > > this patch
> > > this emulation stopped working:
> > >
> > > ~/work/qemu/build/qemu-system-riscv64 \
> > >     -machine virt -nographic -m 8G -smp 8 \
> > >     -kernel ./uboot-ubuntu/usr/lib/u-boot/qemu-riscv64_smode/uboot.elf \
> > >     -drive file=snapshot.img,format=qcow2,if=virtio \
> > >     -netdev bridge,id=bridge1,br=virbr0 -device
> > > virtio-net-pci,netdev=bridge1
> > >
> > >
> > > This is basically a guest created via the official Canonical tutorial:
> > >
> > > https://wiki.ubuntu.com/RISC-V/QEMU
> > >
> > > The error being thrown:
> > >
> > > =================
> > >
> > > Boot HART ID              : 4
> > > Boot HART Domain          : root
> > > Boot HART Priv Version    : v1.12
> > > Boot HART Base ISA        : rv64imafdch
> > > Boot HART ISA Extensions  : time,sstc
> > > Boot HART PMP Count       : 16
> > > Boot HART PMP Granularity : 4
> > > Boot HART PMP Address Bits: 54
> > > Boot HART MHPM Count      : 16
> > > Boot HART MIDELEG         : 0x0000000000001666
> > > Boot HART MEDELEG         : 0x0000000000f0b509
> > >
> > >
> > > U-Boot 2022.07+dfsg-1ubuntu4.2 (Nov 24 2022 - 18:47:41 +0000)
> > >
> > > CPU:
> > > rv64imafdch_zicbom_zicboz_zicsr_zifencei_zihintpause_zawrs_zfa_zca_zcd_zba_zbb_zbc_zbs_sstc_svadu
> > > Model: riscv-virtio,qemu
> > > DRAM:  Unhandled exception: Store/AMO access fault
> > > EPC: 00000000802018b8 RA: 00000000802126a0 TVAL: 00000000ff733f90
> > >
> > > Code: b823 06b2 bc23 06b2 b023 08b2 b423 08b2 (b823 08b2)
> > >
> > >
> > > resetting ...
> > > System reset not supported on this platform
> > > ### ERROR ### Please RESET the board ###
> > > QEMU 8.0.90 monitor - type 'help' for more infor
> > > =================
> > >
> > >
> > > Based on the change made I can make an educated guess on what is going
> > > wrong.
> > > We have another board with a similar memory topology you're making here,
> > > the
> > > Microchip Polarfire (microchip_pfsoc.c). We were having some problems
> > > with this
> > > board while trying to consolidate the boot process between all boards in
> > > hw/riscv/boot.c because of its non-continuous RAM bank. The full story
> > > can be
> > > read in the commit message of 4b402886ac89 ("hw/riscv: change
> > > riscv_compute_fdt_addr()
> > > semantics") but the short version can be seen in riscv_compute_fdt_addr()
> > > from boot.c:
> > >
> > >  - if ram_start is less than 3072MiB, the FDT will be  put at the lowest
> > > value
> > > between 3072 MiB and the end of that RAM bank;
> > >
> > > - if ram_start is higher than 3072 MiB the FDT will be put at the end of
> > > the
> > > RAM bank.
> > >
> > > So, after this patch, since riscv_compute_fdt_addr() is being used with
> > > the now
> > > lower RAM bank, the fdt is being put in LOW_MEM - fdt_size for any setup
> > > that has
> > > more than 1Gb RAM, and this breaks assumptions made by uboot and Ubuntu
> > > and possibly
> > > others that are trying to retrieve the FDT from the gap that you created
> > > between
> > > low and hi mem in this patch.
> > >
> > > In fact, this same Ubuntu guest I mentioned above will boot if I put
> > > only 1 Gb of RAM
> > > (-m 1Gb). If I try with -m 1.1Gb I reproduce this error. This can be a
> > > validation of
> > > the guess I'm making here: Ubuntu is trying to fetch stuff (probably the
> > > fdt) from
> > > the gap between the memory areas.
> > >
> > > This change on top of this patch doesn't work either:
> > >
> > > $ git diff
> > > diff --git a/hw/riscv/virt.c b/hw/riscv/virt.c
> > > index 8fbdc7220c..dfff48d849 100644
> > > --- a/hw/riscv/virt.c
> > > +++ b/hw/riscv/virt.c
> > > @@ -1335,9 +1335,16 @@ static void virt_machine_done(Notifier *notifier,
> > > void *data)
> > >                                           kernel_start_addr, true, NULL);
> > >      }
> > >
> > > -    fdt_load_addr = riscv_compute_fdt_addr(memmap[VIRT_DRAM].base,
> > > +    if (machine->ram_size < memmap[VIRT_DRAM].size) {
> > > +        fdt_load_addr = riscv_compute_fdt_addr(memmap[VIRT_DRAM].base,
> > >                                             memmap[VIRT_DRAM].size,
> > >                                             machine);
> > > +    } else {
> > > +        fdt_load_addr =
> > > riscv_compute_fdt_addr(memmap[VIRT_DRAM_HIGH].base,
> > > +                                           memmap[VIRT_DRAM_HIGH].size,
> > > +                                           machine);
> > > +    }
> > > +
> > >
> > > This would put the fdt at the end of the HI RAM for guests with more
> > > than 1Gb of RAM.
> > > This change in fact makes the situation even worse, breaking setups that
> > > were working
> > > before with this patch.
> > >
> > > There's a chance that reducing the gap between the RAM banks can make
> > > Ubuntu work
> > > reliably again but now I'm a little cold feet with this change.
> > >
> > >
> > > I think we'll need some kernel/Opensbi folks to weight in here to see if
> > > there's a
> > > failsafe memory setup that won't break distros out there and allow your
> > > passthrough
> > > to work.
> > >
> > Hi Daniel,
> >
> > Do you know who we should talk to? I think it's not uncommon to have
> > seperated multi-range memory, the stack should support this configuration.
>
> @Palmer Dabbelt @Anup Patel any ideas?
>

Sparse memory and multiple memory banks are a reality and seen
on many platforms.

Based on my experience:
* Linux RISC-V already supports SPARSEMEM and multiple
  RAM banks.
* OpenSBI also handles multiple RAM banks nicely.
* U-Boot requires some kconfig option setting otherwise
   even U-Boot handles it fine.

Regards,
Anup
Wu, Fei Sept. 8, 2023, 12:16 a.m. UTC | #13
On 9/7/2023 11:46 PM, Anup Patel wrote:
> On Tue, Aug 1, 2023 at 4:16 AM Daniel Henrique Barboza
> <dbarboza@ventanamicro.com> wrote:
>>
>>
>>
>> On 7/30/23 22:53, Fei Wu wrote:
>>> riscv virt platform's memory started at 0x80000000 and
>>> straddled the 4GiB boundary. Curiously enough, this choice
>>> of a memory layout will prevent from launching a VM with
>>> a bit more than 2000MiB and PCIe pass-thru on an x86 host, due
>>> to identity mapping requirements for the MSI doorbell on x86,
>>> and these (APIC/IOAPIC) live right below 4GiB.
>>>
>>> So just split the RAM range into two portions:
>>> - 1 GiB range from 0x80000000 to 0xc0000000.
>>> - The remainder at 0x100000000
>>>
>>> ...leaving a hole between the ranges.
>>
>> I am afraid this breaks some existing distro setups, like Ubuntu. After this patch
>> this emulation stopped working:
>>
>> ~/work/qemu/build/qemu-system-riscv64 \
>>         -machine virt -nographic -m 8G -smp 8 \
>>         -kernel ./uboot-ubuntu/usr/lib/u-boot/qemu-riscv64_smode/uboot.elf \
>>         -drive file=snapshot.img,format=qcow2,if=virtio \
>>         -netdev bridge,id=bridge1,br=virbr0 -device virtio-net-pci,netdev=bridge1
>>
>>
>> This is basically a guest created via the official Canonical tutorial:
>>
>> https://wiki.ubuntu.com/RISC-V/QEMU
>>
>> The error being thrown:
>>
>> =================
>>
>> Boot HART ID              : 4
>> Boot HART Domain          : root
>> Boot HART Priv Version    : v1.12
>> Boot HART Base ISA        : rv64imafdch
>> Boot HART ISA Extensions  : time,sstc
>> Boot HART PMP Count       : 16
>> Boot HART PMP Granularity : 4
>> Boot HART PMP Address Bits: 54
>> Boot HART MHPM Count      : 16
>> Boot HART MIDELEG         : 0x0000000000001666
>> Boot HART MEDELEG         : 0x0000000000f0b509
>>
>>
>> U-Boot 2022.07+dfsg-1ubuntu4.2 (Nov 24 2022 - 18:47:41 +0000)
>>
>> CPU:   rv64imafdch_zicbom_zicboz_zicsr_zifencei_zihintpause_zawrs_zfa_zca_zcd_zba_zbb_zbc_zbs_sstc_svadu
>> Model: riscv-virtio,qemu
>> DRAM:  Unhandled exception: Store/AMO access fault
>> EPC: 00000000802018b8 RA: 00000000802126a0 TVAL: 00000000ff733f90
>>
>> Code: b823 06b2 bc23 06b2 b023 08b2 b423 08b2 (b823 08b2)
>>
>>
>> resetting ...
>> System reset not supported on this platform
>> ### ERROR ### Please RESET the board ###
>> QEMU 8.0.90 monitor - type 'help' for more infor
>> =================
> 
> Can you try again after setting CONFIG_NR_DRAM_BANKS=2 in
> qemu-riscv64_smode_defconfig and qemu-riscv64_spl_defconfig ?
> 
Yes, I made a u-boot patch to change this setting and also use
fdtdec_setup_mem_size_base_lowest() instead fdtdec_setup_mem_size_base()
in dram_init(), the latter is also necessary. The patch has been posted
to u-boot mailing list but got no reply yet:
    https://lists.denx.de/pipermail/u-boot/2023-September/529729.html

Thanks,
Fei.

> Regards,
> Anup
> 
>>
>>
>> Based on the change made I can make an educated guess on what is going wrong.
>> We have another board with a similar memory topology you're making here, the
>> Microchip Polarfire (microchip_pfsoc.c). We were having some problems with this
>> board while trying to consolidate the boot process between all boards in
>> hw/riscv/boot.c because of its non-continuous RAM bank. The full story can be
>> read in the commit message of 4b402886ac89 ("hw/riscv: change riscv_compute_fdt_addr()
>> semantics") but the short version can be seen in riscv_compute_fdt_addr()
>> from boot.c:
>>
>>   - if ram_start is less than 3072MiB, the FDT will be  put at the lowest value
>> between 3072 MiB and the end of that RAM bank;
>>
>> - if ram_start is higher than 3072 MiB the FDT will be put at the end of the
>> RAM bank.
>>
>> So, after this patch, since riscv_compute_fdt_addr() is being used with the now
>> lower RAM bank, the fdt is being put in LOW_MEM - fdt_size for any setup that has
>> more than 1Gb RAM, and this breaks assumptions made by uboot and Ubuntu and possibly
>> others that are trying to retrieve the FDT from the gap that you created between
>> low and hi mem in this patch.
>>
>> In fact, this same Ubuntu guest I mentioned above will boot if I put only 1 Gb of RAM
>> (-m 1Gb). If I try with -m 1.1Gb I reproduce this error. This can be a validation of
>> the guess I'm making here: Ubuntu is trying to fetch stuff (probably the fdt) from
>> the gap between the memory areas.
>>
>> This change on top of this patch doesn't work either:
>>
>> $ git diff
>> diff --git a/hw/riscv/virt.c b/hw/riscv/virt.c
>> index 8fbdc7220c..dfff48d849 100644
>> --- a/hw/riscv/virt.c
>> +++ b/hw/riscv/virt.c
>> @@ -1335,9 +1335,16 @@ static void virt_machine_done(Notifier *notifier, void *data)
>>                                            kernel_start_addr, true, NULL);
>>       }
>>
>> -    fdt_load_addr = riscv_compute_fdt_addr(memmap[VIRT_DRAM].base,
>> +    if (machine->ram_size < memmap[VIRT_DRAM].size) {
>> +        fdt_load_addr = riscv_compute_fdt_addr(memmap[VIRT_DRAM].base,
>>                                              memmap[VIRT_DRAM].size,
>>                                              machine);
>> +    } else {
>> +        fdt_load_addr = riscv_compute_fdt_addr(memmap[VIRT_DRAM_HIGH].base,
>> +                                           memmap[VIRT_DRAM_HIGH].size,
>> +                                           machine);
>> +    }
>> +
>>
>>
>> This would put the fdt at the end of the HI RAM for guests with more than 1Gb of RAM.
>> This change in fact makes the situation even worse, breaking setups that were working
>> before with this patch.
>>
>> There's a chance that reducing the gap between the RAM banks can make Ubuntu work
>> reliably again but now I'm a little cold feet with this change.
>>
>>
>> I think we'll need some kernel/Opensbi folks to weight in here to see if there's a
>> failsafe memory setup that won't break distros out there and allow your passthrough
>> to work.
>>
>>
>>
>> Thanks,
>>
>> Daniel
>>
>>>
>>> Signed-off-by: Andrei Warkentin <andrei.warkentin@intel.com>
>>> Signed-off-by: Fei Wu <fei2.wu@intel.com>
>>> ---
>>>   hw/riscv/virt.c         | 74 ++++++++++++++++++++++++++++++++++++-----
>>>   include/hw/riscv/virt.h |  1 +
>>>   2 files changed, 66 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/hw/riscv/virt.c b/hw/riscv/virt.c
>>> index d90286dc46..8fbdc7220c 100644
>>> --- a/hw/riscv/virt.c
>>> +++ b/hw/riscv/virt.c
>>> @@ -75,7 +75,9 @@
>>>   #error "Can't accomodate all IMSIC groups in address space"
>>>   #endif
>>>
>>> -static const MemMapEntry virt_memmap[] = {
>>> +#define LOW_MEM (1 * GiB)
>>> +
>>> +static MemMapEntry virt_memmap[] = {
>>>       [VIRT_DEBUG] =        {        0x0,         0x100 },
>>>       [VIRT_MROM] =         {     0x1000,        0xf000 },
>>>       [VIRT_TEST] =         {   0x100000,        0x1000 },
>>> @@ -96,6 +98,7 @@ static const MemMapEntry virt_memmap[] = {
>>>       [VIRT_PCIE_ECAM] =    { 0x30000000,    0x10000000 },
>>>       [VIRT_PCIE_MMIO] =    { 0x40000000,    0x40000000 },
>>>       [VIRT_DRAM] =         { 0x80000000,           0x0 },
>>> +    [VIRT_DRAM_HIGH] =    { 0x100000000,          0x0 },
>>>   };
>>>
>>>   /* PCIe high mmio is fixed for RV32 */
>>> @@ -295,15 +298,12 @@ static void create_fdt_socket_cpus(RISCVVirtState *s, int socket,
>>>       }
>>>   }
>>>
>>> -static void create_fdt_socket_memory(RISCVVirtState *s,
>>> -                                     const MemMapEntry *memmap, int socket)
>>> +static void create_fdt_socket_mem_range(RISCVVirtState *s, uint64_t addr,
>>> +                                        uint64_t size, int socket)
>>>   {
>>>       char *mem_name;
>>> -    uint64_t addr, size;
>>>       MachineState *ms = MACHINE(s);
>>>
>>> -    addr = memmap[VIRT_DRAM].base + riscv_socket_mem_offset(ms, socket);
>>> -    size = riscv_socket_mem_size(ms, socket);
>>>       mem_name = g_strdup_printf("/memory@%lx", (long)addr);
>>>       qemu_fdt_add_subnode(ms->fdt, mem_name);
>>>       qemu_fdt_setprop_cells(ms->fdt, mem_name, "reg",
>>> @@ -313,6 +313,34 @@ static void create_fdt_socket_memory(RISCVVirtState *s,
>>>       g_free(mem_name);
>>>   }
>>>
>>> +static void create_fdt_socket_memory(RISCVVirtState *s,
>>> +                                     const MemMapEntry *memmap, int socket)
>>> +{
>>> +    uint64_t addr, size;
>>> +    MachineState *mc = MACHINE(s);
>>> +    uint64_t sock_offset = riscv_socket_mem_offset(mc, socket);
>>> +    uint64_t sock_size = riscv_socket_mem_size(mc, socket);
>>> +
>>> +    if (sock_offset < memmap[VIRT_DRAM].size) {
>>> +        uint64_t low_mem_end = memmap[VIRT_DRAM].base + memmap[VIRT_DRAM].size;
>>> +
>>> +        addr = memmap[VIRT_DRAM].base + sock_offset;
>>> +        size = MIN(low_mem_end - addr, sock_size);
>>> +        create_fdt_socket_mem_range(s, addr, size, socket);
>>> +
>>> +        size = sock_size - size;
>>> +        if (size > 0) {
>>> +            create_fdt_socket_mem_range(s, memmap[VIRT_DRAM_HIGH].base,
>>> +                                        size, socket);
>>> +        }
>>> +    } else {
>>> +        addr = memmap[VIRT_DRAM_HIGH].base +
>>> +               sock_offset - memmap[VIRT_DRAM].size;
>>> +
>>> +        create_fdt_socket_mem_range(s, addr, sock_size, socket);
>>> +    }
>>> +}
>>> +
>>>   static void create_fdt_socket_clint(RISCVVirtState *s,
>>>                                       const MemMapEntry *memmap, int socket,
>>>                                       uint32_t *intc_phandles)
>>> @@ -1334,10 +1362,12 @@ static void virt_machine_done(Notifier *notifier, void *data)
>>>
>>>   static void virt_machine_init(MachineState *machine)
>>>   {
>>> -    const MemMapEntry *memmap = virt_memmap;
>>> +    MemMapEntry *memmap = virt_memmap;
>>>       RISCVVirtState *s = RISCV_VIRT_MACHINE(machine);
>>>       MemoryRegion *system_memory = get_system_memory();
>>>       MemoryRegion *mask_rom = g_new(MemoryRegion, 1);
>>> +    MemoryRegion *ram_below_4g, *ram_above_4g;
>>> +    uint64_t ram_size_low, ram_size_high;
>>>       char *soc_name;
>>>       DeviceState *mmio_irqchip, *virtio_irqchip, *pcie_irqchip;
>>>       int i, base_hartid, hart_count;
>>> @@ -1448,6 +1478,17 @@ static void virt_machine_init(MachineState *machine)
>>>           }
>>>       }
>>>
>>> +    if (machine->ram_size > LOW_MEM) {
>>> +        ram_size_high = machine->ram_size - LOW_MEM;
>>> +        ram_size_low = LOW_MEM;
>>> +    } else {
>>> +        ram_size_high = 0;
>>> +        ram_size_low = machine->ram_size;
>>> +    }
>>> +
>>> +    memmap[VIRT_DRAM].size = ram_size_low;
>>> +    memmap[VIRT_DRAM_HIGH].size = ram_size_high;
>>> +
>>>       if (riscv_is_32bit(&s->soc[0])) {
>>>   #if HOST_LONG_BITS == 64
>>>           /* limit RAM size in a 32-bit system */
>>> @@ -1460,7 +1501,8 @@ static void virt_machine_init(MachineState *machine)
>>>           virt_high_pcie_memmap.size = VIRT32_HIGH_PCIE_MMIO_SIZE;
>>>       } else {
>>>           virt_high_pcie_memmap.size = VIRT64_HIGH_PCIE_MMIO_SIZE;
>>> -        virt_high_pcie_memmap.base = memmap[VIRT_DRAM].base + machine->ram_size;
>>> +        virt_high_pcie_memmap.base = memmap[VIRT_DRAM_HIGH].base +
>>> +                                     memmap[VIRT_DRAM_HIGH].size;
>>>           virt_high_pcie_memmap.base =
>>>               ROUND_UP(virt_high_pcie_memmap.base, virt_high_pcie_memmap.size);
>>>       }
>>> @@ -1468,8 +1510,22 @@ static void virt_machine_init(MachineState *machine)
>>>       s->memmap = virt_memmap;
>>>
>>>       /* register system main memory (actual RAM) */
>>> +    ram_below_4g = g_malloc(sizeof(*ram_below_4g));
>>> +    memory_region_init_alias(ram_below_4g, NULL, "ram-below-4g", machine->ram,
>>> +                             0, memmap[VIRT_DRAM].size);
>>>       memory_region_add_subregion(system_memory, memmap[VIRT_DRAM].base,
>>> -        machine->ram);
>>> +                                ram_below_4g);
>>> +
>>> +    if (memmap[VIRT_DRAM_HIGH].size) {
>>> +        ram_above_4g = g_malloc(sizeof(*ram_above_4g));
>>> +        memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g",
>>> +                                 machine->ram,
>>> +                                 memmap[VIRT_DRAM].size,
>>> +                                 memmap[VIRT_DRAM_HIGH].size);
>>> +        memory_region_add_subregion(system_memory,
>>> +                                    memmap[VIRT_DRAM_HIGH].base,
>>> +                                    ram_above_4g);
>>> +    }
>>>
>>>       /* boot rom */
>>>       memory_region_init_rom(mask_rom, NULL, "riscv_virt_board.mrom",
>>> diff --git a/include/hw/riscv/virt.h b/include/hw/riscv/virt.h
>>> index e5c474b26e..36004eb6ef 100644
>>> --- a/include/hw/riscv/virt.h
>>> +++ b/include/hw/riscv/virt.h
>>> @@ -79,6 +79,7 @@ enum {
>>>       VIRT_IMSIC_S,
>>>       VIRT_FLASH,
>>>       VIRT_DRAM,
>>> +    VIRT_DRAM_HIGH,
>>>       VIRT_PCIE_MMIO,
>>>       VIRT_PCIE_PIO,
>>>       VIRT_PLATFORM_BUS,
>>
diff mbox series

Patch

diff --git a/hw/riscv/virt.c b/hw/riscv/virt.c
index d90286dc46..8fbdc7220c 100644
--- a/hw/riscv/virt.c
+++ b/hw/riscv/virt.c
@@ -75,7 +75,9 @@ 
 #error "Can't accomodate all IMSIC groups in address space"
 #endif
 
-static const MemMapEntry virt_memmap[] = {
+#define LOW_MEM (1 * GiB)
+
+static MemMapEntry virt_memmap[] = {
     [VIRT_DEBUG] =        {        0x0,         0x100 },
     [VIRT_MROM] =         {     0x1000,        0xf000 },
     [VIRT_TEST] =         {   0x100000,        0x1000 },
@@ -96,6 +98,7 @@  static const MemMapEntry virt_memmap[] = {
     [VIRT_PCIE_ECAM] =    { 0x30000000,    0x10000000 },
     [VIRT_PCIE_MMIO] =    { 0x40000000,    0x40000000 },
     [VIRT_DRAM] =         { 0x80000000,           0x0 },
+    [VIRT_DRAM_HIGH] =    { 0x100000000,          0x0 },
 };
 
 /* PCIe high mmio is fixed for RV32 */
@@ -295,15 +298,12 @@  static void create_fdt_socket_cpus(RISCVVirtState *s, int socket,
     }
 }
 
-static void create_fdt_socket_memory(RISCVVirtState *s,
-                                     const MemMapEntry *memmap, int socket)
+static void create_fdt_socket_mem_range(RISCVVirtState *s, uint64_t addr,
+                                        uint64_t size, int socket)
 {
     char *mem_name;
-    uint64_t addr, size;
     MachineState *ms = MACHINE(s);
 
-    addr = memmap[VIRT_DRAM].base + riscv_socket_mem_offset(ms, socket);
-    size = riscv_socket_mem_size(ms, socket);
     mem_name = g_strdup_printf("/memory@%lx", (long)addr);
     qemu_fdt_add_subnode(ms->fdt, mem_name);
     qemu_fdt_setprop_cells(ms->fdt, mem_name, "reg",
@@ -313,6 +313,34 @@  static void create_fdt_socket_memory(RISCVVirtState *s,
     g_free(mem_name);
 }
 
+static void create_fdt_socket_memory(RISCVVirtState *s,
+                                     const MemMapEntry *memmap, int socket)
+{
+    uint64_t addr, size;
+    MachineState *mc = MACHINE(s);
+    uint64_t sock_offset = riscv_socket_mem_offset(mc, socket);
+    uint64_t sock_size = riscv_socket_mem_size(mc, socket);
+
+    if (sock_offset < memmap[VIRT_DRAM].size) {
+        uint64_t low_mem_end = memmap[VIRT_DRAM].base + memmap[VIRT_DRAM].size;
+
+        addr = memmap[VIRT_DRAM].base + sock_offset;
+        size = MIN(low_mem_end - addr, sock_size);
+        create_fdt_socket_mem_range(s, addr, size, socket);
+
+        size = sock_size - size;
+        if (size > 0) {
+            create_fdt_socket_mem_range(s, memmap[VIRT_DRAM_HIGH].base,
+                                        size, socket);
+        }
+    } else {
+        addr = memmap[VIRT_DRAM_HIGH].base +
+               sock_offset - memmap[VIRT_DRAM].size;
+
+        create_fdt_socket_mem_range(s, addr, sock_size, socket);
+    }
+}
+
 static void create_fdt_socket_clint(RISCVVirtState *s,
                                     const MemMapEntry *memmap, int socket,
                                     uint32_t *intc_phandles)
@@ -1334,10 +1362,12 @@  static void virt_machine_done(Notifier *notifier, void *data)
 
 static void virt_machine_init(MachineState *machine)
 {
-    const MemMapEntry *memmap = virt_memmap;
+    MemMapEntry *memmap = virt_memmap;
     RISCVVirtState *s = RISCV_VIRT_MACHINE(machine);
     MemoryRegion *system_memory = get_system_memory();
     MemoryRegion *mask_rom = g_new(MemoryRegion, 1);
+    MemoryRegion *ram_below_4g, *ram_above_4g;
+    uint64_t ram_size_low, ram_size_high;
     char *soc_name;
     DeviceState *mmio_irqchip, *virtio_irqchip, *pcie_irqchip;
     int i, base_hartid, hart_count;
@@ -1448,6 +1478,17 @@  static void virt_machine_init(MachineState *machine)
         }
     }
 
+    if (machine->ram_size > LOW_MEM) {
+        ram_size_high = machine->ram_size - LOW_MEM;
+        ram_size_low = LOW_MEM;
+    } else {
+        ram_size_high = 0;
+        ram_size_low = machine->ram_size;
+    }
+
+    memmap[VIRT_DRAM].size = ram_size_low;
+    memmap[VIRT_DRAM_HIGH].size = ram_size_high;
+
     if (riscv_is_32bit(&s->soc[0])) {
 #if HOST_LONG_BITS == 64
         /* limit RAM size in a 32-bit system */
@@ -1460,7 +1501,8 @@  static void virt_machine_init(MachineState *machine)
         virt_high_pcie_memmap.size = VIRT32_HIGH_PCIE_MMIO_SIZE;
     } else {
         virt_high_pcie_memmap.size = VIRT64_HIGH_PCIE_MMIO_SIZE;
-        virt_high_pcie_memmap.base = memmap[VIRT_DRAM].base + machine->ram_size;
+        virt_high_pcie_memmap.base = memmap[VIRT_DRAM_HIGH].base +
+                                     memmap[VIRT_DRAM_HIGH].size;
         virt_high_pcie_memmap.base =
             ROUND_UP(virt_high_pcie_memmap.base, virt_high_pcie_memmap.size);
     }
@@ -1468,8 +1510,22 @@  static void virt_machine_init(MachineState *machine)
     s->memmap = virt_memmap;
 
     /* register system main memory (actual RAM) */
+    ram_below_4g = g_malloc(sizeof(*ram_below_4g));
+    memory_region_init_alias(ram_below_4g, NULL, "ram-below-4g", machine->ram,
+                             0, memmap[VIRT_DRAM].size);
     memory_region_add_subregion(system_memory, memmap[VIRT_DRAM].base,
-        machine->ram);
+                                ram_below_4g);
+
+    if (memmap[VIRT_DRAM_HIGH].size) {
+        ram_above_4g = g_malloc(sizeof(*ram_above_4g));
+        memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g",
+                                 machine->ram,
+                                 memmap[VIRT_DRAM].size,
+                                 memmap[VIRT_DRAM_HIGH].size);
+        memory_region_add_subregion(system_memory,
+                                    memmap[VIRT_DRAM_HIGH].base,
+                                    ram_above_4g);
+    }
 
     /* boot rom */
     memory_region_init_rom(mask_rom, NULL, "riscv_virt_board.mrom",
diff --git a/include/hw/riscv/virt.h b/include/hw/riscv/virt.h
index e5c474b26e..36004eb6ef 100644
--- a/include/hw/riscv/virt.h
+++ b/include/hw/riscv/virt.h
@@ -79,6 +79,7 @@  enum {
     VIRT_IMSIC_S,
     VIRT_FLASH,
     VIRT_DRAM,
+    VIRT_DRAM_HIGH,
     VIRT_PCIE_MMIO,
     VIRT_PCIE_PIO,
     VIRT_PLATFORM_BUS,