diff mbox series

[v5,4/5] i386/pc: relocate 4g start to 1T where applicable

Message ID 20220520104532.9816-5-joao.m.martins@oracle.com (mailing list archive)
State New, archived
Headers show
Series i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU | expand

Commit Message

Joao Martins May 20, 2022, 10:45 a.m. UTC
It is assumed that the whole GPA space is available to be DMA
addressable, within a given address space limit, expect for a
tiny region before the 4G. Since Linux v5.4, VFIO validates
whether the selected GPA is indeed valid i.e. not reserved by
IOMMU on behalf of some specific devices or platform-defined
restrictions, and thus failing the ioctl(VFIO_DMA_MAP) with
 -EINVAL.

AMD systems with an IOMMU are examples of such platforms and
particularly may only have these ranges as allowed:

	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb[*])

We already account for the 4G hole, albeit if the guest is big
enough we will fail to allocate a guest with  >1010G due to the
~12G hole at the 1Tb boundary, reserved for HyperTransport (HT).

[*] there is another reserved region unrelated to HT that exists
in the 256T boundaru in Fam 17h according to Errata #1286,
documeted also in "Open-Source Register Reference for AMD Family
17h Processors (PUB)"

When creating the region above 4G, take into account that on AMD
platforms the HyperTransport range is reserved and hence it
cannot be used either as GPAs. On those cases rather than
establishing the start of ram-above-4g to be 4G, relocate instead
to 1Tb. See AMD IOMMU spec, section 2.1.2 "IOMMU Logical
Topology", for more information on the underlying restriction of
IOVAs.

After accounting for the 1Tb hole on AMD hosts, mtree should
look like:

0000000000000000-000000007fffffff (prio 0, i/o):
	 alias ram-below-4g @pc.ram 0000000000000000-000000007fffffff
0000010000000000-000001ff7fffffff (prio 0, i/o):
	alias ram-above-4g @pc.ram 0000000080000000-000000ffffffffff

If the relocation is done, we also add the the reserved HT
e820 range as reserved.

Default phys-bits on Qemu is TCG_PHYS_ADDR_BITS (40) which is enough
to address 1Tb (0xff ffff ffff). On AMD platforms, if a
ram-above-4g relocation may be desired and the CPU wasn't configured
with a big enough phys-bits, print an error message to the user
and do not make the relocation of the above-4g-region if phys-bits
is too low.

Suggested-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/pc.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 111 insertions(+)

Comments

Igor Mammedov June 16, 2022, 2:23 p.m. UTC | #1
On Fri, 20 May 2022 11:45:31 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> It is assumed that the whole GPA space is available to be DMA
> addressable, within a given address space limit, expect for a
                                                   ^^^ typo?

> tiny region before the 4G. Since Linux v5.4, VFIO validates
> whether the selected GPA is indeed valid i.e. not reserved by
> IOMMU on behalf of some specific devices or platform-defined
> restrictions, and thus failing the ioctl(VFIO_DMA_MAP) with
>  -EINVAL.
> 
> AMD systems with an IOMMU are examples of such platforms and
> particularly may only have these ranges as allowed:
> 
> 	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
> 	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
> 	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb[*])
> 
> We already account for the 4G hole, albeit if the guest is big
> enough we will fail to allocate a guest with  >1010G due to the
> ~12G hole at the 1Tb boundary, reserved for HyperTransport (HT).
> 
> [*] there is another reserved region unrelated to HT that exists
> in the 256T boundaru in Fam 17h according to Errata #1286,
              ^ ditto

> documeted also in "Open-Source Register Reference for AMD Family
> 17h Processors (PUB)"
> 
> When creating the region above 4G, take into account that on AMD
> platforms the HyperTransport range is reserved and hence it
> cannot be used either as GPAs. On those cases rather than
> establishing the start of ram-above-4g to be 4G, relocate instead
> to 1Tb. See AMD IOMMU spec, section 2.1.2 "IOMMU Logical
> Topology", for more information on the underlying restriction of
> IOVAs.
> 
> After accounting for the 1Tb hole on AMD hosts, mtree should
> look like:
> 
> 0000000000000000-000000007fffffff (prio 0, i/o):
> 	 alias ram-below-4g @pc.ram 0000000000000000-000000007fffffff
> 0000010000000000-000001ff7fffffff (prio 0, i/o):
> 	alias ram-above-4g @pc.ram 0000000080000000-000000ffffffffff
> 
> If the relocation is done, we also add the the reserved HT
> e820 range as reserved.
> 
> Default phys-bits on Qemu is TCG_PHYS_ADDR_BITS (40) which is enough
> to address 1Tb (0xff ffff ffff). On AMD platforms, if a
> ram-above-4g relocation may be desired and the CPU wasn't configured
> with a big enough phys-bits, print an error message to the user
> and do not make the relocation of the above-4g-region if phys-bits
> is too low.
> 
> Suggested-by: Igor Mammedov <imammedo@redhat.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  hw/i386/pc.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 111 insertions(+)
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index af52d4ff89ef..652ae8ff9ccf 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -796,6 +796,110 @@ void xen_load_linux(PCMachineState *pcms)
>  #define PC_ROM_ALIGN       0x800
>  #define PC_ROM_SIZE        (PC_ROM_MAX - PC_ROM_MIN_VGA)
>  
> +/*
> + * AMD systems with an IOMMU have an additional hole close to the
> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
> + * on kernel version, VFIO may or may not let you DMA map those ranges.
> + * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
> + * with certain memory sizes. It's also wrong to use those IOVA ranges
> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
> + * The ranges reserved for Hyper-Transport are:
> + *
> + * FD_0000_0000h - FF_FFFF_FFFFh
> + *
> + * The ranges represent the following:
> + *
> + * Base Address   Top Address  Use
> + *
> + * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
> + * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
> + * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
> + * FD_F910_0000h FD_F91F_FFFFh System Management
> + * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
> + * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
> + * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
> + * FD_FE00_0000h FD_FFFF_FFFFh Configuration
> + * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
> + * FE_2000_0000h FF_FFFF_FFFFh Reserved
> + *
> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
> + * Table 3: Special Address Controls (GPA) for more information.
> + */
> +#define AMD_HT_START         0xfd00000000UL
> +#define AMD_HT_END           0xffffffffffUL
> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
> +#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
> +
> +static hwaddr x86_max_phys_addr(PCMachineState *pcms,

s/x86_max_phys_addr/pc_max_used_gpa/

> +                                hwaddr above_4g_mem_start,
> +                                uint64_t pci_hole64_size)
> +{
> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> +    MachineState *machine = MACHINE(pcms);
> +    ram_addr_t device_mem_size = 0;
> +    hwaddr base;
> +
> +    if (!x86ms->above_4g_mem_size) {
> +       /*
> +        * 32-bit pci hole goes from
> +        * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
> +        */
> +        return IO_APIC_DEFAULT_ADDRESS - 1;

lack of above_4g_mem, doesn't mean absence of device_mem_size or anything else
that's located above it.

> +    }
> +
> +    if (pcmc->has_reserved_memory &&
> +       (machine->ram_size < machine->maxram_size)) {
> +        device_mem_size = machine->maxram_size - machine->ram_size;
> +    }
> +
> +    base = ROUND_UP(above_4g_mem_start + x86ms->above_4g_mem_size +
> +                    pcms->sgx_epc.size, 1 * GiB);
> +
> +    return base + device_mem_size + pci_hole64_size;

it's not guarantied that pci64 hole starts right away device_mem,
but you are not 1st doing this assumption in code, maybe instead of
all above use existing 
   pc_pci_hole64_start() + pci_hole64_size
to gestimate max address 

> +}
> +
> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
> +                                          uint64_t pci_hole64_size)

s/x86_update_above_4g_mem_start/pc_set_amd_above_4g_mem_start/

> +{
> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> +    CPUX86State *env = &X86_CPU(first_cpu)->env;
> +    hwaddr start = x86ms->above_4g_mem_start;
> +    hwaddr maxphysaddr, maxusedaddr;


> +    /*
> +     * The HyperTransport range close to the 1T boundary is unique to AMD
> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
> +     * to above 1T to AMD vCPUs only.
> +     */
> +    if (!IS_AMD_CPU(env)) {
> +        return;
> +    }

move this to caller

> +    /* Bail out if max possible address does not cross HT range */
> +    if (x86_max_phys_addr(pcms, start, pci_hole64_size) < AMD_HT_START) {
> +        return;
> +    }
> +
> +    /*
> +     * Relocating ram-above-4G requires more than TCG_PHYS_ADDR_BITS (40).
> +     * So make sure phys-bits is required to be appropriately sized in order
> +     * to proceed with the above-4g-region relocation and thus boot.
> +     */
> +    start = AMD_ABOVE_1TB_START;
> +    maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
> +    maxusedaddr = x86_max_phys_addr(pcms, start, pci_hole64_size);
> +    if (maxphysaddr < maxusedaddr) {
> +        error_report("Address space limit 0x%"PRIx64" < 0x%"PRIx64
> +                     " phys-bits too low (%u) cannot avoid AMD HT range",
> +                     maxphysaddr, maxusedaddr, X86_CPU(first_cpu)->phys_bits);
> +        exit(EXIT_FAILURE);
> +    }
> +
> +
> +    x86ms->above_4g_mem_start = start;
> +}
> +
>  void pc_memory_init(PCMachineState *pcms,
>                      MemoryRegion *system_memory,
>                      MemoryRegion *rom_memory,
> @@ -817,6 +921,8 @@ void pc_memory_init(PCMachineState *pcms,
>  
>      linux_boot = (machine->kernel_filename != NULL);
>  
> +    x86_update_above_4g_mem_start(pcms, pci_hole64_size);
> +
>      /*
>       * Split single memory region and use aliases to address portions of it,
>       * done for backwards compatibility with older qemus.
> @@ -827,6 +933,11 @@ void pc_memory_init(PCMachineState *pcms,
>                               0, x86ms->below_4g_mem_size);
>      memory_region_add_subregion(system_memory, 0, ram_below_4g);
>      e820_add_entry(0, x86ms->below_4g_mem_size, E820_RAM);
> +
> +    if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START) {
> +        e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
> +    }
probably it is not necessary, but it doesn't hurt

>      if (x86ms->above_4g_mem_size > 0) {
>          ram_above_4g = g_malloc(sizeof(*ram_above_4g));
>          memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g",
Joao Martins June 17, 2022, 12:18 p.m. UTC | #2
On 6/16/22 15:23, Igor Mammedov wrote:
> On Fri, 20 May 2022 11:45:31 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> It is assumed that the whole GPA space is available to be DMA
>> addressable, within a given address space limit, expect for a
>                                                    ^^^ typo?
> 
Yes, it should have been 'except'.

>> tiny region before the 4G. Since Linux v5.4, VFIO validates
>> whether the selected GPA is indeed valid i.e. not reserved by
>> IOMMU on behalf of some specific devices or platform-defined
>> restrictions, and thus failing the ioctl(VFIO_DMA_MAP) with
>>  -EINVAL.
>>
>> AMD systems with an IOMMU are examples of such platforms and
>> particularly may only have these ranges as allowed:
>>
>> 	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
>> 	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
>> 	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb[*])
>>
>> We already account for the 4G hole, albeit if the guest is big
>> enough we will fail to allocate a guest with  >1010G due to the
>> ~12G hole at the 1Tb boundary, reserved for HyperTransport (HT).
>>
>> [*] there is another reserved region unrelated to HT that exists
>> in the 256T boundaru in Fam 17h according to Errata #1286,
>               ^ ditto
> 
Fixed.

>> documeted also in "Open-Source Register Reference for AMD Family
>> 17h Processors (PUB)"
>>
>> When creating the region above 4G, take into account that on AMD
>> platforms the HyperTransport range is reserved and hence it
>> cannot be used either as GPAs. On those cases rather than
>> establishing the start of ram-above-4g to be 4G, relocate instead
>> to 1Tb. See AMD IOMMU spec, section 2.1.2 "IOMMU Logical
>> Topology", for more information on the underlying restriction of
>> IOVAs.
>>
>> After accounting for the 1Tb hole on AMD hosts, mtree should
>> look like:
>>
>> 0000000000000000-000000007fffffff (prio 0, i/o):
>> 	 alias ram-below-4g @pc.ram 0000000000000000-000000007fffffff
>> 0000010000000000-000001ff7fffffff (prio 0, i/o):
>> 	alias ram-above-4g @pc.ram 0000000080000000-000000ffffffffff
>>
>> If the relocation is done, we also add the the reserved HT
>> e820 range as reserved.
>>
>> Default phys-bits on Qemu is TCG_PHYS_ADDR_BITS (40) which is enough
>> to address 1Tb (0xff ffff ffff). On AMD platforms, if a
>> ram-above-4g relocation may be desired and the CPU wasn't configured
>> with a big enough phys-bits, print an error message to the user
>> and do not make the relocation of the above-4g-region if phys-bits
>> is too low.
>>
>> Suggested-by: Igor Mammedov <imammedo@redhat.com>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  hw/i386/pc.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 111 insertions(+)
>>
>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>> index af52d4ff89ef..652ae8ff9ccf 100644
>> --- a/hw/i386/pc.c
>> +++ b/hw/i386/pc.c
>> @@ -796,6 +796,110 @@ void xen_load_linux(PCMachineState *pcms)
>>  #define PC_ROM_ALIGN       0x800
>>  #define PC_ROM_SIZE        (PC_ROM_MAX - PC_ROM_MIN_VGA)
>>  
>> +/*
>> + * AMD systems with an IOMMU have an additional hole close to the
>> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
>> + * on kernel version, VFIO may or may not let you DMA map those ranges.
>> + * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
>> + * with certain memory sizes. It's also wrong to use those IOVA ranges
>> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
>> + * The ranges reserved for Hyper-Transport are:
>> + *
>> + * FD_0000_0000h - FF_FFFF_FFFFh
>> + *
>> + * The ranges represent the following:
>> + *
>> + * Base Address   Top Address  Use
>> + *
>> + * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
>> + * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
>> + * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
>> + * FD_F910_0000h FD_F91F_FFFFh System Management
>> + * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
>> + * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
>> + * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
>> + * FD_FE00_0000h FD_FFFF_FFFFh Configuration
>> + * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
>> + * FE_2000_0000h FF_FFFF_FFFFh Reserved
>> + *
>> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
>> + * Table 3: Special Address Controls (GPA) for more information.
>> + */
>> +#define AMD_HT_START         0xfd00000000UL
>> +#define AMD_HT_END           0xffffffffffUL
>> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
>> +#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
>> +
>> +static hwaddr x86_max_phys_addr(PCMachineState *pcms,
> 
> s/x86_max_phys_addr/pc_max_used_gpa/
> 
Fixed.

>> +                                hwaddr above_4g_mem_start,
>> +                                uint64_t pci_hole64_size)
>> +{
>> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>> +    MachineState *machine = MACHINE(pcms);
>> +    ram_addr_t device_mem_size = 0;
>> +    hwaddr base;
>> +
>> +    if (!x86ms->above_4g_mem_size) {
>> +       /*
>> +        * 32-bit pci hole goes from
>> +        * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
>> +        */
>> +        return IO_APIC_DEFAULT_ADDRESS - 1;
> 
> lack of above_4g_mem, doesn't mean absence of device_mem_size or anything else
> that's located above it.
> 

True. But the intent is to fix 32-bit boundaries as one of the qtests was failing
otherwise. We won't hit the 1T hole, hence a nop. Unless we plan on using
pc_max_used_gpa() for something else other than this.

The alternative would be to just early bail out of pc_set_amd_above_4g_mem_start() if
!above_4g_mem_size. And I guess in that case we can just remove pc_max_used_gpa()
and replace with a:

	max_used_gpa = pc_pci_hole64_start() + pci_hole64_size

Which makes this even simpler. thoughts?

>> +    }
>> +
>> +    if (pcmc->has_reserved_memory &&
>> +       (machine->ram_size < machine->maxram_size)) {
>> +        device_mem_size = machine->maxram_size - machine->ram_size;
>> +    }
>> +
>> +    base = ROUND_UP(above_4g_mem_start + x86ms->above_4g_mem_size +
>> +                    pcms->sgx_epc.size, 1 * GiB);
>> +
>> +    return base + device_mem_size + pci_hole64_size;
> 
> it's not guarantied that pci64 hole starts right away device_mem,
> but you are not 1st doing this assumption in code, maybe instead of
> all above use existing 
>    pc_pci_hole64_start() + pci_hole64_size
> to gestimate max address 
> 
I've switched the block above to that instead.

>> +}
>> +
>> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
>> +                                          uint64_t pci_hole64_size)
> 
> s/x86_update_above_4g_mem_start/pc_set_amd_above_4g_mem_start/
> 
Fixed.

>> +{
>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>> +    CPUX86State *env = &X86_CPU(first_cpu)->env;
>> +    hwaddr start = x86ms->above_4g_mem_start;
>> +    hwaddr maxphysaddr, maxusedaddr;
> 
> 
>> +    /*
>> +     * The HyperTransport range close to the 1T boundary is unique to AMD
>> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
>> +     * to above 1T to AMD vCPUs only.
>> +     */
>> +    if (!IS_AMD_CPU(env)) {
>> +        return;
>> +    }
> 
> move this to caller
> 
Done (same for the patch after this one):

-    x86_update_above_4g_mem_start(pcms, pci_hole64_size);
+    /*
+     * The HyperTransport range close to the 1T boundary is unique to AMD
+     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
+     * to above 1T to AMD vCPUs only.
+     */
+    if (IS_AMD_CPU(env)) {
+        pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);
+    }


>> +    /* Bail out if max possible address does not cross HT range */
>> +    if (x86_max_phys_addr(pcms, start, pci_hole64_size) < AMD_HT_START) {
>> +        return;
>> +    }
>> +
>> +    /*
>> +     * Relocating ram-above-4G requires more than TCG_PHYS_ADDR_BITS (40).
>> +     * So make sure phys-bits is required to be appropriately sized in order
>> +     * to proceed with the above-4g-region relocation and thus boot.
>> +     */
>> +    start = AMD_ABOVE_1TB_START;
>> +    maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
>> +    maxusedaddr = x86_max_phys_addr(pcms, start, pci_hole64_size);
>> +    if (maxphysaddr < maxusedaddr) {
>> +        error_report("Address space limit 0x%"PRIx64" < 0x%"PRIx64
>> +                     " phys-bits too low (%u) cannot avoid AMD HT range",
>> +                     maxphysaddr, maxusedaddr, X86_CPU(first_cpu)->phys_bits);
>> +        exit(EXIT_FAILURE);
>> +    }
>> +
>> +
>> +    x86ms->above_4g_mem_start = start;
>> +}
>> +
>>  void pc_memory_init(PCMachineState *pcms,
>>                      MemoryRegion *system_memory,
>>                      MemoryRegion *rom_memory,
>> @@ -817,6 +921,8 @@ void pc_memory_init(PCMachineState *pcms,
>>  
>>      linux_boot = (machine->kernel_filename != NULL);
>>  
>> +    x86_update_above_4g_mem_start(pcms, pci_hole64_size);
>> +
>>      /*
>>       * Split single memory region and use aliases to address portions of it,
>>       * done for backwards compatibility with older qemus.
>> @@ -827,6 +933,11 @@ void pc_memory_init(PCMachineState *pcms,
>>                               0, x86ms->below_4g_mem_size);
>>      memory_region_add_subregion(system_memory, 0, ram_below_4g);
>>      e820_add_entry(0, x86ms->below_4g_mem_size, E820_RAM);
>> +
>> +    if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START) {
>> +        e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
>> +    }
> probably it is not necessary, but it doesn't hurt
> 

virtual firmware can make better decisions to avoid reserved ranges.

I was actually thinking that if phys_bits was >= 40 that we would
anyways add it.

>>      if (x86ms->above_4g_mem_size > 0) {
>>          ram_above_4g = g_malloc(sizeof(*ram_above_4g));
>>          memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g",
>
Igor Mammedov June 17, 2022, 12:32 p.m. UTC | #3
On Fri, 17 Jun 2022 13:18:38 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 6/16/22 15:23, Igor Mammedov wrote:
> > On Fri, 20 May 2022 11:45:31 +0100
> > Joao Martins <joao.m.martins@oracle.com> wrote:
> >   
> >> It is assumed that the whole GPA space is available to be DMA
> >> addressable, within a given address space limit, expect for a  
> >                                                    ^^^ typo?
> >   
> Yes, it should have been 'except'.
> 
> >> tiny region before the 4G. Since Linux v5.4, VFIO validates
> >> whether the selected GPA is indeed valid i.e. not reserved by
> >> IOMMU on behalf of some specific devices or platform-defined
> >> restrictions, and thus failing the ioctl(VFIO_DMA_MAP) with
> >>  -EINVAL.
> >>
> >> AMD systems with an IOMMU are examples of such platforms and
> >> particularly may only have these ranges as allowed:
> >>
> >> 	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
> >> 	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
> >> 	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb[*])
> >>
> >> We already account for the 4G hole, albeit if the guest is big
> >> enough we will fail to allocate a guest with  >1010G due to the
> >> ~12G hole at the 1Tb boundary, reserved for HyperTransport (HT).
> >>
> >> [*] there is another reserved region unrelated to HT that exists
> >> in the 256T boundaru in Fam 17h according to Errata #1286,  
> >               ^ ditto
> >   
> Fixed.
> 
> >> documeted also in "Open-Source Register Reference for AMD Family
> >> 17h Processors (PUB)"
> >>
> >> When creating the region above 4G, take into account that on AMD
> >> platforms the HyperTransport range is reserved and hence it
> >> cannot be used either as GPAs. On those cases rather than
> >> establishing the start of ram-above-4g to be 4G, relocate instead
> >> to 1Tb. See AMD IOMMU spec, section 2.1.2 "IOMMU Logical
> >> Topology", for more information on the underlying restriction of
> >> IOVAs.
> >>
> >> After accounting for the 1Tb hole on AMD hosts, mtree should
> >> look like:
> >>
> >> 0000000000000000-000000007fffffff (prio 0, i/o):
> >> 	 alias ram-below-4g @pc.ram 0000000000000000-000000007fffffff
> >> 0000010000000000-000001ff7fffffff (prio 0, i/o):
> >> 	alias ram-above-4g @pc.ram 0000000080000000-000000ffffffffff
> >>
> >> If the relocation is done, we also add the the reserved HT
> >> e820 range as reserved.
> >>
> >> Default phys-bits on Qemu is TCG_PHYS_ADDR_BITS (40) which is enough
> >> to address 1Tb (0xff ffff ffff). On AMD platforms, if a
> >> ram-above-4g relocation may be desired and the CPU wasn't configured
> >> with a big enough phys-bits, print an error message to the user
> >> and do not make the relocation of the above-4g-region if phys-bits
> >> is too low.
> >>
> >> Suggested-by: Igor Mammedov <imammedo@redhat.com>
> >> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> >> ---
> >>  hw/i386/pc.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 111 insertions(+)
> >>
> >> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> >> index af52d4ff89ef..652ae8ff9ccf 100644
> >> --- a/hw/i386/pc.c
> >> +++ b/hw/i386/pc.c
> >> @@ -796,6 +796,110 @@ void xen_load_linux(PCMachineState *pcms)
> >>  #define PC_ROM_ALIGN       0x800
> >>  #define PC_ROM_SIZE        (PC_ROM_MAX - PC_ROM_MIN_VGA)
> >>  
> >> +/*
> >> + * AMD systems with an IOMMU have an additional hole close to the
> >> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
> >> + * on kernel version, VFIO may or may not let you DMA map those ranges.
> >> + * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
> >> + * with certain memory sizes. It's also wrong to use those IOVA ranges
> >> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
> >> + * The ranges reserved for Hyper-Transport are:
> >> + *
> >> + * FD_0000_0000h - FF_FFFF_FFFFh
> >> + *
> >> + * The ranges represent the following:
> >> + *
> >> + * Base Address   Top Address  Use
> >> + *
> >> + * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
> >> + * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
> >> + * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
> >> + * FD_F910_0000h FD_F91F_FFFFh System Management
> >> + * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
> >> + * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
> >> + * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
> >> + * FD_FE00_0000h FD_FFFF_FFFFh Configuration
> >> + * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
> >> + * FE_2000_0000h FF_FFFF_FFFFh Reserved
> >> + *
> >> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
> >> + * Table 3: Special Address Controls (GPA) for more information.
> >> + */
> >> +#define AMD_HT_START         0xfd00000000UL
> >> +#define AMD_HT_END           0xffffffffffUL
> >> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
> >> +#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
> >> +
> >> +static hwaddr x86_max_phys_addr(PCMachineState *pcms,  
> > 
> > s/x86_max_phys_addr/pc_max_used_gpa/
> >   
> Fixed.
> 
> >> +                                hwaddr above_4g_mem_start,
> >> +                                uint64_t pci_hole64_size)
> >> +{
> >> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
> >> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> >> +    MachineState *machine = MACHINE(pcms);
> >> +    ram_addr_t device_mem_size = 0;
> >> +    hwaddr base;
> >> +
> >> +    if (!x86ms->above_4g_mem_size) {
> >> +       /*
> >> +        * 32-bit pci hole goes from
> >> +        * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
> >> +        */
> >> +        return IO_APIC_DEFAULT_ADDRESS - 1;  
> > 
> > lack of above_4g_mem, doesn't mean absence of device_mem_size or anything else
> > that's located above it.
> >   
> 
> True. But the intent is to fix 32-bit boundaries as one of the qtests was failing
> otherwise. We won't hit the 1T hole, hence a nop.

I don't get the reasoning, can you clarify it pls?

>  Unless we plan on using
> pc_max_used_gpa() for something else other than this.

Even if '!above_4g_mem_sizem', we can still have hotpluggable memory region
present and that can  hit 1Tb. The same goes for pci64_hole if it's configured
large enough on CLI.

Looks like guesstimate we could use is taking pci64_hole_end as max used GPA

> 
> The alternative would be to just early bail out of pc_set_amd_above_4g_mem_start() if
> !above_4g_mem_size. And I guess in that case we can just remove pc_max_used_gpa()
> and replace with a:
> 
> 	max_used_gpa = pc_pci_hole64_start() + pci_hole64_size
> 
> Which makes this even simpler. thoughts?
> 
> >> +    }
> >> +
> >> +    if (pcmc->has_reserved_memory &&
> >> +       (machine->ram_size < machine->maxram_size)) {
> >> +        device_mem_size = machine->maxram_size - machine->ram_size;
> >> +    }
> >> +
> >> +    base = ROUND_UP(above_4g_mem_start + x86ms->above_4g_mem_size +
> >> +                    pcms->sgx_epc.size, 1 * GiB);
> >> +
> >> +    return base + device_mem_size + pci_hole64_size;  
> > 
> > it's not guarantied that pci64 hole starts right away device_mem,
> > but you are not 1st doing this assumption in code, maybe instead of
> > all above use existing 
> >    pc_pci_hole64_start() + pci_hole64_size
> > to gestimate max address 
> >   
> I've switched the block above to that instead.
> 
> >> +}
> >> +
> >> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
> >> +                                          uint64_t pci_hole64_size)  
> > 
> > s/x86_update_above_4g_mem_start/pc_set_amd_above_4g_mem_start/
> >   
> Fixed.
> 
> >> +{
> >> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> >> +    CPUX86State *env = &X86_CPU(first_cpu)->env;
> >> +    hwaddr start = x86ms->above_4g_mem_start;
> >> +    hwaddr maxphysaddr, maxusedaddr;  
> > 
> >   
> >> +    /*
> >> +     * The HyperTransport range close to the 1T boundary is unique to AMD
> >> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
> >> +     * to above 1T to AMD vCPUs only.
> >> +     */
> >> +    if (!IS_AMD_CPU(env)) {
> >> +        return;
> >> +    }  
> > 
> > move this to caller
> >   
> Done (same for the patch after this one):
> 
> -    x86_update_above_4g_mem_start(pcms, pci_hole64_size);
> +    /*
> +     * The HyperTransport range close to the 1T boundary is unique to AMD
> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
> +     * to above 1T to AMD vCPUs only.
> +     */
> +    if (IS_AMD_CPU(env)) {
> +        pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);
> +    }
> 
> 
> >> +    /* Bail out if max possible address does not cross HT range */
> >> +    if (x86_max_phys_addr(pcms, start, pci_hole64_size) < AMD_HT_START) {
> >> +        return;
> >> +    }
> >> +
> >> +    /*
> >> +     * Relocating ram-above-4G requires more than TCG_PHYS_ADDR_BITS (40).
> >> +     * So make sure phys-bits is required to be appropriately sized in order
> >> +     * to proceed with the above-4g-region relocation and thus boot.
> >> +     */
> >> +    start = AMD_ABOVE_1TB_START;
> >> +    maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
> >> +    maxusedaddr = x86_max_phys_addr(pcms, start, pci_hole64_size);
> >> +    if (maxphysaddr < maxusedaddr) {
> >> +        error_report("Address space limit 0x%"PRIx64" < 0x%"PRIx64
> >> +                     " phys-bits too low (%u) cannot avoid AMD HT range",
> >> +                     maxphysaddr, maxusedaddr, X86_CPU(first_cpu)->phys_bits);
> >> +        exit(EXIT_FAILURE);
> >> +    }
> >> +
> >> +
> >> +    x86ms->above_4g_mem_start = start;
> >> +}
> >> +
> >>  void pc_memory_init(PCMachineState *pcms,
> >>                      MemoryRegion *system_memory,
> >>                      MemoryRegion *rom_memory,
> >> @@ -817,6 +921,8 @@ void pc_memory_init(PCMachineState *pcms,
> >>  
> >>      linux_boot = (machine->kernel_filename != NULL);
> >>  
> >> +    x86_update_above_4g_mem_start(pcms, pci_hole64_size);
> >> +
> >>      /*
> >>       * Split single memory region and use aliases to address portions of it,
> >>       * done for backwards compatibility with older qemus.
> >> @@ -827,6 +933,11 @@ void pc_memory_init(PCMachineState *pcms,
> >>                               0, x86ms->below_4g_mem_size);
> >>      memory_region_add_subregion(system_memory, 0, ram_below_4g);
> >>      e820_add_entry(0, x86ms->below_4g_mem_size, E820_RAM);
> >> +
> >> +    if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START) {
> >> +        e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
> >> +    }  
> > probably it is not necessary, but it doesn't hurt
> >   
> 
> virtual firmware can make better decisions to avoid reserved ranges.
> 
> I was actually thinking that if phys_bits was >= 40 that we would
> anyways add it.
> 
> >>      if (x86ms->above_4g_mem_size > 0) {
> >>          ram_above_4g = g_malloc(sizeof(*ram_above_4g));
> >>          memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g",  
> >   
>
Joao Martins June 17, 2022, 1:33 p.m. UTC | #4
On 6/17/22 13:32, Igor Mammedov wrote:
> On Fri, 17 Jun 2022 13:18:38 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
>> On 6/16/22 15:23, Igor Mammedov wrote:
>>> On Fri, 20 May 2022 11:45:31 +0100
>>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>>> +                                hwaddr above_4g_mem_start,
>>>> +                                uint64_t pci_hole64_size)
>>>> +{
>>>> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>>>> +    MachineState *machine = MACHINE(pcms);
>>>> +    ram_addr_t device_mem_size = 0;
>>>> +    hwaddr base;
>>>> +
>>>> +    if (!x86ms->above_4g_mem_size) {
>>>> +       /*
>>>> +        * 32-bit pci hole goes from
>>>> +        * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
>>>> +        */
>>>> +        return IO_APIC_DEFAULT_ADDRESS - 1;  
>>>
>>> lack of above_4g_mem, doesn't mean absence of device_mem_size or anything else
>>> that's located above it.
>>>   
>>
>> True. But the intent is to fix 32-bit boundaries as one of the qtests was failing
>> otherwise. We won't hit the 1T hole, hence a nop.
> 
> I don't get the reasoning, can you clarify it pls?
> 

I was trying to say that what lead me here was a couple of qtests failures (from v3->v4).

I was doing this before based on pci_hole64. phys-bits=32 was for example one
of the test failures, and pci-hole64 sits above what 32-bit can reference.

>>  Unless we plan on using
>> pc_max_used_gpa() for something else other than this.
> 
> Even if '!above_4g_mem_sizem', we can still have hotpluggable memory region
> present and that can  hit 1Tb. The same goes for pci64_hole if it's configured
> large enough on CLI.
> 
So hotpluggable memory seems to assume it sits above 4g mem.

pci_hole64 likewise as it uses similar computations as hotplug.

Unless I am misunderstanding something here.

> Looks like guesstimate we could use is taking pci64_hole_end as max used GPA
> 
I think this was what I had before (v3[0]) and did not work.

Let me revisit this edge case again.

[0] https://lore.kernel.org/all/20220223184455.9057-5-joao.m.martins@oracle.com/
Joao Martins June 17, 2022, 4:12 p.m. UTC | #5
On 6/17/22 13:18, Joao Martins wrote:
> On 6/16/22 15:23, Igor Mammedov wrote:
>> On Fri, 20 May 2022 11:45:31 +0100
>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>> +    }
>>> +
>>> +    if (pcmc->has_reserved_memory &&
>>> +       (machine->ram_size < machine->maxram_size)) {
>>> +        device_mem_size = machine->maxram_size - machine->ram_size;
>>> +    }
>>> +
>>> +    base = ROUND_UP(above_4g_mem_start + x86ms->above_4g_mem_size +
>>> +                    pcms->sgx_epc.size, 1 * GiB);
>>> +
>>> +    return base + device_mem_size + pci_hole64_size;
>>
>> it's not guarantied that pci64 hole starts right away device_mem,
>> but you are not 1st doing this assumption in code, maybe instead of
>> all above use existing 
>>    pc_pci_hole64_start() + pci_hole64_size
>> to gestimate max address 
>>
> I've switched the block above to that instead.
> 

I had done this, albeit on a second look (and confirmed with testing) this
will crash, provided @device_memory isn't yet initialized. And even without
hotplug, CXL might have had issues.

The problem is largely that pc_pci_hole64_start() that the above check relies
on info we only populate later on in pc_memory_init(), and I don't think I can
move this done to a later point as definitely don't want to re-initialize
MRs or anything.

So we might be left with manually calculating as I was doing in this patch
but maybe try to arrange some form of new helper that has somewhat shared
logic with pc_pci_hole64_start().

  1114  uint64_t pc_pci_hole64_start(void)
  1115  {
  1116      PCMachineState *pcms = PC_MACHINE(qdev_get_machine());
  1117      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
  1118      MachineState *ms = MACHINE(pcms);
  1119      X86MachineState *x86ms = X86_MACHINE(pcms);
  1120      uint64_t hole64_start = 0;
  1121
  1122      if (pcms->cxl_devices_state.host_mr.addr) {
  1123          hole64_start = pcms->cxl_devices_state.host_mr.addr +
  1124              memory_region_size(&pcms->cxl_devices_state.host_mr);
  1125          if (pcms->cxl_devices_state.fixed_windows) {
  1126              GList *it;
  1127              for (it = pcms->cxl_devices_state.fixed_windows; it; it = it->next) {
  1128                  CXLFixedWindow *fw = it->data;
  1129                  hole64_start = fw->mr.addr + memory_region_size(&fw->mr);
  1130              }
  1131          }
* 1132      } else if (pcmc->has_reserved_memory && ms->device_memory->base) {
  1133          hole64_start = ms->device_memory->base;
  1134          if (!pcmc->broken_reserved_end) {
  1135              hole64_start += memory_region_size(&ms->device_memory->mr);
  1136          }
  1137      } else if (pcms->sgx_epc.size != 0) {
  1138              hole64_start = sgx_epc_above_4g_end(&pcms->sgx_epc);
  1139      } else {
  1140          hole64_start = x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
  1141      }
Igor Mammedov June 20, 2022, 2:27 p.m. UTC | #6
On Fri, 17 Jun 2022 14:33:02 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 6/17/22 13:32, Igor Mammedov wrote:
> > On Fri, 17 Jun 2022 13:18:38 +0100
> > Joao Martins <joao.m.martins@oracle.com> wrote:  
> >> On 6/16/22 15:23, Igor Mammedov wrote:  
> >>> On Fri, 20 May 2022 11:45:31 +0100
> >>> Joao Martins <joao.m.martins@oracle.com> wrote:  
> >>>> +                                hwaddr above_4g_mem_start,
> >>>> +                                uint64_t pci_hole64_size)
> >>>> +{
> >>>> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
> >>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> >>>> +    MachineState *machine = MACHINE(pcms);
> >>>> +    ram_addr_t device_mem_size = 0;
> >>>> +    hwaddr base;
> >>>> +
> >>>> +    if (!x86ms->above_4g_mem_size) {
> >>>> +       /*
> >>>> +        * 32-bit pci hole goes from
> >>>> +        * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
> >>>> +        */
> >>>> +        return IO_APIC_DEFAULT_ADDRESS - 1;    
> >>>
> >>> lack of above_4g_mem, doesn't mean absence of device_mem_size or anything else
> >>> that's located above it.
> >>>     
> >>
> >> True. But the intent is to fix 32-bit boundaries as one of the qtests was failing
> >> otherwise. We won't hit the 1T hole, hence a nop.  
> > 
> > I don't get the reasoning, can you clarify it pls?
> >   
> 
> I was trying to say that what lead me here was a couple of qtests failures (from v3->v4).
> 
> I was doing this before based on pci_hole64. phys-bits=32 was for example one
> of the test failures, and pci-hole64 sits above what 32-bit can reference.

if user sets phys-bits=32, then nothing above 4Gb should work (be usable)
(including above-4g-ram, hotplug region or pci64 hole or sgx or cxl)

and this doesn't look to me as AMD specific issue

perhaps do a phys-bits check as a separate patch
that will error out if max_used_gpa is above phys-bits limit
(maybe at machine_done time)
(i.e. defining max_gpa and checking if compatible with configured cpu
are 2 different things)

(it might be possible that tests need to be fixed too to account for it)

> >>  Unless we plan on using
> >> pc_max_used_gpa() for something else other than this.  
> > 
> > Even if '!above_4g_mem_sizem', we can still have hotpluggable memory region
> > present and that can  hit 1Tb. The same goes for pci64_hole if it's configured
> > large enough on CLI.
> >   
> So hotpluggable memory seems to assume it sits above 4g mem.
> 
> pci_hole64 likewise as it uses similar computations as hotplug.
> 
> Unless I am misunderstanding something here.
> 
> > Looks like guesstimate we could use is taking pci64_hole_end as max used GPA
> >   
> I think this was what I had before (v3[0]) and did not work.

that had been tied to host's phys-bits directly, all in one patch
and duplicating existing pc_pci_hole64_start().
 
> Let me revisit this edge case again.
> 
> [0] https://lore.kernel.org/all/20220223184455.9057-5-joao.m.martins@oracle.com/
>
Joao Martins June 20, 2022, 4:36 p.m. UTC | #7
On 6/20/22 15:27, Igor Mammedov wrote:
> On Fri, 17 Jun 2022 14:33:02 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
>> On 6/17/22 13:32, Igor Mammedov wrote:
>>> On Fri, 17 Jun 2022 13:18:38 +0100
>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
>>>> On 6/16/22 15:23, Igor Mammedov wrote:  
>>>>> On Fri, 20 May 2022 11:45:31 +0100
>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
>>>>>> +                                hwaddr above_4g_mem_start,
>>>>>> +                                uint64_t pci_hole64_size)
>>>>>> +{
>>>>>> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>>>>>> +    MachineState *machine = MACHINE(pcms);
>>>>>> +    ram_addr_t device_mem_size = 0;
>>>>>> +    hwaddr base;
>>>>>> +
>>>>>> +    if (!x86ms->above_4g_mem_size) {
>>>>>> +       /*
>>>>>> +        * 32-bit pci hole goes from
>>>>>> +        * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
>>>>>> +        */
>>>>>> +        return IO_APIC_DEFAULT_ADDRESS - 1;    
>>>>>
>>>>> lack of above_4g_mem, doesn't mean absence of device_mem_size or anything else
>>>>> that's located above it.
>>>>>     
>>>>
>>>> True. But the intent is to fix 32-bit boundaries as one of the qtests was failing
>>>> otherwise. We won't hit the 1T hole, hence a nop.  
>>>
>>> I don't get the reasoning, can you clarify it pls?
>>>   
>>
>> I was trying to say that what lead me here was a couple of qtests failures (from v3->v4).
>>
>> I was doing this before based on pci_hole64. phys-bits=32 was for example one
>> of the test failures, and pci-hole64 sits above what 32-bit can reference.
> 
> if user sets phys-bits=32, then nothing above 4Gb should work (be usable)
> (including above-4g-ram, hotplug region or pci64 hole or sgx or cxl)
> 
> and this doesn't look to me as AMD specific issue
> 
> perhaps do a phys-bits check as a separate patch
> that will error out if max_used_gpa is above phys-bits limit
> (maybe at machine_done time)
> (i.e. defining max_gpa and checking if compatible with configured cpu
> are 2 different things)
> 
> (it might be possible that tests need to be fixed too to account for it)
> 

My old notes (from v3) tell me with such a check these tests were exiting early thanks to
that error:

 1/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/qom-test               ERROR           0.07s
  killed by signal 6 SIGABRT
 4/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/test-hmp               ERROR           0.07s
  killed by signal 6 SIGABRT
 7/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/boot-serial-test       ERROR           0.07s
  killed by signal 6 SIGABRT
44/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/test-x86-cpuid-compat  ERROR           0.09s
  killed by signal 6 SIGABRT
45/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/numa-test              ERROR           0.17s
  killed by signal 6 SIGABRT

But the real reason these fail is not at all related to CPU phys bits,
but because we just don't handle the case where no pci_hole64 is supposed to exist (which
is what that other check is trying to do) e.g. A VM with -m 1G would
observe the same thing i.e. the computations after that conditional are all for the pci
hole64, which acounts for SGX/CXL/hotplug or etc which consequently means it's *errousnly*
bigger than phys-bits=32 (by definition). So the error_report is just telling me that
pc_max_used_gpa() is just incorrect without the !x86ms->above_4g_mem_size check.

If you're not fond of:

+    if (!x86ms->above_4g_mem_size) {
+       /*
+        * 32-bit pci hole goes from
+        * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
+         */
+        return IO_APIC_DEFAULT_ADDRESS - 1;
+    }

Then what should I use instead of the above?

'IO_APIC_DEFAULT_ADDRESS - 1' is the size of the 32-bit PCI hole, which is
also what is used for i440fx/q35 code. I could move it to a macro (e.g.
PCI_HOST_HOLE32_SIZE) to make it a bit readable and less hardcoded. Or
perhaps your problem is on !x86ms->above_4g_mem_size and maybe I should check
in addition for hotplug/CXL/etc existence?

>>>>  Unless we plan on using
>>>> pc_max_used_gpa() for something else other than this.  
>>>
>>> Even if '!above_4g_mem_sizem', we can still have hotpluggable memory region
>>> present and that can  hit 1Tb. The same goes for pci64_hole if it's configured
>>> large enough on CLI.
>>>   
>> So hotpluggable memory seems to assume it sits above 4g mem.
>>
>> pci_hole64 likewise as it uses similar computations as hotplug.
>>
>> Unless I am misunderstanding something here.
>>
>>> Looks like guesstimate we could use is taking pci64_hole_end as max used GPA
>>>   
>> I think this was what I had before (v3[0]) and did not work.
> 
> that had been tied to host's phys-bits directly, all in one patch
> and duplicating existing pc_pci_hole64_start().
>  

Duplicating was sort of my bad attempt in this patch for pc_max_used_gpa()

I was sort of thinking to something like extracting calls to start + size "tuple" into
functions -- e.g. for hotplug it is pc_get_device_memory_range() and for CXL it would be
maybe pc_get_cxl_range()) -- rather than assuming those values are already initialized on
the memory-region @base and its size.

See snippet below. Note I am missing CXL handling, but gives you the idea.

But it is slightly more complex than what I had in this version :( and would require
anyone doing changes in pc_memory_init() and pc_pci_hole64_start() to make sure it follows
the similar logic.

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index fd088093b5d5..016bc65fcb4b 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -885,6 +885,34 @@ static void pc_set_amd_above_4g_mem_start(PCMachineState *pcms,
     x86ms->above_4g_mem_start = start;
 }

+static void pc_get_device_memory_range(PCMachineState *pcms,
+                                       hwaddr *base,
+                                       hwaddr *device_mem_size)
+{
+    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
+    X86MachineState *x86ms = X86_MACHINE(pcms);
+    MachineState *machine = MACHINE(pcms);
+    hwaddr addr, size;
+
+    size = machine->maxram_size - machine->ram_size;
+
+    if (pcms->sgx_epc.size != 0) {
+        addr = sgx_epc_above_4g_end(&pcms->sgx_epc);
+    } else {
+        addr = x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
+    }
+
+    if (pcmc->enforce_aligned_dimm) {
+        /* size device region assuming 1G page max alignment per slot */
+        size += (1 * GiB) * machine->ram_slots;
+    }
+
+    if (base)
+        *base = addr;
+    if (device_mem_size)
+        *device_mem_size = size;
+}
+
 void pc_memory_init(PCMachineState *pcms,
                     MemoryRegion *system_memory,
                     MemoryRegion *rom_memory,
@@ -962,7 +990,7 @@ void pc_memory_init(PCMachineState *pcms,
     /* initialize device memory address space */
     if (pcmc->has_reserved_memory &&
         (machine->ram_size < machine->maxram_size)) {
-        ram_addr_t device_mem_size = machine->maxram_size - machine->ram_size;
+        ram_addr_t device_mem_size;

         if (machine->ram_slots > ACPI_MAX_RAM_SLOTS) {
             error_report("unsupported amount of memory slots: %"PRIu64,
@@ -977,20 +1005,7 @@ void pc_memory_init(PCMachineState *pcms,
             exit(EXIT_FAILURE);
         }

-        if (pcms->sgx_epc.size != 0) {
-            machine->device_memory->base = sgx_epc_above_4g_end(&pcms->sgx_epc);
-        } else {
-            machine->device_memory->base =
-                x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
-        }
-
-        machine->device_memory->base =
-            ROUND_UP(machine->device_memory->base, 1 * GiB);
-
-        if (pcmc->enforce_aligned_dimm) {
-            /* size device region assuming 1G page max alignment per slot */
-            device_mem_size += (1 * GiB) * machine->ram_slots;
-        }
+        pc_get_device_memory_range(pcms, &machine->device_memory->base, &device_mem_size);

         if ((machine->device_memory->base + device_mem_size) <
             device_mem_size) {
@@ -1053,6 +1068,27 @@ void pc_memory_init(PCMachineState *pcms,
     pcms->memhp_io_base = ACPI_MEMORY_HOTPLUG_BASE;
 }

+static uint64_t x86ms_pci_hole64_start(PCMachineState *pcms)
+{
+    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
+    X86MachineState *x86ms = X86_MACHINE(pcms);
+    MachineState *machine = MACHINE(pcms);
+    uint64_t hole64_start, size;
+
+    if (pcmc->has_reserved_memory &&
+        (machine->ram_size < machine->maxram_size)) {
+        pc_get_device_memory_range(pcms, &hole64_start, &size);
+        if (!pcmc->broken_reserved_end) {
+            hole64_start += size;
+        }
+    } else if (pcms->sgx_epc.size != 0) {
+        hole64_start = sgx_epc_above_4g_end(&pcms->sgx_epc);
+    } else {
+        hole64_start = x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
+    }
+
+    return hole64_start;
+}
 /*
  * The 64bit pci hole starts after "above 4G RAM" and
  * potentially the space reserved for memory hotplug.
@@ -1062,18 +1098,17 @@ uint64_t pc_pci_hole64_start(void)
     PCMachineState *pcms = PC_MACHINE(qdev_get_machine());
     PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
     MachineState *ms = MACHINE(pcms);
-    X86MachineState *x86ms = X86_MACHINE(pcms);
     uint64_t hole64_start = 0;

-    if (pcmc->has_reserved_memory && ms->device_memory->base) {
+    if (pcmc->has_reserved_memory &&
+        ms->device_memory && ms->device_memory->base) {
         hole64_start = ms->device_memory->base;
         if (!pcmc->broken_reserved_end) {
             hole64_start += memory_region_size(&ms->device_memory->mr);
         }
-    } else if (pcms->sgx_epc.size != 0) {
-            hole64_start = sgx_epc_above_4g_end(&pcms->sgx_epc);
     } else {
-        hole64_start = x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
+        /* handles unpopulated memory regions */
+        hole64_start = x86ms_pci_hole64_start(pcms);
     }

     return ROUND_UP(hole64_start, 1 * GiB);
Joao Martins June 20, 2022, 6:13 p.m. UTC | #8
On 6/20/22 17:36, Joao Martins wrote:
> On 6/20/22 15:27, Igor Mammedov wrote:
>> On Fri, 17 Jun 2022 14:33:02 +0100
>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>> On 6/17/22 13:32, Igor Mammedov wrote:
>>>> On Fri, 17 Jun 2022 13:18:38 +0100
>>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
>>>>> On 6/16/22 15:23, Igor Mammedov wrote:  
>>>>>> On Fri, 20 May 2022 11:45:31 +0100
>>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
>>>>>>> +                                hwaddr above_4g_mem_start,
>>>>>>> +                                uint64_t pci_hole64_size)
>>>>>>> +{
>>>>>>> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>>>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>>>>>>> +    MachineState *machine = MACHINE(pcms);
>>>>>>> +    ram_addr_t device_mem_size = 0;
>>>>>>> +    hwaddr base;
>>>>>>> +
>>>>>>> +    if (!x86ms->above_4g_mem_size) {
>>>>>>> +       /*
>>>>>>> +        * 32-bit pci hole goes from
>>>>>>> +        * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
>>>>>>> +        */
>>>>>>> +        return IO_APIC_DEFAULT_ADDRESS - 1;    
>>>>>>
>>>>>> lack of above_4g_mem, doesn't mean absence of device_mem_size or anything else
>>>>>> that's located above it.
>>>>>>     
>>>>>
>>>>> True. But the intent is to fix 32-bit boundaries as one of the qtests was failing
>>>>> otherwise. We won't hit the 1T hole, hence a nop.  
>>>>
>>>> I don't get the reasoning, can you clarify it pls?
>>>>   
>>>
>>> I was trying to say that what lead me here was a couple of qtests failures (from v3->v4).
>>>
>>> I was doing this before based on pci_hole64. phys-bits=32 was for example one
>>> of the test failures, and pci-hole64 sits above what 32-bit can reference.
>>
>> if user sets phys-bits=32, then nothing above 4Gb should work (be usable)
>> (including above-4g-ram, hotplug region or pci64 hole or sgx or cxl)
>>
>> and this doesn't look to me as AMD specific issue
>>
>> perhaps do a phys-bits check as a separate patch
>> that will error out if max_used_gpa is above phys-bits limit
>> (maybe at machine_done time)
>> (i.e. defining max_gpa and checking if compatible with configured cpu
>> are 2 different things)
>>
>> (it might be possible that tests need to be fixed too to account for it)
>>
> 
> My old notes (from v3) tell me with such a check these tests were exiting early thanks to
> that error:
> 
>  1/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/qom-test               ERROR           0.07s
>   killed by signal 6 SIGABRT
>  4/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/test-hmp               ERROR           0.07s
>   killed by signal 6 SIGABRT
>  7/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/boot-serial-test       ERROR           0.07s
>   killed by signal 6 SIGABRT
> 44/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/test-x86-cpuid-compat  ERROR           0.09s
>   killed by signal 6 SIGABRT
> 45/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/numa-test              ERROR           0.17s
>   killed by signal 6 SIGABRT
> 
> But the real reason these fail is not at all related to CPU phys bits,
> but because we just don't handle the case where no pci_hole64 is supposed to exist (which
> is what that other check is trying to do) e.g. A VM with -m 1G would
> observe the same thing i.e. the computations after that conditional are all for the pci
> hole64, which acounts for SGX/CXL/hotplug or etc which consequently means it's *errousnly*
> bigger than phys-bits=32 (by definition). So the error_report is just telling me that
> pc_max_used_gpa() is just incorrect without the !x86ms->above_4g_mem_size check.
> 
> If you're not fond of:
> 
> +    if (!x86ms->above_4g_mem_size) {
> +       /*
> +        * 32-bit pci hole goes from
> +        * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
> +         */
> +        return IO_APIC_DEFAULT_ADDRESS - 1;
> +    }
> 
> Then what should I use instead of the above?
> 
> 'IO_APIC_DEFAULT_ADDRESS - 1' is the size of the 32-bit PCI hole, which is
> also what is used for i440fx/q35 code. I could move it to a macro (e.g.
> PCI_HOST_HOLE32_SIZE) to make it a bit readable and less hardcoded. Or
> perhaps your problem is on !x86ms->above_4g_mem_size and maybe I should check
> in addition for hotplug/CXL/etc existence?
> 
>>>>>  Unless we plan on using
>>>>> pc_max_used_gpa() for something else other than this.  
>>>>
>>>> Even if '!above_4g_mem_sizem', we can still have hotpluggable memory region
>>>> present and that can  hit 1Tb. The same goes for pci64_hole if it's configured
>>>> large enough on CLI.
>>>>   
>>> So hotpluggable memory seems to assume it sits above 4g mem.
>>>
>>> pci_hole64 likewise as it uses similar computations as hotplug.
>>>
>>> Unless I am misunderstanding something here.
>>>
>>>> Looks like guesstimate we could use is taking pci64_hole_end as max used GPA
>>>>   
>>> I think this was what I had before (v3[0]) and did not work.
>>
>> that had been tied to host's phys-bits directly, all in one patch
>> and duplicating existing pc_pci_hole64_start().
>>  
> 
> Duplicating was sort of my bad attempt in this patch for pc_max_used_gpa()
> 
> I was sort of thinking to something like extracting calls to start + size "tuple" into
> functions -- e.g. for hotplug it is pc_get_device_memory_range() and for CXL it would be
> maybe pc_get_cxl_range()) -- rather than assuming those values are already initialized on
> the memory-region @base and its size.
> 
> See snippet below. Note I am missing CXL handling, but gives you the idea.
> 
> But it is slightly more complex than what I had in this version :( and would require
> anyone doing changes in pc_memory_init() and pc_pci_hole64_start() to make sure it follows
> the similar logic.
> 

Ignore previous snippet, here's a slightly cleaner version:

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 8eaa32ee2106..1d97c77a5eac 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -803,6 +803,43 @@ void xen_load_linux(PCMachineState *pcms)
 #define PC_ROM_ALIGN       0x800
 #define PC_ROM_SIZE        (PC_ROM_MAX - PC_ROM_MIN_VGA)

+static void pc_get_device_memory_range(PCMachineState *pcms,
+                                       hwaddr *base,
+                                       hwaddr *device_mem_size)
+{
+    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
+    X86MachineState *x86ms = X86_MACHINE(pcms);
+    MachineState *machine = MACHINE(pcms);
+    hwaddr addr, size;
+
+    if (pcmc->has_reserved_memory &&
+        machine->device_memory && machine->device_memory->base) {
+        addr = machine->device_memory->base;
+        size = memory_region_size(&machine->device_memory->mr);
+        goto out;
+    }
+
+    /* uninitialized memory region */
+    size = machine->maxram_size - machine->ram_size;
+
+    if (pcms->sgx_epc.size != 0) {
+        addr = sgx_epc_above_4g_end(&pcms->sgx_epc);
+    } else {
+        addr = x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
+    }
+
+    if (pcmc->enforce_aligned_dimm) {
+        /* size device region assuming 1G page max alignment per slot */
+        size += (1 * GiB) * machine->ram_slots;
+    }
+
+out:
+    if (base)
+        *base = addr;
+    if (device_mem_size)
+        *device_mem_size = size;
+}
+
 void pc_memory_init(PCMachineState *pcms,
                     MemoryRegion *system_memory,
                     MemoryRegion *rom_memory,
@@ -864,7 +901,7 @@ void pc_memory_init(PCMachineState *pcms,
     /* initialize device memory address space */
     if (pcmc->has_reserved_memory &&
         (machine->ram_size < machine->maxram_size)) {
-        ram_addr_t device_mem_size = machine->maxram_size - machine->ram_size;
+        ram_addr_t device_mem_size;

         if (machine->ram_slots > ACPI_MAX_RAM_SLOTS) {
             error_report("unsupported amount of memory slots: %"PRIu64,
@@ -879,20 +916,7 @@ void pc_memory_init(PCMachineState *pcms,
             exit(EXIT_FAILURE);
         }

-        if (pcms->sgx_epc.size != 0) {
-            machine->device_memory->base = sgx_epc_above_4g_end(&pcms->sgx_epc);
-        } else {
-            machine->device_memory->base =
-                x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
-        }
-
-        machine->device_memory->base =
-            ROUND_UP(machine->device_memory->base, 1 * GiB);
-
-        if (pcmc->enforce_aligned_dimm) {
-            /* size device region assuming 1G page max alignment per slot */
-            device_mem_size += (1 * GiB) * machine->ram_slots;
-        }
+        pc_get_device_memory_range(pcms, &machine->device_memory->base, &device_mem_size);

         if ((machine->device_memory->base + device_mem_size) <
             device_mem_size) {
@@ -965,12 +989,13 @@ uint64_t pc_pci_hole64_start(void)
     PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
     MachineState *ms = MACHINE(pcms);
     X86MachineState *x86ms = X86_MACHINE(pcms);
-    uint64_t hole64_start = 0;
+    uint64_t hole64_start = 0, size = 0;

-    if (pcmc->has_reserved_memory && ms->device_memory->base) {
-        hole64_start = ms->device_memory->base;
+    if (pcmc->has_reserved_memory &&
+        (ms->ram_size < ms->maxram_size)) {
+        pc_get_device_memory_range(pcms, &hole64_start, &size);
         if (!pcmc->broken_reserved_end) {
-            hole64_start += memory_region_size(&ms->device_memory->mr);
+            hole64_start += size;
         }
     } else if (pcms->sgx_epc.size != 0) {
             hole64_start = sgx_epc_above_4g_end(&pcms->sgx_epc);
Igor Mammedov June 28, 2022, 12:38 p.m. UTC | #9
On Mon, 20 Jun 2022 19:13:46 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 6/20/22 17:36, Joao Martins wrote:
> > On 6/20/22 15:27, Igor Mammedov wrote:  
> >> On Fri, 17 Jun 2022 14:33:02 +0100
> >> Joao Martins <joao.m.martins@oracle.com> wrote:  
> >>> On 6/17/22 13:32, Igor Mammedov wrote:  
> >>>> On Fri, 17 Jun 2022 13:18:38 +0100
> >>>> Joao Martins <joao.m.martins@oracle.com> wrote:    
> >>>>> On 6/16/22 15:23, Igor Mammedov wrote:    
> >>>>>> On Fri, 20 May 2022 11:45:31 +0100
> >>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:    
> >>>>>>> +                                hwaddr above_4g_mem_start,
> >>>>>>> +                                uint64_t pci_hole64_size)
> >>>>>>> +{
> >>>>>>> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
> >>>>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> >>>>>>> +    MachineState *machine = MACHINE(pcms);
> >>>>>>> +    ram_addr_t device_mem_size = 0;
> >>>>>>> +    hwaddr base;
> >>>>>>> +
> >>>>>>> +    if (!x86ms->above_4g_mem_size) {
> >>>>>>> +       /*
> >>>>>>> +        * 32-bit pci hole goes from
> >>>>>>> +        * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
> >>>>>>> +        */
> >>>>>>> +        return IO_APIC_DEFAULT_ADDRESS - 1;      
> >>>>>>
> >>>>>> lack of above_4g_mem, doesn't mean absence of device_mem_size or anything else
> >>>>>> that's located above it.
> >>>>>>       
> >>>>>
> >>>>> True. But the intent is to fix 32-bit boundaries as one of the qtests was failing
> >>>>> otherwise. We won't hit the 1T hole, hence a nop.    
> >>>>
> >>>> I don't get the reasoning, can you clarify it pls?
> >>>>     
> >>>
> >>> I was trying to say that what lead me here was a couple of qtests failures (from v3->v4).
> >>>
> >>> I was doing this before based on pci_hole64. phys-bits=32 was for example one
> >>> of the test failures, and pci-hole64 sits above what 32-bit can reference.  
> >>
> >> if user sets phys-bits=32, then nothing above 4Gb should work (be usable)
> >> (including above-4g-ram, hotplug region or pci64 hole or sgx or cxl)
> >>
> >> and this doesn't look to me as AMD specific issue
> >>
> >> perhaps do a phys-bits check as a separate patch
> >> that will error out if max_used_gpa is above phys-bits limit
> >> (maybe at machine_done time)
> >> (i.e. defining max_gpa and checking if compatible with configured cpu
> >> are 2 different things)
> >>
> >> (it might be possible that tests need to be fixed too to account for it)
> >>  
> > 
> > My old notes (from v3) tell me with such a check these tests were exiting early thanks to
> > that error:
> > 
> >  1/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/qom-test               ERROR           0.07s
> >   killed by signal 6 SIGABRT
> >  4/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/test-hmp               ERROR           0.07s
> >   killed by signal 6 SIGABRT
> >  7/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/boot-serial-test       ERROR           0.07s
> >   killed by signal 6 SIGABRT
> > 44/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/test-x86-cpuid-compat  ERROR           0.09s
> >   killed by signal 6 SIGABRT
> > 45/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/numa-test              ERROR           0.17s
> >   killed by signal 6 SIGABRT
> > 
> > But the real reason these fail is not at all related to CPU phys bits,
> > but because we just don't handle the case where no pci_hole64 is supposed to exist (which
> > is what that other check is trying to do) e.g. A VM with -m 1G would
> > observe the same thing i.e. the computations after that conditional are all for the pci
> > hole64, which acounts for SGX/CXL/hotplug or etc which consequently means it's *errousnly*
> > bigger than phys-bits=32 (by definition). So the error_report is just telling me that
> > pc_max_used_gpa() is just incorrect without the !x86ms->above_4g_mem_size check.
> > 
> > If you're not fond of:
> > 
> > +    if (!x86ms->above_4g_mem_size) {
> > +       /*
> > +        * 32-bit pci hole goes from
> > +        * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
> > +         */
> > +        return IO_APIC_DEFAULT_ADDRESS - 1;
> > +    }
> > 
> > Then what should I use instead of the above?
> > 
> > 'IO_APIC_DEFAULT_ADDRESS - 1' is the size of the 32-bit PCI hole, which is
> > also what is used for i440fx/q35 code. I could move it to a macro (e.g.
> > PCI_HOST_HOLE32_SIZE) to make it a bit readable and less hardcoded. Or
> > perhaps your problem is on !x86ms->above_4g_mem_size and maybe I should check
> > in addition for hotplug/CXL/etc existence?
> >   
> >>>>>  Unless we plan on using
> >>>>> pc_max_used_gpa() for something else other than this.    
> >>>>
> >>>> Even if '!above_4g_mem_sizem', we can still have hotpluggable memory region
> >>>> present and that can  hit 1Tb. The same goes for pci64_hole if it's configured
> >>>> large enough on CLI.
> >>>>     
> >>> So hotpluggable memory seems to assume it sits above 4g mem.
> >>>
> >>> pci_hole64 likewise as it uses similar computations as hotplug.
> >>>
> >>> Unless I am misunderstanding something here.
> >>>  
> >>>> Looks like guesstimate we could use is taking pci64_hole_end as max used GPA
> >>>>     
> >>> I think this was what I had before (v3[0]) and did not work.  
> >>
> >> that had been tied to host's phys-bits directly, all in one patch
> >> and duplicating existing pc_pci_hole64_start().
> >>    
> > 
> > Duplicating was sort of my bad attempt in this patch for pc_max_used_gpa()
> > 
> > I was sort of thinking to something like extracting calls to start + size "tuple" into
> > functions -- e.g. for hotplug it is pc_get_device_memory_range() and for CXL it would be
> > maybe pc_get_cxl_range()) -- rather than assuming those values are already initialized on
> > the memory-region @base and its size.
> > 
> > See snippet below. Note I am missing CXL handling, but gives you the idea.
> > 
> > But it is slightly more complex than what I had in this version :( and would require
> > anyone doing changes in pc_memory_init() and pc_pci_hole64_start() to make sure it follows
> > the similar logic.
> >   
> 
> Ignore previous snippet, here's a slightly cleaner version:

lets go with this version

> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index 8eaa32ee2106..1d97c77a5eac 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -803,6 +803,43 @@ void xen_load_linux(PCMachineState *pcms)
>  #define PC_ROM_ALIGN       0x800
>  #define PC_ROM_SIZE        (PC_ROM_MAX - PC_ROM_MIN_VGA)
> 
> +static void pc_get_device_memory_range(PCMachineState *pcms,
> +                                       hwaddr *base,
> +                                       hwaddr *device_mem_size)
> +{
> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> +    MachineState *machine = MACHINE(pcms);
> +    hwaddr addr, size;
> +
> +    if (pcmc->has_reserved_memory &&
> +        machine->device_memory && machine->device_memory->base) {
> +        addr = machine->device_memory->base;
> +        size = memory_region_size(&machine->device_memory->mr);
> +        goto out;
> +    }
> +
> +    /* uninitialized memory region */
> +    size = machine->maxram_size - machine->ram_size;
> +
> +    if (pcms->sgx_epc.size != 0) {
> +        addr = sgx_epc_above_4g_end(&pcms->sgx_epc);
> +    } else {
> +        addr = x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
> +    }
> +
> +    if (pcmc->enforce_aligned_dimm) {
> +        /* size device region assuming 1G page max alignment per slot */
> +        size += (1 * GiB) * machine->ram_slots;
> +    }
> +
> +out:
> +    if (base)
> +        *base = addr;
> +    if (device_mem_size)
> +        *device_mem_size = size;
> +}
> +
>  void pc_memory_init(PCMachineState *pcms,
>                      MemoryRegion *system_memory,
>                      MemoryRegion *rom_memory,
> @@ -864,7 +901,7 @@ void pc_memory_init(PCMachineState *pcms,
>      /* initialize device memory address space */
>      if (pcmc->has_reserved_memory &&
>          (machine->ram_size < machine->maxram_size)) {
> -        ram_addr_t device_mem_size = machine->maxram_size - machine->ram_size;
> +        ram_addr_t device_mem_size;
> 
>          if (machine->ram_slots > ACPI_MAX_RAM_SLOTS) {
>              error_report("unsupported amount of memory slots: %"PRIu64,
> @@ -879,20 +916,7 @@ void pc_memory_init(PCMachineState *pcms,
>              exit(EXIT_FAILURE);
>          }
> 
> -        if (pcms->sgx_epc.size != 0) {
> -            machine->device_memory->base = sgx_epc_above_4g_end(&pcms->sgx_epc);
> -        } else {
> -            machine->device_memory->base =
> -                x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
> -        }
> -
> -        machine->device_memory->base =
> -            ROUND_UP(machine->device_memory->base, 1 * GiB);
> -
> -        if (pcmc->enforce_aligned_dimm) {
> -            /* size device region assuming 1G page max alignment per slot */
> -            device_mem_size += (1 * GiB) * machine->ram_slots;
> -        }
> +        pc_get_device_memory_range(pcms, &machine->device_memory->base, &device_mem_size);
> 
>          if ((machine->device_memory->base + device_mem_size) <
>              device_mem_size) {
> @@ -965,12 +989,13 @@ uint64_t pc_pci_hole64_start(void)
>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>      MachineState *ms = MACHINE(pcms);
>      X86MachineState *x86ms = X86_MACHINE(pcms);
> -    uint64_t hole64_start = 0;
> +    uint64_t hole64_start = 0, size = 0;
> 
> -    if (pcmc->has_reserved_memory && ms->device_memory->base) {
> -        hole64_start = ms->device_memory->base;
> +    if (pcmc->has_reserved_memory &&
> +        (ms->ram_size < ms->maxram_size)) {
> +        pc_get_device_memory_range(pcms, &hole64_start, &size);
>          if (!pcmc->broken_reserved_end) {
> -            hole64_start += memory_region_size(&ms->device_memory->mr);
> +            hole64_start += size;
>          }
>      } else if (pcms->sgx_epc.size != 0) {
>              hole64_start = sgx_epc_above_4g_end(&pcms->sgx_epc);
>
Joao Martins June 28, 2022, 3:27 p.m. UTC | #10
On 6/28/22 13:38, Igor Mammedov wrote:
> On Mon, 20 Jun 2022 19:13:46 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> On 6/20/22 17:36, Joao Martins wrote:
>>> On 6/20/22 15:27, Igor Mammedov wrote:  
>>>> On Fri, 17 Jun 2022 14:33:02 +0100
>>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
>>>>> On 6/17/22 13:32, Igor Mammedov wrote:  
>>>>>> On Fri, 17 Jun 2022 13:18:38 +0100
>>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:    
>>>>>>> On 6/16/22 15:23, Igor Mammedov wrote:    
>>>>>>>> On Fri, 20 May 2022 11:45:31 +0100
>>>>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:    
>>>>>>>>> +                                hwaddr above_4g_mem_start,
>>>>>>>>> +                                uint64_t pci_hole64_size)
>>>>>>>>> +{
>>>>>>>>> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>>>>>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>>>>>>>>> +    MachineState *machine = MACHINE(pcms);
>>>>>>>>> +    ram_addr_t device_mem_size = 0;
>>>>>>>>> +    hwaddr base;
>>>>>>>>> +
>>>>>>>>> +    if (!x86ms->above_4g_mem_size) {
>>>>>>>>> +       /*
>>>>>>>>> +        * 32-bit pci hole goes from
>>>>>>>>> +        * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
>>>>>>>>> +        */
>>>>>>>>> +        return IO_APIC_DEFAULT_ADDRESS - 1;      
>>>>>>>>
>>>>>>>> lack of above_4g_mem, doesn't mean absence of device_mem_size or anything else
>>>>>>>> that's located above it.
>>>>>>>>       
>>>>>>>
>>>>>>> True. But the intent is to fix 32-bit boundaries as one of the qtests was failing
>>>>>>> otherwise. We won't hit the 1T hole, hence a nop.    
>>>>>>
>>>>>> I don't get the reasoning, can you clarify it pls?
>>>>>>     
>>>>>
>>>>> I was trying to say that what lead me here was a couple of qtests failures (from v3->v4).
>>>>>
>>>>> I was doing this before based on pci_hole64. phys-bits=32 was for example one
>>>>> of the test failures, and pci-hole64 sits above what 32-bit can reference.  
>>>>
>>>> if user sets phys-bits=32, then nothing above 4Gb should work (be usable)
>>>> (including above-4g-ram, hotplug region or pci64 hole or sgx or cxl)
>>>>
>>>> and this doesn't look to me as AMD specific issue
>>>>
>>>> perhaps do a phys-bits check as a separate patch
>>>> that will error out if max_used_gpa is above phys-bits limit
>>>> (maybe at machine_done time)
>>>> (i.e. defining max_gpa and checking if compatible with configured cpu
>>>> are 2 different things)
>>>>
>>>> (it might be possible that tests need to be fixed too to account for it)
>>>>  
>>>
>>> My old notes (from v3) tell me with such a check these tests were exiting early thanks to
>>> that error:
>>>
>>>  1/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/qom-test               ERROR           0.07s
>>>   killed by signal 6 SIGABRT
>>>  4/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/test-hmp               ERROR           0.07s
>>>   killed by signal 6 SIGABRT
>>>  7/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/boot-serial-test       ERROR           0.07s
>>>   killed by signal 6 SIGABRT
>>> 44/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/test-x86-cpuid-compat  ERROR           0.09s
>>>   killed by signal 6 SIGABRT
>>> 45/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/numa-test              ERROR           0.17s
>>>   killed by signal 6 SIGABRT
>>>
>>> But the real reason these fail is not at all related to CPU phys bits,
>>> but because we just don't handle the case where no pci_hole64 is supposed to exist (which
>>> is what that other check is trying to do) e.g. A VM with -m 1G would
>>> observe the same thing i.e. the computations after that conditional are all for the pci
>>> hole64, which acounts for SGX/CXL/hotplug or etc which consequently means it's *errousnly*
>>> bigger than phys-bits=32 (by definition). So the error_report is just telling me that
>>> pc_max_used_gpa() is just incorrect without the !x86ms->above_4g_mem_size check.
>>>
>>> If you're not fond of:
>>>
>>> +    if (!x86ms->above_4g_mem_size) {
>>> +       /*
>>> +        * 32-bit pci hole goes from
>>> +        * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
>>> +         */
>>> +        return IO_APIC_DEFAULT_ADDRESS - 1;
>>> +    }
>>>
>>> Then what should I use instead of the above?
>>>
>>> 'IO_APIC_DEFAULT_ADDRESS - 1' is the size of the 32-bit PCI hole, which is
>>> also what is used for i440fx/q35 code. I could move it to a macro (e.g.
>>> PCI_HOST_HOLE32_SIZE) to make it a bit readable and less hardcoded. Or
>>> perhaps your problem is on !x86ms->above_4g_mem_size and maybe I should check
>>> in addition for hotplug/CXL/etc existence?
>>>   
>>>>>>>  Unless we plan on using
>>>>>>> pc_max_used_gpa() for something else other than this.    
>>>>>>
>>>>>> Even if '!above_4g_mem_sizem', we can still have hotpluggable memory region
>>>>>> present and that can  hit 1Tb. The same goes for pci64_hole if it's configured
>>>>>> large enough on CLI.
>>>>>>     
>>>>> So hotpluggable memory seems to assume it sits above 4g mem.
>>>>>
>>>>> pci_hole64 likewise as it uses similar computations as hotplug.
>>>>>
>>>>> Unless I am misunderstanding something here.
>>>>>  
>>>>>> Looks like guesstimate we could use is taking pci64_hole_end as max used GPA
>>>>>>     
>>>>> I think this was what I had before (v3[0]) and did not work.  
>>>>
>>>> that had been tied to host's phys-bits directly, all in one patch
>>>> and duplicating existing pc_pci_hole64_start().
>>>>    
>>>
>>> Duplicating was sort of my bad attempt in this patch for pc_max_used_gpa()
>>>
>>> I was sort of thinking to something like extracting calls to start + size "tuple" into
>>> functions -- e.g. for hotplug it is pc_get_device_memory_range() and for CXL it would be
>>> maybe pc_get_cxl_range()) -- rather than assuming those values are already initialized on
>>> the memory-region @base and its size.
>>>
>>> See snippet below. Note I am missing CXL handling, but gives you the idea.
>>>
>>> But it is slightly more complex than what I had in this version :( and would require
>>> anyone doing changes in pc_memory_init() and pc_pci_hole64_start() to make sure it follows
>>> the similar logic.
>>>   
>>
>> Ignore previous snippet, here's a slightly cleaner version:
> 
> lets go with this version
> 

OK. I have splited into 5 new patches:

578f551a41f0 i386/pc: handle unitialized mr in pc_get_cxl_range_end()
49256313cfd9 i386/pc: factor out cxl range start to helper
4bc1836bd588 i386/pc: factor out cxl range end to helper
e83cc9d3081c i386/pc: factor out device_memory base/size to helper
1ccb5064338e i386/pc: factor out above-4g end to an helper

Will re-test and respin the series.

The core of the series (in this patch) doesn't change and just gets simpler.

>>
>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>> index 8eaa32ee2106..1d97c77a5eac 100644
>> --- a/hw/i386/pc.c
>> +++ b/hw/i386/pc.c
>> @@ -803,6 +803,43 @@ void xen_load_linux(PCMachineState *pcms)
>>  #define PC_ROM_ALIGN       0x800
>>  #define PC_ROM_SIZE        (PC_ROM_MAX - PC_ROM_MIN_VGA)
>>
>> +static void pc_get_device_memory_range(PCMachineState *pcms,
>> +                                       hwaddr *base,
>> +                                       hwaddr *device_mem_size)
>> +{
>> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>> +    MachineState *machine = MACHINE(pcms);
>> +    hwaddr addr, size;
>> +
>> +    if (pcmc->has_reserved_memory &&
>> +        machine->device_memory && machine->device_memory->base) {
>> +        addr = machine->device_memory->base;
>> +        size = memory_region_size(&machine->device_memory->mr);
>> +        goto out;
>> +    }
>> +
>> +    /* uninitialized memory region */
>> +    size = machine->maxram_size - machine->ram_size;
>> +
>> +    if (pcms->sgx_epc.size != 0) {
>> +        addr = sgx_epc_above_4g_end(&pcms->sgx_epc);
>> +    } else {
>> +        addr = x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
>> +    }
>> +
>> +    if (pcmc->enforce_aligned_dimm) {
>> +        /* size device region assuming 1G page max alignment per slot */
>> +        size += (1 * GiB) * machine->ram_slots;
>> +    }
>> +
>> +out:
>> +    if (base)
>> +        *base = addr;
>> +    if (device_mem_size)
>> +        *device_mem_size = size;
>> +}
>> +
>>  void pc_memory_init(PCMachineState *pcms,
>>                      MemoryRegion *system_memory,
>>                      MemoryRegion *rom_memory,
>> @@ -864,7 +901,7 @@ void pc_memory_init(PCMachineState *pcms,
>>      /* initialize device memory address space */
>>      if (pcmc->has_reserved_memory &&
>>          (machine->ram_size < machine->maxram_size)) {
>> -        ram_addr_t device_mem_size = machine->maxram_size - machine->ram_size;
>> +        ram_addr_t device_mem_size;
>>
>>          if (machine->ram_slots > ACPI_MAX_RAM_SLOTS) {
>>              error_report("unsupported amount of memory slots: %"PRIu64,
>> @@ -879,20 +916,7 @@ void pc_memory_init(PCMachineState *pcms,
>>              exit(EXIT_FAILURE);
>>          }
>>
>> -        if (pcms->sgx_epc.size != 0) {
>> -            machine->device_memory->base = sgx_epc_above_4g_end(&pcms->sgx_epc);
>> -        } else {
>> -            machine->device_memory->base =
>> -                x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
>> -        }
>> -
>> -        machine->device_memory->base =
>> -            ROUND_UP(machine->device_memory->base, 1 * GiB);
>> -
>> -        if (pcmc->enforce_aligned_dimm) {
>> -            /* size device region assuming 1G page max alignment per slot */
>> -            device_mem_size += (1 * GiB) * machine->ram_slots;
>> -        }
>> +        pc_get_device_memory_range(pcms, &machine->device_memory->base, &device_mem_size);
>>
>>          if ((machine->device_memory->base + device_mem_size) <
>>              device_mem_size) {
>> @@ -965,12 +989,13 @@ uint64_t pc_pci_hole64_start(void)
>>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>      MachineState *ms = MACHINE(pcms);
>>      X86MachineState *x86ms = X86_MACHINE(pcms);
>> -    uint64_t hole64_start = 0;
>> +    uint64_t hole64_start = 0, size = 0;
>>
>> -    if (pcmc->has_reserved_memory && ms->device_memory->base) {
>> -        hole64_start = ms->device_memory->base;
>> +    if (pcmc->has_reserved_memory &&
>> +        (ms->ram_size < ms->maxram_size)) {
>> +        pc_get_device_memory_range(pcms, &hole64_start, &size);
>>          if (!pcmc->broken_reserved_end) {
>> -            hole64_start += memory_region_size(&ms->device_memory->mr);
>> +            hole64_start += size;
>>          }
>>      } else if (pcms->sgx_epc.size != 0) {
>>              hole64_start = sgx_epc_above_4g_end(&pcms->sgx_epc);
>>
>
diff mbox series

Patch

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index af52d4ff89ef..652ae8ff9ccf 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -796,6 +796,110 @@  void xen_load_linux(PCMachineState *pcms)
 #define PC_ROM_ALIGN       0x800
 #define PC_ROM_SIZE        (PC_ROM_MAX - PC_ROM_MIN_VGA)
 
+/*
+ * AMD systems with an IOMMU have an additional hole close to the
+ * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
+ * on kernel version, VFIO may or may not let you DMA map those ranges.
+ * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
+ * with certain memory sizes. It's also wrong to use those IOVA ranges
+ * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
+ * The ranges reserved for Hyper-Transport are:
+ *
+ * FD_0000_0000h - FF_FFFF_FFFFh
+ *
+ * The ranges represent the following:
+ *
+ * Base Address   Top Address  Use
+ *
+ * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
+ * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
+ * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
+ * FD_F910_0000h FD_F91F_FFFFh System Management
+ * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
+ * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
+ * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
+ * FD_FE00_0000h FD_FFFF_FFFFh Configuration
+ * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
+ * FE_2000_0000h FF_FFFF_FFFFh Reserved
+ *
+ * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
+ * Table 3: Special Address Controls (GPA) for more information.
+ */
+#define AMD_HT_START         0xfd00000000UL
+#define AMD_HT_END           0xffffffffffUL
+#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
+#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
+
+static hwaddr x86_max_phys_addr(PCMachineState *pcms,
+                                hwaddr above_4g_mem_start,
+                                uint64_t pci_hole64_size)
+{
+    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
+    X86MachineState *x86ms = X86_MACHINE(pcms);
+    MachineState *machine = MACHINE(pcms);
+    ram_addr_t device_mem_size = 0;
+    hwaddr base;
+
+    if (!x86ms->above_4g_mem_size) {
+       /*
+        * 32-bit pci hole goes from
+        * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
+        */
+        return IO_APIC_DEFAULT_ADDRESS - 1;
+    }
+
+    if (pcmc->has_reserved_memory &&
+       (machine->ram_size < machine->maxram_size)) {
+        device_mem_size = machine->maxram_size - machine->ram_size;
+    }
+
+    base = ROUND_UP(above_4g_mem_start + x86ms->above_4g_mem_size +
+                    pcms->sgx_epc.size, 1 * GiB);
+
+    return base + device_mem_size + pci_hole64_size;
+}
+
+static void x86_update_above_4g_mem_start(PCMachineState *pcms,
+                                          uint64_t pci_hole64_size)
+{
+    X86MachineState *x86ms = X86_MACHINE(pcms);
+    CPUX86State *env = &X86_CPU(first_cpu)->env;
+    hwaddr start = x86ms->above_4g_mem_start;
+    hwaddr maxphysaddr, maxusedaddr;
+
+    /*
+     * The HyperTransport range close to the 1T boundary is unique to AMD
+     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
+     * to above 1T to AMD vCPUs only.
+     */
+    if (!IS_AMD_CPU(env)) {
+        return;
+    }
+
+    /* Bail out if max possible address does not cross HT range */
+    if (x86_max_phys_addr(pcms, start, pci_hole64_size) < AMD_HT_START) {
+        return;
+    }
+
+    /*
+     * Relocating ram-above-4G requires more than TCG_PHYS_ADDR_BITS (40).
+     * So make sure phys-bits is required to be appropriately sized in order
+     * to proceed with the above-4g-region relocation and thus boot.
+     */
+    start = AMD_ABOVE_1TB_START;
+    maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
+    maxusedaddr = x86_max_phys_addr(pcms, start, pci_hole64_size);
+    if (maxphysaddr < maxusedaddr) {
+        error_report("Address space limit 0x%"PRIx64" < 0x%"PRIx64
+                     " phys-bits too low (%u) cannot avoid AMD HT range",
+                     maxphysaddr, maxusedaddr, X86_CPU(first_cpu)->phys_bits);
+        exit(EXIT_FAILURE);
+    }
+
+
+    x86ms->above_4g_mem_start = start;
+}
+
 void pc_memory_init(PCMachineState *pcms,
                     MemoryRegion *system_memory,
                     MemoryRegion *rom_memory,
@@ -817,6 +921,8 @@  void pc_memory_init(PCMachineState *pcms,
 
     linux_boot = (machine->kernel_filename != NULL);
 
+    x86_update_above_4g_mem_start(pcms, pci_hole64_size);
+
     /*
      * Split single memory region and use aliases to address portions of it,
      * done for backwards compatibility with older qemus.
@@ -827,6 +933,11 @@  void pc_memory_init(PCMachineState *pcms,
                              0, x86ms->below_4g_mem_size);
     memory_region_add_subregion(system_memory, 0, ram_below_4g);
     e820_add_entry(0, x86ms->below_4g_mem_size, E820_RAM);
+
+    if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START) {
+        e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
+    }
+
     if (x86ms->above_4g_mem_size > 0) {
         ram_above_4g = g_malloc(sizeof(*ram_above_4g));
         memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g",