diff mbox series

[v6,09/10] i386/pc: relocate 4g start to 1T where applicable

Message ID 20220701161014.3850-10-joao.m.martins@oracle.com (mailing list archive)
State New, archived
Headers show
Series i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU | expand

Commit Message

Joao Martins July 1, 2022, 4:10 p.m. UTC
It is assumed that the whole GPA space is available to be DMA
addressable, within a given address space limit, except for a
tiny region before the 4G. Since Linux v5.4, VFIO validates
whether the selected GPA is indeed valid i.e. not reserved by
IOMMU on behalf of some specific devices or platform-defined
restrictions, and thus failing the ioctl(VFIO_DMA_MAP) with
 -EINVAL.

AMD systems with an IOMMU are examples of such platforms and
particularly may only have these ranges as allowed:

	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb[*])

We already account for the 4G hole, albeit if the guest is big
enough we will fail to allocate a guest with  >1010G due to the
~12G hole at the 1Tb boundary, reserved for HyperTransport (HT).

[*] there is another reserved region unrelated to HT that exists
in the 256T boundary in Fam 17h according to Errata #1286,
documeted also in "Open-Source Register Reference for AMD Family
17h Processors (PUB)"

When creating the region above 4G, take into account that on AMD
platforms the HyperTransport range is reserved and hence it
cannot be used either as GPAs. On those cases rather than
establishing the start of ram-above-4g to be 4G, relocate instead
to 1Tb. See AMD IOMMU spec, section 2.1.2 "IOMMU Logical
Topology", for more information on the underlying restriction of
IOVAs.

After accounting for the 1Tb hole on AMD hosts, mtree should
look like:

0000000000000000-000000007fffffff (prio 0, i/o):
	 alias ram-below-4g @pc.ram 0000000000000000-000000007fffffff
0000010000000000-000001ff7fffffff (prio 0, i/o):
	alias ram-above-4g @pc.ram 0000000080000000-000000ffffffffff

If the relocation is done or the address space covers it, we
also add the the reserved HT e820 range as reserved.

Default phys-bits on Qemu is TCG_PHYS_ADDR_BITS (40) which is enough
to address 1Tb (0xff ffff ffff). On AMD platforms, if a
ram-above-4g relocation may be desired and the CPU wasn't configured
with a big enough phys-bits, print an error message to the user
and do not make the relocation of the above-4g-region if phys-bits
is too low.

Suggested-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/pc.c | 101 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 101 insertions(+)

Comments

Joao Martins July 7, 2022, 3:53 p.m. UTC | #1
On 7/1/22 17:10, Joao Martins wrote:
> +    /*
> +     * The HyperTransport range close to the 1T boundary is unique to AMD
> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
> +     * to above 1T to AMD vCPUs only.
> +     */
> +    if (IS_AMD_CPU(&cpu->env)) {
> +        pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);
> +
> +        /*
> +         * Advertise the HT region if address space covers the reserved
> +         * region or if we relocate.
> +         */
> +        if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START ||
> +            cpu->phys_bits >= 40) {
> +            e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
> +        }
> +    }
> +

[As part of Alex discussion in previous version there's this other case where VMs with
memory less than 1T but having enough GPUs (say each having 40G to state an example) can
have PCI devices placed within reserved HT region.]

Changing fwcfg 'reserved-memory-end' to 1T (bearing that phys_bits is correctly
configured) without triggering above-4g relocation ... fixes the case above. As
'reserved-memory-end' is ultimately what virtual firmware uses (SeaBIOS and OVMF) for
hole64 start. Though, I am at odds whether to include this. Meaning, whether this is the
VMM going around a fw bug[*] even after e820 is described accurately, or if this is the
right to do in the VMM?

Part of the reason I haven't done this was because the issue doesn't happen if VMM user
describes the correct pci-hole64-size in q35/pc that's big enough to cover all VFIO
devices (which is ultimately correct). Thoughts?

[*] as it should look at *all* reserved ranges including those above ram.

>      /*
>       * Split single memory region and use aliases to address portions of it,
>       * done for backwards compatibility with older qemus.
> @@ -938,6 +1038,7 @@ void pc_memory_init(PCMachineState *pcms,
>                               0, x86ms->below_4g_mem_size);
>      memory_region_add_subregion(system_memory, 0, ram_below_4g);
>      e820_add_entry(0, x86ms->below_4g_mem_size, E820_RAM);
> +

Spurious new line here that I will fix on v7.

>      if (x86ms->above_4g_mem_size > 0) {
>          ram_above_4g = g_malloc(sizeof(*ram_above_4g));
>          memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g",
Igor Mammedov July 11, 2022, 12:56 p.m. UTC | #2
On Fri,  1 Jul 2022 17:10:13 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> It is assumed that the whole GPA space is available to be DMA
> addressable, within a given address space limit, except for a
> tiny region before the 4G. Since Linux v5.4, VFIO validates
> whether the selected GPA is indeed valid i.e. not reserved by
> IOMMU on behalf of some specific devices or platform-defined
> restrictions, and thus failing the ioctl(VFIO_DMA_MAP) with
>  -EINVAL.
> 
> AMD systems with an IOMMU are examples of such platforms and
> particularly may only have these ranges as allowed:
> 
> 	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
> 	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
> 	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb[*])
> 
> We already account for the 4G hole, albeit if the guest is big
> enough we will fail to allocate a guest with  >1010G due to the
> ~12G hole at the 1Tb boundary, reserved for HyperTransport (HT).
> 
> [*] there is another reserved region unrelated to HT that exists
> in the 256T boundary in Fam 17h according to Errata #1286,
> documeted also in "Open-Source Register Reference for AMD Family
> 17h Processors (PUB)"
> 
> When creating the region above 4G, take into account that on AMD
> platforms the HyperTransport range is reserved and hence it
> cannot be used either as GPAs. On those cases rather than
> establishing the start of ram-above-4g to be 4G, relocate instead
> to 1Tb. See AMD IOMMU spec, section 2.1.2 "IOMMU Logical
> Topology", for more information on the underlying restriction of
> IOVAs.
> 
> After accounting for the 1Tb hole on AMD hosts, mtree should
> look like:
> 
> 0000000000000000-000000007fffffff (prio 0, i/o):
> 	 alias ram-below-4g @pc.ram 0000000000000000-000000007fffffff
> 0000010000000000-000001ff7fffffff (prio 0, i/o):
> 	alias ram-above-4g @pc.ram 0000000080000000-000000ffffffffff
> 
> If the relocation is done or the address space covers it, we
> also add the the reserved HT e820 range as reserved.
> 
> Default phys-bits on Qemu is TCG_PHYS_ADDR_BITS (40) which is enough
> to address 1Tb (0xff ffff ffff). On AMD platforms, if a
> ram-above-4g relocation may be desired and the CPU wasn't configured
> with a big enough phys-bits, print an error message to the user
> and do not make the relocation of the above-4g-region if phys-bits
> is too low.
> 
> Suggested-by: Igor Mammedov <imammedo@redhat.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  hw/i386/pc.c | 101 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 101 insertions(+)
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index a79fa1b6beeb..07025b510540 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -907,6 +907,87 @@ static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
>      return start;
>  }
>  
> +static hwaddr pc_max_used_gpa(PCMachineState *pcms,
> +                                hwaddr above_4g_mem_start,
> +                                uint64_t pci_hole64_size)
> +{
> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> +

> +    if (!x86ms->above_4g_mem_size) {
> +        /*
> +         * 32-bit pci hole goes from
> +         * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
> +          */
> +        return IO_APIC_DEFAULT_ADDRESS - 1;
> +    }
this hunk still bothers me (nothing changed wrt v5 issues around it)
issues recap: (
 1. correctness of it
 2. being limited to AMD only, while it seems pretty generic to me
 3. should be a separate patch
)

> +
> +    return pc_pci_hole64_start() + pci_hole64_size;
> +}
> +
> +/*
> + * AMD systems with an IOMMU have an additional hole close to the
> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
> + * on kernel version, VFIO may or may not let you DMA map those ranges.
> + * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
> + * with certain memory sizes. It's also wrong to use those IOVA ranges
> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
> + * The ranges reserved for Hyper-Transport are:
> + *
> + * FD_0000_0000h - FF_FFFF_FFFFh
> + *
> + * The ranges represent the following:
> + *
> + * Base Address   Top Address  Use
> + *
> + * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
> + * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
> + * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
> + * FD_F910_0000h FD_F91F_FFFFh System Management
> + * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
> + * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
> + * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
> + * FD_FE00_0000h FD_FFFF_FFFFh Configuration
> + * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
> + * FE_2000_0000h FF_FFFF_FFFFh Reserved
> + *
> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
> + * Table 3: Special Address Controls (GPA) for more information.
> + */
> +#define AMD_HT_START         0xfd00000000UL
> +#define AMD_HT_END           0xffffffffffUL
> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
> +#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
> +
> +static void pc_set_amd_above_4g_mem_start(PCMachineState *pcms,
> +                                          uint64_t pci_hole64_size)
> +{
> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> +    hwaddr start = x86ms->above_4g_mem_start;
> +    hwaddr maxphysaddr, maxusedaddr;
> +
> +    /* Bail out if max possible address does not cross HT range */
> +    if (pc_max_used_gpa(pcms, start, pci_hole64_size) < AMD_HT_START) {

move it to the caller?

> +        return;
> +    }
> +
> +    /*
> +     * Relocating ram-above-4G requires more than TCG_PHYS_ADDR_BITS (40).
> +     * So make sure phys-bits is required to be appropriately sized in order
> +     * to proceed with the above-4g-region relocation and thus boot.
> +     */
> +    start = AMD_ABOVE_1TB_START;
> +    maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
> +    maxusedaddr = pc_max_used_gpa(pcms, start, pci_hole64_size);
> +    if (maxphysaddr < maxusedaddr) {
> +        error_report("Address space limit 0x%"PRIx64" < 0x%"PRIx64
> +                     " phys-bits too low (%u) cannot avoid AMD HT range",
> +                     maxphysaddr, maxusedaddr, X86_CPU(first_cpu)->phys_bits);
> +        exit(EXIT_FAILURE);
> +    }
> +
> +    x86ms->above_4g_mem_start = start;
> +}
> +
>  void pc_memory_init(PCMachineState *pcms,
>                      MemoryRegion *system_memory,
>                      MemoryRegion *rom_memory,
> @@ -922,12 +1003,31 @@ void pc_memory_init(PCMachineState *pcms,
>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>      X86MachineState *x86ms = X86_MACHINE(pcms);
>      hwaddr cxl_base, cxl_resv_end = 0;
> +    X86CPU *cpu = X86_CPU(first_cpu);
>  
>      assert(machine->ram_size == x86ms->below_4g_mem_size +
>                                  x86ms->above_4g_mem_size);
>  
>      linux_boot = (machine->kernel_filename != NULL);
>  
> +    /*
> +     * The HyperTransport range close to the 1T boundary is unique to AMD
> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
> +     * to above 1T to AMD vCPUs only.
> +     */
> +    if (IS_AMD_CPU(&cpu->env)) {
> +        pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);
> +
> +        /*
> +         * Advertise the HT region if address space covers the reserved
> +         * region or if we relocate.
> +         */
> +        if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START ||
> +            cpu->phys_bits >= 40) {
> +            e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
> +        }
> +    }
> +
>      /*
>       * Split single memory region and use aliases to address portions of it,
>       * done for backwards compatibility with older qemus.
> @@ -938,6 +1038,7 @@ void pc_memory_init(PCMachineState *pcms,
>                               0, x86ms->below_4g_mem_size);
>      memory_region_add_subregion(system_memory, 0, ram_below_4g);
>      e820_add_entry(0, x86ms->below_4g_mem_size, E820_RAM);
> +

stray newline?

>      if (x86ms->above_4g_mem_size > 0) {
>          ram_above_4g = g_malloc(sizeof(*ram_above_4g));
>          memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g",
Joao Martins July 11, 2022, 2:52 p.m. UTC | #3
On 7/11/22 13:56, Igor Mammedov wrote:
> On Fri,  1 Jul 2022 17:10:13 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>> index a79fa1b6beeb..07025b510540 100644
>> --- a/hw/i386/pc.c
>> +++ b/hw/i386/pc.c
>> @@ -907,6 +907,87 @@ static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
>>      return start;
>>  }
>>  
>> +static hwaddr pc_max_used_gpa(PCMachineState *pcms,
>> +                                hwaddr above_4g_mem_start,
>> +                                uint64_t pci_hole64_size)
>> +{
>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>> +
> 
>> +    if (!x86ms->above_4g_mem_size) {
>> +        /*
>> +         * 32-bit pci hole goes from
>> +         * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
>> +          */
>> +        return IO_APIC_DEFAULT_ADDRESS - 1;
>> +    }
> this hunk still bothers me (nothing changed wrt v5 issues around it)
> issues recap: (
>  1. correctness of it
>  2. being limited to AMD only, while it seems pretty generic to me
>  3. should be a separate patch
> )
> 
How about I instead delete this hunk, and only call pc_set_amd_above_4g_mem_start()
when there's @above_4g_mem_size ? Like in pc_memory_init() I would instead:

if (IS_AMD_CPU(&cpu->env) && x86ms->above_4g_mem_size) {
    hwaddr start = x86ms->above_4g_mem_start;

    if (pc_max_used_gpa(pcms, start, pci_hole64_size) >= AMD_HT_START) {
        pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);
    }
    ...
}

Given that otherwise it is impossible to ever encounter the 1T boundary.

If not ... what other alternative would address your concern?

>> +
>> +    return pc_pci_hole64_start() + pci_hole64_size;
>> +}
>> +
>> +/*
>> + * AMD systems with an IOMMU have an additional hole close to the
>> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
>> + * on kernel version, VFIO may or may not let you DMA map those ranges.
>> + * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
>> + * with certain memory sizes. It's also wrong to use those IOVA ranges
>> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
>> + * The ranges reserved for Hyper-Transport are:
>> + *
>> + * FD_0000_0000h - FF_FFFF_FFFFh
>> + *
>> + * The ranges represent the following:
>> + *
>> + * Base Address   Top Address  Use
>> + *
>> + * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
>> + * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
>> + * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
>> + * FD_F910_0000h FD_F91F_FFFFh System Management
>> + * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
>> + * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
>> + * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
>> + * FD_FE00_0000h FD_FFFF_FFFFh Configuration
>> + * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
>> + * FE_2000_0000h FF_FFFF_FFFFh Reserved
>> + *
>> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
>> + * Table 3: Special Address Controls (GPA) for more information.
>> + */
>> +#define AMD_HT_START         0xfd00000000UL
>> +#define AMD_HT_END           0xffffffffffUL
>> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
>> +#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
>> +
>> +static void pc_set_amd_above_4g_mem_start(PCMachineState *pcms,
>> +                                          uint64_t pci_hole64_size)
>> +{
>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>> +    hwaddr start = x86ms->above_4g_mem_start;
>> +    hwaddr maxphysaddr, maxusedaddr;
>> +
>> +    /* Bail out if max possible address does not cross HT range */
>> +    if (pc_max_used_gpa(pcms, start, pci_hole64_size) < AMD_HT_START) {
> 
> move it to the caller?
> 
Will do. I have replaced with this instead /in the caller/:

    if (pc_max_used_gpa(pcms, start, pci_hole64_size) >= AMD_HT_START) {
        pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);
    }

>> +        return;
>> +    }
>> +
>> +    /*
>> +     * Relocating ram-above-4G requires more than TCG_PHYS_ADDR_BITS (40).
>> +     * So make sure phys-bits is required to be appropriately sized in order
>> +     * to proceed with the above-4g-region relocation and thus boot.
>> +     */
>> +    start = AMD_ABOVE_1TB_START;
>> +    maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
>> +    maxusedaddr = pc_max_used_gpa(pcms, start, pci_hole64_size);
>> +    if (maxphysaddr < maxusedaddr) {
>> +        error_report("Address space limit 0x%"PRIx64" < 0x%"PRIx64
>> +                     " phys-bits too low (%u) cannot avoid AMD HT range",
>> +                     maxphysaddr, maxusedaddr, X86_CPU(first_cpu)->phys_bits);
>> +        exit(EXIT_FAILURE);
>> +    }
>> +
>> +    x86ms->above_4g_mem_start = start;
>> +}
>> +
>>  void pc_memory_init(PCMachineState *pcms,
>>                      MemoryRegion *system_memory,
>>                      MemoryRegion *rom_memory,
>> @@ -922,12 +1003,31 @@ void pc_memory_init(PCMachineState *pcms,
>>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>      X86MachineState *x86ms = X86_MACHINE(pcms);
>>      hwaddr cxl_base, cxl_resv_end = 0;
>> +    X86CPU *cpu = X86_CPU(first_cpu);
>>  
>>      assert(machine->ram_size == x86ms->below_4g_mem_size +
>>                                  x86ms->above_4g_mem_size);
>>  
>>      linux_boot = (machine->kernel_filename != NULL);
>>  
>> +    /*
>> +     * The HyperTransport range close to the 1T boundary is unique to AMD
>> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
>> +     * to above 1T to AMD vCPUs only.
>> +     */
>> +    if (IS_AMD_CPU(&cpu->env)) {
>> +        pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);
>> +
>> +        /*
>> +         * Advertise the HT region if address space covers the reserved
>> +         * region or if we relocate.
>> +         */
>> +        if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START ||
>> +            cpu->phys_bits >= 40) {
>> +            e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
>> +        }
>> +    }
>> +
>>      /*
>>       * Split single memory region and use aliases to address portions of it,
>>       * done for backwards compatibility with older qemus.
>> @@ -938,6 +1038,7 @@ void pc_memory_init(PCMachineState *pcms,
>>                               0, x86ms->below_4g_mem_size);
>>      memory_region_add_subregion(system_memory, 0, ram_below_4g);
>>      e820_add_entry(0, x86ms->below_4g_mem_size, E820_RAM);
>> +
> 
> stray newline?
> 
Yeap. I've already removed as per my earlier email to this patch.
Joao Martins July 11, 2022, 3:31 p.m. UTC | #4
On 7/11/22 15:52, Joao Martins wrote:
> On 7/11/22 13:56, Igor Mammedov wrote:
>> On Fri,  1 Jul 2022 17:10:13 +0100
>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>>> index a79fa1b6beeb..07025b510540 100644
>>> --- a/hw/i386/pc.c
>>> +++ b/hw/i386/pc.c
>>> @@ -907,6 +907,87 @@ static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
>>>      return start;
>>>  }
>>>  
>>> +static hwaddr pc_max_used_gpa(PCMachineState *pcms,
>>> +                                hwaddr above_4g_mem_start,
>>> +                                uint64_t pci_hole64_size)
>>> +{
>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>>> +
>>
>>> +    if (!x86ms->above_4g_mem_size) {
>>> +        /*
>>> +         * 32-bit pci hole goes from
>>> +         * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
>>> +          */
>>> +        return IO_APIC_DEFAULT_ADDRESS - 1;
>>> +    }
>> this hunk still bothers me (nothing changed wrt v5 issues around it)
>> issues recap: (
>>  1. correctness of it
>>  2. being limited to AMD only, while it seems pretty generic to me
>>  3. should be a separate patch
>> )
>>
> How about I instead delete this hunk, and only call pc_set_amd_above_4g_mem_start()
> when there's @above_4g_mem_size ? Like in pc_memory_init() I would instead:
> 
> if (IS_AMD_CPU(&cpu->env) && x86ms->above_4g_mem_size) {
>     hwaddr start = x86ms->above_4g_mem_start;
> 
>     if (pc_max_used_gpa(pcms, start, pci_hole64_size) >= AMD_HT_START) {
>         pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);
>     }
>     ...
> }
> 
> Given that otherwise it is impossible to ever encounter the 1T boundary.
> 

And while at it I would also remove any unneeded arguments from pc_max_used_gpa(),
which would turn the function into this:

+static hwaddr pc_max_used_gpa(uint64_t pci_hole64_size)
+{
+    return pc_pci_hole64_start() + pci_hole64_size;
+}

I would nuke the added helper if it wasn't for having 2 call sites in this patch.

> If not ... what other alternative would address your concern?
> 
>>> +
>>> +    return pc_pci_hole64_start() + pci_hole64_size;
>>> +}
>>> +
>>> +/*
>>> + * AMD systems with an IOMMU have an additional hole close to the
>>> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
>>> + * on kernel version, VFIO may or may not let you DMA map those ranges.
>>> + * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
>>> + * with certain memory sizes. It's also wrong to use those IOVA ranges
>>> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
>>> + * The ranges reserved for Hyper-Transport are:
>>> + *
>>> + * FD_0000_0000h - FF_FFFF_FFFFh
>>> + *
>>> + * The ranges represent the following:
>>> + *
>>> + * Base Address   Top Address  Use
>>> + *
>>> + * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
>>> + * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
>>> + * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
>>> + * FD_F910_0000h FD_F91F_FFFFh System Management
>>> + * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
>>> + * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
>>> + * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
>>> + * FD_FE00_0000h FD_FFFF_FFFFh Configuration
>>> + * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
>>> + * FE_2000_0000h FF_FFFF_FFFFh Reserved
>>> + *
>>> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
>>> + * Table 3: Special Address Controls (GPA) for more information.
>>> + */
>>> +#define AMD_HT_START         0xfd00000000UL
>>> +#define AMD_HT_END           0xffffffffffUL
>>> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
>>> +#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
>>> +
>>> +static void pc_set_amd_above_4g_mem_start(PCMachineState *pcms,
>>> +                                          uint64_t pci_hole64_size)
>>> +{
>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>>> +    hwaddr start = x86ms->above_4g_mem_start;
>>> +    hwaddr maxphysaddr, maxusedaddr;
>>> +
>>> +    /* Bail out if max possible address does not cross HT range */
>>> +    if (pc_max_used_gpa(pcms, start, pci_hole64_size) < AMD_HT_START) {
>>
>> move it to the caller?
>>
> Will do. I have replaced with this instead /in the caller/:
> 
>     if (pc_max_used_gpa(pcms, start, pci_hole64_size) >= AMD_HT_START) {
>         pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);
>     }
> 
>>> +        return;
>>> +    }
>>> +
>>> +    /*
>>> +     * Relocating ram-above-4G requires more than TCG_PHYS_ADDR_BITS (40).
>>> +     * So make sure phys-bits is required to be appropriately sized in order
>>> +     * to proceed with the above-4g-region relocation and thus boot.
>>> +     */
>>> +    start = AMD_ABOVE_1TB_START;
>>> +    maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
>>> +    maxusedaddr = pc_max_used_gpa(pcms, start, pci_hole64_size);
>>> +    if (maxphysaddr < maxusedaddr) {
>>> +        error_report("Address space limit 0x%"PRIx64" < 0x%"PRIx64
>>> +                     " phys-bits too low (%u) cannot avoid AMD HT range",
>>> +                     maxphysaddr, maxusedaddr, X86_CPU(first_cpu)->phys_bits);
>>> +        exit(EXIT_FAILURE);
>>> +    }
>>> +
>>> +    x86ms->above_4g_mem_start = start;
>>> +}
>>> +
>>>  void pc_memory_init(PCMachineState *pcms,
>>>                      MemoryRegion *system_memory,
>>>                      MemoryRegion *rom_memory,
>>> @@ -922,12 +1003,31 @@ void pc_memory_init(PCMachineState *pcms,
>>>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>>      X86MachineState *x86ms = X86_MACHINE(pcms);
>>>      hwaddr cxl_base, cxl_resv_end = 0;
>>> +    X86CPU *cpu = X86_CPU(first_cpu);
>>>  
>>>      assert(machine->ram_size == x86ms->below_4g_mem_size +
>>>                                  x86ms->above_4g_mem_size);
>>>  
>>>      linux_boot = (machine->kernel_filename != NULL);
>>>  
>>> +    /*
>>> +     * The HyperTransport range close to the 1T boundary is unique to AMD
>>> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
>>> +     * to above 1T to AMD vCPUs only.
>>> +     */
>>> +    if (IS_AMD_CPU(&cpu->env)) {
>>> +        pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);
>>> +
>>> +        /*
>>> +         * Advertise the HT region if address space covers the reserved
>>> +         * region or if we relocate.
>>> +         */
>>> +        if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START ||
>>> +            cpu->phys_bits >= 40) {
>>> +            e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
>>> +        }
>>> +    }
>>> +
>>>      /*
>>>       * Split single memory region and use aliases to address portions of it,
>>>       * done for backwards compatibility with older qemus.
>>> @@ -938,6 +1038,7 @@ void pc_memory_init(PCMachineState *pcms,
>>>                               0, x86ms->below_4g_mem_size);
>>>      memory_region_add_subregion(system_memory, 0, ram_below_4g);
>>>      e820_add_entry(0, x86ms->below_4g_mem_size, E820_RAM);
>>> +
>>
>> stray newline?
>>
> Yeap. I've already removed as per my earlier email to this patch.
Joao Martins July 11, 2022, 8:03 p.m. UTC | #5
On 7/11/22 16:31, Joao Martins wrote:
> On 7/11/22 15:52, Joao Martins wrote:
>> On 7/11/22 13:56, Igor Mammedov wrote:
>>> On Fri,  1 Jul 2022 17:10:13 +0100
>>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>>
>>>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>>>> index a79fa1b6beeb..07025b510540 100644
>>>> --- a/hw/i386/pc.c
>>>> +++ b/hw/i386/pc.c
>>>> @@ -907,6 +907,87 @@ static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
>>>>      return start;
>>>>  }
>>>>  
>>>> +static hwaddr pc_max_used_gpa(PCMachineState *pcms,
>>>> +                                hwaddr above_4g_mem_start,
>>>> +                                uint64_t pci_hole64_size)
>>>> +{
>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>>>> +
>>>
>>>> +    if (!x86ms->above_4g_mem_size) {
>>>> +        /*
>>>> +         * 32-bit pci hole goes from
>>>> +         * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
>>>> +          */
>>>> +        return IO_APIC_DEFAULT_ADDRESS - 1;
>>>> +    }
>>> this hunk still bothers me (nothing changed wrt v5 issues around it)
>>> issues recap: (
>>>  1. correctness of it
>>>  2. being limited to AMD only, while it seems pretty generic to me
>>>  3. should be a separate patch
>>> )
>>>
>> How about I instead delete this hunk, and only call pc_set_amd_above_4g_mem_start()
>> when there's @above_4g_mem_size ? Like in pc_memory_init() I would instead:
>>
>> if (IS_AMD_CPU(&cpu->env) && x86ms->above_4g_mem_size) {
>>     hwaddr start = x86ms->above_4g_mem_start;
>>
>>     if (pc_max_used_gpa(pcms, start, pci_hole64_size) >= AMD_HT_START) {
>>         pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);
>>     }
>>     ...
>> }
>>
>> Given that otherwise it is impossible to ever encounter the 1T boundary.
>>
> 
> And while at it I would also remove any unneeded arguments from pc_max_used_gpa(),
> which would turn the function into this:
> 
> +static hwaddr pc_max_used_gpa(uint64_t pci_hole64_size)
> +{
> +    return pc_pci_hole64_start() + pci_hole64_size;
> +}
> 
> I would nuke the added helper if it wasn't for having 2 call sites in this patch.
> 

Full patch diff further below -- after removing pc_max_used_gpa() and made further
cleanups given this code can be much simpler after using this approach.

>> If not ... what other alternative would address your concern?
>>

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index e178bbc4129c..1ded3faeffe0 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -882,6 +882,62 @@ static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
     return start;
 }

+/*
+ * AMD systems with an IOMMU have an additional hole close to the
+ * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
+ * on kernel version, VFIO may or may not let you DMA map those ranges.
+ * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
+ * with certain memory sizes. It's also wrong to use those IOVA ranges
+ * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
+ * The ranges reserved for Hyper-Transport are:
+ *
+ * FD_0000_0000h - FF_FFFF_FFFFh
+ *
+ * The ranges represent the following:
+ *
+ * Base Address   Top Address  Use
+ *
+ * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
+ * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
+ * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
+ * FD_F910_0000h FD_F91F_FFFFh System Management
+ * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
+ * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
+ * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
+ * FD_FE00_0000h FD_FFFF_FFFFh Configuration
+ * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
+ * FE_2000_0000h FF_FFFF_FFFFh Reserved
+ *
+ * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
+ * Table 3: Special Address Controls (GPA) for more information.
+ */
+#define AMD_HT_START         0xfd00000000UL
+#define AMD_HT_END           0xffffffffffUL
+#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
+#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
+
+static void pc_set_amd_above_4g_mem_start(PCMachineState *pcms,
+                                          hwaddr maxusedaddr)
+{
+    X86MachineState *x86ms = X86_MACHINE(pcms);
+    hwaddr maxphysaddr;
+
+    /*
+     * Relocating ram-above-4G requires more than TCG_PHYS_ADDR_BITS (40).
+     * So make sure phys-bits is required to be appropriately sized in order
+     * to proceed with the above-4g-region relocation and thus boot.
+     */
+    maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
+    if (maxphysaddr < maxusedaddr) {
+        error_report("Address space limit 0x%"PRIx64" < 0x%"PRIx64
+                     " phys-bits too low (%u) cannot avoid AMD HT range",
+                     maxphysaddr, maxusedaddr, X86_CPU(first_cpu)->phys_bits);
+        exit(EXIT_FAILURE);
+    }
+
+    x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
+}
+
 void pc_memory_init(PCMachineState *pcms,
                     MemoryRegion *system_memory,
                     MemoryRegion *rom_memory,
@@ -897,6 +953,7 @@ void pc_memory_init(PCMachineState *pcms,
     PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
     X86MachineState *x86ms = X86_MACHINE(pcms);
     hwaddr cxl_base, cxl_resv_end = 0;
+    X86CPU *cpu = X86_CPU(first_cpu);

     assert(machine->ram_size == x86ms->below_4g_mem_size +
                                 x86ms->above_4g_mem_size);
@@ -904,6 +961,29 @@ void pc_memory_init(PCMachineState *pcms,
     linux_boot = (machine->kernel_filename != NULL);

     /*
+     * The HyperTransport range close to the 1T boundary is unique to AMD
+     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
+     * to above 1T to AMD vCPUs only.
+     */
+    if (IS_AMD_CPU(&cpu->env) && x86ms->above_4g_mem_size) {
+        hwaddr maxusedaddr = pc_pci_hole64_start() + pci_hole64_size;
+
+        /* Bail out if max possible address does not cross HT range */
+        if (maxusedaddr >= AMD_HT_START) {
+            pc_set_amd_above_4g_mem_start(pcms, maxusedaddr);
+        }
+
+        /*
+         * Advertise the HT region if address space covers the reserved
+         * region or if we relocate.
+         */
+        if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START ||
+            cpu->phys_bits >= 40) {
+            e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
+        }
+    }
+
+    /*
      * Split single memory region and use aliases to address portions of it,
      * done for backwards compatibility with older qemus.
      */
Igor Mammedov July 12, 2022, 9:06 a.m. UTC | #6
On Mon, 11 Jul 2022 21:03:28 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 7/11/22 16:31, Joao Martins wrote:
> > On 7/11/22 15:52, Joao Martins wrote:  
> >> On 7/11/22 13:56, Igor Mammedov wrote:  
> >>> On Fri,  1 Jul 2022 17:10:13 +0100
> >>> Joao Martins <joao.m.martins@oracle.com> wrote:
> >>>  
> >>>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> >>>> index a79fa1b6beeb..07025b510540 100644
> >>>> --- a/hw/i386/pc.c
> >>>> +++ b/hw/i386/pc.c
> >>>> @@ -907,6 +907,87 @@ static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
> >>>>      return start;
> >>>>  }
> >>>>  
> >>>> +static hwaddr pc_max_used_gpa(PCMachineState *pcms,
> >>>> +                                hwaddr above_4g_mem_start,
> >>>> +                                uint64_t pci_hole64_size)
> >>>> +{
> >>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> >>>> +  
> >>>  
> >>>> +    if (!x86ms->above_4g_mem_size) {
> >>>> +        /*
> >>>> +         * 32-bit pci hole goes from
> >>>> +         * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
> >>>> +          */
> >>>> +        return IO_APIC_DEFAULT_ADDRESS - 1;
> >>>> +    }  
> >>> this hunk still bothers me (nothing changed wrt v5 issues around it)
> >>> issues recap: (
> >>>  1. correctness of it
> >>>  2. being limited to AMD only, while it seems pretty generic to me
> >>>  3. should be a separate patch
> >>> )
> >>>  
> >> How about I instead delete this hunk, and only call pc_set_amd_above_4g_mem_start()
> >> when there's @above_4g_mem_size ? Like in pc_memory_init() I would instead:
> >>
> >> if (IS_AMD_CPU(&cpu->env) && x86ms->above_4g_mem_size) {
> >>     hwaddr start = x86ms->above_4g_mem_start;
> >>
> >>     if (pc_max_used_gpa(pcms, start, pci_hole64_size) >= AMD_HT_START) {
> >>         pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);
> >>     }
> >>     ...
> >> }
> >>
> >> Given that otherwise it is impossible to ever encounter the 1T boundary.
> >>  
> > 
> > And while at it I would also remove any unneeded arguments from pc_max_used_gpa(),
> > which would turn the function into this:
> > 
> > +static hwaddr pc_max_used_gpa(uint64_t pci_hole64_size)
> > +{
> > +    return pc_pci_hole64_start() + pci_hole64_size;
> > +}
> > 
> > I would nuke the added helper if it wasn't for having 2 call sites in this patch.
> >   
> 
> Full patch diff further below -- after removing pc_max_used_gpa() and made further
> cleanups given this code can be much simpler after using this approach.
> 
> >> If not ... what other alternative would address your concern?
> >>  
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index e178bbc4129c..1ded3faeffe0 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -882,6 +882,62 @@ static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
>      return start;
>  }
> 
> +/*
> + * AMD systems with an IOMMU have an additional hole close to the
> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
> + * on kernel version, VFIO may or may not let you DMA map those ranges.
> + * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
> + * with certain memory sizes. It's also wrong to use those IOVA ranges
> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
> + * The ranges reserved for Hyper-Transport are:
> + *
> + * FD_0000_0000h - FF_FFFF_FFFFh
> + *
> + * The ranges represent the following:
> + *
> + * Base Address   Top Address  Use
> + *
> + * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
> + * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
> + * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
> + * FD_F910_0000h FD_F91F_FFFFh System Management
> + * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
> + * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
> + * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
> + * FD_FE00_0000h FD_FFFF_FFFFh Configuration
> + * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
> + * FE_2000_0000h FF_FFFF_FFFFh Reserved
> + *
> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
> + * Table 3: Special Address Controls (GPA) for more information.
> + */
> +#define AMD_HT_START         0xfd00000000UL
> +#define AMD_HT_END           0xffffffffffUL
> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
> +#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
> +
> +static void pc_set_amd_above_4g_mem_start(PCMachineState *pcms,
> +                                          hwaddr maxusedaddr)
> +{
> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> +    hwaddr maxphysaddr;
> +
> +    /*
> +     * Relocating ram-above-4G requires more than TCG_PHYS_ADDR_BITS (40).
> +     * So make sure phys-bits is required to be appropriately sized in order
> +     * to proceed with the above-4g-region relocation and thus boot.
> +     */
> +    maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
> +    if (maxphysaddr < maxusedaddr) {
> +        error_report("Address space limit 0x%"PRIx64" < 0x%"PRIx64
> +                     " phys-bits too low (%u) cannot avoid AMD HT range",
> +                     maxphysaddr, maxusedaddr, X86_CPU(first_cpu)->phys_bits);
> +        exit(EXIT_FAILURE);
> +    }
> +
> +    x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
> +}
> +
>  void pc_memory_init(PCMachineState *pcms,
>                      MemoryRegion *system_memory,
>                      MemoryRegion *rom_memory,
> @@ -897,6 +953,7 @@ void pc_memory_init(PCMachineState *pcms,
>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>      X86MachineState *x86ms = X86_MACHINE(pcms);
>      hwaddr cxl_base, cxl_resv_end = 0;
> +    X86CPU *cpu = X86_CPU(first_cpu);
> 
>      assert(machine->ram_size == x86ms->below_4g_mem_size +
>                                  x86ms->above_4g_mem_size);
> @@ -904,6 +961,29 @@ void pc_memory_init(PCMachineState *pcms,
>      linux_boot = (machine->kernel_filename != NULL);
> 
>      /*
> +     * The HyperTransport range close to the 1T boundary is unique to AMD
> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
> +     * to above 1T to AMD vCPUs only.
> +     */
> +    if (IS_AMD_CPU(&cpu->env) && x86ms->above_4g_mem_size) {

it has the same issue as pc_max_used_gpa(), i.e.
  x86ms->above_4g_mem_size != 0
doesn't mean that there isn't any memory above 4Gb nor that there aren't
any MMIO (sgx/cxl/pci64hole), that's was the reason we were are considering
max_used_gpa

I'd prefer to keep pc_max_used_gpa(),
idea but make it work for above cases and be more generic (i.e. not to be
tied to AMD only) since 'pc_max_used_gpa() < physbits' applies to equally
to AMD and Intel (and to trip it, one just have to configure small enough
physbits or large enough hotpluggable RAM/CXL/PCI64HOLE)



> +        hwaddr maxusedaddr = pc_pci_hole64_start() + pci_hole64_size;
> +
> +        /* Bail out if max possible address does not cross HT range */
> +        if (maxusedaddr >= AMD_HT_START) {
> +            pc_set_amd_above_4g_mem_start(pcms, maxusedaddr);
> +        }
> +
> +        /*
> +         * Advertise the HT region if address space covers the reserved
> +         * region or if we relocate.
> +         */
> +        if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START ||
> +            cpu->phys_bits >= 40) {
> +            e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
> +        }
> +    }
> +
> +    /*
>       * Split single memory region and use aliases to address portions of it,
>       * done for backwards compatibility with older qemus.
>       */
>
Joao Martins July 12, 2022, 10:01 a.m. UTC | #7
On 7/12/22 10:06, Igor Mammedov wrote:
> On Mon, 11 Jul 2022 21:03:28 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> On 7/11/22 16:31, Joao Martins wrote:
>>> On 7/11/22 15:52, Joao Martins wrote:  
>>>> On 7/11/22 13:56, Igor Mammedov wrote:  
>>>>> On Fri,  1 Jul 2022 17:10:13 +0100
>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>>  
>>>>>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>>>>>> index a79fa1b6beeb..07025b510540 100644
>>>>>> --- a/hw/i386/pc.c
>>>>>> +++ b/hw/i386/pc.c
>>>>>> @@ -907,6 +907,87 @@ static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
>>>>>>      return start;
>>>>>>  }
>>>>>>  
>>>>>> +static hwaddr pc_max_used_gpa(PCMachineState *pcms,
>>>>>> +                                hwaddr above_4g_mem_start,
>>>>>> +                                uint64_t pci_hole64_size)
>>>>>> +{
>>>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>>>>>> +  
>>>>>  
>>>>>> +    if (!x86ms->above_4g_mem_size) {
>>>>>> +        /*
>>>>>> +         * 32-bit pci hole goes from
>>>>>> +         * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
>>>>>> +          */
>>>>>> +        return IO_APIC_DEFAULT_ADDRESS - 1;
>>>>>> +    }  
>>>>> this hunk still bothers me (nothing changed wrt v5 issues around it)
>>>>> issues recap: (
>>>>>  1. correctness of it
>>>>>  2. being limited to AMD only, while it seems pretty generic to me
>>>>>  3. should be a separate patch
>>>>> )
>>>>>  
>>>> How about I instead delete this hunk, and only call pc_set_amd_above_4g_mem_start()
>>>> when there's @above_4g_mem_size ? Like in pc_memory_init() I would instead:
>>>>
>>>> if (IS_AMD_CPU(&cpu->env) && x86ms->above_4g_mem_size) {
>>>>     hwaddr start = x86ms->above_4g_mem_start;
>>>>
>>>>     if (pc_max_used_gpa(pcms, start, pci_hole64_size) >= AMD_HT_START) {
>>>>         pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);
>>>>     }
>>>>     ...
>>>> }
>>>>
>>>> Given that otherwise it is impossible to ever encounter the 1T boundary.
>>>>  
>>>
>>> And while at it I would also remove any unneeded arguments from pc_max_used_gpa(),
>>> which would turn the function into this:
>>>
>>> +static hwaddr pc_max_used_gpa(uint64_t pci_hole64_size)
>>> +{
>>> +    return pc_pci_hole64_start() + pci_hole64_size;
>>> +}
>>>
>>> I would nuke the added helper if it wasn't for having 2 call sites in this patch.
>>>   
>>
>> Full patch diff further below -- after removing pc_max_used_gpa() and made further
>> cleanups given this code can be much simpler after using this approach.
>>
>>>> If not ... what other alternative would address your concern?
>>>>  
>>
>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>> index e178bbc4129c..1ded3faeffe0 100644
>> --- a/hw/i386/pc.c
>> +++ b/hw/i386/pc.c
>> @@ -882,6 +882,62 @@ static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
>>      return start;
>>  }
>>
>> +/*
>> + * AMD systems with an IOMMU have an additional hole close to the
>> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
>> + * on kernel version, VFIO may or may not let you DMA map those ranges.
>> + * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
>> + * with certain memory sizes. It's also wrong to use those IOVA ranges
>> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
>> + * The ranges reserved for Hyper-Transport are:
>> + *
>> + * FD_0000_0000h - FF_FFFF_FFFFh
>> + *
>> + * The ranges represent the following:
>> + *
>> + * Base Address   Top Address  Use
>> + *
>> + * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
>> + * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
>> + * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
>> + * FD_F910_0000h FD_F91F_FFFFh System Management
>> + * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
>> + * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
>> + * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
>> + * FD_FE00_0000h FD_FFFF_FFFFh Configuration
>> + * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
>> + * FE_2000_0000h FF_FFFF_FFFFh Reserved
>> + *
>> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
>> + * Table 3: Special Address Controls (GPA) for more information.
>> + */
>> +#define AMD_HT_START         0xfd00000000UL
>> +#define AMD_HT_END           0xffffffffffUL
>> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
>> +#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
>> +
>> +static void pc_set_amd_above_4g_mem_start(PCMachineState *pcms,
>> +                                          hwaddr maxusedaddr)
>> +{
>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>> +    hwaddr maxphysaddr;
>> +
>> +    /*
>> +     * Relocating ram-above-4G requires more than TCG_PHYS_ADDR_BITS (40).
>> +     * So make sure phys-bits is required to be appropriately sized in order
>> +     * to proceed with the above-4g-region relocation and thus boot.
>> +     */
>> +    maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
>> +    if (maxphysaddr < maxusedaddr) {
>> +        error_report("Address space limit 0x%"PRIx64" < 0x%"PRIx64
>> +                     " phys-bits too low (%u) cannot avoid AMD HT range",
>> +                     maxphysaddr, maxusedaddr, X86_CPU(first_cpu)->phys_bits);
>> +        exit(EXIT_FAILURE);
>> +    }
>> +
>> +    x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
>> +}
>> +
>>  void pc_memory_init(PCMachineState *pcms,
>>                      MemoryRegion *system_memory,
>>                      MemoryRegion *rom_memory,
>> @@ -897,6 +953,7 @@ void pc_memory_init(PCMachineState *pcms,
>>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>      X86MachineState *x86ms = X86_MACHINE(pcms);
>>      hwaddr cxl_base, cxl_resv_end = 0;
>> +    X86CPU *cpu = X86_CPU(first_cpu);
>>
>>      assert(machine->ram_size == x86ms->below_4g_mem_size +
>>                                  x86ms->above_4g_mem_size);
>> @@ -904,6 +961,29 @@ void pc_memory_init(PCMachineState *pcms,
>>      linux_boot = (machine->kernel_filename != NULL);
>>
>>      /*
>> +     * The HyperTransport range close to the 1T boundary is unique to AMD
>> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
>> +     * to above 1T to AMD vCPUs only.
>> +     */
>> +    if (IS_AMD_CPU(&cpu->env) && x86ms->above_4g_mem_size) {
> 
> it has the same issue as pc_max_used_gpa(), i.e.
>   x86ms->above_4g_mem_size != 0
> doesn't mean that there isn't any memory above 4Gb nor that there aren't
> any MMIO (sgx/cxl/pci64hole), that's was the reason we were are considering
> max_used_gpa
> 
Argh yes, you are right. I see it now.

> I'd prefer to keep pc_max_used_gpa(),
> idea but make it work for above cases and be more generic (i.e. not to be
> tied to AMD only) since 'pc_max_used_gpa() < physbits'

Are you also indirectly suggesting here that the check inside
pc_set_amd_above_4g_mem_start() should be moved into pc_memory_init()
given that it's orthogonal to this issue. ISTR that you suggested this
at some point. If so, then there's probably very little reason to keep
pc_set_amd_above_4g_mem_start() around.

> applies to equally
> to AMD and Intel (and to trip it, one just have to configure small enough
> physbits or large enough hotpluggable RAM/CXL/PCI64HOLE)
> 
I can reproduce the issue you're thinking with basic memory hotplug. Let me see
what I can come up in pc_max_used_gpa() to cover this one. I'll respond here with a proposal.

I would really love to have v7.1.0 with this issue fixed but I am not very
confident it is going to make it :(

Meanwhile, let me know if you have thoughts on this one:

https://lore.kernel.org/qemu-devel/1b2fa957-74f6-b5a9-3fc1-65c5d68300ce@oracle.com/

I am going to assume that if no comments on the above that I'll keep things as is.

And also, whether I can retain your ack with Bernhard's suggestion here:

https://lore.kernel.org/qemu-devel/0eefb382-4ac6-4335-ca61-035babb95a88@oracle.com/

>> +        hwaddr maxusedaddr = pc_pci_hole64_start() + pci_hole64_size;
>> +
>> +        /* Bail out if max possible address does not cross HT range */
>> +        if (maxusedaddr >= AMD_HT_START) {
>> +            pc_set_amd_above_4g_mem_start(pcms, maxusedaddr);
>> +        }
>> +
>> +        /*
>> +         * Advertise the HT region if address space covers the reserved
>> +         * region or if we relocate.
>> +         */
>> +        if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START ||
>> +            cpu->phys_bits >= 40) {
>> +            e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
>> +        }
>> +    }
>> +
>> +    /*
>>       * Split single memory region and use aliases to address portions of it,
>>       * done for backwards compatibility with older qemus.
>>       */
>>
>
Joao Martins July 12, 2022, 10:21 a.m. UTC | #8
On 7/12/22 11:01, Joao Martins wrote:
> On 7/12/22 10:06, Igor Mammedov wrote:
>> On Mon, 11 Jul 2022 21:03:28 +0100
>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>> On 7/11/22 16:31, Joao Martins wrote:
>>>> On 7/11/22 15:52, Joao Martins wrote:  
>>>>> On 7/11/22 13:56, Igor Mammedov wrote:  
>>>>>> On Fri,  1 Jul 2022 17:10:13 +0100
>>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>> @@ -904,6 +961,29 @@ void pc_memory_init(PCMachineState *pcms,
>>>      linux_boot = (machine->kernel_filename != NULL);
>>>
>>>      /*
>>> +     * The HyperTransport range close to the 1T boundary is unique to AMD
>>> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
>>> +     * to above 1T to AMD vCPUs only.
>>> +     */
>>> +    if (IS_AMD_CPU(&cpu->env) && x86ms->above_4g_mem_size) {
>>
>> it has the same issue as pc_max_used_gpa(), i.e.
>>   x86ms->above_4g_mem_size != 0
>> doesn't mean that there isn't any memory above 4Gb nor that there aren't
>> any MMIO (sgx/cxl/pci64hole), that's was the reason we were are considering
>> max_used_gpa
>>
> Argh yes, you are right. I see it now.
> 
>> I'd prefer to keep pc_max_used_gpa(),
>> idea but make it work for above cases and be more generic (i.e. not to be
>> tied to AMD only) since 'pc_max_used_gpa() < physbits'
> 
> Are you also indirectly suggesting here that the check inside
> pc_set_amd_above_4g_mem_start() should be moved into pc_memory_init()
> given that it's orthogonal to this issue. ISTR that you suggested this
> at some point. If so, then there's probably very little reason to keep
> pc_set_amd_above_4g_mem_start() around.
> 

Hold on, I take that back as the check is AMD specific. And I just noticed a
mistake on v6 (other versions didn't had it) specifically on this phys-bits
boundaries. Given how pc_pci_hole64_start() uses x86ms::above_4g_mem_start the
point of the pc_max_used_gpa() < physbits check inside pc_set_amd_above_4g_mem_start() was
to test the boundaries with AMD_HT_START, not with the typical 4GiB. And
reusing pc_pci_hole64_start() introduced that problem.

So either I'll have to temporarily set x86ms::above_4g_mem_start inside
pc_max_used_gpa() based on passed @above_4g_start value.
Joao Martins July 12, 2022, 11:35 a.m. UTC | #9
On 7/12/22 11:01, Joao Martins wrote:
> On 7/12/22 10:06, Igor Mammedov wrote:
>> On Mon, 11 Jul 2022 21:03:28 +0100
>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>> On 7/11/22 16:31, Joao Martins wrote:
>>>> On 7/11/22 15:52, Joao Martins wrote:  
>>>>> On 7/11/22 13:56, Igor Mammedov wrote:  
>>>>>> On Fri,  1 Jul 2022 17:10:13 +0100
>>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>>  void pc_memory_init(PCMachineState *pcms,
>>>                      MemoryRegion *system_memory,
>>>                      MemoryRegion *rom_memory,
>>> @@ -897,6 +953,7 @@ void pc_memory_init(PCMachineState *pcms,
>>>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>>      X86MachineState *x86ms = X86_MACHINE(pcms);
>>>      hwaddr cxl_base, cxl_resv_end = 0;
>>> +    X86CPU *cpu = X86_CPU(first_cpu);
>>>
>>>      assert(machine->ram_size == x86ms->below_4g_mem_size +
>>>                                  x86ms->above_4g_mem_size);
>>> @@ -904,6 +961,29 @@ void pc_memory_init(PCMachineState *pcms,
>>>      linux_boot = (machine->kernel_filename != NULL);
>>>
>>>      /*
>>> +     * The HyperTransport range close to the 1T boundary is unique to AMD
>>> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
>>> +     * to above 1T to AMD vCPUs only.
>>> +     */
>>> +    if (IS_AMD_CPU(&cpu->env) && x86ms->above_4g_mem_size) {
>>
>> it has the same issue as pc_max_used_gpa(), i.e.
>>   x86ms->above_4g_mem_size != 0
>> doesn't mean that there isn't any memory above 4Gb nor that there aren't
>> any MMIO (sgx/cxl/pci64hole), that's was the reason we were are considering
>> max_used_gpa
>> I'd prefer to keep pc_max_used_gpa(),
>> idea but make it work for above cases and be more generic (i.e. not to be
>> tied to AMD only) since 'pc_max_used_gpa() < physbits'
>> applies to equally
>> to AMD and Intel (and to trip it, one just have to configure small enough
>> physbits or large enough hotpluggable RAM/CXL/PCI64HOLE)
>>
> I can reproduce the issue you're thinking with basic memory hotplug. 

I was mislead by a bug that only existed in v6. Which I fixed now.
So any bug possibility with hotplug, SGX and CXL, or pcihole64 is simply covered with:

	pc_pci_hole64_start() + pci_hole64_size;

which is what pc_max_used_gpa() does. This works fine /without/ above_4g_mem_size != 0
check even without above_4g_mem_size (e.g. mem=2G,maxmem=1024G).

And as a reminder: SGX, hotplug, CXL and pci-hole64 *require* memory above 4G[*]. And part
of the point of us moving to pc_pci_hole64_start() was to make these all work in a generic
way.

So I've removed the x86ms->above_4g_mem_size != 0 check. Current patch diff pasted at the end.

[*] As reiterated here:

> Let me see
> what I can come up in pc_max_used_gpa() to cover this one. I'll respond here with a proposal.
> 

I was over-complicating things here. It turns out nothing else is needed aside in the
context of 1T hole.

This is because I only need to check address space limits (as consequence of
pc_set_amd_above_4g_mem_start()) when pc_max_used_gpa() surprasses HT_START. Which
requires fundamentally a value closer to 1T well beyond what 32-bit can cover. So on
32-bit guests this is never true and thus it things don't change behaviour from current
default for these guests. And thus I won't break qtests and things fail correctly in the
right places.

Now I should say that pc_max_used_gpa() is still not returning the accurate 32-bit guest
max used GPA value, given that I return pci hole64 end (essentially). Do you still that
addressed out of correctness even if it doesn't matter much for the 64-bit 1T case?

If so, our only option seems to be to check phys_bits <= 32 and return max CPU
boundary there? Unless you have something enterily different in mind?

> I would really love to have v7.1.0 with this issue fixed but I am not very
> confident it is going to make it :(
> 
> Meanwhile, let me know if you have thoughts on this one:
> 
> https://lore.kernel.org/qemu-devel/1b2fa957-74f6-b5a9-3fc1-65c5d68300ce@oracle.com/
> 
> I am going to assume that if no comments on the above that I'll keep things as is.
> 
> And also, whether I can retain your ack with Bernhard's suggestion here:
> 
> https://lore.kernel.org/qemu-devel/0eefb382-4ac6-4335-ca61-035babb95a88@oracle.com/
> 


diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 668e15c8f2a6..45433cc53b5b 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -881,6 +881,67 @@ static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
     return start;
 }

+static hwaddr pc_max_used_gpa(PCMachineState *pcms, uint64_t pci_hole64_size)
+{
+    return pc_pci_hole64_start() + pci_hole64_size;
+}
+
+/*
+ * AMD systems with an IOMMU have an additional hole close to the
+ * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
+ * on kernel version, VFIO may or may not let you DMA map those ranges.
+ * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
+ * with certain memory sizes. It's also wrong to use those IOVA ranges
+ * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
+ * The ranges reserved for Hyper-Transport are:
+ *
+ * FD_0000_0000h - FF_FFFF_FFFFh
+ *
+ * The ranges represent the following:
+ *
+ * Base Address   Top Address  Use
+ *
+ * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
+ * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
+ * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
+ * FD_F910_0000h FD_F91F_FFFFh System Management
+ * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
+ * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
+ * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
+ * FD_FE00_0000h FD_FFFF_FFFFh Configuration
+ * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
+ * FE_2000_0000h FF_FFFF_FFFFh Reserved
+ *
+ * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
+ * Table 3: Special Address Controls (GPA) for more information.
+ */
+#define AMD_HT_START         0xfd00000000UL
+#define AMD_HT_END           0xffffffffffUL
+#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
+#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
+
+static void pc_set_amd_above_4g_mem_start(PCMachineState *pcms,
+                                          uint64_t pci_hole64_size)
+{
+    X86MachineState *x86ms = X86_MACHINE(pcms);
+    hwaddr maxphysaddr, maxusedaddr;
+
+    /*
+     * Relocating ram-above-4G requires more than TCG_PHYS_ADDR_BITS (40).
+     * So make sure phys-bits is required to be appropriately sized in order
+     * to proceed with the above-4g-region relocation and thus boot.
+     */
+    x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
+    maxusedaddr = pc_max_used_gpa(pcms, pci_hole64_size);
+    maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
+    if (maxphysaddr < maxusedaddr) {
+        error_report("Address space limit 0x%"PRIx64" < 0x%"PRIx64
+                     " phys-bits too low (%u) cannot avoid AMD HT range",
+                     maxphysaddr, maxusedaddr, X86_CPU(first_cpu)->phys_bits);
+        exit(EXIT_FAILURE);
+    }
+}
+
 void pc_memory_init(PCMachineState *pcms,
                     MemoryRegion *system_memory,
                     MemoryRegion *rom_memory,
@@ -896,6 +957,7 @@ void pc_memory_init(PCMachineState *pcms,
     PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
     X86MachineState *x86ms = X86_MACHINE(pcms);
     hwaddr cxl_base, cxl_resv_end = 0;
+    X86CPU *cpu = X86_CPU(first_cpu);

     assert(machine->ram_size == x86ms->below_4g_mem_size +
                                 x86ms->above_4g_mem_size);
@@ -903,6 +965,27 @@ void pc_memory_init(PCMachineState *pcms,
     linux_boot = (machine->kernel_filename != NULL);

     /*
+     * The HyperTransport range close to the 1T boundary is unique to AMD
+     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
+     * to above 1T to AMD vCPUs only.
+     */
+    if (IS_AMD_CPU(&cpu->env)) {
+        /* Bail out if max possible address does not cross HT range */
+        if (pc_max_used_gpa(pcms, pci_hole64_size) >= AMD_HT_START) {
+            pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);
+        }
+
+        /*
+         * Advertise the HT region if address space covers the reserved
+         * region or if we relocate.
+         */
+        if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START ||
+            cpu->phys_bits >= 40) {
+            e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
+        }
+    }
+
+    /*
      * Split single memory region and use aliases to address portions of it,
      * done for backwards compatibility with older qemus.
      */
Igor Mammedov July 14, 2022, 9:28 a.m. UTC | #10
On Tue, 12 Jul 2022 12:35:49 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 7/12/22 11:01, Joao Martins wrote:
> > On 7/12/22 10:06, Igor Mammedov wrote:  
> >> On Mon, 11 Jul 2022 21:03:28 +0100
> >> Joao Martins <joao.m.martins@oracle.com> wrote:  
> >>> On 7/11/22 16:31, Joao Martins wrote:  
> >>>> On 7/11/22 15:52, Joao Martins wrote:    
> >>>>> On 7/11/22 13:56, Igor Mammedov wrote:    
> >>>>>> On Fri,  1 Jul 2022 17:10:13 +0100
> >>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
> >>>  void pc_memory_init(PCMachineState *pcms,
> >>>                      MemoryRegion *system_memory,
> >>>                      MemoryRegion *rom_memory,
> >>> @@ -897,6 +953,7 @@ void pc_memory_init(PCMachineState *pcms,
> >>>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
> >>>      X86MachineState *x86ms = X86_MACHINE(pcms);
> >>>      hwaddr cxl_base, cxl_resv_end = 0;
> >>> +    X86CPU *cpu = X86_CPU(first_cpu);
> >>>
> >>>      assert(machine->ram_size == x86ms->below_4g_mem_size +
> >>>                                  x86ms->above_4g_mem_size);
> >>> @@ -904,6 +961,29 @@ void pc_memory_init(PCMachineState *pcms,
> >>>      linux_boot = (machine->kernel_filename != NULL);
> >>>
> >>>      /*
> >>> +     * The HyperTransport range close to the 1T boundary is unique to AMD
> >>> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
> >>> +     * to above 1T to AMD vCPUs only.
> >>> +     */
> >>> +    if (IS_AMD_CPU(&cpu->env) && x86ms->above_4g_mem_size) {  
> >>
> >> it has the same issue as pc_max_used_gpa(), i.e.
> >>   x86ms->above_4g_mem_size != 0
> >> doesn't mean that there isn't any memory above 4Gb nor that there aren't
> >> any MMIO (sgx/cxl/pci64hole), that's was the reason we were are considering
> >> max_used_gpa
> >> I'd prefer to keep pc_max_used_gpa(),
> >> idea but make it work for above cases and be more generic (i.e. not to be
> >> tied to AMD only) since 'pc_max_used_gpa() < physbits'
> >> applies to equally
> >> to AMD and Intel (and to trip it, one just have to configure small enough
> >> physbits or large enough hotpluggable RAM/CXL/PCI64HOLE)
> >>  
> > I can reproduce the issue you're thinking with basic memory hotplug.   
> 
> I was mislead by a bug that only existed in v6. Which I fixed now.
> So any bug possibility with hotplug, SGX and CXL, or pcihole64 is simply covered with:
> 
> 	pc_pci_hole64_start() + pci_hole64_size;
> 
> which is what pc_max_used_gpa() does. This works fine /without/ above_4g_mem_size != 0
> check even without above_4g_mem_size (e.g. mem=2G,maxmem=1024G).
> 
> And as a reminder: SGX, hotplug, CXL and pci-hole64 *require* memory above 4G[*]. And part
> of the point of us moving to pc_pci_hole64_start() was to make these all work in a generic
> way.
> 
> So I've removed the x86ms->above_4g_mem_size != 0 check. Current patch diff pasted at the end.
> 
> [*] As reiterated here:
> 
> > Let me see
> > what I can come up in pc_max_used_gpa() to cover this one. I'll respond here with a proposal.
> >   
> 
> I was over-complicating things here. It turns out nothing else is needed aside in the
> context of 1T hole.
> 
> This is because I only need to check address space limits (as consequence of
> pc_set_amd_above_4g_mem_start()) when pc_max_used_gpa() surprasses HT_START. Which
> requires fundamentally a value closer to 1T well beyond what 32-bit can cover. So on
> 32-bit guests this is never true and thus it things don't change behaviour from current
> default for these guests. And thus I won't break qtests and things fail correctly in the
> right places.
> 
> Now I should say that pc_max_used_gpa() is still not returning the accurate 32-bit guest
> max used GPA value, given that I return pci hole64 end (essentially). Do you still that
> addressed out of correctness even if it doesn't matter much for the 64-bit 1T case?
> 
> If so, our only option seems to be to check phys_bits <= 32 and return max CPU
> boundary there? Unless you have something enterily different in mind?
> 
> > I would really love to have v7.1.0 with this issue fixed but I am not very
> > confident it is going to make it :(
> > 
> > Meanwhile, let me know if you have thoughts on this one:
> > 
> > https://lore.kernel.org/qemu-devel/1b2fa957-74f6-b5a9-3fc1-65c5d68300ce@oracle.com/
> > 
> > I am going to assume that if no comments on the above that I'll keep things as is.
> > 
> > And also, whether I can retain your ack with Bernhard's suggestion here:
> > 
> > https://lore.kernel.org/qemu-devel/0eefb382-4ac6-4335-ca61-035babb95a88@oracle.com/
> >   
> 
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index 668e15c8f2a6..45433cc53b5b 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -881,6 +881,67 @@ static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
>      return start;
>  }
> 
> +static hwaddr pc_max_used_gpa(PCMachineState *pcms, uint64_t pci_hole64_size)
> +{
> +    return pc_pci_hole64_start() + pci_hole64_size;
> +}
> +
> +/*
> + * AMD systems with an IOMMU have an additional hole close to the
> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
> + * on kernel version, VFIO may or may not let you DMA map those ranges.
> + * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
> + * with certain memory sizes. It's also wrong to use those IOVA ranges
> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
> + * The ranges reserved for Hyper-Transport are:
> + *
> + * FD_0000_0000h - FF_FFFF_FFFFh
> + *
> + * The ranges represent the following:
> + *
> + * Base Address   Top Address  Use
> + *
> + * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
> + * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
> + * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
> + * FD_F910_0000h FD_F91F_FFFFh System Management
> + * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
> + * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
> + * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
> + * FD_FE00_0000h FD_FFFF_FFFFh Configuration
> + * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
> + * FE_2000_0000h FF_FFFF_FFFFh Reserved
> + *
> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
> + * Table 3: Special Address Controls (GPA) for more information.
> + */
> +#define AMD_HT_START         0xfd00000000UL
> +#define AMD_HT_END           0xffffffffffUL
> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
> +#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
> +
> +static void pc_set_amd_above_4g_mem_start(PCMachineState *pcms,
> +                                          uint64_t pci_hole64_size)
> +{
> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> +    hwaddr maxphysaddr, maxusedaddr;
> +
> +    /*
> +     * Relocating ram-above-4G requires more than TCG_PHYS_ADDR_BITS (40).
> +     * So make sure phys-bits is required to be appropriately sized in order
> +     * to proceed with the above-4g-region relocation and thus boot.
> +     */
> +    x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
> +    maxusedaddr = pc_max_used_gpa(pcms, pci_hole64_size);
> +    maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
> +    if (maxphysaddr < maxusedaddr) {
> +        error_report("Address space limit 0x%"PRIx64" < 0x%"PRIx64
> +                     " phys-bits too low (%u) cannot avoid AMD HT range",
> +                     maxphysaddr, maxusedaddr, X86_CPU(first_cpu)->phys_bits);
> +        exit(EXIT_FAILURE);
> +    }
> +}
> +
>  void pc_memory_init(PCMachineState *pcms,
>                      MemoryRegion *system_memory,
>                      MemoryRegion *rom_memory,
> @@ -896,6 +957,7 @@ void pc_memory_init(PCMachineState *pcms,
>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>      X86MachineState *x86ms = X86_MACHINE(pcms);
>      hwaddr cxl_base, cxl_resv_end = 0;
> +    X86CPU *cpu = X86_CPU(first_cpu);
> 
>      assert(machine->ram_size == x86ms->below_4g_mem_size +
>                                  x86ms->above_4g_mem_size);
> @@ -903,6 +965,27 @@ void pc_memory_init(PCMachineState *pcms,
>      linux_boot = (machine->kernel_filename != NULL);
> 
>      /*
> +     * The HyperTransport range close to the 1T boundary is unique to AMD
> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
> +     * to above 1T to AMD vCPUs only.
> +     */
> +    if (IS_AMD_CPU(&cpu->env)) {
> +        /* Bail out if max possible address does not cross HT range */
> +        if (pc_max_used_gpa(pcms, pci_hole64_size) >= AMD_HT_START) {
> +            pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);

I'd replace call with 
   x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;

> +        }
> +
> +        /*
> +         * Advertise the HT region if address space covers the reserved
> +         * region or if we relocate.
> +         */
> +        if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START ||
> +            cpu->phys_bits >= 40) {
> +            e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
> +        }
> +    }

and then here check that pc_max_used_gpa() fits into phys_bits
which should cover AMD case and case where pci64_hole goes beyond 
supported address range even without 1TB hole

> +
> +    /*
>       * Split single memory region and use aliases to address portions of it,
>       * done for backwards compatibility with older qemus.
>       */
>
Igor Mammedov July 14, 2022, 9:30 a.m. UTC | #11
On Tue, 12 Jul 2022 11:01:18 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 7/12/22 10:06, Igor Mammedov wrote:
> > On Mon, 11 Jul 2022 21:03:28 +0100
> > Joao Martins <joao.m.martins@oracle.com> wrote:
> >   
> >> On 7/11/22 16:31, Joao Martins wrote:  
> >>> On 7/11/22 15:52, Joao Martins wrote:    
> >>>> On 7/11/22 13:56, Igor Mammedov wrote:    
> >>>>> On Fri,  1 Jul 2022 17:10:13 +0100
> >>>>> Joao Martins <joao.m.martins@oracle.com> wrote:
[...]
> I would really love to have v7.1.0 with this issue fixed but I am not very
> confident it is going to make it :(

it still can make into current release

> 
> Meanwhile, let me know if you have thoughts on this one:
> 
> https://lore.kernel.org/qemu-devel/1b2fa957-74f6-b5a9-3fc1-65c5d68300ce@oracle.com/
> 
> I am going to assume that if no comments on the above that I'll keep things as is.
> 
> And also, whether I can retain your ack with Bernhard's suggestion here:
> 
> https://lore.kernel.org/qemu-devel/0eefb382-4ac6-4335-ca61-035babb95a88@oracle.com/
> 
> >> +        hwaddr maxusedaddr = pc_pci_hole64_start() + pci_hole64_size;
> >> +
> >> +        /* Bail out if max possible address does not cross HT range */
> >> +        if (maxusedaddr >= AMD_HT_START) {
> >> +            pc_set_amd_above_4g_mem_start(pcms, maxusedaddr);
> >> +        }
> >> +
> >> +        /*
> >> +         * Advertise the HT region if address space covers the reserved
> >> +         * region or if we relocate.
> >> +         */
> >> +        if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START ||
> >> +            cpu->phys_bits >= 40) {
> >> +            e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
> >> +        }
> >> +    }
> >> +
> >> +    /*
> >>       * Split single memory region and use aliases to address portions of it,
> >>       * done for backwards compatibility with older qemus.
> >>       */
> >>  
> >   
>
Joao Martins July 14, 2022, 9:54 a.m. UTC | #12
On 7/14/22 10:28, Igor Mammedov wrote:
> On Tue, 12 Jul 2022 12:35:49 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> On 7/12/22 11:01, Joao Martins wrote:
>>> On 7/12/22 10:06, Igor Mammedov wrote:  
>>>> On Mon, 11 Jul 2022 21:03:28 +0100
>>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
>>>>> On 7/11/22 16:31, Joao Martins wrote:  
>>>>>> On 7/11/22 15:52, Joao Martins wrote:    
>>>>>>> On 7/11/22 13:56, Igor Mammedov wrote:    
>>>>>>>> On Fri,  1 Jul 2022 17:10:13 +0100
>>>>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
>>>>>  void pc_memory_init(PCMachineState *pcms,
>>>>>                      MemoryRegion *system_memory,
>>>>>                      MemoryRegion *rom_memory,
>>>>> @@ -897,6 +953,7 @@ void pc_memory_init(PCMachineState *pcms,
>>>>>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>>>>      X86MachineState *x86ms = X86_MACHINE(pcms);
>>>>>      hwaddr cxl_base, cxl_resv_end = 0;
>>>>> +    X86CPU *cpu = X86_CPU(first_cpu);
>>>>>
>>>>>      assert(machine->ram_size == x86ms->below_4g_mem_size +
>>>>>                                  x86ms->above_4g_mem_size);
>>>>> @@ -904,6 +961,29 @@ void pc_memory_init(PCMachineState *pcms,
>>>>>      linux_boot = (machine->kernel_filename != NULL);
>>>>>
>>>>>      /*
>>>>> +     * The HyperTransport range close to the 1T boundary is unique to AMD
>>>>> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
>>>>> +     * to above 1T to AMD vCPUs only.
>>>>> +     */
>>>>> +    if (IS_AMD_CPU(&cpu->env) && x86ms->above_4g_mem_size) {  
>>>>
>>>> it has the same issue as pc_max_used_gpa(), i.e.
>>>>   x86ms->above_4g_mem_size != 0
>>>> doesn't mean that there isn't any memory above 4Gb nor that there aren't
>>>> any MMIO (sgx/cxl/pci64hole), that's was the reason we were are considering
>>>> max_used_gpa
>>>> I'd prefer to keep pc_max_used_gpa(),
>>>> idea but make it work for above cases and be more generic (i.e. not to be
>>>> tied to AMD only) since 'pc_max_used_gpa() < physbits'
>>>> applies to equally
>>>> to AMD and Intel (and to trip it, one just have to configure small enough
>>>> physbits or large enough hotpluggable RAM/CXL/PCI64HOLE)
>>>>  
>>> I can reproduce the issue you're thinking with basic memory hotplug.   
>>
>> I was mislead by a bug that only existed in v6. Which I fixed now.
>> So any bug possibility with hotplug, SGX and CXL, or pcihole64 is simply covered with:
>>
>> 	pc_pci_hole64_start() + pci_hole64_size;
>>
>> which is what pc_max_used_gpa() does. This works fine /without/ above_4g_mem_size != 0
>> check even without above_4g_mem_size (e.g. mem=2G,maxmem=1024G).
>>
>> And as a reminder: SGX, hotplug, CXL and pci-hole64 *require* memory above 4G[*]. And part
>> of the point of us moving to pc_pci_hole64_start() was to make these all work in a generic
>> way.
>>
>> So I've removed the x86ms->above_4g_mem_size != 0 check. Current patch diff pasted at the end.
>>
>> [*] As reiterated here:
>>
>>> Let me see
>>> what I can come up in pc_max_used_gpa() to cover this one. I'll respond here with a proposal.
>>>   
>>
>> I was over-complicating things here. It turns out nothing else is needed aside in the
>> context of 1T hole.
>>
>> This is because I only need to check address space limits (as consequence of
>> pc_set_amd_above_4g_mem_start()) when pc_max_used_gpa() surprasses HT_START. Which
>> requires fundamentally a value closer to 1T well beyond what 32-bit can cover. So on
>> 32-bit guests this is never true and thus it things don't change behaviour from current
>> default for these guests. And thus I won't break qtests and things fail correctly in the
>> right places.
>>
>> Now I should say that pc_max_used_gpa() is still not returning the accurate 32-bit guest
>> max used GPA value, given that I return pci hole64 end (essentially). Do you still that
>> addressed out of correctness even if it doesn't matter much for the 64-bit 1T case?
>>
>> If so, our only option seems to be to check phys_bits <= 32 and return max CPU
>> boundary there? Unless you have something enterily different in mind?
>>
>>> I would really love to have v7.1.0 with this issue fixed but I am not very
>>> confident it is going to make it :(
>>>
>>> Meanwhile, let me know if you have thoughts on this one:
>>>
>>> https://lore.kernel.org/qemu-devel/1b2fa957-74f6-b5a9-3fc1-65c5d68300ce@oracle.com/
>>>
>>> I am going to assume that if no comments on the above that I'll keep things as is.
>>>
>>> And also, whether I can retain your ack with Bernhard's suggestion here:
>>>
>>> https://lore.kernel.org/qemu-devel/0eefb382-4ac6-4335-ca61-035babb95a88@oracle.com/
>>>   
>>
>>
>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>> index 668e15c8f2a6..45433cc53b5b 100644
>> --- a/hw/i386/pc.c
>> +++ b/hw/i386/pc.c
>> @@ -881,6 +881,67 @@ static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
>>      return start;
>>  }
>>
>> +static hwaddr pc_max_used_gpa(PCMachineState *pcms, uint64_t pci_hole64_size)
>> +{
>> +    return pc_pci_hole64_start() + pci_hole64_size;
>> +}
>> +
>> +/*
>> + * AMD systems with an IOMMU have an additional hole close to the
>> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
>> + * on kernel version, VFIO may or may not let you DMA map those ranges.
>> + * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
>> + * with certain memory sizes. It's also wrong to use those IOVA ranges
>> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
>> + * The ranges reserved for Hyper-Transport are:
>> + *
>> + * FD_0000_0000h - FF_FFFF_FFFFh
>> + *
>> + * The ranges represent the following:
>> + *
>> + * Base Address   Top Address  Use
>> + *
>> + * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
>> + * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
>> + * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
>> + * FD_F910_0000h FD_F91F_FFFFh System Management
>> + * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
>> + * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
>> + * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
>> + * FD_FE00_0000h FD_FFFF_FFFFh Configuration
>> + * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
>> + * FE_2000_0000h FF_FFFF_FFFFh Reserved
>> + *
>> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
>> + * Table 3: Special Address Controls (GPA) for more information.
>> + */
>> +#define AMD_HT_START         0xfd00000000UL
>> +#define AMD_HT_END           0xffffffffffUL
>> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
>> +#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
>> +
>> +static void pc_set_amd_above_4g_mem_start(PCMachineState *pcms,
>> +                                          uint64_t pci_hole64_size)
>> +{
>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>> +    hwaddr maxphysaddr, maxusedaddr;
>> +
>> +    /*
>> +     * Relocating ram-above-4G requires more than TCG_PHYS_ADDR_BITS (40).
>> +     * So make sure phys-bits is required to be appropriately sized in order
>> +     * to proceed with the above-4g-region relocation and thus boot.
>> +     */
>> +    x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
>> +    maxusedaddr = pc_max_used_gpa(pcms, pci_hole64_size);
>> +    maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
>> +    if (maxphysaddr < maxusedaddr) {
>> +        error_report("Address space limit 0x%"PRIx64" < 0x%"PRIx64
>> +                     " phys-bits too low (%u) cannot avoid AMD HT range",
>> +                     maxphysaddr, maxusedaddr, X86_CPU(first_cpu)->phys_bits);
>> +        exit(EXIT_FAILURE);
>> +    }
>> +}
>> +
>>  void pc_memory_init(PCMachineState *pcms,
>>                      MemoryRegion *system_memory,
>>                      MemoryRegion *rom_memory,
>> @@ -896,6 +957,7 @@ void pc_memory_init(PCMachineState *pcms,
>>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>      X86MachineState *x86ms = X86_MACHINE(pcms);
>>      hwaddr cxl_base, cxl_resv_end = 0;
>> +    X86CPU *cpu = X86_CPU(first_cpu);
>>
>>      assert(machine->ram_size == x86ms->below_4g_mem_size +
>>                                  x86ms->above_4g_mem_size);
>> @@ -903,6 +965,27 @@ void pc_memory_init(PCMachineState *pcms,
>>      linux_boot = (machine->kernel_filename != NULL);
>>
>>      /*
>> +     * The HyperTransport range close to the 1T boundary is unique to AMD
>> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
>> +     * to above 1T to AMD vCPUs only.
>> +     */
>> +    if (IS_AMD_CPU(&cpu->env)) {
>> +        /* Bail out if max possible address does not cross HT range */
>> +        if (pc_max_used_gpa(pcms, pci_hole64_size) >= AMD_HT_START) {
>> +            pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);
> 
> I'd replace call with 
>    x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
> 
See below.

>> +        }
>> +
>> +        /*
>> +         * Advertise the HT region if address space covers the reserved
>> +         * region or if we relocate.
>> +         */
>> +        if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START ||
>> +            cpu->phys_bits >= 40) {
>> +            e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
>> +        }
>> +    }
> 
> and then here check that pc_max_used_gpa() fits into phys_bits
> which should cover AMD case and case where pci64_hole goes beyond 
> supported address range even without 1TB hole
> 

When you say 'here' you mean outside IS_AMD_CPU() ?

If we put outside (and thus generic) where it was ... it will break qtests
as pc_max_used_gpa() does not handle 32-bit case, as mentioned earlier.
Hence why it is inside pc_set_amd_above_4g_mem_start(), or in other words
inside the scope of:

	if (pc_max_used_gpa(pcms, pci_hole64_size) >= AMD_HT_START)

Which means I will for sure have a pci_hole64.
Making it generic to /outside/ this conditional requires addressing this
earlier comment I made:

 our only option seems to be to check phys_bits <= 32 and return max CPU
 boundary there?
Joao Martins July 14, 2022, 10:47 a.m. UTC | #13
On 7/14/22 10:54, Joao Martins wrote:
> On 7/14/22 10:28, Igor Mammedov wrote:
>> On Tue, 12 Jul 2022 12:35:49 +0100
>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>> On 7/12/22 11:01, Joao Martins wrote:
>>>> On 7/12/22 10:06, Igor Mammedov wrote:  
>>>>> On Mon, 11 Jul 2022 21:03:28 +0100
>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
>>>>>> On 7/11/22 16:31, Joao Martins wrote:  
>>>>>>> On 7/11/22 15:52, Joao Martins wrote:    
>>>>>>>> On 7/11/22 13:56, Igor Mammedov wrote:    
>>>>>>>>> On Fri,  1 Jul 2022 17:10:13 +0100
>>>>>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
>>>>>>  void pc_memory_init(PCMachineState *pcms,
>>>>>>                      MemoryRegion *system_memory,
>>>>>>                      MemoryRegion *rom_memory,
>>>>>> @@ -897,6 +953,7 @@ void pc_memory_init(PCMachineState *pcms,
>>>>>>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>>>>>      X86MachineState *x86ms = X86_MACHINE(pcms);
>>>>>>      hwaddr cxl_base, cxl_resv_end = 0;
>>>>>> +    X86CPU *cpu = X86_CPU(first_cpu);
>>>>>>
>>>>>>      assert(machine->ram_size == x86ms->below_4g_mem_size +
>>>>>>                                  x86ms->above_4g_mem_size);
>>>>>> @@ -904,6 +961,29 @@ void pc_memory_init(PCMachineState *pcms,
>>>>>>      linux_boot = (machine->kernel_filename != NULL);
>>>>>>
>>>>>>      /*
>>>>>> +     * The HyperTransport range close to the 1T boundary is unique to AMD
>>>>>> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
>>>>>> +     * to above 1T to AMD vCPUs only.
>>>>>> +     */
>>>>>> +    if (IS_AMD_CPU(&cpu->env) && x86ms->above_4g_mem_size) {  
>>>>>
>>>>> it has the same issue as pc_max_used_gpa(), i.e.
>>>>>   x86ms->above_4g_mem_size != 0
>>>>> doesn't mean that there isn't any memory above 4Gb nor that there aren't
>>>>> any MMIO (sgx/cxl/pci64hole), that's was the reason we were are considering
>>>>> max_used_gpa
>>>>> I'd prefer to keep pc_max_used_gpa(),
>>>>> idea but make it work for above cases and be more generic (i.e. not to be
>>>>> tied to AMD only) since 'pc_max_used_gpa() < physbits'
>>>>> applies to equally
>>>>> to AMD and Intel (and to trip it, one just have to configure small enough
>>>>> physbits or large enough hotpluggable RAM/CXL/PCI64HOLE)
>>>>>  
>>>> I can reproduce the issue you're thinking with basic memory hotplug.   
>>>
>>> I was mislead by a bug that only existed in v6. Which I fixed now.
>>> So any bug possibility with hotplug, SGX and CXL, or pcihole64 is simply covered with:
>>>
>>> 	pc_pci_hole64_start() + pci_hole64_size;
>>>
>>> which is what pc_max_used_gpa() does. This works fine /without/ above_4g_mem_size != 0
>>> check even without above_4g_mem_size (e.g. mem=2G,maxmem=1024G).
>>>
>>> And as a reminder: SGX, hotplug, CXL and pci-hole64 *require* memory above 4G[*]. And part
>>> of the point of us moving to pc_pci_hole64_start() was to make these all work in a generic
>>> way.
>>>
>>> So I've removed the x86ms->above_4g_mem_size != 0 check. Current patch diff pasted at the end.
>>>
>>> [*] As reiterated here:
>>>
>>>> Let me see
>>>> what I can come up in pc_max_used_gpa() to cover this one. I'll respond here with a proposal.
>>>>   
>>>
>>> I was over-complicating things here. It turns out nothing else is needed aside in the
>>> context of 1T hole.
>>>
>>> This is because I only need to check address space limits (as consequence of
>>> pc_set_amd_above_4g_mem_start()) when pc_max_used_gpa() surprasses HT_START. Which
>>> requires fundamentally a value closer to 1T well beyond what 32-bit can cover. So on
>>> 32-bit guests this is never true and thus it things don't change behaviour from current
>>> default for these guests. And thus I won't break qtests and things fail correctly in the
>>> right places.
>>>
>>> Now I should say that pc_max_used_gpa() is still not returning the accurate 32-bit guest
>>> max used GPA value, given that I return pci hole64 end (essentially). Do you still that
>>> addressed out of correctness even if it doesn't matter much for the 64-bit 1T case?
>>>
>>> If so, our only option seems to be to check phys_bits <= 32 and return max CPU
>>> boundary there? Unless you have something enterily different in mind?
>>>
>>>> I would really love to have v7.1.0 with this issue fixed but I am not very
>>>> confident it is going to make it :(
>>>>
>>>> Meanwhile, let me know if you have thoughts on this one:
>>>>
>>>> https://lore.kernel.org/qemu-devel/1b2fa957-74f6-b5a9-3fc1-65c5d68300ce@oracle.com/
>>>>
>>>> I am going to assume that if no comments on the above that I'll keep things as is.
>>>>
>>>> And also, whether I can retain your ack with Bernhard's suggestion here:
>>>>
>>>> https://lore.kernel.org/qemu-devel/0eefb382-4ac6-4335-ca61-035babb95a88@oracle.com/
>>>>   
>>>
>>>

[...]

>>>      /*
>>> +     * The HyperTransport range close to the 1T boundary is unique to AMD
>>> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
>>> +     * to above 1T to AMD vCPUs only.
>>> +     */
>>> +    if (IS_AMD_CPU(&cpu->env)) {
>>> +        /* Bail out if max possible address does not cross HT range */
>>> +        if (pc_max_used_gpa(pcms, pci_hole64_size) >= AMD_HT_START) {
>>> +            pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);
>>
>> I'd replace call with 
>>    x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
>>
> See below.
> 
>>> +        }
>>> +
>>> +        /*
>>> +         * Advertise the HT region if address space covers the reserved
>>> +         * region or if we relocate.
>>> +         */
>>> +        if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START ||
>>> +            cpu->phys_bits >= 40) {
>>> +            e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
>>> +        }
>>> +    }
>>
>> and then here check that pc_max_used_gpa() fits into phys_bits
>> which should cover AMD case and case where pci64_hole goes beyond 
>> supported address range even without 1TB hole
>>
> 
> When you say 'here' you mean outside IS_AMD_CPU() ?
> 
> If we put outside (and thus generic) where it was ... it will break qtests
> as pc_max_used_gpa() does not handle 32-bit case, as mentioned earlier.
> Hence why it is inside pc_set_amd_above_4g_mem_start(), or in other words
> inside the scope of:
> 
> 	if (pc_max_used_gpa(pcms, pci_hole64_size) >= AMD_HT_START)
> 
> Which means I will for sure have a pci_hole64.
> Making it generic to /outside/ this conditional requires addressing this
> earlier comment I made:
> 
>  our only option seems to be to check phys_bits <= 32 and return max CPU
>  boundary there?
> 

Here's how this patch looks like, after your comments and the above issue
I am talking. The added part is inside pc_max_used_gpa().

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 668e15c8f2a6..2d85c66502d5 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -881,6 +881,51 @@ static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
     return start;
 }

+static hwaddr pc_max_used_gpa(PCMachineState *pcms, uint64_t pci_hole64_size)
+{
+    X86CPU *cpu = X86_CPU(first_cpu);
+
+    if (cpu->phys_bits <= 32) {
+        return (1ULL << cpu->phys_bits) - 1ULL;
+    }
+
+    return pc_pci_hole64_start() + pci_hole64_size;
+}
+
+/*
+ * AMD systems with an IOMMU have an additional hole close to the
+ * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
+ * on kernel version, VFIO may or may not let you DMA map those ranges.
+ * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
+ * with certain memory sizes. It's also wrong to use those IOVA ranges
+ * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
+ * The ranges reserved for Hyper-Transport are:
+ *
+ * FD_0000_0000h - FF_FFFF_FFFFh
+ *
+ * The ranges represent the following:
+ *
+ * Base Address   Top Address  Use
+ *
+ * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
+ * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
+ * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
+ * FD_F910_0000h FD_F91F_FFFFh System Management
+ * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
+ * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
+ * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
+ * FD_FE00_0000h FD_FFFF_FFFFh Configuration
+ * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
+ * FE_2000_0000h FF_FFFF_FFFFh Reserved
+ *
+ * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
+ * Table 3: Special Address Controls (GPA) for more information.
+ */
+#define AMD_HT_START         0xfd00000000UL
+#define AMD_HT_END           0xffffffffffUL
+#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
+#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
+
 void pc_memory_init(PCMachineState *pcms,
                     MemoryRegion *system_memory,
                     MemoryRegion *rom_memory,
@@ -895,7 +940,9 @@ void pc_memory_init(PCMachineState *pcms,
     MachineClass *mc = MACHINE_GET_CLASS(machine);
     PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
     X86MachineState *x86ms = X86_MACHINE(pcms);
+    hwaddr maxphysaddr, maxusedaddr;
     hwaddr cxl_base, cxl_resv_end = 0;
+    X86CPU *cpu = X86_CPU(first_cpu);

     assert(machine->ram_size == x86ms->below_4g_mem_size +
                                 x86ms->above_4g_mem_size);
@@ -903,6 +950,40 @@ void pc_memory_init(PCMachineState *pcms,
     linux_boot = (machine->kernel_filename != NULL);

     /*
+     * The HyperTransport range close to the 1T boundary is unique to AMD
+     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
+     * to above 1T to AMD vCPUs only.
+     */
+    if (IS_AMD_CPU(&cpu->env)) {
+        /* Bail out if max possible address does not cross HT range */
+        if (pc_max_used_gpa(pcms, pci_hole64_size) >= AMD_HT_START) {
+            x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
+        }
+
+        /*
+         * Advertise the HT region if address space covers the reserved
+         * region or if we relocate.
+         */
+        if (cpu->phys_bits >= 40) {
+            e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
+        }
+    }
+
+    /*
+     * Relocating ram-above-4G requires more than TCG_PHYS_ADDR_BITS (40).
+     * So make sure phys-bits is required to be appropriately sized in order
+     * to proceed with the above-4g-region relocation and thus boot.
+     */
+    maxusedaddr = pc_max_used_gpa(pcms, pci_hole64_size);
+    maxphysaddr = ((hwaddr)1 << cpu->phys_bits) - 1;
+    if (maxphysaddr < maxusedaddr) {
+        error_report("Address space limit 0x%"PRIx64" < 0x%"PRIx64
+                     " phys-bits too low (%u)",
+                     maxphysaddr, maxusedaddr, cpu->phys_bits);
+        exit(EXIT_FAILURE);
+    }
+
+    /*
      * Split single memory region and use aliases to address portions of it,
      * done for backwards compatibility with older qemus.
      */
Igor Mammedov July 14, 2022, 11:50 a.m. UTC | #14
On Thu, 14 Jul 2022 11:47:19 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 7/14/22 10:54, Joao Martins wrote:
> > On 7/14/22 10:28, Igor Mammedov wrote:  
> >> On Tue, 12 Jul 2022 12:35:49 +0100
> >> Joao Martins <joao.m.martins@oracle.com> wrote:  
> >>> On 7/12/22 11:01, Joao Martins wrote:  
> >>>> On 7/12/22 10:06, Igor Mammedov wrote:    
> >>>>> On Mon, 11 Jul 2022 21:03:28 +0100
> >>>>> Joao Martins <joao.m.martins@oracle.com> wrote:    
> >>>>>> On 7/11/22 16:31, Joao Martins wrote:    
> >>>>>>> On 7/11/22 15:52, Joao Martins wrote:      
> >>>>>>>> On 7/11/22 13:56, Igor Mammedov wrote:      
> >>>>>>>>> On Fri,  1 Jul 2022 17:10:13 +0100
> >>>>>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:    
> >>>>>>  void pc_memory_init(PCMachineState *pcms,
> >>>>>>                      MemoryRegion *system_memory,
> >>>>>>                      MemoryRegion *rom_memory,
> >>>>>> @@ -897,6 +953,7 @@ void pc_memory_init(PCMachineState *pcms,
> >>>>>>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
> >>>>>>      X86MachineState *x86ms = X86_MACHINE(pcms);
> >>>>>>      hwaddr cxl_base, cxl_resv_end = 0;
> >>>>>> +    X86CPU *cpu = X86_CPU(first_cpu);
> >>>>>>
> >>>>>>      assert(machine->ram_size == x86ms->below_4g_mem_size +
> >>>>>>                                  x86ms->above_4g_mem_size);
> >>>>>> @@ -904,6 +961,29 @@ void pc_memory_init(PCMachineState *pcms,
> >>>>>>      linux_boot = (machine->kernel_filename != NULL);
> >>>>>>
> >>>>>>      /*
> >>>>>> +     * The HyperTransport range close to the 1T boundary is unique to AMD
> >>>>>> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
> >>>>>> +     * to above 1T to AMD vCPUs only.
> >>>>>> +     */
> >>>>>> +    if (IS_AMD_CPU(&cpu->env) && x86ms->above_4g_mem_size) {    
> >>>>>
> >>>>> it has the same issue as pc_max_used_gpa(), i.e.
> >>>>>   x86ms->above_4g_mem_size != 0
> >>>>> doesn't mean that there isn't any memory above 4Gb nor that there aren't
> >>>>> any MMIO (sgx/cxl/pci64hole), that's was the reason we were are considering
> >>>>> max_used_gpa
> >>>>> I'd prefer to keep pc_max_used_gpa(),
> >>>>> idea but make it work for above cases and be more generic (i.e. not to be
> >>>>> tied to AMD only) since 'pc_max_used_gpa() < physbits'
> >>>>> applies to equally
> >>>>> to AMD and Intel (and to trip it, one just have to configure small enough
> >>>>> physbits or large enough hotpluggable RAM/CXL/PCI64HOLE)
> >>>>>    
> >>>> I can reproduce the issue you're thinking with basic memory hotplug.     
> >>>
> >>> I was mislead by a bug that only existed in v6. Which I fixed now.
> >>> So any bug possibility with hotplug, SGX and CXL, or pcihole64 is simply covered with:
> >>>
> >>> 	pc_pci_hole64_start() + pci_hole64_size;
> >>>
> >>> which is what pc_max_used_gpa() does. This works fine /without/ above_4g_mem_size != 0
> >>> check even without above_4g_mem_size (e.g. mem=2G,maxmem=1024G).
> >>>
> >>> And as a reminder: SGX, hotplug, CXL and pci-hole64 *require* memory above 4G[*]. And part
> >>> of the point of us moving to pc_pci_hole64_start() was to make these all work in a generic
> >>> way.
> >>>
> >>> So I've removed the x86ms->above_4g_mem_size != 0 check. Current patch diff pasted at the end.
> >>>
> >>> [*] As reiterated here:
> >>>  
> >>>> Let me see
> >>>> what I can come up in pc_max_used_gpa() to cover this one. I'll respond here with a proposal.
> >>>>     
> >>>
> >>> I was over-complicating things here. It turns out nothing else is needed aside in the
> >>> context of 1T hole.
> >>>
> >>> This is because I only need to check address space limits (as consequence of
> >>> pc_set_amd_above_4g_mem_start()) when pc_max_used_gpa() surprasses HT_START. Which
> >>> requires fundamentally a value closer to 1T well beyond what 32-bit can cover. So on
> >>> 32-bit guests this is never true and thus it things don't change behaviour from current
> >>> default for these guests. And thus I won't break qtests and things fail correctly in the
> >>> right places.
> >>>
> >>> Now I should say that pc_max_used_gpa() is still not returning the accurate 32-bit guest
> >>> max used GPA value, given that I return pci hole64 end (essentially). Do you still that
> >>> addressed out of correctness even if it doesn't matter much for the 64-bit 1T case?
> >>>
> >>> If so, our only option seems to be to check phys_bits <= 32 and return max CPU
> >>> boundary there? Unless you have something enterily different in mind?
> >>>  
> >>>> I would really love to have v7.1.0 with this issue fixed but I am not very
> >>>> confident it is going to make it :(
> >>>>
> >>>> Meanwhile, let me know if you have thoughts on this one:
> >>>>
> >>>> https://lore.kernel.org/qemu-devel/1b2fa957-74f6-b5a9-3fc1-65c5d68300ce@oracle.com/
> >>>>
> >>>> I am going to assume that if no comments on the above that I'll keep things as is.
> >>>>
> >>>> And also, whether I can retain your ack with Bernhard's suggestion here:
> >>>>
> >>>> https://lore.kernel.org/qemu-devel/0eefb382-4ac6-4335-ca61-035babb95a88@oracle.com/
> >>>>     
> >>>
> >>>  
> 
> [...]
> 
> >>>      /*
> >>> +     * The HyperTransport range close to the 1T boundary is unique to AMD
> >>> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
> >>> +     * to above 1T to AMD vCPUs only.
> >>> +     */
> >>> +    if (IS_AMD_CPU(&cpu->env)) {
> >>> +        /* Bail out if max possible address does not cross HT range */
> >>> +        if (pc_max_used_gpa(pcms, pci_hole64_size) >= AMD_HT_START) {
> >>> +            pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);  
> >>
> >> I'd replace call with 
> >>    x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
> >>  
> > See below.
> >   
> >>> +        }
> >>> +
> >>> +        /*
> >>> +         * Advertise the HT region if address space covers the reserved
> >>> +         * region or if we relocate.
> >>> +         */
> >>> +        if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START ||
> >>> +            cpu->phys_bits >= 40) {
> >>> +            e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
> >>> +        }
> >>> +    }  
> >>
> >> and then here check that pc_max_used_gpa() fits into phys_bits
> >> which should cover AMD case and case where pci64_hole goes beyond 
> >> supported address range even without 1TB hole
> >>  
> > 
> > When you say 'here' you mean outside IS_AMD_CPU() ?
> > 
> > If we put outside (and thus generic) where it was ... it will break qtests
> > as pc_max_used_gpa() does not handle 32-bit case, as mentioned earlier.
> > Hence why it is inside pc_set_amd_above_4g_mem_start(), or in other words
> > inside the scope of:
> > 
> > 	if (pc_max_used_gpa(pcms, pci_hole64_size) >= AMD_HT_START)
> > 
> > Which means I will for sure have a pci_hole64.
> > Making it generic to /outside/ this conditional requires addressing this
> > earlier comment I made:
> > 
> >  our only option seems to be to check phys_bits <= 32 and return max CPU
> >  boundary there?
> >   
> 
> Here's how this patch looks like, after your comments and the above issue
> I am talking. The added part is inside pc_max_used_gpa().
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index 668e15c8f2a6..2d85c66502d5 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -881,6 +881,51 @@ static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
>      return start;
>  }
> 
> +static hwaddr pc_max_used_gpa(PCMachineState *pcms, uint64_t pci_hole64_size)
> +{
> +    X86CPU *cpu = X86_CPU(first_cpu);
> +
> +    if (cpu->phys_bits <= 32) {

> +        return (1ULL << cpu->phys_bits) - 1ULL;
Add a comment here as to why this value is returned

> +    }
> +
> +    return pc_pci_hole64_start() + pci_hole64_size;
> +}
> +
> +/*
> + * AMD systems with an IOMMU have an additional hole close to the
> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
> + * on kernel version, VFIO may or may not let you DMA map those ranges.
> + * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
> + * with certain memory sizes. It's also wrong to use those IOVA ranges
> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
> + * The ranges reserved for Hyper-Transport are:
> + *
> + * FD_0000_0000h - FF_FFFF_FFFFh
> + *
> + * The ranges represent the following:
> + *
> + * Base Address   Top Address  Use
> + *
> + * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
> + * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
> + * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
> + * FD_F910_0000h FD_F91F_FFFFh System Management
> + * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
> + * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
> + * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
> + * FD_FE00_0000h FD_FFFF_FFFFh Configuration
> + * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
> + * FE_2000_0000h FF_FFFF_FFFFh Reserved
> + *
> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
> + * Table 3: Special Address Controls (GPA) for more information.
> + */
> +#define AMD_HT_START         0xfd00000000UL
> +#define AMD_HT_END           0xffffffffffUL
> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
> +#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
> +
>  void pc_memory_init(PCMachineState *pcms,
>                      MemoryRegion *system_memory,
>                      MemoryRegion *rom_memory,
> @@ -895,7 +940,9 @@ void pc_memory_init(PCMachineState *pcms,
>      MachineClass *mc = MACHINE_GET_CLASS(machine);
>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>      X86MachineState *x86ms = X86_MACHINE(pcms);
> +    hwaddr maxphysaddr, maxusedaddr;
>      hwaddr cxl_base, cxl_resv_end = 0;
> +    X86CPU *cpu = X86_CPU(first_cpu);
> 
>      assert(machine->ram_size == x86ms->below_4g_mem_size +
>                                  x86ms->above_4g_mem_size);
> @@ -903,6 +950,40 @@ void pc_memory_init(PCMachineState *pcms,
>      linux_boot = (machine->kernel_filename != NULL);
> 
>      /*
> +     * The HyperTransport range close to the 1T boundary is unique to AMD
> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
> +     * to above 1T to AMD vCPUs only.
> +     */
> +    if (IS_AMD_CPU(&cpu->env)) {
> +        /* Bail out if max possible address does not cross HT range */
> +        if (pc_max_used_gpa(pcms, pci_hole64_size) >= AMD_HT_START) {
> +            x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
> +        }
> +
> +        /*
> +         * Advertise the HT region if address space covers the reserved
> +         * region or if we relocate.
> +         */
> +        if (cpu->phys_bits >= 40) {
> +            e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
> +        }
> +    }
> +
> +    /*
> +     * Relocating ram-above-4G requires more than TCG_PHYS_ADDR_BITS (40).
> +     * So make sure phys-bits is required to be appropriately sized in order
> +     * to proceed with the above-4g-region relocation and thus boot.
> +     */
> +    maxusedaddr = pc_max_used_gpa(pcms, pci_hole64_size);
> +    maxphysaddr = ((hwaddr)1 << cpu->phys_bits) - 1;
> +    if (maxphysaddr < maxusedaddr) {
> +        error_report("Address space limit 0x%"PRIx64" < 0x%"PRIx64
> +                     " phys-bits too low (%u)",
> +                     maxphysaddr, maxusedaddr, cpu->phys_bits);
> +        exit(EXIT_FAILURE);
> +    }
> +

it looks fine to me

> +    /*
>       * Split single memory region and use aliases to address portions of it,
>       * done for backwards compatibility with older qemus.
>       */
>
Joao Martins July 14, 2022, 3:39 p.m. UTC | #15
On 7/14/22 12:50, Igor Mammedov wrote:
> On Thu, 14 Jul 2022 11:47:19 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> On 7/14/22 10:54, Joao Martins wrote:
>>> On 7/14/22 10:28, Igor Mammedov wrote:  
>>>> On Tue, 12 Jul 2022 12:35:49 +0100
>>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
>>>>> On 7/12/22 11:01, Joao Martins wrote:  
>>>>>> On 7/12/22 10:06, Igor Mammedov wrote:    
>>>>>>> On Mon, 11 Jul 2022 21:03:28 +0100
>>>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:    
>>>>>>>> On 7/11/22 16:31, Joao Martins wrote:    
>>>>>>>>> On 7/11/22 15:52, Joao Martins wrote:      
>>>>>>>>>> On 7/11/22 13:56, Igor Mammedov wrote:      
>>>>>>>>>>> On Fri,  1 Jul 2022 17:10:13 +0100
>>>>>>>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:    
>>>>>>>>  void pc_memory_init(PCMachineState *pcms,
>>>>>>>>                      MemoryRegion *system_memory,
>>>>>>>>                      MemoryRegion *rom_memory,
>>>>>>>> @@ -897,6 +953,7 @@ void pc_memory_init(PCMachineState *pcms,
>>>>>>>>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>>>>>>>      X86MachineState *x86ms = X86_MACHINE(pcms);
>>>>>>>>      hwaddr cxl_base, cxl_resv_end = 0;
>>>>>>>> +    X86CPU *cpu = X86_CPU(first_cpu);
>>>>>>>>
>>>>>>>>      assert(machine->ram_size == x86ms->below_4g_mem_size +
>>>>>>>>                                  x86ms->above_4g_mem_size);
>>>>>>>> @@ -904,6 +961,29 @@ void pc_memory_init(PCMachineState *pcms,
>>>>>>>>      linux_boot = (machine->kernel_filename != NULL);
>>>>>>>>
>>>>>>>>      /*
>>>>>>>> +     * The HyperTransport range close to the 1T boundary is unique to AMD
>>>>>>>> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
>>>>>>>> +     * to above 1T to AMD vCPUs only.
>>>>>>>> +     */
>>>>>>>> +    if (IS_AMD_CPU(&cpu->env) && x86ms->above_4g_mem_size) {    
>>>>>>>
>>>>>>> it has the same issue as pc_max_used_gpa(), i.e.
>>>>>>>   x86ms->above_4g_mem_size != 0
>>>>>>> doesn't mean that there isn't any memory above 4Gb nor that there aren't
>>>>>>> any MMIO (sgx/cxl/pci64hole), that's was the reason we were are considering
>>>>>>> max_used_gpa
>>>>>>> I'd prefer to keep pc_max_used_gpa(),
>>>>>>> idea but make it work for above cases and be more generic (i.e. not to be
>>>>>>> tied to AMD only) since 'pc_max_used_gpa() < physbits'
>>>>>>> applies to equally
>>>>>>> to AMD and Intel (and to trip it, one just have to configure small enough
>>>>>>> physbits or large enough hotpluggable RAM/CXL/PCI64HOLE)
>>>>>>>    
>>>>>> I can reproduce the issue you're thinking with basic memory hotplug.     
>>>>>
>>>>> I was mislead by a bug that only existed in v6. Which I fixed now.
>>>>> So any bug possibility with hotplug, SGX and CXL, or pcihole64 is simply covered with:
>>>>>
>>>>> 	pc_pci_hole64_start() + pci_hole64_size;
>>>>>
>>>>> which is what pc_max_used_gpa() does. This works fine /without/ above_4g_mem_size != 0
>>>>> check even without above_4g_mem_size (e.g. mem=2G,maxmem=1024G).
>>>>>
>>>>> And as a reminder: SGX, hotplug, CXL and pci-hole64 *require* memory above 4G[*]. And part
>>>>> of the point of us moving to pc_pci_hole64_start() was to make these all work in a generic
>>>>> way.
>>>>>
>>>>> So I've removed the x86ms->above_4g_mem_size != 0 check. Current patch diff pasted at the end.
>>>>>
>>>>> [*] As reiterated here:
>>>>>  
>>>>>> Let me see
>>>>>> what I can come up in pc_max_used_gpa() to cover this one. I'll respond here with a proposal.
>>>>>>     
>>>>>
>>>>> I was over-complicating things here. It turns out nothing else is needed aside in the
>>>>> context of 1T hole.
>>>>>
>>>>> This is because I only need to check address space limits (as consequence of
>>>>> pc_set_amd_above_4g_mem_start()) when pc_max_used_gpa() surprasses HT_START. Which
>>>>> requires fundamentally a value closer to 1T well beyond what 32-bit can cover. So on
>>>>> 32-bit guests this is never true and thus it things don't change behaviour from current
>>>>> default for these guests. And thus I won't break qtests and things fail correctly in the
>>>>> right places.
>>>>>
>>>>> Now I should say that pc_max_used_gpa() is still not returning the accurate 32-bit guest
>>>>> max used GPA value, given that I return pci hole64 end (essentially). Do you still that
>>>>> addressed out of correctness even if it doesn't matter much for the 64-bit 1T case?
>>>>>
>>>>> If so, our only option seems to be to check phys_bits <= 32 and return max CPU
>>>>> boundary there? Unless you have something enterily different in mind?
>>>>>  
>>>>>> I would really love to have v7.1.0 with this issue fixed but I am not very
>>>>>> confident it is going to make it :(
>>>>>>
>>>>>> Meanwhile, let me know if you have thoughts on this one:
>>>>>>
>>>>>> https://lore.kernel.org/qemu-devel/1b2fa957-74f6-b5a9-3fc1-65c5d68300ce@oracle.com/
>>>>>>
>>>>>> I am going to assume that if no comments on the above that I'll keep things as is.
>>>>>>
>>>>>> And also, whether I can retain your ack with Bernhard's suggestion here:
>>>>>>
>>>>>> https://lore.kernel.org/qemu-devel/0eefb382-4ac6-4335-ca61-035babb95a88@oracle.com/
>>>>>>     
>>>>>
>>>>>  
>>
>> [...]
>>
>>>>>      /*
>>>>> +     * The HyperTransport range close to the 1T boundary is unique to AMD
>>>>> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
>>>>> +     * to above 1T to AMD vCPUs only.
>>>>> +     */
>>>>> +    if (IS_AMD_CPU(&cpu->env)) {
>>>>> +        /* Bail out if max possible address does not cross HT range */
>>>>> +        if (pc_max_used_gpa(pcms, pci_hole64_size) >= AMD_HT_START) {
>>>>> +            pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);  
>>>>
>>>> I'd replace call with 
>>>>    x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
>>>>  
>>> See below.
>>>   
>>>>> +        }
>>>>> +
>>>>> +        /*
>>>>> +         * Advertise the HT region if address space covers the reserved
>>>>> +         * region or if we relocate.
>>>>> +         */
>>>>> +        if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START ||
>>>>> +            cpu->phys_bits >= 40) {
>>>>> +            e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
>>>>> +        }
>>>>> +    }  
>>>>
>>>> and then here check that pc_max_used_gpa() fits into phys_bits
>>>> which should cover AMD case and case where pci64_hole goes beyond 
>>>> supported address range even without 1TB hole
>>>>  
>>>
>>> When you say 'here' you mean outside IS_AMD_CPU() ?
>>>
>>> If we put outside (and thus generic) where it was ... it will break qtests
>>> as pc_max_used_gpa() does not handle 32-bit case, as mentioned earlier.
>>> Hence why it is inside pc_set_amd_above_4g_mem_start(), or in other words
>>> inside the scope of:
>>>
>>> 	if (pc_max_used_gpa(pcms, pci_hole64_size) >= AMD_HT_START)
>>>
>>> Which means I will for sure have a pci_hole64.
>>> Making it generic to /outside/ this conditional requires addressing this
>>> earlier comment I made:
>>>
>>>  our only option seems to be to check phys_bits <= 32 and return max CPU
>>>  boundary there?
>>>   
>>
>> Here's how this patch looks like, after your comments and the above issue
>> I am talking. The added part is inside pc_max_used_gpa().
>>
>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>> index 668e15c8f2a6..2d85c66502d5 100644
>> --- a/hw/i386/pc.c
>> +++ b/hw/i386/pc.c
>> @@ -881,6 +881,51 @@ static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
>>      return start;
>>  }
>>
>> +static hwaddr pc_max_used_gpa(PCMachineState *pcms, uint64_t pci_hole64_size)
>> +{
>> +    X86CPU *cpu = X86_CPU(first_cpu);
>> +
>> +    if (cpu->phys_bits <= 32) {
> 
>> +        return (1ULL << cpu->phys_bits) - 1ULL;
> Add a comment here as to why this value is returned
> 

I have added this so far:

+    /* 32-bit systems don't have hole64 thus return max phys address */

>> +    }
>> +
>> +    return pc_pci_hole64_start() + pci_hole64_size;
>> +}
>> +

And also a - 1 in the calculation above as this was off by one.

>> +/*
>> + * AMD systems with an IOMMU have an additional hole close to the
>> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
>> + * on kernel version, VFIO may or may not let you DMA map those ranges.
>> + * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
>> + * with certain memory sizes. It's also wrong to use those IOVA ranges
>> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
>> + * The ranges reserved for Hyper-Transport are:
>> + *
>> + * FD_0000_0000h - FF_FFFF_FFFFh
>> + *
>> + * The ranges represent the following:
>> + *
>> + * Base Address   Top Address  Use
>> + *
>> + * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
>> + * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
>> + * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
>> + * FD_F910_0000h FD_F91F_FFFFh System Management
>> + * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
>> + * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
>> + * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
>> + * FD_FE00_0000h FD_FFFF_FFFFh Configuration
>> + * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
>> + * FE_2000_0000h FF_FFFF_FFFFh Reserved
>> + *
>> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
>> + * Table 3: Special Address Controls (GPA) for more information.
>> + */
>> +#define AMD_HT_START         0xfd00000000UL
>> +#define AMD_HT_END           0xffffffffffUL
>> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
>> +#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
>> +
>>  void pc_memory_init(PCMachineState *pcms,
>>                      MemoryRegion *system_memory,
>>                      MemoryRegion *rom_memory,
>> @@ -895,7 +940,9 @@ void pc_memory_init(PCMachineState *pcms,
>>      MachineClass *mc = MACHINE_GET_CLASS(machine);
>>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>      X86MachineState *x86ms = X86_MACHINE(pcms);
>> +    hwaddr maxphysaddr, maxusedaddr;
>>      hwaddr cxl_base, cxl_resv_end = 0;
>> +    X86CPU *cpu = X86_CPU(first_cpu);
>>
>>      assert(machine->ram_size == x86ms->below_4g_mem_size +
>>                                  x86ms->above_4g_mem_size);
>> @@ -903,6 +950,40 @@ void pc_memory_init(PCMachineState *pcms,
>>      linux_boot = (machine->kernel_filename != NULL);
>>
>>      /*
>> +     * The HyperTransport range close to the 1T boundary is unique to AMD
>> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
>> +     * to above 1T to AMD vCPUs only.
>> +     */
>> +    if (IS_AMD_CPU(&cpu->env)) {
>> +        /* Bail out if max possible address does not cross HT range */
>> +        if (pc_max_used_gpa(pcms, pci_hole64_size) >= AMD_HT_START) {
>> +            x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
>> +        }
>> +
>> +        /*
>> +         * Advertise the HT region if address space covers the reserved
>> +         * region or if we relocate.
>> +         */
>> +        if (cpu->phys_bits >= 40) {
>> +            e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
>> +        }
>> +    }
>> +
>> +    /*
>> +     * Relocating ram-above-4G requires more than TCG_PHYS_ADDR_BITS (40).
>> +     * So make sure phys-bits is required to be appropriately sized in order
>> +     * to proceed with the above-4g-region relocation and thus boot.
>> +     */
>> +    maxusedaddr = pc_max_used_gpa(pcms, pci_hole64_size);
>> +    maxphysaddr = ((hwaddr)1 << cpu->phys_bits) - 1;
>> +    if (maxphysaddr < maxusedaddr) {
>> +        error_report("Address space limit 0x%"PRIx64" < 0x%"PRIx64
>> +                     " phys-bits too low (%u)",
>> +                     maxphysaddr, maxusedaddr, cpu->phys_bits);
>> +        exit(EXIT_FAILURE);
>> +    }
>> +
> 
> it looks fine to me
> 

Cool, let me respin v7 today/tomorrow.

>> +    /*
>>       * Split single memory region and use aliases to address portions of it,
>>       * done for backwards compatibility with older qemus.
>>       */
>>
>
diff mbox series

Patch

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index a79fa1b6beeb..07025b510540 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -907,6 +907,87 @@  static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
     return start;
 }
 
+static hwaddr pc_max_used_gpa(PCMachineState *pcms,
+                                hwaddr above_4g_mem_start,
+                                uint64_t pci_hole64_size)
+{
+    X86MachineState *x86ms = X86_MACHINE(pcms);
+
+    if (!x86ms->above_4g_mem_size) {
+        /*
+         * 32-bit pci hole goes from
+         * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
+          */
+        return IO_APIC_DEFAULT_ADDRESS - 1;
+    }
+
+    return pc_pci_hole64_start() + pci_hole64_size;
+}
+
+/*
+ * AMD systems with an IOMMU have an additional hole close to the
+ * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
+ * on kernel version, VFIO may or may not let you DMA map those ranges.
+ * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
+ * with certain memory sizes. It's also wrong to use those IOVA ranges
+ * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
+ * The ranges reserved for Hyper-Transport are:
+ *
+ * FD_0000_0000h - FF_FFFF_FFFFh
+ *
+ * The ranges represent the following:
+ *
+ * Base Address   Top Address  Use
+ *
+ * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
+ * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
+ * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
+ * FD_F910_0000h FD_F91F_FFFFh System Management
+ * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
+ * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
+ * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
+ * FD_FE00_0000h FD_FFFF_FFFFh Configuration
+ * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
+ * FE_2000_0000h FF_FFFF_FFFFh Reserved
+ *
+ * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
+ * Table 3: Special Address Controls (GPA) for more information.
+ */
+#define AMD_HT_START         0xfd00000000UL
+#define AMD_HT_END           0xffffffffffUL
+#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
+#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
+
+static void pc_set_amd_above_4g_mem_start(PCMachineState *pcms,
+                                          uint64_t pci_hole64_size)
+{
+    X86MachineState *x86ms = X86_MACHINE(pcms);
+    hwaddr start = x86ms->above_4g_mem_start;
+    hwaddr maxphysaddr, maxusedaddr;
+
+    /* Bail out if max possible address does not cross HT range */
+    if (pc_max_used_gpa(pcms, start, pci_hole64_size) < AMD_HT_START) {
+        return;
+    }
+
+    /*
+     * Relocating ram-above-4G requires more than TCG_PHYS_ADDR_BITS (40).
+     * So make sure phys-bits is required to be appropriately sized in order
+     * to proceed with the above-4g-region relocation and thus boot.
+     */
+    start = AMD_ABOVE_1TB_START;
+    maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
+    maxusedaddr = pc_max_used_gpa(pcms, start, pci_hole64_size);
+    if (maxphysaddr < maxusedaddr) {
+        error_report("Address space limit 0x%"PRIx64" < 0x%"PRIx64
+                     " phys-bits too low (%u) cannot avoid AMD HT range",
+                     maxphysaddr, maxusedaddr, X86_CPU(first_cpu)->phys_bits);
+        exit(EXIT_FAILURE);
+    }
+
+    x86ms->above_4g_mem_start = start;
+}
+
 void pc_memory_init(PCMachineState *pcms,
                     MemoryRegion *system_memory,
                     MemoryRegion *rom_memory,
@@ -922,12 +1003,31 @@  void pc_memory_init(PCMachineState *pcms,
     PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
     X86MachineState *x86ms = X86_MACHINE(pcms);
     hwaddr cxl_base, cxl_resv_end = 0;
+    X86CPU *cpu = X86_CPU(first_cpu);
 
     assert(machine->ram_size == x86ms->below_4g_mem_size +
                                 x86ms->above_4g_mem_size);
 
     linux_boot = (machine->kernel_filename != NULL);
 
+    /*
+     * The HyperTransport range close to the 1T boundary is unique to AMD
+     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
+     * to above 1T to AMD vCPUs only.
+     */
+    if (IS_AMD_CPU(&cpu->env)) {
+        pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);
+
+        /*
+         * Advertise the HT region if address space covers the reserved
+         * region or if we relocate.
+         */
+        if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START ||
+            cpu->phys_bits >= 40) {
+            e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
+        }
+    }
+
     /*
      * Split single memory region and use aliases to address portions of it,
      * done for backwards compatibility with older qemus.
@@ -938,6 +1038,7 @@  void pc_memory_init(PCMachineState *pcms,
                              0, x86ms->below_4g_mem_size);
     memory_region_add_subregion(system_memory, 0, ram_below_4g);
     e820_add_entry(0, x86ms->below_4g_mem_size, E820_RAM);
+
     if (x86ms->above_4g_mem_size > 0) {
         ram_above_4g = g_malloc(sizeof(*ram_above_4g));
         memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g",