diff mbox series

[v2,8/9] vfio: Check compatibility of CPU and IOMMU address space width

Message ID 20250130134346.1754143-9-clg@redhat.com (mailing list archive)
State New
Headers show
Series vfio: Improve error reporting when MMIO region mapping fails | expand

Commit Message

Cédric Le Goater Jan. 30, 2025, 1:43 p.m. UTC
Print a warning if IOMMU address space width is smaller than the
physical address width. In this case, PCI peer-to-peer transactions on
BARs are not supported and failures of device MMIO regions are to be
expected.

This can occur with the 39-bit IOMMU address space width as found on
consumer grade processors or when using a vIOMMU device with default
settings.

Signed-off-by: Cédric Le Goater <clg@redhat.com>
---
 hw/vfio/common.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

Comments

Alex Williamson Jan. 30, 2025, 6:58 p.m. UTC | #1
On Thu, 30 Jan 2025 14:43:45 +0100
Cédric Le Goater <clg@redhat.com> wrote:

> Print a warning if IOMMU address space width is smaller than the
> physical address width. In this case, PCI peer-to-peer transactions on
> BARs are not supported and failures of device MMIO regions are to be
> expected.
> 
> This can occur with the 39-bit IOMMU address space width as found on
> consumer grade processors or when using a vIOMMU device with default
> settings.
> 
> Signed-off-by: Cédric Le Goater <clg@redhat.com>
> ---
>  hw/vfio/common.c | 21 +++++++++++++++++++++
>  1 file changed, 21 insertions(+)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 5c9d8657d746ce30af5ae8f9122101e086a61ef5..e5ef93c589b2bed68f790608868ec3c7779d4cb8 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -44,6 +44,8 @@
>  #include "migration/qemu-file.h"
>  #include "system/tpm.h"
>  
> +#include "hw/core/cpu.h"
> +
>  VFIODeviceList vfio_device_list =
>      QLIST_HEAD_INITIALIZER(vfio_device_list);
>  static QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces =
> @@ -1546,12 +1548,28 @@ retry:
>      return info;
>  }
>  
> +static bool vfio_device_check_address_space(VFIODevice *vbasedev, Error **errp)
> +{
> +    uint32_t cpu_aw_bits = cpu_get_phys_bits(first_cpu);
> +    uint32_t iommu_aw_bits = vfio_device_get_aw_bits(vbasedev);
> +
> +    if (cpu_aw_bits && cpu_aw_bits > iommu_aw_bits) {

Should we be testing the 64-bit MMIO window and maximum RAM GPA rather
than the vCPU physical address width?

Maybe we're just stuck with an indirect solution currently.  AIUI,
we're testing the vCPU physical address width because (obviously) all
devices and memory need to be addressable within that address space.
However, as we've explored offline, there are bare metal systems where
the CPU physical address width exceeds the IOMMU address width and this
is not a fundamental problem.  It places restrictions on the maximum
RAM physical address and the location of the 64-bit MMIO space.

RAM tends not to be a problem on these bare metal systems since they
physically cannot hold enough RAM to exceed the IOMMU width and, for
the most part, we map RAM starting from the bottom of the address
space.  So long as the VMM doesn't decide to map RAM at the top of the
physical address space, this doesn't become a problem.

However, we've decided to do exactly that for the 64-bit MMIO window.
It's not that the vCPU width being greater than the IOMMU width is a
fundamental problem, it's that we've chosen a 64-bit MMIO policy that
makes this become a problem and QEMU only has a convenient mechanism to
check the host IOMMU width when a vfio device is present.  Arguably,
when a vIOMMU is present guest firmware should be enlightened to
understand the address width of that vIOMMU when placing the 64-bit
MMIO window.  I'd say the failure to do so is a current firmware bug.

If the vIOMMU address width were honored, we could recommend users set
that to match the host, regardless of the vCPU physical address width.
We also already have a failure condition if the vIOMMU address width
exceeds the vfio-device (ie. indirectly the host) IOMMU width.

Of course without a vIOMMU, given our 64-bit MMIO policy, we don't have
a good solution without specifying the 64-bit window or implementing a
more conservative placement.

Not sure how much of this is immediately solvable and some indication
to the user how they can resolve the issue, such as implemented here, is
better than none, but maybe we can elaborate in a comment that this is
really more of a workaround for the current behavior of firmware
relative to the 64-bit MMIO placement policy.  Thanks,

Alex

> +        error_setg(errp, "Host physical address space (%u) is larger than "
> +                   "the host IOMMU address space (%u).", cpu_aw_bits,
> +                   iommu_aw_bits);
> +        vfio_device_error_append(vbasedev, errp);
> +        return false;
> +    }
> +    return true;
> +}
> +
>  bool vfio_attach_device(char *name, VFIODevice *vbasedev,
>                          AddressSpace *as, Error **errp)
>  {
>      const VFIOIOMMUClass *ops =
>          VFIO_IOMMU_CLASS(object_class_by_name(TYPE_VFIO_IOMMU_LEGACY));
>      HostIOMMUDevice *hiod = NULL;
> +    Error *local_err = NULL;
>  
>      if (vbasedev->iommufd) {
>          ops = VFIO_IOMMU_CLASS(object_class_by_name(TYPE_VFIO_IOMMU_IOMMUFD));
> @@ -1571,6 +1589,9 @@ bool vfio_attach_device(char *name, VFIODevice *vbasedev,
>          return false;
>      }
>  
> +    if (!vfio_device_check_address_space(vbasedev, &local_err)) {
> +        warn_report_err(local_err);
> +    }
>      return true;
>  }
>
Cédric Le Goater Jan. 31, 2025, 12:42 p.m. UTC | #2
+ Gerd for insights regarding vIOMMU support in edk2.

On 1/30/25 19:58, Alex Williamson wrote:
> On Thu, 30 Jan 2025 14:43:45 +0100
> Cédric Le Goater <clg@redhat.com> wrote:
> 
>> Print a warning if IOMMU address space width is smaller than the
>> physical address width. In this case, PCI peer-to-peer transactions on
>> BARs are not supported and failures of device MMIO regions are to be
>> expected.
>>
>> This can occur with the 39-bit IOMMU address space width as found on
>> consumer grade processors or when using a vIOMMU device with default
>> settings.
>>
>> Signed-off-by: Cédric Le Goater <clg@redhat.com>
>> ---
>>   hw/vfio/common.c | 21 +++++++++++++++++++++
>>   1 file changed, 21 insertions(+)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 5c9d8657d746ce30af5ae8f9122101e086a61ef5..e5ef93c589b2bed68f790608868ec3c7779d4cb8 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -44,6 +44,8 @@
>>   #include "migration/qemu-file.h"
>>   #include "system/tpm.h"
>>   
>> +#include "hw/core/cpu.h"
>> +
>>   VFIODeviceList vfio_device_list =
>>       QLIST_HEAD_INITIALIZER(vfio_device_list);
>>   static QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces =
>> @@ -1546,12 +1548,28 @@ retry:
>>       return info;
>>   }
>>   
>> +static bool vfio_device_check_address_space(VFIODevice *vbasedev, Error **errp)
>> +{
>> +    uint32_t cpu_aw_bits = cpu_get_phys_bits(first_cpu);
>> +    uint32_t iommu_aw_bits = vfio_device_get_aw_bits(vbasedev);
>> +
>> +    if (cpu_aw_bits && cpu_aw_bits > iommu_aw_bits) {
> 
> Should we be testing the 64-bit MMIO window and maximum RAM GPA rather
> than the vCPU physical address width?
> 
> Maybe we're just stuck with an indirect solution currently.  AIUI,
> we're testing the vCPU physical address width because (obviously) all
> devices and memory need to be addressable within that address space.
> However, as we've explored offline, there are bare metal systems where
> the CPU physical address width exceeds the IOMMU address width and this
> is not a fundamental problem.  It places restrictions on the maximum
> RAM physical address and the location of the 64-bit MMIO space.
> 
> RAM tends not to be a problem on these bare metal systems since they
> physically cannot hold enough RAM to exceed the IOMMU width and, for
> the most part, we map RAM starting from the bottom of the address
> space.  So long as the VMM doesn't decide to map RAM at the top of the
> physical address space, this doesn't become a problem.
> 
> However, we've decided to do exactly that for the 64-bit MMIO window.
> It's not that the vCPU width being greater than the IOMMU width is a
> fundamental problem, it's that we've chosen a 64-bit MMIO policy that
> makes this become a problem and QEMU only has a convenient mechanism to
> check the host IOMMU width when a vfio device is present.  Arguably,
> when a vIOMMU is present guest firmware should be enlightened to
> understand the address width of that vIOMMU when placing the 64-bit
> MMIO window.  I'd say the failure to do so is a current firmware bug.
> 
> If the vIOMMU address width were honored, we could recommend users set
> that to match the host, regardless of the vCPU physical address width.
> We also already have a failure condition if the vIOMMU address width
> exceeds the vfio-device (ie. indirectly the host) IOMMU width.
> 
> Of course without a vIOMMU, given our 64-bit MMIO policy, we don't have
> a good solution without specifying the 64-bit window or implementing a
> more conservative placement.
> 
> Not sure how much of this is immediately solvable and some indication
> to the user how they can resolve the issue, such as implemented here, is
> better than none, but maybe we can elaborate in a comment that this is
> really more of a workaround for the current behavior of firmware
> relative to the 64-bit MMIO placement policy.  Thanks,

Sure. I will improve the commit log in v3.

Thanks,

C.


> Alex
> 
>> +        error_setg(errp, "Host physical address space (%u) is larger than "
>> +                   "the host IOMMU address space (%u).", cpu_aw_bits,
>> +                   iommu_aw_bits);
>> +        vfio_device_error_append(vbasedev, errp);
>> +        return false;
>> +    }
>> +    return true;
>> +}
>> +
>>   bool vfio_attach_device(char *name, VFIODevice *vbasedev,
>>                           AddressSpace *as, Error **errp)
>>   {
>>       const VFIOIOMMUClass *ops =
>>           VFIO_IOMMU_CLASS(object_class_by_name(TYPE_VFIO_IOMMU_LEGACY));
>>       HostIOMMUDevice *hiod = NULL;
>> +    Error *local_err = NULL;
>>   
>>       if (vbasedev->iommufd) {
>>           ops = VFIO_IOMMU_CLASS(object_class_by_name(TYPE_VFIO_IOMMU_IOMMUFD));
>> @@ -1571,6 +1589,9 @@ bool vfio_attach_device(char *name, VFIODevice *vbasedev,
>>           return false;
>>       }
>>   
>> +    if (!vfio_device_check_address_space(vbasedev, &local_err)) {
>> +        warn_report_err(local_err);
>> +    }
>>       return true;
>>   }
>>   
>
Gerd Hoffmann Jan. 31, 2025, 1:23 p.m. UTC | #3
On Fri, Jan 31, 2025 at 01:42:31PM +0100, Cédric Le Goater wrote:
> + Gerd for insights regarding vIOMMU support in edk2.
> 
> > > +static bool vfio_device_check_address_space(VFIODevice *vbasedev, Error **errp)
> > > +{
> > > +    uint32_t cpu_aw_bits = cpu_get_phys_bits(first_cpu);
> > > +    uint32_t iommu_aw_bits = vfio_device_get_aw_bits(vbasedev);
> > > +
> > > +    if (cpu_aw_bits && cpu_aw_bits > iommu_aw_bits) {
> > 
> > Should we be testing the 64-bit MMIO window and maximum RAM GPA rather
> > than the vCPU physical address width?

Placing the 64-bit  mmio window is done by the guest firmware,
so this isn't something qemu can check before boot.

> > However, we've decided to do exactly that for the 64-bit MMIO window.
> > It's not that the vCPU width being greater than the IOMMU width is a
> > fundamental problem, it's that we've chosen a 64-bit MMIO policy that
> > makes this become a problem and QEMU only has a convenient mechanism to
> > check the host IOMMU width when a vfio device is present.  Arguably,
> > when a vIOMMU is present guest firmware should be enlightened to
> > understand the address width of that vIOMMU when placing the 64-bit
> > MMIO window.  I'd say the failure to do so is a current firmware bug.

edk2 has no iommu driver ...

Is there some simple way to figure what the iommu width is (inside the
guest)?

Note that there is a 'guest-phys-bits' property for x86 CPUs, which is a
hint for the guest what the usable address width is.  It was added
because there are cases where the guest simply can't figure that it is
not possible to use the full physical address space of the cpu.  There
are some non-obvious limitations around 5-level paging.  Intel has some
CPUs with phys-bits > 48 but only 4-level EPT for example.

So one option to handle this is to make sure guest-phys-bits is not
larger than the iommu width.

take care,
  Gerd
Cédric Le Goater Jan. 31, 2025, 5:03 p.m. UTC | #4
Hello Gerd,

On 1/31/25 14:23, Gerd Hoffmann wrote:
> On Fri, Jan 31, 2025 at 01:42:31PM +0100, Cédric Le Goater wrote:
>> + Gerd for insights regarding vIOMMU support in edk2.
>>
>>>> +static bool vfio_device_check_address_space(VFIODevice *vbasedev, Error **errp)
>>>> +{
>>>> +    uint32_t cpu_aw_bits = cpu_get_phys_bits(first_cpu);
>>>> +    uint32_t iommu_aw_bits = vfio_device_get_aw_bits(vbasedev);
>>>> +
>>>> +    if (cpu_aw_bits && cpu_aw_bits > iommu_aw_bits) {
>>>
>>> Should we be testing the 64-bit MMIO window and maximum RAM GPA rather
>>> than the vCPU physical address width?
> 
> Placing the 64-bit  mmio window is done by the guest firmware,
> so this isn't something qemu can check before boot.
> 
>>> However, we've decided to do exactly that for the 64-bit MMIO window.
>>> It's not that the vCPU width being greater than the IOMMU width is a
>>> fundamental problem, it's that we've chosen a 64-bit MMIO policy that
>>> makes this become a problem and QEMU only has a convenient mechanism to
>>> check the host IOMMU width when a vfio device is present.  Arguably,
>>> when a vIOMMU is present guest firmware should be enlightened to
>>> understand the address width of that vIOMMU when placing the 64-bit
>>> MMIO window.  I'd say the failure to do so is a current firmware bug.
> 
> edk2 has no iommu driver ...
> 
> Is there some simple way to figure what the iommu width is (inside the
> guest)?

When using VFIO, vfio_device_get_aw_bits() will return the IOMMU
address space width. There are checks when using a vIOMMU that
the host and vIOMMU address space are compatible. vIOMMU should
be smaller.

> Note that there is a 'guest-phys-bits' property for x86 CPUs, which is a
> hint for the guest what the usable address width is.  It was added
> because there are cases where the guest simply can't figure that it is
> not possible to use the full physical address space of the cpu.  There
> are some non-obvious limitations around 5-level paging.  Intel has some
> CPUs with phys-bits > 48 but only 4-level EPT for example.
> 
> So one option to handle this is to make sure guest-phys-bits is not
> larger than the iommu width.

Yes. This is what I am trying to do.

Patch [1] returns X86_CPU(cs)->phys_bits. I was not sure which *phys*
property to use. If you think this is incorrect and not returning the
right information, I will change the proposal with guest-phys-bits.

Thanks,

C.

[1] https://lore.kernel.org/qemu-devel/20250130134346.1754143-8-clg@redhat.com/
Alex Williamson Jan. 31, 2025, 10:18 p.m. UTC | #5
On Fri, 31 Jan 2025 14:23:58 +0100
Gerd Hoffmann <kraxel@redhat.com> wrote:

> On Fri, Jan 31, 2025 at 01:42:31PM +0100, Cédric Le Goater wrote:
> > + Gerd for insights regarding vIOMMU support in edk2.
> >   
> > > > +static bool vfio_device_check_address_space(VFIODevice *vbasedev, Error **errp)
> > > > +{
> > > > +    uint32_t cpu_aw_bits = cpu_get_phys_bits(first_cpu);
> > > > +    uint32_t iommu_aw_bits = vfio_device_get_aw_bits(vbasedev);
> > > > +
> > > > +    if (cpu_aw_bits && cpu_aw_bits > iommu_aw_bits) {  
> > > 
> > > Should we be testing the 64-bit MMIO window and maximum RAM GPA rather
> > > than the vCPU physical address width?  
> 
> Placing the 64-bit  mmio window is done by the guest firmware,
> so this isn't something qemu can check before boot.
> 
> > > However, we've decided to do exactly that for the 64-bit MMIO window.
> > > It's not that the vCPU width being greater than the IOMMU width is a
> > > fundamental problem, it's that we've chosen a 64-bit MMIO policy that
> > > makes this become a problem and QEMU only has a convenient mechanism to
> > > check the host IOMMU width when a vfio device is present.  Arguably,
> > > when a vIOMMU is present guest firmware should be enlightened to
> > > understand the address width of that vIOMMU when placing the 64-bit
> > > MMIO window.  I'd say the failure to do so is a current firmware bug.  
> 
> edk2 has no iommu driver ...
> 
> Is there some simple way to figure what the iommu width is (inside the
> guest)?

If the guest firmware is exposing a DMAR table (VT-d), there's a host
address width field in that table.  Otherwise there are capability
registers on the DRHD units that could be queried.  AMD-Vi seems to
have similar information in the IVinfo table linked from the IVRS
table.  Maybe an iterative solution is ok, starting with the most
common IOMMU types and assuming others are 64-bits wide until proven
otherwise.

> 
> Note that there is a 'guest-phys-bits' property for x86 CPUs, which is a
> hint for the guest what the usable address width is.  It was added
> because there are cases where the guest simply can't figure that it is
> not possible to use the full physical address space of the cpu.  There
> are some non-obvious limitations around 5-level paging.  Intel has some
> CPUs with phys-bits > 48 but only 4-level EPT for example.
> 
> So one option to handle this is to make sure guest-phys-bits is not
> larger than the iommu width.

Right, as Cédric notes that's sort of what happens here, but my concern
was that the we're kind of incorrectly linking the cpu address width to
the platform address width, which isn't specifically the requirement.
It's valid to have CPU physical address width great than IOMMU address
width, that exists on bare metal, but guest firmware needs to take
IOMMU address width into account in placing the 64-bit MMIO window if
we want PCI BARs to be DMA addressable.  Thanks,

Alex
Gerd Hoffmann Feb. 6, 2025, 7:54 a.m. UTC | #6
> > Note that there is a 'guest-phys-bits' property for x86 CPUs, which is a
> > hint for the guest what the usable address width is.  It was added
> > because there are cases where the guest simply can't figure that it is
> > not possible to use the full physical address space of the cpu.  There
> > are some non-obvious limitations around 5-level paging.  Intel has some
> > CPUs with phys-bits > 48 but only 4-level EPT for example.
> > 
> > So one option to handle this is to make sure guest-phys-bits is not
> > larger than the iommu width.
> 
> Yes. This is what I am trying to do.
> 
> Patch [1] returns X86_CPU(cs)->phys_bits. I was not sure which *phys*
> property to use. If you think this is incorrect and not returning the
> right information, I will change the proposal with guest-phys-bits.

> [1] https://lore.kernel.org/qemu-devel/20250130134346.1754143-8-clg@redhat.com/

Yes, guest-phys-bits should be used here (and the helpers renamed too
for consistency, to avoid making all this more complex than it already
is).

take care,
  Gerd
Gerd Hoffmann Feb. 6, 2025, 8:22 a.m. UTC | #7
Hi,

> > Is there some simple way to figure what the iommu width is (inside the
> > guest)?
> 
> If the guest firmware is exposing a DMAR table (VT-d), there's a host
> address width field in that table.  Otherwise there are capability
> registers on the DRHD units that could be queried.  AMD-Vi seems to
> have similar information in the IVinfo table linked from the IVRS
> table.  Maybe an iterative solution is ok, starting with the most
> common IOMMU types and assuming others are 64-bits wide until proven
> otherwise.

Hmm.  The guest firmware simply exposes the tables it gets from qemu
(without parsing it).  Also the acpi tables are loaded after the
firmware created the address space layout, because qemu adjusts them
according to the programming done by the firmware (set pci bus ranges
according to bridge windows etc).

So checking ACPI tables for that information doesn't look very
attractive.  The firmware would have to load the tables twice (once
early to parse DMAR, once late for the final tables).

Going for the capability registers might be possible.  Can the hardware
be probed for safely?  Has AMD capability registers too?

Third option would be using another channel to communicate the
information from qemu to firmware, where using 'guest-phys-bits' would
be one candidate.

> Right, as Cédric notes that's sort of what happens here, but my concern
> was that the we're kind of incorrectly linking the cpu address width to
> the platform address width, which isn't specifically the requirement.

There is a separate 'guest-phys-bits' property for that reason.  The
phys-bits of the CPU are not changed.

take care,
  Gerd
Cédric Le Goater Feb. 6, 2025, 5:05 p.m. UTC | #8
On 2/6/25 08:54, Gerd Hoffmann wrote:
>>> Note that there is a 'guest-phys-bits' property for x86 CPUs, which is a
>>> hint for the guest what the usable address width is.  It was added
>>> because there are cases where the guest simply can't figure that it is
>>> not possible to use the full physical address space of the cpu.  There
>>> are some non-obvious limitations around 5-level paging.  Intel has some
>>> CPUs with phys-bits > 48 but only 4-level EPT for example.
>>>
>>> So one option to handle this is to make sure guest-phys-bits is not
>>> larger than the iommu width.
>>
>> Yes. This is what I am trying to do.
>>
>> Patch [1] returns X86_CPU(cs)->phys_bits. I was not sure which *phys*
>> property to use. If you think this is incorrect and not returning the
>> right information, I will change the proposal with guest-phys-bits.
> 
>> [1] https://lore.kernel.org/qemu-devel/20250130134346.1754143-8-clg@redhat.com/
> 
> Yes, guest-phys-bits should be used here (and the helpers renamed too
> for consistency, to avoid making all this more complex than it already
> is).

yep. I will do that in a dedicated series.

Thanks

C.
diff mbox series

Patch

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 5c9d8657d746ce30af5ae8f9122101e086a61ef5..e5ef93c589b2bed68f790608868ec3c7779d4cb8 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -44,6 +44,8 @@ 
 #include "migration/qemu-file.h"
 #include "system/tpm.h"
 
+#include "hw/core/cpu.h"
+
 VFIODeviceList vfio_device_list =
     QLIST_HEAD_INITIALIZER(vfio_device_list);
 static QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces =
@@ -1546,12 +1548,28 @@  retry:
     return info;
 }
 
+static bool vfio_device_check_address_space(VFIODevice *vbasedev, Error **errp)
+{
+    uint32_t cpu_aw_bits = cpu_get_phys_bits(first_cpu);
+    uint32_t iommu_aw_bits = vfio_device_get_aw_bits(vbasedev);
+
+    if (cpu_aw_bits && cpu_aw_bits > iommu_aw_bits) {
+        error_setg(errp, "Host physical address space (%u) is larger than "
+                   "the host IOMMU address space (%u).", cpu_aw_bits,
+                   iommu_aw_bits);
+        vfio_device_error_append(vbasedev, errp);
+        return false;
+    }
+    return true;
+}
+
 bool vfio_attach_device(char *name, VFIODevice *vbasedev,
                         AddressSpace *as, Error **errp)
 {
     const VFIOIOMMUClass *ops =
         VFIO_IOMMU_CLASS(object_class_by_name(TYPE_VFIO_IOMMU_LEGACY));
     HostIOMMUDevice *hiod = NULL;
+    Error *local_err = NULL;
 
     if (vbasedev->iommufd) {
         ops = VFIO_IOMMU_CLASS(object_class_by_name(TYPE_VFIO_IOMMU_IOMMUFD));
@@ -1571,6 +1589,9 @@  bool vfio_attach_device(char *name, VFIODevice *vbasedev,
         return false;
     }
 
+    if (!vfio_device_check_address_space(vbasedev, &local_err)) {
+        warn_report_err(local_err);
+    }
     return true;
 }