mbox series

[0/3] VIRTIO-IOMMU: Introduce an aw-bits option

Message ID 20240123181753.413961-1-eric.auger@redhat.com (mailing list archive)
Headers show
Series VIRTIO-IOMMU: Introduce an aw-bits option | expand

Message

Eric Auger Jan. 23, 2024, 6:15 p.m. UTC
In [1] and [2] we attempted to fix a case where a VFIO-PCI device
protected with a virtio-iommu is assigned to an x86 guest. On x86
the physical IOMMU may have an address width (gaw) of 39 or 48 bits
whereas the virtio-iommu exposes a 64b input address space by default.
Hence the guest may try to use the full 64b space and DMA MAP
failures may be encountered. To work around this issue we endeavoured
to pass usable host IOVA regions (excluding the out of range space) from
VFIO to the virtio-iommu device so that the virtio-iommu driver can
query those latter during the probe request and let the guest iommu
kernel subsystem carve them out.

However if there are several devices in the same iommu group,
only the reserved regions of the first one are taken into
account by the iommu subsystem of the guest. This generally
works on baremetal because devices are not going to
expose different reserved regions. However in our case, this
may prevent from taking into account the host iommu geometry.

So the simplest solution to this problem looks to introduce an
input address width option, aw-bits, which matches what is
done on the intel-iommu. By default, from now on it is set
to 39 bits with pc_q35 and 64b with arm virt. This replaces the
previous default value of 64b. So we need to introduce a compat
for pc_q35 machines older than 9.0 to behave similarly.

Outstanding series [2] remains useful to let resv regions beeing
communicated on time before the probe request.

[1] [PATCH v4 00/12] VIRTIO-IOMMU/VFIO: Don't assume 64b IOVA space
    https://lore.kernel.org/all/20231019134651.842175-1-eric.auger@redhat.com/
    - This is merged -

[2] [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for hotplugged devices
    https://lore.kernel.org/all/20240117080414.316890-1-eric.auger@redhat.com/
    - This is pending for review on the ML -

This series can be found at:
https://github.com/eauger/qemu/tree/virtio-iommu-aw-bits-v1

Applied on top of [3]
[PATCH v2] virtio-iommu: Use qemu_real_host_page_mask as default page_size_mask
https://lore.kernel.org/all/20240117132039.332273-1-eric.auger@redhat.com/

Eric Auger (3):
  virtio-iommu: Add an option to define the input range width
  virtio-iommu: Trace domain range limits as unsigned int
  hw/pc: Set the default virtio-iommu aw-bits to 39 on pc_q35_9.0
    onwards

 include/hw/virtio/virtio-iommu.h |  1 +
 hw/arm/virt.c                    |  6 ++++++
 hw/i386/pc.c                     | 10 +++++++++-
 hw/virtio/virtio-iommu.c         |  4 +++-
 hw/virtio/trace-events           |  2 +-
 5 files changed, 20 insertions(+), 3 deletions(-)

Comments

Jean-Philippe Brucker Jan. 29, 2024, 12:23 p.m. UTC | #1
Hi Eric,

On Tue, Jan 23, 2024 at 07:15:54PM +0100, Eric Auger wrote:
> In [1] and [2] we attempted to fix a case where a VFIO-PCI device
> protected with a virtio-iommu is assigned to an x86 guest. On x86
> the physical IOMMU may have an address width (gaw) of 39 or 48 bits
> whereas the virtio-iommu exposes a 64b input address space by default.
> Hence the guest may try to use the full 64b space and DMA MAP
> failures may be encountered. To work around this issue we endeavoured
> to pass usable host IOVA regions (excluding the out of range space) from
> VFIO to the virtio-iommu device so that the virtio-iommu driver can
> query those latter during the probe request and let the guest iommu
> kernel subsystem carve them out.
> 
> However if there are several devices in the same iommu group,
> only the reserved regions of the first one are taken into
> account by the iommu subsystem of the guest. This generally
> works on baremetal because devices are not going to
> expose different reserved regions. However in our case, this
> may prevent from taking into account the host iommu geometry.
> 
> So the simplest solution to this problem looks to introduce an
> input address width option, aw-bits, which matches what is
> done on the intel-iommu. By default, from now on it is set
> to 39 bits with pc_q35 and 64b with arm virt.

Doesn't Arm have the same problem?  The TTB0 page tables limit what can be
mapped to 48-bit, or 52-bit when SMMU_IDR5.VAX==1 and granule is 64kB.
A Linux host driver could configure smaller VA sizes:
* SMMUv2 limits the VA to SMMU_IDR2.UBS (upstream bus size) which
  can go as low as 32-bit (I'm assuming we don't care about 32-bit hosts).
* SMMUv3 currently limits the VA to CONFIG_ARM64_VA_BITS, which
  could be as low as 36 bits (but realistically 39, since 36 depends on
  16kB pages and CONFIG_EXPERT).

But 64-bit definitely can't work for VFIO, and I suppose isn't useful for
virtual devices, so maybe 39 is also a reasonable default on Arm.

Thanks,
Jean

> This replaces the
> previous default value of 64b. So we need to introduce a compat
> for pc_q35 machines older than 9.0 to behave similarly.
Eric Auger Jan. 29, 2024, 2:07 p.m. UTC | #2
Hi Jean-Philippe,

On 1/29/24 13:23, Jean-Philippe Brucker wrote:
> Hi Eric,
>
> On Tue, Jan 23, 2024 at 07:15:54PM +0100, Eric Auger wrote:
>> In [1] and [2] we attempted to fix a case where a VFIO-PCI device
>> protected with a virtio-iommu is assigned to an x86 guest. On x86
>> the physical IOMMU may have an address width (gaw) of 39 or 48 bits
>> whereas the virtio-iommu exposes a 64b input address space by default.
>> Hence the guest may try to use the full 64b space and DMA MAP
>> failures may be encountered. To work around this issue we endeavoured
>> to pass usable host IOVA regions (excluding the out of range space) from
>> VFIO to the virtio-iommu device so that the virtio-iommu driver can
>> query those latter during the probe request and let the guest iommu
>> kernel subsystem carve them out.
>>
>> However if there are several devices in the same iommu group,
>> only the reserved regions of the first one are taken into
>> account by the iommu subsystem of the guest. This generally
>> works on baremetal because devices are not going to
>> expose different reserved regions. However in our case, this
>> may prevent from taking into account the host iommu geometry.
>>
>> So the simplest solution to this problem looks to introduce an
>> input address width option, aw-bits, which matches what is
>> done on the intel-iommu. By default, from now on it is set
>> to 39 bits with pc_q35 and 64b with arm virt.
> Doesn't Arm have the same problem?  The TTB0 page tables limit what can be
> mapped to 48-bit, or 52-bit when SMMU_IDR5.VAX==1 and granule is 64kB.
> A Linux host driver could configure smaller VA sizes:
> * SMMUv2 limits the VA to SMMU_IDR2.UBS (upstream bus size) which
>   can go as low as 32-bit (I'm assuming we don't care about 32-bit hosts).
Yes I think we can ignore that use case.
> * SMMUv3 currently limits the VA to CONFIG_ARM64_VA_BITS, which
>   could be as low as 36 bits (but realistically 39, since 36 depends on
>   16kB pages and CONFIG_EXPERT).
Further reading "3.4.1 Input address size and Virtual Address size" ooks
indeed SMMU_IDR5.VAX gives info on the physical SMMU actual
implementation max (which matches intel iommu gaw). I missed that. Now I
am confused about should we limit VAS to 39 to accomodate of the worst
case host SW configuration or shall we use 48 instead? If we set such a
low 39b value, won't it prevent some guests from properly working?
Thanks Eric
>
> But 64-bit definitely can't work for VFIO, and I suppose isn't useful for
> virtual devices, so maybe 39 is also a reasonable default on Arm.
>
> Thanks,
> Jean
>
>> This replaces the
>> previous default value of 64b. So we need to introduce a compat
>> for pc_q35 machines older than 9.0 to behave similarly.
Jean-Philippe Brucker Jan. 29, 2024, 5:42 p.m. UTC | #3
On Mon, Jan 29, 2024 at 03:07:41PM +0100, Eric Auger wrote:
> Hi Jean-Philippe,
> 
> On 1/29/24 13:23, Jean-Philippe Brucker wrote:
> > Hi Eric,
> >
> > On Tue, Jan 23, 2024 at 07:15:54PM +0100, Eric Auger wrote:
> >> In [1] and [2] we attempted to fix a case where a VFIO-PCI device
> >> protected with a virtio-iommu is assigned to an x86 guest. On x86
> >> the physical IOMMU may have an address width (gaw) of 39 or 48 bits
> >> whereas the virtio-iommu exposes a 64b input address space by default.
> >> Hence the guest may try to use the full 64b space and DMA MAP
> >> failures may be encountered. To work around this issue we endeavoured
> >> to pass usable host IOVA regions (excluding the out of range space) from
> >> VFIO to the virtio-iommu device so that the virtio-iommu driver can
> >> query those latter during the probe request and let the guest iommu
> >> kernel subsystem carve them out.
> >>
> >> However if there are several devices in the same iommu group,
> >> only the reserved regions of the first one are taken into
> >> account by the iommu subsystem of the guest. This generally
> >> works on baremetal because devices are not going to
> >> expose different reserved regions. However in our case, this
> >> may prevent from taking into account the host iommu geometry.
> >>
> >> So the simplest solution to this problem looks to introduce an
> >> input address width option, aw-bits, which matches what is
> >> done on the intel-iommu. By default, from now on it is set
> >> to 39 bits with pc_q35 and 64b with arm virt.
> > Doesn't Arm have the same problem?  The TTB0 page tables limit what can be
> > mapped to 48-bit, or 52-bit when SMMU_IDR5.VAX==1 and granule is 64kB.
> > A Linux host driver could configure smaller VA sizes:
> > * SMMUv2 limits the VA to SMMU_IDR2.UBS (upstream bus size) which
> >   can go as low as 32-bit (I'm assuming we don't care about 32-bit hosts).
> Yes I think we can ignore that use case.
> > * SMMUv3 currently limits the VA to CONFIG_ARM64_VA_BITS, which
> >   could be as low as 36 bits (but realistically 39, since 36 depends on
> >   16kB pages and CONFIG_EXPERT).
> Further reading "3.4.1 Input address size and Virtual Address size" ooks
> indeed SMMU_IDR5.VAX gives info on the physical SMMU actual
> implementation max (which matches intel iommu gaw). I missed that. Now I
> am confused about should we limit VAS to 39 to accomodate of the worst
> case host SW configuration or shall we use 48 instead?

I don't know what's best either. 48 should be fine if hosts normally
enable VA_BITS_48 (I see debian has it [1], not sure how to find the
others).

[1] https://salsa.debian.org/kernel-team/linux/-/blob/master/debian/config/arm64/config?ref_type=heads#L18

> If we set such a low 39b value, won't it prevent some guests from
> properly working?

It's not that low, since it gives each endpoint a private 512GB address
space, but yes there might be special cases that reach the limit. Maybe
assign a multi-queue NIC to a 256-vCPU guest, and if you want per-vCPU DMA
pools, then with a 39-bit address space you only get 2GB per vCPU. With
48-bit you get 1TB which should be plenty.

52-bit private IOVA space doesn't seem useful, I doubt we'll ever need to
support that on the MAP/UNMAP interface.

So I guess 48-bit can be the default, and users with special setups can
override aw-bits.

Thanks,
Jean
Eric Auger Jan. 29, 2024, 5:56 p.m. UTC | #4
Hi Jean,

On 1/29/24 18:42, Jean-Philippe Brucker wrote:
> On Mon, Jan 29, 2024 at 03:07:41PM +0100, Eric Auger wrote:
>> Hi Jean-Philippe,
>>
>> On 1/29/24 13:23, Jean-Philippe Brucker wrote:
>>> Hi Eric,
>>>
>>> On Tue, Jan 23, 2024 at 07:15:54PM +0100, Eric Auger wrote:
>>>> In [1] and [2] we attempted to fix a case where a VFIO-PCI device
>>>> protected with a virtio-iommu is assigned to an x86 guest. On x86
>>>> the physical IOMMU may have an address width (gaw) of 39 or 48 bits
>>>> whereas the virtio-iommu exposes a 64b input address space by default.
>>>> Hence the guest may try to use the full 64b space and DMA MAP
>>>> failures may be encountered. To work around this issue we endeavoured
>>>> to pass usable host IOVA regions (excluding the out of range space) from
>>>> VFIO to the virtio-iommu device so that the virtio-iommu driver can
>>>> query those latter during the probe request and let the guest iommu
>>>> kernel subsystem carve them out.
>>>>
>>>> However if there are several devices in the same iommu group,
>>>> only the reserved regions of the first one are taken into
>>>> account by the iommu subsystem of the guest. This generally
>>>> works on baremetal because devices are not going to
>>>> expose different reserved regions. However in our case, this
>>>> may prevent from taking into account the host iommu geometry.
>>>>
>>>> So the simplest solution to this problem looks to introduce an
>>>> input address width option, aw-bits, which matches what is
>>>> done on the intel-iommu. By default, from now on it is set
>>>> to 39 bits with pc_q35 and 64b with arm virt.
>>> Doesn't Arm have the same problem?  The TTB0 page tables limit what can be
>>> mapped to 48-bit, or 52-bit when SMMU_IDR5.VAX==1 and granule is 64kB.
>>> A Linux host driver could configure smaller VA sizes:
>>> * SMMUv2 limits the VA to SMMU_IDR2.UBS (upstream bus size) which
>>>   can go as low as 32-bit (I'm assuming we don't care about 32-bit hosts).
>> Yes I think we can ignore that use case.
>>> * SMMUv3 currently limits the VA to CONFIG_ARM64_VA_BITS, which
>>>   could be as low as 36 bits (but realistically 39, since 36 depends on
>>>   16kB pages and CONFIG_EXPERT).
>> Further reading "3.4.1 Input address size and Virtual Address size" ooks
>> indeed SMMU_IDR5.VAX gives info on the physical SMMU actual
>> implementation max (which matches intel iommu gaw). I missed that. Now I
>> am confused about should we limit VAS to 39 to accomodate of the worst
>> case host SW configuration or shall we use 48 instead?
> I don't know what's best either. 48 should be fine if hosts normally
> enable VA_BITS_48 (I see debian has it [1], not sure how to find the
> others).
>
> [1] https://salsa.debian.org/kernel-team/linux/-/blob/master/debian/config/arm64/config?ref_type=heads#L18
>
>> If we set such a low 39b value, won't it prevent some guests from
>> properly working?
> It's not that low, since it gives each endpoint a private 512GB address
> space, but yes there might be special cases that reach the limit. Maybe
> assign a multi-queue NIC to a 256-vCPU guest, and if you want per-vCPU DMA
> pools, then with a 39-bit address space you only get 2GB per vCPU. With
> 48-bit you get 1TB which should be plenty.
>
> 52-bit private IOVA space doesn't seem useful, I doubt we'll ever need to
> support that on the MAP/UNMAP interface.
>
> So I guess 48-bit can be the default, and users with special setups can
> override aw-bits.

Yes it looks safe also on my side. I will respin with 48b default then.

Thank you!

Eric
>
> Thanks,
> Jean
>