diff mbox

[v6,0/7] vfio/type1: Add support for valid iova list management

Message ID 20180524122032.724676b9@w520.home (mailing list archive)
State New, archived
Headers show

Commit Message

Alex Williamson May 24, 2018, 6:20 p.m. UTC
[Cc +Joerg: AMD-Vi observation towards the end]

On Wed, 18 Apr 2018 12:40:38 +0100
Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> wrote:

> This series introduces an iova list associated with a vfio 
> iommu. The list is kept updated taking care of iommu apertures,
> and reserved regions. Also this series adds checks for any conflict
> with existing dma mappings whenever a new device group is attached to
> the domain.
> 
> User-space can retrieve valid iova ranges using VFIO_IOMMU_GET_INFO
> ioctl capability chains. Any dma map request outside the valid iova
> range will be rejected.

Hi Shameer,

I ran into two minor issues in testing this series, both related to
mdev usage of type1.  First, in patch 5/7 when we try to validate a dma
map request:

> +static bool vfio_iommu_iova_dma_valid(struct vfio_iommu *iommu,
> +				dma_addr_t start, dma_addr_t end)
> +{
> +	struct list_head *iova = &iommu->iova_list;
> +	struct vfio_iova *node;
> +
> +	list_for_each_entry(node, iova, list) {
> +		if ((start >= node->start) && (end <= node->end))
> +			return true;
> +	}
> +
> +	return false;
> +}

A container with only an mdev device will have an empty list because it
has not backing iommu to set ranges or reserved regions, so any dma map
will fail.  I think this is resolved as follows:


The second issue is similar, patch 6/7 adds to VFIO_IOMMU_GET_INFO:

+		ret = vfio_iommu_iova_build_caps(iommu, &caps);
+		if (ret)
+			return ret;

And build_caps has:

+	list_for_each_entry(iova, &iommu->iova_list, list)
+		iovas++;
+
+	if (!iovas) {
+		ret = -EINVAL;

Therefore if the iova list is empty, as for mdevs, the use can no
longer even call VFIO_IOMMU_GET_INFO on the container, which is a
regression.  Again, I think the fix is simple:

@@ -2090,7 +2090,7 @@ static int vfio_iommu_iova_build_caps(struct vfio_iommu *iommu,
                iovas++;
 
        if (!iovas) {
-               ret = -EINVAL;
+               ret = 0;
                goto out_unlock;
        }
 
ie. build_caps needs to handle lack of an iova_list as a non-error.

Also, I wrote a small unit test to validate the iova list for my
systems[1].  With the above changes, my Intel test system gives expected
results:

# ./vfio-type1-iova-list /sys/bus/mdev/devices/c08db5ed-05d3-4b39-b150-438a18bc698f /sys/bus/pci/devices/0000:00:1b.0
---- Adding device: c08db5ed-05d3-4b39-b150-438a18bc698f ----
Initial info struct size: 0x18
No caps
---- Adding device: 0000:00:1b.0 ----
Initial info struct size: 0x18
Requested info struct size: 0x48
New info struct size: 0x48
argsz: 0x48, flags: 0x3, cap_offset: 0x18
	00: 4800 0000 0300 0000 00f0 ffff ffff ffff
	10: 1800 0000 0000 0000 0100 0100 0000 0000
	20: 0200 0000 0000 0000 0000 0000 0000 0000
	30: ffff dffe 0000 0000 0000 f0fe 0000 0000
	40: ffff ffff ffff ff01 
[cap id: 1, version: 1, next: 0x0]
Found type1 iova range version: 1
	00: 0000000000000000 - 00000000fedfffff
	01: 00000000fef00000 - 01ffffffffffffff

Adding an mdev device to the container results in no iova list, adding
the physical device updates to the expected set with the MSI range
excluded.

I was a little surprised by an AMD system:

# ./vfio-type1-iova-list /sys/bus/pci/devices/0000:01:00.0
---- Adding device: 0000:01:00.0 ----
Initial info struct size: 0x18
Requested info struct size: 0x58
New info struct size: 0x58
argsz: 0x58, flags: 0x3, cap_offset: 0x18
	00: 5800 0000 0300 0000 00f0 ffff 7fff ffff
	10: 1800 0000 0000 0000 0100 0100 0000 0000
	20: 0300 0000 0000 0000 0000 0000 0000 0000
	30: ffff dffe 0000 0000 0000 f0fe 0000 0000
	40: ffff ffff fc00 0000 0000 0000 0001 0000
	50: ffff ffff ffff ffff 
[cap id: 1, version: 1, next: 0x0]
Found type1 iova range version: 1
	00: 0000000000000000 - 00000000fedfffff
	01: 00000000fef00000 - 000000fcffffffff
	02: 0000010000000000 - ffffffffffffffff

The additional excluded range is the HyperTransport area (which for
99.9+% of use cases isn't going to be a problem) is a rather surprising
limit for the size of a VM we can use on AMD, just under 1TB.  I fully
expect we'll see a bug report from that and we should be thinking about
how to address it.  Otherwise, if the changes above look good to you
I'll incorporate them into their respective patches and push to my next
branch.  Thanks,

Alex

[1] https://gist.github.com/awilliam/25e39252ab28101a2d89ace1ff765fdd

Comments

Shameerali Kolothum Thodi May 25, 2018, 8:45 a.m. UTC | #1
Hi Alex,

> -----Original Message-----
> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Thursday, May 24, 2018 7:21 PM
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: eric.auger@redhat.com; pmorel@linux.vnet.ibm.com;
> kvm@vger.kernel.org; linux-kernel@vger.kernel.org; iommu@lists.linux-
> foundation.org; Linuxarm <linuxarm@huawei.com>; John Garry
> <john.garry@huawei.com>; xuwei (O) <xuwei5@huawei.com>; Joerg Roedel
> <joro@8bytes.org>
> Subject: Re: [PATCH v6 0/7] vfio/type1: Add support for valid iova list
> management
> 
> [Cc +Joerg: AMD-Vi observation towards the end]
> 
> On Wed, 18 Apr 2018 12:40:38 +0100
> Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> wrote:
> 
> > This series introduces an iova list associated with a vfio
> > iommu. The list is kept updated taking care of iommu apertures,
> > and reserved regions. Also this series adds checks for any conflict
> > with existing dma mappings whenever a new device group is attached to
> > the domain.
> >
> > User-space can retrieve valid iova ranges using VFIO_IOMMU_GET_INFO
> > ioctl capability chains. Any dma map request outside the valid iova
> > range will be rejected.
> 
> Hi Shameer,
> 
> I ran into two minor issues in testing this series, both related to
> mdev usage of type1.  First, in patch 5/7 when we try to validate a dma
> map request:

I must admit I haven't looked into the mdev use case at all and my impression
was that it will be same as others. Thanks for doing these tests.

> > +static bool vfio_iommu_iova_dma_valid(struct vfio_iommu *iommu,
> > +				dma_addr_t start, dma_addr_t end)
> > +{
> > +	struct list_head *iova = &iommu->iova_list;
> > +	struct vfio_iova *node;
> > +
> > +	list_for_each_entry(node, iova, list) {
> > +		if ((start >= node->start) && (end <= node->end))
> > +			return true;
> > +	}
> > +
> > +	return false;
> > +}
> 
> A container with only an mdev device will have an empty list because it
> has not backing iommu to set ranges or reserved regions, so any dma map
> will fail.  I think this is resolved as follows:
> 
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -1100,7 +1100,7 @@ static bool vfio_iommu_iova_dma_valid(struct
> vfio_iommu *iommu,
>                         return true;
>         }
> 
> -       return false;
> +       return list_empty(&iommu->iova_list);
>  }

Ok.

> ie. return false only if there was anything to test against.
> 
> The second issue is similar, patch 6/7 adds to VFIO_IOMMU_GET_INFO:
> 
> +		ret = vfio_iommu_iova_build_caps(iommu, &caps);
> +		if (ret)
> +			return ret;
> 
> And build_caps has:
> 
> +	list_for_each_entry(iova, &iommu->iova_list, list)
> +		iovas++;
> +
> +	if (!iovas) {
> +		ret = -EINVAL;
> 
> Therefore if the iova list is empty, as for mdevs, the use can no
> longer even call VFIO_IOMMU_GET_INFO on the container, which is a
> regression.  Again, I think the fix is simple:
> 
> @@ -2090,7 +2090,7 @@ static int vfio_iommu_iova_build_caps(struct
> vfio_iommu *iommu,
>                 iovas++;
> 
>         if (!iovas) {
> -               ret = -EINVAL;
> +               ret = 0;
>                 goto out_unlock;
>         }
> 
> ie. build_caps needs to handle lack of an iova_list as a non-error.

Ok.

> Also, I wrote a small unit test to validate the iova list for my
> systems[1].  With the above changes, my Intel test system gives expected
> results:
> 
> # ./vfio-type1-iova-list /sys/bus/mdev/devices/c08db5ed-05d3-4b39-b150-
> 438a18bc698f /sys/bus/pci/devices/0000:00:1b.0
> ---- Adding device: c08db5ed-05d3-4b39-b150-438a18bc698f ----
> Initial info struct size: 0x18
> No caps
> ---- Adding device: 0000:00:1b.0 ----
> Initial info struct size: 0x18
> Requested info struct size: 0x48
> New info struct size: 0x48
> argsz: 0x48, flags: 0x3, cap_offset: 0x18
> 	00: 4800 0000 0300 0000 00f0 ffff ffff ffff
> 	10: 1800 0000 0000 0000 0100 0100 0000 0000
> 	20: 0200 0000 0000 0000 0000 0000 0000 0000
> 	30: ffff dffe 0000 0000 0000 f0fe 0000 0000
> 	40: ffff ffff ffff ff01
> [cap id: 1, version: 1, next: 0x0]
> Found type1 iova range version: 1
> 	00: 0000000000000000 - 00000000fedfffff
> 	01: 00000000fef00000 - 01ffffffffffffff
> 
> Adding an mdev device to the container results in no iova list, adding
> the physical device updates to the expected set with the MSI range
> excluded.
> 
> I was a little surprised by an AMD system:
> 
> # ./vfio-type1-iova-list /sys/bus/pci/devices/0000:01:00.0
> ---- Adding device: 0000:01:00.0 ----
> Initial info struct size: 0x18
> Requested info struct size: 0x58
> New info struct size: 0x58
> argsz: 0x58, flags: 0x3, cap_offset: 0x18
> 	00: 5800 0000 0300 0000 00f0 ffff 7fff ffff
> 	10: 1800 0000 0000 0000 0100 0100 0000 0000
> 	20: 0300 0000 0000 0000 0000 0000 0000 0000
> 	30: ffff dffe 0000 0000 0000 f0fe 0000 0000
> 	40: ffff ffff fc00 0000 0000 0000 0001 0000
> 	50: ffff ffff ffff ffff
> [cap id: 1, version: 1, next: 0x0]
> Found type1 iova range version: 1
> 	00: 0000000000000000 - 00000000fedfffff
> 	01: 00000000fef00000 - 000000fcffffffff
> 	02: 0000010000000000 - ffffffffffffffff
> 
> The additional excluded range is the HyperTransport area (which for
> 99.9+% of use cases isn't going to be a problem) is a rather surprising
> limit for the size of a VM we can use on AMD, just under 1TB.  I fully
> expect we'll see a bug report from that and we should be thinking about
> how to address it.  Otherwise, if the changes above look good to you
> I'll incorporate them into their respective patches and push to my next
> branch.  Thanks,

Yes, the above changes related to list empty cases looks fine to me. 

Much appreciated,
Shameer
 
> Alex
> 
> [1] https://gist.github.com/awilliam/25e39252ab28101a2d89ace1ff765fdd
Alex Williamson May 25, 2018, 8:54 p.m. UTC | #2
On Fri, 25 May 2018 08:45:36 +0000
Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote:

> Yes, the above changes related to list empty cases looks fine to me. 

Thanks Shameer, applied to my next branch with the discussed fixes for
v4.18 (modulo patch 4/7 which Joerg already pushed for v4.17).  Thanks,

Alex
Alex Williamson June 5, 2018, 5:03 p.m. UTC | #3
[Cc +dwmw2]

On Fri, 25 May 2018 14:54:08 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Fri, 25 May 2018 08:45:36 +0000
> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote:
> 
> > Yes, the above changes related to list empty cases looks fine to me.   
> 
> Thanks Shameer, applied to my next branch with the discussed fixes for
> v4.18 (modulo patch 4/7 which Joerg already pushed for v4.17).  Thanks,

Hi Shameer,

We're still hitting issues with this.  VT-d marks reserved regions for
any RMRR ranges, which are IOVA ranges that the BIOS requests to be
identity mapped for a device.  These are indicated by these sorts of
log entries on boot:

 DMAR: Setting RMRR:
 DMAR: Setting identity map for device 0000:00:02.0 [0xbf800000 - 0xcf9fffff]
 DMAR: Setting identity map for device 0000:00:1a.0 [0xbe8d1000 - 0xbe8dffff]
 DMAR: Setting identity map for device 0000:00:1d.0 [0xbe8d1000 - 0xbe8dffff]

So while for an unaffected device, I see very usable ranges for QEMU,
such as:

	00: 0000000000000000 - 00000000fedfffff
	01: 00000000fef00000 - 01ffffffffffffff

If I try to assign the previously assignable 00:02.0 IGD graphics
device, I get:

	00: 0000000000000000 - 00000000bf7fffff
	01: 00000000cfa00000 - 00000000fedfffff
	02: 00000000fef00000 - 01ffffffffffffff

And we get a fault when QEMU tries the following mapping:

vfio_dma_map(0x55f790421a20, 0x100000, 0xbff00000, 0x7ff163f00000)

bff00000 clearly extends into the gap starting at bf800000.  VT-d is
rather split-brained about RMRRs, typically we'd exclude devices from
assignment at all for relying on RMRRs and these reserved ranges
would be a welcome mechanism to avoid conflicts with those ranges, but
for RMRR ranges covering IGD and USB devices we've decided that we
don't need to honor the RMRR (see device_is_rmrr_locked()), but it's
still listed as a reserved range and bites us here.

Unless we can get VT-d to exclude RMRRs from reserved regions where
we've already seen fit to allow the devices to participate in the IOMMU
API, this would introduce a functional regression for assignment of
these devices.  David, should we skip GFX and USB devices with RMRRs in
intel_iommu_get_resv_regions() just as we do in
device_is_rmrr_locked()?  Thanks,

Alex
Shameerali Kolothum Thodi June 6, 2018, 6:52 a.m. UTC | #4
> -----Original Message-----
> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: 05 June 2018 18:03
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: eric.auger@redhat.com; pmorel@linux.vnet.ibm.com;
> kvm@vger.kernel.org; linux-kernel@vger.kernel.org; iommu@lists.linux-
> foundation.org; Linuxarm <linuxarm@huawei.com>; John Garry
> <john.garry@huawei.com>; xuwei (O) <xuwei5@huawei.com>; Joerg Roedel
> <joro@8bytes.org>; David Woodhouse <dwmw2@infradead.org>
> Subject: Re: [PATCH v6 0/7] vfio/type1: Add support for valid iova list
> management
> 
> [Cc +dwmw2]
> 
> On Fri, 25 May 2018 14:54:08 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> > On Fri, 25 May 2018 08:45:36 +0000
> > Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> wrote:
> >
> > > Yes, the above changes related to list empty cases looks fine to me.
> >
> > Thanks Shameer, applied to my next branch with the discussed fixes for
> > v4.18 (modulo patch 4/7 which Joerg already pushed for v4.17).  Thanks,
> 
> Hi Shameer,
> 
> We're still hitting issues with this.  VT-d marks reserved regions for
> any RMRR ranges, which are IOVA ranges that the BIOS requests to be
> identity mapped for a device.  These are indicated by these sorts of
> log entries on boot:
> 
>  DMAR: Setting RMRR:
>  DMAR: Setting identity map for device 0000:00:02.0 [0xbf800000 - 0xcf9fffff]
>  DMAR: Setting identity map for device 0000:00:1a.0 [0xbe8d1000 - 0xbe8dffff]
>  DMAR: Setting identity map for device 0000:00:1d.0 [0xbe8d1000 - 0xbe8dffff]
> 
> So while for an unaffected device, I see very usable ranges for QEMU,
> such as:
> 
> 	00: 0000000000000000 - 00000000fedfffff
> 	01: 00000000fef00000 - 01ffffffffffffff
> 
> If I try to assign the previously assignable 00:02.0 IGD graphics
> device, I get:
> 
> 	00: 0000000000000000 - 00000000bf7fffff
> 	01: 00000000cfa00000 - 00000000fedfffff
> 	02: 00000000fef00000 - 01ffffffffffffff
> 
> And we get a fault when QEMU tries the following mapping:
> 
> vfio_dma_map(0x55f790421a20, 0x100000, 0xbff00000, 0x7ff163f00000)
> 
> bff00000 clearly extends into the gap starting at bf800000.  VT-d is
> rather split-brained about RMRRs, typically we'd exclude devices from
> assignment at all for relying on RMRRs and these reserved ranges
> would be a welcome mechanism to avoid conflicts with those ranges, but
> for RMRR ranges covering IGD and USB devices we've decided that we
> don't need to honor the RMRR (see device_is_rmrr_locked()), but it's
> still listed as a reserved range and bites us here.

Ah..:(. This looks similar to the pci window range reporting issue faced in
the arm world. I see the RFC you have sent out to ignore these "known" 
RMRRs. Hope we will be able to resolve this soon.

Thanks,
Shameer

> Unless we can get VT-d to exclude RMRRs from reserved regions where
> we've already seen fit to allow the devices to participate in the IOMMU
> API, this would introduce a functional regression for assignment of
> these devices.  David, should we skip GFX and USB devices with RMRRs in
> intel_iommu_get_resv_regions() just as we do in
> device_is_rmrr_locked()?  Thanks,
> 
> Alex
Alex Williamson June 7, 2018, 3:05 p.m. UTC | #5
On Wed, 6 Jun 2018 06:52:08 +0000
Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote:

> > -----Original Message-----
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: 05 June 2018 18:03
> > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> > Cc: eric.auger@redhat.com; pmorel@linux.vnet.ibm.com;
> > kvm@vger.kernel.org; linux-kernel@vger.kernel.org; iommu@lists.linux-
> > foundation.org; Linuxarm <linuxarm@huawei.com>; John Garry
> > <john.garry@huawei.com>; xuwei (O) <xuwei5@huawei.com>; Joerg Roedel
> > <joro@8bytes.org>; David Woodhouse <dwmw2@infradead.org>
> > Subject: Re: [PATCH v6 0/7] vfio/type1: Add support for valid iova list
> > management
> > 
> > [Cc +dwmw2]
> > 
> > On Fri, 25 May 2018 14:54:08 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> >   
> > > On Fri, 25 May 2018 08:45:36 +0000
> > > Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>  
> > wrote:  
> > >  
> > > > Yes, the above changes related to list empty cases looks fine to me.  
> > >
> > > Thanks Shameer, applied to my next branch with the discussed fixes for
> > > v4.18 (modulo patch 4/7 which Joerg already pushed for v4.17).  Thanks,  
> > 
> > Hi Shameer,
> > 
> > We're still hitting issues with this.  VT-d marks reserved regions for
> > any RMRR ranges, which are IOVA ranges that the BIOS requests to be
> > identity mapped for a device.  These are indicated by these sorts of
> > log entries on boot:
> > 
> >  DMAR: Setting RMRR:
> >  DMAR: Setting identity map for device 0000:00:02.0 [0xbf800000 - 0xcf9fffff]
> >  DMAR: Setting identity map for device 0000:00:1a.0 [0xbe8d1000 - 0xbe8dffff]
> >  DMAR: Setting identity map for device 0000:00:1d.0 [0xbe8d1000 - 0xbe8dffff]
> > 
> > So while for an unaffected device, I see very usable ranges for QEMU,
> > such as:
> > 
> > 	00: 0000000000000000 - 00000000fedfffff
> > 	01: 00000000fef00000 - 01ffffffffffffff
> > 
> > If I try to assign the previously assignable 00:02.0 IGD graphics
> > device, I get:
> > 
> > 	00: 0000000000000000 - 00000000bf7fffff
> > 	01: 00000000cfa00000 - 00000000fedfffff
> > 	02: 00000000fef00000 - 01ffffffffffffff
> > 
> > And we get a fault when QEMU tries the following mapping:
> > 
> > vfio_dma_map(0x55f790421a20, 0x100000, 0xbff00000, 0x7ff163f00000)
> > 
> > bff00000 clearly extends into the gap starting at bf800000.  VT-d is
> > rather split-brained about RMRRs, typically we'd exclude devices from
> > assignment at all for relying on RMRRs and these reserved ranges
> > would be a welcome mechanism to avoid conflicts with those ranges, but
> > for RMRR ranges covering IGD and USB devices we've decided that we
> > don't need to honor the RMRR (see device_is_rmrr_locked()), but it's
> > still listed as a reserved range and bites us here.  
> 
> Ah..:(. This looks similar to the pci window range reporting issue faced in
> the arm world. I see the RFC you have sent out to ignore these "known" 
> RMRRs. Hope we will be able to resolve this soon.

In the meantime, it seems that I need to drop the iova list from my
branch for v4.18, I don't think it makes much sense to expose to
userspace if we cannot also enforce it and it seems like it presents API
issues if we were to expose the iova list without enforcement and later
try to enforce it.  Sorry I didn't catch this earlier.  Thanks,

Alex
diff mbox

Patch

--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -1100,7 +1100,7 @@  static bool vfio_iommu_iova_dma_valid(struct vfio_iommu *iommu,
                        return true;
        }
 
-       return false;
+       return list_empty(&iommu->iova_list);
 }
 
ie. return false only if there was anything to test against.