diff mbox series

[v3,07/17] iommufd: Add IOMMU_RESV_IOVA_RANGES

Message ID 20230724110406.107212-8-yi.l.liu@intel.com (mailing list archive)
State New
Headers show
Series iommufd: Add nesting infrastructure | expand

Commit Message

Yi Liu July 24, 2023, 11:03 a.m. UTC
This reports device's reserved IOVA regions to userspace. This is needed
in the nested translation as userspace owns stage-1 HWPT, and userspace
needs to exclude the reserved IOVA regions in the stage-1 HWPT hence exclude
them in the device's DMA address space.

This can also be used to figure out allowed IOVAs of an IOAS.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/main.c | 54 ++++++++++++++++++++++++++++++++++++
 include/uapi/linux/iommufd.h | 46 ++++++++++++++++++++++++++++++
 2 files changed, 100 insertions(+)

Comments

Tian, Kevin July 28, 2023, 10:07 a.m. UTC | #1
> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Monday, July 24, 2023 7:04 PM
> 
> This reports device's reserved IOVA regions to userspace. This is needed
> in the nested translation as userspace owns stage-1 HWPT, and userspace
> needs to exclude the reserved IOVA regions in the stage-1 HWPT hence
> exclude
> them in the device's DMA address space.
> 
> This can also be used to figure out allowed IOVAs of an IOAS.

We may need a special type to mark SW_MSI since it requires identity
mapping in stage-1 instead of being reserved.
Jason Gunthorpe July 28, 2023, 5:16 p.m. UTC | #2
On Fri, Jul 28, 2023 at 10:07:58AM +0000, Tian, Kevin wrote:
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Monday, July 24, 2023 7:04 PM
> > 
> > This reports device's reserved IOVA regions to userspace. This is needed
> > in the nested translation as userspace owns stage-1 HWPT, and userspace
> > needs to exclude the reserved IOVA regions in the stage-1 HWPT hence
> > exclude
> > them in the device's DMA address space.
> > 
> > This can also be used to figure out allowed IOVAs of an IOAS.
> 
> We may need a special type to mark SW_MSI since it requires identity
> mapping in stage-1 instead of being reserved.

Only the kernel can do this, so there is no action for user space to
take beyond knowing that is is not mappable IOVA.

The merit for "SW_MSI" may be to inform the rest of the system about
the IOVA of the ITS page, but with the current situation that isn't
required since only the kernel needs that information.

I think the long term way forward is to somehow arrange for the SW_MSI
to not become mapped when creating the parent HWPT and instead cause
the ITS page to be mapped through some explicit IOCTL.

Jason
Jason Gunthorpe July 28, 2023, 5:44 p.m. UTC | #3
On Mon, Jul 24, 2023 at 04:03:56AM -0700, Yi Liu wrote:

> +/**
> + * struct iommu_resv_iova_ranges - ioctl(IOMMU_RESV_IOVA_RANGES)
> + * @size: sizeof(struct iommu_resv_iova_ranges)
> + * @dev_id: device to read resv iova ranges for
> + * @num_iovas: Input/Output total number of resv ranges for the device
> + * @__reserved: Must be 0
> + * @resv_iovas: Pointer to the output array of struct iommu_resv_iova_range
> + *
> + * Query a device for ranges of reserved IOVAs. num_iovas will be set to the
> + * total number of iovas and the resv_iovas[] will be filled in as space
> + * permits.
> + *
> + * On input num_iovas is the length of the resv_iovas array. On output it is
> + * the total number of iovas filled in. The ioctl will return -EMSGSIZE and
> + * set num_iovas to the required value if num_iovas is too small. In this
> + * case the caller should allocate a larger output array and re-issue the
> + * ioctl.
> + *
> + * Under nested translation, userspace should query the reserved IOVAs for a
> + * given device, and report it to the stage-1 I/O page table owner to exclude
> + * the reserved IOVAs. The reserved IOVAs can also be used to figure out the
> + * allowed IOVA ranges for the IOAS that the device is attached to. For detail
> + * see ioctl IOMMU_IOAS_IOVA_RANGES.

I'm not sure I like this, the other APIs here work with the *allowed*
IOVAs, which is the inverse of this one which works with the
*disallowed* IOVAs.

It means we can't take the output of this API and feed it into
IOMMUFD_CMD_IOAS_ALLOW_IOVAS.. Though I suppose qemu isn't going to do
this anyhow.

On the other hand, it is kind of hard to intersect an allowed list of
multiple idevs into a single master list.

As it is, userspace will have to aggregate the list, sort it, merge
adjacent overlapping reserved ranges then invert the list to get an
allowed list. This is not entirely simple..

Did you already write an algorithm to do this in qemu someplace?

Anyhow, this should be split out from this series. It seems simple
enough to merge it now if someone can confirm what qemu needs.

Jason
Tian, Kevin July 31, 2023, 6:14 a.m. UTC | #4
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Saturday, July 29, 2023 1:17 AM
> 
> On Fri, Jul 28, 2023 at 10:07:58AM +0000, Tian, Kevin wrote:
> > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > Sent: Monday, July 24, 2023 7:04 PM
> > >
> > > This reports device's reserved IOVA regions to userspace. This is needed
> > > in the nested translation as userspace owns stage-1 HWPT, and
> userspace
> > > needs to exclude the reserved IOVA regions in the stage-1 HWPT hence
> > > exclude
> > > them in the device's DMA address space.
> > >
> > > This can also be used to figure out allowed IOVAs of an IOAS.
> >
> > We may need a special type to mark SW_MSI since it requires identity
> > mapping in stage-1 instead of being reserved.
> 
> Only the kernel can do this, so there is no action for user space to
> take beyond knowing that is is not mappable IOVA.
> 
> The merit for "SW_MSI" may be to inform the rest of the system about
> the IOVA of the ITS page, but with the current situation that isn't
> required since only the kernel needs that information.

IIUC guest kernel needs to know the "SW_MSI" region and then setup an
1:1 mapping for it in S1.

So Qemu needs to know and pass this information to the guest?

> 
> I think the long term way forward is to somehow arrange for the SW_MSI
> to not become mapped when creating the parent HWPT and instead cause
> the ITS page to be mapped through some explicit IOCTL.
> 

yes this is a cleaner approach. Qemu selects the intermediate address of
vITS page and maps it to physical ITS page in S2. Then the guest kernel
just pick whatever "SW_MSI" address in S1 to vITS as it does today on 
bare metal.
Tian, Kevin July 31, 2023, 6:21 a.m. UTC | #5
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Saturday, July 29, 2023 1:45 AM
> 
> On Mon, Jul 24, 2023 at 04:03:56AM -0700, Yi Liu wrote:
> 
> > +/**
> > + * struct iommu_resv_iova_ranges - ioctl(IOMMU_RESV_IOVA_RANGES)
> > + * @size: sizeof(struct iommu_resv_iova_ranges)
> > + * @dev_id: device to read resv iova ranges for
> > + * @num_iovas: Input/Output total number of resv ranges for the device
> > + * @__reserved: Must be 0
> > + * @resv_iovas: Pointer to the output array of struct
> iommu_resv_iova_range
> > + *
> > + * Query a device for ranges of reserved IOVAs. num_iovas will be set to
> the
> > + * total number of iovas and the resv_iovas[] will be filled in as space
> > + * permits.
> > + *
> > + * On input num_iovas is the length of the resv_iovas array. On output it is
> > + * the total number of iovas filled in. The ioctl will return -EMSGSIZE and
> > + * set num_iovas to the required value if num_iovas is too small. In this
> > + * case the caller should allocate a larger output array and re-issue the
> > + * ioctl.
> > + *
> > + * Under nested translation, userspace should query the reserved IOVAs
> for a
> > + * given device, and report it to the stage-1 I/O page table owner to
> exclude
> > + * the reserved IOVAs. The reserved IOVAs can also be used to figure out
> the
> > + * allowed IOVA ranges for the IOAS that the device is attached to. For
> detail
> > + * see ioctl IOMMU_IOAS_IOVA_RANGES.
> 
> I'm not sure I like this, the other APIs here work with the *allowed*
> IOVAs, which is the inverse of this one which works with the
> *disallowed* IOVAs.
> 
> It means we can't take the output of this API and feed it into
> IOMMUFD_CMD_IOAS_ALLOW_IOVAS.. Though I suppose qemu isn't going
> to do
> this anyhow.
> 
> On the other hand, it is kind of hard to intersect an allowed list of
> multiple idevs into a single master list.
> 
> As it is, userspace will have to aggregate the list, sort it, merge
> adjacent overlapping reserved ranges then invert the list to get an
> allowed list. This is not entirely simple..
> 
> Did you already write an algorithm to do this in qemu someplace?

Qemu is optional to aggregate it for S2 given IOMMU_IOAS_IOVA_RANGES
is still being used. If the only purpose of using this new cmd is to report
per-device reserved ranges to the guest then aggregation is not required.

Arguably IOMMU_IOAS_IOVA_RANGES becomes redundant with this
new cmd. But it's already there and as you said it's actually more
convenient to be used if the user doesn't care about per-device
reserved ranges...

> 
> Anyhow, this should be split out from this series. It seems simple
> enough to merge it now if someone can confirm what qemu needs.
> 
> Jason
Yi Liu July 31, 2023, 9:53 a.m. UTC | #6
> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Monday, July 31, 2023 2:22 PM
> 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Saturday, July 29, 2023 1:45 AM
> >
> > On Mon, Jul 24, 2023 at 04:03:56AM -0700, Yi Liu wrote:
> >
> > > +/**
> > > + * struct iommu_resv_iova_ranges - ioctl(IOMMU_RESV_IOVA_RANGES)
> > > + * @size: sizeof(struct iommu_resv_iova_ranges)
> > > + * @dev_id: device to read resv iova ranges for
> > > + * @num_iovas: Input/Output total number of resv ranges for the device
> > > + * @__reserved: Must be 0
> > > + * @resv_iovas: Pointer to the output array of struct
> > iommu_resv_iova_range
> > > + *
> > > + * Query a device for ranges of reserved IOVAs. num_iovas will be set to
> > the
> > > + * total number of iovas and the resv_iovas[] will be filled in as space
> > > + * permits.
> > > + *
> > > + * On input num_iovas is the length of the resv_iovas array. On output it is
> > > + * the total number of iovas filled in. The ioctl will return -EMSGSIZE and
> > > + * set num_iovas to the required value if num_iovas is too small. In this
> > > + * case the caller should allocate a larger output array and re-issue the
> > > + * ioctl.
> > > + *
> > > + * Under nested translation, userspace should query the reserved IOVAs
> > for a
> > > + * given device, and report it to the stage-1 I/O page table owner to
> > exclude
> > > + * the reserved IOVAs. The reserved IOVAs can also be used to figure out
> > the
> > > + * allowed IOVA ranges for the IOAS that the device is attached to. For
> > detail
> > > + * see ioctl IOMMU_IOAS_IOVA_RANGES.
> >
> > I'm not sure I like this, the other APIs here work with the *allowed*
> > IOVAs, which is the inverse of this one which works with the
> > *disallowed* IOVAs.
> >
> > It means we can't take the output of this API and feed it into
> > IOMMUFD_CMD_IOAS_ALLOW_IOVAS.. Though I suppose qemu isn't going
> > to do
> > this anyhow.
> >
> > On the other hand, it is kind of hard to intersect an allowed list of
> > multiple idevs into a single master list.
> >
> > As it is, userspace will have to aggregate the list, sort it, merge
> > adjacent overlapping reserved ranges then invert the list to get an
> > allowed list. This is not entirely simple..
> >
> > Did you already write an algorithm to do this in qemu someplace?
> 
> Qemu is optional to aggregate it for S2 given IOMMU_IOAS_IOVA_RANGES
> is still being used. If the only purpose of using this new cmd is to report
> per-device reserved ranges to the guest then aggregation is not required.
> 
> Arguably IOMMU_IOAS_IOVA_RANGES becomes redundant with this
> new cmd. But it's already there and as you said it's actually more
> convenient to be used if the user doesn't care about per-device
> reserved ranges...

Yes. it's not simple as userspace needs to do a lot  ofwork to get the allowed
iovas if multiple devices are attached to an IOAS.

> > Anyhow, this should be split out from this series. It seems simple
> > enough to merge it now if someone can confirm what qemu needs.

Ok, so the reason is this new ioctl can be used to figure out allowd iovas.
Right?

Regards,
Yi Liu
Jason Gunthorpe July 31, 2023, 1:12 p.m. UTC | #7
On Mon, Jul 31, 2023 at 06:14:50AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Saturday, July 29, 2023 1:17 AM
> > 
> > On Fri, Jul 28, 2023 at 10:07:58AM +0000, Tian, Kevin wrote:
> > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > Sent: Monday, July 24, 2023 7:04 PM
> > > >
> > > > This reports device's reserved IOVA regions to userspace. This is needed
> > > > in the nested translation as userspace owns stage-1 HWPT, and
> > userspace
> > > > needs to exclude the reserved IOVA regions in the stage-1 HWPT hence
> > > > exclude
> > > > them in the device's DMA address space.
> > > >
> > > > This can also be used to figure out allowed IOVAs of an IOAS.
> > >
> > > We may need a special type to mark SW_MSI since it requires identity
> > > mapping in stage-1 instead of being reserved.
> > 
> > Only the kernel can do this, so there is no action for user space to
> > take beyond knowing that is is not mappable IOVA.
> > 
> > The merit for "SW_MSI" may be to inform the rest of the system about
> > the IOVA of the ITS page, but with the current situation that isn't
> > required since only the kernel needs that information.
> 
> IIUC guest kernel needs to know the "SW_MSI" region and then setup an
> 1:1 mapping for it in S1.

Yes, but qemu hardcodes this and for some reason people thought that
was a good idea back when.

> > I think the long term way forward is to somehow arrange for the SW_MSI
> > to not become mapped when creating the parent HWPT and instead cause
> > the ITS page to be mapped through some explicit IOCTL.
> 
> yes this is a cleaner approach. Qemu selects the intermediate address of
> vITS page and maps it to physical ITS page in S2. Then the guest kernel
> just pick whatever "SW_MSI" address in S1 to vITS as it does today on 
> bare metal.

Right, so I've been inclined to minimize the amount of special stuff
created for this way of doing the MSI and hope we can reach a better
way sooner than later

Jason
Jason Gunthorpe July 31, 2023, 1:23 p.m. UTC | #8
On Mon, Jul 31, 2023 at 06:21:44AM +0000, Tian, Kevin wrote:

> > As it is, userspace will have to aggregate the list, sort it, merge
> > adjacent overlapping reserved ranges then invert the list to get an
> > allowed list. This is not entirely simple..
> > 
> > Did you already write an algorithm to do this in qemu someplace?
> 
> Qemu is optional to aggregate it for S2 given IOMMU_IOAS_IOVA_RANGES
> is still being used. If the only purpose of using this new cmd is to report
> per-device reserved ranges to the guest then aggregation is not required.

I don't think it is entirely optional.. If qmeu doesn't track this,
then it will have failures when attaching the S2 to the device. It
needs to make sure it punches the right holes in the guest memory map
to be compatible with the VFIO HW.

I suppose in reality the reserved regions are fairly predictable and
probably always match the existing qemu memory map so you can ignore
this and still work.

Plus most qemu cases don't deal with hotplug so you can build up the
identity ioas with all the devices and then use IOMMU_IOAS_IOVA_RANGES
as you say and still work.

> Arguably IOMMU_IOAS_IOVA_RANGES becomes redundant with this
> new cmd. But it's already there and as you said it's actually more
> convenient to be used if the user doesn't care about per-device
> reserved ranges...

Yes and yes

So, I guess we should leave it like this?

Jason
Tian, Kevin Aug. 1, 2023, 2:40 a.m. UTC | #9
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Monday, July 31, 2023 9:24 PM
> 
> On Mon, Jul 31, 2023 at 06:21:44AM +0000, Tian, Kevin wrote:
> 
> > > As it is, userspace will have to aggregate the list, sort it, merge
> > > adjacent overlapping reserved ranges then invert the list to get an
> > > allowed list. This is not entirely simple..
> > >
> > > Did you already write an algorithm to do this in qemu someplace?
> >
> > Qemu is optional to aggregate it for S2 given IOMMU_IOAS_IOVA_RANGES
> > is still being used. If the only purpose of using this new cmd is to report
> > per-device reserved ranges to the guest then aggregation is not required.
> 
> I don't think it is entirely optional.. If qmeu doesn't track this,
> then it will have failures when attaching the S2 to the device. It
> needs to make sure it punches the right holes in the guest memory map
> to be compatible with the VFIO HW.
> 
> I suppose in reality the reserved regions are fairly predictable and
> probably always match the existing qemu memory map so you can ignore
> this and still work.
> 
> Plus most qemu cases don't deal with hotplug so you can build up the
> identity ioas with all the devices and then use IOMMU_IOAS_IOVA_RANGES
> as you say and still work.
> 
> > Arguably IOMMU_IOAS_IOVA_RANGES becomes redundant with this
> > new cmd. But it's already there and as you said it's actually more
> > convenient to be used if the user doesn't care about per-device
> > reserved ranges...
> 
> Yes and yes
> 
> So, I guess we should leave it like this?
> 

Yes. Along with this discussion (including what you explained for sw_msi)
let's abandon this new cmd and leave it as today.
Jason Gunthorpe Aug. 1, 2023, 6:22 p.m. UTC | #10
On Tue, Aug 01, 2023 at 02:40:44AM +0000, Tian, Kevin wrote:

> > So, I guess we should leave it like this?
> > 
> 
> Yes. Along with this discussion (including what you explained for sw_msi)
> let's abandon this new cmd and leave it as today.

You sure? This makes it basically impossible to write a "correct" vmm
that is aware of what the physical memory map must be early on

Jason
Tian, Kevin Aug. 2, 2023, 1:09 a.m. UTC | #11
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, August 2, 2023 2:23 AM
> 
> On Tue, Aug 01, 2023 at 02:40:44AM +0000, Tian, Kevin wrote:
> 
> > > So, I guess we should leave it like this?
> > >
> >
> > Yes. Along with this discussion (including what you explained for sw_msi)
> > let's abandon this new cmd and leave it as today.
> 
> You sure? This makes it basically impossible to write a "correct" vmm
> that is aware of what the physical memory map must be early on
> 

emmm... I thought it's what you meant by "leave it like this" and the
fact that existing VMM's memory layout happens to match the reserved
regions. Nobody complains lacking of such a interface for years then
we may postpone supporting it until it's really required.

btw even if we add this new cmd now, getting the Qemu support to
use the aggregated list when creating the guest memory map is not
a simple task given currently vfio only passively acts on change
notifications in the guest memory layout. It requires a new mechanism
to enforce strict order (probe all vfio devices before creating the memory
layout) and then injects vfio reserved regions into the layout.

Preferably let's not making it a hard dependency for this series.
Jason Gunthorpe Aug. 2, 2023, 12:22 p.m. UTC | #12
On Wed, Aug 02, 2023 at 01:09:28AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, August 2, 2023 2:23 AM
> > 
> > On Tue, Aug 01, 2023 at 02:40:44AM +0000, Tian, Kevin wrote:
> > 
> > > > So, I guess we should leave it like this?
> > > >
> > >
> > > Yes. Along with this discussion (including what you explained for sw_msi)
> > > let's abandon this new cmd and leave it as today.
> > 
> > You sure? This makes it basically impossible to write a "correct" vmm
> > that is aware of what the physical memory map must be early on
> > 
> 
> emmm... I thought it's what you meant by "leave it like this" and the
> fact that existing VMM's memory layout happens to match the reserved
> regions. Nobody complains lacking of such a interface for years then
> we may postpone supporting it until it's really required.
> 
> btw even if we add this new cmd now, getting the Qemu support to
> use the aggregated list when creating the guest memory map is not
> a simple task given currently vfio only passively acts on change
> notifications in the guest memory layout. It requires a new mechanism
> to enforce strict order (probe all vfio devices before creating the memory
> layout) and then injects vfio reserved regions into the layout.
> 
> Preferably let's not making it a hard dependency for this series.

I see, if qemu won't use it then lets not do it

Jason
Nicolin Chen Aug. 3, 2023, 1:23 a.m. UTC | #13
On Wed, Aug 02, 2023 at 01:09:28AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, August 2, 2023 2:23 AM
> >
> > On Tue, Aug 01, 2023 at 02:40:44AM +0000, Tian, Kevin wrote:
> >
> > > > So, I guess we should leave it like this?
> > > >
> > >
> > > Yes. Along with this discussion (including what you explained for sw_msi)
> > > let's abandon this new cmd and leave it as today.
> >
> > You sure? This makes it basically impossible to write a "correct" vmm
> > that is aware of what the physical memory map must be early on
> >
> 
> emmm... I thought it's what you meant by "leave it like this" and the
> fact that existing VMM's memory layout happens to match the reserved
> regions. Nobody complains lacking of such a interface for years then
> we may postpone supporting it until it's really required.
> 
> btw even if we add this new cmd now, getting the Qemu support to
> use the aggregated list when creating the guest memory map is not
> a simple task given currently vfio only passively acts on change
> notifications in the guest memory layout. It requires a new mechanism
> to enforce strict order (probe all vfio devices before creating the memory
> layout) and then injects vfio reserved regions into the layout.
> 
> Preferably let's not making it a hard dependency for this series.

Should we drop this and its selftest patch from this series?

Thanks
Nic
Tian, Kevin Aug. 3, 2023, 1:25 a.m. UTC | #14
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Thursday, August 3, 2023 9:23 AM
> 
> On Wed, Aug 02, 2023 at 01:09:28AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, August 2, 2023 2:23 AM
> > >
> > > On Tue, Aug 01, 2023 at 02:40:44AM +0000, Tian, Kevin wrote:
> > >
> > > > > So, I guess we should leave it like this?
> > > > >
> > > >
> > > > Yes. Along with this discussion (including what you explained for
> sw_msi)
> > > > let's abandon this new cmd and leave it as today.
> > >
> > > You sure? This makes it basically impossible to write a "correct" vmm
> > > that is aware of what the physical memory map must be early on
> > >
> >
> > emmm... I thought it's what you meant by "leave it like this" and the
> > fact that existing VMM's memory layout happens to match the reserved
> > regions. Nobody complains lacking of such a interface for years then
> > we may postpone supporting it until it's really required.
> >
> > btw even if we add this new cmd now, getting the Qemu support to
> > use the aggregated list when creating the guest memory map is not
> > a simple task given currently vfio only passively acts on change
> > notifications in the guest memory layout. It requires a new mechanism
> > to enforce strict order (probe all vfio devices before creating the memory
> > layout) and then injects vfio reserved regions into the layout.
> >
> > Preferably let's not making it a hard dependency for this series.
> 
> Should we drop this and its selftest patch from this series?
> 

Yes. let's drop it.
Nicolin Chen Aug. 3, 2023, 2:17 a.m. UTC | #15
On Thu, Aug 03, 2023 at 01:25:37AM +0000, Tian, Kevin wrote:
> > > Preferably let's not making it a hard dependency for this series.
> >
> > Should we drop this and its selftest patch from this series?
> >
> 
> Yes. let's drop it.

Ack. Thanks!
diff mbox series

Patch

diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index bd3efc1d8509..510db114fc61 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -250,6 +250,57 @@  static int iommufd_get_hw_info(struct iommufd_ucmd *ucmd)
 	return rc;
 }
 
+static int iommufd_resv_iova_ranges(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_resv_iova_ranges *cmd = ucmd->cmd;
+	struct iommu_resv_iova_range __user *ranges;
+	struct iommu_resv_region *resv;
+	struct iommufd_device *idev;
+	LIST_HEAD(resv_regions);
+	u32 max_iovas;
+	int rc;
+
+	if (cmd->__reserved)
+		return -EOPNOTSUPP;
+
+	idev = iommufd_get_device(ucmd, cmd->dev_id);
+	if (IS_ERR(idev))
+		return PTR_ERR(idev);
+
+	max_iovas = cmd->num_iovas;
+	ranges = u64_to_user_ptr(cmd->resv_iovas);
+	cmd->num_iovas = 0;
+
+	iommu_get_resv_regions(idev->dev, &resv_regions);
+
+	list_for_each_entry(resv, &resv_regions, list) {
+		if (resv->type == IOMMU_RESV_DIRECT_RELAXABLE)
+			continue;
+		if (cmd->num_iovas < max_iovas) {
+			struct iommu_resv_iova_range elm = {
+				.start = resv->start,
+				.last = resv->length - 1 + resv->start,
+			};
+
+			if (copy_to_user(&ranges[cmd->num_iovas], &elm,
+					 sizeof(elm))) {
+				rc = -EFAULT;
+				goto out_put;
+			}
+		}
+		cmd->num_iovas++;
+	}
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+	if (rc)
+		goto out_put;
+	if (cmd->num_iovas > max_iovas)
+		rc = -EMSGSIZE;
+out_put:
+	iommu_put_resv_regions(idev->dev, &resv_regions);
+	iommufd_put_object(&idev->obj);
+	return rc;
+}
+
 static int iommufd_fops_open(struct inode *inode, struct file *filp)
 {
 	struct iommufd_ctx *ictx;
@@ -347,6 +398,7 @@  union ucmd_buffer {
 	struct iommu_ioas_map map;
 	struct iommu_ioas_unmap unmap;
 	struct iommu_option option;
+	struct iommu_resv_iova_ranges resv_ranges;
 	struct iommu_vfio_ioas vfio_ioas;
 #ifdef CONFIG_IOMMUFD_TEST
 	struct iommu_test_cmd test;
@@ -389,6 +441,8 @@  static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
 		 length),
 	IOCTL_OP(IOMMU_OPTION, iommufd_option, struct iommu_option,
 		 val64),
+	IOCTL_OP(IOMMU_RESV_IOVA_RANGES, iommufd_resv_iova_ranges,
+		 struct iommu_resv_iova_ranges, resv_iovas),
 	IOCTL_OP(IOMMU_VFIO_IOAS, iommufd_vfio_ioas, struct iommu_vfio_ioas,
 		 __reserved),
 #ifdef CONFIG_IOMMUFD_TEST
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index d38bc54fd5f2..f2026cde2d64 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -47,6 +47,7 @@  enum {
 	IOMMUFD_CMD_VFIO_IOAS,
 	IOMMUFD_CMD_HWPT_ALLOC,
 	IOMMUFD_CMD_GET_HW_INFO,
+	IOMMUFD_CMD_RESV_IOVA_RANGES,
 };
 
 /**
@@ -422,4 +423,49 @@  struct iommu_hw_info {
 	__u32 __reserved;
 };
 #define IOMMU_GET_HW_INFO _IO(IOMMUFD_TYPE, IOMMUFD_CMD_GET_HW_INFO)
+
+/**
+ * struct iommu_resv_iova_range - ioctl(IOMMU_RESV_IOVA_RANGE)
+ * @start: First IOVA
+ * @last: Inclusive last IOVA
+ *
+ * An interval in IOVA space.
+ */
+struct iommu_resv_iova_range {
+	__aligned_u64 start;
+	__aligned_u64 last;
+};
+
+/**
+ * struct iommu_resv_iova_ranges - ioctl(IOMMU_RESV_IOVA_RANGES)
+ * @size: sizeof(struct iommu_resv_iova_ranges)
+ * @dev_id: device to read resv iova ranges for
+ * @num_iovas: Input/Output total number of resv ranges for the device
+ * @__reserved: Must be 0
+ * @resv_iovas: Pointer to the output array of struct iommu_resv_iova_range
+ *
+ * Query a device for ranges of reserved IOVAs. num_iovas will be set to the
+ * total number of iovas and the resv_iovas[] will be filled in as space
+ * permits.
+ *
+ * On input num_iovas is the length of the resv_iovas array. On output it is
+ * the total number of iovas filled in. The ioctl will return -EMSGSIZE and
+ * set num_iovas to the required value if num_iovas is too small. In this
+ * case the caller should allocate a larger output array and re-issue the
+ * ioctl.
+ *
+ * Under nested translation, userspace should query the reserved IOVAs for a
+ * given device, and report it to the stage-1 I/O page table owner to exclude
+ * the reserved IOVAs. The reserved IOVAs can also be used to figure out the
+ * allowed IOVA ranges for the IOAS that the device is attached to. For detail
+ * see ioctl IOMMU_IOAS_IOVA_RANGES.
+ */
+struct iommu_resv_iova_ranges {
+	__u32 size;
+	__u32 dev_id;
+	__u32 num_iovas;
+	__u32 __reserved;
+	__aligned_u64 resv_iovas;
+};
+#define IOMMU_RESV_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_RESV_IOVA_RANGES)
 #endif