diff mbox series

[v8,04/10] vfio/type1: Support binding guest page tables to PASID

Message ID 20210302203545.436623-5-yi.l.liu@intel.com (mailing list archive)
State New, archived
Headers show
Series vfio: expose virtual Shared Virtual Addressing to VMs | expand

Commit Message

Liu, Yi L March 2, 2021, 8:35 p.m. UTC
Nesting translation allows two-levels/stages page tables, with 1st level
for guest translations (e.g. GVA->GPA), 2nd level for host translations
(e.g. GPA->HPA). This patch adds interface for binding guest page tables
to a PASID. This PASID must have been allocated by the userspace before
the binding request. e.g. allocated from /dev/ioasid. As the bind data
is parsed by iommu abstract layer, so this patch doesn't have the ownership
check against the PASID from userspace. It would be done in the iommu sub-
system.

Cc: Kevin Tian <kevin.tian@intel.com>
CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
v7 -> v8:
*) adapt to /dev/ioasid
*) address comments from Alex on v7.
*) adapt to latest iommu_sva_unbind_gpasid() implementation.
*) remove the OP check against VFIO_IOMMU_NESTING_OP_NUM as it's redundant
   to the default switch case in vfio_iommu_handle_pgtbl_op().

v6 -> v7:
*) introduced @user in struct domain_capsule to simplify the code per Eric's
   suggestion.
*) introduced VFIO_IOMMU_NESTING_OP_NUM for sanitizing op from userspace.
*) corrected the @argsz value of unbind_data in vfio_group_unbind_gpasid_fn().

v5 -> v6:
*) dropped vfio_find_nesting_group() and add vfio_get_nesting_domain_capsule().
   per comment from Eric.
*) use iommu_uapi_sva_bind/unbind_gpasid() and iommu_sva_unbind_gpasid() in
   linux/iommu.h for userspace operation and in-kernel operation.

v3 -> v4:
*) address comments from Alex on v3

v2 -> v3:
*) use __iommu_sva_unbind_gpasid() for unbind call issued by VFIO
   https://lore.kernel.org/linux-iommu/1592931837-58223-6-git-send-email-jacob.jun.pan@linux.intel.com/

v1 -> v2:
*) rename subject from "vfio/type1: Bind guest page tables to host"
*) remove VFIO_IOMMU_BIND, introduce VFIO_IOMMU_NESTING_OP to support bind/
   unbind guet page table
*) replaced vfio_iommu_for_each_dev() with a group level loop since this
   series enforces one group per container w/ nesting type as start.
*) rename vfio_bind/unbind_gpasid_fn() to vfio_dev_bind/unbind_gpasid_fn()
*) vfio_dev_unbind_gpasid() always successful
*) use vfio_mm->pasid_lock to avoid race between PASID free and page table
   bind/unbind
---
 drivers/vfio/vfio_iommu_type1.c | 156 ++++++++++++++++++++++++++++++++
 include/uapi/linux/vfio.h       |  35 +++++++
 2 files changed, 191 insertions(+)

Comments

Jason Gunthorpe March 2, 2021, 12:56 p.m. UTC | #1
On Wed, Mar 03, 2021 at 04:35:39AM +0800, Liu Yi L wrote:
>  
> +static int vfio_dev_bind_gpasid_fn(struct device *dev, void *data)
> +{
> +	struct domain_capsule *dc = (struct domain_capsule *)data;
> +	unsigned long arg = *(unsigned long *)dc->data;
> +
> +	return iommu_uapi_sva_bind_gpasid(dc->domain, dev,
> +					  (void __user *)arg);

This arg buisness is really tortured. The type should be set at the
ioctl, not constantly passed down as unsigned long or worse void *.

And why is this passing a __user pointer deep into an iommu_* API??

> +/**
> + * VFIO_IOMMU_NESTING_OP - _IOW(VFIO_TYPE, VFIO_BASE + 18,
> + *				struct vfio_iommu_type1_nesting_op)
> + *
> + * This interface allows userspace to utilize the nesting IOMMU
> + * capabilities as reported in VFIO_IOMMU_TYPE1_INFO_CAP_NESTING
> + * cap through VFIO_IOMMU_GET_INFO. For platforms which require
> + * system wide PASID, PASID will be allocated by VFIO_IOMMU_PASID
> + * _REQUEST.
> + *
> + * @data[] types defined for each op:
> + * +=================+===============================================+
> + * | NESTING OP      |      @data[]                                  |
> + * +=================+===============================================+
> + * | BIND_PGTBL      |      struct iommu_gpasid_bind_data            |
> + * +-----------------+-----------------------------------------------+
> + * | UNBIND_PGTBL    |      struct iommu_gpasid_bind_data            |
> + *
> +-----------------+-----------------------------------------------+

If the type is known why does the struct have a flex array?

Jason
Jacob Pan March 2, 2021, 5:13 p.m. UTC | #2
Hi Jason,

On Tue, 2 Mar 2021 08:56:28 -0400, Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Mar 03, 2021 at 04:35:39AM +0800, Liu Yi L wrote:
> >  
> > +static int vfio_dev_bind_gpasid_fn(struct device *dev, void *data)
> > +{
> > +	struct domain_capsule *dc = (struct domain_capsule *)data;
> > +	unsigned long arg = *(unsigned long *)dc->data;
> > +
> > +	return iommu_uapi_sva_bind_gpasid(dc->domain, dev,
> > +					  (void __user *)arg);  
> 
> This arg buisness is really tortured. The type should be set at the
> ioctl, not constantly passed down as unsigned long or worse void *.
> 
> And why is this passing a __user pointer deep into an iommu_* API??
> 
The idea was that IOMMU UAPI (not API) is independent of VFIO or other user
driver frameworks. The design is documented here:
Documentation/userspace-api/iommu.rst
IOMMU UAPI handles the type and sanitation of user provided data.

Could you be more specific about your concerns?

> > +/**
> > + * VFIO_IOMMU_NESTING_OP - _IOW(VFIO_TYPE, VFIO_BASE + 18,
> > + *				struct vfio_iommu_type1_nesting_op)
> > + *
> > + * This interface allows userspace to utilize the nesting IOMMU
> > + * capabilities as reported in VFIO_IOMMU_TYPE1_INFO_CAP_NESTING
> > + * cap through VFIO_IOMMU_GET_INFO. For platforms which require
> > + * system wide PASID, PASID will be allocated by VFIO_IOMMU_PASID
> > + * _REQUEST.
> > + *
> > + * @data[] types defined for each op:
> > + * +=================+===============================================+
> > + * | NESTING OP      |      @data[]                                  |
> > + * +=================+===============================================+
> > + * | BIND_PGTBL      |      struct iommu_gpasid_bind_data            |
> > + * +-----------------+-----------------------------------------------+
> > + * | UNBIND_PGTBL    |      struct iommu_gpasid_bind_data            |
> > + *
> > +-----------------+-----------------------------------------------+  
> 
> If the type is known why does the struct have a flex array?
> 
This will be extended to other types in the next patches.

> Jason


Thanks,

Jacob
Jason Gunthorpe March 2, 2021, 5:15 p.m. UTC | #3
On Tue, Mar 02, 2021 at 09:13:19AM -0800, Jacob Pan wrote:
> Hi Jason,
> 
> On Tue, 2 Mar 2021 08:56:28 -0400, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Mar 03, 2021 at 04:35:39AM +0800, Liu Yi L wrote:
> > >  
> > > +static int vfio_dev_bind_gpasid_fn(struct device *dev, void *data)
> > > +{
> > > +	struct domain_capsule *dc = (struct domain_capsule *)data;
> > > +	unsigned long arg = *(unsigned long *)dc->data;
> > > +
> > > +	return iommu_uapi_sva_bind_gpasid(dc->domain, dev,
> > > +					  (void __user *)arg);  
> > 
> > This arg buisness is really tortured. The type should be set at the
> > ioctl, not constantly passed down as unsigned long or worse void *.
> > 
> > And why is this passing a __user pointer deep into an iommu_* API??
> > 
> The idea was that IOMMU UAPI (not API) is independent of VFIO or other user
> driver frameworks. The design is documented here:
> Documentation/userspace-api/iommu.rst
> IOMMU UAPI handles the type and sanitation of user provided data.

Why? If it is uapi it has defined types and those types should be
completely clear from the C code, not obfuscated.

I haven't looked at the design doc yet, but this is a just a big red
flag, you shouldn't be tunneling one subsytems uAPI through another
subsystem.

If you need to hook two subsystems together it should be more
directly, like VFIO takes in the IOMMU FD and 'registers' itself in
some way with the IOMMU then you can do the IOMMU actions through the
IOMMU FD and it can call back to VFIO as needed.

At least in this way we can swap VFIO for other things in the API.

Having every subsystem that wants to implement IOMMU also implement
tunneled ops seems very backwards.

> Could you be more specific about your concerns?

Avoid using unsigned long, void * and flex arrays to describe
concretely typed things.

Jason
Jacob Pan March 3, 2021, 7:42 p.m. UTC | #4
Hi Jason,

On Tue, 2 Mar 2021 13:15:51 -0400, Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Mar 02, 2021 at 09:13:19AM -0800, Jacob Pan wrote:
> > Hi Jason,
> > 
> > On Tue, 2 Mar 2021 08:56:28 -0400, Jason Gunthorpe <jgg@nvidia.com>
> > wrote: 
> > > On Wed, Mar 03, 2021 at 04:35:39AM +0800, Liu Yi L wrote:  
> > > >  
> > > > +static int vfio_dev_bind_gpasid_fn(struct device *dev, void *data)
> > > > +{
> > > > +	struct domain_capsule *dc = (struct domain_capsule *)data;
> > > > +	unsigned long arg = *(unsigned long *)dc->data;
> > > > +
> > > > +	return iommu_uapi_sva_bind_gpasid(dc->domain, dev,
> > > > +					  (void __user *)arg);    
> > > 
> > > This arg buisness is really tortured. The type should be set at the
> > > ioctl, not constantly passed down as unsigned long or worse void *.
> > > 
> > > And why is this passing a __user pointer deep into an iommu_* API??
> > >   
> > The idea was that IOMMU UAPI (not API) is independent of VFIO or other
> > user driver frameworks. The design is documented here:
> > Documentation/userspace-api/iommu.rst
> > IOMMU UAPI handles the type and sanitation of user provided data.  
> 
> Why? If it is uapi it has defined types and those types should be
> completely clear from the C code, not obfuscated.
> 
From the user's p.o.v., it is plain c code nothing obfuscated. As for
kernel handling of the data types, it has to be answered by the bigger
question of how we deal with sharing IOMMU among multiple subsystems with
UAPIs.

> I haven't looked at the design doc yet, but this is a just a big red
> flag, you shouldn't be tunneling one subsytems uAPI through another
> subsystem.
>
> If you need to hook two subsystems together it should be more
> directly, like VFIO takes in the IOMMU FD and 'registers' itself in
> some way with the IOMMU then you can do the IOMMU actions through the
> IOMMU FD and it can call back to VFIO as needed.
> 
> At least in this way we can swap VFIO for other things in the API.
> 
> Having every subsystem that wants to implement IOMMU also implement
> tunneled ops seems very backwards.
>
Let me try to understand your proposal, you are saying:
1. Create a new user interface such that FDs can be obtained as a reference
to an IOMMU.
2. Handle over the IOMMU FD to VFIO or other subsystem which wish to
register for IOMMU service.

However, IOMMU is a system device which has little value to be exposed to
the userspace. Not to mention the device-IOMMU affinity/topology. VFIO
nicely abstracts IOMMU from the userspace, why do we want to reverse that?

With this UAPI layering approach, we converge common code at the IOMMU
layer. Without introducing new user interfaces, we can support multiple
subsystems that want to use IOMMU service. e.g. VDPA and VFIO.

> > Could you be more specific about your concerns?  
> 
> Avoid using unsigned long, void * and flex arrays to describe
> concretely typed things.
> 
> Jason


Thanks,

Jacob
Jason Gunthorpe March 3, 2021, 7:45 p.m. UTC | #5
On Wed, Mar 03, 2021 at 11:42:12AM -0800, Jacob Pan wrote:
> Hi Jason,
> 
> On Tue, 2 Mar 2021 13:15:51 -0400, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Tue, Mar 02, 2021 at 09:13:19AM -0800, Jacob Pan wrote:
> > > Hi Jason,
> > > 
> > > On Tue, 2 Mar 2021 08:56:28 -0400, Jason Gunthorpe <jgg@nvidia.com>
> > > wrote: 
> > > > On Wed, Mar 03, 2021 at 04:35:39AM +0800, Liu Yi L wrote:  
> > > > >  
> > > > > +static int vfio_dev_bind_gpasid_fn(struct device *dev, void *data)
> > > > > +{
> > > > > +	struct domain_capsule *dc = (struct domain_capsule *)data;
> > > > > +	unsigned long arg = *(unsigned long *)dc->data;
> > > > > +
> > > > > +	return iommu_uapi_sva_bind_gpasid(dc->domain, dev,
> > > > > +					  (void __user *)arg);    
> > > > 
> > > > This arg buisness is really tortured. The type should be set at the
> > > > ioctl, not constantly passed down as unsigned long or worse void *.
> > > > 
> > > > And why is this passing a __user pointer deep into an iommu_* API??
> > > >   
> > > The idea was that IOMMU UAPI (not API) is independent of VFIO or other
> > > user driver frameworks. The design is documented here:
> > > Documentation/userspace-api/iommu.rst
> > > IOMMU UAPI handles the type and sanitation of user provided data.  
> > 
> > Why? If it is uapi it has defined types and those types should be
> > completely clear from the C code, not obfuscated.
> > 
> From the user's p.o.v., it is plain c code nothing obfuscated. As for
> kernel handling of the data types, it has to be answered by the bigger
> question of how we deal with sharing IOMMU among multiple subsystems with
> UAPIs.
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nvidia.com; s=n1;
	t=1614800733; bh=ESjraPQ1U+x3dvWd7l3HBKlTBu3ySX2nWO4QM44ApEU=;
	h=ARC-Seal:ARC-Message-Signature:ARC-Authentication-Results:Date:
	 From:To:CC:Subject:Message-ID:References:Content-Type:
	 Content-Disposition:In-Reply-To:X-ClientProxiedBy:MIME-Version:
	 X-MS-Exchange-MessageSentRepresentingType:X-Header;
	b=Jyxs64SLsrGfmoPaAHc58Bpre6/+/wDGs/dAMDviWTwRc+2Miw0+jOBGQXNLz/lE8
	 KbS/K02BZmgn/jJwl994Po6nS3kGgTg0of6AFSd2MqtaZPMx+aMJ3prec9hwpHaXmN
	 SiosC+FviWoFQHmEvWSQoVeEMS093zQ+sjcsymUkMXYHGYQRqebW101Mii6/0hT3iv
	 Su4YvAyaKmOpWT8sayI4K0ICIhRxWAOT5P78FrXofij9o3X9T9F/9Bo+S5BWSCMzMr
	 +dM2KsTd4Ecac9hNemigs87T/tiCh50XaZoc8WMRFWGYMpha6KFdCV5wWL0Yzauyt7
	 H7uCWkmg/QTCw==

As I said, don't obfuscate types like this in the kernel. It is not
good style.

> However, IOMMU is a system device which has little value to be exposed to
> the userspace. Not to mention the device-IOMMU affinity/topology. VFIO
> nicely abstracts IOMMU from the userspace, why do we want to reverse that?

The other patch was talking about a /dev/ioasid - why can't this stuff
be run over that?

Jason
Liu, Yi L March 4, 2021, 7:20 a.m. UTC | #6
Hi Jason,

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, March 4, 2021 3:45 AM
> 
> On Wed, Mar 03, 2021 at 11:42:12AM -0800, Jacob Pan wrote:
> > Hi Jason,
> >
> > On Tue, 2 Mar 2021 13:15:51 -0400, Jason Gunthorpe <jgg@nvidia.com>
> wrote:
> >
> > > On Tue, Mar 02, 2021 at 09:13:19AM -0800, Jacob Pan wrote:
> > > > Hi Jason,
> > > >
> > > > On Tue, 2 Mar 2021 08:56:28 -0400, Jason Gunthorpe <jgg@nvidia.com>
> > > > wrote:
> > > > > On Wed, Mar 03, 2021 at 04:35:39AM +0800, Liu Yi L wrote:
> > > > > >
> > > > > > +static int vfio_dev_bind_gpasid_fn(struct device *dev, void *data)
> > > > > > +{
> > > > > > +	struct domain_capsule *dc = (struct domain_capsule *)data;
> > > > > > +	unsigned long arg = *(unsigned long *)dc->data;
> > > > > > +
> > > > > > +	return iommu_uapi_sva_bind_gpasid(dc->domain, dev,
> > > > > > +					  (void __user *)arg);
> > > > >
> > > > > This arg buisness is really tortured. The type should be set at the
> > > > > ioctl, not constantly passed down as unsigned long or worse void *.
> > > > >
> > > > > And why is this passing a __user pointer deep into an iommu_* API??
> > > > >
> > > > The idea was that IOMMU UAPI (not API) is independent of VFIO or
> other
> > > > user driver frameworks. The design is documented here:
> > > > Documentation/userspace-api/iommu.rst
> > > > IOMMU UAPI handles the type and sanitation of user provided data.
> > >
> > > Why? If it is uapi it has defined types and those types should be
> > > completely clear from the C code, not obfuscated.
> > >
> > From the user's p.o.v., it is plain c code nothing obfuscated. As for
> > kernel handling of the data types, it has to be answered by the bigger
> > question of how we deal with sharing IOMMU among multiple
> subsystems with
> > UAPIs.
> 
> As I said, don't obfuscate types like this in the kernel. It is not
> good style.
> 
> > However, IOMMU is a system device which has little value to be exposed
> to
> > the userspace. Not to mention the device-IOMMU affinity/topology. VFIO
> > nicely abstracts IOMMU from the userspace, why do we want to reverse
> that?
> 
> The other patch was talking about a /dev/ioasid - why can't this stuff
> be run over that?

The stuff in this patch are actually iommu domain operations, which are
finally supported by iommu domain ops. While /dev/ioasid in another patch
is created for IOASID allocation/free to fit the PASID allocation requirement
from both vSVA and vDPA. It has no idea about iommu domain and neither the
device information. Without such info, /dev/ioasid is unable to run this
stuff.

Thanks,
Yi Liu
Jason Gunthorpe March 4, 2021, 12:52 p.m. UTC | #7
On Thu, Mar 04, 2021 at 07:20:22AM +0000, Liu, Yi L wrote:
> > > However, IOMMU is a system device which has little value to be exposed
> > to
> > > the userspace. Not to mention the device-IOMMU affinity/topology. VFIO
> > > nicely abstracts IOMMU from the userspace, why do we want to reverse
> > that?
> > 
> > The other patch was talking about a /dev/ioasid - why can't this stuff
> > be run over that?
> 
> The stuff in this patch are actually iommu domain operations, which are
> finally supported by iommu domain ops. While /dev/ioasid in another patch
> is created for IOASID allocation/free to fit the PASID allocation requirement
> from both vSVA and vDPA. It has no idea about iommu domain and neither the
> device information. Without such info, /dev/ioasid is unable to run this
> stuff.

Why can't it know? My point was that VFIO should interact with
/dev/ioasid to exchange the information both sides need and once
exchanged the control over IOMMU should be done through /dev/ioasid,
not inside VFIO.

Jason
diff mbox series

Patch

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 3a5c84d4f19b..0044931b80dc 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -164,6 +164,34 @@  struct vfio_regions {
 
 #define WAITED 1
 
+struct domain_capsule {
+	struct vfio_group	*group;
+	struct iommu_domain	*domain;
+	void			*data;
+	/* set if @data contains a user pointer*/
+	bool			user;
+};
+
+/* iommu->lock must be held */
+static int vfio_prepare_nesting_domain_capsule(struct vfio_iommu *iommu,
+					       struct domain_capsule *dc)
+{
+	struct vfio_domain *domain;
+	struct vfio_group *group;
+
+	if (!iommu->nesting_info)
+		return -EINVAL;
+
+	domain = list_first_entry(&iommu->domain_list,
+				  struct vfio_domain, next);
+	group = list_first_entry(&domain->group_list,
+				 struct vfio_group, next);
+	dc->group = group;
+	dc->domain = domain->domain;
+	dc->user = true;
+	return 0;
+}
+
 static int put_pfn(unsigned long pfn, int prot);
 
 static struct vfio_group *vfio_iommu_find_iommu_group(struct vfio_iommu *iommu,
@@ -2607,6 +2635,51 @@  static int vfio_iommu_resv_refresh(struct vfio_iommu *iommu,
 	return ret;
 }
 
+static int vfio_dev_bind_gpasid_fn(struct device *dev, void *data)
+{
+	struct domain_capsule *dc = (struct domain_capsule *)data;
+	unsigned long arg = *(unsigned long *)dc->data;
+
+	return iommu_uapi_sva_bind_gpasid(dc->domain, dev,
+					  (void __user *)arg);
+}
+
+static int vfio_dev_unbind_gpasid_fn(struct device *dev, void *data)
+{
+	struct domain_capsule *dc = (struct domain_capsule *)data;
+
+	/*
+	 * dc->user is a toggle for the unbind operation. When user
+	 * set, the dc->data passes in a __user pointer and requires
+	 * to use iommu_uapi_sva_unbind_gpasid(), in which it will
+	 * copy the unbind data from the user buffer. When user is
+	 * clear, the dc->data passes in a pasid which is going to
+	 * be unbind no need to copy data from userspace.
+	 */
+	if (dc->user) {
+		unsigned long arg = *(unsigned long *)dc->data;
+
+		iommu_uapi_sva_unbind_gpasid(dc->domain,
+					     dev, (void __user *)arg);
+	} else {
+		ioasid_t pasid = *(ioasid_t *)dc->data;
+
+		iommu_sva_unbind_gpasid(dc->domain, dev, pasid);
+	}
+	return 0;
+}
+
+static void vfio_group_unbind_gpasid_fn(ioasid_t pasid, void *data)
+{
+	struct domain_capsule *dc = (struct domain_capsule *)data;
+
+	dc->user = false;
+	dc->data = &pasid;
+
+	iommu_group_for_each_dev(dc->group->iommu_group,
+				 dc, vfio_dev_unbind_gpasid_fn);
+}
+
 static void vfio_iommu_type1_detach_group(void *iommu_data,
 					  struct iommu_group *iommu_group)
 {
@@ -2650,6 +2723,27 @@  static void vfio_iommu_type1_detach_group(void *iommu_data,
 		if (!group)
 			continue;
 
+		if (iommu->nesting_info &&
+		    iommu->nesting_info->features &
+					IOMMU_NESTING_FEAT_BIND_PGTBL) {
+			struct domain_capsule dc = { .group = group,
+						     .domain = domain->domain,
+						     .data = NULL };
+			struct ioasid_user *iuser;
+
+			/*
+			 * For devices attached to nesting type iommu,
+			 * VFIO should unbind page tables bound with the
+			 * devices in the iommu group before detaching.
+			 */
+			iuser = ioasid_user_get_from_task(current);
+			if (!(IS_ERR(iuser) || !iuser)) {
+				ioasid_user_for_each_id(iuser, &dc,
+					       vfio_group_unbind_gpasid_fn);
+				ioasid_user_put(iuser);
+			}
+		}
+
 		vfio_iommu_detach_group(domain, group);
 		update_dirty_scope = !group->pinned_page_dirty_scope;
 		list_del(&group->next);
@@ -3166,6 +3260,66 @@  static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
 	return -EINVAL;
 }
 
+static long vfio_iommu_handle_pgtbl_op(struct vfio_iommu *iommu,
+				       bool is_bind, unsigned long arg)
+{
+	struct domain_capsule dc = { .data = &arg, .user = true };
+	struct iommu_nesting_info *info;
+	int ret;
+
+	mutex_lock(&iommu->lock);
+
+	info = iommu->nesting_info;
+	if (!info || !(info->features & IOMMU_NESTING_FEAT_BIND_PGTBL)) {
+		ret = -EOPNOTSUPP;
+		goto out_unlock;
+	}
+
+	ret = vfio_prepare_nesting_domain_capsule(iommu, &dc);
+	if (ret)
+		goto out_unlock;
+
+	if (is_bind)
+		ret = iommu_group_for_each_dev(dc.group->iommu_group, &dc,
+					       vfio_dev_bind_gpasid_fn);
+	if (ret || !is_bind)
+		iommu_group_for_each_dev(dc.group->iommu_group,
+					 &dc, vfio_dev_unbind_gpasid_fn);
+
+out_unlock:
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
+static long vfio_iommu_type1_nesting_op(struct vfio_iommu *iommu,
+					unsigned long arg)
+{
+	struct vfio_iommu_type1_nesting_op hdr;
+	unsigned int minsz;
+	int ret;
+
+	minsz = offsetofend(struct vfio_iommu_type1_nesting_op, flags);
+
+	if (copy_from_user(&hdr, (void __user *)arg, minsz))
+		return -EFAULT;
+
+	if (hdr.argsz < minsz || hdr.flags & ~VFIO_NESTING_OP_MASK)
+		return -EINVAL;
+
+	switch (hdr.flags & VFIO_NESTING_OP_MASK) {
+	case VFIO_IOMMU_NESTING_OP_BIND_PGTBL:
+		ret = vfio_iommu_handle_pgtbl_op(iommu, true, arg + minsz);
+		break;
+	case VFIO_IOMMU_NESTING_OP_UNBIND_PGTBL:
+		ret = vfio_iommu_handle_pgtbl_op(iommu, false, arg + minsz);
+		break;
+	default:
+		ret = -EINVAL;
+	}
+
+	return ret;
+}
+
 static long vfio_iommu_type1_ioctl(void *iommu_data,
 				   unsigned int cmd, unsigned long arg)
 {
@@ -3182,6 +3336,8 @@  static long vfio_iommu_type1_ioctl(void *iommu_data,
 		return vfio_iommu_type1_unmap_dma(iommu, arg);
 	case VFIO_IOMMU_DIRTY_PAGES:
 		return vfio_iommu_type1_dirty_pages(iommu, arg);
+	case VFIO_IOMMU_NESTING_OP:
+		return vfio_iommu_type1_nesting_op(iommu, arg);
 	default:
 		return -ENOTTY;
 	}
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 5dab20aaf7e7..985e6cf4c52d 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1227,6 +1227,41 @@  struct vfio_iommu_type1_dirty_bitmap_get {
 
 #define VFIO_IOMMU_DIRTY_PAGES             _IO(VFIO_TYPE, VFIO_BASE + 17)
 
+/**
+ * VFIO_IOMMU_NESTING_OP - _IOW(VFIO_TYPE, VFIO_BASE + 18,
+ *				struct vfio_iommu_type1_nesting_op)
+ *
+ * This interface allows userspace to utilize the nesting IOMMU
+ * capabilities as reported in VFIO_IOMMU_TYPE1_INFO_CAP_NESTING
+ * cap through VFIO_IOMMU_GET_INFO. For platforms which require
+ * system wide PASID, PASID will be allocated by VFIO_IOMMU_PASID
+ * _REQUEST.
+ *
+ * @data[] types defined for each op:
+ * +=================+===============================================+
+ * | NESTING OP      |      @data[]                                  |
+ * +=================+===============================================+
+ * | BIND_PGTBL      |      struct iommu_gpasid_bind_data            |
+ * +-----------------+-----------------------------------------------+
+ * | UNBIND_PGTBL    |      struct iommu_gpasid_bind_data            |
+ * +-----------------+-----------------------------------------------+
+ *
+ * returns: 0 on success, -errno on failure.
+ */
+struct vfio_iommu_type1_nesting_op {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_NESTING_OP_MASK	(0xffff) /* lower 16-bits for op */
+	__u8	data[];
+};
+
+enum {
+	VFIO_IOMMU_NESTING_OP_BIND_PGTBL,
+	VFIO_IOMMU_NESTING_OP_UNBIND_PGTBL,
+};
+
+#define VFIO_IOMMU_NESTING_OP		_IO(VFIO_TYPE, VFIO_BASE + 18)
+
 /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
 
 /*