From patchwork Fri Apr  9 03:44:13 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Shenming Lu <lushenming@huawei.com>
X-Patchwork-Id: 12192933
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-16.7 required=3.0 tests=BAYES_00,
	HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT
	autolearn=unavailable autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 4A1BDC433ED
	for <kvm@archiver.kernel.org>; Fri,  9 Apr 2021 03:44:45 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 13FFF61042
	for <kvm@archiver.kernel.org>; Fri,  9 Apr 2021 03:44:45 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233233AbhDIDo4 (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Thu, 8 Apr 2021 23:44:56 -0400
Received: from szxga04-in.huawei.com ([45.249.212.190]:15638 "EHLO
        szxga04-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232941AbhDIDoz (ORCPT <rfc822;kvm@vger.kernel.org>);
        Thu, 8 Apr 2021 23:44:55 -0400
Received: from DGGEMS414-HUB.china.huawei.com (unknown [172.30.72.59])
        by szxga04-in.huawei.com (SkyGuard) with ESMTP id 4FGkTF2QDmznZ7g;
        Fri,  9 Apr 2021 11:41:53 +0800 (CST)
Received: from DESKTOP-7FEPK9S.china.huawei.com (10.174.184.135) by
 DGGEMS414-HUB.china.huawei.com (10.3.19.214) with Microsoft SMTP Server id
 14.3.498.0; Fri, 9 Apr 2021 11:44:33 +0800
From: Shenming Lu <lushenming@huawei.com>
To: Alex Williamson <alex.williamson@redhat.com>,
        Cornelia Huck <cohuck@redhat.com>,
        Will Deacon <will@kernel.org>,
        Robin Murphy <robin.murphy@arm.com>,
        Joerg Roedel <joro@8bytes.org>,
        Jean-Philippe Brucker <jean-philippe@linaro.org>,
        Eric Auger <eric.auger@redhat.com>, <kvm@vger.kernel.org>,
        <linux-kernel@vger.kernel.org>,
        <linux-arm-kernel@lists.infradead.org>,
        <iommu@lists.linux-foundation.org>, <linux-api@vger.kernel.org>
CC: Kevin Tian <kevin.tian@intel.com>,
        Lu Baolu <baolu.lu@linux.intel.com>, <yi.l.liu@intel.com>,
        Christoph Hellwig <hch@infradead.org>,
        Jonathan Cameron <Jonathan.Cameron@huawei.com>,
        Barry Song <song.bao.hua@hisilicon.com>,
        <wanghaibin.wang@huawei.com>, <yuzenghui@huawei.com>,
        <lushenming@huawei.com>
Subject: [RFC PATCH v3 1/8] iommu: Evolve the device fault reporting framework
Date: Fri, 9 Apr 2021 11:44:13 +0800
Message-ID: <20210409034420.1799-2-lushenming@huawei.com>
X-Mailer: git-send-email 2.27.0.windows.1
In-Reply-To: <20210409034420.1799-1-lushenming@huawei.com>
References: <20210409034420.1799-1-lushenming@huawei.com>
MIME-Version: 1.0
X-Originating-IP: [10.174.184.135]
X-CFilter-Loop: Reflected
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

This patch follows the discussion here:

https://lore.kernel.org/linux-acpi/YAaxjmJW+ZMvrhac@myrica/

Besides SVA/vSVA, such as VFIO may also enable (2nd level) IOPF to remove
pinning restriction. In order to better support more scenarios of using
device faults, we extend iommu_register_fault_handler() with flags and
introduce FAULT_REPORT_ to describe the device fault reporting capability
under a specific configuration.

Note that we don't further distinguish recoverable and unrecoverable faults
by flags in the fault reporting cap, having PAGE_FAULT_REPORT_ +
UNRECOV_FAULT_REPORT_ seems not a clean way.

In addition, still take VFIO as an example, in nested mode, the 1st level
and 2nd level fault reporting may be configured separately and currently
each device can only register one iommu dev fault handler, so we add a
handler update interface for this.

Signed-off-by: Shenming Lu <lushenming@huawei.com>
---
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |  3 +-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 18 ++++--
 drivers/iommu/iommu.c                         | 56 ++++++++++++++++++-
 include/linux/iommu.h                         | 19 ++++++-
 include/uapi/linux/iommu.h                    |  4 ++
 5 files changed, 90 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
index ee66d1f4cb81..e6d766fb8f1a 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
@@ -482,7 +482,8 @@ static int arm_smmu_master_sva_enable_iopf(struct arm_smmu_master *master)
 	if (ret)
 		return ret;
 
-	ret = iommu_register_device_fault_handler(dev, iommu_queue_iopf, dev);
+	ret = iommu_register_device_fault_handler(dev, iommu_queue_iopf,
+						  FAULT_REPORT_FLAT, dev);
 	if (ret) {
 		iopf_queue_remove_device(master->smmu->evtq.iopf, dev);
 		return ret;
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 53abad8fdd91..51843f54a87f 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1448,10 +1448,6 @@ static int arm_smmu_handle_evt(struct arm_smmu_device *smmu, u64 *evt)
 		return -EOPNOTSUPP;
 	}
 
-	/* Stage-2 is always pinned at the moment */
-	if (evt[1] & EVTQ_1_S2)
-		return -EFAULT;
-
 	if (evt[1] & EVTQ_1_RnW)
 		perm |= IOMMU_FAULT_PERM_READ;
 	else
@@ -1469,26 +1465,36 @@ static int arm_smmu_handle_evt(struct arm_smmu_device *smmu, u64 *evt)
 			.flags = IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE,
 			.grpid = FIELD_GET(EVTQ_1_STAG, evt[1]),
 			.perm = perm,
-			.addr = FIELD_GET(EVTQ_2_ADDR, evt[2]),
 		};
 
 		if (ssid_valid) {
 			flt->prm.flags |= IOMMU_FAULT_PAGE_REQUEST_PASID_VALID;
 			flt->prm.pasid = FIELD_GET(EVTQ_0_SSID, evt[0]);
 		}
+
+		if (evt[1] & EVTQ_1_S2) {
+			flt->prm.flags |= IOMMU_FAULT_PAGE_REQUEST_L2;
+			flt->prm.addr = FIELD_GET(EVTQ_3_IPA, evt[3]);
+		} else
+			flt->prm.addr = FIELD_GET(EVTQ_2_ADDR, evt[2]);
 	} else {
 		flt->type = IOMMU_FAULT_DMA_UNRECOV;
 		flt->event = (struct iommu_fault_unrecoverable) {
 			.reason = reason,
 			.flags = IOMMU_FAULT_UNRECOV_ADDR_VALID,
 			.perm = perm,
-			.addr = FIELD_GET(EVTQ_2_ADDR, evt[2]),
 		};
 
 		if (ssid_valid) {
 			flt->event.flags |= IOMMU_FAULT_UNRECOV_PASID_VALID;
 			flt->event.pasid = FIELD_GET(EVTQ_0_SSID, evt[0]);
 		}
+
+		if (evt[1] & EVTQ_1_S2) {
+			flt->event.flags |= IOMMU_FAULT_UNRECOV_L2;
+			flt->event.addr = FIELD_GET(EVTQ_3_IPA, evt[3]);
+		} else
+			flt->event.addr = FIELD_GET(EVTQ_2_ADDR, evt[2]);
 	}
 
 	mutex_lock(&smmu->streams_mutex);
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index d0b0a15dba84..b50b526b45ac 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1056,6 +1056,40 @@ int iommu_group_unregister_notifier(struct iommu_group *group,
 }
 EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
 
+/*
+ * iommu_update_device_fault_handler - Update the device fault handler via flags
+ * @dev: the device
+ * @mask: bits(not set) to clear
+ * @set: bits to set
+ *
+ * Update the device fault handler installed by
+ * iommu_register_device_fault_handler().
+ *
+ * Return 0 on success, or an error.
+ */
+int iommu_update_device_fault_handler(struct device *dev, u32 mask, u32 set)
+{
+	struct dev_iommu *param = dev->iommu;
+	int ret = 0;
+
+	if (!param)
+		return -EINVAL;
+
+	mutex_lock(&param->lock);
+
+	if (param->fault_param) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	param->fault_param->flags = (param->fault_param->flags & mask) | set;
+
+out_unlock:
+	mutex_unlock(&param->lock);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_update_device_fault_handler);
+
 /**
  * iommu_register_device_fault_handler() - Register a device fault handler
  * @dev: the device
@@ -1076,11 +1110,16 @@ EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
  */
 int iommu_register_device_fault_handler(struct device *dev,
 					iommu_dev_fault_handler_t handler,
-					void *data)
+					u32 flags, void *data)
 {
 	struct dev_iommu *param = dev->iommu;
 	int ret = 0;
 
+	/* Only under one configuration. */
+	if (flags & FAULT_REPORT_FLAT &&
+	    flags & (FAULT_REPORT_NESTED_L1 | FAULT_REPORT_NESTED_L2))
+		return -EINVAL;
+
 	if (!param)
 		return -EINVAL;
 
@@ -1099,6 +1138,7 @@ int iommu_register_device_fault_handler(struct device *dev,
 		goto done_unlock;
 	}
 	param->fault_param->handler = handler;
+	param->fault_param->flags = flags;
 	param->fault_param->data = data;
 	mutex_init(&param->fault_param->lock);
 	INIT_LIST_HEAD(&param->fault_param->faults);
@@ -1177,6 +1217,20 @@ int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt)
 		goto done_unlock;
 	}
 
+	if (!(fparam->flags & FAULT_REPORT_FLAT)) {
+		bool l2;
+
+		if (evt->fault.type == IOMMU_FAULT_PAGE_REQ)
+			l2 = evt->fault.prm.flags & IOMMU_FAULT_PAGE_REQUEST_L2;
+		if (evt->fault.type == IOMMU_FAULT_DMA_UNRECOV)
+			l2 = evt->fault.event.flags & IOMMU_FAULT_UNRECOV_L2;
+
+		if (l2 && !(fparam->flags & FAULT_REPORT_NESTED_L2))
+			return -EOPNOTSUPP;
+		if (!l2 && !(fparam->flags & FAULT_REPORT_NESTED_L1))
+			return -EOPNOTSUPP;
+	}
+
 	if (evt->fault.type == IOMMU_FAULT_PAGE_REQ &&
 	    (evt->fault.prm.flags & IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE)) {
 		evt_pending = kmemdup(evt, sizeof(struct iommu_fault_event),
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 86d688c4418f..28dbca3c6d60 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -352,12 +352,19 @@ struct iommu_fault_event {
 /**
  * struct iommu_fault_param - per-device IOMMU fault data
  * @handler: Callback function to handle IOMMU faults at device level
+ * @flags: FAULT_REPORT_ indicates the fault reporting capability under
+ *	   a specific configuration (1st/2nd-level-only(FLAT), or nested).
+ *	   Nested mode needs to specify which level/stage is concerned.
  * @data: handler private data
  * @faults: holds the pending faults which needs response
  * @lock: protect pending faults list
  */
 struct iommu_fault_param {
 	iommu_dev_fault_handler_t handler;
+#define FAULT_REPORT_FLAT			(1 << 0)
+#define FAULT_REPORT_NESTED_L1			(1 << 1)
+#define FAULT_REPORT_NESTED_L2			(1 << 2)
+	u32 flags;
 	void *data;
 	struct list_head faults;
 	struct mutex lock;
@@ -509,9 +516,11 @@ extern int iommu_group_register_notifier(struct iommu_group *group,
 					 struct notifier_block *nb);
 extern int iommu_group_unregister_notifier(struct iommu_group *group,
 					   struct notifier_block *nb);
+extern int iommu_update_device_fault_handler(struct device *dev,
+					     u32 mask, u32 set);
 extern int iommu_register_device_fault_handler(struct device *dev,
 					iommu_dev_fault_handler_t handler,
-					void *data);
+					u32 flags, void *data);
 
 extern int iommu_unregister_device_fault_handler(struct device *dev);
 
@@ -873,10 +882,16 @@ static inline int iommu_group_unregister_notifier(struct iommu_group *group,
 	return 0;
 }
 
+static inline int iommu_update_device_fault_handler(struct device *dev,
+						    u32 mask, u32 set)
+{
+	return -ENODEV;
+}
+
 static inline
 int iommu_register_device_fault_handler(struct device *dev,
 					iommu_dev_fault_handler_t handler,
-					void *data)
+					u32 flags, void *data)
 {
 	return -ENODEV;
 }
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index e1d9e75f2c94..f6db9c8c970c 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -71,6 +71,7 @@ struct iommu_fault_unrecoverable {
 #define IOMMU_FAULT_UNRECOV_PASID_VALID		(1 << 0)
 #define IOMMU_FAULT_UNRECOV_ADDR_VALID		(1 << 1)
 #define IOMMU_FAULT_UNRECOV_FETCH_ADDR_VALID	(1 << 2)
+#define IOMMU_FAULT_UNRECOV_L2			(1 << 3)
 	__u32	flags;
 	__u32	pasid;
 	__u32	perm;
@@ -85,6 +86,8 @@ struct iommu_fault_unrecoverable {
  *         When IOMMU_FAULT_PAGE_RESPONSE_NEEDS_PASID is set, the page response
  *         must have the same PASID value as the page request. When it is clear,
  *         the page response should not have a PASID.
+ *         If IOMMU_FAULT_PAGE_REQUEST_L2 is set, the fault occurred at the
+ *         second level/stage, otherwise, occurred at the first level.
  * @pasid: Process Address Space ID
  * @grpid: Page Request Group Index
  * @perm: requested page permissions (IOMMU_FAULT_PERM_* values)
@@ -96,6 +99,7 @@ struct iommu_fault_page_request {
 #define IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE	(1 << 1)
 #define IOMMU_FAULT_PAGE_REQUEST_PRIV_DATA	(1 << 2)
 #define IOMMU_FAULT_PAGE_RESPONSE_NEEDS_PASID	(1 << 3)
+#define IOMMU_FAULT_PAGE_REQUEST_L2		(1 << 4)
 	__u32	flags;
 	__u32	pasid;
 	__u32	grpid;

From patchwork Fri Apr  9 03:44:14 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Shenming Lu <lushenming@huawei.com>
X-Patchwork-Id: 12192935
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-13.9 required=3.0 tests=BAYES_00,
	HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,UNWANTED_LANGUAGE_BODY,
	URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 41C0FC43470
	for <kvm@archiver.kernel.org>; Fri,  9 Apr 2021 03:44:47 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 1D24561055
	for <kvm@archiver.kernel.org>; Fri,  9 Apr 2021 03:44:47 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233263AbhDIDo5 (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Thu, 8 Apr 2021 23:44:57 -0400
Received: from szxga04-in.huawei.com ([45.249.212.190]:15636 "EHLO
        szxga04-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232937AbhDIDoz (ORCPT <rfc822;kvm@vger.kernel.org>);
        Thu, 8 Apr 2021 23:44:55 -0400
Received: from DGGEMS414-HUB.china.huawei.com (unknown [172.30.72.59])
        by szxga04-in.huawei.com (SkyGuard) with ESMTP id 4FGkTF1XZ0znZ7R;
        Fri,  9 Apr 2021 11:41:53 +0800 (CST)
Received: from DESKTOP-7FEPK9S.china.huawei.com (10.174.184.135) by
 DGGEMS414-HUB.china.huawei.com (10.3.19.214) with Microsoft SMTP Server id
 14.3.498.0; Fri, 9 Apr 2021 11:44:34 +0800
From: Shenming Lu <lushenming@huawei.com>
To: Alex Williamson <alex.williamson@redhat.com>,
        Cornelia Huck <cohuck@redhat.com>,
        Will Deacon <will@kernel.org>,
        Robin Murphy <robin.murphy@arm.com>,
        Joerg Roedel <joro@8bytes.org>,
        Jean-Philippe Brucker <jean-philippe@linaro.org>,
        Eric Auger <eric.auger@redhat.com>, <kvm@vger.kernel.org>,
        <linux-kernel@vger.kernel.org>,
        <linux-arm-kernel@lists.infradead.org>,
        <iommu@lists.linux-foundation.org>, <linux-api@vger.kernel.org>
CC: Kevin Tian <kevin.tian@intel.com>,
        Lu Baolu <baolu.lu@linux.intel.com>, <yi.l.liu@intel.com>,
        Christoph Hellwig <hch@infradead.org>,
        Jonathan Cameron <Jonathan.Cameron@huawei.com>,
        Barry Song <song.bao.hua@hisilicon.com>,
        <wanghaibin.wang@huawei.com>, <yuzenghui@huawei.com>,
        <lushenming@huawei.com>
Subject: [RFC PATCH v3 2/8] vfio/type1: Add a page fault handler
Date: Fri, 9 Apr 2021 11:44:14 +0800
Message-ID: <20210409034420.1799-3-lushenming@huawei.com>
X-Mailer: git-send-email 2.27.0.windows.1
In-Reply-To: <20210409034420.1799-1-lushenming@huawei.com>
References: <20210409034420.1799-1-lushenming@huawei.com>
MIME-Version: 1.0
X-Originating-IP: [10.174.184.135]
X-CFilter-Loop: Reflected
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

VFIO manages the DMA mapping itself. To support IOPF (on-demand paging)
for VFIO (IOMMU capable) devices, we add a VFIO page fault handler to
serve the reported page faults from the IOMMU driver.

Signed-off-by: Shenming Lu <lushenming@huawei.com>
---
 drivers/vfio/vfio_iommu_type1.c | 114 ++++++++++++++++++++++++++++++++
 1 file changed, 114 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 45cbfd4879a5..ab0ff60ee207 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -101,6 +101,7 @@ struct vfio_dma {
 	struct task_struct	*task;
 	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
 	unsigned long		*bitmap;
+	unsigned long		*iopf_mapped_bitmap;
 };
 
 struct vfio_batch {
@@ -141,6 +142,16 @@ struct vfio_regions {
 	size_t len;
 };
 
+/* A global IOPF enabled group list */
+static struct rb_root iopf_group_list = RB_ROOT;
+static DEFINE_MUTEX(iopf_group_list_lock);
+
+struct vfio_iopf_group {
+	struct rb_node		node;
+	struct iommu_group	*iommu_group;
+	struct vfio_iommu	*iommu;
+};
+
 #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)	\
 					(!list_empty(&iommu->domain_list))
 
@@ -157,6 +168,10 @@ struct vfio_regions {
 #define DIRTY_BITMAP_PAGES_MAX	 ((u64)INT_MAX)
 #define DIRTY_BITMAP_SIZE_MAX	 DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)
 
+#define IOPF_MAPPED_BITMAP_GET(dma, i)	\
+			      ((dma->iopf_mapped_bitmap[(i) / BITS_PER_LONG]	\
+			       >> ((i) % BITS_PER_LONG)) & 0x1)
+
 #define WAITED 1
 
 static int put_pfn(unsigned long pfn, int prot);
@@ -416,6 +431,34 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
 	return ret;
 }
 
+/*
+ * Helper functions for iopf_group_list
+ */
+static struct vfio_iopf_group *
+vfio_find_iopf_group(struct iommu_group *iommu_group)
+{
+	struct vfio_iopf_group *iopf_group;
+	struct rb_node *node;
+
+	mutex_lock(&iopf_group_list_lock);
+
+	node = iopf_group_list.rb_node;
+
+	while (node) {
+		iopf_group = rb_entry(node, struct vfio_iopf_group, node);
+
+		if (iommu_group < iopf_group->iommu_group)
+			node = node->rb_left;
+		else if (iommu_group > iopf_group->iommu_group)
+			node = node->rb_right;
+		else
+			break;
+	}
+
+	mutex_unlock(&iopf_group_list_lock);
+	return node ? iopf_group : NULL;
+}
+
 static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
 {
 	struct mm_struct *mm;
@@ -3106,6 +3149,77 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
 	return -EINVAL;
 }
 
+/* VFIO I/O Page Fault handler */
+static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
+{
+	struct device *dev = (struct device *)data;
+	struct iommu_group *iommu_group;
+	struct vfio_iopf_group *iopf_group;
+	struct vfio_iommu *iommu;
+	struct vfio_dma *dma;
+	dma_addr_t iova = ALIGN_DOWN(fault->prm.addr, PAGE_SIZE);
+	int access_flags = 0;
+	unsigned long bit_offset, vaddr, pfn;
+	int ret;
+	enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID;
+	struct iommu_page_response resp = {0};
+
+	if (fault->type != IOMMU_FAULT_PAGE_REQ)
+		return -EOPNOTSUPP;
+
+	iommu_group = iommu_group_get(dev);
+	if (!iommu_group)
+		return -ENODEV;
+
+	iopf_group = vfio_find_iopf_group(iommu_group);
+	iommu_group_put(iommu_group);
+	if (!iopf_group)
+		return -ENODEV;
+
+	iommu = iopf_group->iommu;
+
+	mutex_lock(&iommu->lock);
+
+	ret = vfio_find_dma_valid(iommu, iova, PAGE_SIZE, &dma);
+	if (ret < 0)
+		goto out_invalid;
+
+	if (fault->prm.perm & IOMMU_FAULT_PERM_READ)
+		access_flags |= IOMMU_READ;
+	if (fault->prm.perm & IOMMU_FAULT_PERM_WRITE)
+		access_flags |= IOMMU_WRITE;
+	if ((dma->prot & access_flags) != access_flags)
+		goto out_invalid;
+
+	bit_offset = (iova - dma->iova) >> PAGE_SHIFT;
+	if (IOPF_MAPPED_BITMAP_GET(dma, bit_offset))
+		goto out_success;
+
+	vaddr = iova - dma->iova + dma->vaddr;
+
+	if (vfio_pin_page_external(dma, vaddr, &pfn, true))
+		goto out_invalid;
+
+	if (vfio_iommu_map(iommu, iova, pfn, 1, dma->prot)) {
+		if (put_pfn(pfn, dma->prot))
+			vfio_lock_acct(dma, -1, true);
+		goto out_invalid;
+	}
+
+	bitmap_set(dma->iopf_mapped_bitmap, bit_offset, 1);
+
+out_success:
+	status = IOMMU_PAGE_RESP_SUCCESS;
+
+out_invalid:
+	mutex_unlock(&iommu->lock);
+	resp.version		= IOMMU_PAGE_RESP_VERSION_1;
+	resp.grpid		= fault->prm.grpid;
+	resp.code		= status;
+	iommu_page_response(dev, &resp);
+	return 0;
+}
+
 static long vfio_iommu_type1_ioctl(void *iommu_data,
 				   unsigned int cmd, unsigned long arg)
 {

From patchwork Fri Apr  9 03:44:15 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Shenming Lu <lushenming@huawei.com>
X-Patchwork-Id: 12192941
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-16.7 required=3.0 tests=BAYES_00,
	HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT
	autolearn=unavailable autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 4800EC433ED
	for <kvm@archiver.kernel.org>; Fri,  9 Apr 2021 03:45:20 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 20C2E61055
	for <kvm@archiver.kernel.org>; Fri,  9 Apr 2021 03:45:20 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233308AbhDIDpJ (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Thu, 8 Apr 2021 23:45:09 -0400
Received: from szxga04-in.huawei.com ([45.249.212.190]:15641 "EHLO
        szxga04-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232662AbhDIDo7 (ORCPT <rfc822;kvm@vger.kernel.org>);
        Thu, 8 Apr 2021 23:44:59 -0400
Received: from DGGEMS414-HUB.china.huawei.com (unknown [172.30.72.60])
        by szxga04-in.huawei.com (SkyGuard) with ESMTP id 4FGkTL29g4znYts;
        Fri,  9 Apr 2021 11:41:58 +0800 (CST)
Received: from DESKTOP-7FEPK9S.china.huawei.com (10.174.184.135) by
 DGGEMS414-HUB.china.huawei.com (10.3.19.214) with Microsoft SMTP Server id
 14.3.498.0; Fri, 9 Apr 2021 11:44:36 +0800
From: Shenming Lu <lushenming@huawei.com>
To: Alex Williamson <alex.williamson@redhat.com>,
        Cornelia Huck <cohuck@redhat.com>,
        Will Deacon <will@kernel.org>,
        Robin Murphy <robin.murphy@arm.com>,
        Joerg Roedel <joro@8bytes.org>,
        Jean-Philippe Brucker <jean-philippe@linaro.org>,
        Eric Auger <eric.auger@redhat.com>, <kvm@vger.kernel.org>,
        <linux-kernel@vger.kernel.org>,
        <linux-arm-kernel@lists.infradead.org>,
        <iommu@lists.linux-foundation.org>, <linux-api@vger.kernel.org>
CC: Kevin Tian <kevin.tian@intel.com>,
        Lu Baolu <baolu.lu@linux.intel.com>, <yi.l.liu@intel.com>,
        Christoph Hellwig <hch@infradead.org>,
        Jonathan Cameron <Jonathan.Cameron@huawei.com>,
        Barry Song <song.bao.hua@hisilicon.com>,
        <wanghaibin.wang@huawei.com>, <yuzenghui@huawei.com>,
        <lushenming@huawei.com>
Subject: [RFC PATCH v3 3/8] vfio/type1: Add an MMU notifier to avoid pinning
Date: Fri, 9 Apr 2021 11:44:15 +0800
Message-ID: <20210409034420.1799-4-lushenming@huawei.com>
X-Mailer: git-send-email 2.27.0.windows.1
In-Reply-To: <20210409034420.1799-1-lushenming@huawei.com>
References: <20210409034420.1799-1-lushenming@huawei.com>
MIME-Version: 1.0
X-Originating-IP: [10.174.184.135]
X-CFilter-Loop: Reflected
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

To avoid pinning pages when they are mapped in IOMMU page tables, we
add an MMU notifier to tell the addresses which are no longer valid
and try to unmap them.

Signed-off-by: Shenming Lu <lushenming@huawei.com>
---
 drivers/vfio/vfio_iommu_type1.c | 112 +++++++++++++++++++++++++++++++-
 1 file changed, 109 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index ab0ff60ee207..1cb9d1f2717b 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -40,6 +40,7 @@
 #include <linux/notifier.h>
 #include <linux/dma-iommu.h>
 #include <linux/irqdomain.h>
+#include <linux/mmu_notifier.h>
 
 #define DRIVER_VERSION  "0.2"
 #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
@@ -69,6 +70,7 @@ struct vfio_iommu {
 	struct mutex		lock;
 	struct rb_root		dma_list;
 	struct blocking_notifier_head notifier;
+	struct mmu_notifier	mn;
 	unsigned int		dma_avail;
 	unsigned int		vaddr_invalid_count;
 	uint64_t		pgsize_bitmap;
@@ -1204,6 +1206,72 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
 	return unlocked;
 }
 
+/* Unmap the IOPF mapped pages in the specified range. */
+static void vfio_unmap_partial_iopf(struct vfio_iommu *iommu,
+				    struct vfio_dma *dma,
+				    dma_addr_t start, dma_addr_t end)
+{
+	struct iommu_iotlb_gather *gathers;
+	struct vfio_domain *d;
+	int i, num_domains = 0;
+
+	list_for_each_entry(d, &iommu->domain_list, next)
+		num_domains++;
+
+	gathers = kzalloc(sizeof(*gathers) * num_domains, GFP_KERNEL);
+	if (gathers) {
+		for (i = 0; i < num_domains; i++)
+			iommu_iotlb_gather_init(&gathers[i]);
+	}
+
+	while (start < end) {
+		unsigned long bit_offset;
+		size_t len;
+
+		bit_offset = (start - dma->iova) >> PAGE_SHIFT;
+
+		for (len = 0; start + len < end; len += PAGE_SIZE) {
+			if (!IOPF_MAPPED_BITMAP_GET(dma,
+					bit_offset + (len >> PAGE_SHIFT)))
+				break;
+		}
+
+		if (len) {
+			i = 0;
+			list_for_each_entry(d, &iommu->domain_list, next) {
+				size_t unmapped;
+
+				if (gathers)
+					unmapped = iommu_unmap_fast(d->domain,
+								    start, len,
+								    &gathers[i++]);
+				else
+					unmapped = iommu_unmap(d->domain,
+							       start, len);
+
+				if (WARN_ON(unmapped != len))
+					goto out;
+			}
+
+			bitmap_clear(dma->iopf_mapped_bitmap,
+				     bit_offset, len >> PAGE_SHIFT);
+
+			cond_resched();
+		}
+
+		start += (len + PAGE_SIZE);
+	}
+
+out:
+	if (gathers) {
+		i = 0;
+		list_for_each_entry(d, &iommu->domain_list, next)
+			iommu_iotlb_sync(d->domain, &gathers[i++]);
+
+		kfree(gathers);
+	}
+}
+
 static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
 {
 	WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list));
@@ -3197,17 +3265,18 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
 
 	vaddr = iova - dma->iova + dma->vaddr;
 
-	if (vfio_pin_page_external(dma, vaddr, &pfn, true))
+	if (vfio_pin_page_external(dma, vaddr, &pfn, false))
 		goto out_invalid;
 
 	if (vfio_iommu_map(iommu, iova, pfn, 1, dma->prot)) {
-		if (put_pfn(pfn, dma->prot))
-			vfio_lock_acct(dma, -1, true);
+		put_pfn(pfn, dma->prot);
 		goto out_invalid;
 	}
 
 	bitmap_set(dma->iopf_mapped_bitmap, bit_offset, 1);
 
+	put_pfn(pfn, dma->prot);
+
 out_success:
 	status = IOMMU_PAGE_RESP_SUCCESS;
 
@@ -3220,6 +3289,43 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
 	return 0;
 }
 
+static void mn_invalidate_range(struct mmu_notifier *mn, struct mm_struct *mm,
+				unsigned long start, unsigned long end)
+{
+	struct vfio_iommu *iommu = container_of(mn, struct vfio_iommu, mn);
+	struct rb_node *n;
+	int ret;
+
+	mutex_lock(&iommu->lock);
+
+	ret = vfio_wait_all_valid(iommu);
+	if (WARN_ON(ret < 0))
+		return;
+
+	for (n = rb_first(&iommu->dma_list); n; n = rb_next(n)) {
+		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
+		unsigned long start_n, end_n;
+
+		if (end <= dma->vaddr || start >= dma->vaddr + dma->size)
+			continue;
+
+		start_n = ALIGN_DOWN(max_t(unsigned long, start, dma->vaddr),
+				     PAGE_SIZE);
+		end_n = ALIGN(min_t(unsigned long, end, dma->vaddr + dma->size),
+			      PAGE_SIZE);
+
+		vfio_unmap_partial_iopf(iommu, dma,
+					start_n - dma->vaddr + dma->iova,
+					end_n - dma->vaddr + dma->iova);
+	}
+
+	mutex_unlock(&iommu->lock);
+}
+
+static const struct mmu_notifier_ops vfio_iommu_type1_mn_ops = {
+	.invalidate_range	= mn_invalidate_range,
+};
+
 static long vfio_iommu_type1_ioctl(void *iommu_data,
 				   unsigned int cmd, unsigned long arg)
 {

From patchwork Fri Apr  9 03:44:16 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Shenming Lu <lushenming@huawei.com>
X-Patchwork-Id: 12192937
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-16.7 required=3.0 tests=BAYES_00,
	HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT
	autolearn=unavailable autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 2CE53C43462
	for <kvm@archiver.kernel.org>; Fri,  9 Apr 2021 03:44:56 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 0794161042
	for <kvm@archiver.kernel.org>; Fri,  9 Apr 2021 03:44:56 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233310AbhDIDpF (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Thu, 8 Apr 2021 23:45:05 -0400
Received: from szxga04-in.huawei.com ([45.249.212.190]:15640 "EHLO
        szxga04-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233273AbhDIDo7 (ORCPT <rfc822;kvm@vger.kernel.org>);
        Thu, 8 Apr 2021 23:44:59 -0400
Received: from DGGEMS414-HUB.china.huawei.com (unknown [172.30.72.60])
        by szxga04-in.huawei.com (SkyGuard) with ESMTP id 4FGkTL39vRznYwQ;
        Fri,  9 Apr 2021 11:41:58 +0800 (CST)
Received: from DESKTOP-7FEPK9S.china.huawei.com (10.174.184.135) by
 DGGEMS414-HUB.china.huawei.com (10.3.19.214) with Microsoft SMTP Server id
 14.3.498.0; Fri, 9 Apr 2021 11:44:38 +0800
From: Shenming Lu <lushenming@huawei.com>
To: Alex Williamson <alex.williamson@redhat.com>,
        Cornelia Huck <cohuck@redhat.com>,
        Will Deacon <will@kernel.org>,
        Robin Murphy <robin.murphy@arm.com>,
        Joerg Roedel <joro@8bytes.org>,
        Jean-Philippe Brucker <jean-philippe@linaro.org>,
        Eric Auger <eric.auger@redhat.com>, <kvm@vger.kernel.org>,
        <linux-kernel@vger.kernel.org>,
        <linux-arm-kernel@lists.infradead.org>,
        <iommu@lists.linux-foundation.org>, <linux-api@vger.kernel.org>
CC: Kevin Tian <kevin.tian@intel.com>,
        Lu Baolu <baolu.lu@linux.intel.com>, <yi.l.liu@intel.com>,
        Christoph Hellwig <hch@infradead.org>,
        Jonathan Cameron <Jonathan.Cameron@huawei.com>,
        Barry Song <song.bao.hua@hisilicon.com>,
        <wanghaibin.wang@huawei.com>, <yuzenghui@huawei.com>,
        <lushenming@huawei.com>
Subject: [RFC PATCH v3 4/8] vfio/type1: Pre-map more pages than requested in
 the IOPF handling
Date: Fri, 9 Apr 2021 11:44:16 +0800
Message-ID: <20210409034420.1799-5-lushenming@huawei.com>
X-Mailer: git-send-email 2.27.0.windows.1
In-Reply-To: <20210409034420.1799-1-lushenming@huawei.com>
References: <20210409034420.1799-1-lushenming@huawei.com>
MIME-Version: 1.0
X-Originating-IP: [10.174.184.135]
X-CFilter-Loop: Reflected
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

To optimize for fewer page fault handlings, we can pre-map more pages
than requested at once.

Note that IOPF_PREMAP_LEN is just an arbitrary value for now, which we
could try further tuning.

Signed-off-by: Shenming Lu <lushenming@huawei.com>
---
 drivers/vfio/vfio_iommu_type1.c | 131 ++++++++++++++++++++++++++++++--
 1 file changed, 123 insertions(+), 8 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 1cb9d1f2717b..01e296c6dc9e 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -3217,6 +3217,91 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
 	return -EINVAL;
 }
 
+/*
+ * To optimize for fewer page fault handlings, try to
+ * pre-map more pages than requested.
+ */
+#define IOPF_PREMAP_LEN		512
+
+/*
+ * Return 0 on success or a negative error code, the
+ * number of pages contiguously pinned is in @pinned.
+ */
+static int pin_pages_iopf(struct vfio_dma *dma, unsigned long vaddr,
+			  unsigned long npages, unsigned long *pfn_base,
+			  unsigned long *pinned, struct vfio_batch *batch)
+{
+	struct mm_struct *mm;
+	unsigned long pfn;
+	int ret = 0;
+	*pinned = 0;
+
+	mm = get_task_mm(dma->task);
+	if (!mm)
+		return -ENODEV;
+
+	if (batch->size) {
+		*pfn_base = page_to_pfn(batch->pages[batch->offset]);
+		pfn = *pfn_base;
+	} else {
+		*pfn_base = 0;
+	}
+
+	while (npages) {
+		if (!batch->size) {
+			unsigned long req_pages = min_t(unsigned long, npages,
+							batch->capacity);
+
+			ret = vaddr_get_pfns(mm, vaddr, req_pages, dma->prot,
+					     &pfn, batch->pages);
+			if (ret < 0)
+				goto out;
+
+			batch->size = ret;
+			batch->offset = 0;
+			ret = 0;
+
+			if (!*pfn_base)
+				*pfn_base = pfn;
+		}
+
+		while (true) {
+			if (pfn != *pfn_base + *pinned)
+				goto out;
+
+			(*pinned)++;
+			npages--;
+			vaddr += PAGE_SIZE;
+			batch->offset++;
+			batch->size--;
+
+			if (!batch->size)
+				break;
+
+			pfn = page_to_pfn(batch->pages[batch->offset]);
+		}
+
+		if (unlikely(disable_hugepages))
+			break;
+	}
+
+out:
+	if (batch->size == 1 && !batch->offset) {
+		put_pfn(pfn, dma->prot);
+		batch->size = 0;
+	}
+
+	mmput(mm);
+	return ret;
+}
+
+static void unpin_pages_iopf(struct vfio_dma *dma,
+			     unsigned long pfn, unsigned long npages)
+{
+	while (npages--)
+		put_pfn(pfn++, dma->prot);
+}
+
 /* VFIO I/O Page Fault handler */
 static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
 {
@@ -3225,9 +3310,11 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
 	struct vfio_iopf_group *iopf_group;
 	struct vfio_iommu *iommu;
 	struct vfio_dma *dma;
+	struct vfio_batch batch;
 	dma_addr_t iova = ALIGN_DOWN(fault->prm.addr, PAGE_SIZE);
 	int access_flags = 0;
-	unsigned long bit_offset, vaddr, pfn;
+	size_t premap_len, map_len, mapped_len = 0;
+	unsigned long bit_offset, vaddr, pfn, i, npages;
 	int ret;
 	enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID;
 	struct iommu_page_response resp = {0};
@@ -3263,19 +3350,47 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
 	if (IOPF_MAPPED_BITMAP_GET(dma, bit_offset))
 		goto out_success;
 
+	premap_len = IOPF_PREMAP_LEN << PAGE_SHIFT;
+	npages = dma->size >> PAGE_SHIFT;
+	map_len = PAGE_SIZE;
+	for (i = bit_offset + 1; i < npages; i++) {
+		if (map_len >= premap_len || IOPF_MAPPED_BITMAP_GET(dma, i))
+			break;
+		map_len += PAGE_SIZE;
+	}
 	vaddr = iova - dma->iova + dma->vaddr;
+	vfio_batch_init(&batch);
 
-	if (vfio_pin_page_external(dma, vaddr, &pfn, false))
-		goto out_invalid;
+	while (map_len) {
+		ret = pin_pages_iopf(dma, vaddr + mapped_len,
+				     map_len >> PAGE_SHIFT, &pfn,
+				     &npages, &batch);
+		if (!npages)
+			break;
 
-	if (vfio_iommu_map(iommu, iova, pfn, 1, dma->prot)) {
-		put_pfn(pfn, dma->prot);
-		goto out_invalid;
+		if (vfio_iommu_map(iommu, iova + mapped_len, pfn,
+				   npages, dma->prot)) {
+			unpin_pages_iopf(dma, pfn, npages);
+			vfio_batch_unpin(&batch, dma);
+			break;
+		}
+
+		bitmap_set(dma->iopf_mapped_bitmap,
+			   bit_offset + (mapped_len >> PAGE_SHIFT), npages);
+
+		unpin_pages_iopf(dma, pfn, npages);
+
+		map_len -= npages << PAGE_SHIFT;
+		mapped_len += npages << PAGE_SHIFT;
+
+		if (ret)
+			break;
 	}
 
-	bitmap_set(dma->iopf_mapped_bitmap, bit_offset, 1);
+	vfio_batch_fini(&batch);
 
-	put_pfn(pfn, dma->prot);
+	if (!mapped_len)
+		goto out_invalid;
 
 out_success:
 	status = IOMMU_PAGE_RESP_SUCCESS;

From patchwork Fri Apr  9 03:44:17 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Shenming Lu <lushenming@huawei.com>
X-Patchwork-Id: 12192939
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-16.7 required=3.0 tests=BAYES_00,
	HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT
	autolearn=unavailable autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 81A12C43460
	for <kvm@archiver.kernel.org>; Fri,  9 Apr 2021 03:44:57 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 5C44761055
	for <kvm@archiver.kernel.org>; Fri,  9 Apr 2021 03:44:57 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233330AbhDIDpI (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Thu, 8 Apr 2021 23:45:08 -0400
Received: from szxga04-in.huawei.com ([45.249.212.190]:15639 "EHLO
        szxga04-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233272AbhDIDo7 (ORCPT <rfc822;kvm@vger.kernel.org>);
        Thu, 8 Apr 2021 23:44:59 -0400
Received: from DGGEMS414-HUB.china.huawei.com (unknown [172.30.72.60])
        by szxga04-in.huawei.com (SkyGuard) with ESMTP id 4FGkTL2gmrznYv1;
        Fri,  9 Apr 2021 11:41:58 +0800 (CST)
Received: from DESKTOP-7FEPK9S.china.huawei.com (10.174.184.135) by
 DGGEMS414-HUB.china.huawei.com (10.3.19.214) with Microsoft SMTP Server id
 14.3.498.0; Fri, 9 Apr 2021 11:44:39 +0800
From: Shenming Lu <lushenming@huawei.com>
To: Alex Williamson <alex.williamson@redhat.com>,
        Cornelia Huck <cohuck@redhat.com>,
        Will Deacon <will@kernel.org>,
        Robin Murphy <robin.murphy@arm.com>,
        Joerg Roedel <joro@8bytes.org>,
        Jean-Philippe Brucker <jean-philippe@linaro.org>,
        Eric Auger <eric.auger@redhat.com>, <kvm@vger.kernel.org>,
        <linux-kernel@vger.kernel.org>,
        <linux-arm-kernel@lists.infradead.org>,
        <iommu@lists.linux-foundation.org>, <linux-api@vger.kernel.org>
CC: Kevin Tian <kevin.tian@intel.com>,
        Lu Baolu <baolu.lu@linux.intel.com>, <yi.l.liu@intel.com>,
        Christoph Hellwig <hch@infradead.org>,
        Jonathan Cameron <Jonathan.Cameron@huawei.com>,
        Barry Song <song.bao.hua@hisilicon.com>,
        <wanghaibin.wang@huawei.com>, <yuzenghui@huawei.com>,
        <lushenming@huawei.com>
Subject: [RFC PATCH v3 5/8] vfio/type1: VFIO_IOMMU_ENABLE_IOPF
Date: Fri, 9 Apr 2021 11:44:17 +0800
Message-ID: <20210409034420.1799-6-lushenming@huawei.com>
X-Mailer: git-send-email 2.27.0.windows.1
In-Reply-To: <20210409034420.1799-1-lushenming@huawei.com>
References: <20210409034420.1799-1-lushenming@huawei.com>
MIME-Version: 1.0
X-Originating-IP: [10.174.184.135]
X-CFilter-Loop: Reflected
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

Since enabling IOPF for devices may lead to a slow ramp up of performance,
we add an ioctl VFIO_IOMMU_ENABLE_IOPF to make it configurable. And the
IOPF enabling of a VFIO device includes setting IOMMU_DEV_FEAT_IOPF and
registering the VFIO IOPF handler.

Note that VFIO_IOMMU_DISABLE_IOPF is not supported since there may be
inflight page faults when disabling.

Signed-off-by: Shenming Lu <lushenming@huawei.com>
---
 drivers/vfio/vfio_iommu_type1.c | 223 +++++++++++++++++++++++++++++++-
 include/uapi/linux/vfio.h       |   6 +
 2 files changed, 226 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 01e296c6dc9e..7df5711e743a 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -71,6 +71,7 @@ struct vfio_iommu {
 	struct rb_root		dma_list;
 	struct blocking_notifier_head notifier;
 	struct mmu_notifier	mn;
+	struct mm_struct	*mm;
 	unsigned int		dma_avail;
 	unsigned int		vaddr_invalid_count;
 	uint64_t		pgsize_bitmap;
@@ -81,6 +82,7 @@ struct vfio_iommu {
 	bool			dirty_page_tracking;
 	bool			pinned_page_dirty_scope;
 	bool			container_open;
+	bool			iopf_enabled;
 };
 
 struct vfio_domain {
@@ -461,6 +463,38 @@ vfio_find_iopf_group(struct iommu_group *iommu_group)
 	return node ? iopf_group : NULL;
 }
 
+static void vfio_link_iopf_group(struct vfio_iopf_group *new)
+{
+	struct rb_node **link, *parent = NULL;
+	struct vfio_iopf_group *iopf_group;
+
+	mutex_lock(&iopf_group_list_lock);
+
+	link = &iopf_group_list.rb_node;
+
+	while (*link) {
+		parent = *link;
+		iopf_group = rb_entry(parent, struct vfio_iopf_group, node);
+
+		if (new->iommu_group < iopf_group->iommu_group)
+			link = &(*link)->rb_left;
+		else
+			link = &(*link)->rb_right;
+	}
+
+	rb_link_node(&new->node, parent, link);
+	rb_insert_color(&new->node, &iopf_group_list);
+
+	mutex_unlock(&iopf_group_list_lock);
+}
+
+static void vfio_unlink_iopf_group(struct vfio_iopf_group *old)
+{
+	mutex_lock(&iopf_group_list_lock);
+	rb_erase(&old->node, &iopf_group_list);
+	mutex_unlock(&iopf_group_list_lock);
+}
+
 static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
 {
 	struct mm_struct *mm;
@@ -2363,6 +2397,68 @@ static void vfio_iommu_iova_insert_copy(struct vfio_iommu *iommu,
 	list_splice_tail(iova_copy, iova);
 }
 
+static int vfio_dev_domian_nested(struct device *dev, int *nested)
+{
+	struct iommu_domain *domain;
+
+	domain = iommu_get_domain_for_dev(dev);
+	if (!domain)
+		return -ENODEV;
+
+	return iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, nested);
+}
+
+static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data);
+
+static int dev_enable_iopf(struct device *dev, void *data)
+{
+	int *enabled_dev_cnt = data;
+	int nested;
+	u32 flags;
+	int ret;
+
+	ret = iommu_dev_enable_feature(dev, IOMMU_DEV_FEAT_IOPF);
+	if (ret)
+		return ret;
+
+	ret = vfio_dev_domian_nested(dev, &nested);
+	if (ret)
+		goto out_disable;
+
+	if (nested)
+		flags = FAULT_REPORT_NESTED_L2;
+	else
+		flags = FAULT_REPORT_FLAT;
+
+	ret = iommu_register_device_fault_handler(dev,
+				vfio_iommu_type1_dma_map_iopf, flags, dev);
+	if (ret)
+		goto out_disable;
+
+	(*enabled_dev_cnt)++;
+	return 0;
+
+out_disable:
+	iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_IOPF);
+	return ret;
+}
+
+static int dev_disable_iopf(struct device *dev, void *data)
+{
+	int *enabled_dev_cnt = data;
+
+	if (enabled_dev_cnt && *enabled_dev_cnt <= 0)
+		return -1;
+
+	WARN_ON(iommu_unregister_device_fault_handler(dev));
+	WARN_ON(iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_IOPF));
+
+	if (enabled_dev_cnt)
+		(*enabled_dev_cnt)--;
+
+	return 0;
+}
+
 static int vfio_iommu_type1_attach_group(void *iommu_data,
 					 struct iommu_group *iommu_group)
 {
@@ -2376,6 +2472,8 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	struct iommu_domain_geometry geo;
 	LIST_HEAD(iova_copy);
 	LIST_HEAD(group_resv_regions);
+	int iopf_enabled_dev_cnt = 0;
+	struct vfio_iopf_group *iopf_group = NULL;
 
 	mutex_lock(&iommu->lock);
 
@@ -2453,6 +2551,24 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	if (ret)
 		goto out_domain;
 
+	if (iommu->iopf_enabled) {
+		ret = iommu_group_for_each_dev(iommu_group, &iopf_enabled_dev_cnt,
+					       dev_enable_iopf);
+		if (ret)
+			goto out_detach;
+
+		iopf_group = kzalloc(sizeof(*iopf_group), GFP_KERNEL);
+		if (!iopf_group) {
+			ret = -ENOMEM;
+			goto out_detach;
+		}
+
+		iopf_group->iommu_group = iommu_group;
+		iopf_group->iommu = iommu;
+
+		vfio_link_iopf_group(iopf_group);
+	}
+
 	/* Get aperture info */
 	iommu_domain_get_attr(domain->domain, DOMAIN_ATTR_GEOMETRY, &geo);
 
@@ -2534,9 +2650,11 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	vfio_test_domain_fgsp(domain);
 
 	/* replay mappings on new domains */
-	ret = vfio_iommu_replay(iommu, domain);
-	if (ret)
-		goto out_detach;
+	if (!iommu->iopf_enabled) {
+		ret = vfio_iommu_replay(iommu, domain);
+		if (ret)
+			goto out_detach;
+	}
 
 	if (resv_msi) {
 		ret = iommu_get_msi_cookie(domain->domain, resv_msi_base);
@@ -2567,6 +2685,15 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	iommu_domain_free(domain->domain);
 	vfio_iommu_iova_free(&iova_copy);
 	vfio_iommu_resv_free(&group_resv_regions);
+	if (iommu->iopf_enabled) {
+		if (iopf_group) {
+			vfio_unlink_iopf_group(iopf_group);
+			kfree(iopf_group);
+		}
+
+		iommu_group_for_each_dev(iommu_group, &iopf_enabled_dev_cnt,
+					 dev_disable_iopf);
+	}
 out_free:
 	kfree(domain);
 	kfree(group);
@@ -2728,6 +2855,19 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
 		if (!group)
 			continue;
 
+		if (iommu->iopf_enabled) {
+			struct vfio_iopf_group *iopf_group;
+
+			iopf_group = vfio_find_iopf_group(iommu_group);
+			if (!WARN_ON(!iopf_group)) {
+				vfio_unlink_iopf_group(iopf_group);
+				kfree(iopf_group);
+			}
+
+			iommu_group_for_each_dev(iommu_group, NULL,
+						 dev_disable_iopf);
+		}
+
 		vfio_iommu_detach_group(domain, group);
 		update_dirty_scope = !group->pinned_page_dirty_scope;
 		list_del(&group->next);
@@ -2846,6 +2986,11 @@ static void vfio_iommu_type1_release(void *iommu_data)
 
 	vfio_iommu_iova_free(&iommu->iova_list);
 
+	if (iommu->iopf_enabled) {
+		mmu_notifier_unregister(&iommu->mn, iommu->mm);
+		mmdrop(iommu->mm);
+	}
+
 	kfree(iommu);
 }
 
@@ -3441,6 +3586,76 @@ static const struct mmu_notifier_ops vfio_iommu_type1_mn_ops = {
 	.invalidate_range	= mn_invalidate_range,
 };
 
+static int vfio_iommu_type1_enable_iopf(struct vfio_iommu *iommu)
+{
+	struct vfio_domain *d;
+	struct vfio_group *g;
+	struct vfio_iopf_group *iopf_group;
+	int enabled_dev_cnt = 0;
+	int ret;
+
+	if (!current->mm)
+		return -ENODEV;
+
+	mutex_lock(&iommu->lock);
+
+	mmgrab(current->mm);
+	iommu->mm = current->mm;
+	iommu->mn.ops = &vfio_iommu_type1_mn_ops;
+	ret = mmu_notifier_register(&iommu->mn, current->mm);
+	if (ret)
+		goto out_drop;
+
+	list_for_each_entry(d, &iommu->domain_list, next) {
+		list_for_each_entry(g, &d->group_list, next) {
+			ret = iommu_group_for_each_dev(g->iommu_group,
+					&enabled_dev_cnt, dev_enable_iopf);
+			if (ret)
+				goto out_unwind;
+
+			iopf_group = kzalloc(sizeof(*iopf_group), GFP_KERNEL);
+			if (!iopf_group) {
+				ret = -ENOMEM;
+				goto out_unwind;
+			}
+
+			iopf_group->iommu_group = g->iommu_group;
+			iopf_group->iommu = iommu;
+
+			vfio_link_iopf_group(iopf_group);
+		}
+	}
+
+	iommu->iopf_enabled = true;
+	goto out_unlock;
+
+out_unwind:
+	list_for_each_entry(d, &iommu->domain_list, next) {
+		list_for_each_entry(g, &d->group_list, next) {
+			iopf_group = vfio_find_iopf_group(g->iommu_group);
+			if (iopf_group) {
+				vfio_unlink_iopf_group(iopf_group);
+				kfree(iopf_group);
+			}
+
+			if (iommu_group_for_each_dev(g->iommu_group,
+					&enabled_dev_cnt, dev_disable_iopf))
+				goto out_unregister;
+		}
+	}
+
+out_unregister:
+	mmu_notifier_unregister(&iommu->mn, current->mm);
+
+out_drop:
+	iommu->mm = NULL;
+	mmdrop(current->mm);
+
+out_unlock:
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
 static long vfio_iommu_type1_ioctl(void *iommu_data,
 				   unsigned int cmd, unsigned long arg)
 {
@@ -3457,6 +3672,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 		return vfio_iommu_type1_unmap_dma(iommu, arg);
 	case VFIO_IOMMU_DIRTY_PAGES:
 		return vfio_iommu_type1_dirty_pages(iommu, arg);
+	case VFIO_IOMMU_ENABLE_IOPF:
+		return vfio_iommu_type1_enable_iopf(iommu);
 	default:
 		return -ENOTTY;
 	}
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 8ce36c1d53ca..5497036bebdc 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1208,6 +1208,12 @@ struct vfio_iommu_type1_dirty_bitmap_get {
 
 #define VFIO_IOMMU_DIRTY_PAGES             _IO(VFIO_TYPE, VFIO_BASE + 17)
 
+/*
+ * IOCTL to enable IOPF for the container.
+ * Called right after VFIO_SET_IOMMU.
+ */
+#define VFIO_IOMMU_ENABLE_IOPF             _IO(VFIO_TYPE, VFIO_BASE + 18)
+
 /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
 
 /*

From patchwork Fri Apr  9 03:44:18 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Shenming Lu <lushenming@huawei.com>
X-Patchwork-Id: 12192945
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-16.7 required=3.0 tests=BAYES_00,
	HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT
	autolearn=unavailable autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 024EBC433ED
	for <kvm@archiver.kernel.org>; Fri,  9 Apr 2021 03:45:27 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id B686061042
	for <kvm@archiver.kernel.org>; Fri,  9 Apr 2021 03:45:26 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233403AbhDIDpg (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Thu, 8 Apr 2021 23:45:36 -0400
Received: from szxga06-in.huawei.com ([45.249.212.32]:16423 "EHLO
        szxga06-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233173AbhDIDpF (ORCPT <rfc822;kvm@vger.kernel.org>);
        Thu, 8 Apr 2021 23:45:05 -0400
Received: from DGGEMS414-HUB.china.huawei.com (unknown [172.30.72.60])
        by szxga06-in.huawei.com (SkyGuard) with ESMTP id 4FGkVZ6mXCzlWqh;
        Fri,  9 Apr 2021 11:43:02 +0800 (CST)
Received: from DESKTOP-7FEPK9S.china.huawei.com (10.174.184.135) by
 DGGEMS414-HUB.china.huawei.com (10.3.19.214) with Microsoft SMTP Server id
 14.3.498.0; Fri, 9 Apr 2021 11:44:41 +0800
From: Shenming Lu <lushenming@huawei.com>
To: Alex Williamson <alex.williamson@redhat.com>,
        Cornelia Huck <cohuck@redhat.com>,
        Will Deacon <will@kernel.org>,
        Robin Murphy <robin.murphy@arm.com>,
        Joerg Roedel <joro@8bytes.org>,
        Jean-Philippe Brucker <jean-philippe@linaro.org>,
        Eric Auger <eric.auger@redhat.com>, <kvm@vger.kernel.org>,
        <linux-kernel@vger.kernel.org>,
        <linux-arm-kernel@lists.infradead.org>,
        <iommu@lists.linux-foundation.org>, <linux-api@vger.kernel.org>
CC: Kevin Tian <kevin.tian@intel.com>,
        Lu Baolu <baolu.lu@linux.intel.com>, <yi.l.liu@intel.com>,
        Christoph Hellwig <hch@infradead.org>,
        Jonathan Cameron <Jonathan.Cameron@huawei.com>,
        Barry Song <song.bao.hua@hisilicon.com>,
        <wanghaibin.wang@huawei.com>, <yuzenghui@huawei.com>,
        <lushenming@huawei.com>
Subject: [RFC PATCH v3 6/8] vfio/type1: No need to statically pin and map if
 IOPF enabled
Date: Fri, 9 Apr 2021 11:44:18 +0800
Message-ID: <20210409034420.1799-7-lushenming@huawei.com>
X-Mailer: git-send-email 2.27.0.windows.1
In-Reply-To: <20210409034420.1799-1-lushenming@huawei.com>
References: <20210409034420.1799-1-lushenming@huawei.com>
MIME-Version: 1.0
X-Originating-IP: [10.174.184.135]
X-CFilter-Loop: Reflected
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

If IOPF enabled for the VFIO container, there is no need to statically
pin and map the entire DMA range, we can do it on demand. And unmap
according to the IOPF mapped bitmap when removing vfio_dma.

Note that we still mark all pages dirty even if IOPF enabled, we may
add IOPF-based fine grained dirty tracking support in the future.

Signed-off-by: Shenming Lu <lushenming@huawei.com>
---
 drivers/vfio/vfio_iommu_type1.c | 38 +++++++++++++++++++++++++++------
 1 file changed, 32 insertions(+), 6 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 7df5711e743a..dcc93c3b258c 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -175,6 +175,7 @@ struct vfio_iopf_group {
 #define IOPF_MAPPED_BITMAP_GET(dma, i)	\
 			      ((dma->iopf_mapped_bitmap[(i) / BITS_PER_LONG]	\
 			       >> ((i) % BITS_PER_LONG)) & 0x1)
+#define IOPF_MAPPED_BITMAP_BYTES(n)	DIRTY_BITMAP_BYTES(n)
 
 #define WAITED 1
 
@@ -959,7 +960,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 	 * already pinned and accounted. Accouting should be done if there is no
 	 * iommu capable domain in the container.
 	 */
-	do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu);
+	do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) ||
+			iommu->iopf_enabled;
 
 	for (i = 0; i < npage; i++) {
 		struct vfio_pfn *vpfn;
@@ -1048,7 +1050,8 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data,
 
 	mutex_lock(&iommu->lock);
 
-	do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu);
+	do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) ||
+			iommu->iopf_enabled;
 	for (i = 0; i < npage; i++) {
 		struct vfio_dma *dma;
 		dma_addr_t iova;
@@ -1169,7 +1172,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
 	if (!dma->size)
 		return 0;
 
-	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
+	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) || iommu->iopf_enabled)
 		return 0;
 
 	/*
@@ -1306,11 +1309,20 @@ static void vfio_unmap_partial_iopf(struct vfio_iommu *iommu,
 	}
 }
 
+static void vfio_dma_clean_iopf(struct vfio_iommu *iommu, struct vfio_dma *dma)
+{
+	vfio_unmap_partial_iopf(iommu, dma, dma->iova, dma->iova + dma->size);
+
+	kfree(dma->iopf_mapped_bitmap);
+}
+
 static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
 {
 	WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list));
 	vfio_unmap_unpin(iommu, dma, true);
 	vfio_unlink_dma(iommu, dma);
+	if (iommu->iopf_enabled)
+		vfio_dma_clean_iopf(iommu, dma);
 	put_task_struct(dma->task);
 	vfio_dma_bitmap_free(dma);
 	if (dma->vaddr_invalid) {
@@ -1359,7 +1371,8 @@ static int update_user_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
 	 * mark all pages dirty if any IOMMU capable device is not able
 	 * to report dirty pages and all pages are pinned and mapped.
 	 */
-	if (iommu->num_non_pinned_groups && dma->iommu_mapped)
+	if (iommu->num_non_pinned_groups &&
+	    (dma->iommu_mapped || iommu->iopf_enabled))
 		bitmap_set(dma->bitmap, 0, nbits);
 
 	if (shift) {
@@ -1772,6 +1785,16 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 		goto out_unlock;
 	}
 
+	if (iommu->iopf_enabled) {
+		dma->iopf_mapped_bitmap = kvzalloc(IOPF_MAPPED_BITMAP_BYTES(
+						size >> PAGE_SHIFT), GFP_KERNEL);
+		if (!dma->iopf_mapped_bitmap) {
+			ret = -ENOMEM;
+			kfree(dma);
+			goto out_unlock;
+		}
+	}
+
 	iommu->dma_avail--;
 	dma->iova = iova;
 	dma->vaddr = vaddr;
@@ -1811,8 +1834,11 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	/* Insert zero-sized and grow as we map chunks of it */
 	vfio_link_dma(iommu, dma);
 
-	/* Don't pin and map if container doesn't contain IOMMU capable domain*/
-	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
+	/*
+	 * Don't pin and map if container doesn't contain IOMMU capable domain,
+	 * or IOPF enabled for the container.
+	 */
+	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) || iommu->iopf_enabled)
 		dma->size = size;
 	else
 		ret = vfio_pin_map_dma(iommu, dma, size);

From patchwork Fri Apr  9 03:44:19 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Shenming Lu <lushenming@huawei.com>
X-Patchwork-Id: 12192947
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-16.7 required=3.0 tests=BAYES_00,
	HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT
	autolearn=unavailable autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 4FC4AC43462
	for <kvm@archiver.kernel.org>; Fri,  9 Apr 2021 03:45:28 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 079FD610F7
	for <kvm@archiver.kernel.org>; Fri,  9 Apr 2021 03:45:27 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233344AbhDIDpi (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Thu, 8 Apr 2021 23:45:38 -0400
Received: from szxga06-in.huawei.com ([45.249.212.32]:16424 "EHLO
        szxga06-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233300AbhDIDpF (ORCPT <rfc822;kvm@vger.kernel.org>);
        Thu, 8 Apr 2021 23:45:05 -0400
Received: from DGGEMS414-HUB.china.huawei.com (unknown [172.30.72.60])
        by szxga06-in.huawei.com (SkyGuard) with ESMTP id 4FGkVZ6HPszlWqS;
        Fri,  9 Apr 2021 11:43:02 +0800 (CST)
Received: from DESKTOP-7FEPK9S.china.huawei.com (10.174.184.135) by
 DGGEMS414-HUB.china.huawei.com (10.3.19.214) with Microsoft SMTP Server id
 14.3.498.0; Fri, 9 Apr 2021 11:44:43 +0800
From: Shenming Lu <lushenming@huawei.com>
To: Alex Williamson <alex.williamson@redhat.com>,
        Cornelia Huck <cohuck@redhat.com>,
        Will Deacon <will@kernel.org>,
        Robin Murphy <robin.murphy@arm.com>,
        Joerg Roedel <joro@8bytes.org>,
        Jean-Philippe Brucker <jean-philippe@linaro.org>,
        Eric Auger <eric.auger@redhat.com>, <kvm@vger.kernel.org>,
        <linux-kernel@vger.kernel.org>,
        <linux-arm-kernel@lists.infradead.org>,
        <iommu@lists.linux-foundation.org>, <linux-api@vger.kernel.org>
CC: Kevin Tian <kevin.tian@intel.com>,
        Lu Baolu <baolu.lu@linux.intel.com>, <yi.l.liu@intel.com>,
        Christoph Hellwig <hch@infradead.org>,
        Jonathan Cameron <Jonathan.Cameron@huawei.com>,
        Barry Song <song.bao.hua@hisilicon.com>,
        <wanghaibin.wang@huawei.com>, <yuzenghui@huawei.com>,
        <lushenming@huawei.com>
Subject: [RFC PATCH v3 7/8] vfio/type1: Add selective DMA faulting support
Date: Fri, 9 Apr 2021 11:44:19 +0800
Message-ID: <20210409034420.1799-8-lushenming@huawei.com>
X-Mailer: git-send-email 2.27.0.windows.1
In-Reply-To: <20210409034420.1799-1-lushenming@huawei.com>
References: <20210409034420.1799-1-lushenming@huawei.com>
MIME-Version: 1.0
X-Originating-IP: [10.174.184.135]
X-CFilter-Loop: Reflected
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

Some devices only allow selective DMA faulting. Similar to the selective
dirty page tracking, the vendor driver can call vfio_pin_pages() to
indicate the non-faultable scope, we add a new struct vfio_range to
record it, then when the IOPF handler receives any page request out
of the scope, we can directly return with an invalid response.

Suggested-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Shenming Lu <lushenming@huawei.com>
---
 drivers/vfio/vfio.c             |   4 +-
 drivers/vfio/vfio_iommu_type1.c | 357 +++++++++++++++++++++++++++++++-
 include/linux/vfio.h            |   1 +
 3 files changed, 358 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 38779e6fd80c..44c8dfabf7de 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -2013,7 +2013,8 @@ int vfio_unpin_pages(struct device *dev, unsigned long *user_pfn, int npage)
 	container = group->container;
 	driver = container->iommu_driver;
 	if (likely(driver && driver->ops->unpin_pages))
-		ret = driver->ops->unpin_pages(container->iommu_data, user_pfn,
+		ret = driver->ops->unpin_pages(container->iommu_data,
+					       group->iommu_group, user_pfn,
 					       npage);
 	else
 		ret = -ENOTTY;
@@ -2112,6 +2113,7 @@ int vfio_group_unpin_pages(struct vfio_group *group,
 	driver = container->iommu_driver;
 	if (likely(driver && driver->ops->unpin_pages))
 		ret = driver->ops->unpin_pages(container->iommu_data,
+					       group->iommu_group,
 					       user_iova_pfn, npage);
 	else
 		ret = -ENOTTY;
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index dcc93c3b258c..ba2b5a1cf6e9 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -150,10 +150,19 @@ struct vfio_regions {
 static struct rb_root iopf_group_list = RB_ROOT;
 static DEFINE_MUTEX(iopf_group_list_lock);
 
+struct vfio_range {
+	struct rb_node		node;
+	dma_addr_t		base_iova;
+	size_t			span;
+	unsigned int		ref_count;
+};
+
 struct vfio_iopf_group {
 	struct rb_node		node;
 	struct iommu_group	*iommu_group;
 	struct vfio_iommu	*iommu;
+	struct rb_root		pinned_range_list;
+	bool			selective_faulting;
 };
 
 #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)	\
@@ -496,6 +505,255 @@ static void vfio_unlink_iopf_group(struct vfio_iopf_group *old)
 	mutex_unlock(&iopf_group_list_lock);
 }
 
+/*
+ * Helper functions for range list, handle one page at a time.
+ */
+static struct vfio_range *vfio_find_range(struct rb_root *range_list,
+					  dma_addr_t iova)
+{
+	struct rb_node *node = range_list->rb_node;
+	struct vfio_range *range;
+
+	while (node) {
+		range = rb_entry(node, struct vfio_range, node);
+
+		if (iova + PAGE_SIZE <= range->base_iova)
+			node = node->rb_left;
+		else if (iova >= range->base_iova + range->span)
+			node = node->rb_right;
+		else
+			return range;
+	}
+
+	return NULL;
+}
+
+/* Do the possible merge adjacent to the input range. */
+static void vfio_merge_range_list(struct rb_root *range_list,
+				  struct vfio_range *range)
+{
+	struct rb_node *node_prev = rb_prev(&range->node);
+	struct rb_node *node_next = rb_next(&range->node);
+
+	if (node_next) {
+		struct vfio_range *range_next = rb_entry(node_next,
+							 struct vfio_range,
+							 node);
+
+		if (range_next->base_iova == (range->base_iova + range->span) &&
+		    range_next->ref_count == range->ref_count) {
+			rb_erase(node_next, range_list);
+			range->span += range_next->span;
+			kfree(range_next);
+		}
+	}
+
+	if (node_prev) {
+		struct vfio_range *range_prev = rb_entry(node_prev,
+							 struct vfio_range,
+							 node);
+
+		if (range->base_iova == (range_prev->base_iova + range_prev->span)
+		    && range->ref_count == range_prev->ref_count) {
+			rb_erase(&range->node, range_list);
+			range_prev->span += range->span;
+			kfree(range);
+		}
+	}
+}
+
+static void vfio_link_range(struct rb_root *range_list, struct vfio_range *new)
+{
+	struct rb_node **link, *parent = NULL;
+	struct vfio_range *range;
+
+	link = &range_list->rb_node;
+
+	while (*link) {
+		parent = *link;
+		range = rb_entry(parent, struct vfio_range, node);
+
+		if (new->base_iova < range->base_iova)
+			link = &(*link)->rb_left;
+		else
+			link = &(*link)->rb_right;
+	}
+
+	rb_link_node(&new->node, parent, link);
+	rb_insert_color(&new->node, range_list);
+
+	vfio_merge_range_list(range_list, new);
+}
+
+static int vfio_add_to_range_list(struct rb_root *range_list,
+				  dma_addr_t iova)
+{
+	struct vfio_range *range = vfio_find_range(range_list, iova);
+
+	if (range) {
+		struct vfio_range *new_prev, *new_next;
+		size_t span_prev, span_next;
+
+		/* May split the found range into three parts. */
+		span_prev = iova - range->base_iova;
+		span_next = range->span - span_prev - PAGE_SIZE;
+
+		if (span_prev) {
+			new_prev = kzalloc(sizeof(*new_prev), GFP_KERNEL);
+			if (!new_prev)
+				return -ENOMEM;
+
+			new_prev->base_iova = range->base_iova;
+			new_prev->span = span_prev;
+			new_prev->ref_count = range->ref_count;
+		}
+
+		if (span_next) {
+			new_next = kzalloc(sizeof(*new_next), GFP_KERNEL);
+			if (!new_next) {
+				if (span_prev)
+					kfree(new_prev);
+				return -ENOMEM;
+			}
+
+			new_next->base_iova = iova + PAGE_SIZE;
+			new_next->span = span_next;
+			new_next->ref_count = range->ref_count;
+		}
+
+		range->base_iova = iova;
+		range->span = PAGE_SIZE;
+		range->ref_count++;
+		vfio_merge_range_list(range_list, range);
+
+		if (span_prev)
+			vfio_link_range(range_list, new_prev);
+
+		if (span_next)
+			vfio_link_range(range_list, new_next);
+	} else {
+		struct vfio_range *new;
+
+		new = kzalloc(sizeof(*new), GFP_KERNEL);
+		if (!new)
+			return -ENOMEM;
+
+		new->base_iova = iova;
+		new->span = PAGE_SIZE;
+		new->ref_count = 1;
+
+		vfio_link_range(range_list, new);
+	}
+
+	return 0;
+}
+
+static int vfio_remove_from_range_list(struct rb_root *range_list,
+				       dma_addr_t iova)
+{
+	struct vfio_range *range = vfio_find_range(range_list, iova);
+	struct vfio_range *news[3];
+	size_t span_prev, span_in, span_next;
+	int i, num_news;
+
+	if (!range)
+		return 0;
+
+	span_prev = iova - range->base_iova;
+	span_in = range->ref_count > 1 ? PAGE_SIZE : 0;
+	span_next = range->span - span_prev - PAGE_SIZE;
+
+	num_news = (int)!!span_prev + (int)!!span_in + (int)!!span_next;
+	if (!num_news) {
+		rb_erase(&range->node, range_list);
+		kfree(range);
+		return 0;
+	}
+
+	for (i = 0; i < num_news - 1; i++) {
+		news[i] = kzalloc(sizeof(struct vfio_range), GFP_KERNEL);
+		if (!news[i]) {
+			if (i > 0)
+				kfree(news[0]);
+			return -ENOMEM;
+		}
+	}
+	/* Reuse the found range. */
+	news[i] = range;
+
+	i = 0;
+	if (span_prev) {
+		news[i]->base_iova = range->base_iova;
+		news[i]->span = span_prev;
+		news[i++]->ref_count = range->ref_count;
+	}
+	if (span_in) {
+		news[i]->base_iova = iova;
+		news[i]->span = span_in;
+		news[i++]->ref_count = range->ref_count - 1;
+	}
+	if (span_next) {
+		news[i]->base_iova = iova + PAGE_SIZE;
+		news[i]->span = span_next;
+		news[i]->ref_count = range->ref_count;
+	}
+
+	vfio_merge_range_list(range_list, range);
+
+	for (i = 0; i < num_news - 1; i++)
+		vfio_link_range(range_list, news[i]);
+
+	return 0;
+}
+
+static void vfio_range_list_free(struct rb_root *range_list)
+{
+	struct rb_node *n;
+
+	while ((n = rb_first(range_list))) {
+		struct vfio_range *range = rb_entry(n, struct vfio_range, node);
+
+		rb_erase(&range->node, range_list);
+		kfree(range);
+	}
+}
+
+static int vfio_range_list_get_copy(struct vfio_iopf_group *iopf_group,
+				    struct rb_root *range_list_copy)
+{
+	struct rb_root *range_list = &iopf_group->pinned_range_list;
+	struct rb_node *n, **link = &range_list_copy->rb_node, *parent = NULL;
+	int ret;
+
+	for (n = rb_first(range_list); n; n = rb_next(n)) {
+		struct vfio_range *range, *range_copy;
+
+		range = rb_entry(n, struct vfio_range, node);
+
+		range_copy = kzalloc(sizeof(*range_copy), GFP_KERNEL);
+		if (!range_copy) {
+			ret = -ENOMEM;
+			goto out_free;
+		}
+
+		range_copy->base_iova = range->base_iova;
+		range_copy->span = range->span;
+		range_copy->ref_count = range->ref_count;
+
+		rb_link_node(&range_copy->node, parent, link);
+		rb_insert_color(&range_copy->node, range_list_copy);
+
+		parent = *link;
+		link = &(*link)->rb_right;
+	}
+
+	return 0;
+
+out_free:
+	vfio_range_list_free(range_list_copy);
+	return ret;
+}
+
 static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
 {
 	struct mm_struct *mm;
@@ -910,6 +1168,9 @@ static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
 	return unlocked;
 }
 
+static struct vfio_group *find_iommu_group(struct vfio_domain *domain,
+					   struct iommu_group *iommu_group);
+
 static int vfio_iommu_type1_pin_pages(void *iommu_data,
 				      struct iommu_group *iommu_group,
 				      unsigned long *user_pfn,
@@ -923,6 +1184,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 	struct vfio_dma *dma;
 	bool do_accounting;
 	dma_addr_t iova;
+	struct vfio_iopf_group *iopf_group = NULL;
+	struct rb_root range_list_copy = RB_ROOT;
 
 	if (!iommu || !user_pfn || !phys_pfn)
 		return -EINVAL;
@@ -955,6 +1218,31 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 		goto pin_done;
 	}
 
+	/*
+	 * Some devices only allow selective DMA faulting. Similar to the
+	 * selective dirty tracking, the vendor driver can call vfio_pin_pages()
+	 * to indicate the non-faultable scope, and we record it to filter
+	 * out the invalid page requests in the IOPF handler.
+	 */
+	if (iommu->iopf_enabled) {
+		iopf_group = vfio_find_iopf_group(iommu_group);
+		if (iopf_group) {
+			/*
+			 * We don't want to work on the original range
+			 * list as the list gets modified and in case
+			 * of failure we have to retain the original
+			 * list. Get a copy here.
+			 */
+			ret = vfio_range_list_get_copy(iopf_group,
+						       &range_list_copy);
+			if (ret)
+				goto pin_done;
+		} else {
+			WARN_ON(!find_iommu_group(iommu->external_domain,
+						  iommu_group));
+		}
+	}
+
 	/*
 	 * If iommu capable domain exist in the container then all pages are
 	 * already pinned and accounted. Accouting should be done if there is no
@@ -981,6 +1269,15 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 		vpfn = vfio_iova_get_vfio_pfn(dma, iova);
 		if (vpfn) {
 			phys_pfn[i] = vpfn->pfn;
+			if (iopf_group) {
+				ret = vfio_add_to_range_list(&range_list_copy,
+							     iova);
+				if (ret) {
+					vfio_unpin_page_external(dma, iova,
+								 do_accounting);
+					goto pin_unwind;
+				}
+			}
 			continue;
 		}
 
@@ -997,6 +1294,15 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 			goto pin_unwind;
 		}
 
+		if (iopf_group) {
+			ret = vfio_add_to_range_list(&range_list_copy, iova);
+			if (ret) {
+				vfio_unpin_page_external(dma, iova,
+							 do_accounting);
+				goto pin_unwind;
+			}
+		}
+
 		if (iommu->dirty_page_tracking) {
 			unsigned long pgshift = __ffs(iommu->pgsize_bitmap);
 
@@ -1010,6 +1316,13 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 	}
 	ret = i;
 
+	if (iopf_group) {
+		vfio_range_list_free(&iopf_group->pinned_range_list);
+		iopf_group->pinned_range_list.rb_node = range_list_copy.rb_node;
+		if (!iopf_group->selective_faulting)
+			iopf_group->selective_faulting = true;
+	}
+
 	group = vfio_iommu_find_iommu_group(iommu, iommu_group);
 	if (!group->pinned_page_dirty_scope) {
 		group->pinned_page_dirty_scope = true;
@@ -1019,6 +1332,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 	goto pin_done;
 
 pin_unwind:
+	if (iopf_group)
+		vfio_range_list_free(&range_list_copy);
 	phys_pfn[i] = 0;
 	for (j = 0; j < i; j++) {
 		dma_addr_t iova;
@@ -1034,12 +1349,14 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 }
 
 static int vfio_iommu_type1_unpin_pages(void *iommu_data,
+					struct iommu_group *iommu_group,
 					unsigned long *user_pfn,
 					int npage)
 {
 	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_iopf_group *iopf_group = NULL;
 	bool do_accounting;
-	int i;
+	int i, ret;
 
 	if (!iommu || !user_pfn)
 		return -EINVAL;
@@ -1050,6 +1367,13 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data,
 
 	mutex_lock(&iommu->lock);
 
+	if (iommu->iopf_enabled) {
+		iopf_group = vfio_find_iopf_group(iommu_group);
+		if (!iopf_group)
+			WARN_ON(!find_iommu_group(iommu->external_domain,
+						  iommu_group));
+	}
+
 	do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) ||
 			iommu->iopf_enabled;
 	for (i = 0; i < npage; i++) {
@@ -1058,14 +1382,24 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data,
 
 		iova = user_pfn[i] << PAGE_SHIFT;
 		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
-		if (!dma)
+		if (!dma) {
+			ret = -EINVAL;
 			goto unpin_exit;
+		}
+
+		if (iopf_group) {
+			ret = vfio_remove_from_range_list(
+					&iopf_group->pinned_range_list, iova);
+			if (ret)
+				goto unpin_exit;
+		}
+
 		vfio_unpin_page_external(dma, iova, do_accounting);
 	}
 
 unpin_exit:
 	mutex_unlock(&iommu->lock);
-	return i > npage ? npage : (i > 0 ? i : -EINVAL);
+	return i > npage ? npage : (i > 0 ? i : ret);
 }
 
 static long vfio_sync_unpin(struct vfio_dma *dma, struct vfio_domain *domain,
@@ -2591,6 +2925,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 
 		iopf_group->iommu_group = iommu_group;
 		iopf_group->iommu = iommu;
+		iopf_group->pinned_range_list = RB_ROOT;
 
 		vfio_link_iopf_group(iopf_group);
 	}
@@ -2886,6 +3221,8 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
 
 			iopf_group = vfio_find_iopf_group(iommu_group);
 			if (!WARN_ON(!iopf_group)) {
+				WARN_ON(!RB_EMPTY_ROOT(
+						&iopf_group->pinned_range_list));
 				vfio_unlink_iopf_group(iopf_group);
 				kfree(iopf_group);
 			}
@@ -3482,6 +3819,7 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
 	struct vfio_iommu *iommu;
 	struct vfio_dma *dma;
 	struct vfio_batch batch;
+	struct vfio_range *range;
 	dma_addr_t iova = ALIGN_DOWN(fault->prm.addr, PAGE_SIZE);
 	int access_flags = 0;
 	size_t premap_len, map_len, mapped_len = 0;
@@ -3506,6 +3844,12 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
 
 	mutex_lock(&iommu->lock);
 
+	if (iopf_group->selective_faulting) {
+		range = vfio_find_range(&iopf_group->pinned_range_list, iova);
+		if (!range)
+			goto out_invalid;
+	}
+
 	ret = vfio_find_dma_valid(iommu, iova, PAGE_SIZE, &dma);
 	if (ret < 0)
 		goto out_invalid;
@@ -3523,6 +3867,12 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
 
 	premap_len = IOPF_PREMAP_LEN << PAGE_SHIFT;
 	npages = dma->size >> PAGE_SHIFT;
+	if (iopf_group->selective_faulting) {
+		dma_addr_t range_end = range->base_iova + range->span;
+
+		if (range_end < dma->iova + dma->size)
+			npages = (range_end - dma->iova) >> PAGE_SHIFT;
+	}
 	map_len = PAGE_SIZE;
 	for (i = bit_offset + 1; i < npages; i++) {
 		if (map_len >= premap_len || IOPF_MAPPED_BITMAP_GET(dma, i))
@@ -3647,6 +3997,7 @@ static int vfio_iommu_type1_enable_iopf(struct vfio_iommu *iommu)
 
 			iopf_group->iommu_group = g->iommu_group;
 			iopf_group->iommu = iommu;
+			iopf_group->pinned_range_list = RB_ROOT;
 
 			vfio_link_iopf_group(iopf_group);
 		}
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index b7e18bde5aa8..a7b426d579df 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -87,6 +87,7 @@ struct vfio_iommu_driver_ops {
 				     int npage, int prot,
 				     unsigned long *phys_pfn);
 	int		(*unpin_pages)(void *iommu_data,
+				       struct iommu_group *group,
 				       unsigned long *user_pfn, int npage);
 	int		(*register_notifier)(void *iommu_data,
 					     unsigned long *events,

From patchwork Fri Apr  9 03:44:20 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Shenming Lu <lushenming@huawei.com>
X-Patchwork-Id: 12192943
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-16.7 required=3.0 tests=BAYES_00,
	HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT
	autolearn=unavailable autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id D3F05C43460
	for <kvm@archiver.kernel.org>; Fri,  9 Apr 2021 03:45:25 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 7E86F61055
	for <kvm@archiver.kernel.org>; Fri,  9 Apr 2021 03:45:25 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233153AbhDIDpb (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Thu, 8 Apr 2021 23:45:31 -0400
Received: from szxga06-in.huawei.com ([45.249.212.32]:16422 "EHLO
        szxga06-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233299AbhDIDpF (ORCPT <rfc822;kvm@vger.kernel.org>);
        Thu, 8 Apr 2021 23:45:05 -0400
Received: from DGGEMS414-HUB.china.huawei.com (unknown [172.30.72.60])
        by szxga06-in.huawei.com (SkyGuard) with ESMTP id 4FGkVZ5q3qzlWq4;
        Fri,  9 Apr 2021 11:43:02 +0800 (CST)
Received: from DESKTOP-7FEPK9S.china.huawei.com (10.174.184.135) by
 DGGEMS414-HUB.china.huawei.com (10.3.19.214) with Microsoft SMTP Server id
 14.3.498.0; Fri, 9 Apr 2021 11:44:45 +0800
From: Shenming Lu <lushenming@huawei.com>
To: Alex Williamson <alex.williamson@redhat.com>,
        Cornelia Huck <cohuck@redhat.com>,
        Will Deacon <will@kernel.org>,
        Robin Murphy <robin.murphy@arm.com>,
        Joerg Roedel <joro@8bytes.org>,
        Jean-Philippe Brucker <jean-philippe@linaro.org>,
        Eric Auger <eric.auger@redhat.com>, <kvm@vger.kernel.org>,
        <linux-kernel@vger.kernel.org>,
        <linux-arm-kernel@lists.infradead.org>,
        <iommu@lists.linux-foundation.org>, <linux-api@vger.kernel.org>
CC: Kevin Tian <kevin.tian@intel.com>,
        Lu Baolu <baolu.lu@linux.intel.com>, <yi.l.liu@intel.com>,
        Christoph Hellwig <hch@infradead.org>,
        Jonathan Cameron <Jonathan.Cameron@huawei.com>,
        Barry Song <song.bao.hua@hisilicon.com>,
        <wanghaibin.wang@huawei.com>, <yuzenghui@huawei.com>,
        <lushenming@huawei.com>
Subject: [RFC PATCH v3 8/8] vfio: Add nested IOPF support
Date: Fri, 9 Apr 2021 11:44:20 +0800
Message-ID: <20210409034420.1799-9-lushenming@huawei.com>
X-Mailer: git-send-email 2.27.0.windows.1
In-Reply-To: <20210409034420.1799-1-lushenming@huawei.com>
References: <20210409034420.1799-1-lushenming@huawei.com>
MIME-Version: 1.0
X-Originating-IP: [10.174.184.135]
X-CFilter-Loop: Reflected
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

To set up nested mode, drivers such as vfio_pci need to register a
handler to receive stage/level 1 faults from the IOMMU, but since
currently each device can only have one iommu dev fault handler,
and if stage 2 IOPF is already enabled (VFIO_IOMMU_ENABLE_IOPF),
we choose to update the registered handler (a consolidated one) via
flags (set FAULT_REPORT_NESTED_L1), and further deliver the received
stage 1 faults in the handler to the guest through a newly added
vfio_device_ops callback.

Signed-off-by: Shenming Lu <lushenming@huawei.com>
---
 drivers/vfio/vfio.c             | 81 +++++++++++++++++++++++++++++++++
 drivers/vfio/vfio_iommu_type1.c | 49 +++++++++++++++++++-
 include/linux/vfio.h            | 12 +++++
 3 files changed, 141 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 44c8dfabf7de..4245f15914bf 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -2356,6 +2356,87 @@ struct iommu_domain *vfio_group_iommu_domain(struct vfio_group *group)
 }
 EXPORT_SYMBOL_GPL(vfio_group_iommu_domain);
 
+/*
+ * Register/Update the VFIO IOPF handler to receive
+ * nested stage/level 1 faults.
+ */
+int vfio_iommu_dev_fault_handler_register_nested(struct device *dev)
+{
+	struct vfio_container *container;
+	struct vfio_group *group;
+	struct vfio_iommu_driver *driver;
+	int ret;
+
+	if (!dev)
+		return -EINVAL;
+
+	group = vfio_group_get_from_dev(dev);
+	if (!group)
+		return -ENODEV;
+
+	ret = vfio_group_add_container_user(group);
+	if (ret)
+		goto out;
+
+	container = group->container;
+	driver = container->iommu_driver;
+	if (likely(driver && driver->ops->register_handler))
+		ret = driver->ops->register_handler(container->iommu_data, dev);
+	else
+		ret = -ENOTTY;
+
+	vfio_group_try_dissolve_container(group);
+
+out:
+	vfio_group_put(group);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_register_nested);
+
+int vfio_iommu_dev_fault_handler_unregister_nested(struct device *dev)
+{
+	struct vfio_container *container;
+	struct vfio_group *group;
+	struct vfio_iommu_driver *driver;
+	int ret;
+
+	if (!dev)
+		return -EINVAL;
+
+	group = vfio_group_get_from_dev(dev);
+	if (!group)
+		return -ENODEV;
+
+	ret = vfio_group_add_container_user(group);
+	if (ret)
+		goto out;
+
+	container = group->container;
+	driver = container->iommu_driver;
+	if (likely(driver && driver->ops->unregister_handler))
+		ret = driver->ops->unregister_handler(container->iommu_data, dev);
+	else
+		ret = -ENOTTY;
+
+	vfio_group_try_dissolve_container(group);
+
+out:
+	vfio_group_put(group);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_unregister_nested);
+
+int vfio_transfer_iommu_fault(struct device *dev, struct iommu_fault *fault)
+{
+	struct vfio_device *device = dev_get_drvdata(dev);
+
+	if (unlikely(!device->ops->transfer))
+		return -EOPNOTSUPP;
+
+	return device->ops->transfer(device->device_data, fault);
+}
+EXPORT_SYMBOL_GPL(vfio_transfer_iommu_fault);
+
 /**
  * Module/class support
  */
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index ba2b5a1cf6e9..9d1adeddb303 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -3821,13 +3821,32 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
 	struct vfio_batch batch;
 	struct vfio_range *range;
 	dma_addr_t iova = ALIGN_DOWN(fault->prm.addr, PAGE_SIZE);
-	int access_flags = 0;
+	int access_flags = 0, nested;
 	size_t premap_len, map_len, mapped_len = 0;
 	unsigned long bit_offset, vaddr, pfn, i, npages;
 	int ret;
 	enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID;
 	struct iommu_page_response resp = {0};
 
+	if (vfio_dev_domian_nested(dev, &nested))
+		return -ENODEV;
+
+	/*
+	 * When configured in nested mode, further deliver the
+	 * stage/level 1 faults to the guest.
+	 */
+	if (nested) {
+		bool l2;
+
+		if (fault->type == IOMMU_FAULT_PAGE_REQ)
+			l2 = fault->prm.flags & IOMMU_FAULT_PAGE_REQUEST_L2;
+		if (fault->type == IOMMU_FAULT_DMA_UNRECOV)
+			l2 = fault->event.flags & IOMMU_FAULT_UNRECOV_L2;
+
+		if (!l2)
+			return vfio_transfer_iommu_fault(dev, fault);
+	}
+
 	if (fault->type != IOMMU_FAULT_PAGE_REQ)
 		return -EOPNOTSUPP;
 
@@ -4201,6 +4220,32 @@ static void vfio_iommu_type1_notify(void *iommu_data,
 	wake_up_all(&iommu->vaddr_wait);
 }
 
+static int vfio_iommu_type1_register_handler(void *iommu_data,
+					     struct device *dev)
+{
+	struct vfio_iommu *iommu = iommu_data;
+
+	if (iommu->iopf_enabled)
+		return iommu_update_device_fault_handler(dev, ~0,
+						FAULT_REPORT_NESTED_L1);
+	else
+		return iommu_register_device_fault_handler(dev,
+						vfio_iommu_type1_dma_map_iopf,
+						FAULT_REPORT_NESTED_L1, dev);
+}
+
+static int vfio_iommu_type1_unregister_handler(void *iommu_data,
+					       struct device *dev)
+{
+	struct vfio_iommu *iommu = iommu_data;
+
+	if (iommu->iopf_enabled)
+		return iommu_update_device_fault_handler(dev,
+						~FAULT_REPORT_NESTED_L1, 0);
+	else
+		return iommu_unregister_device_fault_handler(dev);
+}
+
 static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
 	.name			= "vfio-iommu-type1",
 	.owner			= THIS_MODULE,
@@ -4216,6 +4261,8 @@ static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
 	.dma_rw			= vfio_iommu_type1_dma_rw,
 	.group_iommu_domain	= vfio_iommu_type1_group_iommu_domain,
 	.notify			= vfio_iommu_type1_notify,
+	.register_handler	= vfio_iommu_type1_register_handler,
+	.unregister_handler	= vfio_iommu_type1_unregister_handler,
 };
 
 static int __init vfio_iommu_type1_init(void)
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index a7b426d579df..4621d8f0395d 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -29,6 +29,8 @@
  * @match: Optional device name match callback (return: 0 for no-match, >0 for
  *         match, -errno for abort (ex. match with insufficient or incorrect
  *         additional args)
+ * @transfer: Optional. Transfer the received stage/level 1 faults to the guest
+ *            for nested mode.
  */
 struct vfio_device_ops {
 	char	*name;
@@ -43,6 +45,7 @@ struct vfio_device_ops {
 	int	(*mmap)(void *device_data, struct vm_area_struct *vma);
 	void	(*request)(void *device_data, unsigned int count);
 	int	(*match)(void *device_data, char *buf);
+	int	(*transfer)(void *device_data, struct iommu_fault *fault);
 };
 
 extern struct iommu_group *vfio_iommu_group_get(struct device *dev);
@@ -100,6 +103,10 @@ struct vfio_iommu_driver_ops {
 						   struct iommu_group *group);
 	void		(*notify)(void *iommu_data,
 				  enum vfio_iommu_notify_type event);
+	int		(*register_handler)(void *iommu_data,
+					    struct device *dev);
+	int		(*unregister_handler)(void *iommu_data,
+					      struct device *dev);
 };
 
 extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
@@ -161,6 +168,11 @@ extern int vfio_unregister_notifier(struct device *dev,
 struct kvm;
 extern void vfio_group_set_kvm(struct vfio_group *group, struct kvm *kvm);
 
+extern int vfio_iommu_dev_fault_handler_register_nested(struct device *dev);
+extern int vfio_iommu_dev_fault_handler_unregister_nested(struct device *dev);
+extern int vfio_transfer_iommu_fault(struct device *dev,
+				     struct iommu_fault *fault);
+
 /*
  * Sub-module helpers
  */