From patchwork Wed Jun 8 12:31:25 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yi Liu X-Patchwork-Id: 12873446 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 90BD2C43334 for ; Wed, 8 Jun 2022 12:36:29 +0000 (UTC) Received: from localhost ([::1]:57898 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nyuvE-0003f9-LO for qemu-devel@archiver.kernel.org; Wed, 08 Jun 2022 08:36:28 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:57816) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nyuqh-00008D-8U for qemu-devel@nongnu.org; Wed, 08 Jun 2022 08:31:47 -0400 Received: from mga05.intel.com ([192.55.52.43]:56952) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nyuqe-0005kl-Pz for qemu-devel@nongnu.org; Wed, 08 Jun 2022 08:31:46 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1654691504; x=1686227504; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=tSB2DyD8BeADWS4sEGxP5x5vma8/I3aEjte+rXvo4Lw=; b=cPA53Oeq/fEzpKdZsD0z47IW37iKZsi6tNzWz1IitoauWn+icdez/uP5 Bhk1eXe7Oi4wjVxwi+ybFNWhYYFm+86yTmEznZFvrWdZLkJtD40hUepPC /Vb3AjN3X/6Wu2GgeypZ0ch7bfJg53qVAjHjJ9YbpR7VVhTAevKQkt67+ Z2FZnJSx4PoI5Fg1Wykc3yJBY9CFhQf44K3FU7XqQBy98tRetP1G6gfzl qYDB5XO1ANTS+8UwrZpEbuXxw/ok5t6i8rkIOb/I0K8CxWjXoM5ppNQ0B cBABbu6SCl5ILFVKrh34VA8OgT9XXMgJ52EJ6Ac9Gr4mGkslGi/h8xFB3 Q==; X-IronPort-AV: E=McAfee;i="6400,9594,10371"; a="363210074" X-IronPort-AV: E=Sophos;i="5.91,286,1647327600"; d="scan'208";a="363210074" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Jun 2022 05:31:40 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.91,286,1647327600"; d="scan'208";a="670529692" Received: from 984fee00a4c6.jf.intel.com ([10.165.58.231]) by FMSMGA003.fm.intel.com with ESMTP; 08 Jun 2022 05:31:40 -0700 From: Yi Liu To: alex.williamson@redhat.com, cohuck@redhat.com, qemu-devel@nongnu.org Cc: david@gibson.dropbear.id.au, thuth@redhat.com, farman@linux.ibm.com, mjrosato@linux.ibm.com, akrowiak@linux.ibm.com, pasic@linux.ibm.com, jjherne@linux.ibm.com, jasowang@redhat.com, kvm@vger.kernel.org, jgg@nvidia.com, nicolinc@nvidia.com, eric.auger@redhat.com, eric.auger.pro@gmail.com, kevin.tian@intel.com, yi.l.liu@intel.com, chao.p.peng@intel.com, yi.y.sun@intel.com, peterx@redhat.com, shameerali.kolothum.thodi@huawei.com, zhangfei.gao@linaro.org, berrange@redhat.com Subject: [RFC v2 01/15] scripts/update-linux-headers: Add iommufd.h Date: Wed, 8 Jun 2022 05:31:25 -0700 Message-Id: <20220608123139.19356-2-yi.l.liu@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220608123139.19356-1-yi.l.liu@intel.com> References: <20220608123139.19356-1-yi.l.liu@intel.com> MIME-Version: 1.0 Received-SPF: pass client-ip=192.55.52.43; envelope-from=yi.l.liu@intel.com; helo=mga05.intel.com X-Spam_score_int: -44 X-Spam_score: -4.5 X-Spam_bar: ---- X-Spam_report: (-4.5 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" From: Eric Auger Update the script to import iommufd.h Signed-off-by: Eric Auger Signed-off-by: Yi Liu --- scripts/update-linux-headers.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scripts/update-linux-headers.sh b/scripts/update-linux-headers.sh index 839a5ec614..a89b83e6d6 100755 --- a/scripts/update-linux-headers.sh +++ b/scripts/update-linux-headers.sh @@ -160,7 +160,7 @@ done rm -rf "$output/linux-headers/linux" mkdir -p "$output/linux-headers/linux" -for header in kvm.h vfio.h vfio_ccw.h vfio_zdev.h vhost.h \ +for header in kvm.h vfio.h iommufd.h vfio_ccw.h vfio_zdev.h vhost.h \ psci.h psp-sev.h userfaultfd.h mman.h; do cp "$tmpdir/include/linux/$header" "$output/linux-headers/linux" done From patchwork Wed Jun 8 12:31:26 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yi Liu X-Patchwork-Id: 12873444 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id A7317C433EF for ; Wed, 8 Jun 2022 12:36:15 +0000 (UTC) Received: from localhost ([::1]:57048 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nyuv0-00032h-6X for qemu-devel@archiver.kernel.org; Wed, 08 Jun 2022 08:36:14 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:57850) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nyuqi-0000Aj-Kd for qemu-devel@nongnu.org; Wed, 08 Jun 2022 08:31:48 -0400 Received: from mga05.intel.com ([192.55.52.43]:56958) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nyuqf-0005ks-2W for qemu-devel@nongnu.org; Wed, 08 Jun 2022 08:31:48 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1654691505; x=1686227505; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=5qWRnjvGyNwaSgv5oXZe2n49XMKJfJ+r4yqF2S9an0k=; b=YLRZxP/AuyQe7C5WPIeBsMNV87WCGNeVbl+scAwe2KBxIdjiPhPrk7/T vc+z6NSxiwBZXDB8N1TXcE0CcVDlNy/YpqZqGz0iGEuFSvsXvEe7dczy2 nKcD1B6oIJ5RHH25XXLUMqded4lYLzJyjTc5HOujZgGD3AwPQm8afY32N 0A++eu42VC95W8VLID7epr09JYJiYUNTNhExPRc0SX3xGp0xZH1ZpwjDK irkrdFp7ntZqxinHE0iYTnPLe1mFnu6oe27pvhAXSxOtGkCILS/XYPwzI m38HzAL7RyNz2Uy/6yfaeOvSfR9U25497fcZGqV/nKmw/y2MQXbshw9h5 g==; X-IronPort-AV: E=McAfee;i="6400,9594,10371"; a="363210085" X-IronPort-AV: E=Sophos;i="5.91,286,1647327600"; d="scan'208";a="363210085" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Jun 2022 05:31:41 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.91,286,1647327600"; d="scan'208";a="670529709" Received: from 984fee00a4c6.jf.intel.com ([10.165.58.231]) by FMSMGA003.fm.intel.com with ESMTP; 08 Jun 2022 05:31:40 -0700 From: Yi Liu To: alex.williamson@redhat.com, cohuck@redhat.com, qemu-devel@nongnu.org Cc: david@gibson.dropbear.id.au, thuth@redhat.com, farman@linux.ibm.com, mjrosato@linux.ibm.com, akrowiak@linux.ibm.com, pasic@linux.ibm.com, jjherne@linux.ibm.com, jasowang@redhat.com, kvm@vger.kernel.org, jgg@nvidia.com, nicolinc@nvidia.com, eric.auger@redhat.com, eric.auger.pro@gmail.com, kevin.tian@intel.com, yi.l.liu@intel.com, chao.p.peng@intel.com, yi.y.sun@intel.com, peterx@redhat.com, shameerali.kolothum.thodi@huawei.com, zhangfei.gao@linaro.org, berrange@redhat.com Subject: [RFC v2 02/15] linux-headers: Import latest vfio.h and iommufd.h Date: Wed, 8 Jun 2022 05:31:26 -0700 Message-Id: <20220608123139.19356-3-yi.l.liu@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220608123139.19356-1-yi.l.liu@intel.com> References: <20220608123139.19356-1-yi.l.liu@intel.com> MIME-Version: 1.0 Received-SPF: pass client-ip=192.55.52.43; envelope-from=yi.l.liu@intel.com; helo=mga05.intel.com X-Spam_score_int: -44 X-Spam_score: -4.5 X-Spam_bar: ---- X-Spam_report: (-4.5 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" From: Eric Auger Imported from https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6 Signed-off-by: Eric Auger Signed-off-by: Yi Liu --- linux-headers/linux/iommufd.h | 223 ++++++++++++++++++++++++++++++++++ linux-headers/linux/vfio.h | 84 +++++++++++++ 2 files changed, 307 insertions(+) create mode 100644 linux-headers/linux/iommufd.h diff --git a/linux-headers/linux/iommufd.h b/linux-headers/linux/iommufd.h new file mode 100644 index 0000000000..6c3cd9e259 --- /dev/null +++ b/linux-headers/linux/iommufd.h @@ -0,0 +1,223 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. + */ +#ifndef _IOMMUFD_H +#define _IOMMUFD_H + +#include +#include + +#define IOMMUFD_TYPE (';') + +/** + * DOC: General ioctl format + * + * The ioctl mechanims follows a general format to allow for extensibility. Each + * ioctl is passed in a structure pointer as the argument providing the size of + * the structure in the first u32. The kernel checks that any structure space + * beyond what it understands is 0. This allows userspace to use the backward + * compatible portion while consistently using the newer, larger, structures. + * + * ioctls use a standard meaning for common errnos: + * + * - ENOTTY: The IOCTL number itself is not supported at all + * - E2BIG: The IOCTL number is supported, but the provided structure has + * non-zero in a part the kernel does not understand. + * - EOPNOTSUPP: The IOCTL number is supported, and the structure is + * understood, however a known field has a value the kernel does not + * understand or support. + * - EINVAL: Everything about the IOCTL was understood, but a field is not + * correct. + * - ENOENT: An ID or IOVA provided does not exist. + * - ENOMEM: Out of memory. + * - EOVERFLOW: Mathematics oveflowed. + * + * As well as additional errnos. within specific ioctls. + */ +enum { + IOMMUFD_CMD_BASE = 0x80, + IOMMUFD_CMD_DESTROY = IOMMUFD_CMD_BASE, + IOMMUFD_CMD_IOAS_ALLOC, + IOMMUFD_CMD_IOAS_IOVA_RANGES, + IOMMUFD_CMD_IOAS_MAP, + IOMMUFD_CMD_IOAS_COPY, + IOMMUFD_CMD_IOAS_UNMAP, + IOMMUFD_CMD_VFIO_IOAS, +}; + +/** + * struct iommu_destroy - ioctl(IOMMU_DESTROY) + * @size: sizeof(struct iommu_destroy) + * @id: iommufd object ID to destroy. Can by any destroyable object type. + * + * Destroy any object held within iommufd. + */ +struct iommu_destroy { + __u32 size; + __u32 id; +}; +#define IOMMU_DESTROY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_DESTROY) + +/** + * struct iommu_ioas_alloc - ioctl(IOMMU_IOAS_ALLOC) + * @size: sizeof(struct iommu_ioas_alloc) + * @flags: Must be 0 + * @out_ioas_id: Output IOAS ID for the allocated object + * + * Allocate an IO Address Space (IOAS) which holds an IO Virtual Address (IOVA) + * to memory mapping. + */ +struct iommu_ioas_alloc { + __u32 size; + __u32 flags; + __u32 out_ioas_id; +}; +#define IOMMU_IOAS_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_ALLOC) + +/** + * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES) + * @size: sizeof(struct iommu_ioas_iova_ranges) + * @ioas_id: IOAS ID to read ranges from + * @out_num_iovas: Output total number of ranges in the IOAS + * @__reserved: Must be 0 + * @out_valid_iovas: Array of valid IOVA ranges. The array length is the smaller + * of out_num_iovas or the length implied by size. + * @out_valid_iovas.start: First IOVA in the allowed range + * @out_valid_iovas.last: Inclusive last IOVA in the allowed range + * + * Query an IOAS for ranges of allowed IOVAs. Operation outside these ranges is + * not allowed. out_num_iovas will be set to the total number of iovas + * and the out_valid_iovas[] will be filled in as space permits. + * size should include the allocated flex array. + */ +struct iommu_ioas_iova_ranges { + __u32 size; + __u32 ioas_id; + __u32 out_num_iovas; + __u32 __reserved; + struct iommu_valid_iovas { + __aligned_u64 start; + __aligned_u64 last; + } out_valid_iovas[]; +}; +#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_IOVA_RANGES) + +/** + * enum iommufd_ioas_map_flags - Flags for map and copy + * @IOMMU_IOAS_MAP_FIXED_IOVA: If clear the kernel will compute an appropriate + * IOVA to place the mapping at + * @IOMMU_IOAS_MAP_WRITEABLE: DMA is allowed to write to this mapping + * @IOMMU_IOAS_MAP_READABLE: DMA is allowed to read from this mapping + */ +enum iommufd_ioas_map_flags { + IOMMU_IOAS_MAP_FIXED_IOVA = 1 << 0, + IOMMU_IOAS_MAP_WRITEABLE = 1 << 1, + IOMMU_IOAS_MAP_READABLE = 1 << 2, +}; + +/** + * struct iommu_ioas_map - ioctl(IOMMU_IOAS_MAP) + * @size: sizeof(struct iommu_ioas_map) + * @flags: Combination of enum iommufd_ioas_map_flags + * @ioas_id: IOAS ID to change the mapping of + * @__reserved: Must be 0 + * @user_va: Userspace pointer to start mapping from + * @length: Number of bytes to map + * @iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is set + * then this must be provided as input. + * + * Set an IOVA mapping from a user pointer. If FIXED_IOVA is specified then the + * mapping will be established at iova, otherwise a suitable location will be + * automatically selected and returned in iova. + */ +struct iommu_ioas_map { + __u32 size; + __u32 flags; + __u32 ioas_id; + __u32 __reserved; + __aligned_u64 user_va; + __aligned_u64 length; + __aligned_u64 iova; +}; +#define IOMMU_IOAS_MAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_MAP) + +/** + * struct iommu_ioas_copy - ioctl(IOMMU_IOAS_COPY) + * @size: sizeof(struct iommu_ioas_copy) + * @flags: Combination of enum iommufd_ioas_map_flags + * @dst_ioas_id: IOAS ID to change the mapping of + * @src_ioas_id: IOAS ID to copy from + * @length: Number of bytes to copy and map + * @dst_iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is + * set then this must be provided as input. + * @src_iova: IOVA to start the copy + * + * Copy an already existing mapping from src_ioas_id and establish it in + * dst_ioas_id. The src iova/length must exactly match a range used with + * IOMMU_IOAS_MAP. + */ +struct iommu_ioas_copy { + __u32 size; + __u32 flags; + __u32 dst_ioas_id; + __u32 src_ioas_id; + __aligned_u64 length; + __aligned_u64 dst_iova; + __aligned_u64 src_iova; +}; +#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY) + +/** + * struct iommu_ioas_unmap - ioctl(IOMMU_IOAS_UNMAP) + * @size: sizeof(struct iommu_ioas_copy) + * @ioas_id: IOAS ID to change the mapping of + * @iova: IOVA to start the unmapping at + * @length: Number of bytes to unmap + * + * Unmap an IOVA range. The iova/length must exactly match a range + * used with IOMMU_IOAS_PAGETABLE_MAP, or be the values 0 & U64_MAX. + * In the latter case all IOVAs will be unmaped. + */ +struct iommu_ioas_unmap { + __u32 size; + __u32 ioas_id; + __aligned_u64 iova; + __aligned_u64 length; +}; +#define IOMMU_IOAS_UNMAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_UNMAP) + +/** + * enum iommufd_vfio_ioas_op + * @IOMMU_VFIO_IOAS_GET: Get the current compatibility IOAS + * @IOMMU_VFIO_IOAS_SET: Change the current compatibility IOAS + * @IOMMU_VFIO_IOAS_CLEAR: Disable VFIO compatibility + */ +enum iommufd_vfio_ioas_op { + IOMMU_VFIO_IOAS_GET = 0, + IOMMU_VFIO_IOAS_SET = 1, + IOMMU_VFIO_IOAS_CLEAR = 2, +}; + +/** + * struct iommu_vfio_ioas - ioctl(IOMMU_VFIO_IOAS) + * @size: sizeof(struct iommu_ioas_copy) + * @ioas_id: For IOMMU_VFIO_IOAS_SET the input IOAS ID to set + * For IOMMU_VFIO_IOAS_GET will output the IOAS ID + * @op: One of enum iommufd_vfio_ioas_op + * @__reserved: Must be 0 + * + * The VFIO compatibility support uses a single ioas because VFIO APIs do not + * support the ID field. Set or Get the IOAS that VFIO compatibility will use. + * When VFIO_GROUP_SET_CONTAINER is used on an iommufd it will get the + * compatibility ioas, either by taking what is already set, or auto creating + * one. From then on VFIO will continue to use that ioas and is not effected by + * this ioctl. SET or CLEAR does not destroy any auto-created IOAS. + */ +struct iommu_vfio_ioas { + __u32 size; + __u32 ioas_id; + __u16 op; + __u16 __reserved; +}; +#define IOMMU_VFIO_IOAS _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VFIO_IOAS) +#endif diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h index e9f7795c39..d1aa1ccd9f 100644 --- a/linux-headers/linux/vfio.h +++ b/linux-headers/linux/vfio.h @@ -190,6 +190,90 @@ struct vfio_group_status { /* --------------- IOCTLs for DEVICE file descriptors --------------- */ +/* + * VFIO_DEVICE_BIND_IOMMUFD - _IOR(VFIO_TYPE, VFIO_BASE + 19, + * struct vfio_device_bind_iommufd) + * + * Bind a vfio_device to the specified iommufd + * + * The user should provide a device cookie when calling this ioctl. The + * cookie is carried only in event e.g. I/O fault reported to userspace + * via iommufd. The user should use devid returned by this ioctl to mark + * the target device in other ioctls (e.g. capability query via iommufd). + * + * User is not allowed to access the device before the binding operation + * is completed. + * + * Unbind is automatically conducted when device fd is closed. + * + * Input parameters: + * - iommufd; + * - dev_cookie; + * + * Output parameters: + * - devid; + * + * Return: 0 on success, -errno on failure. + */ +struct vfio_device_bind_iommufd { + __u32 argsz; + __u32 flags; + __aligned_u64 dev_cookie; + __s32 iommufd; + __u32 out_devid; +}; + +#define VFIO_DEVICE_BIND_IOMMUFD _IO(VFIO_TYPE, VFIO_BASE + 19) + +/* + * VFIO_DEVICE_ATTACH_IOAS - _IOW(VFIO_TYPE, VFIO_BASE + 21, + * struct vfio_device_attach_ioas) + * + * Attach a vfio device to the specified IOAS. + * + * Multiple vfio devices can be attached to the same IOAS Page Table. One + * device can be attached to only one ioas at this point. + * + * @argsz: user filled size of this data. + * @flags: reserved for future extension. + * @iommufd: iommufd where the ioas comes from. + * @ioas_id: Input the target I/O address space page table. + * @hwpt_id: Output the hw page table id + * + * Return: 0 on success, -errno on failure. + */ +struct vfio_device_attach_ioas { + __u32 argsz; + __u32 flags; + __s32 iommufd; + __u32 ioas_id; + __u32 out_hwpt_id; +}; + +#define VFIO_DEVICE_ATTACH_IOAS _IO(VFIO_TYPE, VFIO_BASE + 20) + +/* + * VFIO_DEVICE_DETACH_IOAS - _IOW(VFIO_TYPE, VFIO_BASE + 21, + * struct vfio_device_detach_ioas) + * + * Detach a vfio device from the specified IOAS. + * + * @argsz: user filled size of this data. + * @flags: reserved for future extension. + * @iommufd: iommufd where the ioas comes from. + * @ioas_id: Input the target I/O address space page table. + * + * Return: 0 on success, -errno on failure. + */ +struct vfio_device_detach_ioas { + __u32 argsz; + __u32 flags; + __s32 iommufd; + __u32 ioas_id; +}; + +#define VFIO_DEVICE_DETACH_IOAS _IO(VFIO_TYPE, VFIO_BASE + 21) + /** * VFIO_DEVICE_GET_INFO - _IOR(VFIO_TYPE, VFIO_BASE + 7, * struct vfio_device_info) From patchwork Wed Jun 8 12:31:27 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yi Liu X-Patchwork-Id: 12873450 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 4287FC43334 for ; Wed, 8 Jun 2022 12:40:57 +0000 (UTC) Received: from localhost ([::1]:37600 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nyuzY-0000mo-8T for qemu-devel@archiver.kernel.org; Wed, 08 Jun 2022 08:40:56 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:57910) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nyuqn-0000HM-CU for qemu-devel@nongnu.org; Wed, 08 Jun 2022 08:31:53 -0400 Received: from mga05.intel.com ([192.55.52.43]:56952) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nyuqh-0005kl-7D for qemu-devel@nongnu.org; Wed, 08 Jun 2022 08:31:51 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1654691507; x=1686227507; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=R+rINfGuKEGx4S/HB5/iMpGulUK/r1nTXz1UYK53w7U=; b=f/1NU+ytucd4W46fhyIPrdXoP2FjkgRxqjovmIlfwBV4IkuwciLVDWQM +9WbfaT2KeVSkwXdzvC2b9iIvUYvR9AxlqkIahsp03eoPyC3oQKJ7wAQ+ HNA48gJliH0JSr+Is/DUTbRNnVQ7mvoYTKht5KY5bCEWLPWbrTCV0Dd5p C13WaTS+livHFibzZp2OzN+I3HBp2K2GF8wakKtWy/Xc7yGupqRRgePne 36ztkc3cD+pgMtynMxRmt+oTtI9rd16+ZBaWrDrP4FezRLdV9E6FE5Qrd QO8TwXeQE9XR93ngewqEOJu63cvz4uJ6wEHA1QBJglL6rs21B2dcuA5mw g==; X-IronPort-AV: E=McAfee;i="6400,9594,10371"; a="363210100" X-IronPort-AV: E=Sophos;i="5.91,286,1647327600"; d="scan'208";a="363210100" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Jun 2022 05:31:42 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.91,286,1647327600"; d="scan'208";a="670529728" Received: from 984fee00a4c6.jf.intel.com ([10.165.58.231]) by FMSMGA003.fm.intel.com with ESMTP; 08 Jun 2022 05:31:41 -0700 From: Yi Liu To: alex.williamson@redhat.com, cohuck@redhat.com, qemu-devel@nongnu.org Cc: david@gibson.dropbear.id.au, thuth@redhat.com, farman@linux.ibm.com, mjrosato@linux.ibm.com, akrowiak@linux.ibm.com, pasic@linux.ibm.com, jjherne@linux.ibm.com, jasowang@redhat.com, kvm@vger.kernel.org, jgg@nvidia.com, nicolinc@nvidia.com, eric.auger@redhat.com, eric.auger.pro@gmail.com, kevin.tian@intel.com, yi.l.liu@intel.com, chao.p.peng@intel.com, yi.y.sun@intel.com, peterx@redhat.com, shameerali.kolothum.thodi@huawei.com, zhangfei.gao@linaro.org, berrange@redhat.com Subject: [RFC v2 03/15] vfio/common: Split common.c into common.c, container.c and as.c Date: Wed, 8 Jun 2022 05:31:27 -0700 Message-Id: <20220608123139.19356-4-yi.l.liu@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220608123139.19356-1-yi.l.liu@intel.com> References: <20220608123139.19356-1-yi.l.liu@intel.com> MIME-Version: 1.0 Received-SPF: pass client-ip=192.55.52.43; envelope-from=yi.l.liu@intel.com; helo=mga05.intel.com X-Spam_score_int: -44 X-Spam_score: -4.5 X-Spam_bar: ---- X-Spam_report: (-4.5 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" Before introducing the support for the new /dev/iommu backend in VFIO let's try to split common.c file into 3 parts: - in common.c we keep backend agnostic code unrelated to dma mapping - as.c is created and contains code related to VFIOAddressSpace and MemoryListeners. This code will be backend agnostic. - container.c is created and will contain code related to the legacy VFIO backend (containers, groups, ...). No functional change intended Signed-off-by: Yi Liu Signed-off-by: Eric Auger --- hw/vfio/as.c | 893 +++++++++++++ hw/vfio/common.c | 2365 +++------------------------------ hw/vfio/container.c | 1193 +++++++++++++++++ hw/vfio/meson.build | 2 + include/hw/vfio/vfio-common.h | 28 + 5 files changed, 2303 insertions(+), 2178 deletions(-) create mode 100644 hw/vfio/as.c create mode 100644 hw/vfio/container.c diff --git a/hw/vfio/as.c b/hw/vfio/as.c new file mode 100644 index 0000000000..01eb60105b --- /dev/null +++ b/hw/vfio/as.c @@ -0,0 +1,893 @@ +/* + * generic functions used by VFIO devices + * + * Copyright Red Hat, Inc. 2012 + * + * Authors: + * Alex Williamson + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + * Based on qemu-kvm device-assignment: + * Adapted for KVM by Qumranet. + * Copyright (c) 2007, Neocleus, Alex Novik (alex@neocleus.com) + * Copyright (c) 2007, Neocleus, Guy Zana (guy@neocleus.com) + * Copyright (C) 2008, Qumranet, Amit Shah (amit.shah@qumranet.com) + * Copyright (C) 2008, Red Hat, Amit Shah (amit.shah@redhat.com) + * Copyright (C) 2008, IBM, Muli Ben-Yehuda (muli@il.ibm.com) + */ + +#include "qemu/osdep.h" +#include +#ifdef CONFIG_KVM +#include +#endif +#include + +#include "hw/vfio/vfio-common.h" +#include "hw/vfio/vfio.h" +#include "exec/address-spaces.h" +#include "exec/memory.h" +#include "exec/ram_addr.h" +#include "hw/hw.h" +#include "qemu/error-report.h" +#include "qemu/main-loop.h" +#include "qemu/range.h" +#include "sysemu/kvm.h" +#include "sysemu/reset.h" +#include "sysemu/runstate.h" +#include "trace.h" +#include "qapi/error.h" +#include "migration/migration.h" +#include "sysemu/tpm.h" + +static QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces = + QLIST_HEAD_INITIALIZER(vfio_address_spaces); + +void vfio_host_win_add(VFIOContainer *container, + hwaddr min_iova, hwaddr max_iova, + uint64_t iova_pgsizes) +{ + VFIOHostDMAWindow *hostwin; + + QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) { + if (ranges_overlap(hostwin->min_iova, + hostwin->max_iova - hostwin->min_iova + 1, + min_iova, + max_iova - min_iova + 1)) { + hw_error("%s: Overlapped IOMMU are not enabled", __func__); + } + } + + hostwin = g_malloc0(sizeof(*hostwin)); + + hostwin->min_iova = min_iova; + hostwin->max_iova = max_iova; + hostwin->iova_pgsizes = iova_pgsizes; + QLIST_INSERT_HEAD(&container->hostwin_list, hostwin, hostwin_next); +} + +int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova, + hwaddr max_iova) +{ + VFIOHostDMAWindow *hostwin; + + QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) { + if (hostwin->min_iova == min_iova && hostwin->max_iova == max_iova) { + QLIST_REMOVE(hostwin, hostwin_next); + g_free(hostwin); + return 0; + } + } + + return -1; +} + +static bool vfio_listener_skipped_section(MemoryRegionSection *section) +{ + return (!memory_region_is_ram(section->mr) && + !memory_region_is_iommu(section->mr)) || + memory_region_is_protected(section->mr) || + /* + * Sizing an enabled 64-bit BAR can cause spurious mappings to + * addresses in the upper part of the 64-bit address space. These + * are never accessed by the CPU and beyond the address width of + * some IOMMU hardware. TODO: VFIO should tell us the IOMMU width. + */ + section->offset_within_address_space & (1ULL << 63); +} + +/* Called with rcu_read_lock held. */ +static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr, + ram_addr_t *ram_addr, bool *read_only) +{ + MemoryRegion *mr; + hwaddr xlat; + hwaddr len = iotlb->addr_mask + 1; + bool writable = iotlb->perm & IOMMU_WO; + + /* + * The IOMMU TLB entry we have just covers translation through + * this IOMMU to its immediate target. We need to translate + * it the rest of the way through to memory. + */ + mr = address_space_translate(&address_space_memory, + iotlb->translated_addr, + &xlat, &len, writable, + MEMTXATTRS_UNSPECIFIED); + if (!memory_region_is_ram(mr)) { + error_report("iommu map to non memory area %"HWADDR_PRIx"", + xlat); + return false; + } else if (memory_region_has_ram_discard_manager(mr)) { + RamDiscardManager *rdm = memory_region_get_ram_discard_manager(mr); + MemoryRegionSection tmp = { + .mr = mr, + .offset_within_region = xlat, + .size = int128_make64(len), + }; + + /* + * Malicious VMs can map memory into the IOMMU, which is expected + * to remain discarded. vfio will pin all pages, populating memory. + * Disallow that. vmstate priorities make sure any RamDiscardManager + * were already restored before IOMMUs are restored. + */ + if (!ram_discard_manager_is_populated(rdm, &tmp)) { + error_report("iommu map to discarded memory (e.g., unplugged via" + " virtio-mem): %"HWADDR_PRIx"", + iotlb->translated_addr); + return false; + } + + /* + * Malicious VMs might trigger discarding of IOMMU-mapped memory. The + * pages will remain pinned inside vfio until unmapped, resulting in a + * higher memory consumption than expected. If memory would get + * populated again later, there would be an inconsistency between pages + * pinned by vfio and pages seen by QEMU. This is the case until + * unmapped from the IOMMU (e.g., during device reset). + * + * With malicious guests, we really only care about pinning more memory + * than expected. RLIMIT_MEMLOCK set for the user/process can never be + * exceeded and can be used to mitigate this problem. + */ + warn_report_once("Using vfio with vIOMMUs and coordinated discarding of" + " RAM (e.g., virtio-mem) works, however, malicious" + " guests can trigger pinning of more memory than" + " intended via an IOMMU. It's possible to mitigate " + " by setting/adjusting RLIMIT_MEMLOCK."); + } + + /* + * Translation truncates length to the IOMMU page size, + * check that it did not truncate too much. + */ + if (len & iotlb->addr_mask) { + error_report("iommu has granularity incompatible with target AS"); + return false; + } + + if (vaddr) { + *vaddr = memory_region_get_ram_ptr(mr) + xlat; + } + + if (ram_addr) { + *ram_addr = memory_region_get_ram_addr(mr) + xlat; + } + + if (read_only) { + *read_only = !writable || mr->readonly; + } + + return true; +} + +static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) +{ + VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n); + VFIOContainer *container = giommu->container; + hwaddr iova = iotlb->iova + giommu->iommu_offset; + void *vaddr; + int ret; + + trace_vfio_iommu_map_notify(iotlb->perm == IOMMU_NONE ? "UNMAP" : "MAP", + iova, iova + iotlb->addr_mask); + + if (iotlb->target_as != &address_space_memory) { + error_report("Wrong target AS \"%s\", only system memory is allowed", + iotlb->target_as->name ? iotlb->target_as->name : "none"); + return; + } + + rcu_read_lock(); + + if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) { + bool read_only; + + if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only)) { + goto out; + } + /* + * vaddr is only valid until rcu_read_unlock(). But after + * vfio_dma_map has set up the mapping the pages will be + * pinned by the kernel. This makes sure that the RAM backend + * of vaddr will always be there, even if the memory object is + * destroyed and its backing memory munmap-ed. + */ + ret = vfio_dma_map(container, iova, + iotlb->addr_mask + 1, vaddr, + read_only); + if (ret) { + error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", " + "0x%"HWADDR_PRIx", %p) = %d (%m)", + container, iova, + iotlb->addr_mask + 1, vaddr, ret); + } + } else { + ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1, iotlb); + if (ret) { + error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", " + "0x%"HWADDR_PRIx") = %d (%m)", + container, iova, + iotlb->addr_mask + 1, ret); + } + } +out: + rcu_read_unlock(); +} + +static void vfio_ram_discard_notify_discard(RamDiscardListener *rdl, + MemoryRegionSection *section) +{ + VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener, + listener); + const hwaddr size = int128_get64(section->size); + const hwaddr iova = section->offset_within_address_space; + int ret; + + /* Unmap with a single call. */ + ret = vfio_dma_unmap(vrdl->container, iova, size , NULL); + if (ret) { + error_report("%s: vfio_dma_unmap() failed: %s", __func__, + strerror(-ret)); + } +} + +static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl, + MemoryRegionSection *section) +{ + VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener, + listener); + const hwaddr end = section->offset_within_region + + int128_get64(section->size); + hwaddr start, next, iova; + void *vaddr; + int ret; + + /* + * Map in (aligned within memory region) minimum granularity, so we can + * unmap in minimum granularity later. + */ + for (start = section->offset_within_region; start < end; start = next) { + next = ROUND_UP(start + 1, vrdl->granularity); + next = MIN(next, end); + + iova = start - section->offset_within_region + + section->offset_within_address_space; + vaddr = memory_region_get_ram_ptr(section->mr) + start; + + ret = vfio_dma_map(vrdl->container, iova, next - start, + vaddr, section->readonly); + if (ret) { + /* Rollback */ + vfio_ram_discard_notify_discard(rdl, section); + return ret; + } + } + return 0; +} + +static void vfio_register_ram_discard_listener(VFIOContainer *container, + MemoryRegionSection *section) +{ + RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr); + VFIORamDiscardListener *vrdl; + + /* Ignore some corner cases not relevant in practice. */ + g_assert(QEMU_IS_ALIGNED(section->offset_within_region, TARGET_PAGE_SIZE)); + g_assert(QEMU_IS_ALIGNED(section->offset_within_address_space, + TARGET_PAGE_SIZE)); + g_assert(QEMU_IS_ALIGNED(int128_get64(section->size), TARGET_PAGE_SIZE)); + + vrdl = g_new0(VFIORamDiscardListener, 1); + vrdl->container = container; + vrdl->mr = section->mr; + vrdl->offset_within_address_space = section->offset_within_address_space; + vrdl->size = int128_get64(section->size); + vrdl->granularity = ram_discard_manager_get_min_granularity(rdm, + section->mr); + + g_assert(vrdl->granularity && is_power_of_2(vrdl->granularity)); + g_assert(container->pgsizes && + vrdl->granularity >= 1ULL << ctz64(container->pgsizes)); + + ram_discard_listener_init(&vrdl->listener, + vfio_ram_discard_notify_populate, + vfio_ram_discard_notify_discard, true); + ram_discard_manager_register_listener(rdm, &vrdl->listener, section); + QLIST_INSERT_HEAD(&container->vrdl_list, vrdl, next); + + /* + * Sanity-check if we have a theoretically problematic setup where we could + * exceed the maximum number of possible DMA mappings over time. We assume + * that each mapped section in the same address space as a RamDiscardManager + * section consumes exactly one DMA mapping, with the exception of + * RamDiscardManager sections; i.e., we don't expect to have gIOMMU sections + * in the same address space as RamDiscardManager sections. + * + * We assume that each section in the address space consumes one memslot. + * We take the number of KVM memory slots as a best guess for the maximum + * number of sections in the address space we could have over time, + * also consuming DMA mappings. + */ + if (container->dma_max_mappings) { + unsigned int vrdl_count = 0, vrdl_mappings = 0, max_memslots = 512; + +#ifdef CONFIG_KVM + if (kvm_enabled()) { + max_memslots = kvm_get_max_memslots(); + } +#endif + + QLIST_FOREACH(vrdl, &container->vrdl_list, next) { + hwaddr start, end; + + start = QEMU_ALIGN_DOWN(vrdl->offset_within_address_space, + vrdl->granularity); + end = ROUND_UP(vrdl->offset_within_address_space + vrdl->size, + vrdl->granularity); + vrdl_mappings += (end - start) / vrdl->granularity; + vrdl_count++; + } + + if (vrdl_mappings + max_memslots - vrdl_count > + container->dma_max_mappings) { + warn_report("%s: possibly running out of DMA mappings. E.g., try" + " increasing the 'block-size' of virtio-mem devies." + " Maximum possible DMA mappings: %d, Maximum possible" + " memslots: %d", __func__, container->dma_max_mappings, + max_memslots); + } + } +} + +static void vfio_unregister_ram_discard_listener(VFIOContainer *container, + MemoryRegionSection *section) +{ + RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr); + VFIORamDiscardListener *vrdl = NULL; + + QLIST_FOREACH(vrdl, &container->vrdl_list, next) { + if (vrdl->mr == section->mr && + vrdl->offset_within_address_space == + section->offset_within_address_space) { + break; + } + } + + if (!vrdl) { + hw_error("vfio: Trying to unregister missing RAM discard listener"); + } + + ram_discard_manager_unregister_listener(rdm, &vrdl->listener); + QLIST_REMOVE(vrdl, next); + g_free(vrdl); +} + +static bool vfio_known_safe_misalignment(MemoryRegionSection *section) +{ + MemoryRegion *mr = section->mr; + + if (!TPM_IS_CRB(mr->owner)) { + return false; + } + + /* this is a known safe misaligned region, just trace for debug purpose */ + trace_vfio_known_safe_misalignment(memory_region_name(mr), + section->offset_within_address_space, + section->offset_within_region, + qemu_real_host_page_size()); + return true; +} + +static void vfio_listener_region_add(MemoryListener *listener, + MemoryRegionSection *section) +{ + VFIOContainer *container = container_of(listener, VFIOContainer, listener); + hwaddr iova, end; + Int128 llend, llsize; + void *vaddr; + int ret; + VFIOHostDMAWindow *hostwin; + bool hostwin_found; + Error *err = NULL; + + if (vfio_listener_skipped_section(section)) { + trace_vfio_listener_region_add_skip( + section->offset_within_address_space, + section->offset_within_address_space + + int128_get64(int128_sub(section->size, int128_one()))); + return; + } + + if (unlikely((section->offset_within_address_space & + ~qemu_real_host_page_mask()) != + (section->offset_within_region & ~qemu_real_host_page_mask()))) { + if (!vfio_known_safe_misalignment(section)) { + error_report("%s received unaligned region %s iova=0x%"PRIx64 + " offset_within_region=0x%"PRIx64 + " qemu_real_host_page_size=0x%"PRIxPTR, + __func__, memory_region_name(section->mr), + section->offset_within_address_space, + section->offset_within_region, + qemu_real_host_page_size()); + } + return; + } + + iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space); + llend = int128_make64(section->offset_within_address_space); + llend = int128_add(llend, section->size); + llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask())); + + if (int128_ge(int128_make64(iova), llend)) { + if (memory_region_is_ram_device(section->mr)) { + trace_vfio_listener_region_add_no_dma_map( + memory_region_name(section->mr), + section->offset_within_address_space, + int128_getlo(section->size), + qemu_real_host_page_size()); + } + return; + } + end = int128_get64(int128_sub(llend, int128_one())); + + if (vfio_container_add_section_window(container, section, &err)) { + goto fail; + } + + hostwin_found = false; + QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) { + if (hostwin->min_iova <= iova && end <= hostwin->max_iova) { + hostwin_found = true; + break; + } + } + + if (!hostwin_found) { + error_setg(&err, "Container %p can't map guest IOVA region" + " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx, container, iova, end); + goto fail; + } + + memory_region_ref(section->mr); + + if (memory_region_is_iommu(section->mr)) { + VFIOGuestIOMMU *giommu; + IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr); + int iommu_idx; + + trace_vfio_listener_region_add_iommu(iova, end); + /* + * FIXME: For VFIO iommu types which have KVM acceleration to + * avoid bouncing all map/unmaps through qemu this way, this + * would be the right place to wire that up (tell the KVM + * device emulation the VFIO iommu handles to use). + */ + giommu = g_malloc0(sizeof(*giommu)); + giommu->iommu_mr = iommu_mr; + giommu->iommu_offset = section->offset_within_address_space - + section->offset_within_region; + giommu->container = container; + llend = int128_add(int128_make64(section->offset_within_region), + section->size); + llend = int128_sub(llend, int128_one()); + iommu_idx = memory_region_iommu_attrs_to_index(iommu_mr, + MEMTXATTRS_UNSPECIFIED); + iommu_notifier_init(&giommu->n, vfio_iommu_map_notify, + IOMMU_NOTIFIER_IOTLB_EVENTS, + section->offset_within_region, + int128_get64(llend), + iommu_idx); + + ret = memory_region_iommu_set_page_size_mask(giommu->iommu_mr, + container->pgsizes, + &err); + if (ret) { + g_free(giommu); + goto fail; + } + + ret = memory_region_register_iommu_notifier(section->mr, &giommu->n, + &err); + if (ret) { + g_free(giommu); + goto fail; + } + QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next); + memory_region_iommu_replay(giommu->iommu_mr, &giommu->n); + + return; + } + + /* Here we assume that memory_region_is_ram(section->mr)==true */ + + /* + * For RAM memory regions with a RamDiscardManager, we only want to map the + * actually populated parts - and update the mapping whenever we're notified + * about changes. + */ + if (memory_region_has_ram_discard_manager(section->mr)) { + vfio_register_ram_discard_listener(container, section); + return; + } + + vaddr = memory_region_get_ram_ptr(section->mr) + + section->offset_within_region + + (iova - section->offset_within_address_space); + + trace_vfio_listener_region_add_ram(iova, end, vaddr); + + llsize = int128_sub(llend, int128_make64(iova)); + + if (memory_region_is_ram_device(section->mr)) { + hwaddr pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1; + + if ((iova & pgmask) || (int128_get64(llsize) & pgmask)) { + trace_vfio_listener_region_add_no_dma_map( + memory_region_name(section->mr), + section->offset_within_address_space, + int128_getlo(section->size), + pgmask + 1); + return; + } + } + + ret = vfio_dma_map(container, iova, int128_get64(llsize), + vaddr, section->readonly); + if (ret) { + error_setg(&err, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", " + "0x%"HWADDR_PRIx", %p) = %d (%m)", + container, iova, int128_get64(llsize), vaddr, ret); + if (memory_region_is_ram_device(section->mr)) { + /* Allow unexpected mappings not to be fatal for RAM devices */ + error_report_err(err); + return; + } + goto fail; + } + + return; + +fail: + if (memory_region_is_ram_device(section->mr)) { + error_report("failed to vfio_dma_map. pci p2p may not work"); + return; + } + /* + * On the initfn path, store the first error in the container so we + * can gracefully fail. Runtime, there's not much we can do other + * than throw a hardware error. + */ + if (!container->initialized) { + if (!container->error) { + error_propagate_prepend(&container->error, err, + "Region %s: ", + memory_region_name(section->mr)); + } else { + error_free(err); + } + } else { + error_report_err(err); + hw_error("vfio: DMA mapping failed, unable to continue"); + } +} + +static void vfio_listener_region_del(MemoryListener *listener, + MemoryRegionSection *section) +{ + VFIOContainer *container = container_of(listener, VFIOContainer, listener); + hwaddr iova, end; + Int128 llend, llsize; + int ret; + bool try_unmap = true; + + if (vfio_listener_skipped_section(section)) { + trace_vfio_listener_region_del_skip( + section->offset_within_address_space, + section->offset_within_address_space + + int128_get64(int128_sub(section->size, int128_one()))); + return; + } + + if (unlikely((section->offset_within_address_space & + ~qemu_real_host_page_mask()) != + (section->offset_within_region & ~qemu_real_host_page_mask()))) { + error_report("%s received unaligned region", __func__); + return; + } + + if (memory_region_is_iommu(section->mr)) { + VFIOGuestIOMMU *giommu; + + QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) { + if (MEMORY_REGION(giommu->iommu_mr) == section->mr && + giommu->n.start == section->offset_within_region) { + memory_region_unregister_iommu_notifier(section->mr, + &giommu->n); + QLIST_REMOVE(giommu, giommu_next); + g_free(giommu); + break; + } + } + + /* + * FIXME: We assume the one big unmap below is adequate to + * remove any individual page mappings in the IOMMU which + * might have been copied into VFIO. This works for a page table + * based IOMMU where a big unmap flattens a large range of IO-PTEs. + * That may not be true for all IOMMU types. + */ + } + + iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space); + llend = int128_make64(section->offset_within_address_space); + llend = int128_add(llend, section->size); + llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask())); + + if (int128_ge(int128_make64(iova), llend)) { + return; + } + end = int128_get64(int128_sub(llend, int128_one())); + + llsize = int128_sub(llend, int128_make64(iova)); + + trace_vfio_listener_region_del(iova, end); + + if (memory_region_is_ram_device(section->mr)) { + hwaddr pgmask; + VFIOHostDMAWindow *hostwin; + bool hostwin_found = false; + + QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) { + if (hostwin->min_iova <= iova && end <= hostwin->max_iova) { + hostwin_found = true; + break; + } + } + assert(hostwin_found); /* or region_add() would have failed */ + + pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1; + try_unmap = !((iova & pgmask) || (int128_get64(llsize) & pgmask)); + } else if (memory_region_has_ram_discard_manager(section->mr)) { + vfio_unregister_ram_discard_listener(container, section); + /* Unregistering will trigger an unmap. */ + try_unmap = false; + } + + if (try_unmap) { + if (int128_eq(llsize, int128_2_64())) { + /* The unmap ioctl doesn't accept a full 64-bit span. */ + llsize = int128_rshift(llsize, 1); + ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL); + if (ret) { + error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", " + "0x%"HWADDR_PRIx") = %d (%m)", + container, iova, int128_get64(llsize), ret); + } + iova += int128_get64(llsize); + } + ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL); + if (ret) { + error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", " + "0x%"HWADDR_PRIx") = %d (%m)", + container, iova, int128_get64(llsize), ret); + } + } + + memory_region_unref(section->mr); + + vfio_container_del_section_window(container, section); +} + +static void vfio_listener_log_global_start(MemoryListener *listener) +{ + VFIOContainer *container = container_of(listener, VFIOContainer, listener); + + vfio_set_dirty_page_tracking(container, true); +} + +static void vfio_listener_log_global_stop(MemoryListener *listener) +{ + VFIOContainer *container = container_of(listener, VFIOContainer, listener); + + vfio_set_dirty_page_tracking(container, false); +} + +typedef struct { + IOMMUNotifier n; + VFIOGuestIOMMU *giommu; +} vfio_giommu_dirty_notifier; + +static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) +{ + vfio_giommu_dirty_notifier *gdn = container_of(n, + vfio_giommu_dirty_notifier, n); + VFIOGuestIOMMU *giommu = gdn->giommu; + VFIOContainer *container = giommu->container; + hwaddr iova = iotlb->iova + giommu->iommu_offset; + ram_addr_t translated_addr; + + trace_vfio_iommu_map_dirty_notify(iova, iova + iotlb->addr_mask); + + if (iotlb->target_as != &address_space_memory) { + error_report("Wrong target AS \"%s\", only system memory is allowed", + iotlb->target_as->name ? iotlb->target_as->name : "none"); + return; + } + + rcu_read_lock(); + if (vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL)) { + int ret; + + ret = vfio_get_dirty_bitmap(container, iova, iotlb->addr_mask + 1, + translated_addr); + if (ret) { + error_report("vfio_iommu_map_dirty_notify(%p, 0x%"HWADDR_PRIx", " + "0x%"HWADDR_PRIx") = %d (%m)", + container, iova, + iotlb->addr_mask + 1, ret); + } + } + rcu_read_unlock(); +} + +static int vfio_ram_discard_get_dirty_bitmap(MemoryRegionSection *section, + void *opaque) +{ + const hwaddr size = int128_get64(section->size); + const hwaddr iova = section->offset_within_address_space; + const ram_addr_t ram_addr = memory_region_get_ram_addr(section->mr) + + section->offset_within_region; + VFIORamDiscardListener *vrdl = opaque; + + /* + * Sync the whole mapped region (spanning multiple individual mappings) + * in one go. + */ + return vfio_get_dirty_bitmap(vrdl->container, iova, size, ram_addr); +} + +static int vfio_sync_ram_discard_listener_dirty_bitmap(VFIOContainer *container, + MemoryRegionSection *section) +{ + RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr); + VFIORamDiscardListener *vrdl = NULL; + + QLIST_FOREACH(vrdl, &container->vrdl_list, next) { + if (vrdl->mr == section->mr && + vrdl->offset_within_address_space == + section->offset_within_address_space) { + break; + } + } + + if (!vrdl) { + hw_error("vfio: Trying to sync missing RAM discard listener"); + } + + /* + * We only want/can synchronize the bitmap for actually mapped parts - + * which correspond to populated parts. Replay all populated parts. + */ + return ram_discard_manager_replay_populated(rdm, section, + vfio_ram_discard_get_dirty_bitmap, + &vrdl); +} + +static int vfio_sync_dirty_bitmap(VFIOContainer *container, + MemoryRegionSection *section) +{ + ram_addr_t ram_addr; + + if (memory_region_is_iommu(section->mr)) { + VFIOGuestIOMMU *giommu; + + QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) { + if (MEMORY_REGION(giommu->iommu_mr) == section->mr && + giommu->n.start == section->offset_within_region) { + Int128 llend; + vfio_giommu_dirty_notifier gdn = { .giommu = giommu }; + int idx = memory_region_iommu_attrs_to_index(giommu->iommu_mr, + MEMTXATTRS_UNSPECIFIED); + + llend = int128_add(int128_make64(section->offset_within_region), + section->size); + llend = int128_sub(llend, int128_one()); + + iommu_notifier_init(&gdn.n, + vfio_iommu_map_dirty_notify, + IOMMU_NOTIFIER_MAP, + section->offset_within_region, + int128_get64(llend), + idx); + memory_region_iommu_replay(giommu->iommu_mr, &gdn.n); + break; + } + } + return 0; + } else if (memory_region_has_ram_discard_manager(section->mr)) { + return vfio_sync_ram_discard_listener_dirty_bitmap(container, section); + } + + ram_addr = memory_region_get_ram_addr(section->mr) + + section->offset_within_region; + + return vfio_get_dirty_bitmap(container, + REAL_HOST_PAGE_ALIGN(section->offset_within_address_space), + int128_get64(section->size), ram_addr); +} + +static void vfio_listener_log_sync(MemoryListener *listener, + MemoryRegionSection *section) +{ + VFIOContainer *container = container_of(listener, VFIOContainer, listener); + + if (vfio_listener_skipped_section(section) || + !container->dirty_pages_supported) { + return; + } + + if (vfio_devices_all_dirty_tracking(container)) { + vfio_sync_dirty_bitmap(container, section); + } +} + +const MemoryListener vfio_memory_listener = { + .name = "vfio", + .region_add = vfio_listener_region_add, + .region_del = vfio_listener_region_del, + .log_global_start = vfio_listener_log_global_start, + .log_global_stop = vfio_listener_log_global_stop, + .log_sync = vfio_listener_log_sync, +}; + +VFIOAddressSpace *vfio_get_address_space(AddressSpace *as) +{ + VFIOAddressSpace *space; + + QLIST_FOREACH(space, &vfio_address_spaces, list) { + if (space->as == as) { + return space; + } + } + + /* No suitable VFIOAddressSpace, create a new one */ + space = g_malloc0(sizeof(*space)); + space->as = as; + QLIST_INIT(&space->containers); + + QLIST_INSERT_HEAD(&vfio_address_spaces, space, list); + + return space; +} + +void vfio_put_address_space(VFIOAddressSpace *space) +{ + if (QLIST_EMPTY(&space->containers)) { + QLIST_REMOVE(space, list); + g_free(space); + } +} diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 29982c7af8..5c5c735a28 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -20,43 +20,13 @@ #include "qemu/osdep.h" #include -#ifdef CONFIG_KVM -#include -#endif #include #include "hw/vfio/vfio-common.h" #include "hw/vfio/vfio.h" -#include "exec/address-spaces.h" -#include "exec/memory.h" -#include "exec/ram_addr.h" #include "hw/hw.h" -#include "qemu/error-report.h" -#include "qemu/main-loop.h" -#include "qemu/range.h" -#include "sysemu/kvm.h" -#include "sysemu/reset.h" -#include "sysemu/runstate.h" #include "trace.h" #include "qapi/error.h" -#include "migration/migration.h" -#include "sysemu/tpm.h" - -VFIOGroupList vfio_group_list = - QLIST_HEAD_INITIALIZER(vfio_group_list); -static QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces = - QLIST_HEAD_INITIALIZER(vfio_address_spaces); - -#ifdef CONFIG_KVM -/* - * We have a single VFIO pseudo device per KVM VM. Once created it lives - * for the life of the VM. Closing the file descriptor only drops our - * reference to it and the device's reference to kvm. Therefore once - * initialized, this file descriptor is only released on QEMU exit and - * we'll re-use it should another vfio device be attached before then. - */ -static int vfio_kvm_device_fd = -1; -#endif /* * Common VFIO interrupt disable @@ -136,29 +106,6 @@ static const char *index_to_str(VFIODevice *vbasedev, int index) } } -static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state) -{ - switch (container->iommu_type) { - case VFIO_TYPE1v2_IOMMU: - case VFIO_TYPE1_IOMMU: - /* - * We support coordinated discarding of RAM via the RamDiscardManager. - */ - return ram_block_uncoordinated_discard_disable(state); - default: - /* - * VFIO_SPAPR_TCE_IOMMU most probably works just fine with - * RamDiscardManager, however, it is completely untested. - * - * VFIO_SPAPR_TCE_v2_IOMMU with "DMA memory preregistering" does - * completely the opposite of managing mapping/pinning dynamically as - * required by RamDiscardManager. We would have to special-case sections - * with a RamDiscardManager. - */ - return ram_block_discard_disable(state); - } -} - int vfio_set_irq_signaling(VFIODevice *vbasedev, int index, int subindex, int action, int fd, Error **errp) { @@ -313,2138 +260,295 @@ const MemoryRegionOps vfio_region_ops = { }, }; -/* - * Device state interfaces - */ - -bool vfio_mig_active(void) +static struct vfio_info_cap_header * +vfio_get_cap(void *ptr, uint32_t cap_offset, uint16_t id) { - VFIOGroup *group; - VFIODevice *vbasedev; - - if (QLIST_EMPTY(&vfio_group_list)) { - return false; - } + struct vfio_info_cap_header *hdr; - QLIST_FOREACH(group, &vfio_group_list, next) { - QLIST_FOREACH(vbasedev, &group->device_list, next) { - if (vbasedev->migration_blocker) { - return false; - } + for (hdr = ptr + cap_offset; hdr != ptr; hdr = ptr + hdr->next) { + if (hdr->id == id) { + return hdr; } } - return true; + + return NULL; } -static bool vfio_devices_all_dirty_tracking(VFIOContainer *container) +struct vfio_info_cap_header * +vfio_get_region_info_cap(struct vfio_region_info *info, uint16_t id) { - VFIOGroup *group; - VFIODevice *vbasedev; - MigrationState *ms = migrate_get_current(); - - if (!migration_is_setup_or_active(ms->state)) { - return false; + if (!(info->flags & VFIO_REGION_INFO_FLAG_CAPS)) { + return NULL; } - QLIST_FOREACH(group, &container->group_list, container_next) { - QLIST_FOREACH(vbasedev, &group->device_list, next) { - VFIOMigration *migration = vbasedev->migration; + return vfio_get_cap((void *)info, info->cap_offset, id); +} + +static struct vfio_info_cap_header * +vfio_get_iommu_type1_info_cap(struct vfio_iommu_type1_info *info, uint16_t id) +{ + if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) { + return NULL; + } - if (!migration) { - return false; - } + return vfio_get_cap((void *)info, info->cap_offset, id); +} - if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) - && (migration->device_state & VFIO_DEVICE_STATE_V1_RUNNING)) { - return false; - } - } +struct vfio_info_cap_header * +vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id) +{ + if (!(info->flags & VFIO_DEVICE_FLAGS_CAPS)) { + return NULL; } - return true; + + return vfio_get_cap((void *)info, info->cap_offset, id); } -static bool vfio_devices_all_running_and_saving(VFIOContainer *container) +bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info, + unsigned int *avail) { - VFIOGroup *group; - VFIODevice *vbasedev; - MigrationState *ms = migrate_get_current(); + struct vfio_info_cap_header *hdr; + struct vfio_iommu_type1_info_dma_avail *cap; - if (!migration_is_setup_or_active(ms->state)) { + /* If the capability cannot be found, assume no DMA limiting */ + hdr = vfio_get_iommu_type1_info_cap(info, + VFIO_IOMMU_TYPE1_INFO_DMA_AVAIL); + if (hdr == NULL) { return false; } - QLIST_FOREACH(group, &container->group_list, container_next) { - QLIST_FOREACH(vbasedev, &group->device_list, next) { - VFIOMigration *migration = vbasedev->migration; - - if (!migration) { - return false; - } - - if ((migration->device_state & VFIO_DEVICE_STATE_V1_SAVING) && - (migration->device_state & VFIO_DEVICE_STATE_V1_RUNNING)) { - continue; - } else { - return false; - } - } + if (avail != NULL) { + cap = (void *) hdr; + *avail = cap->avail; } + return true; } -static int vfio_dma_unmap_bitmap(VFIOContainer *container, - hwaddr iova, ram_addr_t size, - IOMMUTLBEntry *iotlb) +static int vfio_setup_region_sparse_mmaps(VFIORegion *region, + struct vfio_region_info *info) { - struct vfio_iommu_type1_dma_unmap *unmap; - struct vfio_bitmap *bitmap; - uint64_t pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size(); - int ret; - - unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap)); + struct vfio_info_cap_header *hdr; + struct vfio_region_info_cap_sparse_mmap *sparse; + int i, j; - unmap->argsz = sizeof(*unmap) + sizeof(*bitmap); - unmap->iova = iova; - unmap->size = size; - unmap->flags |= VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP; - bitmap = (struct vfio_bitmap *)&unmap->data; + hdr = vfio_get_region_info_cap(info, VFIO_REGION_INFO_CAP_SPARSE_MMAP); + if (!hdr) { + return -ENODEV; + } - /* - * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of - * qemu_real_host_page_size to mark those dirty. Hence set bitmap_pgsize - * to qemu_real_host_page_size. - */ + sparse = container_of(hdr, struct vfio_region_info_cap_sparse_mmap, header); - bitmap->pgsize = qemu_real_host_page_size(); - bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) / - BITS_PER_BYTE; + trace_vfio_region_sparse_mmap_header(region->vbasedev->name, + region->nr, sparse->nr_areas); - if (bitmap->size > container->max_dirty_bitmap_size) { - error_report("UNMAP: Size of bitmap too big 0x%"PRIx64, - (uint64_t)bitmap->size); - ret = -E2BIG; - goto unmap_exit; - } + region->mmaps = g_new0(VFIOMmap, sparse->nr_areas); - bitmap->data = g_try_malloc0(bitmap->size); - if (!bitmap->data) { - ret = -ENOMEM; - goto unmap_exit; + for (i = 0, j = 0; i < sparse->nr_areas; i++) { + if (sparse->areas[i].size) { + trace_vfio_region_sparse_mmap_entry(i, sparse->areas[i].offset, + sparse->areas[i].offset + + sparse->areas[i].size - 1); + region->mmaps[j].offset = sparse->areas[i].offset; + region->mmaps[j].size = sparse->areas[i].size; + j++; + } } - ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap); - if (!ret) { - cpu_physical_memory_set_dirty_lebitmap((unsigned long *)bitmap->data, - iotlb->translated_addr, pages); - } else { - error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %m"); - } + region->nr_mmaps = j; + region->mmaps = g_realloc(region->mmaps, j * sizeof(VFIOMmap)); - g_free(bitmap->data); -unmap_exit: - g_free(unmap); - return ret; + return 0; } -/* - * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86 - */ -static int vfio_dma_unmap(VFIOContainer *container, - hwaddr iova, ram_addr_t size, - IOMMUTLBEntry *iotlb) +int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region, + int index, const char *name) { - struct vfio_iommu_type1_dma_unmap unmap = { - .argsz = sizeof(unmap), - .flags = 0, - .iova = iova, - .size = size, - }; + struct vfio_region_info *info; + int ret; - if (iotlb && container->dirty_pages_supported && - vfio_devices_all_running_and_saving(container)) { - return vfio_dma_unmap_bitmap(container, iova, size, iotlb); + ret = vfio_get_region_info(vbasedev, index, &info); + if (ret) { + return ret; } - while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) { - /* - * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c - * v4.15) where an overflow in its wrap-around check prevents us from - * unmapping the last page of the address space. Test for the error - * condition and re-try the unmap excluding the last page. The - * expectation is that we've never mapped the last page anyway and this - * unmap request comes via vIOMMU support which also makes it unlikely - * that this page is used. This bug was introduced well after type1 v2 - * support was introduced, so we shouldn't need to test for v1. A fix - * is queued for kernel v5.0 so this workaround can be removed once - * affected kernels are sufficiently deprecated. - */ - if (errno == EINVAL && unmap.size && !(unmap.iova + unmap.size) && - container->iommu_type == VFIO_TYPE1v2_IOMMU) { - trace_vfio_dma_unmap_overflow_workaround(); - unmap.size -= 1ULL << ctz64(container->pgsizes); - continue; - } - error_report("VFIO_UNMAP_DMA failed: %s", strerror(errno)); - return -errno; - } + region->vbasedev = vbasedev; + region->flags = info->flags; + region->size = info->size; + region->fd_offset = info->offset; + region->nr = index; - return 0; -} + if (region->size) { + region->mem = g_new0(MemoryRegion, 1); + memory_region_init_io(region->mem, obj, &vfio_region_ops, + region, name, region->size); -static int vfio_dma_map(VFIOContainer *container, hwaddr iova, - ram_addr_t size, void *vaddr, bool readonly) -{ - struct vfio_iommu_type1_dma_map map = { - .argsz = sizeof(map), - .flags = VFIO_DMA_MAP_FLAG_READ, - .vaddr = (__u64)(uintptr_t)vaddr, - .iova = iova, - .size = size, - }; + if (!vbasedev->no_mmap && + region->flags & VFIO_REGION_INFO_FLAG_MMAP) { - if (!readonly) { - map.flags |= VFIO_DMA_MAP_FLAG_WRITE; - } + ret = vfio_setup_region_sparse_mmaps(region, info); - /* - * Try the mapping, if it fails with EBUSY, unmap the region and try - * again. This shouldn't be necessary, but we sometimes see it in - * the VGA ROM space. - */ - if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 || - (errno == EBUSY && vfio_dma_unmap(container, iova, size, NULL) == 0 && - ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) { - return 0; + if (ret) { + region->nr_mmaps = 1; + region->mmaps = g_new0(VFIOMmap, region->nr_mmaps); + region->mmaps[0].offset = 0; + region->mmaps[0].size = region->size; + } + } } - error_report("VFIO_MAP_DMA failed: %s", strerror(errno)); - return -errno; + g_free(info); + + trace_vfio_region_setup(vbasedev->name, index, name, + region->flags, region->fd_offset, region->size); + return 0; } -static void vfio_host_win_add(VFIOContainer *container, - hwaddr min_iova, hwaddr max_iova, - uint64_t iova_pgsizes) +static void vfio_subregion_unmap(VFIORegion *region, int index) { - VFIOHostDMAWindow *hostwin; - - QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) { - if (ranges_overlap(hostwin->min_iova, - hostwin->max_iova - hostwin->min_iova + 1, - min_iova, - max_iova - min_iova + 1)) { - hw_error("%s: Overlapped IOMMU are not enabled", __func__); - } - } - - hostwin = g_malloc0(sizeof(*hostwin)); - - hostwin->min_iova = min_iova; - hostwin->max_iova = max_iova; - hostwin->iova_pgsizes = iova_pgsizes; - QLIST_INSERT_HEAD(&container->hostwin_list, hostwin, hostwin_next); + trace_vfio_region_unmap(memory_region_name(®ion->mmaps[index].mem), + region->mmaps[index].offset, + region->mmaps[index].offset + + region->mmaps[index].size - 1); + memory_region_del_subregion(region->mem, ®ion->mmaps[index].mem); + munmap(region->mmaps[index].mmap, region->mmaps[index].size); + object_unparent(OBJECT(®ion->mmaps[index].mem)); + region->mmaps[index].mmap = NULL; } -static int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova, - hwaddr max_iova) +int vfio_region_mmap(VFIORegion *region) { - VFIOHostDMAWindow *hostwin; + int i, prot = 0; + char *name; - QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) { - if (hostwin->min_iova == min_iova && hostwin->max_iova == max_iova) { - QLIST_REMOVE(hostwin, hostwin_next); - g_free(hostwin); - return 0; - } + if (!region->mem) { + return 0; } - return -1; -} - -static bool vfio_listener_skipped_section(MemoryRegionSection *section) -{ - return (!memory_region_is_ram(section->mr) && - !memory_region_is_iommu(section->mr)) || - memory_region_is_protected(section->mr) || - /* - * Sizing an enabled 64-bit BAR can cause spurious mappings to - * addresses in the upper part of the 64-bit address space. These - * are never accessed by the CPU and beyond the address width of - * some IOMMU hardware. TODO: VFIO should tell us the IOMMU width. - */ - section->offset_within_address_space & (1ULL << 63); -} + prot |= region->flags & VFIO_REGION_INFO_FLAG_READ ? PROT_READ : 0; + prot |= region->flags & VFIO_REGION_INFO_FLAG_WRITE ? PROT_WRITE : 0; -/* Called with rcu_read_lock held. */ -static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr, - ram_addr_t *ram_addr, bool *read_only) -{ - MemoryRegion *mr; - hwaddr xlat; - hwaddr len = iotlb->addr_mask + 1; - bool writable = iotlb->perm & IOMMU_WO; + for (i = 0; i < region->nr_mmaps; i++) { + region->mmaps[i].mmap = mmap(NULL, region->mmaps[i].size, prot, + MAP_SHARED, region->vbasedev->fd, + region->fd_offset + + region->mmaps[i].offset); + if (region->mmaps[i].mmap == MAP_FAILED) { + int ret = -errno; - /* - * The IOMMU TLB entry we have just covers translation through - * this IOMMU to its immediate target. We need to translate - * it the rest of the way through to memory. - */ - mr = address_space_translate(&address_space_memory, - iotlb->translated_addr, - &xlat, &len, writable, - MEMTXATTRS_UNSPECIFIED); - if (!memory_region_is_ram(mr)) { - error_report("iommu map to non memory area %"HWADDR_PRIx"", - xlat); - return false; - } else if (memory_region_has_ram_discard_manager(mr)) { - RamDiscardManager *rdm = memory_region_get_ram_discard_manager(mr); - MemoryRegionSection tmp = { - .mr = mr, - .offset_within_region = xlat, - .size = int128_make64(len), - }; - - /* - * Malicious VMs can map memory into the IOMMU, which is expected - * to remain discarded. vfio will pin all pages, populating memory. - * Disallow that. vmstate priorities make sure any RamDiscardManager - * were already restored before IOMMUs are restored. - */ - if (!ram_discard_manager_is_populated(rdm, &tmp)) { - error_report("iommu map to discarded memory (e.g., unplugged via" - " virtio-mem): %"HWADDR_PRIx"", - iotlb->translated_addr); - return false; - } + trace_vfio_region_mmap_fault(memory_region_name(region->mem), i, + region->fd_offset + + region->mmaps[i].offset, + region->fd_offset + + region->mmaps[i].offset + + region->mmaps[i].size - 1, ret); - /* - * Malicious VMs might trigger discarding of IOMMU-mapped memory. The - * pages will remain pinned inside vfio until unmapped, resulting in a - * higher memory consumption than expected. If memory would get - * populated again later, there would be an inconsistency between pages - * pinned by vfio and pages seen by QEMU. This is the case until - * unmapped from the IOMMU (e.g., during device reset). - * - * With malicious guests, we really only care about pinning more memory - * than expected. RLIMIT_MEMLOCK set for the user/process can never be - * exceeded and can be used to mitigate this problem. - */ - warn_report_once("Using vfio with vIOMMUs and coordinated discarding of" - " RAM (e.g., virtio-mem) works, however, malicious" - " guests can trigger pinning of more memory than" - " intended via an IOMMU. It's possible to mitigate " - " by setting/adjusting RLIMIT_MEMLOCK."); - } + region->mmaps[i].mmap = NULL; - /* - * Translation truncates length to the IOMMU page size, - * check that it did not truncate too much. - */ - if (len & iotlb->addr_mask) { - error_report("iommu has granularity incompatible with target AS"); - return false; - } + for (i--; i >= 0; i--) { + vfio_subregion_unmap(region, i); + } - if (vaddr) { - *vaddr = memory_region_get_ram_ptr(mr) + xlat; - } + return ret; + } - if (ram_addr) { - *ram_addr = memory_region_get_ram_addr(mr) + xlat; - } + name = g_strdup_printf("%s mmaps[%d]", + memory_region_name(region->mem), i); + memory_region_init_ram_device_ptr(®ion->mmaps[i].mem, + memory_region_owner(region->mem), + name, region->mmaps[i].size, + region->mmaps[i].mmap); + g_free(name); + memory_region_add_subregion(region->mem, region->mmaps[i].offset, + ®ion->mmaps[i].mem); - if (read_only) { - *read_only = !writable || mr->readonly; + trace_vfio_region_mmap(memory_region_name(®ion->mmaps[i].mem), + region->mmaps[i].offset, + region->mmaps[i].offset + + region->mmaps[i].size - 1); } - return true; + return 0; } -static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) +void vfio_region_unmap(VFIORegion *region) { - VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n); - VFIOContainer *container = giommu->container; - hwaddr iova = iotlb->iova + giommu->iommu_offset; - void *vaddr; - int ret; - - trace_vfio_iommu_map_notify(iotlb->perm == IOMMU_NONE ? "UNMAP" : "MAP", - iova, iova + iotlb->addr_mask); + int i; - if (iotlb->target_as != &address_space_memory) { - error_report("Wrong target AS \"%s\", only system memory is allowed", - iotlb->target_as->name ? iotlb->target_as->name : "none"); + if (!region->mem) { return; } - rcu_read_lock(); - - if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) { - bool read_only; - - if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only)) { - goto out; - } - /* - * vaddr is only valid until rcu_read_unlock(). But after - * vfio_dma_map has set up the mapping the pages will be - * pinned by the kernel. This makes sure that the RAM backend - * of vaddr will always be there, even if the memory object is - * destroyed and its backing memory munmap-ed. - */ - ret = vfio_dma_map(container, iova, - iotlb->addr_mask + 1, vaddr, - read_only); - if (ret) { - error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", " - "0x%"HWADDR_PRIx", %p) = %d (%m)", - container, iova, - iotlb->addr_mask + 1, vaddr, ret); - } - } else { - ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1, iotlb); - if (ret) { - error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", " - "0x%"HWADDR_PRIx") = %d (%m)", - container, iova, - iotlb->addr_mask + 1, ret); + for (i = 0; i < region->nr_mmaps; i++) { + if (region->mmaps[i].mmap) { + vfio_subregion_unmap(region, i); } } -out: - rcu_read_unlock(); } -static void vfio_ram_discard_notify_discard(RamDiscardListener *rdl, - MemoryRegionSection *section) +void vfio_region_exit(VFIORegion *region) { - VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener, - listener); - const hwaddr size = int128_get64(section->size); - const hwaddr iova = section->offset_within_address_space; - int ret; + int i; - /* Unmap with a single call. */ - ret = vfio_dma_unmap(vrdl->container, iova, size , NULL); - if (ret) { - error_report("%s: vfio_dma_unmap() failed: %s", __func__, - strerror(-ret)); + if (!region->mem) { + return; } -} - -static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl, - MemoryRegionSection *section) -{ - VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener, - listener); - const hwaddr end = section->offset_within_region + - int128_get64(section->size); - hwaddr start, next, iova; - void *vaddr; - int ret; - /* - * Map in (aligned within memory region) minimum granularity, so we can - * unmap in minimum granularity later. - */ - for (start = section->offset_within_region; start < end; start = next) { - next = ROUND_UP(start + 1, vrdl->granularity); - next = MIN(next, end); - - iova = start - section->offset_within_region + - section->offset_within_address_space; - vaddr = memory_region_get_ram_ptr(section->mr) + start; - - ret = vfio_dma_map(vrdl->container, iova, next - start, - vaddr, section->readonly); - if (ret) { - /* Rollback */ - vfio_ram_discard_notify_discard(rdl, section); - return ret; + for (i = 0; i < region->nr_mmaps; i++) { + if (region->mmaps[i].mmap) { + memory_region_del_subregion(region->mem, ®ion->mmaps[i].mem); } } - return 0; + + trace_vfio_region_exit(region->vbasedev->name, region->nr); } -static void vfio_register_ram_discard_listener(VFIOContainer *container, - MemoryRegionSection *section) +void vfio_region_finalize(VFIORegion *region) { - RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr); - VFIORamDiscardListener *vrdl; - - /* Ignore some corner cases not relevant in practice. */ - g_assert(QEMU_IS_ALIGNED(section->offset_within_region, TARGET_PAGE_SIZE)); - g_assert(QEMU_IS_ALIGNED(section->offset_within_address_space, - TARGET_PAGE_SIZE)); - g_assert(QEMU_IS_ALIGNED(int128_get64(section->size), TARGET_PAGE_SIZE)); - - vrdl = g_new0(VFIORamDiscardListener, 1); - vrdl->container = container; - vrdl->mr = section->mr; - vrdl->offset_within_address_space = section->offset_within_address_space; - vrdl->size = int128_get64(section->size); - vrdl->granularity = ram_discard_manager_get_min_granularity(rdm, - section->mr); - - g_assert(vrdl->granularity && is_power_of_2(vrdl->granularity)); - g_assert(container->pgsizes && - vrdl->granularity >= 1ULL << ctz64(container->pgsizes)); - - ram_discard_listener_init(&vrdl->listener, - vfio_ram_discard_notify_populate, - vfio_ram_discard_notify_discard, true); - ram_discard_manager_register_listener(rdm, &vrdl->listener, section); - QLIST_INSERT_HEAD(&container->vrdl_list, vrdl, next); + int i; - /* - * Sanity-check if we have a theoretically problematic setup where we could - * exceed the maximum number of possible DMA mappings over time. We assume - * that each mapped section in the same address space as a RamDiscardManager - * section consumes exactly one DMA mapping, with the exception of - * RamDiscardManager sections; i.e., we don't expect to have gIOMMU sections - * in the same address space as RamDiscardManager sections. - * - * We assume that each section in the address space consumes one memslot. - * We take the number of KVM memory slots as a best guess for the maximum - * number of sections in the address space we could have over time, - * also consuming DMA mappings. - */ - if (container->dma_max_mappings) { - unsigned int vrdl_count = 0, vrdl_mappings = 0, max_memslots = 512; - -#ifdef CONFIG_KVM - if (kvm_enabled()) { - max_memslots = kvm_get_max_memslots(); - } -#endif - - QLIST_FOREACH(vrdl, &container->vrdl_list, next) { - hwaddr start, end; - - start = QEMU_ALIGN_DOWN(vrdl->offset_within_address_space, - vrdl->granularity); - end = ROUND_UP(vrdl->offset_within_address_space + vrdl->size, - vrdl->granularity); - vrdl_mappings += (end - start) / vrdl->granularity; - vrdl_count++; - } - - if (vrdl_mappings + max_memslots - vrdl_count > - container->dma_max_mappings) { - warn_report("%s: possibly running out of DMA mappings. E.g., try" - " increasing the 'block-size' of virtio-mem devies." - " Maximum possible DMA mappings: %d, Maximum possible" - " memslots: %d", __func__, container->dma_max_mappings, - max_memslots); - } - } -} - -static void vfio_unregister_ram_discard_listener(VFIOContainer *container, - MemoryRegionSection *section) -{ - RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr); - VFIORamDiscardListener *vrdl = NULL; - - QLIST_FOREACH(vrdl, &container->vrdl_list, next) { - if (vrdl->mr == section->mr && - vrdl->offset_within_address_space == - section->offset_within_address_space) { - break; - } - } - - if (!vrdl) { - hw_error("vfio: Trying to unregister missing RAM discard listener"); - } - - ram_discard_manager_unregister_listener(rdm, &vrdl->listener); - QLIST_REMOVE(vrdl, next); - g_free(vrdl); -} - -static bool vfio_known_safe_misalignment(MemoryRegionSection *section) -{ - MemoryRegion *mr = section->mr; - - if (!TPM_IS_CRB(mr->owner)) { - return false; - } - - /* this is a known safe misaligned region, just trace for debug purpose */ - trace_vfio_known_safe_misalignment(memory_region_name(mr), - section->offset_within_address_space, - section->offset_within_region, - qemu_real_host_page_size()); - return true; -} - -static void vfio_listener_region_add(MemoryListener *listener, - MemoryRegionSection *section) -{ - VFIOContainer *container = container_of(listener, VFIOContainer, listener); - hwaddr iova, end; - Int128 llend, llsize; - void *vaddr; - int ret; - VFIOHostDMAWindow *hostwin; - bool hostwin_found; - Error *err = NULL; - - if (vfio_listener_skipped_section(section)) { - trace_vfio_listener_region_add_skip( - section->offset_within_address_space, - section->offset_within_address_space + - int128_get64(int128_sub(section->size, int128_one()))); - return; - } - - if (unlikely((section->offset_within_address_space & - ~qemu_real_host_page_mask()) != - (section->offset_within_region & ~qemu_real_host_page_mask()))) { - if (!vfio_known_safe_misalignment(section)) { - error_report("%s received unaligned region %s iova=0x%"PRIx64 - " offset_within_region=0x%"PRIx64 - " qemu_real_host_page_size=0x%"PRIxPTR, - __func__, memory_region_name(section->mr), - section->offset_within_address_space, - section->offset_within_region, - qemu_real_host_page_size()); - } - return; - } - - iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space); - llend = int128_make64(section->offset_within_address_space); - llend = int128_add(llend, section->size); - llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask())); - - if (int128_ge(int128_make64(iova), llend)) { - if (memory_region_is_ram_device(section->mr)) { - trace_vfio_listener_region_add_no_dma_map( - memory_region_name(section->mr), - section->offset_within_address_space, - int128_getlo(section->size), - qemu_real_host_page_size()); - } - return; - } - end = int128_get64(int128_sub(llend, int128_one())); - - if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) { - hwaddr pgsize = 0; - - /* For now intersections are not allowed, we may relax this later */ - QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) { - if (ranges_overlap(hostwin->min_iova, - hostwin->max_iova - hostwin->min_iova + 1, - section->offset_within_address_space, - int128_get64(section->size))) { - error_setg(&err, - "region [0x%"PRIx64",0x%"PRIx64"] overlaps with existing" - "host DMA window [0x%"PRIx64",0x%"PRIx64"]", - section->offset_within_address_space, - section->offset_within_address_space + - int128_get64(section->size) - 1, - hostwin->min_iova, hostwin->max_iova); - goto fail; - } - } - - ret = vfio_spapr_create_window(container, section, &pgsize); - if (ret) { - error_setg_errno(&err, -ret, "Failed to create SPAPR window"); - goto fail; - } - - vfio_host_win_add(container, section->offset_within_address_space, - section->offset_within_address_space + - int128_get64(section->size) - 1, pgsize); -#ifdef CONFIG_KVM - if (kvm_enabled()) { - VFIOGroup *group; - IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr); - struct kvm_vfio_spapr_tce param; - struct kvm_device_attr attr = { - .group = KVM_DEV_VFIO_GROUP, - .attr = KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE, - .addr = (uint64_t)(unsigned long)¶m, - }; - - if (!memory_region_iommu_get_attr(iommu_mr, IOMMU_ATTR_SPAPR_TCE_FD, - ¶m.tablefd)) { - QLIST_FOREACH(group, &container->group_list, container_next) { - param.groupfd = group->fd; - if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) { - error_report("vfio: failed to setup fd %d " - "for a group with fd %d: %s", - param.tablefd, param.groupfd, - strerror(errno)); - return; - } - trace_vfio_spapr_group_attach(param.groupfd, param.tablefd); - } - } - } -#endif - } - - hostwin_found = false; - QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) { - if (hostwin->min_iova <= iova && end <= hostwin->max_iova) { - hostwin_found = true; - break; - } - } - - if (!hostwin_found) { - error_setg(&err, "Container %p can't map guest IOVA region" - " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx, container, iova, end); - goto fail; - } - - memory_region_ref(section->mr); - - if (memory_region_is_iommu(section->mr)) { - VFIOGuestIOMMU *giommu; - IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr); - int iommu_idx; - - trace_vfio_listener_region_add_iommu(iova, end); - /* - * FIXME: For VFIO iommu types which have KVM acceleration to - * avoid bouncing all map/unmaps through qemu this way, this - * would be the right place to wire that up (tell the KVM - * device emulation the VFIO iommu handles to use). - */ - giommu = g_malloc0(sizeof(*giommu)); - giommu->iommu_mr = iommu_mr; - giommu->iommu_offset = section->offset_within_address_space - - section->offset_within_region; - giommu->container = container; - llend = int128_add(int128_make64(section->offset_within_region), - section->size); - llend = int128_sub(llend, int128_one()); - iommu_idx = memory_region_iommu_attrs_to_index(iommu_mr, - MEMTXATTRS_UNSPECIFIED); - iommu_notifier_init(&giommu->n, vfio_iommu_map_notify, - IOMMU_NOTIFIER_IOTLB_EVENTS, - section->offset_within_region, - int128_get64(llend), - iommu_idx); - - ret = memory_region_iommu_set_page_size_mask(giommu->iommu_mr, - container->pgsizes, - &err); - if (ret) { - g_free(giommu); - goto fail; - } - - ret = memory_region_register_iommu_notifier(section->mr, &giommu->n, - &err); - if (ret) { - g_free(giommu); - goto fail; - } - QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next); - memory_region_iommu_replay(giommu->iommu_mr, &giommu->n); - - return; - } - - /* Here we assume that memory_region_is_ram(section->mr)==true */ - - /* - * For RAM memory regions with a RamDiscardManager, we only want to map the - * actually populated parts - and update the mapping whenever we're notified - * about changes. - */ - if (memory_region_has_ram_discard_manager(section->mr)) { - vfio_register_ram_discard_listener(container, section); - return; - } - - vaddr = memory_region_get_ram_ptr(section->mr) + - section->offset_within_region + - (iova - section->offset_within_address_space); - - trace_vfio_listener_region_add_ram(iova, end, vaddr); - - llsize = int128_sub(llend, int128_make64(iova)); - - if (memory_region_is_ram_device(section->mr)) { - hwaddr pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1; - - if ((iova & pgmask) || (int128_get64(llsize) & pgmask)) { - trace_vfio_listener_region_add_no_dma_map( - memory_region_name(section->mr), - section->offset_within_address_space, - int128_getlo(section->size), - pgmask + 1); - return; - } - } - - ret = vfio_dma_map(container, iova, int128_get64(llsize), - vaddr, section->readonly); - if (ret) { - error_setg(&err, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", " - "0x%"HWADDR_PRIx", %p) = %d (%m)", - container, iova, int128_get64(llsize), vaddr, ret); - if (memory_region_is_ram_device(section->mr)) { - /* Allow unexpected mappings not to be fatal for RAM devices */ - error_report_err(err); - return; - } - goto fail; - } - - return; - -fail: - if (memory_region_is_ram_device(section->mr)) { - error_report("failed to vfio_dma_map. pci p2p may not work"); - return; - } - /* - * On the initfn path, store the first error in the container so we - * can gracefully fail. Runtime, there's not much we can do other - * than throw a hardware error. - */ - if (!container->initialized) { - if (!container->error) { - error_propagate_prepend(&container->error, err, - "Region %s: ", - memory_region_name(section->mr)); - } else { - error_free(err); - } - } else { - error_report_err(err); - hw_error("vfio: DMA mapping failed, unable to continue"); - } -} - -static void vfio_listener_region_del(MemoryListener *listener, - MemoryRegionSection *section) -{ - VFIOContainer *container = container_of(listener, VFIOContainer, listener); - hwaddr iova, end; - Int128 llend, llsize; - int ret; - bool try_unmap = true; - - if (vfio_listener_skipped_section(section)) { - trace_vfio_listener_region_del_skip( - section->offset_within_address_space, - section->offset_within_address_space + - int128_get64(int128_sub(section->size, int128_one()))); - return; - } - - if (unlikely((section->offset_within_address_space & - ~qemu_real_host_page_mask()) != - (section->offset_within_region & ~qemu_real_host_page_mask()))) { - error_report("%s received unaligned region", __func__); - return; - } - - if (memory_region_is_iommu(section->mr)) { - VFIOGuestIOMMU *giommu; - - QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) { - if (MEMORY_REGION(giommu->iommu_mr) == section->mr && - giommu->n.start == section->offset_within_region) { - memory_region_unregister_iommu_notifier(section->mr, - &giommu->n); - QLIST_REMOVE(giommu, giommu_next); - g_free(giommu); - break; - } - } - - /* - * FIXME: We assume the one big unmap below is adequate to - * remove any individual page mappings in the IOMMU which - * might have been copied into VFIO. This works for a page table - * based IOMMU where a big unmap flattens a large range of IO-PTEs. - * That may not be true for all IOMMU types. - */ - } - - iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space); - llend = int128_make64(section->offset_within_address_space); - llend = int128_add(llend, section->size); - llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask())); - - if (int128_ge(int128_make64(iova), llend)) { - return; - } - end = int128_get64(int128_sub(llend, int128_one())); - - llsize = int128_sub(llend, int128_make64(iova)); - - trace_vfio_listener_region_del(iova, end); - - if (memory_region_is_ram_device(section->mr)) { - hwaddr pgmask; - VFIOHostDMAWindow *hostwin; - bool hostwin_found = false; - - QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) { - if (hostwin->min_iova <= iova && end <= hostwin->max_iova) { - hostwin_found = true; - break; - } - } - assert(hostwin_found); /* or region_add() would have failed */ - - pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1; - try_unmap = !((iova & pgmask) || (int128_get64(llsize) & pgmask)); - } else if (memory_region_has_ram_discard_manager(section->mr)) { - vfio_unregister_ram_discard_listener(container, section); - /* Unregistering will trigger an unmap. */ - try_unmap = false; - } - - if (try_unmap) { - if (int128_eq(llsize, int128_2_64())) { - /* The unmap ioctl doesn't accept a full 64-bit span. */ - llsize = int128_rshift(llsize, 1); - ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL); - if (ret) { - error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", " - "0x%"HWADDR_PRIx") = %d (%m)", - container, iova, int128_get64(llsize), ret); - } - iova += int128_get64(llsize); - } - ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL); - if (ret) { - error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", " - "0x%"HWADDR_PRIx") = %d (%m)", - container, iova, int128_get64(llsize), ret); - } - } - - memory_region_unref(section->mr); - - if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) { - vfio_spapr_remove_window(container, - section->offset_within_address_space); - if (vfio_host_win_del(container, - section->offset_within_address_space, - section->offset_within_address_space + - int128_get64(section->size) - 1) < 0) { - hw_error("%s: Cannot delete missing window at %"HWADDR_PRIx, - __func__, section->offset_within_address_space); - } - } -} - -static void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start) -{ - int ret; - struct vfio_iommu_type1_dirty_bitmap dirty = { - .argsz = sizeof(dirty), - }; - - if (start) { - dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START; - } else { - dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP; - } - - ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty); - if (ret) { - error_report("Failed to set dirty tracking flag 0x%x errno: %d", - dirty.flags, errno); - } -} - -static void vfio_listener_log_global_start(MemoryListener *listener) -{ - VFIOContainer *container = container_of(listener, VFIOContainer, listener); - - vfio_set_dirty_page_tracking(container, true); -} - -static void vfio_listener_log_global_stop(MemoryListener *listener) -{ - VFIOContainer *container = container_of(listener, VFIOContainer, listener); - - vfio_set_dirty_page_tracking(container, false); -} - -static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova, - uint64_t size, ram_addr_t ram_addr) -{ - struct vfio_iommu_type1_dirty_bitmap *dbitmap; - struct vfio_iommu_type1_dirty_bitmap_get *range; - uint64_t pages; - int ret; - - dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range)); - - dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range); - dbitmap->flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP; - range = (struct vfio_iommu_type1_dirty_bitmap_get *)&dbitmap->data; - range->iova = iova; - range->size = size; - - /* - * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of - * qemu_real_host_page_size to mark those dirty. Hence set bitmap's pgsize - * to qemu_real_host_page_size. - */ - range->bitmap.pgsize = qemu_real_host_page_size(); - - pages = REAL_HOST_PAGE_ALIGN(range->size) / qemu_real_host_page_size(); - range->bitmap.size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) / - BITS_PER_BYTE; - range->bitmap.data = g_try_malloc0(range->bitmap.size); - if (!range->bitmap.data) { - ret = -ENOMEM; - goto err_out; - } - - ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap); - if (ret) { - error_report("Failed to get dirty bitmap for iova: 0x%"PRIx64 - " size: 0x%"PRIx64" err: %d", (uint64_t)range->iova, - (uint64_t)range->size, errno); - goto err_out; - } - - cpu_physical_memory_set_dirty_lebitmap((unsigned long *)range->bitmap.data, - ram_addr, pages); - - trace_vfio_get_dirty_bitmap(container->fd, range->iova, range->size, - range->bitmap.size, ram_addr); -err_out: - g_free(range->bitmap.data); - g_free(dbitmap); - - return ret; -} - -typedef struct { - IOMMUNotifier n; - VFIOGuestIOMMU *giommu; -} vfio_giommu_dirty_notifier; - -static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) -{ - vfio_giommu_dirty_notifier *gdn = container_of(n, - vfio_giommu_dirty_notifier, n); - VFIOGuestIOMMU *giommu = gdn->giommu; - VFIOContainer *container = giommu->container; - hwaddr iova = iotlb->iova + giommu->iommu_offset; - ram_addr_t translated_addr; - - trace_vfio_iommu_map_dirty_notify(iova, iova + iotlb->addr_mask); - - if (iotlb->target_as != &address_space_memory) { - error_report("Wrong target AS \"%s\", only system memory is allowed", - iotlb->target_as->name ? iotlb->target_as->name : "none"); - return; - } - - rcu_read_lock(); - if (vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL)) { - int ret; - - ret = vfio_get_dirty_bitmap(container, iova, iotlb->addr_mask + 1, - translated_addr); - if (ret) { - error_report("vfio_iommu_map_dirty_notify(%p, 0x%"HWADDR_PRIx", " - "0x%"HWADDR_PRIx") = %d (%m)", - container, iova, - iotlb->addr_mask + 1, ret); - } - } - rcu_read_unlock(); -} - -static int vfio_ram_discard_get_dirty_bitmap(MemoryRegionSection *section, - void *opaque) -{ - const hwaddr size = int128_get64(section->size); - const hwaddr iova = section->offset_within_address_space; - const ram_addr_t ram_addr = memory_region_get_ram_addr(section->mr) + - section->offset_within_region; - VFIORamDiscardListener *vrdl = opaque; - - /* - * Sync the whole mapped region (spanning multiple individual mappings) - * in one go. - */ - return vfio_get_dirty_bitmap(vrdl->container, iova, size, ram_addr); -} - -static int vfio_sync_ram_discard_listener_dirty_bitmap(VFIOContainer *container, - MemoryRegionSection *section) -{ - RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr); - VFIORamDiscardListener *vrdl = NULL; - - QLIST_FOREACH(vrdl, &container->vrdl_list, next) { - if (vrdl->mr == section->mr && - vrdl->offset_within_address_space == - section->offset_within_address_space) { - break; - } - } - - if (!vrdl) { - hw_error("vfio: Trying to sync missing RAM discard listener"); - } - - /* - * We only want/can synchronize the bitmap for actually mapped parts - - * which correspond to populated parts. Replay all populated parts. - */ - return ram_discard_manager_replay_populated(rdm, section, - vfio_ram_discard_get_dirty_bitmap, - &vrdl); -} - -static int vfio_sync_dirty_bitmap(VFIOContainer *container, - MemoryRegionSection *section) -{ - ram_addr_t ram_addr; - - if (memory_region_is_iommu(section->mr)) { - VFIOGuestIOMMU *giommu; - - QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) { - if (MEMORY_REGION(giommu->iommu_mr) == section->mr && - giommu->n.start == section->offset_within_region) { - Int128 llend; - vfio_giommu_dirty_notifier gdn = { .giommu = giommu }; - int idx = memory_region_iommu_attrs_to_index(giommu->iommu_mr, - MEMTXATTRS_UNSPECIFIED); - - llend = int128_add(int128_make64(section->offset_within_region), - section->size); - llend = int128_sub(llend, int128_one()); - - iommu_notifier_init(&gdn.n, - vfio_iommu_map_dirty_notify, - IOMMU_NOTIFIER_MAP, - section->offset_within_region, - int128_get64(llend), - idx); - memory_region_iommu_replay(giommu->iommu_mr, &gdn.n); - break; - } - } - return 0; - } else if (memory_region_has_ram_discard_manager(section->mr)) { - return vfio_sync_ram_discard_listener_dirty_bitmap(container, section); - } - - ram_addr = memory_region_get_ram_addr(section->mr) + - section->offset_within_region; - - return vfio_get_dirty_bitmap(container, - REAL_HOST_PAGE_ALIGN(section->offset_within_address_space), - int128_get64(section->size), ram_addr); -} - -static void vfio_listener_log_sync(MemoryListener *listener, - MemoryRegionSection *section) -{ - VFIOContainer *container = container_of(listener, VFIOContainer, listener); - - if (vfio_listener_skipped_section(section) || - !container->dirty_pages_supported) { - return; - } - - if (vfio_devices_all_dirty_tracking(container)) { - vfio_sync_dirty_bitmap(container, section); - } -} - -static const MemoryListener vfio_memory_listener = { - .name = "vfio", - .region_add = vfio_listener_region_add, - .region_del = vfio_listener_region_del, - .log_global_start = vfio_listener_log_global_start, - .log_global_stop = vfio_listener_log_global_stop, - .log_sync = vfio_listener_log_sync, -}; - -static void vfio_listener_release(VFIOContainer *container) -{ - memory_listener_unregister(&container->listener); - if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) { - memory_listener_unregister(&container->prereg_listener); - } -} - -static struct vfio_info_cap_header * -vfio_get_cap(void *ptr, uint32_t cap_offset, uint16_t id) -{ - struct vfio_info_cap_header *hdr; - - for (hdr = ptr + cap_offset; hdr != ptr; hdr = ptr + hdr->next) { - if (hdr->id == id) { - return hdr; - } - } - - return NULL; -} - -struct vfio_info_cap_header * -vfio_get_region_info_cap(struct vfio_region_info *info, uint16_t id) -{ - if (!(info->flags & VFIO_REGION_INFO_FLAG_CAPS)) { - return NULL; - } - - return vfio_get_cap((void *)info, info->cap_offset, id); -} - -static struct vfio_info_cap_header * -vfio_get_iommu_type1_info_cap(struct vfio_iommu_type1_info *info, uint16_t id) -{ - if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) { - return NULL; - } - - return vfio_get_cap((void *)info, info->cap_offset, id); -} - -struct vfio_info_cap_header * -vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id) -{ - if (!(info->flags & VFIO_DEVICE_FLAGS_CAPS)) { - return NULL; - } - - return vfio_get_cap((void *)info, info->cap_offset, id); -} - -bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info, - unsigned int *avail) -{ - struct vfio_info_cap_header *hdr; - struct vfio_iommu_type1_info_dma_avail *cap; - - /* If the capability cannot be found, assume no DMA limiting */ - hdr = vfio_get_iommu_type1_info_cap(info, - VFIO_IOMMU_TYPE1_INFO_DMA_AVAIL); - if (hdr == NULL) { - return false; - } - - if (avail != NULL) { - cap = (void *) hdr; - *avail = cap->avail; - } - - return true; -} - -static int vfio_setup_region_sparse_mmaps(VFIORegion *region, - struct vfio_region_info *info) -{ - struct vfio_info_cap_header *hdr; - struct vfio_region_info_cap_sparse_mmap *sparse; - int i, j; - - hdr = vfio_get_region_info_cap(info, VFIO_REGION_INFO_CAP_SPARSE_MMAP); - if (!hdr) { - return -ENODEV; - } - - sparse = container_of(hdr, struct vfio_region_info_cap_sparse_mmap, header); - - trace_vfio_region_sparse_mmap_header(region->vbasedev->name, - region->nr, sparse->nr_areas); - - region->mmaps = g_new0(VFIOMmap, sparse->nr_areas); - - for (i = 0, j = 0; i < sparse->nr_areas; i++) { - if (sparse->areas[i].size) { - trace_vfio_region_sparse_mmap_entry(i, sparse->areas[i].offset, - sparse->areas[i].offset + - sparse->areas[i].size - 1); - region->mmaps[j].offset = sparse->areas[i].offset; - region->mmaps[j].size = sparse->areas[i].size; - j++; - } - } - - region->nr_mmaps = j; - region->mmaps = g_realloc(region->mmaps, j * sizeof(VFIOMmap)); - - return 0; -} - -int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region, - int index, const char *name) -{ - struct vfio_region_info *info; - int ret; - - ret = vfio_get_region_info(vbasedev, index, &info); - if (ret) { - return ret; - } - - region->vbasedev = vbasedev; - region->flags = info->flags; - region->size = info->size; - region->fd_offset = info->offset; - region->nr = index; - - if (region->size) { - region->mem = g_new0(MemoryRegion, 1); - memory_region_init_io(region->mem, obj, &vfio_region_ops, - region, name, region->size); - - if (!vbasedev->no_mmap && - region->flags & VFIO_REGION_INFO_FLAG_MMAP) { - - ret = vfio_setup_region_sparse_mmaps(region, info); - - if (ret) { - region->nr_mmaps = 1; - region->mmaps = g_new0(VFIOMmap, region->nr_mmaps); - region->mmaps[0].offset = 0; - region->mmaps[0].size = region->size; - } - } - } - - g_free(info); - - trace_vfio_region_setup(vbasedev->name, index, name, - region->flags, region->fd_offset, region->size); - return 0; -} - -static void vfio_subregion_unmap(VFIORegion *region, int index) -{ - trace_vfio_region_unmap(memory_region_name(®ion->mmaps[index].mem), - region->mmaps[index].offset, - region->mmaps[index].offset + - region->mmaps[index].size - 1); - memory_region_del_subregion(region->mem, ®ion->mmaps[index].mem); - munmap(region->mmaps[index].mmap, region->mmaps[index].size); - object_unparent(OBJECT(®ion->mmaps[index].mem)); - region->mmaps[index].mmap = NULL; -} - -int vfio_region_mmap(VFIORegion *region) -{ - int i, prot = 0; - char *name; - - if (!region->mem) { - return 0; - } - - prot |= region->flags & VFIO_REGION_INFO_FLAG_READ ? PROT_READ : 0; - prot |= region->flags & VFIO_REGION_INFO_FLAG_WRITE ? PROT_WRITE : 0; - - for (i = 0; i < region->nr_mmaps; i++) { - region->mmaps[i].mmap = mmap(NULL, region->mmaps[i].size, prot, - MAP_SHARED, region->vbasedev->fd, - region->fd_offset + - region->mmaps[i].offset); - if (region->mmaps[i].mmap == MAP_FAILED) { - int ret = -errno; - - trace_vfio_region_mmap_fault(memory_region_name(region->mem), i, - region->fd_offset + - region->mmaps[i].offset, - region->fd_offset + - region->mmaps[i].offset + - region->mmaps[i].size - 1, ret); - - region->mmaps[i].mmap = NULL; - - for (i--; i >= 0; i--) { - vfio_subregion_unmap(region, i); - } - - return ret; - } - - name = g_strdup_printf("%s mmaps[%d]", - memory_region_name(region->mem), i); - memory_region_init_ram_device_ptr(®ion->mmaps[i].mem, - memory_region_owner(region->mem), - name, region->mmaps[i].size, - region->mmaps[i].mmap); - g_free(name); - memory_region_add_subregion(region->mem, region->mmaps[i].offset, - ®ion->mmaps[i].mem); - - trace_vfio_region_mmap(memory_region_name(®ion->mmaps[i].mem), - region->mmaps[i].offset, - region->mmaps[i].offset + - region->mmaps[i].size - 1); - } - - return 0; -} - -void vfio_region_unmap(VFIORegion *region) -{ - int i; - - if (!region->mem) { - return; - } - - for (i = 0; i < region->nr_mmaps; i++) { - if (region->mmaps[i].mmap) { - vfio_subregion_unmap(region, i); - } - } -} - -void vfio_region_exit(VFIORegion *region) -{ - int i; - - if (!region->mem) { - return; - } - - for (i = 0; i < region->nr_mmaps; i++) { - if (region->mmaps[i].mmap) { - memory_region_del_subregion(region->mem, ®ion->mmaps[i].mem); - } - } - - trace_vfio_region_exit(region->vbasedev->name, region->nr); -} - -void vfio_region_finalize(VFIORegion *region) -{ - int i; - - if (!region->mem) { - return; - } - - for (i = 0; i < region->nr_mmaps; i++) { - if (region->mmaps[i].mmap) { - munmap(region->mmaps[i].mmap, region->mmaps[i].size); - object_unparent(OBJECT(®ion->mmaps[i].mem)); - } - } - - object_unparent(OBJECT(region->mem)); - - g_free(region->mem); - g_free(region->mmaps); - - trace_vfio_region_finalize(region->vbasedev->name, region->nr); - - region->mem = NULL; - region->mmaps = NULL; - region->nr_mmaps = 0; - region->size = 0; - region->flags = 0; - region->nr = 0; -} - -void vfio_region_mmaps_set_enabled(VFIORegion *region, bool enabled) -{ - int i; - - if (!region->mem) { - return; - } - - for (i = 0; i < region->nr_mmaps; i++) { - if (region->mmaps[i].mmap) { - memory_region_set_enabled(®ion->mmaps[i].mem, enabled); - } - } - - trace_vfio_region_mmaps_set_enabled(memory_region_name(region->mem), - enabled); -} - -void vfio_reset_handler(void *opaque) -{ - VFIOGroup *group; - VFIODevice *vbasedev; - - QLIST_FOREACH(group, &vfio_group_list, next) { - QLIST_FOREACH(vbasedev, &group->device_list, next) { - if (vbasedev->dev->realized) { - vbasedev->ops->vfio_compute_needs_reset(vbasedev); - } - } - } - - QLIST_FOREACH(group, &vfio_group_list, next) { - QLIST_FOREACH(vbasedev, &group->device_list, next) { - if (vbasedev->dev->realized && vbasedev->needs_reset) { - vbasedev->ops->vfio_hot_reset_multi(vbasedev); - } - } - } -} - -static void vfio_kvm_device_add_group(VFIOGroup *group) -{ -#ifdef CONFIG_KVM - struct kvm_device_attr attr = { - .group = KVM_DEV_VFIO_GROUP, - .attr = KVM_DEV_VFIO_GROUP_ADD, - .addr = (uint64_t)(unsigned long)&group->fd, - }; - - if (!kvm_enabled()) { - return; - } - - if (vfio_kvm_device_fd < 0) { - struct kvm_create_device cd = { - .type = KVM_DEV_TYPE_VFIO, - }; - - if (kvm_vm_ioctl(kvm_state, KVM_CREATE_DEVICE, &cd)) { - error_report("Failed to create KVM VFIO device: %m"); - return; - } - - vfio_kvm_device_fd = cd.fd; - } - - if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) { - error_report("Failed to add group %d to KVM VFIO device: %m", - group->groupid); - } -#endif -} - -static void vfio_kvm_device_del_group(VFIOGroup *group) -{ -#ifdef CONFIG_KVM - struct kvm_device_attr attr = { - .group = KVM_DEV_VFIO_GROUP, - .attr = KVM_DEV_VFIO_GROUP_DEL, - .addr = (uint64_t)(unsigned long)&group->fd, - }; - - if (vfio_kvm_device_fd < 0) { - return; - } - - if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) { - error_report("Failed to remove group %d from KVM VFIO device: %m", - group->groupid); - } -#endif -} - -static VFIOAddressSpace *vfio_get_address_space(AddressSpace *as) -{ - VFIOAddressSpace *space; - - QLIST_FOREACH(space, &vfio_address_spaces, list) { - if (space->as == as) { - return space; - } - } - - /* No suitable VFIOAddressSpace, create a new one */ - space = g_malloc0(sizeof(*space)); - space->as = as; - QLIST_INIT(&space->containers); - - QLIST_INSERT_HEAD(&vfio_address_spaces, space, list); - - return space; -} - -static void vfio_put_address_space(VFIOAddressSpace *space) -{ - if (QLIST_EMPTY(&space->containers)) { - QLIST_REMOVE(space, list); - g_free(space); - } -} - -/* - * vfio_get_iommu_type - selects the richest iommu_type (v2 first) - */ -static int vfio_get_iommu_type(VFIOContainer *container, - Error **errp) -{ - int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU, - VFIO_SPAPR_TCE_v2_IOMMU, VFIO_SPAPR_TCE_IOMMU }; - int i; - - for (i = 0; i < ARRAY_SIZE(iommu_types); i++) { - if (ioctl(container->fd, VFIO_CHECK_EXTENSION, iommu_types[i])) { - return iommu_types[i]; - } - } - error_setg(errp, "No available IOMMU models"); - return -EINVAL; -} - -static int vfio_init_container(VFIOContainer *container, int group_fd, - Error **errp) -{ - int iommu_type, ret; - - iommu_type = vfio_get_iommu_type(container, errp); - if (iommu_type < 0) { - return iommu_type; - } - - ret = ioctl(group_fd, VFIO_GROUP_SET_CONTAINER, &container->fd); - if (ret) { - error_setg_errno(errp, errno, "Failed to set group container"); - return -errno; - } - - while (ioctl(container->fd, VFIO_SET_IOMMU, iommu_type)) { - if (iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) { - /* - * On sPAPR, despite the IOMMU subdriver always advertises v1 and - * v2, the running platform may not support v2 and there is no - * way to guess it until an IOMMU group gets added to the container. - * So in case it fails with v2, try v1 as a fallback. - */ - iommu_type = VFIO_SPAPR_TCE_IOMMU; - continue; - } - error_setg_errno(errp, errno, "Failed to set iommu for container"); - return -errno; - } - - container->iommu_type = iommu_type; - return 0; -} - -static int vfio_get_iommu_info(VFIOContainer *container, - struct vfio_iommu_type1_info **info) -{ - - size_t argsz = sizeof(struct vfio_iommu_type1_info); - - *info = g_new0(struct vfio_iommu_type1_info, 1); -again: - (*info)->argsz = argsz; - - if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) { - g_free(*info); - *info = NULL; - return -errno; - } - - if (((*info)->argsz > argsz)) { - argsz = (*info)->argsz; - *info = g_realloc(*info, argsz); - goto again; - } - - return 0; -} - -static struct vfio_info_cap_header * -vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id) -{ - struct vfio_info_cap_header *hdr; - void *ptr = info; - - if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) { - return NULL; - } - - for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) { - if (hdr->id == id) { - return hdr; - } - } - - return NULL; -} - -static void vfio_get_iommu_info_migration(VFIOContainer *container, - struct vfio_iommu_type1_info *info) -{ - struct vfio_info_cap_header *hdr; - struct vfio_iommu_type1_info_cap_migration *cap_mig; - - hdr = vfio_get_iommu_info_cap(info, VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION); - if (!hdr) { + if (!region->mem) { return; } - cap_mig = container_of(hdr, struct vfio_iommu_type1_info_cap_migration, - header); - - /* - * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of - * qemu_real_host_page_size to mark those dirty. - */ - if (cap_mig->pgsize_bitmap & qemu_real_host_page_size()) { - container->dirty_pages_supported = true; - container->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size; - container->dirty_pgsizes = cap_mig->pgsize_bitmap; - } -} - -static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, - Error **errp) -{ - VFIOContainer *container; - int ret, fd; - VFIOAddressSpace *space; - - space = vfio_get_address_space(as); - - /* - * VFIO is currently incompatible with discarding of RAM insofar as the - * madvise to purge (zap) the page from QEMU's address space does not - * interact with the memory API and therefore leaves stale virtual to - * physical mappings in the IOMMU if the page was previously pinned. We - * therefore set discarding broken for each group added to a container, - * whether the container is used individually or shared. This provides - * us with options to allow devices within a group to opt-in and allow - * discarding, so long as it is done consistently for a group (for instance - * if the device is an mdev device where it is known that the host vendor - * driver will never pin pages outside of the working set of the guest - * driver, which would thus not be discarding candidates). - * - * The first opportunity to induce pinning occurs here where we attempt to - * attach the group to existing containers within the AddressSpace. If any - * pages are already zapped from the virtual address space, such as from - * previous discards, new pinning will cause valid mappings to be - * re-established. Likewise, when the overall MemoryListener for a new - * container is registered, a replay of mappings within the AddressSpace - * will occur, re-establishing any previously zapped pages as well. - * - * Especially virtio-balloon is currently only prevented from discarding - * new memory, it will not yet set ram_block_discard_set_required() and - * therefore, neither stops us here or deals with the sudden memory - * consumption of inflated memory. - * - * We do support discarding of memory coordinated via the RamDiscardManager - * with some IOMMU types. vfio_ram_block_discard_disable() handles the - * details once we know which type of IOMMU we are using. - */ - - QLIST_FOREACH(container, &space->containers, next) { - if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) { - ret = vfio_ram_block_discard_disable(container, true); - if (ret) { - error_setg_errno(errp, -ret, - "Cannot set discarding of RAM broken"); - if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, - &container->fd)) { - error_report("vfio: error disconnecting group %d from" - " container", group->groupid); - } - return ret; - } - group->container = container; - QLIST_INSERT_HEAD(&container->group_list, group, container_next); - vfio_kvm_device_add_group(group); - return 0; - } - } - - fd = qemu_open_old("/dev/vfio/vfio", O_RDWR); - if (fd < 0) { - error_setg_errno(errp, errno, "failed to open /dev/vfio/vfio"); - ret = -errno; - goto put_space_exit; - } - - ret = ioctl(fd, VFIO_GET_API_VERSION); - if (ret != VFIO_API_VERSION) { - error_setg(errp, "supported vfio version: %d, " - "reported version: %d", VFIO_API_VERSION, ret); - ret = -EINVAL; - goto close_fd_exit; - } - - container = g_malloc0(sizeof(*container)); - container->space = space; - container->fd = fd; - container->error = NULL; - container->dirty_pages_supported = false; - container->dma_max_mappings = 0; - QLIST_INIT(&container->giommu_list); - QLIST_INIT(&container->hostwin_list); - QLIST_INIT(&container->vrdl_list); - - ret = vfio_init_container(container, group->fd, errp); - if (ret) { - goto free_container_exit; - } - - ret = vfio_ram_block_discard_disable(container, true); - if (ret) { - error_setg_errno(errp, -ret, "Cannot set discarding of RAM broken"); - goto free_container_exit; - } - - switch (container->iommu_type) { - case VFIO_TYPE1v2_IOMMU: - case VFIO_TYPE1_IOMMU: - { - struct vfio_iommu_type1_info *info; - - /* - * FIXME: This assumes that a Type1 IOMMU can map any 64-bit - * IOVA whatsoever. That's not actually true, but the current - * kernel interface doesn't tell us what it can map, and the - * existing Type1 IOMMUs generally support any IOVA we're - * going to actually try in practice. - */ - ret = vfio_get_iommu_info(container, &info); - - if (ret || !(info->flags & VFIO_IOMMU_INFO_PGSIZES)) { - /* Assume 4k IOVA page size */ - info->iova_pgsizes = 4096; - } - vfio_host_win_add(container, 0, (hwaddr)-1, info->iova_pgsizes); - container->pgsizes = info->iova_pgsizes; - - /* The default in the kernel ("dma_entry_limit") is 65535. */ - container->dma_max_mappings = 65535; - if (!ret) { - vfio_get_info_dma_avail(info, &container->dma_max_mappings); - vfio_get_iommu_info_migration(container, info); - } - g_free(info); - break; - } - case VFIO_SPAPR_TCE_v2_IOMMU: - case VFIO_SPAPR_TCE_IOMMU: - { - struct vfio_iommu_spapr_tce_info info; - bool v2 = container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU; - - /* - * The host kernel code implementing VFIO_IOMMU_DISABLE is called - * when container fd is closed so we do not call it explicitly - * in this file. - */ - if (!v2) { - ret = ioctl(fd, VFIO_IOMMU_ENABLE); - if (ret) { - error_setg_errno(errp, errno, "failed to enable container"); - ret = -errno; - goto enable_discards_exit; - } - } else { - container->prereg_listener = vfio_prereg_listener; - - memory_listener_register(&container->prereg_listener, - &address_space_memory); - if (container->error) { - memory_listener_unregister(&container->prereg_listener); - ret = -1; - error_propagate_prepend(errp, container->error, - "RAM memory listener initialization failed: "); - goto enable_discards_exit; - } - } - - info.argsz = sizeof(info); - ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info); - if (ret) { - error_setg_errno(errp, errno, - "VFIO_IOMMU_SPAPR_TCE_GET_INFO failed"); - ret = -errno; - if (v2) { - memory_listener_unregister(&container->prereg_listener); - } - goto enable_discards_exit; - } - - if (v2) { - container->pgsizes = info.ddw.pgsizes; - /* - * There is a default window in just created container. - * To make region_add/del simpler, we better remove this - * window now and let those iommu_listener callbacks - * create/remove them when needed. - */ - ret = vfio_spapr_remove_window(container, info.dma32_window_start); - if (ret) { - error_setg_errno(errp, -ret, - "failed to remove existing window"); - goto enable_discards_exit; - } - } else { - /* The default table uses 4K pages */ - container->pgsizes = 0x1000; - vfio_host_win_add(container, info.dma32_window_start, - info.dma32_window_start + - info.dma32_window_size - 1, - 0x1000); + for (i = 0; i < region->nr_mmaps; i++) { + if (region->mmaps[i].mmap) { + munmap(region->mmaps[i].mmap, region->mmaps[i].size); + object_unparent(OBJECT(®ion->mmaps[i].mem)); } } - } - - vfio_kvm_device_add_group(group); - - QLIST_INIT(&container->group_list); - QLIST_INSERT_HEAD(&space->containers, container, next); - - group->container = container; - QLIST_INSERT_HEAD(&container->group_list, group, container_next); - - container->listener = vfio_memory_listener; - - memory_listener_register(&container->listener, container->space->as); - - if (container->error) { - ret = -1; - error_propagate_prepend(errp, container->error, - "memory listener initialization failed: "); - goto listener_release_exit; - } - - container->initialized = true; - - return 0; -listener_release_exit: - QLIST_REMOVE(group, container_next); - QLIST_REMOVE(container, next); - vfio_kvm_device_del_group(group); - vfio_listener_release(container); - -enable_discards_exit: - vfio_ram_block_discard_disable(container, false); - -free_container_exit: - g_free(container); - -close_fd_exit: - close(fd); - -put_space_exit: - vfio_put_address_space(space); - - return ret; -} - -static void vfio_disconnect_container(VFIOGroup *group) -{ - VFIOContainer *container = group->container; - - QLIST_REMOVE(group, container_next); - group->container = NULL; - - /* - * Explicitly release the listener first before unset container, - * since unset may destroy the backend container if it's the last - * group. - */ - if (QLIST_EMPTY(&container->group_list)) { - vfio_listener_release(container); - } - - if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) { - error_report("vfio: error disconnecting group %d from container", - group->groupid); - } - - if (QLIST_EMPTY(&container->group_list)) { - VFIOAddressSpace *space = container->space; - VFIOGuestIOMMU *giommu, *tmp; - VFIOHostDMAWindow *hostwin, *next; - QLIST_REMOVE(container, next); - - QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) { - memory_region_unregister_iommu_notifier( - MEMORY_REGION(giommu->iommu_mr), &giommu->n); - QLIST_REMOVE(giommu, giommu_next); - g_free(giommu); - } + object_unparent(OBJECT(region->mem)); - QLIST_FOREACH_SAFE(hostwin, &container->hostwin_list, hostwin_next, - next) { - QLIST_REMOVE(hostwin, hostwin_next); - g_free(hostwin); - } + g_free(region->mem); + g_free(region->mmaps); - trace_vfio_disconnect_container(container->fd); - close(container->fd); - g_free(container); + trace_vfio_region_finalize(region->vbasedev->name, region->nr); - vfio_put_address_space(space); - } + region->mem = NULL; + region->mmaps = NULL; + region->nr_mmaps = 0; + region->size = 0; + region->flags = 0; + region->nr = 0; } -VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp) +void vfio_region_mmaps_set_enabled(VFIORegion *region, bool enabled) { - VFIOGroup *group; - char path[32]; - struct vfio_group_status status = { .argsz = sizeof(status) }; - - QLIST_FOREACH(group, &vfio_group_list, next) { - if (group->groupid == groupid) { - /* Found it. Now is it already in the right context? */ - if (group->container->space->as == as) { - return group; - } else { - error_setg(errp, "group %d used in multiple address spaces", - group->groupid); - return NULL; - } - } - } - - group = g_malloc0(sizeof(*group)); - - snprintf(path, sizeof(path), "/dev/vfio/%d", groupid); - group->fd = qemu_open_old(path, O_RDWR); - if (group->fd < 0) { - error_setg_errno(errp, errno, "failed to open %s", path); - goto free_group_exit; - } - - if (ioctl(group->fd, VFIO_GROUP_GET_STATUS, &status)) { - error_setg_errno(errp, errno, "failed to get group %d status", groupid); - goto close_fd_exit; - } - - if (!(status.flags & VFIO_GROUP_FLAGS_VIABLE)) { - error_setg(errp, "group %d is not viable", groupid); - error_append_hint(errp, - "Please ensure all devices within the iommu_group " - "are bound to their vfio bus driver.\n"); - goto close_fd_exit; - } - - group->groupid = groupid; - QLIST_INIT(&group->device_list); - - if (vfio_connect_container(group, as, errp)) { - error_prepend(errp, "failed to setup container for group %d: ", - groupid); - goto close_fd_exit; - } - - if (QLIST_EMPTY(&vfio_group_list)) { - qemu_register_reset(vfio_reset_handler, NULL); - } - - QLIST_INSERT_HEAD(&vfio_group_list, group, next); - - return group; - -close_fd_exit: - close(group->fd); - -free_group_exit: - g_free(group); - - return NULL; -} + int i; -void vfio_put_group(VFIOGroup *group) -{ - if (!group || !QLIST_EMPTY(&group->device_list)) { + if (!region->mem) { return; } - if (!group->ram_block_discard_allowed) { - vfio_ram_block_discard_disable(group->container, false); - } - vfio_kvm_device_del_group(group); - vfio_disconnect_container(group); - QLIST_REMOVE(group, next); - trace_vfio_put_group(group->fd); - close(group->fd); - g_free(group); - - if (QLIST_EMPTY(&vfio_group_list)) { - qemu_unregister_reset(vfio_reset_handler, NULL); - } -} - -int vfio_get_device(VFIOGroup *group, const char *name, - VFIODevice *vbasedev, Error **errp) -{ - struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) }; - int ret, fd; - - fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name); - if (fd < 0) { - error_setg_errno(errp, errno, "error getting device from group %d", - group->groupid); - error_append_hint(errp, - "Verify all devices in group %d are bound to vfio- " - "or pci-stub and not already in use\n", group->groupid); - return fd; - } - - ret = ioctl(fd, VFIO_DEVICE_GET_INFO, &dev_info); - if (ret) { - error_setg_errno(errp, errno, "error getting device info"); - close(fd); - return ret; - } - - /* - * Set discarding of RAM as not broken for this group if the driver knows - * the device operates compatibly with discarding. Setting must be - * consistent per group, but since compatibility is really only possible - * with mdev currently, we expect singleton groups. - */ - if (vbasedev->ram_block_discard_allowed != - group->ram_block_discard_allowed) { - if (!QLIST_EMPTY(&group->device_list)) { - error_setg(errp, "Inconsistent setting of support for discarding " - "RAM (e.g., balloon) within group"); - close(fd); - return -1; - } - - if (!group->ram_block_discard_allowed) { - group->ram_block_discard_allowed = true; - vfio_ram_block_discard_disable(group->container, false); + for (i = 0; i < region->nr_mmaps; i++) { + if (region->mmaps[i].mmap) { + memory_region_set_enabled(®ion->mmaps[i].mem, enabled); } } - vbasedev->fd = fd; - vbasedev->group = group; - QLIST_INSERT_HEAD(&group->device_list, vbasedev, next); - - vbasedev->num_irqs = dev_info.num_irqs; - vbasedev->num_regions = dev_info.num_regions; - vbasedev->flags = dev_info.flags; - - trace_vfio_get_device(name, dev_info.flags, dev_info.num_regions, - dev_info.num_irqs); - - vbasedev->reset_works = !!(dev_info.flags & VFIO_DEVICE_FLAGS_RESET); - return 0; -} - -void vfio_put_base_device(VFIODevice *vbasedev) -{ - if (!vbasedev->group) { - return; - } - QLIST_REMOVE(vbasedev, next); - vbasedev->group = NULL; - trace_vfio_put_base_device(vbasedev->fd); - close(vbasedev->fd); + trace_vfio_region_mmaps_set_enabled(memory_region_name(region->mem), + enabled); } int vfio_get_region_info(VFIODevice *vbasedev, int index, @@ -2523,98 +627,3 @@ bool vfio_has_region_cap(VFIODevice *vbasedev, int region, uint16_t cap_type) return ret; } - -/* - * Interfaces for IBM EEH (Enhanced Error Handling) - */ -static bool vfio_eeh_container_ok(VFIOContainer *container) -{ - /* - * As of 2016-03-04 (linux-4.5) the host kernel EEH/VFIO - * implementation is broken if there are multiple groups in a - * container. The hardware works in units of Partitionable - * Endpoints (== IOMMU groups) and the EEH operations naively - * iterate across all groups in the container, without any logic - * to make sure the groups have their state synchronized. For - * certain operations (ENABLE) that might be ok, until an error - * occurs, but for others (GET_STATE) it's clearly broken. - */ - - /* - * XXX Once fixed kernels exist, test for them here - */ - - if (QLIST_EMPTY(&container->group_list)) { - return false; - } - - if (QLIST_NEXT(QLIST_FIRST(&container->group_list), container_next)) { - return false; - } - - return true; -} - -static int vfio_eeh_container_op(VFIOContainer *container, uint32_t op) -{ - struct vfio_eeh_pe_op pe_op = { - .argsz = sizeof(pe_op), - .op = op, - }; - int ret; - - if (!vfio_eeh_container_ok(container)) { - error_report("vfio/eeh: EEH_PE_OP 0x%x: " - "kernel requires a container with exactly one group", op); - return -EPERM; - } - - ret = ioctl(container->fd, VFIO_EEH_PE_OP, &pe_op); - if (ret < 0) { - error_report("vfio/eeh: EEH_PE_OP 0x%x failed: %m", op); - return -errno; - } - - return ret; -} - -static VFIOContainer *vfio_eeh_as_container(AddressSpace *as) -{ - VFIOAddressSpace *space = vfio_get_address_space(as); - VFIOContainer *container = NULL; - - if (QLIST_EMPTY(&space->containers)) { - /* No containers to act on */ - goto out; - } - - container = QLIST_FIRST(&space->containers); - - if (QLIST_NEXT(container, next)) { - /* We don't yet have logic to synchronize EEH state across - * multiple containers */ - container = NULL; - goto out; - } - -out: - vfio_put_address_space(space); - return container; -} - -bool vfio_eeh_as_ok(AddressSpace *as) -{ - VFIOContainer *container = vfio_eeh_as_container(as); - - return (container != NULL) && vfio_eeh_container_ok(container); -} - -int vfio_eeh_as_op(AddressSpace *as, uint32_t op) -{ - VFIOContainer *container = vfio_eeh_as_container(as); - - if (!container) { - return -ENODEV; - } - return vfio_eeh_container_op(container, op); -} diff --git a/hw/vfio/container.c b/hw/vfio/container.c new file mode 100644 index 0000000000..8ee7c9b980 --- /dev/null +++ b/hw/vfio/container.c @@ -0,0 +1,1193 @@ +/* + * generic functions used by VFIO devices + * + * Copyright Red Hat, Inc. 2012 + * + * Authors: + * Alex Williamson + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + * Based on qemu-kvm device-assignment: + * Adapted for KVM by Qumranet. + * Copyright (c) 2007, Neocleus, Alex Novik (alex@neocleus.com) + * Copyright (c) 2007, Neocleus, Guy Zana (guy@neocleus.com) + * Copyright (C) 2008, Qumranet, Amit Shah (amit.shah@qumranet.com) + * Copyright (C) 2008, Red Hat, Amit Shah (amit.shah@redhat.com) + * Copyright (C) 2008, IBM, Muli Ben-Yehuda (muli@il.ibm.com) + */ + +#include "qemu/osdep.h" +#include +#ifdef CONFIG_KVM +#include +#endif +#include + +#include "hw/vfio/vfio-common.h" +#include "hw/vfio/vfio.h" +#include "exec/address-spaces.h" +#include "exec/memory.h" +#include "exec/ram_addr.h" +#include "hw/hw.h" +#include "qemu/error-report.h" +#include "qemu/range.h" +#include "sysemu/kvm.h" +#include "sysemu/reset.h" +#include "trace.h" +#include "qapi/error.h" +#include "migration/migration.h" + +#ifdef CONFIG_KVM +/* + * We have a single VFIO pseudo device per KVM VM. Once created it lives + * for the life of the VM. Closing the file descriptor only drops our + * reference to it and the device's reference to kvm. Therefore once + * initialized, this file descriptor is only released on QEMU exit and + * we'll re-use it should another vfio device be attached before then. + */ +static int vfio_kvm_device_fd = -1; +#endif + +VFIOGroupList vfio_group_list = + QLIST_HEAD_INITIALIZER(vfio_group_list); + +/* + * Device state interfaces + */ + +bool vfio_mig_active(void) +{ + VFIOGroup *group; + VFIODevice *vbasedev; + + if (QLIST_EMPTY(&vfio_group_list)) { + return false; + } + + QLIST_FOREACH(group, &vfio_group_list, next) { + QLIST_FOREACH(vbasedev, &group->device_list, next) { + if (vbasedev->migration_blocker) { + return false; + } + } + } + return true; +} + +bool vfio_devices_all_dirty_tracking(VFIOContainer *container) +{ + VFIOGroup *group; + VFIODevice *vbasedev; + MigrationState *ms = migrate_get_current(); + + if (!migration_is_setup_or_active(ms->state)) { + return false; + } + + QLIST_FOREACH(group, &container->group_list, container_next) { + QLIST_FOREACH(vbasedev, &group->device_list, next) { + VFIOMigration *migration = vbasedev->migration; + + if (!migration) { + return false; + } + + if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) + && (migration->device_state & VFIO_DEVICE_STATE_V1_RUNNING)) { + return false; + } + } + } + return true; +} + +bool vfio_devices_all_running_and_saving(VFIOContainer *container) +{ + VFIOGroup *group; + VFIODevice *vbasedev; + MigrationState *ms = migrate_get_current(); + + if (!migration_is_setup_or_active(ms->state)) { + return false; + } + + QLIST_FOREACH(group, &container->group_list, container_next) { + QLIST_FOREACH(vbasedev, &group->device_list, next) { + VFIOMigration *migration = vbasedev->migration; + + if (!migration) { + return false; + } + + if ((migration->device_state & VFIO_DEVICE_STATE_V1_SAVING) && + (migration->device_state & VFIO_DEVICE_STATE_V1_RUNNING)) { + continue; + } else { + return false; + } + } + } + return true; +} + +static int vfio_dma_unmap_bitmap(VFIOContainer *container, + hwaddr iova, ram_addr_t size, + IOMMUTLBEntry *iotlb) +{ + struct vfio_iommu_type1_dma_unmap *unmap; + struct vfio_bitmap *bitmap; + uint64_t pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size(); + int ret; + + unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap)); + + unmap->argsz = sizeof(*unmap) + sizeof(*bitmap); + unmap->iova = iova; + unmap->size = size; + unmap->flags |= VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP; + bitmap = (struct vfio_bitmap *)&unmap->data; + + /* + * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of + * qemu_real_host_page_size to mark those dirty. Hence set bitmap_pgsize + * to qemu_real_host_page_size. + */ + + bitmap->pgsize = qemu_real_host_page_size(); + bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) / + BITS_PER_BYTE; + + if (bitmap->size > container->max_dirty_bitmap_size) { + error_report("UNMAP: Size of bitmap too big 0x%"PRIx64, + (uint64_t)bitmap->size); + ret = -E2BIG; + goto unmap_exit; + } + + bitmap->data = g_try_malloc0(bitmap->size); + if (!bitmap->data) { + ret = -ENOMEM; + goto unmap_exit; + } + + ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap); + if (!ret) { + cpu_physical_memory_set_dirty_lebitmap((unsigned long *)bitmap->data, + iotlb->translated_addr, pages); + } else { + error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %m"); + } + + g_free(bitmap->data); +unmap_exit: + g_free(unmap); + return ret; +} + +/* + * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86 + */ +int vfio_dma_unmap(VFIOContainer *container, + hwaddr iova, ram_addr_t size, + IOMMUTLBEntry *iotlb) +{ + struct vfio_iommu_type1_dma_unmap unmap = { + .argsz = sizeof(unmap), + .flags = 0, + .iova = iova, + .size = size, + }; + + if (iotlb && container->dirty_pages_supported && + vfio_devices_all_running_and_saving(container)) { + return vfio_dma_unmap_bitmap(container, iova, size, iotlb); + } + + while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) { + /* + * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c + * v4.15) where an overflow in its wrap-around check prevents us from + * unmapping the last page of the address space. Test for the error + * condition and re-try the unmap excluding the last page. The + * expectation is that we've never mapped the last page anyway and this + * unmap request comes via vIOMMU support which also makes it unlikely + * that this page is used. This bug was introduced well after type1 v2 + * support was introduced, so we shouldn't need to test for v1. A fix + * is queued for kernel v5.0 so this workaround can be removed once + * affected kernels are sufficiently deprecated. + */ + if (errno == EINVAL && unmap.size && !(unmap.iova + unmap.size) && + container->iommu_type == VFIO_TYPE1v2_IOMMU) { + trace_vfio_dma_unmap_overflow_workaround(); + unmap.size -= 1ULL << ctz64(container->pgsizes); + continue; + } + error_report("VFIO_UNMAP_DMA failed: %s", strerror(errno)); + return -errno; + } + + return 0; +} + +int vfio_dma_map(VFIOContainer *container, hwaddr iova, + ram_addr_t size, void *vaddr, bool readonly) +{ + struct vfio_iommu_type1_dma_map map = { + .argsz = sizeof(map), + .flags = VFIO_DMA_MAP_FLAG_READ, + .vaddr = (__u64)(uintptr_t)vaddr, + .iova = iova, + .size = size, + }; + + if (!readonly) { + map.flags |= VFIO_DMA_MAP_FLAG_WRITE; + } + + /* + * Try the mapping, if it fails with EBUSY, unmap the region and try + * again. This shouldn't be necessary, but we sometimes see it in + * the VGA ROM space. + */ + if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 || + (errno == EBUSY && vfio_dma_unmap(container, iova, size, NULL) == 0 && + ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) { + return 0; + } + + error_report("VFIO_MAP_DMA failed: %s", strerror(errno)); + return -errno; +} + +void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start) +{ + int ret; + struct vfio_iommu_type1_dirty_bitmap dirty = { + .argsz = sizeof(dirty), + }; + + if (start) { + dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START; + } else { + dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP; + } + + ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty); + if (ret) { + error_report("Failed to set dirty tracking flag 0x%x errno: %d", + dirty.flags, errno); + } +} + +int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova, + uint64_t size, ram_addr_t ram_addr) +{ + struct vfio_iommu_type1_dirty_bitmap *dbitmap; + struct vfio_iommu_type1_dirty_bitmap_get *range; + uint64_t pages; + int ret; + + dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range)); + + dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range); + dbitmap->flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP; + range = (struct vfio_iommu_type1_dirty_bitmap_get *)&dbitmap->data; + range->iova = iova; + range->size = size; + + /* + * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of + * qemu_real_host_page_size to mark those dirty. Hence set bitmap's pgsize + * to qemu_real_host_page_size. + */ + range->bitmap.pgsize = qemu_real_host_page_size(); + + pages = REAL_HOST_PAGE_ALIGN(range->size) / qemu_real_host_page_size(); + range->bitmap.size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) / + BITS_PER_BYTE; + range->bitmap.data = g_try_malloc0(range->bitmap.size); + if (!range->bitmap.data) { + ret = -ENOMEM; + goto err_out; + } + + ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap); + if (ret) { + error_report("Failed to get dirty bitmap for iova: 0x%"PRIx64 + " size: 0x%"PRIx64" err: %d", (uint64_t)range->iova, + (uint64_t)range->size, errno); + goto err_out; + } + + cpu_physical_memory_set_dirty_lebitmap((unsigned long *)range->bitmap.data, + ram_addr, pages); + + trace_vfio_get_dirty_bitmap(container->fd, range->iova, range->size, + range->bitmap.size, ram_addr); +err_out: + g_free(range->bitmap.data); + g_free(dbitmap); + + return ret; +} + +static void vfio_listener_release(VFIOContainer *container) +{ + memory_listener_unregister(&container->listener); + if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) { + memory_listener_unregister(&container->prereg_listener); + } +} + +int vfio_container_add_section_window(VFIOContainer *container, + MemoryRegionSection *section, + Error **errp) +{ + VFIOHostDMAWindow *hostwin; + hwaddr pgsize = 0; + int ret; + + if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) { + return 0; + } + + /* For now intersections are not allowed, we may relax this later */ + QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) { + if (ranges_overlap(hostwin->min_iova, + hostwin->max_iova - hostwin->min_iova + 1, + section->offset_within_address_space, + int128_get64(section->size))) { + error_setg(errp, + "region [0x%"PRIx64",0x%"PRIx64"] overlaps with existing" + "host DMA window [0x%"PRIx64",0x%"PRIx64"]", + section->offset_within_address_space, + section->offset_within_address_space + + int128_get64(section->size) - 1, + hostwin->min_iova, hostwin->max_iova); + return -1; + } + } + + ret = vfio_spapr_create_window(container, section, &pgsize); + if (ret) { + error_setg_errno(errp, -ret, "Failed to create SPAPR window"); + return ret; + } + + vfio_host_win_add(container, section->offset_within_address_space, + section->offset_within_address_space + + int128_get64(section->size) - 1, pgsize); +#ifdef CONFIG_KVM + if (kvm_enabled()) { + VFIOGroup *group; + IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr); + struct kvm_vfio_spapr_tce param; + struct kvm_device_attr attr = { + .group = KVM_DEV_VFIO_GROUP, + .attr = KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE, + .addr = (uint64_t)(unsigned long)¶m, + }; + + if (!memory_region_iommu_get_attr(iommu_mr, IOMMU_ATTR_SPAPR_TCE_FD, + ¶m.tablefd)) { + QLIST_FOREACH(group, &container->group_list, container_next) { + param.groupfd = group->fd; + if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) { + error_report("vfio: failed to setup fd %d " + "for a group with fd %d: %s", + param.tablefd, param.groupfd, + strerror(errno)); + return -1; + } + trace_vfio_spapr_group_attach(param.groupfd, param.tablefd); + } + } + } +#endif + return 0; +} + +void vfio_container_del_section_window(VFIOContainer *container, + MemoryRegionSection *section) +{ + if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) { + return; + } + + vfio_spapr_remove_window(container, + section->offset_within_address_space); + if (vfio_host_win_del(container, + section->offset_within_address_space, + section->offset_within_address_space + + int128_get64(section->size) - 1) < 0) { + hw_error("%s: Cannot delete missing window at %"HWADDR_PRIx, + __func__, section->offset_within_address_space); + } +} + +void vfio_reset_handler(void *opaque) +{ + VFIOGroup *group; + VFIODevice *vbasedev; + + QLIST_FOREACH(group, &vfio_group_list, next) { + QLIST_FOREACH(vbasedev, &group->device_list, next) { + if (vbasedev->dev->realized) { + vbasedev->ops->vfio_compute_needs_reset(vbasedev); + } + } + } + + QLIST_FOREACH(group, &vfio_group_list, next) { + QLIST_FOREACH(vbasedev, &group->device_list, next) { + if (vbasedev->dev->realized && vbasedev->needs_reset) { + vbasedev->ops->vfio_hot_reset_multi(vbasedev); + } + } + } +} + +static void vfio_kvm_device_add_group(VFIOGroup *group) +{ +#ifdef CONFIG_KVM + struct kvm_device_attr attr = { + .group = KVM_DEV_VFIO_GROUP, + .attr = KVM_DEV_VFIO_GROUP_ADD, + .addr = (uint64_t)(unsigned long)&group->fd, + }; + + if (!kvm_enabled()) { + return; + } + + if (vfio_kvm_device_fd < 0) { + struct kvm_create_device cd = { + .type = KVM_DEV_TYPE_VFIO, + }; + + if (kvm_vm_ioctl(kvm_state, KVM_CREATE_DEVICE, &cd)) { + error_report("Failed to create KVM VFIO device: %m"); + return; + } + + vfio_kvm_device_fd = cd.fd; + } + + if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) { + error_report("Failed to add group %d to KVM VFIO device: %m", + group->groupid); + } +#endif +} + +static void vfio_kvm_device_del_group(VFIOGroup *group) +{ +#ifdef CONFIG_KVM + struct kvm_device_attr attr = { + .group = KVM_DEV_VFIO_GROUP, + .attr = KVM_DEV_VFIO_GROUP_DEL, + .addr = (uint64_t)(unsigned long)&group->fd, + }; + + if (vfio_kvm_device_fd < 0) { + return; + } + + if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) { + error_report("Failed to remove group %d from KVM VFIO device: %m", + group->groupid); + } +#endif +} + +/* + * vfio_get_iommu_type - selects the richest iommu_type (v2 first) + */ +static int vfio_get_iommu_type(VFIOContainer *container, + Error **errp) +{ + int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU, + VFIO_SPAPR_TCE_v2_IOMMU, VFIO_SPAPR_TCE_IOMMU }; + int i; + + for (i = 0; i < ARRAY_SIZE(iommu_types); i++) { + if (ioctl(container->fd, VFIO_CHECK_EXTENSION, iommu_types[i])) { + return iommu_types[i]; + } + } + error_setg(errp, "No available IOMMU models"); + return -EINVAL; +} + +static int vfio_init_container(VFIOContainer *container, int group_fd, + Error **errp) +{ + int iommu_type, ret; + + iommu_type = vfio_get_iommu_type(container, errp); + if (iommu_type < 0) { + return iommu_type; + } + + ret = ioctl(group_fd, VFIO_GROUP_SET_CONTAINER, &container->fd); + if (ret) { + error_setg_errno(errp, errno, "Failed to set group container"); + return -errno; + } + + while (ioctl(container->fd, VFIO_SET_IOMMU, iommu_type)) { + if (iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) { + /* + * On sPAPR, despite the IOMMU subdriver always advertises v1 and + * v2, the running platform may not support v2 and there is no + * way to guess it until an IOMMU group gets added to the container. + * So in case it fails with v2, try v1 as a fallback. + */ + iommu_type = VFIO_SPAPR_TCE_IOMMU; + continue; + } + error_setg_errno(errp, errno, "Failed to set iommu for container"); + return -errno; + } + + container->iommu_type = iommu_type; + return 0; +} + +static int vfio_get_iommu_info(VFIOContainer *container, + struct vfio_iommu_type1_info **info) +{ + + size_t argsz = sizeof(struct vfio_iommu_type1_info); + + *info = g_new0(struct vfio_iommu_type1_info, 1); +again: + (*info)->argsz = argsz; + + if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) { + g_free(*info); + *info = NULL; + return -errno; + } + + if (((*info)->argsz > argsz)) { + argsz = (*info)->argsz; + *info = g_realloc(*info, argsz); + goto again; + } + + return 0; +} + +static struct vfio_info_cap_header * +vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id) +{ + struct vfio_info_cap_header *hdr; + void *ptr = info; + + if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) { + return NULL; + } + + for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) { + if (hdr->id == id) { + return hdr; + } + } + + return NULL; +} + +static void vfio_get_iommu_info_migration(VFIOContainer *container, + struct vfio_iommu_type1_info *info) +{ + struct vfio_info_cap_header *hdr; + struct vfio_iommu_type1_info_cap_migration *cap_mig; + + hdr = vfio_get_iommu_info_cap(info, VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION); + if (!hdr) { + return; + } + + cap_mig = container_of(hdr, struct vfio_iommu_type1_info_cap_migration, + header); + + /* + * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of + * qemu_real_host_page_size to mark those dirty. + */ + if (cap_mig->pgsize_bitmap & qemu_real_host_page_size()) { + container->dirty_pages_supported = true; + container->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size; + container->dirty_pgsizes = cap_mig->pgsize_bitmap; + } +} + +static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state) +{ + switch (container->iommu_type) { + case VFIO_TYPE1v2_IOMMU: + case VFIO_TYPE1_IOMMU: + /* + * We support coordinated discarding of RAM via the RamDiscardManager. + */ + return ram_block_uncoordinated_discard_disable(state); + default: + /* + * VFIO_SPAPR_TCE_IOMMU most probably works just fine with + * RamDiscardManager, however, it is completely untested. + * + * VFIO_SPAPR_TCE_v2_IOMMU with "DMA memory preregistering" does + * completely the opposite of managing mapping/pinning dynamically as + * required by RamDiscardManager. We would have to special-case sections + * with a RamDiscardManager. + */ + return ram_block_discard_disable(state); + } +} + +static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, + Error **errp) +{ + VFIOContainer *container; + int ret, fd; + VFIOAddressSpace *space; + + space = vfio_get_address_space(as); + + /* + * VFIO is currently incompatible with discarding of RAM insofar as the + * madvise to purge (zap) the page from QEMU's address space does not + * interact with the memory API and therefore leaves stale virtual to + * physical mappings in the IOMMU if the page was previously pinned. We + * therefore set discarding broken for each group added to a container, + * whether the container is used individually or shared. This provides + * us with options to allow devices within a group to opt-in and allow + * discarding, so long as it is done consistently for a group (for instance + * if the device is an mdev device where it is known that the host vendor + * driver will never pin pages outside of the working set of the guest + * driver, which would thus not be discarding candidates). + * + * The first opportunity to induce pinning occurs here where we attempt to + * attach the group to existing containers within the AddressSpace. If any + * pages are already zapped from the virtual address space, such as from + * previous discards, new pinning will cause valid mappings to be + * re-established. Likewise, when the overall MemoryListener for a new + * container is registered, a replay of mappings within the AddressSpace + * will occur, re-establishing any previously zapped pages as well. + * + * Especially virtio-balloon is currently only prevented from discarding + * new memory, it will not yet set ram_block_discard_set_required() and + * therefore, neither stops us here or deals with the sudden memory + * consumption of inflated memory. + * + * We do support discarding of memory coordinated via the RamDiscardManager + * with some IOMMU types. vfio_ram_block_discard_disable() handles the + * details once we know which type of IOMMU we are using. + */ + + QLIST_FOREACH(container, &space->containers, next) { + if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) { + ret = vfio_ram_block_discard_disable(container, true); + if (ret) { + error_setg_errno(errp, -ret, + "Cannot set discarding of RAM broken"); + if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, + &container->fd)) { + error_report("vfio: error disconnecting group %d from" + " container", group->groupid); + } + return ret; + } + group->container = container; + QLIST_INSERT_HEAD(&container->group_list, group, container_next); + vfio_kvm_device_add_group(group); + return 0; + } + } + + fd = qemu_open_old("/dev/vfio/vfio", O_RDWR); + if (fd < 0) { + error_setg_errno(errp, errno, "failed to open /dev/vfio/vfio"); + ret = -errno; + goto put_space_exit; + } + + ret = ioctl(fd, VFIO_GET_API_VERSION); + if (ret != VFIO_API_VERSION) { + error_setg(errp, "supported vfio version: %d, " + "reported version: %d", VFIO_API_VERSION, ret); + ret = -EINVAL; + goto close_fd_exit; + } + + container = g_malloc0(sizeof(*container)); + container->space = space; + container->fd = fd; + container->error = NULL; + container->dirty_pages_supported = false; + container->dma_max_mappings = 0; + QLIST_INIT(&container->giommu_list); + QLIST_INIT(&container->hostwin_list); + QLIST_INIT(&container->vrdl_list); + + ret = vfio_init_container(container, group->fd, errp); + if (ret) { + goto free_container_exit; + } + + ret = vfio_ram_block_discard_disable(container, true); + if (ret) { + error_setg_errno(errp, -ret, "Cannot set discarding of RAM broken"); + goto free_container_exit; + } + + switch (container->iommu_type) { + case VFIO_TYPE1v2_IOMMU: + case VFIO_TYPE1_IOMMU: + { + struct vfio_iommu_type1_info *info; + + /* + * FIXME: This assumes that a Type1 IOMMU can map any 64-bit + * IOVA whatsoever. That's not actually true, but the current + * kernel interface doesn't tell us what it can map, and the + * existing Type1 IOMMUs generally support any IOVA we're + * going to actually try in practice. + */ + ret = vfio_get_iommu_info(container, &info); + + if (ret || !(info->flags & VFIO_IOMMU_INFO_PGSIZES)) { + /* Assume 4k IOVA page size */ + info->iova_pgsizes = 4096; + } + vfio_host_win_add(container, 0, (hwaddr)-1, info->iova_pgsizes); + container->pgsizes = info->iova_pgsizes; + + /* The default in the kernel ("dma_entry_limit") is 65535. */ + container->dma_max_mappings = 65535; + if (!ret) { + vfio_get_info_dma_avail(info, &container->dma_max_mappings); + vfio_get_iommu_info_migration(container, info); + } + g_free(info); + break; + } + case VFIO_SPAPR_TCE_v2_IOMMU: + case VFIO_SPAPR_TCE_IOMMU: + { + struct vfio_iommu_spapr_tce_info info; + bool v2 = container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU; + + /* + * The host kernel code implementing VFIO_IOMMU_DISABLE is called + * when container fd is closed so we do not call it explicitly + * in this file. + */ + if (!v2) { + ret = ioctl(fd, VFIO_IOMMU_ENABLE); + if (ret) { + error_setg_errno(errp, errno, "failed to enable container"); + ret = -errno; + goto enable_discards_exit; + } + } else { + container->prereg_listener = vfio_prereg_listener; + + memory_listener_register(&container->prereg_listener, + &address_space_memory); + if (container->error) { + memory_listener_unregister(&container->prereg_listener); + ret = -1; + error_propagate_prepend(errp, container->error, + "RAM memory listener initialization failed: "); + goto enable_discards_exit; + } + } + + info.argsz = sizeof(info); + ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info); + if (ret) { + error_setg_errno(errp, errno, + "VFIO_IOMMU_SPAPR_TCE_GET_INFO failed"); + ret = -errno; + if (v2) { + memory_listener_unregister(&container->prereg_listener); + } + goto enable_discards_exit; + } + + if (v2) { + container->pgsizes = info.ddw.pgsizes; + /* + * There is a default window in just created container. + * To make region_add/del simpler, we better remove this + * window now and let those iommu_listener callbacks + * create/remove them when needed. + */ + ret = vfio_spapr_remove_window(container, info.dma32_window_start); + if (ret) { + error_setg_errno(errp, -ret, + "failed to remove existing window"); + goto enable_discards_exit; + } + } else { + /* The default table uses 4K pages */ + container->pgsizes = 0x1000; + vfio_host_win_add(container, info.dma32_window_start, + info.dma32_window_start + + info.dma32_window_size - 1, + 0x1000); + } + } + } + + vfio_kvm_device_add_group(group); + + QLIST_INIT(&container->group_list); + QLIST_INSERT_HEAD(&space->containers, container, next); + + group->container = container; + QLIST_INSERT_HEAD(&container->group_list, group, container_next); + + container->listener = vfio_memory_listener; + + memory_listener_register(&container->listener, container->space->as); + + if (container->error) { + ret = -1; + error_propagate_prepend(errp, container->error, + "memory listener initialization failed: "); + goto listener_release_exit; + } + + container->initialized = true; + + return 0; +listener_release_exit: + QLIST_REMOVE(group, container_next); + QLIST_REMOVE(container, next); + vfio_kvm_device_del_group(group); + vfio_listener_release(container); + +enable_discards_exit: + vfio_ram_block_discard_disable(container, false); + +free_container_exit: + g_free(container); + +close_fd_exit: + close(fd); + +put_space_exit: + vfio_put_address_space(space); + + return ret; +} + +static void vfio_disconnect_container(VFIOGroup *group) +{ + VFIOContainer *container = group->container; + + QLIST_REMOVE(group, container_next); + group->container = NULL; + + /* + * Explicitly release the listener first before unset container, + * since unset may destroy the backend container if it's the last + * group. + */ + if (QLIST_EMPTY(&container->group_list)) { + vfio_listener_release(container); + } + + if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) { + error_report("vfio: error disconnecting group %d from container", + group->groupid); + } + + if (QLIST_EMPTY(&container->group_list)) { + VFIOAddressSpace *space = container->space; + VFIOGuestIOMMU *giommu, *tmp; + VFIOHostDMAWindow *hostwin, *next; + + QLIST_REMOVE(container, next); + + QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) { + memory_region_unregister_iommu_notifier( + MEMORY_REGION(giommu->iommu_mr), &giommu->n); + QLIST_REMOVE(giommu, giommu_next); + g_free(giommu); + } + + QLIST_FOREACH_SAFE(hostwin, &container->hostwin_list, hostwin_next, + next) { + QLIST_REMOVE(hostwin, hostwin_next); + g_free(hostwin); + } + + trace_vfio_disconnect_container(container->fd); + close(container->fd); + g_free(container); + + vfio_put_address_space(space); + } +} + +VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp) +{ + VFIOGroup *group; + char path[32]; + struct vfio_group_status status = { .argsz = sizeof(status) }; + + QLIST_FOREACH(group, &vfio_group_list, next) { + if (group->groupid == groupid) { + /* Found it. Now is it already in the right context? */ + if (group->container->space->as == as) { + return group; + } else { + error_setg(errp, "group %d used in multiple address spaces", + group->groupid); + return NULL; + } + } + } + + group = g_malloc0(sizeof(*group)); + + snprintf(path, sizeof(path), "/dev/vfio/%d", groupid); + group->fd = qemu_open_old(path, O_RDWR); + if (group->fd < 0) { + error_setg_errno(errp, errno, "failed to open %s", path); + goto free_group_exit; + } + + if (ioctl(group->fd, VFIO_GROUP_GET_STATUS, &status)) { + error_setg_errno(errp, errno, "failed to get group %d status", groupid); + goto close_fd_exit; + } + + if (!(status.flags & VFIO_GROUP_FLAGS_VIABLE)) { + error_setg(errp, "group %d is not viable", groupid); + error_append_hint(errp, + "Please ensure all devices within the iommu_group " + "are bound to their vfio bus driver.\n"); + goto close_fd_exit; + } + + group->groupid = groupid; + QLIST_INIT(&group->device_list); + + if (vfio_connect_container(group, as, errp)) { + error_prepend(errp, "failed to setup container for group %d: ", + groupid); + goto close_fd_exit; + } + + if (QLIST_EMPTY(&vfio_group_list)) { + qemu_register_reset(vfio_reset_handler, NULL); + } + + QLIST_INSERT_HEAD(&vfio_group_list, group, next); + + return group; + +close_fd_exit: + close(group->fd); + +free_group_exit: + g_free(group); + + return NULL; +} + +void vfio_put_group(VFIOGroup *group) +{ + if (!group || !QLIST_EMPTY(&group->device_list)) { + return; + } + + if (!group->ram_block_discard_allowed) { + vfio_ram_block_discard_disable(group->container, false); + } + vfio_kvm_device_del_group(group); + vfio_disconnect_container(group); + QLIST_REMOVE(group, next); + trace_vfio_put_group(group->fd); + close(group->fd); + g_free(group); + + if (QLIST_EMPTY(&vfio_group_list)) { + qemu_unregister_reset(vfio_reset_handler, NULL); + } +} + +int vfio_get_device(VFIOGroup *group, const char *name, + VFIODevice *vbasedev, Error **errp) +{ + struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) }; + int ret, fd; + + fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name); + if (fd < 0) { + error_setg_errno(errp, errno, "error getting device from group %d", + group->groupid); + error_append_hint(errp, + "Verify all devices in group %d are bound to vfio- " + "or pci-stub and not already in use\n", group->groupid); + return fd; + } + + ret = ioctl(fd, VFIO_DEVICE_GET_INFO, &dev_info); + if (ret) { + error_setg_errno(errp, errno, "error getting device info"); + close(fd); + return ret; + } + + /* + * Set discarding of RAM as not broken for this group if the driver knows + * the device operates compatibly with discarding. Setting must be + * consistent per group, but since compatibility is really only possible + * with mdev currently, we expect singleton groups. + */ + if (vbasedev->ram_block_discard_allowed != + group->ram_block_discard_allowed) { + if (!QLIST_EMPTY(&group->device_list)) { + error_setg(errp, "Inconsistent setting of support for discarding " + "RAM (e.g., balloon) within group"); + close(fd); + return -1; + } + + if (!group->ram_block_discard_allowed) { + group->ram_block_discard_allowed = true; + vfio_ram_block_discard_disable(group->container, false); + } + } + + vbasedev->fd = fd; + vbasedev->group = group; + QLIST_INSERT_HEAD(&group->device_list, vbasedev, next); + + vbasedev->num_irqs = dev_info.num_irqs; + vbasedev->num_regions = dev_info.num_regions; + vbasedev->flags = dev_info.flags; + + trace_vfio_get_device(name, dev_info.flags, dev_info.num_regions, + dev_info.num_irqs); + + vbasedev->reset_works = !!(dev_info.flags & VFIO_DEVICE_FLAGS_RESET); + return 0; +} + +void vfio_put_base_device(VFIODevice *vbasedev) +{ + if (!vbasedev->group) { + return; + } + QLIST_REMOVE(vbasedev, next); + vbasedev->group = NULL; + trace_vfio_put_base_device(vbasedev->fd); + close(vbasedev->fd); +} + +/* FIXME: should below code be in common.c? */ +/* + * Interfaces for IBM EEH (Enhanced Error Handling) + */ +static bool vfio_eeh_container_ok(VFIOContainer *container) +{ + /* + * As of 2016-03-04 (linux-4.5) the host kernel EEH/VFIO + * implementation is broken if there are multiple groups in a + * container. The hardware works in units of Partitionable + * Endpoints (== IOMMU groups) and the EEH operations naively + * iterate across all groups in the container, without any logic + * to make sure the groups have their state synchronized. For + * certain operations (ENABLE) that might be ok, until an error + * occurs, but for others (GET_STATE) it's clearly broken. + */ + + /* + * XXX Once fixed kernels exist, test for them here + */ + + if (QLIST_EMPTY(&container->group_list)) { + return false; + } + + if (QLIST_NEXT(QLIST_FIRST(&container->group_list), container_next)) { + return false; + } + + return true; +} + +static int vfio_eeh_container_op(VFIOContainer *container, uint32_t op) +{ + struct vfio_eeh_pe_op pe_op = { + .argsz = sizeof(pe_op), + .op = op, + }; + int ret; + + if (!vfio_eeh_container_ok(container)) { + error_report("vfio/eeh: EEH_PE_OP 0x%x: " + "kernel requires a container with exactly one group", op); + return -EPERM; + } + + ret = ioctl(container->fd, VFIO_EEH_PE_OP, &pe_op); + if (ret < 0) { + error_report("vfio/eeh: EEH_PE_OP 0x%x failed: %m", op); + return -errno; + } + + return ret; +} + +static VFIOContainer *vfio_eeh_as_container(AddressSpace *as) +{ + VFIOAddressSpace *space = vfio_get_address_space(as); + VFIOContainer *container = NULL; + + if (QLIST_EMPTY(&space->containers)) { + /* No containers to act on */ + goto out; + } + + container = QLIST_FIRST(&space->containers); + + if (QLIST_NEXT(container, next)) { + /* + * We don't yet have logic to synchronize EEH state across + * multiple containers. + */ + container = NULL; + goto out; + } + +out: + vfio_put_address_space(space); + return container; +} + +bool vfio_eeh_as_ok(AddressSpace *as) +{ + VFIOContainer *container = vfio_eeh_as_container(as); + + return (container != NULL) && vfio_eeh_container_ok(container); +} + +int vfio_eeh_as_op(AddressSpace *as, uint32_t op) +{ + VFIOContainer *container = vfio_eeh_as_container(as); + + if (!container) { + return -ENODEV; + } + return vfio_eeh_container_op(container, op); +} diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build index da9af297a0..e3b6d6e2cb 100644 --- a/hw/vfio/meson.build +++ b/hw/vfio/meson.build @@ -1,6 +1,8 @@ vfio_ss = ss.source_set() vfio_ss.add(files( 'common.c', + 'as.c', + 'container.c', 'spapr.c', 'migration.c', )) diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index e573f5a9f1..03ff7944cb 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -33,6 +33,8 @@ #define VFIO_MSG_PREFIX "vfio %s: " +extern const MemoryListener vfio_memory_listener; + enum { VFIO_DEVICE_TYPE_PCI = 0, VFIO_DEVICE_TYPE_PLATFORM = 1, @@ -190,6 +192,32 @@ typedef struct VFIODisplay { } dmabuf; } VFIODisplay; +void vfio_host_win_add(VFIOContainer *container, + hwaddr min_iova, hwaddr max_iova, + uint64_t iova_pgsizes); +int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova, + hwaddr max_iova); +VFIOAddressSpace *vfio_get_address_space(AddressSpace *as); +void vfio_put_address_space(VFIOAddressSpace *space); +bool vfio_devices_all_running_and_saving(VFIOContainer *container); +bool vfio_devices_all_dirty_tracking(VFIOContainer *container); + +/* container->fd */ +int vfio_dma_unmap(VFIOContainer *container, + hwaddr iova, ram_addr_t size, + IOMMUTLBEntry *iotlb); +int vfio_dma_map(VFIOContainer *container, hwaddr iova, + ram_addr_t size, void *vaddr, bool readonly); +void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start); +int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova, + uint64_t size, ram_addr_t ram_addr); + +int vfio_container_add_section_window(VFIOContainer *container, + MemoryRegionSection *section, + Error **errp); +void vfio_container_del_section_window(VFIOContainer *container, + MemoryRegionSection *section); + void vfio_put_base_device(VFIODevice *vbasedev); void vfio_disable_irqindex(VFIODevice *vbasedev, int index); void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index); From patchwork Wed Jun 8 12:31:28 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yi Liu X-Patchwork-Id: 12873445 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id B0ECAC43334 for ; Wed, 8 Jun 2022 12:36:24 +0000 (UTC) Received: from localhost ([::1]:57456 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nyuv9-0003Ll-J4 for qemu-devel@archiver.kernel.org; Wed, 08 Jun 2022 08:36:23 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:57916) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nyuqm-0000IF-Ka for qemu-devel@nongnu.org; Wed, 08 Jun 2022 08:31:52 -0400 Received: from mga05.intel.com ([192.55.52.43]:56947) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nyuqh-0005ka-Mx for qemu-devel@nongnu.org; Wed, 08 Jun 2022 08:31:52 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1654691507; x=1686227507; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=HQgNDGaxtm/9gknl6zsGMUe+qWdG9WyLhYuLDEvJUpg=; b=miGPVJbvbEDHpHf39j41yZcbwg8JEwf2eCqzNb8piVOzgJd3VTTUNQj3 t8iNaKuqvZ2OGllv4mmH2HfDDs0yYdBAQ01jFqczqws56wc5LHWHRBGt7 qqGJNCik1y5B7Gr27bny0IOA4ccAKhhiGWAJB4SQC7jBzkuOL+FjpG+rs P4/SNdUGZANq9UkRAoo15yGehnrTSmZT0Q4GWwIEobWZZRRehIeG4b7kn fegUPD0X5lRZAM3V5E59ahBJZirMAqLHqLeP/294y+t418rAiJXX2JLKW Htd/I5shzed4Hfv6CYjVSuooI5ohHffD0a5utuMcPyPkVRRBcOicHvEjk w==; X-IronPort-AV: E=McAfee;i="6400,9594,10371"; a="363210107" X-IronPort-AV: E=Sophos;i="5.91,286,1647327600"; d="scan'208";a="363210107" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Jun 2022 05:31:43 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.91,286,1647327600"; d="scan'208";a="670529749" Received: from 984fee00a4c6.jf.intel.com ([10.165.58.231]) by FMSMGA003.fm.intel.com with ESMTP; 08 Jun 2022 05:31:42 -0700 From: Yi Liu To: alex.williamson@redhat.com, cohuck@redhat.com, qemu-devel@nongnu.org Cc: david@gibson.dropbear.id.au, thuth@redhat.com, farman@linux.ibm.com, mjrosato@linux.ibm.com, akrowiak@linux.ibm.com, pasic@linux.ibm.com, jjherne@linux.ibm.com, jasowang@redhat.com, kvm@vger.kernel.org, jgg@nvidia.com, nicolinc@nvidia.com, eric.auger@redhat.com, eric.auger.pro@gmail.com, kevin.tian@intel.com, yi.l.liu@intel.com, chao.p.peng@intel.com, yi.y.sun@intel.com, peterx@redhat.com, shameerali.kolothum.thodi@huawei.com, zhangfei.gao@linaro.org, berrange@redhat.com Subject: [RFC v2 04/15] vfio: Add base container Date: Wed, 8 Jun 2022 05:31:28 -0700 Message-Id: <20220608123139.19356-5-yi.l.liu@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220608123139.19356-1-yi.l.liu@intel.com> References: <20220608123139.19356-1-yi.l.liu@intel.com> MIME-Version: 1.0 Received-SPF: pass client-ip=192.55.52.43; envelope-from=yi.l.liu@intel.com; helo=mga05.intel.com X-Spam_score_int: -44 X-Spam_score: -4.5 X-Spam_bar: ---- X-Spam_report: (-4.5 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" Abstract the VFIOContainer to be a base object. It is supposed to be embedded by legacy VFIO container and later on, into the new iommufd based container. The base container implements generic code such as code related to memory_listener and address space management. The VFIOContainerOps implements callbacks that depend on the kernel user space being used. 'as.c' only manipulates the base container with wrapper functions that calls the functions defined in VFIOContainerOps. Existing 'container.c' code is converted to implement the legacy container ops functions. Existing migration code only works with the legacy container. Also 'spapr.c' isn't BE agnostic. Below is the base container. It's named as VFIOContainer, old VFIOContainer is replaced with VFIOLegacyContainer. struct VFIOContainer { VFIOContainerOps *ops; VFIOAddressSpace *space; MemoryListener listener; Error *error; bool initialized; bool dirty_pages_supported; uint64_t dirty_pgsizes; uint64_t max_dirty_bitmap_size; unsigned long pgsizes; unsigned int dma_max_mappings; QLIST_HEAD(, VFIOGuestIOMMU) giommu_list; QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list; QLIST_HEAD(, VFIORamDiscardListener) vrdl_list; QLIST_ENTRY(VFIOContainer) next; }; struct VFIOLegacyContainer { VFIOContainer bcontainer; int fd; /* /dev/vfio/vfio, empowered by the attached groups */ MemoryListener prereg_listener; unsigned iommu_type; QLIST_HEAD(, VFIOGroup) group_list; }; Co-authored-by: Eric Auger Signed-off-by: Eric Auger Signed-off-by: Yi Liu --- v1 -> v2: - Remove QOM for VFIOContainer object, use callback instead per David's comment. - Rename container-obj.c/.h to be container-base.c/.h --- hw/vfio/as.c | 48 +++--- hw/vfio/container-base.c | 154 +++++++++++++++++++ hw/vfio/container.c | 211 +++++++++++++++----------- hw/vfio/meson.build | 1 + hw/vfio/migration.c | 5 +- hw/vfio/pci.c | 4 +- hw/vfio/spapr.c | 22 +-- include/hw/vfio/vfio-common.h | 79 ++-------- include/hw/vfio/vfio-container-base.h | 136 +++++++++++++++++ 9 files changed, 470 insertions(+), 190 deletions(-) create mode 100644 hw/vfio/container-base.c create mode 100644 include/hw/vfio/vfio-container-base.h diff --git a/hw/vfio/as.c b/hw/vfio/as.c index 01eb60105b..ec58914001 100644 --- a/hw/vfio/as.c +++ b/hw/vfio/as.c @@ -216,9 +216,9 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) * of vaddr will always be there, even if the memory object is * destroyed and its backing memory munmap-ed. */ - ret = vfio_dma_map(container, iova, - iotlb->addr_mask + 1, vaddr, - read_only); + ret = vfio_container_dma_map(container, iova, + iotlb->addr_mask + 1, vaddr, + read_only); if (ret) { error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", " "0x%"HWADDR_PRIx", %p) = %d (%m)", @@ -226,7 +226,8 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) iotlb->addr_mask + 1, vaddr, ret); } } else { - ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1, iotlb); + ret = vfio_container_dma_unmap(container, iova, + iotlb->addr_mask + 1, iotlb); if (ret) { error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", " "0x%"HWADDR_PRIx") = %d (%m)", @@ -243,12 +244,13 @@ static void vfio_ram_discard_notify_discard(RamDiscardListener *rdl, { VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener, listener); + VFIOContainer *container = vrdl->container; const hwaddr size = int128_get64(section->size); const hwaddr iova = section->offset_within_address_space; int ret; /* Unmap with a single call. */ - ret = vfio_dma_unmap(vrdl->container, iova, size , NULL); + ret = vfio_container_dma_unmap(container, iova, size , NULL); if (ret) { error_report("%s: vfio_dma_unmap() failed: %s", __func__, strerror(-ret)); @@ -260,6 +262,7 @@ static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl, { VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener, listener); + VFIOContainer *container = vrdl->container; const hwaddr end = section->offset_within_region + int128_get64(section->size); hwaddr start, next, iova; @@ -278,8 +281,8 @@ static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl, section->offset_within_address_space; vaddr = memory_region_get_ram_ptr(section->mr) + start; - ret = vfio_dma_map(vrdl->container, iova, next - start, - vaddr, section->readonly); + ret = vfio_container_dma_map(container, iova, next - start, + vaddr, section->readonly); if (ret) { /* Rollback */ vfio_ram_discard_notify_discard(rdl, section); @@ -555,8 +558,8 @@ static void vfio_listener_region_add(MemoryListener *listener, } } - ret = vfio_dma_map(container, iova, int128_get64(llsize), - vaddr, section->readonly); + ret = vfio_container_dma_map(container, iova, int128_get64(llsize), + vaddr, section->readonly); if (ret) { error_setg(&err, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", " "0x%"HWADDR_PRIx", %p) = %d (%m)", @@ -681,7 +684,8 @@ static void vfio_listener_region_del(MemoryListener *listener, if (int128_eq(llsize, int128_2_64())) { /* The unmap ioctl doesn't accept a full 64-bit span. */ llsize = int128_rshift(llsize, 1); - ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL); + ret = vfio_container_dma_unmap(container, iova, + int128_get64(llsize), NULL); if (ret) { error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", " "0x%"HWADDR_PRIx") = %d (%m)", @@ -689,7 +693,8 @@ static void vfio_listener_region_del(MemoryListener *listener, } iova += int128_get64(llsize); } - ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL); + ret = vfio_container_dma_unmap(container, iova, + int128_get64(llsize), NULL); if (ret) { error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", " "0x%"HWADDR_PRIx") = %d (%m)", @@ -706,14 +711,14 @@ static void vfio_listener_log_global_start(MemoryListener *listener) { VFIOContainer *container = container_of(listener, VFIOContainer, listener); - vfio_set_dirty_page_tracking(container, true); + vfio_container_set_dirty_page_tracking(container, true); } static void vfio_listener_log_global_stop(MemoryListener *listener) { VFIOContainer *container = container_of(listener, VFIOContainer, listener); - vfio_set_dirty_page_tracking(container, false); + vfio_container_set_dirty_page_tracking(container, false); } typedef struct { @@ -742,8 +747,9 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) if (vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL)) { int ret; - ret = vfio_get_dirty_bitmap(container, iova, iotlb->addr_mask + 1, - translated_addr); + ret = vfio_container_get_dirty_bitmap(container, iova, + iotlb->addr_mask + 1, + translated_addr); if (ret) { error_report("vfio_iommu_map_dirty_notify(%p, 0x%"HWADDR_PRIx", " "0x%"HWADDR_PRIx") = %d (%m)", @@ -767,11 +773,13 @@ static int vfio_ram_discard_get_dirty_bitmap(MemoryRegionSection *section, * Sync the whole mapped region (spanning multiple individual mappings) * in one go. */ - return vfio_get_dirty_bitmap(vrdl->container, iova, size, ram_addr); + return vfio_container_get_dirty_bitmap(vrdl->container, iova, + size, ram_addr); } -static int vfio_sync_ram_discard_listener_dirty_bitmap(VFIOContainer *container, - MemoryRegionSection *section) +static int +vfio_sync_ram_discard_listener_dirty_bitmap(VFIOContainer *container, + MemoryRegionSection *section) { RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr); VFIORamDiscardListener *vrdl = NULL; @@ -835,7 +843,7 @@ static int vfio_sync_dirty_bitmap(VFIOContainer *container, ram_addr = memory_region_get_ram_addr(section->mr) + section->offset_within_region; - return vfio_get_dirty_bitmap(container, + return vfio_container_get_dirty_bitmap(container, REAL_HOST_PAGE_ALIGN(section->offset_within_address_space), int128_get64(section->size), ram_addr); } @@ -850,7 +858,7 @@ static void vfio_listener_log_sync(MemoryListener *listener, return; } - if (vfio_devices_all_dirty_tracking(container)) { + if (vfio_container_devices_all_dirty_tracking(container)) { vfio_sync_dirty_bitmap(container, section); } } diff --git a/hw/vfio/container-base.c b/hw/vfio/container-base.c new file mode 100644 index 0000000000..6aaf0e0faa --- /dev/null +++ b/hw/vfio/container-base.c @@ -0,0 +1,154 @@ +/* + * VFIO BASE CONTAINER + * + * Copyright (C) 2022 Intel Corporation. + * Copyright Red Hat, Inc. 2022 + * + * Authors: Yi Liu + * Eric Auger + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + + * You should have received a copy of the GNU General Public License along + * with this program; if not, see . + */ + +#include "qemu/osdep.h" +#include "qapi/error.h" +#include "qemu/error-report.h" +#include "hw/vfio/vfio-container-base.h" + +bool vfio_container_check_extension(VFIOContainer *container, + VFIOContainerFeature feat) +{ + if (!container->ops->check_extension) { + return false; + } + + return container->ops->check_extension(container, feat); +} + +int vfio_container_dma_map(VFIOContainer *container, + hwaddr iova, ram_addr_t size, + void *vaddr, bool readonly) +{ + if (!container->ops->dma_map) { + return -EINVAL; + } + + return container->ops->dma_map(container, iova, size, vaddr, readonly); +} + +int vfio_container_dma_unmap(VFIOContainer *container, + hwaddr iova, ram_addr_t size, + IOMMUTLBEntry *iotlb) +{ + if (!container->ops->dma_unmap) { + return -EINVAL; + } + + return container->ops->dma_unmap(container, iova, size, iotlb); +} + +void vfio_container_set_dirty_page_tracking(VFIOContainer *container, + bool start) +{ + if (!container->ops->set_dirty_page_tracking) { + return; + } + + container->ops->set_dirty_page_tracking(container, start); +} + +bool vfio_container_devices_all_dirty_tracking(VFIOContainer *container) +{ + if (!container->ops->devices_all_dirty_tracking) { + return false; + } + + return container->ops->devices_all_dirty_tracking(container); +} + +int vfio_container_get_dirty_bitmap(VFIOContainer *container, uint64_t iova, + uint64_t size, ram_addr_t ram_addr) +{ + if (!container->ops->get_dirty_bitmap) { + return -EINVAL; + } + + return container->ops->get_dirty_bitmap(container, iova, size, ram_addr); +} + +int vfio_container_add_section_window(VFIOContainer *container, + MemoryRegionSection *section, + Error **errp) +{ + if (!container->ops->add_window) { + return 0; + } + + return container->ops->add_window(container, section, errp); +} + +void vfio_container_del_section_window(VFIOContainer *container, + MemoryRegionSection *section) +{ + if (!container->ops->del_window) { + return; + } + + return container->ops->del_window(container, section); +} + +void vfio_container_init(VFIOContainer *container, + VFIOAddressSpace *space, + const VFIOContainerOps *ops) +{ + container->ops = ops; + container->space = space; + container->error = NULL; + container->dirty_pages_supported = false; + container->dma_max_mappings = 0; + QLIST_INIT(&container->giommu_list); + QLIST_INIT(&container->hostwin_list); + QLIST_INIT(&container->vrdl_list); +} + +void vfio_container_destroy(VFIOContainer *container) +{ + VFIORamDiscardListener *vrdl, *vrdl_tmp; + VFIOGuestIOMMU *giommu, *tmp; + VFIOHostDMAWindow *hostwin, *next; + + QLIST_SAFE_REMOVE(container, next); + + QLIST_FOREACH_SAFE(vrdl, &container->vrdl_list, next, vrdl_tmp) { + RamDiscardManager *rdm; + + rdm = memory_region_get_ram_discard_manager(vrdl->mr); + ram_discard_manager_unregister_listener(rdm, &vrdl->listener); + QLIST_REMOVE(vrdl, next); + g_free(vrdl); + } + + QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) { + memory_region_unregister_iommu_notifier( + MEMORY_REGION(giommu->iommu_mr), &giommu->n); + QLIST_REMOVE(giommu, giommu_next); + g_free(giommu); + } + + QLIST_FOREACH_SAFE(hostwin, &container->hostwin_list, hostwin_next, + next) { + QLIST_REMOVE(hostwin, hostwin_next); + g_free(hostwin); + } +} diff --git a/hw/vfio/container.c b/hw/vfio/container.c index 8ee7c9b980..dfc5183d5d 100644 --- a/hw/vfio/container.c +++ b/hw/vfio/container.c @@ -76,8 +76,11 @@ bool vfio_mig_active(void) return true; } -bool vfio_devices_all_dirty_tracking(VFIOContainer *container) +static bool vfio_devices_all_dirty_tracking(VFIOContainer *bcontainer) { + VFIOLegacyContainer *container = container_of(bcontainer, + VFIOLegacyContainer, + bcontainer); VFIOGroup *group; VFIODevice *vbasedev; MigrationState *ms = migrate_get_current(); @@ -103,7 +106,7 @@ bool vfio_devices_all_dirty_tracking(VFIOContainer *container) return true; } -bool vfio_devices_all_running_and_saving(VFIOContainer *container) +static bool vfio_devices_all_running_and_saving(VFIOLegacyContainer *container) { VFIOGroup *group; VFIODevice *vbasedev; @@ -132,10 +135,11 @@ bool vfio_devices_all_running_and_saving(VFIOContainer *container) return true; } -static int vfio_dma_unmap_bitmap(VFIOContainer *container, +static int vfio_dma_unmap_bitmap(VFIOLegacyContainer *container, hwaddr iova, ram_addr_t size, IOMMUTLBEntry *iotlb) { + VFIOContainer *bcontainer = &container->bcontainer; struct vfio_iommu_type1_dma_unmap *unmap; struct vfio_bitmap *bitmap; uint64_t pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size(); @@ -159,7 +163,7 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container, bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) / BITS_PER_BYTE; - if (bitmap->size > container->max_dirty_bitmap_size) { + if (bitmap->size > bcontainer->max_dirty_bitmap_size) { error_report("UNMAP: Size of bitmap too big 0x%"PRIx64, (uint64_t)bitmap->size); ret = -E2BIG; @@ -189,10 +193,13 @@ unmap_exit: /* * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86 */ -int vfio_dma_unmap(VFIOContainer *container, - hwaddr iova, ram_addr_t size, - IOMMUTLBEntry *iotlb) +static int vfio_dma_unmap(VFIOContainer *bcontainer, + hwaddr iova, ram_addr_t size, + IOMMUTLBEntry *iotlb) { + VFIOLegacyContainer *container = container_of(bcontainer, + VFIOLegacyContainer, + bcontainer); struct vfio_iommu_type1_dma_unmap unmap = { .argsz = sizeof(unmap), .flags = 0, @@ -200,7 +207,7 @@ int vfio_dma_unmap(VFIOContainer *container, .size = size, }; - if (iotlb && container->dirty_pages_supported && + if (iotlb && bcontainer->dirty_pages_supported && vfio_devices_all_running_and_saving(container)) { return vfio_dma_unmap_bitmap(container, iova, size, iotlb); } @@ -221,7 +228,7 @@ int vfio_dma_unmap(VFIOContainer *container, if (errno == EINVAL && unmap.size && !(unmap.iova + unmap.size) && container->iommu_type == VFIO_TYPE1v2_IOMMU) { trace_vfio_dma_unmap_overflow_workaround(); - unmap.size -= 1ULL << ctz64(container->pgsizes); + unmap.size -= 1ULL << ctz64(bcontainer->pgsizes); continue; } error_report("VFIO_UNMAP_DMA failed: %s", strerror(errno)); @@ -231,9 +238,23 @@ int vfio_dma_unmap(VFIOContainer *container, return 0; } -int vfio_dma_map(VFIOContainer *container, hwaddr iova, - ram_addr_t size, void *vaddr, bool readonly) +static bool vfio_legacy_container_check_extension(VFIOContainer *bcontainer, + VFIOContainerFeature feat) { + switch (feat) { + case VFIO_FEAT_LIVE_MIGRATION: + return true; + default: + return false; + }; +} + +static int vfio_dma_map(VFIOContainer *bcontainer, hwaddr iova, + ram_addr_t size, void *vaddr, bool readonly) +{ + VFIOLegacyContainer *container = container_of(bcontainer, + VFIOLegacyContainer, + bcontainer); struct vfio_iommu_type1_dma_map map = { .argsz = sizeof(map), .flags = VFIO_DMA_MAP_FLAG_READ, @@ -252,7 +273,7 @@ int vfio_dma_map(VFIOContainer *container, hwaddr iova, * the VGA ROM space. */ if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 || - (errno == EBUSY && vfio_dma_unmap(container, iova, size, NULL) == 0 && + (errno == EBUSY && vfio_dma_unmap(bcontainer, iova, size, NULL) == 0 && ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) { return 0; } @@ -261,8 +282,11 @@ int vfio_dma_map(VFIOContainer *container, hwaddr iova, return -errno; } -void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start) +static void vfio_set_dirty_page_tracking(VFIOContainer *bcontainer, bool start) { + VFIOLegacyContainer *container = container_of(bcontainer, + VFIOLegacyContainer, + bcontainer); int ret; struct vfio_iommu_type1_dirty_bitmap dirty = { .argsz = sizeof(dirty), @@ -281,9 +305,12 @@ void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start) } } -int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova, - uint64_t size, ram_addr_t ram_addr) +static int vfio_get_dirty_bitmap(VFIOContainer *bcontainer, uint64_t iova, + uint64_t size, ram_addr_t ram_addr) { + VFIOLegacyContainer *container = container_of(bcontainer, + VFIOLegacyContainer, + bcontainer); struct vfio_iommu_type1_dirty_bitmap *dbitmap; struct vfio_iommu_type1_dirty_bitmap_get *range; uint64_t pages; @@ -333,18 +360,24 @@ err_out: return ret; } -static void vfio_listener_release(VFIOContainer *container) +static void vfio_listener_release(VFIOLegacyContainer *container) { - memory_listener_unregister(&container->listener); + VFIOContainer *bcontainer = &container->bcontainer; + + memory_listener_unregister(&bcontainer->listener); if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) { memory_listener_unregister(&container->prereg_listener); } } -int vfio_container_add_section_window(VFIOContainer *container, - MemoryRegionSection *section, - Error **errp) +static int +vfio_legacy_container_add_section_window(VFIOContainer *bcontainer, + MemoryRegionSection *section, + Error **errp) { + VFIOLegacyContainer *container = container_of(bcontainer, + VFIOLegacyContainer, + bcontainer); VFIOHostDMAWindow *hostwin; hwaddr pgsize = 0; int ret; @@ -354,7 +387,7 @@ int vfio_container_add_section_window(VFIOContainer *container, } /* For now intersections are not allowed, we may relax this later */ - QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) { + QLIST_FOREACH(hostwin, &bcontainer->hostwin_list, hostwin_next) { if (ranges_overlap(hostwin->min_iova, hostwin->max_iova - hostwin->min_iova + 1, section->offset_within_address_space, @@ -376,7 +409,7 @@ int vfio_container_add_section_window(VFIOContainer *container, return ret; } - vfio_host_win_add(container, section->offset_within_address_space, + vfio_host_win_add(bcontainer, section->offset_within_address_space, section->offset_within_address_space + int128_get64(section->size) - 1, pgsize); #ifdef CONFIG_KVM @@ -409,16 +442,21 @@ int vfio_container_add_section_window(VFIOContainer *container, return 0; } -void vfio_container_del_section_window(VFIOContainer *container, - MemoryRegionSection *section) +static void +vfio_legacy_container_del_section_window(VFIOContainer *bcontainer, + MemoryRegionSection *section) { + VFIOLegacyContainer *container = container_of(bcontainer, + VFIOLegacyContainer, + bcontainer); + if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) { return; } vfio_spapr_remove_window(container, section->offset_within_address_space); - if (vfio_host_win_del(container, + if (vfio_host_win_del(bcontainer, section->offset_within_address_space, section->offset_within_address_space + int128_get64(section->size) - 1) < 0) { @@ -505,7 +543,7 @@ static void vfio_kvm_device_del_group(VFIOGroup *group) /* * vfio_get_iommu_type - selects the richest iommu_type (v2 first) */ -static int vfio_get_iommu_type(VFIOContainer *container, +static int vfio_get_iommu_type(VFIOLegacyContainer *container, Error **errp) { int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU, @@ -521,7 +559,7 @@ static int vfio_get_iommu_type(VFIOContainer *container, return -EINVAL; } -static int vfio_init_container(VFIOContainer *container, int group_fd, +static int vfio_init_container(VFIOLegacyContainer *container, int group_fd, Error **errp) { int iommu_type, ret; @@ -556,7 +594,7 @@ static int vfio_init_container(VFIOContainer *container, int group_fd, return 0; } -static int vfio_get_iommu_info(VFIOContainer *container, +static int vfio_get_iommu_info(VFIOLegacyContainer *container, struct vfio_iommu_type1_info **info) { @@ -600,11 +638,12 @@ vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id) return NULL; } -static void vfio_get_iommu_info_migration(VFIOContainer *container, - struct vfio_iommu_type1_info *info) +static void vfio_get_iommu_info_migration(VFIOLegacyContainer *container, + struct vfio_iommu_type1_info *info) { struct vfio_info_cap_header *hdr; struct vfio_iommu_type1_info_cap_migration *cap_mig; + VFIOContainer *bcontainer = &container->bcontainer; hdr = vfio_get_iommu_info_cap(info, VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION); if (!hdr) { @@ -619,13 +658,14 @@ static void vfio_get_iommu_info_migration(VFIOContainer *container, * qemu_real_host_page_size to mark those dirty. */ if (cap_mig->pgsize_bitmap & qemu_real_host_page_size()) { - container->dirty_pages_supported = true; - container->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size; - container->dirty_pgsizes = cap_mig->pgsize_bitmap; + bcontainer->dirty_pages_supported = true; + bcontainer->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size; + bcontainer->dirty_pgsizes = cap_mig->pgsize_bitmap; } } -static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state) +static int +vfio_ram_block_discard_disable(VFIOLegacyContainer *container, bool state) { switch (container->iommu_type) { case VFIO_TYPE1v2_IOMMU: @@ -651,7 +691,8 @@ static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state) static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, Error **errp) { - VFIOContainer *container; + VFIOContainer *bcontainer; + VFIOLegacyContainer *container; int ret, fd; VFIOAddressSpace *space; @@ -688,7 +729,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, * details once we know which type of IOMMU we are using. */ - QLIST_FOREACH(container, &space->containers, next) { + QLIST_FOREACH(bcontainer, &space->containers, next) { + container = container_of(bcontainer, VFIOLegacyContainer, bcontainer); if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) { ret = vfio_ram_block_discard_disable(container, true); if (ret) { @@ -724,14 +766,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, } container = g_malloc0(sizeof(*container)); - container->space = space; container->fd = fd; - container->error = NULL; - container->dirty_pages_supported = false; - container->dma_max_mappings = 0; - QLIST_INIT(&container->giommu_list); - QLIST_INIT(&container->hostwin_list); - QLIST_INIT(&container->vrdl_list); + bcontainer = &container->bcontainer; + vfio_container_init(bcontainer, space, &legacy_container_ops); ret = vfio_init_container(container, group->fd, errp); if (ret) { @@ -763,13 +800,13 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, /* Assume 4k IOVA page size */ info->iova_pgsizes = 4096; } - vfio_host_win_add(container, 0, (hwaddr)-1, info->iova_pgsizes); - container->pgsizes = info->iova_pgsizes; + vfio_host_win_add(bcontainer, 0, (hwaddr)-1, info->iova_pgsizes); + bcontainer->pgsizes = info->iova_pgsizes; /* The default in the kernel ("dma_entry_limit") is 65535. */ - container->dma_max_mappings = 65535; + bcontainer->dma_max_mappings = 65535; if (!ret) { - vfio_get_info_dma_avail(info, &container->dma_max_mappings); + vfio_get_info_dma_avail(info, &bcontainer->dma_max_mappings); vfio_get_iommu_info_migration(container, info); } g_free(info); @@ -798,10 +835,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, memory_listener_register(&container->prereg_listener, &address_space_memory); - if (container->error) { + if (bcontainer->error) { memory_listener_unregister(&container->prereg_listener); ret = -1; - error_propagate_prepend(errp, container->error, + error_propagate_prepend(errp, bcontainer->error, "RAM memory listener initialization failed: "); goto enable_discards_exit; } @@ -820,7 +857,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, } if (v2) { - container->pgsizes = info.ddw.pgsizes; + bcontainer->pgsizes = info.ddw.pgsizes; /* * There is a default window in just created container. * To make region_add/del simpler, we better remove this @@ -835,8 +872,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, } } else { /* The default table uses 4K pages */ - container->pgsizes = 0x1000; - vfio_host_win_add(container, info.dma32_window_start, + bcontainer->pgsizes = 0x1000; + vfio_host_win_add(bcontainer, info.dma32_window_start, info.dma32_window_start + info.dma32_window_size - 1, 0x1000); @@ -847,28 +884,28 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, vfio_kvm_device_add_group(group); QLIST_INIT(&container->group_list); - QLIST_INSERT_HEAD(&space->containers, container, next); + QLIST_INSERT_HEAD(&space->containers, bcontainer, next); group->container = container; QLIST_INSERT_HEAD(&container->group_list, group, container_next); - container->listener = vfio_memory_listener; + bcontainer->listener = vfio_memory_listener; - memory_listener_register(&container->listener, container->space->as); + memory_listener_register(&bcontainer->listener, bcontainer->space->as); - if (container->error) { + if (bcontainer->error) { ret = -1; - error_propagate_prepend(errp, container->error, + error_propagate_prepend(errp, bcontainer->error, "memory listener initialization failed: "); goto listener_release_exit; } - container->initialized = true; + bcontainer->initialized = true; return 0; listener_release_exit: QLIST_REMOVE(group, container_next); - QLIST_REMOVE(container, next); + QLIST_REMOVE(bcontainer, next); vfio_kvm_device_del_group(group); vfio_listener_release(container); @@ -889,7 +926,8 @@ put_space_exit: static void vfio_disconnect_container(VFIOGroup *group) { - VFIOContainer *container = group->container; + VFIOLegacyContainer *container = group->container; + VFIOContainer *bcontainer = &container->bcontainer; QLIST_REMOVE(group, container_next); group->container = NULL; @@ -909,25 +947,9 @@ static void vfio_disconnect_container(VFIOGroup *group) } if (QLIST_EMPTY(&container->group_list)) { - VFIOAddressSpace *space = container->space; - VFIOGuestIOMMU *giommu, *tmp; - VFIOHostDMAWindow *hostwin, *next; - - QLIST_REMOVE(container, next); - - QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) { - memory_region_unregister_iommu_notifier( - MEMORY_REGION(giommu->iommu_mr), &giommu->n); - QLIST_REMOVE(giommu, giommu_next); - g_free(giommu); - } - - QLIST_FOREACH_SAFE(hostwin, &container->hostwin_list, hostwin_next, - next) { - QLIST_REMOVE(hostwin, hostwin_next); - g_free(hostwin); - } + VFIOAddressSpace *space = bcontainer->space; + vfio_container_destroy(bcontainer); trace_vfio_disconnect_container(container->fd); close(container->fd); g_free(container); @@ -939,13 +961,15 @@ static void vfio_disconnect_container(VFIOGroup *group) VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp) { VFIOGroup *group; + VFIOContainer *bcontainer; char path[32]; struct vfio_group_status status = { .argsz = sizeof(status) }; QLIST_FOREACH(group, &vfio_group_list, next) { if (group->groupid == groupid) { /* Found it. Now is it already in the right context? */ - if (group->container->space->as == as) { + bcontainer = &group->container->bcontainer; + if (bcontainer->space->as == as) { return group; } else { error_setg(errp, "group %d used in multiple address spaces", @@ -1098,7 +1122,7 @@ void vfio_put_base_device(VFIODevice *vbasedev) /* * Interfaces for IBM EEH (Enhanced Error Handling) */ -static bool vfio_eeh_container_ok(VFIOContainer *container) +static bool vfio_eeh_container_ok(VFIOLegacyContainer *container) { /* * As of 2016-03-04 (linux-4.5) the host kernel EEH/VFIO @@ -1126,7 +1150,7 @@ static bool vfio_eeh_container_ok(VFIOContainer *container) return true; } -static int vfio_eeh_container_op(VFIOContainer *container, uint32_t op) +static int vfio_eeh_container_op(VFIOLegacyContainer *container, uint32_t op) { struct vfio_eeh_pe_op pe_op = { .argsz = sizeof(pe_op), @@ -1149,19 +1173,21 @@ static int vfio_eeh_container_op(VFIOContainer *container, uint32_t op) return ret; } -static VFIOContainer *vfio_eeh_as_container(AddressSpace *as) +static VFIOLegacyContainer *vfio_eeh_as_container(AddressSpace *as) { VFIOAddressSpace *space = vfio_get_address_space(as); - VFIOContainer *container = NULL; + VFIOLegacyContainer *container = NULL; + VFIOContainer *bcontainer = NULL; if (QLIST_EMPTY(&space->containers)) { /* No containers to act on */ goto out; } - container = QLIST_FIRST(&space->containers); + bcontainer = QLIST_FIRST(&space->containers); + container = container_of(bcontainer, VFIOLegacyContainer, bcontainer); - if (QLIST_NEXT(container, next)) { + if (QLIST_NEXT(bcontainer, next)) { /* * We don't yet have logic to synchronize EEH state across * multiple containers. @@ -1177,17 +1203,28 @@ out: bool vfio_eeh_as_ok(AddressSpace *as) { - VFIOContainer *container = vfio_eeh_as_container(as); + VFIOLegacyContainer *container = vfio_eeh_as_container(as); return (container != NULL) && vfio_eeh_container_ok(container); } int vfio_eeh_as_op(AddressSpace *as, uint32_t op) { - VFIOContainer *container = vfio_eeh_as_container(as); + VFIOLegacyContainer *container = vfio_eeh_as_container(as); if (!container) { return -ENODEV; } return vfio_eeh_container_op(container, op); } + +const VFIOContainerOps legacy_container_ops = { + .dma_map = vfio_dma_map, + .dma_unmap = vfio_dma_unmap, + .devices_all_dirty_tracking = vfio_devices_all_dirty_tracking, + .set_dirty_page_tracking = vfio_set_dirty_page_tracking, + .get_dirty_bitmap = vfio_get_dirty_bitmap, + .add_window = vfio_legacy_container_add_section_window, + .del_window = vfio_legacy_container_del_section_window, + .check_extension = vfio_legacy_container_check_extension, +}; diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build index e3b6d6e2cb..067c3c3c5f 100644 --- a/hw/vfio/meson.build +++ b/hw/vfio/meson.build @@ -2,6 +2,7 @@ vfio_ss = ss.source_set() vfio_ss.add(files( 'common.c', 'as.c', + 'container-base.c', 'container.c', 'spapr.c', 'migration.c', diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index a6ad1f8945..6abbf54d14 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -857,11 +857,12 @@ int64_t vfio_mig_bytes_transferred(void) int vfio_migration_probe(VFIODevice *vbasedev, Error **errp) { - VFIOContainer *container = vbasedev->group->container; + VFIOLegacyContainer *container = vbasedev->group->container; struct vfio_region_info *info = NULL; int ret = -ENOTSUP; - if (!vbasedev->enable_migration || !container->dirty_pages_supported) { + if (!vbasedev->enable_migration || + !container->bcontainer.dirty_pages_supported) { goto add_blocker; } diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index 939dcc3d4a..a9973a6d6a 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -3144,7 +3144,9 @@ static void vfio_realize(PCIDevice *pdev, Error **errp) } } - if (!pdev->failover_pair_id) { + if (!pdev->failover_pair_id && + vfio_container_check_extension(&vbasedev->group->container->bcontainer, + VFIO_FEAT_LIVE_MIGRATION)) { ret = vfio_migration_probe(vbasedev, errp); if (ret) { error_report("%s: Migration disabled", vbasedev->name); diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c index 9ec1e95f6d..7647e7d492 100644 --- a/hw/vfio/spapr.c +++ b/hw/vfio/spapr.c @@ -39,8 +39,8 @@ static void *vfio_prereg_gpa_to_vaddr(MemoryRegionSection *section, hwaddr gpa) static void vfio_prereg_listener_region_add(MemoryListener *listener, MemoryRegionSection *section) { - VFIOContainer *container = container_of(listener, VFIOContainer, - prereg_listener); + VFIOLegacyContainer *container = container_of(listener, VFIOLegacyContainer, + prereg_listener); const hwaddr gpa = section->offset_within_address_space; hwaddr end; int ret; @@ -83,9 +83,9 @@ static void vfio_prereg_listener_region_add(MemoryListener *listener, * can gracefully fail. Runtime, there's not much we can do other * than throw a hardware error. */ - if (!container->initialized) { - if (!container->error) { - error_setg_errno(&container->error, -ret, + if (!container->bcontainer.initialized) { + if (!container->bcontainer.error) { + error_setg_errno(&container->bcontainer.error, -ret, "Memory registering failed"); } } else { @@ -97,8 +97,8 @@ static void vfio_prereg_listener_region_add(MemoryListener *listener, static void vfio_prereg_listener_region_del(MemoryListener *listener, MemoryRegionSection *section) { - VFIOContainer *container = container_of(listener, VFIOContainer, - prereg_listener); + VFIOLegacyContainer *container = container_of(listener, VFIOLegacyContainer, + prereg_listener); const hwaddr gpa = section->offset_within_address_space; hwaddr end; int ret; @@ -141,7 +141,7 @@ const MemoryListener vfio_prereg_listener = { .region_del = vfio_prereg_listener_region_del, }; -int vfio_spapr_create_window(VFIOContainer *container, +int vfio_spapr_create_window(VFIOLegacyContainer *container, MemoryRegionSection *section, hwaddr *pgsize) { @@ -159,13 +159,13 @@ int vfio_spapr_create_window(VFIOContainer *container, if (pagesize > rampagesize) { pagesize = rampagesize; } - pgmask = container->pgsizes & (pagesize | (pagesize - 1)); + pgmask = container->bcontainer.pgsizes & (pagesize | (pagesize - 1)); pagesize = pgmask ? (1ULL << (63 - clz64(pgmask))) : 0; if (!pagesize) { error_report("Host doesn't support page size 0x%"PRIx64 ", the supported mask is 0x%lx", memory_region_iommu_get_min_page_size(iommu_mr), - container->pgsizes); + container->bcontainer.pgsizes); return -EINVAL; } @@ -233,7 +233,7 @@ int vfio_spapr_create_window(VFIOContainer *container, return 0; } -int vfio_spapr_remove_window(VFIOContainer *container, +int vfio_spapr_remove_window(VFIOLegacyContainer *container, hwaddr offset_within_address_space) { struct vfio_iommu_spapr_tce_remove remove = { diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index 03ff7944cb..5cc0413b5c 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -30,10 +30,12 @@ #include #endif #include "sysemu/sysemu.h" +#include "hw/vfio/vfio-container-base.h" #define VFIO_MSG_PREFIX "vfio %s: " extern const MemoryListener vfio_memory_listener; +extern const VFIOContainerOps legacy_container_ops; enum { VFIO_DEVICE_TYPE_PCI = 0, @@ -70,58 +72,15 @@ typedef struct VFIOMigration { uint64_t pending_bytes; } VFIOMigration; -typedef struct VFIOAddressSpace { - AddressSpace *as; - QLIST_HEAD(, VFIOContainer) containers; - QLIST_ENTRY(VFIOAddressSpace) list; -} VFIOAddressSpace; - struct VFIOGroup; -typedef struct VFIOContainer { - VFIOAddressSpace *space; +typedef struct VFIOLegacyContainer { + VFIOContainer bcontainer; int fd; /* /dev/vfio/vfio, empowered by the attached groups */ - MemoryListener listener; MemoryListener prereg_listener; unsigned iommu_type; - Error *error; - bool initialized; - bool dirty_pages_supported; - uint64_t dirty_pgsizes; - uint64_t max_dirty_bitmap_size; - unsigned long pgsizes; - unsigned int dma_max_mappings; - QLIST_HEAD(, VFIOGuestIOMMU) giommu_list; - QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list; QLIST_HEAD(, VFIOGroup) group_list; - QLIST_HEAD(, VFIORamDiscardListener) vrdl_list; - QLIST_ENTRY(VFIOContainer) next; -} VFIOContainer; - -typedef struct VFIOGuestIOMMU { - VFIOContainer *container; - IOMMUMemoryRegion *iommu_mr; - hwaddr iommu_offset; - IOMMUNotifier n; - QLIST_ENTRY(VFIOGuestIOMMU) giommu_next; -} VFIOGuestIOMMU; - -typedef struct VFIORamDiscardListener { - VFIOContainer *container; - MemoryRegion *mr; - hwaddr offset_within_address_space; - hwaddr size; - uint64_t granularity; - RamDiscardListener listener; - QLIST_ENTRY(VFIORamDiscardListener) next; -} VFIORamDiscardListener; - -typedef struct VFIOHostDMAWindow { - hwaddr min_iova; - hwaddr max_iova; - uint64_t iova_pgsizes; - QLIST_ENTRY(VFIOHostDMAWindow) hostwin_next; -} VFIOHostDMAWindow; +} VFIOLegacyContainer; typedef struct VFIODeviceOps VFIODeviceOps; @@ -159,7 +118,7 @@ struct VFIODeviceOps { typedef struct VFIOGroup { int fd; int groupid; - VFIOContainer *container; + VFIOLegacyContainer *container; QLIST_HEAD(, VFIODevice) device_list; QLIST_ENTRY(VFIOGroup) next; QLIST_ENTRY(VFIOGroup) container_next; @@ -192,31 +151,13 @@ typedef struct VFIODisplay { } dmabuf; } VFIODisplay; -void vfio_host_win_add(VFIOContainer *container, +void vfio_host_win_add(VFIOContainer *bcontainer, hwaddr min_iova, hwaddr max_iova, uint64_t iova_pgsizes); -int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova, +int vfio_host_win_del(VFIOContainer *bcontainer, hwaddr min_iova, hwaddr max_iova); VFIOAddressSpace *vfio_get_address_space(AddressSpace *as); void vfio_put_address_space(VFIOAddressSpace *space); -bool vfio_devices_all_running_and_saving(VFIOContainer *container); -bool vfio_devices_all_dirty_tracking(VFIOContainer *container); - -/* container->fd */ -int vfio_dma_unmap(VFIOContainer *container, - hwaddr iova, ram_addr_t size, - IOMMUTLBEntry *iotlb); -int vfio_dma_map(VFIOContainer *container, hwaddr iova, - ram_addr_t size, void *vaddr, bool readonly); -void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start); -int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova, - uint64_t size, ram_addr_t ram_addr); - -int vfio_container_add_section_window(VFIOContainer *container, - MemoryRegionSection *section, - Error **errp); -void vfio_container_del_section_window(VFIOContainer *container, - MemoryRegionSection *section); void vfio_put_base_device(VFIODevice *vbasedev); void vfio_disable_irqindex(VFIODevice *vbasedev, int index); @@ -263,10 +204,10 @@ vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id); #endif extern const MemoryListener vfio_prereg_listener; -int vfio_spapr_create_window(VFIOContainer *container, +int vfio_spapr_create_window(VFIOLegacyContainer *container, MemoryRegionSection *section, hwaddr *pgsize); -int vfio_spapr_remove_window(VFIOContainer *container, +int vfio_spapr_remove_window(VFIOLegacyContainer *container, hwaddr offset_within_address_space); int vfio_migration_probe(VFIODevice *vbasedev, Error **errp); diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h new file mode 100644 index 0000000000..fa5d7fcb85 --- /dev/null +++ b/include/hw/vfio/vfio-container-base.h @@ -0,0 +1,136 @@ +/* + * VFIO BASE CONTAINER + * + * Copyright (C) 2022 Intel Corporation. + * Copyright Red Hat, Inc. 2022 + * + * Authors: Yi Liu + * Eric Auger + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + + * You should have received a copy of the GNU General Public License along + * with this program; if not, see . + */ + +#ifndef HW_VFIO_VFIO_BASE_CONTAINER_H +#define HW_VFIO_VFIO_BASE_CONTAINER_H + +#include "exec/memory.h" +#ifndef CONFIG_USER_ONLY +#include "exec/hwaddr.h" +#endif + +typedef enum VFIOContainerFeature { + VFIO_FEAT_LIVE_MIGRATION, +} VFIOContainerFeature; + +typedef struct VFIOContainer VFIOContainer; + +typedef struct VFIOAddressSpace { + AddressSpace *as; + QLIST_HEAD(, VFIOContainer) containers; + QLIST_ENTRY(VFIOAddressSpace) list; +} VFIOAddressSpace; + +typedef struct VFIOGuestIOMMU { + VFIOContainer *container; + IOMMUMemoryRegion *iommu_mr; + hwaddr iommu_offset; + IOMMUNotifier n; + QLIST_ENTRY(VFIOGuestIOMMU) giommu_next; +} VFIOGuestIOMMU; + +typedef struct VFIORamDiscardListener { + VFIOContainer *container; + MemoryRegion *mr; + hwaddr offset_within_address_space; + hwaddr size; + uint64_t granularity; + RamDiscardListener listener; + QLIST_ENTRY(VFIORamDiscardListener) next; +} VFIORamDiscardListener; + +typedef struct VFIOHostDMAWindow { + hwaddr min_iova; + hwaddr max_iova; + uint64_t iova_pgsizes; + QLIST_ENTRY(VFIOHostDMAWindow) hostwin_next; +} VFIOHostDMAWindow; + +typedef struct VFIOContainerOps { + /* required */ + bool (*check_extension)(VFIOContainer *container, + VFIOContainerFeature feat); + int (*dma_map)(VFIOContainer *container, + hwaddr iova, ram_addr_t size, + void *vaddr, bool readonly); + int (*dma_unmap)(VFIOContainer *container, + hwaddr iova, ram_addr_t size, + IOMMUTLBEntry *iotlb); + /* migration feature */ + bool (*devices_all_dirty_tracking)(VFIOContainer *container); + void (*set_dirty_page_tracking)(VFIOContainer *container, bool start); + int (*get_dirty_bitmap)(VFIOContainer *container, uint64_t iova, + uint64_t size, ram_addr_t ram_addr); + + /* SPAPR specific */ + int (*add_window)(VFIOContainer *container, + MemoryRegionSection *section, + Error **errp); + void (*del_window)(VFIOContainer *container, + MemoryRegionSection *section); +} VFIOContainerOps; + +/* + * This is the base object for vfio container backends + */ +struct VFIOContainer { + const VFIOContainerOps *ops; + VFIOAddressSpace *space; + MemoryListener listener; + Error *error; + bool initialized; + bool dirty_pages_supported; + uint64_t dirty_pgsizes; + uint64_t max_dirty_bitmap_size; + unsigned long pgsizes; + unsigned int dma_max_mappings; + QLIST_HEAD(, VFIOGuestIOMMU) giommu_list; + QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list; + QLIST_HEAD(, VFIORamDiscardListener) vrdl_list; + QLIST_ENTRY(VFIOContainer) next; +}; + +bool vfio_container_check_extension(VFIOContainer *container, + VFIOContainerFeature feat); +int vfio_container_dma_map(VFIOContainer *container, + hwaddr iova, ram_addr_t size, + void *vaddr, bool readonly); +int vfio_container_dma_unmap(VFIOContainer *container, + hwaddr iova, ram_addr_t size, + IOMMUTLBEntry *iotlb); +bool vfio_container_devices_all_dirty_tracking(VFIOContainer *container); +void vfio_container_set_dirty_page_tracking(VFIOContainer *container, + bool start); +int vfio_container_get_dirty_bitmap(VFIOContainer *container, uint64_t iova, + uint64_t size, ram_addr_t ram_addr); +int vfio_container_add_section_window(VFIOContainer *container, + MemoryRegionSection *section, + Error **errp); +void vfio_container_del_section_window(VFIOContainer *container, + MemoryRegionSection *section); + +void vfio_container_init(VFIOContainer *container, + VFIOAddressSpace *space, + const VFIOContainerOps *ops); +void vfio_container_destroy(VFIOContainer *container); +#endif /* HW_VFIO_VFIO_BASE_CONTAINER_H */ From patchwork Wed Jun 8 12:31:29 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yi Liu X-Patchwork-Id: 12873451 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 36CAEC43334 for ; Wed, 8 Jun 2022 12:44:27 +0000 (UTC) Received: from localhost ([::1]:43924 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nyv2w-0005UP-6T for qemu-devel@archiver.kernel.org; Wed, 08 Jun 2022 08:44:26 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:57880) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nyuql-0000E7-2f for qemu-devel@nongnu.org; Wed, 08 Jun 2022 08:31:51 -0400 Received: from mga05.intel.com ([192.55.52.43]:56958) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nyuqi-0005ks-Vc for qemu-devel@nongnu.org; Wed, 08 Jun 2022 08:31:50 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1654691508; x=1686227508; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=OpRRvY8SRmH4ZeE0LHY8drQ9pIrGi/jNOxQ0+lVaXF8=; b=J07jUs7nUn0KZyFsCyWW5td31oxOg0MiC1WLr34x+ONDDGHmW8PyJ7B0 tkLpC245zqmd4EaJgewzA8hia3bCVBUXDy4e4RGHTNh9XfvUkVG9dzId5 QWvUF4mEHSoji6rsjXrU1q1GaGc4g7vbUz3niJmIBN/7M3QXiq3MzuiNg 5ngXxXhxRBtXdj3RR8ENVzvtBEdqBCfNx2IXpYrze1TP7NKd5huJPjy5A VcEefEWsFN1WPeOEAkIlkZ3a7smUPcFSksldUrGOZGdv6dkeOSLMsTuDH adrnsnhC1BIsc7YcqC4awUMD8OBxUv9d7huD42BRu0hCgnpXhK9L9ip5b g==; X-IronPort-AV: E=McAfee;i="6400,9594,10371"; a="363210120" X-IronPort-AV: E=Sophos;i="5.91,286,1647327600"; d="scan'208";a="363210120" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Jun 2022 05:31:44 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.91,286,1647327600"; d="scan'208";a="670529772" Received: from 984fee00a4c6.jf.intel.com ([10.165.58.231]) by FMSMGA003.fm.intel.com with ESMTP; 08 Jun 2022 05:31:43 -0700 From: Yi Liu To: alex.williamson@redhat.com, cohuck@redhat.com, qemu-devel@nongnu.org Cc: david@gibson.dropbear.id.au, thuth@redhat.com, farman@linux.ibm.com, mjrosato@linux.ibm.com, akrowiak@linux.ibm.com, pasic@linux.ibm.com, jjherne@linux.ibm.com, jasowang@redhat.com, kvm@vger.kernel.org, jgg@nvidia.com, nicolinc@nvidia.com, eric.auger@redhat.com, eric.auger.pro@gmail.com, kevin.tian@intel.com, yi.l.liu@intel.com, chao.p.peng@intel.com, yi.y.sun@intel.com, peterx@redhat.com, shameerali.kolothum.thodi@huawei.com, zhangfei.gao@linaro.org, berrange@redhat.com Subject: [RFC v2 05/15] vfio/container: Introduce vfio_[attach/detach]_device Date: Wed, 8 Jun 2022 05:31:29 -0700 Message-Id: <20220608123139.19356-6-yi.l.liu@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220608123139.19356-1-yi.l.liu@intel.com> References: <20220608123139.19356-1-yi.l.liu@intel.com> MIME-Version: 1.0 Received-SPF: pass client-ip=192.55.52.43; envelope-from=yi.l.liu@intel.com; helo=mga05.intel.com X-Spam_score_int: -44 X-Spam_score: -4.5 X-Spam_bar: ---- X-Spam_report: (-4.5 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" From: Eric Auger We want the VFIO devices to be able to use two different IOMMU callbacks, the legacy VFIO one and the new iommufd one. Introduce vfio_[attach/detach]_device which aim at hiding the underlying IOMMU backend (IOCTLs, datatypes, ...). Once vfio_attach_device completes, the device is attached to a security context and its fd can be used. Conversely When vfio_detach_device completes, the device has been detached to the security context. In this patch, only the vfio-pci device gets converted to use the new API. Subsequent patches will handle other devices. Signed-off-by: Eric Auger Signed-off-by: Yi Liu --- hw/vfio/container.c | 65 +++++++++++++++++++++++++++++++++++ hw/vfio/pci.c | 50 +++------------------------ include/hw/vfio/vfio-common.h | 2 ++ 3 files changed, 72 insertions(+), 45 deletions(-) diff --git a/hw/vfio/container.c b/hw/vfio/container.c index dfc5183d5d..74e6eeba74 100644 --- a/hw/vfio/container.c +++ b/hw/vfio/container.c @@ -1218,6 +1218,71 @@ int vfio_eeh_as_op(AddressSpace *as, uint32_t op) return vfio_eeh_container_op(container, op); } +static int vfio_device_groupid(VFIODevice *vbasedev, Error **errp) +{ + char *tmp, group_path[PATH_MAX], *group_name; + int ret, groupid; + ssize_t len; + + tmp = g_strdup_printf("%s/iommu_group", vbasedev->sysfsdev); + len = readlink(tmp, group_path, sizeof(group_path)); + g_free(tmp); + + if (len <= 0 || len >= sizeof(group_path)) { + ret = len < 0 ? -errno : -ENAMETOOLONG; + error_setg_errno(errp, -ret, "no iommu_group found"); + return ret; + } + + group_path[len] = 0; + + group_name = basename(group_path); + if (sscanf(group_name, "%d", &groupid) != 1) { + error_setg_errno(errp, errno, "failed to read %s", group_path); + return -errno; + } + return groupid; +} + +int vfio_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp) +{ + int groupid = vfio_device_groupid(vbasedev, errp); + VFIODevice *vbasedev_iter; + VFIOGroup *group; + int ret; + + if (groupid < 0) { + return groupid; + } + + trace_vfio_realize(vbasedev->name, groupid); + group = vfio_get_group(groupid, as, errp); + if (!group) { + return -1; + } + + QLIST_FOREACH(vbasedev_iter, &group->device_list, next) { + if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) { + error_setg(errp, "device is already attached"); + vfio_put_group(group); + return -1; + } + } + ret = vfio_get_device(group, vbasedev->name, vbasedev, errp); + if (ret) { + vfio_put_group(group); + return -1; + } + + return 0; +} + +void vfio_detach_device(VFIODevice *vbasedev) +{ + vfio_put_base_device(vbasedev); + vfio_put_group(vbasedev->group); +} + const VFIOContainerOps legacy_container_ops = { .dma_map = vfio_dma_map, .dma_unmap = vfio_dma_unmap, diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index a9973a6d6a..9856d81819 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -2697,10 +2697,9 @@ static void vfio_populate_device(VFIOPCIDevice *vdev, Error **errp) static void vfio_put_device(VFIOPCIDevice *vdev) { - g_free(vdev->vbasedev.name); g_free(vdev->msix); - vfio_put_base_device(&vdev->vbasedev); + vfio_detach_device(&vdev->vbasedev); } static void vfio_err_notifier_handler(void *opaque) @@ -2847,13 +2846,9 @@ static void vfio_realize(PCIDevice *pdev, Error **errp) { VFIOPCIDevice *vdev = VFIO_PCI(pdev); VFIODevice *vbasedev = &vdev->vbasedev; - VFIODevice *vbasedev_iter; - VFIOGroup *group; - char *tmp, *subsys, group_path[PATH_MAX], *group_name; + char *tmp, *subsys; Error *err = NULL; - ssize_t len; struct stat st; - int groupid; int i, ret; bool is_mdev; @@ -2882,39 +2877,6 @@ static void vfio_realize(PCIDevice *pdev, Error **errp) vbasedev->type = VFIO_DEVICE_TYPE_PCI; vbasedev->dev = DEVICE(vdev); - tmp = g_strdup_printf("%s/iommu_group", vbasedev->sysfsdev); - len = readlink(tmp, group_path, sizeof(group_path)); - g_free(tmp); - - if (len <= 0 || len >= sizeof(group_path)) { - error_setg_errno(errp, len < 0 ? errno : ENAMETOOLONG, - "no iommu_group found"); - goto error; - } - - group_path[len] = 0; - - group_name = basename(group_path); - if (sscanf(group_name, "%d", &groupid) != 1) { - error_setg_errno(errp, errno, "failed to read %s", group_path); - goto error; - } - - trace_vfio_realize(vbasedev->name, groupid); - - group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev), errp); - if (!group) { - goto error; - } - - QLIST_FOREACH(vbasedev_iter, &group->device_list, next) { - if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) { - error_setg(errp, "device is already attached"); - vfio_put_group(group); - goto error; - } - } - /* * Mediated devices *might* operate compatibly with discarding of RAM, but * we cannot know for certain, it depends on whether the mdev vendor driver @@ -2932,13 +2894,12 @@ static void vfio_realize(PCIDevice *pdev, Error **errp) if (vbasedev->ram_block_discard_allowed && !is_mdev) { error_setg(errp, "x-balloon-allowed only potentially compatible " "with mdev devices"); - vfio_put_group(group); goto error; } - ret = vfio_get_device(group, vbasedev->name, vbasedev, errp); + ret = vfio_attach_device(vbasedev, + pci_device_iommu_address_space(pdev), errp); if (ret) { - vfio_put_group(group); goto error; } @@ -3167,12 +3128,12 @@ out_teardown: vfio_bars_exit(vdev); error: error_prepend(errp, VFIO_MSG_PREFIX, vbasedev->name); + vfio_detach_device(vbasedev); } static void vfio_instance_finalize(Object *obj) { VFIOPCIDevice *vdev = VFIO_PCI(obj); - VFIOGroup *group = vdev->vbasedev.group; vfio_display_finalize(vdev); vfio_bars_finalize(vdev); @@ -3186,7 +3147,6 @@ static void vfio_instance_finalize(Object *obj) * g_free(vdev->igd_opregion); */ vfio_put_device(vdev); - vfio_put_group(group); } static void vfio_exitfn(PCIDevice *pdev) diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index 5cc0413b5c..1b68cd8ece 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -181,6 +181,8 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp); void vfio_put_group(VFIOGroup *group); int vfio_get_device(VFIOGroup *group, const char *name, VFIODevice *vbasedev, Error **errp); +int vfio_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp); +void vfio_detach_device(VFIODevice *vbasedev); extern const MemoryRegionOps vfio_region_ops; typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList; From patchwork Wed Jun 8 12:31:30 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yi Liu X-Patchwork-Id: 12873449 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 73000C433EF for ; Wed, 8 Jun 2022 12:40:55 +0000 (UTC) Received: from localhost ([::1]:37722 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nyuzW-0000rj-Fq for qemu-devel@archiver.kernel.org; Wed, 08 Jun 2022 08:40:54 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:57946) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nyuqp-0000M9-CI for qemu-devel@nongnu.org; Wed, 08 Jun 2022 08:31:55 -0400 Received: from mga05.intel.com ([192.55.52.43]:56958) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nyuql-0005ks-F7 for qemu-devel@nongnu.org; Wed, 08 Jun 2022 08:31:54 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1654691511; x=1686227511; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=gM0xGzwg7MLnTWpYcUR+4zIK9VWgzL3tJ/HAYadjIXo=; b=dAgjbmh1lRSDcXnrvKHRmDV9q9Tpzqia2wER0b51eCpg88ZuyXRFDwpF paCW/U4lN1Bz5qbZk4CzxxeUAdBizxalwMWoyIJaSCw1V0/3F0i0OM7ML 5n+q4EGzhgQ92m+FJPa8Rv6WXnLB0VLfjCBNNDqn8pzlZJMDYKuPOX5bD V9Wskcg3iwXfVW7e4/j1+9efZ5IU6P0PItehyQZJM2gEjfMpT55LF0VQt Jj6pYNrZq14/NB3SFwN6Yj6N9HlXiviS423W0OEH4/8Mj7Ht4ZSFP/lJS 4iNmOvu4RAh4eLAsChqeBaMJio9KJuGJDCWQ+YvYNuruHJp+UnH0Ns+ih g==; X-IronPort-AV: E=McAfee;i="6400,9594,10371"; a="363210127" X-IronPort-AV: E=Sophos;i="5.91,286,1647327600"; d="scan'208";a="363210127" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Jun 2022 05:31:44 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.91,286,1647327600"; d="scan'208";a="670529786" Received: from 984fee00a4c6.jf.intel.com ([10.165.58.231]) by FMSMGA003.fm.intel.com with ESMTP; 08 Jun 2022 05:31:44 -0700 From: Yi Liu To: alex.williamson@redhat.com, cohuck@redhat.com, qemu-devel@nongnu.org Cc: david@gibson.dropbear.id.au, thuth@redhat.com, farman@linux.ibm.com, mjrosato@linux.ibm.com, akrowiak@linux.ibm.com, pasic@linux.ibm.com, jjherne@linux.ibm.com, jasowang@redhat.com, kvm@vger.kernel.org, jgg@nvidia.com, nicolinc@nvidia.com, eric.auger@redhat.com, eric.auger.pro@gmail.com, kevin.tian@intel.com, yi.l.liu@intel.com, chao.p.peng@intel.com, yi.y.sun@intel.com, peterx@redhat.com, shameerali.kolothum.thodi@huawei.com, zhangfei.gao@linaro.org, berrange@redhat.com Subject: [RFC v2 06/15] vfio/platform: Use vfio_[attach/detach]_device Date: Wed, 8 Jun 2022 05:31:30 -0700 Message-Id: <20220608123139.19356-7-yi.l.liu@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220608123139.19356-1-yi.l.liu@intel.com> References: <20220608123139.19356-1-yi.l.liu@intel.com> MIME-Version: 1.0 Received-SPF: pass client-ip=192.55.52.43; envelope-from=yi.l.liu@intel.com; helo=mga05.intel.com X-Spam_score_int: -44 X-Spam_score: -4.5 X-Spam_bar: ---- X-Spam_report: (-4.5 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" From: Eric Auger Let the vfio-platform device use vfio_attach_device() and vfio_detach_device(), hence hiding the details of the used IOMMU backend. Signed-off-by: Eric Auger Signed-off-by: Yi Liu --- hw/vfio/platform.c | 42 ++---------------------------------------- 1 file changed, 2 insertions(+), 40 deletions(-) diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c index 5af73f9287..3bcdc20667 100644 --- a/hw/vfio/platform.c +++ b/hw/vfio/platform.c @@ -529,12 +529,7 @@ static VFIODeviceOps vfio_platform_ops = { */ static int vfio_base_device_init(VFIODevice *vbasedev, Error **errp) { - VFIOGroup *group; - VFIODevice *vbasedev_iter; - char *tmp, group_path[PATH_MAX], *group_name; - ssize_t len; struct stat st; - int groupid; int ret; /* @sysfsdev takes precedence over @host */ @@ -557,47 +552,14 @@ static int vfio_base_device_init(VFIODevice *vbasedev, Error **errp) return -errno; } - tmp = g_strdup_printf("%s/iommu_group", vbasedev->sysfsdev); - len = readlink(tmp, group_path, sizeof(group_path)); - g_free(tmp); - - if (len < 0 || len >= sizeof(group_path)) { - ret = len < 0 ? -errno : -ENAMETOOLONG; - error_setg_errno(errp, -ret, "no iommu_group found"); - return ret; - } - - group_path[len] = 0; - - group_name = basename(group_path); - if (sscanf(group_name, "%d", &groupid) != 1) { - error_setg_errno(errp, errno, "failed to read %s", group_path); - return -errno; - } - - trace_vfio_platform_base_device_init(vbasedev->name, groupid); - - group = vfio_get_group(groupid, &address_space_memory, errp); - if (!group) { - return -ENOENT; - } - - QLIST_FOREACH(vbasedev_iter, &group->device_list, next) { - if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) { - error_setg(errp, "device is already attached"); - vfio_put_group(group); - return -EBUSY; - } - } - ret = vfio_get_device(group, vbasedev->name, vbasedev, errp); + ret = vfio_attach_device(vbasedev, &address_space_memory, errp); if (ret) { - vfio_put_group(group); return ret; } ret = vfio_populate_device(vbasedev, errp); if (ret) { - vfio_put_group(group); + vfio_detach_device(vbasedev); } return ret; From patchwork Wed Jun 8 12:31:31 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yi Liu X-Patchwork-Id: 12873455 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id B65CDC43334 for ; Wed, 8 Jun 2022 12:48:48 +0000 (UTC) Received: from localhost ([::1]:50694 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nyv79-0002UY-R9 for qemu-devel@archiver.kernel.org; Wed, 08 Jun 2022 08:48:47 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:57990) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nyuqr-0000OL-FT for qemu-devel@nongnu.org; Wed, 08 Jun 2022 08:31:57 -0400 Received: from mga05.intel.com ([192.55.52.43]:56952) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nyuqm-0005kl-Np for qemu-devel@nongnu.org; Wed, 08 Jun 2022 08:31:56 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1654691512; x=1686227512; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=k4VcpDgjB2tGSrAgzMFAbDoDoLC4lC0u8qknfORhCBg=; b=liqOcruGz6UzKEqTx/fd/sVR/9dWLCkXXDdhFq+yJBPlddoS0DDFOiUq Q+bt1Ic4G3kEskJ1W6FeWtF/lXweAQCybYpftbhwvXgmNnNXF3feRIHJu G0y7hYXDL+Hdhpx34jjV4U5Vd6/1wFRBYlaZO//tVOwQPg0LmPkp2JfYG G7A6+fIIS9mvMWd0BwL0TmL9Vkju9l6kftTczUxvi6UoBFecYPkIghaFH Nx7VgZehXntx5f/i02tK+Nz/hbNiDh8wsNANxGJaPj2jDdkrLZ+emVrM8 h9G72pXpIIqmvixabhJsPVWNyEbe6Q3EoSpvtMaM6NgqSqZ/EfQ7uO1ID w==; X-IronPort-AV: E=McAfee;i="6400,9594,10371"; a="363210135" X-IronPort-AV: E=Sophos;i="5.91,286,1647327600"; d="scan'208";a="363210135" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Jun 2022 05:31:45 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.91,286,1647327600"; d="scan'208";a="670529803" Received: from 984fee00a4c6.jf.intel.com ([10.165.58.231]) by FMSMGA003.fm.intel.com with ESMTP; 08 Jun 2022 05:31:44 -0700 From: Yi Liu To: alex.williamson@redhat.com, cohuck@redhat.com, qemu-devel@nongnu.org Cc: david@gibson.dropbear.id.au, thuth@redhat.com, farman@linux.ibm.com, mjrosato@linux.ibm.com, akrowiak@linux.ibm.com, pasic@linux.ibm.com, jjherne@linux.ibm.com, jasowang@redhat.com, kvm@vger.kernel.org, jgg@nvidia.com, nicolinc@nvidia.com, eric.auger@redhat.com, eric.auger.pro@gmail.com, kevin.tian@intel.com, yi.l.liu@intel.com, chao.p.peng@intel.com, yi.y.sun@intel.com, peterx@redhat.com, shameerali.kolothum.thodi@huawei.com, zhangfei.gao@linaro.org, berrange@redhat.com Subject: [RFC v2 07/15] vfio/ap: Use vfio_[attach/detach]_device Date: Wed, 8 Jun 2022 05:31:31 -0700 Message-Id: <20220608123139.19356-8-yi.l.liu@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220608123139.19356-1-yi.l.liu@intel.com> References: <20220608123139.19356-1-yi.l.liu@intel.com> MIME-Version: 1.0 Received-SPF: pass client-ip=192.55.52.43; envelope-from=yi.l.liu@intel.com; helo=mga05.intel.com X-Spam_score_int: -44 X-Spam_score: -4.5 X-Spam_bar: ---- X-Spam_report: (-4.5 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" From: Eric Auger Let the vfio-ap device use vfio_attach_device() and vfio_detach_device(), hence hiding the details of the used IOMMU backend. Signed-off-by: Eric Auger Signed-off-by: Yi Liu --- hw/vfio/ap.c | 62 ++++++++-------------------------------------------- 1 file changed, 9 insertions(+), 53 deletions(-) diff --git a/hw/vfio/ap.c b/hw/vfio/ap.c index e0dd561e85..286ac638e5 100644 --- a/hw/vfio/ap.c +++ b/hw/vfio/ap.c @@ -50,58 +50,17 @@ struct VFIODeviceOps vfio_ap_ops = { .vfio_compute_needs_reset = vfio_ap_compute_needs_reset, }; -static void vfio_ap_put_device(VFIOAPDevice *vapdev) -{ - g_free(vapdev->vdev.name); - vfio_put_base_device(&vapdev->vdev); -} - -static VFIOGroup *vfio_ap_get_group(VFIOAPDevice *vapdev, Error **errp) -{ - GError *gerror = NULL; - char *symlink, *group_path; - int groupid; - - symlink = g_strdup_printf("%s/iommu_group", vapdev->vdev.sysfsdev); - group_path = g_file_read_link(symlink, &gerror); - g_free(symlink); - - if (!group_path) { - error_setg(errp, "%s: no iommu_group found for %s: %s", - TYPE_VFIO_AP_DEVICE, vapdev->vdev.sysfsdev, gerror->message); - g_error_free(gerror); - return NULL; - } - - if (sscanf(basename(group_path), "%d", &groupid) != 1) { - error_setg(errp, "vfio: failed to read %s", group_path); - g_free(group_path); - return NULL; - } - - g_free(group_path); - - return vfio_get_group(groupid, &address_space_memory, errp); -} - static void vfio_ap_realize(DeviceState *dev, Error **errp) { - int ret; - char *mdevid; - VFIOGroup *vfio_group; APDevice *apdev = AP_DEVICE(dev); VFIOAPDevice *vapdev = VFIO_AP_DEVICE(apdev); + VFIODevice *vbasedev = &vapdev->vdev; + int ret; - vfio_group = vfio_ap_get_group(vapdev, errp); - if (!vfio_group) { - return; - } - - vapdev->vdev.ops = &vfio_ap_ops; - vapdev->vdev.type = VFIO_DEVICE_TYPE_AP; - mdevid = basename(vapdev->vdev.sysfsdev); - vapdev->vdev.name = g_strdup_printf("%s", mdevid); - vapdev->vdev.dev = dev; + vbasedev->name = g_path_get_basename(vbasedev->sysfsdev); + vbasedev->ops = &vfio_ap_ops; + vbasedev->type = VFIO_DEVICE_TYPE_AP; + vbasedev->dev = dev; /* * vfio-ap devices operate in a way compatible with discarding of @@ -111,7 +70,7 @@ static void vfio_ap_realize(DeviceState *dev, Error **errp) */ vapdev->vdev.ram_block_discard_allowed = true; - ret = vfio_get_device(vfio_group, mdevid, &vapdev->vdev, errp); + ret = vfio_attach_device(vbasedev, &address_space_memory, errp); if (ret) { goto out_get_dev_err; } @@ -119,18 +78,15 @@ static void vfio_ap_realize(DeviceState *dev, Error **errp) return; out_get_dev_err: - vfio_ap_put_device(vapdev); - vfio_put_group(vfio_group); + vfio_detach_device(vbasedev); } static void vfio_ap_unrealize(DeviceState *dev) { APDevice *apdev = AP_DEVICE(dev); VFIOAPDevice *vapdev = VFIO_AP_DEVICE(apdev); - VFIOGroup *group = vapdev->vdev.group; - vfio_ap_put_device(vapdev); - vfio_put_group(group); + vfio_detach_device(&vapdev->vdev); } static Property vfio_ap_properties[] = { From patchwork Wed Jun 8 12:31:32 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yi Liu X-Patchwork-Id: 12873452 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 0A29CC433EF for ; Wed, 8 Jun 2022 12:44:28 +0000 (UTC) Received: from localhost ([::1]:43896 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nyv2x-0005TI-0S for qemu-devel@archiver.kernel.org; Wed, 08 Jun 2022 08:44:27 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:58030) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nyuqt-0000RA-5x for qemu-devel@nongnu.org; Wed, 08 Jun 2022 08:31:59 -0400 Received: from mga05.intel.com ([192.55.52.43]:56947) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nyuqo-0005ka-1D for qemu-devel@nongnu.org; Wed, 08 Jun 2022 08:31:57 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1654691514; x=1686227514; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=dBhh34Fn2hj15XF0RFxWzOWwQ/vwct5pqhM//y9VRbY=; b=X9htwzEtvJx3EYL1JTlERvQBUJ3hRn1bAnhkzPy01fMbaY4q1JiaSBCk 4kumdHckyqXttCL/6Px0FgHveKiuMrmaxtmJftnG8ssXsCQl2mOtajemH Lm+ezvdnVG60bED1ZppD7lzZPMws7AcSJpbo3pmeIeeUzYdx6jeyO+h1b PNQUWSWuRT6Vz4NrFfbOSFRZvQmJfVtNNBp/dka4qHE6vr5Xr/BfHiYwS 4hteR7ifirYvgqKnIgzcEmGh0nFl3PpU9KfoF8NE1jSwMAtm6KYokuCM9 LnpbW+zXS7AEJD3fmRurg/r1z3qJg40iTpQmWrn5d1EkKYP9e61ZYtOIB A==; X-IronPort-AV: E=McAfee;i="6400,9594,10371"; a="363210142" X-IronPort-AV: E=Sophos;i="5.91,286,1647327600"; d="scan'208";a="363210142" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Jun 2022 05:31:46 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.91,286,1647327600"; d="scan'208";a="670529818" Received: from 984fee00a4c6.jf.intel.com ([10.165.58.231]) by FMSMGA003.fm.intel.com with ESMTP; 08 Jun 2022 05:31:45 -0700 From: Yi Liu To: alex.williamson@redhat.com, cohuck@redhat.com, qemu-devel@nongnu.org Cc: david@gibson.dropbear.id.au, thuth@redhat.com, farman@linux.ibm.com, mjrosato@linux.ibm.com, akrowiak@linux.ibm.com, pasic@linux.ibm.com, jjherne@linux.ibm.com, jasowang@redhat.com, kvm@vger.kernel.org, jgg@nvidia.com, nicolinc@nvidia.com, eric.auger@redhat.com, eric.auger.pro@gmail.com, kevin.tian@intel.com, yi.l.liu@intel.com, chao.p.peng@intel.com, yi.y.sun@intel.com, peterx@redhat.com, shameerali.kolothum.thodi@huawei.com, zhangfei.gao@linaro.org, berrange@redhat.com Subject: [RFC v2 08/15] vfio/ccw: Use vfio_[attach/detach]_device Date: Wed, 8 Jun 2022 05:31:32 -0700 Message-Id: <20220608123139.19356-9-yi.l.liu@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220608123139.19356-1-yi.l.liu@intel.com> References: <20220608123139.19356-1-yi.l.liu@intel.com> MIME-Version: 1.0 Received-SPF: pass client-ip=192.55.52.43; envelope-from=yi.l.liu@intel.com; helo=mga05.intel.com X-Spam_score_int: -44 X-Spam_score: -4.5 X-Spam_bar: ---- X-Spam_report: (-4.5 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" From: Eric Auger Let the vfio-ccw device use vfio_attach_device() and vfio_detach_device(), hence hiding the details of the used IOMMU backend. Also now all the devices have been migrated to use the new vfio_attach_device/vfio_detach_device API, let's turn the legacy functions into static functions, local to container.c. Signed-off-by: Eric Auger Signed-off-by: Yi Liu --- hw/vfio/ccw.c | 118 ++++++++-------------------------- hw/vfio/container.c | 8 +-- include/hw/vfio/vfio-common.h | 4 -- 3 files changed, 32 insertions(+), 98 deletions(-) diff --git a/hw/vfio/ccw.c b/hw/vfio/ccw.c index 0354737666..6fde7849cc 100644 --- a/hw/vfio/ccw.c +++ b/hw/vfio/ccw.c @@ -579,27 +579,32 @@ static void vfio_ccw_put_region(VFIOCCWDevice *vcdev) g_free(vcdev->io_region); } -static void vfio_ccw_put_device(VFIOCCWDevice *vcdev) -{ - g_free(vcdev->vdev.name); - vfio_put_base_device(&vcdev->vdev); -} - -static void vfio_ccw_get_device(VFIOGroup *group, VFIOCCWDevice *vcdev, - Error **errp) +static void vfio_ccw_realize(DeviceState *dev, Error **errp) { + CcwDevice *ccw_dev = DO_UPCAST(CcwDevice, parent_obj, dev); + S390CCWDevice *cdev = DO_UPCAST(S390CCWDevice, parent_obj, ccw_dev); + VFIOCCWDevice *vcdev = DO_UPCAST(VFIOCCWDevice, cdev, cdev); + S390CCWDeviceClass *cdc = S390_CCW_DEVICE_GET_CLASS(cdev); + VFIODevice *vbasedev = &vcdev->vdev; + Error *err = NULL; char *name = g_strdup_printf("%x.%x.%04x", vcdev->cdev.hostid.cssid, vcdev->cdev.hostid.ssid, vcdev->cdev.hostid.devid); - VFIODevice *vbasedev; + int ret; - QLIST_FOREACH(vbasedev, &group->device_list, next) { - if (strcmp(vbasedev->name, name) == 0) { - error_setg(errp, "vfio: subchannel %s has already been attached", - name); - goto out_err; + /* Call the class init function for subchannel. */ + if (cdc->realize) { + cdc->realize(cdev, vcdev->vdev.sysfsdev, &err); + if (err) { + goto out_err_propagate; } } + vbasedev->sysfsdev = g_strdup_printf("/sys/bus/css/devices/%s/%s", + name, cdev->mdevid); + vbasedev->ops = &vfio_ccw_ops; + vbasedev->type = VFIO_DEVICE_TYPE_CCW; + vbasedev->name = name; + vbasedev->dev = &vcdev->cdev.parent_obj.parent_obj; /* * All vfio-ccw devices are believed to operate in a way compatible with @@ -609,80 +614,18 @@ static void vfio_ccw_get_device(VFIOGroup *group, VFIOCCWDevice *vcdev, * needs to be set before vfio_get_device() for vfio common to handle * ram_block_discard_disable(). */ - vcdev->vdev.ram_block_discard_allowed = true; - - if (vfio_get_device(group, vcdev->cdev.mdevid, &vcdev->vdev, errp)) { - goto out_err; - } - - vcdev->vdev.ops = &vfio_ccw_ops; - vcdev->vdev.type = VFIO_DEVICE_TYPE_CCW; - vcdev->vdev.name = name; - vcdev->vdev.dev = &vcdev->cdev.parent_obj.parent_obj; - - return; - -out_err: - g_free(name); -} - -static VFIOGroup *vfio_ccw_get_group(S390CCWDevice *cdev, Error **errp) -{ - char *tmp, group_path[PATH_MAX]; - ssize_t len; - int groupid; - tmp = g_strdup_printf("/sys/bus/css/devices/%x.%x.%04x/%s/iommu_group", - cdev->hostid.cssid, cdev->hostid.ssid, - cdev->hostid.devid, cdev->mdevid); - len = readlink(tmp, group_path, sizeof(group_path)); - g_free(tmp); + vbasedev->ram_block_discard_allowed = true; - if (len <= 0 || len >= sizeof(group_path)) { - error_setg(errp, "vfio: no iommu_group found"); - return NULL; - } - - group_path[len] = 0; - - if (sscanf(basename(group_path), "%d", &groupid) != 1) { - error_setg(errp, "vfio: failed to read %s", group_path); - return NULL; - } - - return vfio_get_group(groupid, &address_space_memory, errp); -} - -static void vfio_ccw_realize(DeviceState *dev, Error **errp) -{ - VFIOGroup *group; - CcwDevice *ccw_dev = DO_UPCAST(CcwDevice, parent_obj, dev); - S390CCWDevice *cdev = DO_UPCAST(S390CCWDevice, parent_obj, ccw_dev); - VFIOCCWDevice *vcdev = DO_UPCAST(VFIOCCWDevice, cdev, cdev); - S390CCWDeviceClass *cdc = S390_CCW_DEVICE_GET_CLASS(cdev); - Error *err = NULL; - - /* Call the class init function for subchannel. */ - if (cdc->realize) { - cdc->realize(cdev, vcdev->vdev.sysfsdev, &err); - if (err) { - goto out_err_propagate; - } - } - - group = vfio_ccw_get_group(cdev, &err); - if (!group) { - goto out_group_err; - } - - vfio_ccw_get_device(group, vcdev, &err); - if (err) { - goto out_device_err; + ret = vfio_attach_device(vbasedev, &address_space_memory, errp); + if (ret) { + g_free(vbasedev->name); + g_free(vbasedev->sysfsdev); } vfio_ccw_get_region(vcdev, &err); if (err) { - goto out_region_err; + goto out_get_dev_err; } vfio_ccw_register_irq_notifier(vcdev, VFIO_CCW_IO_IRQ_INDEX, &err); @@ -714,11 +657,8 @@ out_irq_notifier_err: vfio_ccw_unregister_irq_notifier(vcdev, VFIO_CCW_IO_IRQ_INDEX); out_io_notifier_err: vfio_ccw_put_region(vcdev); -out_region_err: - vfio_ccw_put_device(vcdev); -out_device_err: - vfio_put_group(group); -out_group_err: +out_get_dev_err: + vfio_detach_device(vbasedev); if (cdc->unrealize) { cdc->unrealize(cdev); } @@ -732,14 +672,12 @@ static void vfio_ccw_unrealize(DeviceState *dev) S390CCWDevice *cdev = DO_UPCAST(S390CCWDevice, parent_obj, ccw_dev); VFIOCCWDevice *vcdev = DO_UPCAST(VFIOCCWDevice, cdev, cdev); S390CCWDeviceClass *cdc = S390_CCW_DEVICE_GET_CLASS(cdev); - VFIOGroup *group = vcdev->vdev.group; vfio_ccw_unregister_irq_notifier(vcdev, VFIO_CCW_REQ_IRQ_INDEX); vfio_ccw_unregister_irq_notifier(vcdev, VFIO_CCW_CRW_IRQ_INDEX); vfio_ccw_unregister_irq_notifier(vcdev, VFIO_CCW_IO_IRQ_INDEX); vfio_ccw_put_region(vcdev); - vfio_ccw_put_device(vcdev); - vfio_put_group(group); + vfio_detach_device(&vcdev->vdev); if (cdc->unrealize) { cdc->unrealize(cdev); diff --git a/hw/vfio/container.c b/hw/vfio/container.c index 74e6eeba74..fbde5b0d31 100644 --- a/hw/vfio/container.c +++ b/hw/vfio/container.c @@ -958,7 +958,7 @@ static void vfio_disconnect_container(VFIOGroup *group) } } -VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp) +static VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp) { VFIOGroup *group; VFIOContainer *bcontainer; @@ -1027,7 +1027,7 @@ free_group_exit: return NULL; } -void vfio_put_group(VFIOGroup *group) +static void vfio_put_group(VFIOGroup *group) { if (!group || !QLIST_EMPTY(&group->device_list)) { return; @@ -1048,8 +1048,8 @@ void vfio_put_group(VFIOGroup *group) } } -int vfio_get_device(VFIOGroup *group, const char *name, - VFIODevice *vbasedev, Error **errp) +static int vfio_get_device(VFIOGroup *group, const char *name, + VFIODevice *vbasedev, Error **errp) { struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) }; int ret, fd; diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index 1b68cd8ece..669d761728 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -177,10 +177,6 @@ void vfio_region_unmap(VFIORegion *region); void vfio_region_exit(VFIORegion *region); void vfio_region_finalize(VFIORegion *region); void vfio_reset_handler(void *opaque); -VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp); -void vfio_put_group(VFIOGroup *group); -int vfio_get_device(VFIOGroup *group, const char *name, - VFIODevice *vbasedev, Error **errp); int vfio_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp); void vfio_detach_device(VFIODevice *vbasedev); From patchwork Wed Jun 8 12:31:33 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yi Liu X-Patchwork-Id: 12873470 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 559CFC43334 for ; Wed, 8 Jun 2022 12:57:05 +0000 (UTC) Received: from localhost ([::1]:59274 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nyvFA-0000TM-5J for qemu-devel@archiver.kernel.org; Wed, 08 Jun 2022 08:57:04 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:58036) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nyuqu-0000T8-Su for qemu-devel@nongnu.org; Wed, 08 Jun 2022 08:32:00 -0400 Received: from mga05.intel.com ([192.55.52.43]:56958) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nyuqq-0005ks-9K for qemu-devel@nongnu.org; Wed, 08 Jun 2022 08:31:59 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1654691516; x=1686227516; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=MGTE2TVi4Z/OLr5qovQ9Ivt4dKH7Id4WEAoyPHQsjKw=; b=H2UDP1Im+2WxP9CLJoQa2G62H0wMrY/UejRRIQL2CUXiiiik4Gb7941W Uxw0dFSk8rrFVJ0tWGHHU9CxKds7N+KBnTyRZ/vcBpI1BxDPMxuBi9X5E pQhVJIEel0Y8VJtz+pD+Gs3vX5UZjxF41+wpH6qHK3lYU5EWHrQJcX8WI nXxal5xpOLvJ4Do++KNbF9rkEKszT9Ijpx95sZet93vBHPevosasKodUR STq8wv6qfWPrgw9qst1Ksq+nCVYqItBUpQgbQMfcPvy42yamUoKj2BYZE 0hWNC/dnTaQFCmCITqWE1AaqD5eeoS2wMgl3nw0+qHDlXEYook2RJ0w81 w==; X-IronPort-AV: E=McAfee;i="6400,9594,10371"; a="363210150" X-IronPort-AV: E=Sophos;i="5.91,286,1647327600"; d="scan'208";a="363210150" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Jun 2022 05:31:47 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.91,286,1647327600"; d="scan'208";a="670529839" Received: from 984fee00a4c6.jf.intel.com ([10.165.58.231]) by FMSMGA003.fm.intel.com with ESMTP; 08 Jun 2022 05:31:46 -0700 From: Yi Liu To: alex.williamson@redhat.com, cohuck@redhat.com, qemu-devel@nongnu.org Cc: david@gibson.dropbear.id.au, thuth@redhat.com, farman@linux.ibm.com, mjrosato@linux.ibm.com, akrowiak@linux.ibm.com, pasic@linux.ibm.com, jjherne@linux.ibm.com, jasowang@redhat.com, kvm@vger.kernel.org, jgg@nvidia.com, nicolinc@nvidia.com, eric.auger@redhat.com, eric.auger.pro@gmail.com, kevin.tian@intel.com, yi.l.liu@intel.com, chao.p.peng@intel.com, yi.y.sun@intel.com, peterx@redhat.com, shameerali.kolothum.thodi@huawei.com, zhangfei.gao@linaro.org, berrange@redhat.com Subject: [RFC v2 09/15] vfio/container-base: Introduce [attach/detach]_device container callbacks Date: Wed, 8 Jun 2022 05:31:33 -0700 Message-Id: <20220608123139.19356-10-yi.l.liu@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220608123139.19356-1-yi.l.liu@intel.com> References: <20220608123139.19356-1-yi.l.liu@intel.com> MIME-Version: 1.0 Received-SPF: pass client-ip=192.55.52.43; envelope-from=yi.l.liu@intel.com; helo=mga05.intel.com X-Spam_score_int: -44 X-Spam_score: -4.5 X-Spam_bar: ---- X-Spam_report: (-4.5 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" From: Eric Auger Let's turn attach/detach_device as container callbacks. That way, their implementation can be easily customized for a given backend. For the time being, only the legacy container is supported. Signed-off-by: Eric Auger Signed-off-by: Yi Liu --- hw/vfio/as.c | 30 +++++++++++++++++++++++++++ hw/vfio/container.c | 9 ++++++-- hw/vfio/pci.c | 2 +- include/hw/vfio/vfio-common.h | 7 +++++++ include/hw/vfio/vfio-container-base.h | 4 ++++ 5 files changed, 49 insertions(+), 3 deletions(-) diff --git a/hw/vfio/as.c b/hw/vfio/as.c index ec58914001..2c83b8e1fe 100644 --- a/hw/vfio/as.c +++ b/hw/vfio/as.c @@ -899,3 +899,33 @@ void vfio_put_address_space(VFIOAddressSpace *space) g_free(space); } } + +static const VFIOContainerOps * +vfio_get_container_ops(VFIOIOMMUBackendType be) +{ + switch (be) { + case VFIO_IOMMU_BACKEND_TYPE_LEGACY: + return &legacy_container_ops; + default: + return NULL; + } +} + +int vfio_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp) +{ + const VFIOContainerOps *ops; + + ops = vfio_get_container_ops(VFIO_IOMMU_BACKEND_TYPE_LEGACY); + if (!ops) { + return -ENOENT; + } + return ops->attach_device(vbasedev, as, errp); +} + +void vfio_detach_device(VFIODevice *vbasedev) +{ + if (!vbasedev->container) { + return; + } + vbasedev->container->ops->detach_device(vbasedev); +} diff --git a/hw/vfio/container.c b/hw/vfio/container.c index fbde5b0d31..f303c08aa5 100644 --- a/hw/vfio/container.c +++ b/hw/vfio/container.c @@ -1244,7 +1244,8 @@ static int vfio_device_groupid(VFIODevice *vbasedev, Error **errp) return groupid; } -int vfio_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp) +static int +legacy_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp) { int groupid = vfio_device_groupid(vbasedev, errp); VFIODevice *vbasedev_iter; @@ -1273,14 +1274,16 @@ int vfio_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp) vfio_put_group(group); return -1; } + vbasedev->container = &group->container->bcontainer; return 0; } -void vfio_detach_device(VFIODevice *vbasedev) +static void legacy_detach_device(VFIODevice *vbasedev) { vfio_put_base_device(vbasedev); vfio_put_group(vbasedev->group); + vbasedev->container = NULL; } const VFIOContainerOps legacy_container_ops = { @@ -1292,4 +1295,6 @@ const VFIOContainerOps legacy_container_ops = { .add_window = vfio_legacy_container_add_section_window, .del_window = vfio_legacy_container_del_section_window, .check_extension = vfio_legacy_container_check_extension, + .attach_device = legacy_attach_device, + .detach_device = legacy_detach_device, }; diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index 9856d81819..07361a6fcd 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -3106,7 +3106,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp) } if (!pdev->failover_pair_id && - vfio_container_check_extension(&vbasedev->group->container->bcontainer, + vfio_container_check_extension(vbasedev->container, VFIO_FEAT_LIVE_MIGRATION)) { ret = vfio_migration_probe(vbasedev, errp); if (ret) { diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index 669d761728..635397d391 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -84,9 +84,15 @@ typedef struct VFIOLegacyContainer { typedef struct VFIODeviceOps VFIODeviceOps; +typedef enum VFIOIOMMUBackendType { + VFIO_IOMMU_BACKEND_TYPE_LEGACY = 0, + VFIO_IOMMU_BACKEND_TYPE_IOMMUFD = 1, +} VFIOIOMMUBackendType; + typedef struct VFIODevice { QLIST_ENTRY(VFIODevice) next; struct VFIOGroup *group; + VFIOContainer *container; char *sysfsdev; char *name; DeviceState *dev; @@ -98,6 +104,7 @@ typedef struct VFIODevice { bool ram_block_discard_allowed; bool enable_migration; VFIODeviceOps *ops; + VFIOIOMMUBackendType be; unsigned int num_irqs; unsigned int num_regions; unsigned int flags; diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h index fa5d7fcb85..71df8743fb 100644 --- a/include/hw/vfio/vfio-container-base.h +++ b/include/hw/vfio/vfio-container-base.h @@ -66,6 +66,8 @@ typedef struct VFIOHostDMAWindow { QLIST_ENTRY(VFIOHostDMAWindow) hostwin_next; } VFIOHostDMAWindow; +typedef struct VFIODevice VFIODevice; + typedef struct VFIOContainerOps { /* required */ bool (*check_extension)(VFIOContainer *container, @@ -88,6 +90,8 @@ typedef struct VFIOContainerOps { Error **errp); void (*del_window)(VFIOContainer *container, MemoryRegionSection *section); + int (*attach_device)(VFIODevice *vbasedev, AddressSpace *as, Error **errp); + void (*detach_device)(VFIODevice *vbasedev); } VFIOContainerOps; /* From patchwork Wed Jun 8 12:31:34 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yi Liu X-Patchwork-Id: 12873453 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id DF290C433EF for ; Wed, 8 Jun 2022 12:45:05 +0000 (UTC) Received: from localhost ([::1]:45836 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nyv3Y-0006rx-VQ for qemu-devel@archiver.kernel.org; Wed, 08 Jun 2022 08:45:04 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:58046) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nyuqw-0000WH-QL for qemu-devel@nongnu.org; Wed, 08 Jun 2022 08:32:02 -0400 Received: from mga05.intel.com ([192.55.52.43]:56952) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nyuqr-0005kl-I9 for qemu-devel@nongnu.org; Wed, 08 Jun 2022 08:32:02 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1654691517; x=1686227517; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=lifPEPAqd0/u6utnK71bM3OvcAqpE8N/lgs7hmF7yN4=; b=QJqP/85CiFD1IS17L/4kap2orbCkAsy62jo0b1RmhY+Xd5EBs4NrGyPx T8wmZWqqPY1Bcw99LArF2836ikz9/SwRD06Kwr5MrvN169b0R2MsgfDxq GjS5s09wXs5f3+Vvo+LSR+CsgeDlFJCbqBp9Twmref1jEWo0+qyrhloPI JbPmPeZStDHu3BOAnxpsKV1xlsDAGTgybyuy6HUoORa+jIJNEBlOg9Q/G EQkjzXaboE0HaM1nE+90EIMheXSecK/LCrhF6/txa0d13Ck2C0U7avK4D xS1TunPUHscLUPawNMobY2oHwfjgMgCYZasjpUYuWO5Iuin9W2wmFrH5S A==; X-IronPort-AV: E=McAfee;i="6400,9594,10371"; a="363210154" X-IronPort-AV: E=Sophos;i="5.91,286,1647327600"; d="scan'208";a="363210154" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Jun 2022 05:31:47 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.91,286,1647327600"; d="scan'208";a="670529854" Received: from 984fee00a4c6.jf.intel.com ([10.165.58.231]) by FMSMGA003.fm.intel.com with ESMTP; 08 Jun 2022 05:31:46 -0700 From: Yi Liu To: alex.williamson@redhat.com, cohuck@redhat.com, qemu-devel@nongnu.org Cc: david@gibson.dropbear.id.au, thuth@redhat.com, farman@linux.ibm.com, mjrosato@linux.ibm.com, akrowiak@linux.ibm.com, pasic@linux.ibm.com, jjherne@linux.ibm.com, jasowang@redhat.com, kvm@vger.kernel.org, jgg@nvidia.com, nicolinc@nvidia.com, eric.auger@redhat.com, eric.auger.pro@gmail.com, kevin.tian@intel.com, yi.l.liu@intel.com, chao.p.peng@intel.com, yi.y.sun@intel.com, peterx@redhat.com, shameerali.kolothum.thodi@huawei.com, zhangfei.gao@linaro.org, berrange@redhat.com Subject: [RFC v2 10/15] vfio/container-base: Introduce VFIOContainer reset callback Date: Wed, 8 Jun 2022 05:31:34 -0700 Message-Id: <20220608123139.19356-11-yi.l.liu@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220608123139.19356-1-yi.l.liu@intel.com> References: <20220608123139.19356-1-yi.l.liu@intel.com> MIME-Version: 1.0 Received-SPF: pass client-ip=192.55.52.43; envelope-from=yi.l.liu@intel.com; helo=mga05.intel.com X-Spam_score_int: -43 X-Spam_score: -4.4 X-Spam_bar: ---- X-Spam_report: (-4.4 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" From: Eric Auger Reset implementation depends on the container backend. Let's introduce a VFIOContainer class function and register a generic reset handler that will be able to call the right reset function depending on the container type. Also, let's move the registration/unregistration to a place that is not backend-specific (first vfio address space created instead of the first group). Signed-off-by: Eric Auger Signed-off-by: Yi Liu --- hw/vfio/as.c | 18 ++++++++++++++++++ hw/vfio/container-base.c | 9 +++++++++ hw/vfio/container.c | 27 +++++++++++++++------------ include/hw/vfio/vfio-container-base.h | 2 ++ 4 files changed, 44 insertions(+), 12 deletions(-) diff --git a/hw/vfio/as.c b/hw/vfio/as.c index 2c83b8e1fe..1a3ceb5e62 100644 --- a/hw/vfio/as.c +++ b/hw/vfio/as.c @@ -872,6 +872,18 @@ const MemoryListener vfio_memory_listener = { .log_sync = vfio_listener_log_sync, }; +void vfio_reset_handler(void *opaque) +{ + VFIOAddressSpace *space; + VFIOContainer *bcontainer; + + QLIST_FOREACH(space, &vfio_address_spaces, list) { + QLIST_FOREACH(bcontainer, &space->containers, next) { + vfio_container_reset(bcontainer); + } + } +} + VFIOAddressSpace *vfio_get_address_space(AddressSpace *as) { VFIOAddressSpace *space; @@ -887,6 +899,9 @@ VFIOAddressSpace *vfio_get_address_space(AddressSpace *as) space->as = as; QLIST_INIT(&space->containers); + if (QLIST_EMPTY(&vfio_address_spaces)) { + qemu_register_reset(vfio_reset_handler, NULL); + } QLIST_INSERT_HEAD(&vfio_address_spaces, space, list); return space; @@ -898,6 +913,9 @@ void vfio_put_address_space(VFIOAddressSpace *space) QLIST_REMOVE(space, list); g_free(space); } + if (QLIST_EMPTY(&vfio_address_spaces)) { + qemu_unregister_reset(vfio_reset_handler, NULL); + } } static const VFIOContainerOps * diff --git a/hw/vfio/container-base.c b/hw/vfio/container-base.c index 6aaf0e0faa..a9f28e4b9d 100644 --- a/hw/vfio/container-base.c +++ b/hw/vfio/container-base.c @@ -58,6 +58,15 @@ int vfio_container_dma_unmap(VFIOContainer *container, return container->ops->dma_unmap(container, iova, size, iotlb); } +int vfio_container_reset(VFIOContainer *container) +{ + if (!container->ops->reset) { + return -ENOENT; + } + + return container->ops->reset(container); +} + void vfio_container_set_dirty_page_tracking(VFIOContainer *container, bool start) { diff --git a/hw/vfio/container.c b/hw/vfio/container.c index f303c08aa5..2d9704bc1a 100644 --- a/hw/vfio/container.c +++ b/hw/vfio/container.c @@ -465,12 +465,16 @@ vfio_legacy_container_del_section_window(VFIOContainer *bcontainer, } } -void vfio_reset_handler(void *opaque) +static int vfio_legacy_container_reset(VFIOContainer *bcontainer) { + VFIOLegacyContainer *container = container_of(bcontainer, + VFIOLegacyContainer, + bcontainer); VFIOGroup *group; VFIODevice *vbasedev; + int ret, final_ret = 0; - QLIST_FOREACH(group, &vfio_group_list, next) { + QLIST_FOREACH(group, &container->group_list, container_next) { QLIST_FOREACH(vbasedev, &group->device_list, next) { if (vbasedev->dev->realized) { vbasedev->ops->vfio_compute_needs_reset(vbasedev); @@ -478,13 +482,19 @@ void vfio_reset_handler(void *opaque) } } - QLIST_FOREACH(group, &vfio_group_list, next) { + QLIST_FOREACH(group, &container->group_list, next) { QLIST_FOREACH(vbasedev, &group->device_list, next) { if (vbasedev->dev->realized && vbasedev->needs_reset) { - vbasedev->ops->vfio_hot_reset_multi(vbasedev); + ret = vbasedev->ops->vfio_hot_reset_multi(vbasedev); + if (ret) { + error_report("failed to reset %s (%d)", + vbasedev->name, ret); + final_ret = ret; + } } } } + return final_ret; } static void vfio_kvm_device_add_group(VFIOGroup *group) @@ -1010,10 +1020,6 @@ static VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp) goto close_fd_exit; } - if (QLIST_EMPTY(&vfio_group_list)) { - qemu_register_reset(vfio_reset_handler, NULL); - } - QLIST_INSERT_HEAD(&vfio_group_list, group, next); return group; @@ -1042,10 +1048,6 @@ static void vfio_put_group(VFIOGroup *group) trace_vfio_put_group(group->fd); close(group->fd); g_free(group); - - if (QLIST_EMPTY(&vfio_group_list)) { - qemu_unregister_reset(vfio_reset_handler, NULL); - } } static int vfio_get_device(VFIOGroup *group, const char *name, @@ -1295,6 +1297,7 @@ const VFIOContainerOps legacy_container_ops = { .add_window = vfio_legacy_container_add_section_window, .del_window = vfio_legacy_container_del_section_window, .check_extension = vfio_legacy_container_check_extension, + .reset = vfio_legacy_container_reset, .attach_device = legacy_attach_device, .detach_device = legacy_detach_device, }; diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h index 71df8743fb..f9fb8b6af7 100644 --- a/include/hw/vfio/vfio-container-base.h +++ b/include/hw/vfio/vfio-container-base.h @@ -78,6 +78,7 @@ typedef struct VFIOContainerOps { int (*dma_unmap)(VFIOContainer *container, hwaddr iova, ram_addr_t size, IOMMUTLBEntry *iotlb); + int (*reset)(VFIOContainer *container); /* migration feature */ bool (*devices_all_dirty_tracking)(VFIOContainer *container); void (*set_dirty_page_tracking)(VFIOContainer *container, bool start); @@ -122,6 +123,7 @@ int vfio_container_dma_map(VFIOContainer *container, int vfio_container_dma_unmap(VFIOContainer *container, hwaddr iova, ram_addr_t size, IOMMUTLBEntry *iotlb); +int vfio_container_reset(VFIOContainer *container); bool vfio_container_devices_all_dirty_tracking(VFIOContainer *container); void vfio_container_set_dirty_page_tracking(VFIOContainer *container, bool start); From patchwork Wed Jun 8 12:31:35 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yi Liu X-Patchwork-Id: 12873476 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id DD164C43334 for ; Wed, 8 Jun 2022 12:58:17 +0000 (UTC) Received: from localhost ([::1]:33012 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nyvGK-0001y7-L2 for qemu-devel@archiver.kernel.org; Wed, 08 Jun 2022 08:58:16 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:58132) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nyurD-00010S-Kx for qemu-devel@nongnu.org; Wed, 08 Jun 2022 08:32:19 -0400 Received: from mga05.intel.com ([192.55.52.43]:56947) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nyur3-0005ka-JF for qemu-devel@nongnu.org; Wed, 08 Jun 2022 08:32:19 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1654691529; x=1686227529; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=VDUDde22/mftjnQk63dwew8A4G2v7RPHGmdbSSc+W0I=; b=PhBrSNcQGVztsKXURobPq9v7TwnDNu841FV/K9+R3dgVIQb92gpye3IO 16djMmg9z4tbbITv0bZCaNTj8mJQGmBc7xiROaWIGoRAxMjEyImvEZ6Xb /3TFH4eOG7fV+xxXs86dwIyG4wAtYaaMMmdz+6pGij6Hh8nG8pewDfLFV MMZY3k+sFHOC6YJss1WnYUz/JLgpPsocpcMhyKiSZGvBQDZYKYXtwYn4x NlR/KiXDkOpa9fZ+Vw6IubrmaARcJGrb4HS3pUKCbxYUavk7tGItxjt3I 51BycZj4+uCfVZmS0W0rlCm7FhkSmEIFha5XyXGLRgEL0XG25JddqiSmY A==; X-IronPort-AV: E=McAfee;i="6400,9594,10371"; a="363210161" X-IronPort-AV: E=Sophos;i="5.91,286,1647327600"; d="scan'208";a="363210161" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Jun 2022 05:31:48 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.91,286,1647327600"; d="scan'208";a="670529867" Received: from 984fee00a4c6.jf.intel.com ([10.165.58.231]) by FMSMGA003.fm.intel.com with ESMTP; 08 Jun 2022 05:31:47 -0700 From: Yi Liu To: alex.williamson@redhat.com, cohuck@redhat.com, qemu-devel@nongnu.org Cc: david@gibson.dropbear.id.au, thuth@redhat.com, farman@linux.ibm.com, mjrosato@linux.ibm.com, akrowiak@linux.ibm.com, pasic@linux.ibm.com, jjherne@linux.ibm.com, jasowang@redhat.com, kvm@vger.kernel.org, jgg@nvidia.com, nicolinc@nvidia.com, eric.auger@redhat.com, eric.auger.pro@gmail.com, kevin.tian@intel.com, yi.l.liu@intel.com, chao.p.peng@intel.com, yi.y.sun@intel.com, peterx@redhat.com, shameerali.kolothum.thodi@huawei.com, zhangfei.gao@linaro.org, berrange@redhat.com Subject: [RFC v2 11/15] backends/iommufd: Introduce the iommufd object Date: Wed, 8 Jun 2022 05:31:35 -0700 Message-Id: <20220608123139.19356-12-yi.l.liu@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220608123139.19356-1-yi.l.liu@intel.com> References: <20220608123139.19356-1-yi.l.liu@intel.com> MIME-Version: 1.0 Received-SPF: pass client-ip=192.55.52.43; envelope-from=yi.l.liu@intel.com; helo=mga05.intel.com X-Spam_score_int: -44 X-Spam_score: -4.5 X-Spam_bar: ---- X-Spam_report: (-4.5 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, T_SCC_BODY_TEXT_LINE=-0.01, T_SPF_TEMPERROR=0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" From: Eric Auger Introduce an iommufd object which allows the interaction with the host /dev/iommu device. The /dev/iommu can have been already pre-opened outside of qemu, in which case the fd can be passed directly along with the iommufd object: This allows the iommufd object to be shared accross several subsystems (VFIO, VDPA, ...). For example, libvirt would open the /dev/iommu once. If no fd is passed along with the iommufd object, the /dev/iommu is opened by the qemu code. The CONFIG_IOMMUFD option must be set to compile this new object. Signed-off-by: Eric Auger Signed-off-by: Yi Liu Suggested-by: Alex Williamson --- MAINTAINERS | 7 ++ backends/Kconfig | 5 + backends/iommufd.c | 265 +++++++++++++++++++++++++++++++++++++++ backends/meson.build | 1 + backends/trace-events | 12 ++ include/sysemu/iommufd.h | 47 +++++++ qapi/qom.json | 16 ++- qemu-options.hx | 12 ++ 8 files changed, 364 insertions(+), 1 deletion(-) create mode 100644 backends/iommufd.c create mode 100644 include/sysemu/iommufd.h diff --git a/MAINTAINERS b/MAINTAINERS index 5580a36b68..c6aa4e7fd0 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1984,6 +1984,13 @@ F: hw/vfio/ap.c F: docs/system/s390x/vfio-ap.rst L: qemu-s390x@nongnu.org +iommufd +M: Yi Liu +M: Eric Auger +S: Supported +F: backends/iommufd.c +F: include/sysemu/iommufd.h + vhost M: Michael S. Tsirkin S: Supported diff --git a/backends/Kconfig b/backends/Kconfig index f35abc1609..aad57e8f53 100644 --- a/backends/Kconfig +++ b/backends/Kconfig @@ -1 +1,6 @@ source tpm/Kconfig + +config IOMMUFD + bool + default y + depends on LINUX diff --git a/backends/iommufd.c b/backends/iommufd.c new file mode 100644 index 0000000000..6a66948d5d --- /dev/null +++ b/backends/iommufd.c @@ -0,0 +1,265 @@ +/* + * iommufd container backend + * + * Copyright (C) 2022 Intel Corporation. + * Copyright Red Hat, Inc. 2022 + * + * Authors: Yi Liu + * Eric Auger + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + + * You should have received a copy of the GNU General Public License along + * with this program; if not, see . + */ + +#include "qemu/osdep.h" +#include "sysemu/iommufd.h" +#include "qapi/error.h" +#include "qapi/qmp/qerror.h" +#include "qemu/module.h" +#include "qom/object_interfaces.h" +#include "qemu/error-report.h" +#include "monitor/monitor.h" +#include "trace.h" +#include +#include + +static void iommufd_backend_init(Object *obj) +{ + IOMMUFDBackend *be = IOMMUFD_BACKEND(obj); + + be->fd = -1; + be->users = 0; + be->owned = true; + qemu_mutex_init(&be->lock); +} + +static void iommufd_backend_finalize(Object *obj) +{ + IOMMUFDBackend *be = IOMMUFD_BACKEND(obj); + + if (be->owned) { + close(be->fd); + be->fd = -1; + } +} + +static void iommufd_backend_set_fd(Object *obj, const char *str, Error **errp) +{ + IOMMUFDBackend *be = IOMMUFD_BACKEND(obj); + int fd = -1; + + fd = monitor_fd_param(monitor_cur(), str, errp); + if (fd == -1) { + error_prepend(errp, "Could not parse remote object fd %s:", str); + return; + } + qemu_mutex_lock(&be->lock); + be->fd = fd; + be->owned = false; + qemu_mutex_unlock(&be->lock); + trace_iommu_backend_set_fd(be->fd); +} + +static void iommufd_backend_class_init(ObjectClass *oc, void *data) +{ + object_class_property_add_str(oc, "fd", NULL, iommufd_backend_set_fd); +} + +int iommufd_backend_connect(IOMMUFDBackend *be, Error **errp) +{ + int fd, ret = 0; + + qemu_mutex_lock(&be->lock); + if (be->users == UINT32_MAX) { + error_setg(errp, "too many connections"); + ret = -E2BIG; + goto out; + } + if (be->owned && !be->users) { + fd = qemu_open_old("/dev/iommu", O_RDWR); + if (fd < 0) { + error_setg_errno(errp, errno, "/dev/iommu opening failed"); + ret = fd; + goto out; + } + be->fd = fd; + } + be->users++; +out: + trace_iommufd_backend_connect(be->fd, be->owned, + be->users, ret); + qemu_mutex_unlock(&be->lock); + return ret; +} + +void iommufd_backend_disconnect(IOMMUFDBackend *be) +{ + qemu_mutex_lock(&be->lock); + if (!be->users) { + goto out; + } + be->users--; + if (!be->users && be->owned) { + close(be->fd); + be->fd = -1; + } +out: + trace_iommufd_backend_disconnect(be->fd, be->users); + qemu_mutex_unlock(&be->lock); +} + +static int iommufd_backend_alloc_ioas(int fd, uint32_t *ioas) +{ + int ret; + struct iommu_ioas_alloc alloc_data = { + .size = sizeof(alloc_data), + .flags = 0, + }; + + ret = ioctl(fd, IOMMU_IOAS_ALLOC, &alloc_data); + if (ret) { + error_report("Failed to allocate ioas %m"); + } + + *ioas = alloc_data.out_ioas_id; + trace_iommufd_backend_alloc_ioas(fd, *ioas, ret); + + return ret; +} + +static void iommufd_backend_free_ioas(int fd, uint32_t ioas) +{ + int ret; + struct iommu_destroy des = { + .size = sizeof(des), + .id = ioas, + }; + + ret = ioctl(fd, IOMMU_DESTROY, &des); + trace_iommufd_backend_free_ioas(fd, ioas, ret); + if (ret) { + error_report("Failed to free ioas: %u %m", ioas); + } +} + +int iommufd_backend_get_ioas(IOMMUFDBackend *be, uint32_t *ioas_id) +{ + int ret; + + ret = iommufd_backend_alloc_ioas(be->fd, ioas_id); + trace_iommufd_backend_get_ioas(be->fd, *ioas_id, ret); + return ret; +} + +void iommufd_backend_put_ioas(IOMMUFDBackend *be, uint32_t ioas) +{ + trace_iommufd_backend_put_ioas(be->fd, ioas); + iommufd_backend_free_ioas(be->fd, ioas); +} + +int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas, + hwaddr iova, ram_addr_t size) +{ + int ret; + struct iommu_ioas_unmap unmap = { + .size = sizeof(unmap), + .ioas_id = ioas, + .iova = iova, + .length = size, + }; + + ret = ioctl(be->fd, IOMMU_IOAS_UNMAP, &unmap); + trace_iommufd_backend_unmap_dma(be->fd, ioas, iova, size, ret); + if (ret) { + error_report("IOMMU_IOAS_UNMAP failed: %s", strerror(errno)); + } + return !ret ? 0 : -errno; +} + +int iommufd_backend_map_dma(IOMMUFDBackend *be, uint32_t ioas, hwaddr iova, + ram_addr_t size, void *vaddr, bool readonly) +{ + int ret; + struct iommu_ioas_map map = { + .size = sizeof(map), + .flags = IOMMU_IOAS_MAP_READABLE | + IOMMU_IOAS_MAP_FIXED_IOVA, + .ioas_id = ioas, + .__reserved = 0, + .user_va = (int64_t)vaddr, + .iova = iova, + .length = size, + }; + + if (!readonly) { + map.flags |= IOMMU_IOAS_MAP_WRITEABLE; + } + + ret = ioctl(be->fd, IOMMU_IOAS_MAP, &map); + trace_iommufd_backend_map_dma(be->fd, ioas, iova, size, + vaddr, readonly, ret); + if (ret) { + error_report("IOMMU_IOAS_MAP failed: %s", strerror(errno)); + } + return !ret ? 0 : -errno; +} + +int iommufd_backend_copy_dma(IOMMUFDBackend *be, uint32_t src_ioas, + uint32_t dst_ioas, hwaddr iova, + ram_addr_t size, bool readonly) +{ + int ret; + struct iommu_ioas_copy copy = { + .size = sizeof(copy), + .flags = IOMMU_IOAS_MAP_READABLE | + IOMMU_IOAS_MAP_FIXED_IOVA, + .dst_ioas_id = dst_ioas, + .src_ioas_id = src_ioas, + .length = size, + .dst_iova = iova, + .src_iova = iova, + }; + + if (!readonly) { + copy.flags |= IOMMU_IOAS_MAP_WRITEABLE; + } + + ret = ioctl(be->fd, IOMMU_IOAS_COPY, ©); + trace_iommufd_backend_copy_dma(be->fd, src_ioas, dst_ioas, + iova, size, readonly, ret); + if (ret) { + error_report("IOMMU_IOAS_COPY failed: %s", strerror(errno)); + } + return !ret ? 0 : -errno; +} + +static const TypeInfo iommufd_backend_info = { + .name = TYPE_IOMMUFD_BACKEND, + .parent = TYPE_OBJECT, + .instance_size = sizeof(IOMMUFDBackend), + .instance_init = iommufd_backend_init, + .instance_finalize = iommufd_backend_finalize, + .class_size = sizeof(IOMMUFDBackendClass), + .class_init = iommufd_backend_class_init, + .interfaces = (InterfaceInfo[]) { + { TYPE_USER_CREATABLE }, + { } + } +}; + +static void register_types(void) +{ + type_register_static(&iommufd_backend_info); +} + +type_init(register_types); diff --git a/backends/meson.build b/backends/meson.build index b1884a88ec..5cbdd3a858 100644 --- a/backends/meson.build +++ b/backends/meson.build @@ -15,6 +15,7 @@ softmmu_ss.add(when: 'CONFIG_LINUX', if_true: files('hostmem-memfd.c')) if have_vhost_user softmmu_ss.add(when: 'CONFIG_VIRTIO', if_true: files('vhost-user.c')) endif +specific_ss.add(when: 'CONFIG_IOMMUFD', if_true: files('iommufd.c')) softmmu_ss.add(when: 'CONFIG_VIRTIO_CRYPTO', if_true: files('cryptodev-vhost.c')) if have_vhost_user_crypto softmmu_ss.add(when: 'CONFIG_VIRTIO_CRYPTO', if_true: files('cryptodev-vhost-user.c')) diff --git a/backends/trace-events b/backends/trace-events index 652eb76a57..2c8af3e726 100644 --- a/backends/trace-events +++ b/backends/trace-events @@ -5,3 +5,15 @@ dbus_vmstate_pre_save(void) dbus_vmstate_post_load(int version_id) "version_id: %d" dbus_vmstate_loading(const char *id) "id: %s" dbus_vmstate_saving(const char *id) "id: %s" + +# iommufd.c +iommufd_backend_connect(int fd, bool owned, uint32_t users, int ret) "fd=%d owned=%d users=%d (%d)" +iommufd_backend_disconnect(int fd, uint32_t users) "fd=%d users=%d" +iommu_backend_set_fd(int fd) "pre-opened /dev/iommu fd=%d" +iommufd_backend_get_ioas(int iommufd, uint32_t ioas, int ret) " iommufd=%d ioas=%d (%d)" +iommufd_backend_put_ioas(int iommufd, uint32_t ioas) " iommufd=%d ioas=%d" +iommufd_backend_unmap_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)" +iommufd_backend_map_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, void *vaddr, bool readonly, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" addr=%p readonly=%d (%d)" +iommufd_backend_copy_dma(int iommufd, uint32_t src_ioas, uint32_t dst_ioas, uint64_t iova, uint64_t size, bool readonly, int ret) " iommufd=%d src_ioas=%d dst_ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" readonly=%d (%d)" +iommufd_backend_alloc_ioas(int iommufd, uint32_t ioas, int ret) " iommufd=%d ioas=%d (%d)" +iommufd_backend_free_ioas(int iommufd, uint32_t ioas, int ret) " iommufd=%d ioas=%d (%d)" diff --git a/include/sysemu/iommufd.h b/include/sysemu/iommufd.h new file mode 100644 index 0000000000..06a866d1bd --- /dev/null +++ b/include/sysemu/iommufd.h @@ -0,0 +1,47 @@ +#ifndef SYSEMU_IOMMUFD_H +#define SYSEMU_IOMMUFD_H + +#include "qom/object.h" +#include "qemu/thread.h" +#include "exec/hwaddr.h" +#include "exec/ram_addr.h" + +#define TYPE_IOMMUFD_BACKEND "iommufd" +OBJECT_DECLARE_TYPE(IOMMUFDBackend, IOMMUFDBackendClass, + IOMMUFD_BACKEND) +#define IOMMUFD_BACKEND(obj) \ + OBJECT_CHECK(IOMMUFDBackend, (obj), TYPE_IOMMUFD_BACKEND) +#define IOMMUFD_BACKEND_GET_CLASS(obj) \ + OBJECT_GET_CLASS(IOMMUFDBackendClass, (obj), TYPE_IOMMUFD_BACKEND) +#define IOMMUFD_BACKEND_CLASS(klass) \ + OBJECT_CLASS_CHECK(IOMMUFDBackendClass, (klass), TYPE_IOMMUFD_BACKEND) +struct IOMMUFDBackendClass { + ObjectClass parent_class; +}; + +struct IOMMUFDBackend { + Object parent; + + /*< protected >*/ + int fd; /* /dev/iommu file descriptor */ + bool owned; /* is the /dev/iommu opened internally */ + QemuMutex lock; + uint32_t users; + + /*< public >*/ +}; + +int iommufd_backend_connect(IOMMUFDBackend *be, Error **errp); +void iommufd_backend_disconnect(IOMMUFDBackend *be); + +int iommufd_backend_get_ioas(IOMMUFDBackend *be, uint32_t *ioas_id); +void iommufd_backend_put_ioas(IOMMUFDBackend *be, uint32_t ioas_id); +int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas, + hwaddr iova, ram_addr_t size); +int iommufd_backend_map_dma(IOMMUFDBackend *be, uint32_t ioas, hwaddr iova, + ram_addr_t size, void *vaddr, bool readonly); +int iommufd_backend_copy_dma(IOMMUFDBackend *be, uint32_t src_ioas, + uint32_t dst_ioas, hwaddr iova, + ram_addr_t size, bool readonly); + +#endif diff --git a/qapi/qom.json b/qapi/qom.json index 6a653c6636..f609e7f667 100644 --- a/qapi/qom.json +++ b/qapi/qom.json @@ -734,6 +734,18 @@ { 'struct': 'RemoteObjectProperties', 'data': { 'fd': 'str', 'devid': 'str' } } +## +# @IOMMUFDProperties: +# +# Properties for IOMMUFDbackend objects. +# +# fd: file descriptor name +# +# Since: 7.2 +## +{ 'struct': 'IOMMUFDProperties', + 'data': { '*fd': 'str' } } + ## # @RngProperties: # @@ -862,6 +874,7 @@ 'qtest', 'rng-builtin', 'rng-egd', + 'iommufd', { 'name': 'rng-random', 'if': 'CONFIG_POSIX' }, 'secret', @@ -938,7 +951,8 @@ 'tls-creds-psk': 'TlsCredsPskProperties', 'tls-creds-x509': 'TlsCredsX509Properties', 'tls-cipher-suites': 'TlsCredsProperties', - 'x-remote-object': 'RemoteObjectProperties' + 'x-remote-object': 'RemoteObjectProperties', + 'iommufd': 'IOMMUFDProperties' } } ## diff --git a/qemu-options.hx b/qemu-options.hx index 60cf188da4..2f7204d826 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -4860,6 +4860,18 @@ SRST The ``share`` boolean option is on by default with memfd. + ``-object iommufd,id=id[,fd=fd]`` + Creates an iommufd backend which allows control of DMA mapping + through the /dev/iommu device. + + The ``id`` parameter is a unique ID which frontends (such as + vfio-pci of vdpa) will use to connect withe the iommufd backend. + + The ``fd`` parameter is an optional pre-opened file descriptor + resulting from /dev/iommu opening. Usually the iommufd is shared + accross all subsystems, bringing the benefit of centralized + reference counting. + ``-object rng-builtin,id=id`` Creates a random number generator backend which obtains entropy from QEMU builtin functions. The ``id`` parameter is a unique ID From patchwork Wed Jun 8 12:31:36 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yi Liu X-Patchwork-Id: 12873456 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id B3B18C43334 for ; Wed, 8 Jun 2022 12:48:53 +0000 (UTC) Received: from localhost ([::1]:50692 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nyv7E-0002UU-3A for qemu-devel@archiver.kernel.org; Wed, 08 Jun 2022 08:48:52 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:58092) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nyur6-0000p9-KA for qemu-devel@nongnu.org; Wed, 08 Jun 2022 08:32:12 -0400 Received: from mga05.intel.com ([192.55.52.43]:56958) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nyur4-0005ks-Aw for qemu-devel@nongnu.org; Wed, 08 Jun 2022 08:32:12 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1654691530; x=1686227530; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=9oajes8EV6rgYzGmYnPn9mJY8K7V4O6AG1IziEyV2Os=; b=c+H0sX7YvznzcOjJmycWUJ0Ph5qDMybIV6qo581E6qnstSW/o0DM+bVd 4nZH/l7iERJXUOQpVjWfA3yNWTB4HtDS2eSh8Qz4nYnNsEoXRZl+W2I7x vahrH5y7gVnF9oDXECUXiEWKP2GXiUf8REaN42otPcxHpkMg9uCjFJJiV IibCZd7SjvEr+dhxlZ+wWTCVNZyLWJ+tEASJcjwW4H4E4dXR4LC7cQjsB ASaVZ6TdeTlabFyY+FmImNHoQ/6em7N+OyOlZAbM+vfYWA9MesW3PPVex BUjSEC+rF3KcmNwb2g5OfqW6ZXkP371CrFmWPro7X97WP6pk4qvg/+neG Q==; X-IronPort-AV: E=McAfee;i="6400,9594,10371"; a="363210165" X-IronPort-AV: E=Sophos;i="5.91,286,1647327600"; d="scan'208";a="363210165" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Jun 2022 05:31:48 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.91,286,1647327600"; d="scan'208";a="670529878" Received: from 984fee00a4c6.jf.intel.com ([10.165.58.231]) by FMSMGA003.fm.intel.com with ESMTP; 08 Jun 2022 05:31:48 -0700 From: Yi Liu To: alex.williamson@redhat.com, cohuck@redhat.com, qemu-devel@nongnu.org Cc: david@gibson.dropbear.id.au, thuth@redhat.com, farman@linux.ibm.com, mjrosato@linux.ibm.com, akrowiak@linux.ibm.com, pasic@linux.ibm.com, jjherne@linux.ibm.com, jasowang@redhat.com, kvm@vger.kernel.org, jgg@nvidia.com, nicolinc@nvidia.com, eric.auger@redhat.com, eric.auger.pro@gmail.com, kevin.tian@intel.com, yi.l.liu@intel.com, chao.p.peng@intel.com, yi.y.sun@intel.com, peterx@redhat.com, shameerali.kolothum.thodi@huawei.com, zhangfei.gao@linaro.org, berrange@redhat.com Subject: [RFC v2 12/15] util/char_dev: Add open_cdev() Date: Wed, 8 Jun 2022 05:31:36 -0700 Message-Id: <20220608123139.19356-13-yi.l.liu@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220608123139.19356-1-yi.l.liu@intel.com> References: <20220608123139.19356-1-yi.l.liu@intel.com> MIME-Version: 1.0 Received-SPF: pass client-ip=192.55.52.43; envelope-from=yi.l.liu@intel.com; helo=mga05.intel.com X-Spam_score_int: -44 X-Spam_score: -4.5 X-Spam_bar: ---- X-Spam_report: (-4.5 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" /dev/vfio/devices/vfioX may not exist. In that case it is still possible to open /dev/char/$major:$minor instead. Add helper function to abstract the cdev open. Suggested-by: Jason Gunthorpe Signed-off-by: Yi Liu --- MAINTAINERS | 6 +++++ include/qemu/char_dev.h | 16 ++++++++++++ util/chardev_open.c | 58 +++++++++++++++++++++++++++++++++++++++++ util/meson.build | 1 + 4 files changed, 81 insertions(+) create mode 100644 include/qemu/char_dev.h create mode 100644 util/chardev_open.c diff --git a/MAINTAINERS b/MAINTAINERS index c6aa4e7fd0..0b3ade4170 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -3171,6 +3171,12 @@ S: Maintained F: include/qemu/iova-tree.h F: util/iova-tree.c +cdev Open +M: Yi Liu +S: Maintained +F: include/qemu/char_dev.h +F: util/chardev_open.c + elf2dmp M: Viktor Prutyanov S: Maintained diff --git a/include/qemu/char_dev.h b/include/qemu/char_dev.h new file mode 100644 index 0000000000..79877fb24d --- /dev/null +++ b/include/qemu/char_dev.h @@ -0,0 +1,16 @@ +/* + * QEMU Chardev Helper + * + * Copyright (C) 2022 Intel Corporation. + * + * Authors: Yi Liu + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + */ + +#ifndef QEMU_CHARDEV_HELPERS_H +#define QEMU_CHARDEV_HELPERS_H + +int open_cdev(const char *devpath, dev_t cdev); +#endif diff --git a/util/chardev_open.c b/util/chardev_open.c new file mode 100644 index 0000000000..133fad06ec --- /dev/null +++ b/util/chardev_open.c @@ -0,0 +1,58 @@ +/* + * Copyright (C) 2022 Intel Corporation. + * Copyright (c) 2019, Mellanox Technologies. All rights reserved. + * + * Authors: Yi Liu + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + * Copied from https://github.com/linux-rdma/rdma-core/blob/master/util/open_cdev.c + * + */ +#ifndef _GNU_SOURCE +#define _GNU_SOURCE +#endif +#include "qemu/osdep.h" +#include "qemu/char_dev.h" + +static int open_cdev_internal(const char *path, dev_t cdev) +{ + struct stat st; + int fd; + + fd = qemu_open_old(path, O_RDWR); + if (fd == -1) + return -1; + if (fstat(fd, &st) || !S_ISCHR(st.st_mode) || + (cdev != 0 && st.st_rdev != cdev)) { + close(fd); + return -1; + } + return fd; +} + +static int open_cdev_robust(dev_t cdev) +{ + char *devpath; + int ret; + + /* + * This assumes that udev is being used and is creating the /dev/char/ + * symlinks. + */ + devpath = g_strdup_printf("/dev/char/%u:%u", major(cdev), minor(cdev)); + ret = open_cdev_internal(devpath, cdev); + g_free(devpath); + return ret; +} + +int open_cdev(const char *devpath, dev_t cdev) +{ + int fd; + + fd = open_cdev_internal(devpath, cdev); + if (fd == -1 && cdev != 0) + return open_cdev_robust(cdev); + return fd; +} diff --git a/util/meson.build b/util/meson.build index 8f16018cd4..eff15d3826 100644 --- a/util/meson.build +++ b/util/meson.build @@ -95,4 +95,5 @@ if have_block util_ss.add(files('filemonitor-stub.c')) endif util_ss.add(when: 'CONFIG_LINUX', if_true: files('vfio-helpers.c')) + util_ss.add(when: 'CONFIG_LINUX', if_true: files('chardev_open.c')) endif From patchwork Wed Jun 8 12:31:37 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yi Liu X-Patchwork-Id: 12873469 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 676EBC43334 for ; Wed, 8 Jun 2022 12:52:49 +0000 (UTC) Received: from localhost ([::1]:52878 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nyvB2-0004Fj-E7 for qemu-devel@archiver.kernel.org; Wed, 08 Jun 2022 08:52:48 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:58126) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nyurA-0000vE-2X for qemu-devel@nongnu.org; Wed, 08 Jun 2022 08:32:16 -0400 Received: from mga05.intel.com ([192.55.52.43]:56952) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nyur7-0005kl-5B for qemu-devel@nongnu.org; Wed, 08 Jun 2022 08:32:15 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1654691533; x=1686227533; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=jAyo3pwMPZQg0heMxXYQSaTslfVs+jhdT9vHrNmEINw=; b=CdjM75bAw+TkN5sdx1X5/ce2BuPyFnr6HmgnOyLxrl39gl+lBazeUCoi NXjT4vs65g8jAXyaAI0hNXYTe4X+TjDvgJfc0b7KYwut4IGorOTIG9zfN lZ2q5BCKicPA1i8QXnUyDFa2+unys4QNawHnPWD5smZNpr7GlnLJXUXo8 q8PUjxSowsLY64D6tRrgGDmvnUSAsdPQHed8YlCyelub/k/zCtcADJpyY Z1Fb1gJeU/ruOPsUYRgiFhNF9FHsEz1mFU2vfOZzI2geLM5JLo9Ie425N GQNsq/EmsyUBe43fkQpALhaWPOw+aBTB5+TOF1lMaND14UjyBQYEElgW5 w==; X-IronPort-AV: E=McAfee;i="6400,9594,10371"; a="363210169" X-IronPort-AV: E=Sophos;i="5.91,286,1647327600"; d="scan'208";a="363210169" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Jun 2022 05:31:49 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.91,286,1647327600"; d="scan'208";a="670529889" Received: from 984fee00a4c6.jf.intel.com ([10.165.58.231]) by FMSMGA003.fm.intel.com with ESMTP; 08 Jun 2022 05:31:48 -0700 From: Yi Liu To: alex.williamson@redhat.com, cohuck@redhat.com, qemu-devel@nongnu.org Cc: david@gibson.dropbear.id.au, thuth@redhat.com, farman@linux.ibm.com, mjrosato@linux.ibm.com, akrowiak@linux.ibm.com, pasic@linux.ibm.com, jjherne@linux.ibm.com, jasowang@redhat.com, kvm@vger.kernel.org, jgg@nvidia.com, nicolinc@nvidia.com, eric.auger@redhat.com, eric.auger.pro@gmail.com, kevin.tian@intel.com, yi.l.liu@intel.com, chao.p.peng@intel.com, yi.y.sun@intel.com, peterx@redhat.com, shameerali.kolothum.thodi@huawei.com, zhangfei.gao@linaro.org, berrange@redhat.com Subject: [RFC v2 13/15] vfio/iommufd: Implement the iommufd backend Date: Wed, 8 Jun 2022 05:31:37 -0700 Message-Id: <20220608123139.19356-14-yi.l.liu@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220608123139.19356-1-yi.l.liu@intel.com> References: <20220608123139.19356-1-yi.l.liu@intel.com> MIME-Version: 1.0 Received-SPF: pass client-ip=192.55.52.43; envelope-from=yi.l.liu@intel.com; helo=mga05.intel.com X-Spam_score_int: -44 X-Spam_score: -4.5 X-Spam_bar: ---- X-Spam_report: (-4.5 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" Add the iommufd backend. The IOMMUFD container class is implemented based on the new /dev/iommu user API. This backend obviously depends on CONFIG_IOMMUFD. So far, the iommufd backend doesn't support live migration and cache coherency yet due to missing support in the host kernel meaning that only a subset of the container class callbacks is implemented. Co-authored-by: Eric Auger Signed-off-by: Eric Auger Signed-off-by: Yi Liu --- v1 -> v2: - arbitrarily set bcontainer->pgsizes to 4K to fix interoperability with virtio-iommu (Nicolin) - add tear down in iommufd_attach_device error path - use cdev open for /dev/vfio/devices/vfioX open --- hw/vfio/as.c | 2 +- hw/vfio/iommufd.c | 517 ++++++++++++++++++++++++++++++++++ hw/vfio/meson.build | 3 + hw/vfio/pci.c | 10 + hw/vfio/trace-events | 11 + include/hw/vfio/vfio-common.h | 22 ++ 6 files changed, 564 insertions(+), 1 deletion(-) create mode 100644 hw/vfio/iommufd.c diff --git a/hw/vfio/as.c b/hw/vfio/as.c index 1a3ceb5e62..3ff9d4215f 100644 --- a/hw/vfio/as.c +++ b/hw/vfio/as.c @@ -42,7 +42,7 @@ #include "migration/migration.h" #include "sysemu/tpm.h" -static QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces = +VFIOAddressSpaceList vfio_address_spaces = QLIST_HEAD_INITIALIZER(vfio_address_spaces); void vfio_host_win_add(VFIOContainer *container, diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c new file mode 100644 index 0000000000..7417c6ce44 --- /dev/null +++ b/hw/vfio/iommufd.c @@ -0,0 +1,517 @@ +/* + * iommufd container backend + * + * Copyright (C) 2022 Intel Corporation. + * Copyright Red Hat, Inc. 2022 + * + * Authors: Yi Liu + * Eric Auger + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + + * You should have received a copy of the GNU General Public License along + * with this program; if not, see . + */ + +#include "qemu/osdep.h" +#include +#include + +#include "hw/vfio/vfio-common.h" +#include "qemu/error-report.h" +#include "trace.h" +#include "qapi/error.h" +#include "sysemu/iommufd.h" +#include "hw/qdev-core.h" +#include "sysemu/reset.h" +#include "qemu/cutils.h" +#include "qemu/char_dev.h" + +static bool iommufd_check_extension(VFIOContainer *bcontainer, + VFIOContainerFeature feat) +{ + switch (feat) { + default: + return false; + }; +} + +static int iommufd_map(VFIOContainer *bcontainer, hwaddr iova, + ram_addr_t size, void *vaddr, bool readonly) +{ + VFIOIOMMUFDContainer *container = container_of(bcontainer, + VFIOIOMMUFDContainer, bcontainer); + + return iommufd_backend_map_dma(container->be, + container->ioas_id, + iova, size, vaddr, readonly); +} + +static int iommufd_unmap(VFIOContainer *bcontainer, + hwaddr iova, ram_addr_t size, + IOMMUTLBEntry *iotlb) +{ + VFIOIOMMUFDContainer *container = container_of(bcontainer, + VFIOIOMMUFDContainer, bcontainer); + + /* TODO: Handle dma_unmap_bitmap with iotlb args (migration) */ + return iommufd_backend_unmap_dma(container->be, + container->ioas_id, iova, size); +} + +static int vfio_get_devicefd(const char *sysfs_path, Error **errp) +{ + long int ret = -ENOTTY; + char *path, *vfio_dev_path = NULL, *vfio_path = NULL; + DIR *dir; + struct dirent *dent; + gchar *contents; + struct stat st; + gsize length; + int major, minor; + dev_t vfio_devt; + + path = g_strdup_printf("%s/vfio-device", sysfs_path); + if (stat(path, &st) < 0) { + error_setg_errno(errp, errno, "no such host device"); + goto out_free_path; + } + + dir = opendir(path); + if (!dir) { + error_setg_errno(errp, errno, "couldn't open dirrectory %s", path); + goto out_free_path; + } + + while ((dent = readdir(dir))) { + if (!strncmp(dent->d_name, "vfio", 4)) { + vfio_dev_path = g_strdup_printf("%s/%s/dev", path, dent->d_name); + break; + } + } + + if (!vfio_dev_path) { + error_setg(errp, "failed to find vfio-device/vfioX/dev"); + goto out_free_path; + } + + if (!g_file_get_contents(vfio_dev_path, &contents, &length, NULL)) { + error_setg(errp, "failed to load \"%s\"", vfio_dev_path); + goto out_free_dev_path; + } + + if (sscanf(contents, "%d:%d", &major, &minor) != 2) { + error_setg(errp, "failed to get major:mino for \"%s\"", vfio_dev_path); + goto out_free_dev_path; + } + g_free(contents); + vfio_devt = makedev(major, minor); + + vfio_path = g_strdup_printf("/dev/vfio/devices/%s", dent->d_name); + ret = open_cdev(vfio_path, vfio_devt); + if (ret < 0) { + error_setg(errp, "Failed to open %s", vfio_path); + } + + trace_vfio_iommufd_get_devicefd(vfio_path, ret); + g_free(vfio_path); + +out_free_dev_path: + g_free(vfio_dev_path); +out_free_path: + g_free(path); + + if (*errp) { + error_prepend(errp, VFIO_MSG_PREFIX, path); + } + return ret; +} + +static VFIOIOASHwpt *vfio_container_get_hwpt(VFIOIOMMUFDContainer *container, + uint32_t hwpt_id) +{ + VFIOIOASHwpt *hwpt; + + QLIST_FOREACH(hwpt, &container->hwpt_list, next) { + if (hwpt->hwpt_id == hwpt_id) { + return hwpt; + } + } + + hwpt = g_malloc0(sizeof(*hwpt)); + + hwpt->hwpt_id = hwpt_id; + QLIST_INIT(&hwpt->device_list); + QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next); + + return hwpt; +} + +static void vfio_container_put_hwpt(VFIOIOASHwpt *hwpt) +{ + if (!QLIST_EMPTY(&hwpt->device_list)) { + g_assert_not_reached(); + } + QLIST_REMOVE(hwpt, next); + g_free(hwpt); +} + +static VFIOIOASHwpt *vfio_find_hwpt_for_dev(VFIOIOMMUFDContainer *container, + VFIODevice *vbasedev) +{ + VFIOIOASHwpt *hwpt; + VFIODevice *vbasedev_iter; + + QLIST_FOREACH(hwpt, &container->hwpt_list, next) { + QLIST_FOREACH(vbasedev_iter, &hwpt->device_list, hwpt_next) { + if (vbasedev_iter == vbasedev) { + return hwpt; + } + } + } + return NULL; +} + +static void +__vfio_device_detach_container(VFIODevice *vbasedev, + VFIOIOMMUFDContainer *container, Error **errp) +{ + struct vfio_device_detach_ioas detach_data = { + .argsz = sizeof(detach_data), + .flags = 0, + .iommufd = container->be->fd, + .ioas_id = container->ioas_id, + }; + + if (ioctl(vbasedev->fd, VFIO_DEVICE_DETACH_IOAS, &detach_data)) { + error_setg_errno(errp, errno, "detach %s from ioas id=%d failed", + vbasedev->name, container->ioas_id); + } + trace_vfio_iommufd_detach_device(container->be->fd, vbasedev->name, + container->ioas_id); + + /* iommufd unbind is done per device fd close */ +} + +static void vfio_device_detach_container(VFIODevice *vbasedev, + VFIOIOMMUFDContainer *container, + Error **errp) +{ + VFIOIOASHwpt *hwpt; + + hwpt = vfio_find_hwpt_for_dev(container, vbasedev); + if (hwpt) { + QLIST_REMOVE(vbasedev, hwpt_next); + if (QLIST_EMPTY(&hwpt->device_list)) { + vfio_container_put_hwpt(hwpt); + } + } + + __vfio_device_detach_container(vbasedev, container, errp); +} + +static int vfio_device_attach_container(VFIODevice *vbasedev, + VFIOIOMMUFDContainer *container, + Error **errp) +{ + struct vfio_device_bind_iommufd bind = { + .argsz = sizeof(bind), + .flags = 0, + .iommufd = container->be->fd, + .dev_cookie = (uint64_t)vbasedev, + }; + struct vfio_device_attach_ioas attach_data = { + .argsz = sizeof(attach_data), + .flags = 0, + .iommufd = container->be->fd, + .ioas_id = container->ioas_id, + }; + VFIOIOASHwpt *hwpt; + int ret; + + /* Bind device to iommufd */ + ret = ioctl(vbasedev->fd, VFIO_DEVICE_BIND_IOMMUFD, &bind); + if (ret) { + error_setg_errno(errp, errno, "error bind device fd=%d to iommufd=%d", + vbasedev->fd, bind.iommufd); + return ret; + } + + vbasedev->devid = bind.out_devid; + trace_vfio_iommufd_bind_device(bind.iommufd, vbasedev->name, + vbasedev->fd, vbasedev->devid); + + /* Attach device to an ioas within iommufd */ + ret = ioctl(vbasedev->fd, VFIO_DEVICE_ATTACH_IOAS, &attach_data); + if (ret) { + error_setg_errno(errp, errno, + "[iommufd=%d] error attach %s (%d) to ioasid=%d", + container->be->fd, vbasedev->name, vbasedev->fd, + attach_data.ioas_id); + return ret; + + } + trace_vfio_iommufd_attach_device(bind.iommufd, vbasedev->name, + vbasedev->fd, container->ioas_id, + attach_data.out_hwpt_id); + + hwpt = vfio_container_get_hwpt(container, attach_data.out_hwpt_id); + + QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next); + return 0; +} + +static int vfio_device_reset(VFIODevice *vbasedev) +{ + if (vbasedev->dev->realized) { + vbasedev->ops->vfio_compute_needs_reset(vbasedev); + if (vbasedev->needs_reset) { + return vbasedev->ops->vfio_hot_reset_multi(vbasedev); + } + } + return 0; +} + +static int vfio_iommufd_container_reset(VFIOContainer *bcontainer) +{ + VFIOIOMMUFDContainer *container; + int ret, final_ret = 0; + VFIODevice *vbasedev; + VFIOIOASHwpt *hwpt; + + container = container_of(bcontainer, VFIOIOMMUFDContainer, bcontainer); + + QLIST_FOREACH(hwpt, &container->hwpt_list, next) { + QLIST_FOREACH(vbasedev, &hwpt->device_list, hwpt_next) { + ret = vfio_device_reset(vbasedev); + if (ret) { + error_report("failed to reset %s (%d)", vbasedev->name, ret); + final_ret = ret; + } else { + trace_vfio_iommufd_container_reset(vbasedev->name); + } + } + } + return final_ret; +} + +static void vfio_iommufd_container_destroy(VFIOIOMMUFDContainer *container) +{ + vfio_container_destroy(&container->bcontainer); + g_free(container); +} + +static int vfio_ram_block_discard_disable(bool state) +{ + /* + * We support coordinated discarding of RAM via the RamDiscardManager. + */ + return ram_block_uncoordinated_discard_disable(state); +} + +static void iommufd_detach_device(VFIODevice *vbasedev); + +static int iommufd_attach_device(VFIODevice *vbasedev, AddressSpace *as, + Error **errp) +{ + VFIOContainer *bcontainer; + VFIOIOMMUFDContainer *container; + VFIOAddressSpace *space; + struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) }; + int ret, devfd; + uint32_t ioas_id; + Error *err = NULL; + + devfd = vfio_get_devicefd(vbasedev->sysfsdev, errp); + if (devfd < 0) { + return devfd; + } + vbasedev->fd = devfd; + + space = vfio_get_address_space(as); + + /* try to attach to an existing container in this space */ + QLIST_FOREACH(bcontainer, &space->containers, next) { + if (bcontainer->ops != &iommufd_container_ops) { + continue; + } + container = container_of(bcontainer, VFIOIOMMUFDContainer, bcontainer); + if (vfio_device_attach_container(vbasedev, container, &err)) { + const char *msg = error_get_pretty(err); + + trace_vfio_iommufd_fail_attach_existing_container(msg); + error_free(err); + err = NULL; + } else { + ret = vfio_ram_block_discard_disable(true); + if (ret) { + vfio_device_detach_container(vbasedev, container, &err); + error_propagate(errp, err); + vfio_put_address_space(space); + close(vbasedev->fd); + error_prepend(errp, + "Cannot set discarding of RAM broken (%d)", ret); + return ret; + } + goto out; + } + } + + /* Need to allocate a new dedicated container */ + ret = iommufd_backend_get_ioas(vbasedev->iommufd, &ioas_id); + if (ret < 0) { + vfio_put_address_space(space); + close(vbasedev->fd); + error_report("Failed to alloc ioas (%s)", strerror(errno)); + return ret; + } + + trace_vfio_iommufd_alloc_ioas(vbasedev->iommufd->fd, ioas_id); + + container = g_malloc0(sizeof(*container)); + container->be = vbasedev->iommufd; + container->ioas_id = ioas_id; + QLIST_INIT(&container->hwpt_list); + + bcontainer = &container->bcontainer; + vfio_container_init(bcontainer, space, &iommufd_container_ops); + + ret = vfio_device_attach_container(vbasedev, container, &err); + if (ret) { + /* todo check if any other thing to do */ + error_propagate(errp, err); + vfio_iommufd_container_destroy(container); + iommufd_backend_put_ioas(vbasedev->iommufd, ioas_id); + vfio_put_address_space(space); + close(vbasedev->fd); + return ret; + } + + ret = vfio_ram_block_discard_disable(true); + if (ret) { + goto error; + } + + /* + * TODO: for now iommufd BE is on par with vfio iommu type1, so it's + * fine to add the whole range as window. For SPAPR, below code + * should be updated. + */ + vfio_host_win_add(bcontainer, 0, (hwaddr)-1, 4096); + bcontainer->pgsizes = 4096; + + /* + * TODO: kvmgroup, unable to do it before the protocol done + * between iommufd and kvm. + */ + + QLIST_INSERT_HEAD(&space->containers, bcontainer, next); + + bcontainer->listener = vfio_memory_listener; + + memory_listener_register(&bcontainer->listener, bcontainer->space->as); + + bcontainer->initialized = true; + +out: + vbasedev->container = bcontainer; + + /* + * TODO: examine RAM_BLOCK_DISCARD stuff, should we do group level + * for discarding incompatibility check as well? + */ + if (vbasedev->ram_block_discard_allowed) { + vfio_ram_block_discard_disable(false); + } + + ret = ioctl(devfd, VFIO_DEVICE_GET_INFO, &dev_info); + if (ret) { + error_setg_errno(errp, errno, "error getting device info"); + memory_listener_unregister(&bcontainer->listener); + QLIST_SAFE_REMOVE(bcontainer, next); + goto error; + } + + vbasedev->group = 0; + vbasedev->num_irqs = dev_info.num_irqs; + vbasedev->num_regions = dev_info.num_regions; + vbasedev->flags = dev_info.flags; + vbasedev->reset_works = !!(dev_info.flags & VFIO_DEVICE_FLAGS_RESET); + + trace_vfio_iommufd_device_info(vbasedev->name, devfd, vbasedev->num_irqs, + vbasedev->num_regions, vbasedev->flags); + return 0; +error: + vfio_device_detach_container(vbasedev, container, &err); + error_propagate(errp, err); + vfio_iommufd_container_destroy(container); + iommufd_backend_put_ioas(vbasedev->iommufd, ioas_id); + vfio_put_address_space(space); + close(vbasedev->fd); + return ret; +} + +static void iommufd_detach_device(VFIODevice *vbasedev) +{ + VFIOContainer *bcontainer = vbasedev->container; + VFIOIOMMUFDContainer *container; + VFIODevice *vbasedev_iter; + VFIOIOASHwpt *hwpt; + Error *err = NULL; + + if (!bcontainer) { + goto out; + } + + if (!vbasedev->ram_block_discard_allowed) { + vfio_ram_block_discard_disable(false); + } + + container = container_of(bcontainer, VFIOIOMMUFDContainer, bcontainer); + QLIST_FOREACH(hwpt, &container->hwpt_list, next) { + QLIST_FOREACH(vbasedev_iter, &hwpt->device_list, hwpt_next) { + if (vbasedev_iter == vbasedev) { + goto found; + } + } + } + g_assert_not_reached(); +found: + QLIST_REMOVE(vbasedev, hwpt_next); + if (QLIST_EMPTY(&hwpt->device_list)) { + vfio_container_put_hwpt(hwpt); + } + + __vfio_device_detach_container(vbasedev, container, &err); + if (err) { + error_report_err(err); + } + if (QLIST_EMPTY(&container->hwpt_list)) { + VFIOAddressSpace *space = bcontainer->space; + + iommufd_backend_put_ioas(container->be, container->ioas_id); + vfio_iommufd_container_destroy(container); + vfio_put_address_space(space); + } + vbasedev->container = NULL; +out: + close(vbasedev->fd); + g_free(vbasedev->name); +} + +const VFIOContainerOps iommufd_container_ops = { + .check_extension = iommufd_check_extension, + .dma_map = iommufd_map, + .dma_unmap = iommufd_unmap, + .attach_device = iommufd_attach_device, + .detach_device = iommufd_detach_device, + .reset = vfio_iommufd_container_reset, +}; diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build index 067c3c3c5f..50e370a409 100644 --- a/hw/vfio/meson.build +++ b/hw/vfio/meson.build @@ -7,6 +7,9 @@ vfio_ss.add(files( 'spapr.c', 'migration.c', )) +vfio_ss.add(when: 'CONFIG_IOMMUFD', if_true: files( + 'iommufd.c', +)) vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files( 'display.c', 'pci-quirks.c', diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index 07361a6fcd..6d10e86331 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -3191,6 +3191,16 @@ static void vfio_pci_reset(DeviceState *dev) goto post_reset; } + /* + * This is a temporary check, long term iommufd should + * support hot reset as well + */ + if (vdev->vbasedev.be == VFIO_IOMMU_BACKEND_TYPE_IOMMUFD) { + error_report("Dangerous: iommufd BE doesn't support hot " + "reset, please stop the VM"); + goto post_reset; + } + /* See if we can do our own bus reset */ if (!vfio_pci_hot_reset_one(vdev)) { goto post_reset; diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events index 582882db91..68675c6f8f 100644 --- a/hw/vfio/trace-events +++ b/hw/vfio/trace-events @@ -166,3 +166,14 @@ vfio_load_state_device_data(const char *name, uint64_t data_offset, uint64_t dat vfio_load_cleanup(const char *name) " (%s)" vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64 vfio_iommu_map_dirty_notify(uint64_t iova_start, uint64_t iova_end) "iommu dirty @ 0x%"PRIx64" - 0x%"PRIx64 + +#iommufd.c + +vfio_iommufd_get_devicefd(const char *dev, int devfd) " %s (fd=%d)" +vfio_iommufd_bind_device(int iommufd, const char *name, int devfd, int devid) " [iommufd=%d] Succesfully bound device %s (fd=%d): output devid=%d" +vfio_iommufd_attach_device(int iommufd, const char *name, int devfd, int ioasid, int hwptid) " [iommufd=%d] Succesfully attached device %s (%d) to ioasid=%d: output hwptd=%d" +vfio_iommufd_detach_device(int iommufd, const char *name, int ioasid) " [iommufd=%d] Detached %s from ioasid=%d" +vfio_iommufd_alloc_ioas(int iommufd, int ioas_id) " [iommufd=%d] new IOMMUFD container with ioasid=%d" +vfio_iommufd_device_info(char *name, int devfd, int num_irqs, int num_regions, int flags) " %s (%d) num_irqs=%d num_regions=%d flags=%d" +vfio_iommufd_fail_attach_existing_container(const char *msg) " %s" +vfio_iommufd_container_reset(char *name) " Successfully reset %s" diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index 635397d391..2470b6d58a 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -36,6 +36,7 @@ extern const MemoryListener vfio_memory_listener; extern const VFIOContainerOps legacy_container_ops; +extern const VFIOContainerOps iommufd_container_ops; enum { VFIO_DEVICE_TYPE_PCI = 0, @@ -82,6 +83,24 @@ typedef struct VFIOLegacyContainer { QLIST_HEAD(, VFIOGroup) group_list; } VFIOLegacyContainer; +typedef struct VFIOIOASHwpt { + uint32_t hwpt_id; + QLIST_HEAD(, VFIODevice) device_list; + QLIST_ENTRY(VFIOIOASHwpt) next; +} VFIOIOASHwpt; + +typedef struct IOMMUFDBackend IOMMUFDBackend; + +typedef struct VFIOIOMMUFDContainer { + VFIOContainer bcontainer; + IOMMUFDBackend *be; + uint32_t ioas_id; + QLIST_HEAD(, VFIOIOASHwpt) hwpt_list; +} VFIOIOMMUFDContainer; + +typedef QLIST_HEAD(VFIOAddressSpaceList, VFIOAddressSpace) VFIOAddressSpaceList; +extern VFIOAddressSpaceList vfio_address_spaces; + typedef struct VFIODeviceOps VFIODeviceOps; typedef enum VFIOIOMMUBackendType { @@ -91,6 +110,7 @@ typedef enum VFIOIOMMUBackendType { typedef struct VFIODevice { QLIST_ENTRY(VFIODevice) next; + QLIST_ENTRY(VFIODevice) hwpt_next; struct VFIOGroup *group; VFIOContainer *container; char *sysfsdev; @@ -98,6 +118,7 @@ typedef struct VFIODevice { DeviceState *dev; int fd; int type; + int devid; bool reset_works; bool needs_reset; bool no_mmap; @@ -111,6 +132,7 @@ typedef struct VFIODevice { VFIOMigration *migration; Error *migration_blocker; OnOffAuto pre_copy_dirty_page_tracking; + IOMMUFDBackend *iommufd; } VFIODevice; struct VFIODeviceOps { From patchwork Wed Jun 8 12:31:38 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yi Liu X-Patchwork-Id: 12873471 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 1BEECC43334 for ; Wed, 8 Jun 2022 12:57:11 +0000 (UTC) Received: from localhost ([::1]:59230 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nyvFD-0000RJ-DV for qemu-devel@archiver.kernel.org; Wed, 08 Jun 2022 08:57:07 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:58158) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nyurJ-0001F4-Jy for qemu-devel@nongnu.org; Wed, 08 Jun 2022 08:32:26 -0400 Received: from mga05.intel.com ([192.55.52.43]:56958) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nyurG-0005ks-UH for qemu-devel@nongnu.org; Wed, 08 Jun 2022 08:32:25 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1654691542; x=1686227542; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=QOWvCiC5xeNIhozB8NOdNSZTOHuvSql4zTaZ4nNSoFA=; b=QTYm4nmtsKy9oDYsFqDOGNxC5fTe9XoeoRYzOpSPfG9G19K5xUL6guKq FTNw+oLLlkO23RbhPgN6/ICaCk6P557VQjTlktjp35d0ypO7wAnzIBUYn dOvZrru7GwazaK5uhlQrQHUom5pyVRRml9g+3eUhhPhBHgOPDZ2XiDmr5 h/HhkOmBSvnoO1rQXhm+WiBNq91mStFWgjYQwdtEpe/Ga3wUuStcYeyh0 aG3SPPhHDob5rW3vkG9oY4E3SV9DINe7wtc9aOIelfLs9k+pW6EylrseY ES8iZH45AcG6QaFvVLA7izdMliN74Vtm/M+o87qc0uiw5WMq7a04UP/Lu Q==; X-IronPort-AV: E=McAfee;i="6400,9594,10371"; a="363210174" X-IronPort-AV: E=Sophos;i="5.91,286,1647327600"; d="scan'208";a="363210174" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Jun 2022 05:31:50 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.91,286,1647327600"; d="scan'208";a="670529900" Received: from 984fee00a4c6.jf.intel.com ([10.165.58.231]) by FMSMGA003.fm.intel.com with ESMTP; 08 Jun 2022 05:31:49 -0700 From: Yi Liu To: alex.williamson@redhat.com, cohuck@redhat.com, qemu-devel@nongnu.org Cc: david@gibson.dropbear.id.au, thuth@redhat.com, farman@linux.ibm.com, mjrosato@linux.ibm.com, akrowiak@linux.ibm.com, pasic@linux.ibm.com, jjherne@linux.ibm.com, jasowang@redhat.com, kvm@vger.kernel.org, jgg@nvidia.com, nicolinc@nvidia.com, eric.auger@redhat.com, eric.auger.pro@gmail.com, kevin.tian@intel.com, yi.l.liu@intel.com, chao.p.peng@intel.com, yi.y.sun@intel.com, peterx@redhat.com, shameerali.kolothum.thodi@huawei.com, zhangfei.gao@linaro.org, berrange@redhat.com Subject: [RFC v2 14/15] vfio/iommufd: Add IOAS_COPY_DMA support Date: Wed, 8 Jun 2022 05:31:38 -0700 Message-Id: <20220608123139.19356-15-yi.l.liu@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220608123139.19356-1-yi.l.liu@intel.com> References: <20220608123139.19356-1-yi.l.liu@intel.com> MIME-Version: 1.0 Received-SPF: pass client-ip=192.55.52.43; envelope-from=yi.l.liu@intel.com; helo=mga05.intel.com X-Spam_score_int: -44 X-Spam_score: -4.5 X-Spam_bar: ---- X-Spam_report: (-4.5 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" Compared with legacy vfio container BE, one of the benefits provided by iommufd is to reduce the redundant page pinning on kernel side through the usage of IOAS_COPY_DMA. For iommufd containers within the same address space, IOVA mappings can be copied from a source container to destination container. To achieve this, move the vfio_memory_listener to be per address space. In the memory listener callbacks, all the containers within the address space will be looped. For the iommufd containers, QEMU uses IOAS_MAP_DMA on the first one, and then uses IOAS_COPY_DMA to copy the IOVA mappings from the first iommufd container to other iommufd containers within the address space. For legacy containers, IOVA mapping is done by VFIO_IOMMU_MAP_DMA. Signed-off-by: Yi Liu --- hw/vfio/as.c | 117 ++++++++++++++++++++++---- hw/vfio/container-base.c | 13 ++- hw/vfio/container.c | 19 ++--- hw/vfio/iommufd.c | 47 +++++++++-- include/hw/vfio/vfio-common.h | 5 +- include/hw/vfio/vfio-container-base.h | 8 +- 6 files changed, 167 insertions(+), 42 deletions(-) diff --git a/hw/vfio/as.c b/hw/vfio/as.c index 3ff9d4215f..56485f9299 100644 --- a/hw/vfio/as.c +++ b/hw/vfio/as.c @@ -405,16 +405,16 @@ static bool vfio_known_safe_misalignment(MemoryRegionSection *section) return true; } -static void vfio_listener_region_add(MemoryListener *listener, - MemoryRegionSection *section) +static void vfio_container_region_add(VFIOContainer *container, + VFIOContainer **src_container, + MemoryRegionSection *section) { - VFIOContainer *container = container_of(listener, VFIOContainer, listener); hwaddr iova, end; Int128 llend, llsize; void *vaddr; int ret; VFIOHostDMAWindow *hostwin; - bool hostwin_found; + bool hostwin_found, copy_dma_supported = false; Error *err = NULL; if (vfio_listener_skipped_section(section)) { @@ -558,12 +558,25 @@ static void vfio_listener_region_add(MemoryListener *listener, } } + copy_dma_supported = vfio_container_check_extension(container, + VFIO_FEAT_DMA_COPY); + + if (copy_dma_supported && *src_container) { + if (!vfio_container_dma_copy(*src_container, container, + iova, int128_get64(llsize), + section->readonly)) { + return; + } else { + info_report("IOAS copy failed try map for container: %p", container); + } + } + ret = vfio_container_dma_map(container, iova, int128_get64(llsize), vaddr, section->readonly); if (ret) { - error_setg(&err, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", " - "0x%"HWADDR_PRIx", %p) = %d (%m)", - container, iova, int128_get64(llsize), vaddr, ret); + error_setg(&err, "vfio_container_dma_map(%p, 0x%"HWADDR_PRIx", " + "0x%"HWADDR_PRIx", %p) = %d (%m)", container, iova, + int128_get64(llsize), vaddr, ret); if (memory_region_is_ram_device(section->mr)) { /* Allow unexpected mappings not to be fatal for RAM devices */ error_report_err(err); @@ -572,6 +585,9 @@ static void vfio_listener_region_add(MemoryListener *listener, goto fail; } + if (copy_dma_supported) { + *src_container = container; + } return; fail: @@ -598,10 +614,22 @@ fail: } } -static void vfio_listener_region_del(MemoryListener *listener, +static void vfio_listener_region_add(MemoryListener *listener, MemoryRegionSection *section) { - VFIOContainer *container = container_of(listener, VFIOContainer, listener); + VFIOAddressSpace *space = container_of(listener, + VFIOAddressSpace, listener); + VFIOContainer *container, *src_container; + + src_container = NULL; + QLIST_FOREACH(container, &space->containers, next) { + vfio_container_region_add(container, &src_container, section); + } +} + +static void vfio_container_region_del(VFIOContainer *container, + MemoryRegionSection *section) +{ hwaddr iova, end; Int128 llend, llsize; int ret; @@ -707,18 +735,38 @@ static void vfio_listener_region_del(MemoryListener *listener, vfio_container_del_section_window(container, section); } +static void vfio_listener_region_del(MemoryListener *listener, + MemoryRegionSection *section) +{ + VFIOAddressSpace *space = container_of(listener, + VFIOAddressSpace, listener); + VFIOContainer *container; + + QLIST_FOREACH(container, &space->containers, next) { + vfio_container_region_del(container, section); + } +} + static void vfio_listener_log_global_start(MemoryListener *listener) { - VFIOContainer *container = container_of(listener, VFIOContainer, listener); + VFIOAddressSpace *space = container_of(listener, + VFIOAddressSpace, listener); + VFIOContainer *container; - vfio_container_set_dirty_page_tracking(container, true); + QLIST_FOREACH(container, &space->containers, next) { + vfio_container_set_dirty_page_tracking(container, true); + } } static void vfio_listener_log_global_stop(MemoryListener *listener) { - VFIOContainer *container = container_of(listener, VFIOContainer, listener); + VFIOAddressSpace *space = container_of(listener, + VFIOAddressSpace, listener); + VFIOContainer *container; - vfio_container_set_dirty_page_tracking(container, false); + QLIST_FOREACH(container, &space->containers, next) { + vfio_container_set_dirty_page_tracking(container, false); + } } typedef struct { @@ -848,11 +896,9 @@ static int vfio_sync_dirty_bitmap(VFIOContainer *container, int128_get64(section->size), ram_addr); } -static void vfio_listener_log_sync(MemoryListener *listener, - MemoryRegionSection *section) +static void vfio_container_log_sync(VFIOContainer *container, + MemoryRegionSection *section) { - VFIOContainer *container = container_of(listener, VFIOContainer, listener); - if (vfio_listener_skipped_section(section) || !container->dirty_pages_supported) { return; @@ -863,6 +909,18 @@ static void vfio_listener_log_sync(MemoryListener *listener, } } +static void vfio_listener_log_sync(MemoryListener *listener, + MemoryRegionSection *section) +{ + VFIOAddressSpace *space = container_of(listener, + VFIOAddressSpace, listener); + VFIOContainer *container; + + QLIST_FOREACH(container, &space->containers, next) { + vfio_container_log_sync(container, section); + } +} + const MemoryListener vfio_memory_listener = { .name = "vfio", .region_add = vfio_listener_region_add, @@ -907,6 +965,31 @@ VFIOAddressSpace *vfio_get_address_space(AddressSpace *as) return space; } +void vfio_as_add_container(VFIOAddressSpace *space, + VFIOContainer *container) +{ + if (space->listener_initialized) { + memory_listener_unregister(&space->listener); + } + + QLIST_INSERT_HEAD(&space->containers, container, next); + + /* Unregistration happen in vfio_as_del_container() */ + space->listener = vfio_memory_listener; + memory_listener_register(&space->listener, space->as); + space->listener_initialized = true; +} + +void vfio_as_del_container(VFIOAddressSpace *space, + VFIOContainer *container) +{ + QLIST_SAFE_REMOVE(container, next); + + if (QLIST_EMPTY(&space->containers)) { + memory_listener_unregister(&space->listener); + } +} + void vfio_put_address_space(VFIOAddressSpace *space) { if (QLIST_EMPTY(&space->containers)) { diff --git a/hw/vfio/container-base.c b/hw/vfio/container-base.c index a9f28e4b9d..0a6bc31686 100644 --- a/hw/vfio/container-base.c +++ b/hw/vfio/container-base.c @@ -47,6 +47,17 @@ int vfio_container_dma_map(VFIOContainer *container, return container->ops->dma_map(container, iova, size, vaddr, readonly); } +int vfio_container_dma_copy(VFIOContainer *src, VFIOContainer *dst, + hwaddr iova, ram_addr_t size, bool readonly) +{ + if (!src->ops->dma_copy || src->ops->dma_copy != dst->ops->dma_copy) { + error_report("Incompatible container: unable to copy dma"); + return -EINVAL; + } + + return src->ops->dma_copy(src, dst, iova, size, readonly); +} + int vfio_container_dma_unmap(VFIOContainer *container, hwaddr iova, ram_addr_t size, IOMMUTLBEntry *iotlb) @@ -137,8 +148,6 @@ void vfio_container_destroy(VFIOContainer *container) VFIOGuestIOMMU *giommu, *tmp; VFIOHostDMAWindow *hostwin, *next; - QLIST_SAFE_REMOVE(container, next); - QLIST_FOREACH_SAFE(vrdl, &container->vrdl_list, next, vrdl_tmp) { RamDiscardManager *rdm; diff --git a/hw/vfio/container.c b/hw/vfio/container.c index 2d9704bc1a..760ba56da4 100644 --- a/hw/vfio/container.c +++ b/hw/vfio/container.c @@ -362,9 +362,6 @@ err_out: static void vfio_listener_release(VFIOLegacyContainer *container) { - VFIOContainer *bcontainer = &container->bcontainer; - - memory_listener_unregister(&bcontainer->listener); if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) { memory_listener_unregister(&container->prereg_listener); } @@ -894,14 +891,11 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, vfio_kvm_device_add_group(group); QLIST_INIT(&container->group_list); - QLIST_INSERT_HEAD(&space->containers, bcontainer, next); group->container = container; QLIST_INSERT_HEAD(&container->group_list, group, container_next); - bcontainer->listener = vfio_memory_listener; - - memory_listener_register(&bcontainer->listener, bcontainer->space->as); + vfio_as_add_container(space, bcontainer); if (bcontainer->error) { ret = -1; @@ -914,8 +908,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, return 0; listener_release_exit: + vfio_as_del_container(space, bcontainer); QLIST_REMOVE(group, container_next); - QLIST_REMOVE(bcontainer, next); vfio_kvm_device_del_group(group); vfio_listener_release(container); @@ -938,6 +932,7 @@ static void vfio_disconnect_container(VFIOGroup *group) { VFIOLegacyContainer *container = group->container; VFIOContainer *bcontainer = &container->bcontainer; + VFIOAddressSpace *space = bcontainer->space; QLIST_REMOVE(group, container_next); group->container = NULL; @@ -945,10 +940,12 @@ static void vfio_disconnect_container(VFIOGroup *group) /* * Explicitly release the listener first before unset container, * since unset may destroy the backend container if it's the last - * group. + * group. By removing container from the list, container is disconnected + * with address space memory listener. */ if (QLIST_EMPTY(&container->group_list)) { vfio_listener_release(container); + vfio_as_del_container(space, bcontainer); } if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) { @@ -957,10 +954,8 @@ static void vfio_disconnect_container(VFIOGroup *group) } if (QLIST_EMPTY(&container->group_list)) { - VFIOAddressSpace *space = bcontainer->space; - - vfio_container_destroy(bcontainer); trace_vfio_disconnect_container(container->fd); + vfio_container_destroy(bcontainer); close(container->fd); g_free(container); diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c index 7417c6ce44..75f43345f0 100644 --- a/hw/vfio/iommufd.c +++ b/hw/vfio/iommufd.c @@ -39,6 +39,8 @@ static bool iommufd_check_extension(VFIOContainer *bcontainer, VFIOContainerFeature feat) { switch (feat) { + case VFIO_FEAT_DMA_COPY: + return true; default: return false; }; @@ -55,6 +57,20 @@ static int iommufd_map(VFIOContainer *bcontainer, hwaddr iova, iova, size, vaddr, readonly); } +static int iommufd_copy(VFIOContainer *src, VFIOContainer *dst, + hwaddr iova, ram_addr_t size, bool readonly) +{ + VFIOIOMMUFDContainer *container_src = container_of(src, + VFIOIOMMUFDContainer, bcontainer); + VFIOIOMMUFDContainer *container_dst = container_of(dst, + VFIOIOMMUFDContainer, bcontainer); + + assert(container_src->be->fd == container_dst->be->fd); + + return iommufd_backend_copy_dma(container_src->be, container_src->ioas_id, + container_dst->ioas_id, iova, size, readonly); +} + static int iommufd_unmap(VFIOContainer *bcontainer, hwaddr iova, ram_addr_t size, IOMMUTLBEntry *iotlb) @@ -413,12 +429,14 @@ static int iommufd_attach_device(VFIODevice *vbasedev, AddressSpace *as, * between iommufd and kvm. */ - QLIST_INSERT_HEAD(&space->containers, bcontainer, next); - - bcontainer->listener = vfio_memory_listener; - - memory_listener_register(&bcontainer->listener, bcontainer->space->as); + vfio_as_add_container(space, bcontainer); + if (bcontainer->error) { + ret = -1; + error_propagate_prepend(errp, bcontainer->error, + "memory listener initialization failed: "); + goto error; + } bcontainer->initialized = true; out: @@ -435,8 +453,7 @@ out: ret = ioctl(devfd, VFIO_DEVICE_GET_INFO, &dev_info); if (ret) { error_setg_errno(errp, errno, "error getting device info"); - memory_listener_unregister(&bcontainer->listener); - QLIST_SAFE_REMOVE(bcontainer, next); + vfio_as_del_container(space, bcontainer); goto error; } @@ -465,6 +482,7 @@ static void iommufd_detach_device(VFIODevice *vbasedev) VFIOIOMMUFDContainer *container; VFIODevice *vbasedev_iter; VFIOIOASHwpt *hwpt; + VFIOAddressSpace *space; Error *err = NULL; if (!bcontainer) { @@ -490,15 +508,25 @@ found: vfio_container_put_hwpt(hwpt); } + space = bcontainer->space; + /* + * Needs to remove the bcontainer from space->containers list before + * detach container. Otherwise, detach container may destroy the + * container if it's the last device. By removing bcontainer from the + * list, container is disconnected with address space memory listener. + */ + if (QLIST_EMPTY(&container->hwpt_list)) { + vfio_as_del_container(space, bcontainer); + } __vfio_device_detach_container(vbasedev, container, &err); if (err) { error_report_err(err); } if (QLIST_EMPTY(&container->hwpt_list)) { - VFIOAddressSpace *space = bcontainer->space; + uint32_t ioas_id = container->ioas_id; - iommufd_backend_put_ioas(container->be, container->ioas_id); vfio_iommufd_container_destroy(container); + iommufd_backend_put_ioas(vbasedev->iommufd, ioas_id); vfio_put_address_space(space); } vbasedev->container = NULL; @@ -510,6 +538,7 @@ out: const VFIOContainerOps iommufd_container_ops = { .check_extension = iommufd_check_extension, .dma_map = iommufd_map, + .dma_copy = iommufd_copy, .dma_unmap = iommufd_unmap, .attach_device = iommufd_attach_device, .detach_device = iommufd_detach_device, diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index 2470b6d58a..6269672712 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -34,7 +34,6 @@ #define VFIO_MSG_PREFIX "vfio %s: " -extern const MemoryListener vfio_memory_listener; extern const VFIOContainerOps legacy_container_ops; extern const VFIOContainerOps iommufd_container_ops; @@ -186,6 +185,10 @@ void vfio_host_win_add(VFIOContainer *bcontainer, int vfio_host_win_del(VFIOContainer *bcontainer, hwaddr min_iova, hwaddr max_iova); VFIOAddressSpace *vfio_get_address_space(AddressSpace *as); +void vfio_as_add_container(VFIOAddressSpace *space, + VFIOContainer *bcontainer); +void vfio_as_del_container(VFIOAddressSpace *space, + VFIOContainer *container); void vfio_put_address_space(VFIOAddressSpace *space); void vfio_put_base_device(VFIODevice *vbasedev); diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h index f9fb8b6af7..27ed29e081 100644 --- a/include/hw/vfio/vfio-container-base.h +++ b/include/hw/vfio/vfio-container-base.h @@ -31,12 +31,15 @@ typedef enum VFIOContainerFeature { VFIO_FEAT_LIVE_MIGRATION, + VFIO_FEAT_DMA_COPY, } VFIOContainerFeature; typedef struct VFIOContainer VFIOContainer; typedef struct VFIOAddressSpace { AddressSpace *as; + MemoryListener listener; + bool listener_initialized; QLIST_HEAD(, VFIOContainer) containers; QLIST_ENTRY(VFIOAddressSpace) list; } VFIOAddressSpace; @@ -75,6 +78,8 @@ typedef struct VFIOContainerOps { int (*dma_map)(VFIOContainer *container, hwaddr iova, ram_addr_t size, void *vaddr, bool readonly); + int (*dma_copy)(VFIOContainer *src, VFIOContainer *dst, + hwaddr iova, ram_addr_t size, bool readonly); int (*dma_unmap)(VFIOContainer *container, hwaddr iova, ram_addr_t size, IOMMUTLBEntry *iotlb); @@ -101,7 +106,6 @@ typedef struct VFIOContainerOps { struct VFIOContainer { const VFIOContainerOps *ops; VFIOAddressSpace *space; - MemoryListener listener; Error *error; bool initialized; bool dirty_pages_supported; @@ -120,6 +124,8 @@ bool vfio_container_check_extension(VFIOContainer *container, int vfio_container_dma_map(VFIOContainer *container, hwaddr iova, ram_addr_t size, void *vaddr, bool readonly); +int vfio_container_dma_copy(VFIOContainer *src, VFIOContainer *dst, + hwaddr iova, ram_addr_t size, bool readonly); int vfio_container_dma_unmap(VFIOContainer *container, hwaddr iova, ram_addr_t size, IOMMUTLBEntry *iotlb); From patchwork Wed Jun 8 12:31:39 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yi Liu X-Patchwork-Id: 12873478 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id AB86CC433EF for ; Wed, 8 Jun 2022 13:03:09 +0000 (UTC) Received: from localhost ([::1]:38158 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nyvL2-00060n-GO for qemu-devel@archiver.kernel.org; Wed, 08 Jun 2022 09:03:08 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:58186) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nyurP-0001NA-2Y for qemu-devel@nongnu.org; Wed, 08 Jun 2022 08:32:31 -0400 Received: from mga05.intel.com ([192.55.52.43]:56952) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nyurN-0005kl-56 for qemu-devel@nongnu.org; Wed, 08 Jun 2022 08:32:30 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1654691549; x=1686227549; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Wbr6N/45Rts/aHBvR6LbIKg6keUVNwa4C9q2Z+KpQtw=; b=alzXnTrllnanHfO/agbpk0B+Cl24JtsHlhTrS1UU/Vi3FUEIJbam7Y9a fx7Y64LR6zwLa9XvSuHJob8K+/YWtwuDafleqOCN9CQMSMl/ABx2ZQq6A bX02gJ2xKWLH4wGPiQYGgLl+rnoMgXzP/PvrsSHBmWt9O2HbrmzDVmZYQ go2i9BPzBkuPPDVZ2qWZgAg+YTsrKbZsXPeoBtvNHX5EjJfOh29JM4mQS bdwAv/O56E70m2APk5pbtCIWKMP+OJ1tR95Ehm9NC90MR1o29Jmj8NW3x dcukw6qX+JffhpA5Pyp6GLbLwlDQyYp6Ed5n0c/Jxennl+eoJrMSTpUAa w==; X-IronPort-AV: E=McAfee;i="6400,9594,10371"; a="363210177" X-IronPort-AV: E=Sophos;i="5.91,286,1647327600"; d="scan'208";a="363210177" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Jun 2022 05:31:50 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.91,286,1647327600"; d="scan'208";a="670529911" Received: from 984fee00a4c6.jf.intel.com ([10.165.58.231]) by FMSMGA003.fm.intel.com with ESMTP; 08 Jun 2022 05:31:50 -0700 From: Yi Liu To: alex.williamson@redhat.com, cohuck@redhat.com, qemu-devel@nongnu.org Cc: david@gibson.dropbear.id.au, thuth@redhat.com, farman@linux.ibm.com, mjrosato@linux.ibm.com, akrowiak@linux.ibm.com, pasic@linux.ibm.com, jjherne@linux.ibm.com, jasowang@redhat.com, kvm@vger.kernel.org, jgg@nvidia.com, nicolinc@nvidia.com, eric.auger@redhat.com, eric.auger.pro@gmail.com, kevin.tian@intel.com, yi.l.liu@intel.com, chao.p.peng@intel.com, yi.y.sun@intel.com, peterx@redhat.com, shameerali.kolothum.thodi@huawei.com, zhangfei.gao@linaro.org, berrange@redhat.com Subject: [RFC v2 15/15] vfio/as: Allow the selection of a given iommu backend Date: Wed, 8 Jun 2022 05:31:39 -0700 Message-Id: <20220608123139.19356-16-yi.l.liu@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220608123139.19356-1-yi.l.liu@intel.com> References: <20220608123139.19356-1-yi.l.liu@intel.com> MIME-Version: 1.0 Received-SPF: pass client-ip=192.55.52.43; envelope-from=yi.l.liu@intel.com; helo=mga05.intel.com X-Spam_score_int: -44 X-Spam_score: -4.5 X-Spam_bar: ---- X-Spam_report: (-4.5 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" From: Eric Auger Now we support two types of iommu backends, let's add the capability to select one of them. This depends on whether an iommufd object has been linked with the vfio-pci device: if the user wants to use the legacy backend, it shall not link the vfio-pci device with any iommufd object: -device vfio-pci,host=0000:02:00.0 This is called the legacy mode/backend. If the user wants to use the iommufd backend (/dev/iommu) it shall pass an iommufd object id in the vfio-pci device options: -object iommufd,id=iommufd0 -device vfio-pci,host=0000:02:00.0,iommufd=iommufd0 Note the /dev/iommu device may have been pre-opened by a management tool such as libvirt. This mode is no more considered for the legacy backend. So let's remove the "TODO" comment. Signed-off-by: Eric Auger Signed-off-by: Yi Liu Suggested-by: Alex Williamson --- hw/vfio/as.c | 9 ++++++--- hw/vfio/pci.c | 19 ++++++++++++++----- 2 files changed, 20 insertions(+), 8 deletions(-) diff --git a/hw/vfio/as.c b/hw/vfio/as.c index 56485f9299..e799750104 100644 --- a/hw/vfio/as.c +++ b/hw/vfio/as.c @@ -1007,6 +1007,8 @@ vfio_get_container_ops(VFIOIOMMUBackendType be) switch (be) { case VFIO_IOMMU_BACKEND_TYPE_LEGACY: return &legacy_container_ops; + case VFIO_IOMMU_BACKEND_TYPE_IOMMUFD: + return &iommufd_container_ops; default: return NULL; } @@ -1016,9 +1018,10 @@ int vfio_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp) { const VFIOContainerOps *ops; - ops = vfio_get_container_ops(VFIO_IOMMU_BACKEND_TYPE_LEGACY); - if (!ops) { - return -ENOENT; + if (vbasedev->iommufd) { + ops = vfio_get_container_ops(VFIO_IOMMU_BACKEND_TYPE_IOMMUFD); + } else { + ops = vfio_get_container_ops(VFIO_IOMMU_BACKEND_TYPE_LEGACY); } return ops->attach_device(vbasedev, as, errp); } diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index 6d10e86331..7efd3382ca 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -42,6 +42,7 @@ #include "qapi/error.h" #include "migration/blocker.h" #include "migration/qemu-file.h" +#include "sysemu/iommufd.h" #define TYPE_VFIO_PCI_NOHOTPLUG "vfio-pci-nohotplug" @@ -2852,6 +2853,13 @@ static void vfio_realize(PCIDevice *pdev, Error **errp) int i, ret; bool is_mdev; + if (vbasedev->iommufd) { + iommufd_backend_connect(vbasedev->iommufd, errp); + if (*errp) { + return; + } + } + if (!vbasedev->sysfsdev) { if (!(~vdev->host.domain || ~vdev->host.bus || ~vdev->host.slot || ~vdev->host.function)) { @@ -3134,6 +3142,7 @@ error: static void vfio_instance_finalize(Object *obj) { VFIOPCIDevice *vdev = VFIO_PCI(obj); + VFIODevice *vbasedev = &vdev->vbasedev; vfio_display_finalize(vdev); vfio_bars_finalize(vdev); @@ -3146,6 +3155,9 @@ static void vfio_instance_finalize(Object *obj) * * g_free(vdev->igd_opregion); */ + if (vbasedev->iommufd) { + iommufd_backend_disconnect(vbasedev->iommufd); + } vfio_put_device(vdev); } @@ -3281,11 +3293,8 @@ static Property vfio_pci_dev_properties[] = { qdev_prop_nv_gpudirect_clique, uint8_t), DEFINE_PROP_OFF_AUTO_PCIBAR("x-msix-relocation", VFIOPCIDevice, msix_relo, OFF_AUTOPCIBAR_OFF), - /* - * TODO - support passed fds... is this necessary? - * DEFINE_PROP_STRING("vfiofd", VFIOPCIDevice, vfiofd_name), - * DEFINE_PROP_STRING("vfiogroupfd, VFIOPCIDevice, vfiogroupfd_name), - */ + DEFINE_PROP_LINK("iommufd", VFIOPCIDevice, vbasedev.iommufd, + TYPE_IOMMUFD_BACKEND, IOMMUFDBackend *), DEFINE_PROP_END_OF_LIST(), };