From patchwork Thu Apr 14 10:46:53 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yi Liu X-Patchwork-Id: 12813327 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 866F0C433EF for ; Thu, 14 Apr 2022 10:48:57 +0000 (UTC) Received: from localhost ([::1]:46992 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nex1z-0004SD-UE for qemu-devel@archiver.kernel.org; Thu, 14 Apr 2022 06:48:55 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:55336) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nex0R-0001rb-PW for qemu-devel@nongnu.org; Thu, 14 Apr 2022 06:47:19 -0400 Received: from mga12.intel.com ([192.55.52.136]:34770) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nex0P-0005Ke-43 for qemu-devel@nongnu.org; Thu, 14 Apr 2022 06:47:19 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1649933237; x=1681469237; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=tSB2DyD8BeADWS4sEGxP5x5vma8/I3aEjte+rXvo4Lw=; b=oBkm3rGawgIupEsVuP6u/Ul6C7TXqqcl8OMUKKpLjRCEFymGcDo8fS3M GCZY6ZqdRopdHMCsxaTlmbxU+ssevRXOw9DDJclgm6XsdMUcxXwWE2RfW awyP1ZNJN9AxB47444vEzUer3YDxZPysKVjnpscFhoqPXNfRvaF63x+mO 4gEz4OkaxlWNT/J8oPaarejZnKtjYewGZyS6LhSnj6PEJ8QfEuDkO+Wu7 3qexCi6XlGVY/hFqTOgYn+t9ZoSqgI4IHbyDbxoI0Y2I1H+kujJy7z1Y4 LK9KUMgQ2T3thCa6KU+psc/e7XzrJmEgGmzaJG5unC82sOWneC89sSzUB Q==; X-IronPort-AV: E=McAfee;i="6400,9594,10316"; a="242836469" X-IronPort-AV: E=Sophos;i="5.90,259,1643702400"; d="scan'208";a="242836469" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Apr 2022 03:47:12 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,259,1643702400"; d="scan'208";a="803091171" Received: from 984fee00a4c6.jf.intel.com ([10.165.58.231]) by fmsmga006.fm.intel.com with ESMTP; 14 Apr 2022 03:47:11 -0700 From: Yi Liu To: alex.williamson@redhat.com, cohuck@redhat.com, qemu-devel@nongnu.org Subject: [RFC 01/18] scripts/update-linux-headers: Add iommufd.h Date: Thu, 14 Apr 2022 03:46:53 -0700 Message-Id: <20220414104710.28534-2-yi.l.liu@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220414104710.28534-1-yi.l.liu@intel.com> References: <20220414104710.28534-1-yi.l.liu@intel.com> MIME-Version: 1.0 Received-SPF: pass client-ip=192.55.52.136; envelope-from=yi.l.liu@intel.com; helo=mga12.intel.com X-Spam_score_int: -44 X-Spam_score: -4.5 X-Spam_bar: ---- X-Spam_report: (-4.5 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: akrowiak@linux.ibm.com, jjherne@linux.ibm.com, thuth@redhat.com, yi.l.liu@intel.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com, jasowang@redhat.com, farman@linux.ibm.com, peterx@redhat.com, pasic@linux.ibm.com, eric.auger@redhat.com, yi.y.sun@intel.com, chao.p.peng@intel.com, nicolinc@nvidia.com, kevin.tian@intel.com, jgg@nvidia.com, eric.auger.pro@gmail.com, david@gibson.dropbear.id.au Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" From: Eric Auger Update the script to import iommufd.h Signed-off-by: Eric Auger Signed-off-by: Yi Liu --- scripts/update-linux-headers.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scripts/update-linux-headers.sh b/scripts/update-linux-headers.sh index 839a5ec614..a89b83e6d6 100755 --- a/scripts/update-linux-headers.sh +++ b/scripts/update-linux-headers.sh @@ -160,7 +160,7 @@ done rm -rf "$output/linux-headers/linux" mkdir -p "$output/linux-headers/linux" -for header in kvm.h vfio.h vfio_ccw.h vfio_zdev.h vhost.h \ +for header in kvm.h vfio.h iommufd.h vfio_ccw.h vfio_zdev.h vhost.h \ psci.h psp-sev.h userfaultfd.h mman.h; do cp "$tmpdir/include/linux/$header" "$output/linux-headers/linux" done From patchwork Thu Apr 14 10:46:54 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yi Liu X-Patchwork-Id: 12813333 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 09B31C433EF for ; Thu, 14 Apr 2022 10:51:58 +0000 (UTC) Received: from localhost ([::1]:55898 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nex4v-0001wt-3g for qemu-devel@archiver.kernel.org; Thu, 14 Apr 2022 06:51:57 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:55380) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nex0T-0001rl-1Y for qemu-devel@nongnu.org; Thu, 14 Apr 2022 06:47:21 -0400 Received: from mga12.intel.com ([192.55.52.136]:34772) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nex0P-0005Kn-Cp for qemu-devel@nongnu.org; Thu, 14 Apr 2022 06:47:20 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1649933237; x=1681469237; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=9bASU3PcIsP6cXPBsHkUwGLzWqEnhVxW+X+6cFAlEpo=; b=PGAoghB8uIq+u0uNLFKAh+S7khSgLWZxC/XMtfUns6e/v6fOssatvYSe HhIJjaXkK03Gupq73p5W50SFEhKa7sPVjjzA+5C/tf5zBpbAgqE8L9b9J 8O4RixVWgZq8kxj4LqoZFNTUkOBUtJg6Lff6C1cfDrY9aRo1aPa3Gdl22 vU9qQUk4E7qwdDCx24iCToIKMgA1gHSKQAFJ1K6otIuRKswZkqEtDpmPs hWfpS/eSumFZnFuDRuzbz/RXjqedm6dZA0yIp8aK13c0Tjcm7U5NcjUEI 9A/OtXoKbGGJPz4evilBYZAecUSsbq+/G8Cb78ICEg9DDUftcD53qzfuk Q==; X-IronPort-AV: E=McAfee;i="6400,9594,10316"; a="242836471" X-IronPort-AV: E=Sophos;i="5.90,259,1643702400"; d="scan'208";a="242836471" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Apr 2022 03:47:12 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,259,1643702400"; d="scan'208";a="803091179" Received: from 984fee00a4c6.jf.intel.com ([10.165.58.231]) by fmsmga006.fm.intel.com with ESMTP; 14 Apr 2022 03:47:11 -0700 From: Yi Liu To: alex.williamson@redhat.com, cohuck@redhat.com, qemu-devel@nongnu.org Subject: [RFC 02/18] linux-headers: Import latest vfio.h and iommufd.h Date: Thu, 14 Apr 2022 03:46:54 -0700 Message-Id: <20220414104710.28534-3-yi.l.liu@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220414104710.28534-1-yi.l.liu@intel.com> References: <20220414104710.28534-1-yi.l.liu@intel.com> MIME-Version: 1.0 Received-SPF: pass client-ip=192.55.52.136; envelope-from=yi.l.liu@intel.com; helo=mga12.intel.com X-Spam_score_int: -44 X-Spam_score: -4.5 X-Spam_bar: ---- X-Spam_report: (-4.5 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: akrowiak@linux.ibm.com, jjherne@linux.ibm.com, thuth@redhat.com, yi.l.liu@intel.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com, jasowang@redhat.com, farman@linux.ibm.com, peterx@redhat.com, pasic@linux.ibm.com, eric.auger@redhat.com, yi.y.sun@intel.com, chao.p.peng@intel.com, nicolinc@nvidia.com, kevin.tian@intel.com, jgg@nvidia.com, eric.auger.pro@gmail.com, david@gibson.dropbear.id.au Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" From: Eric Auger Imported from https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6 Signed-off-by: Eric Auger Signed-off-by: Yi Liu --- linux-headers/linux/iommufd.h | 223 ++++++++++++++++++++++++++++++++++ linux-headers/linux/vfio.h | 84 +++++++++++++ 2 files changed, 307 insertions(+) create mode 100644 linux-headers/linux/iommufd.h diff --git a/linux-headers/linux/iommufd.h b/linux-headers/linux/iommufd.h new file mode 100644 index 0000000000..6c3cd9e259 --- /dev/null +++ b/linux-headers/linux/iommufd.h @@ -0,0 +1,223 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. + */ +#ifndef _IOMMUFD_H +#define _IOMMUFD_H + +#include +#include + +#define IOMMUFD_TYPE (';') + +/** + * DOC: General ioctl format + * + * The ioctl mechanims follows a general format to allow for extensibility. Each + * ioctl is passed in a structure pointer as the argument providing the size of + * the structure in the first u32. The kernel checks that any structure space + * beyond what it understands is 0. This allows userspace to use the backward + * compatible portion while consistently using the newer, larger, structures. + * + * ioctls use a standard meaning for common errnos: + * + * - ENOTTY: The IOCTL number itself is not supported at all + * - E2BIG: The IOCTL number is supported, but the provided structure has + * non-zero in a part the kernel does not understand. + * - EOPNOTSUPP: The IOCTL number is supported, and the structure is + * understood, however a known field has a value the kernel does not + * understand or support. + * - EINVAL: Everything about the IOCTL was understood, but a field is not + * correct. + * - ENOENT: An ID or IOVA provided does not exist. + * - ENOMEM: Out of memory. + * - EOVERFLOW: Mathematics oveflowed. + * + * As well as additional errnos. within specific ioctls. + */ +enum { + IOMMUFD_CMD_BASE = 0x80, + IOMMUFD_CMD_DESTROY = IOMMUFD_CMD_BASE, + IOMMUFD_CMD_IOAS_ALLOC, + IOMMUFD_CMD_IOAS_IOVA_RANGES, + IOMMUFD_CMD_IOAS_MAP, + IOMMUFD_CMD_IOAS_COPY, + IOMMUFD_CMD_IOAS_UNMAP, + IOMMUFD_CMD_VFIO_IOAS, +}; + +/** + * struct iommu_destroy - ioctl(IOMMU_DESTROY) + * @size: sizeof(struct iommu_destroy) + * @id: iommufd object ID to destroy. Can by any destroyable object type. + * + * Destroy any object held within iommufd. + */ +struct iommu_destroy { + __u32 size; + __u32 id; +}; +#define IOMMU_DESTROY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_DESTROY) + +/** + * struct iommu_ioas_alloc - ioctl(IOMMU_IOAS_ALLOC) + * @size: sizeof(struct iommu_ioas_alloc) + * @flags: Must be 0 + * @out_ioas_id: Output IOAS ID for the allocated object + * + * Allocate an IO Address Space (IOAS) which holds an IO Virtual Address (IOVA) + * to memory mapping. + */ +struct iommu_ioas_alloc { + __u32 size; + __u32 flags; + __u32 out_ioas_id; +}; +#define IOMMU_IOAS_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_ALLOC) + +/** + * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES) + * @size: sizeof(struct iommu_ioas_iova_ranges) + * @ioas_id: IOAS ID to read ranges from + * @out_num_iovas: Output total number of ranges in the IOAS + * @__reserved: Must be 0 + * @out_valid_iovas: Array of valid IOVA ranges. The array length is the smaller + * of out_num_iovas or the length implied by size. + * @out_valid_iovas.start: First IOVA in the allowed range + * @out_valid_iovas.last: Inclusive last IOVA in the allowed range + * + * Query an IOAS for ranges of allowed IOVAs. Operation outside these ranges is + * not allowed. out_num_iovas will be set to the total number of iovas + * and the out_valid_iovas[] will be filled in as space permits. + * size should include the allocated flex array. + */ +struct iommu_ioas_iova_ranges { + __u32 size; + __u32 ioas_id; + __u32 out_num_iovas; + __u32 __reserved; + struct iommu_valid_iovas { + __aligned_u64 start; + __aligned_u64 last; + } out_valid_iovas[]; +}; +#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_IOVA_RANGES) + +/** + * enum iommufd_ioas_map_flags - Flags for map and copy + * @IOMMU_IOAS_MAP_FIXED_IOVA: If clear the kernel will compute an appropriate + * IOVA to place the mapping at + * @IOMMU_IOAS_MAP_WRITEABLE: DMA is allowed to write to this mapping + * @IOMMU_IOAS_MAP_READABLE: DMA is allowed to read from this mapping + */ +enum iommufd_ioas_map_flags { + IOMMU_IOAS_MAP_FIXED_IOVA = 1 << 0, + IOMMU_IOAS_MAP_WRITEABLE = 1 << 1, + IOMMU_IOAS_MAP_READABLE = 1 << 2, +}; + +/** + * struct iommu_ioas_map - ioctl(IOMMU_IOAS_MAP) + * @size: sizeof(struct iommu_ioas_map) + * @flags: Combination of enum iommufd_ioas_map_flags + * @ioas_id: IOAS ID to change the mapping of + * @__reserved: Must be 0 + * @user_va: Userspace pointer to start mapping from + * @length: Number of bytes to map + * @iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is set + * then this must be provided as input. + * + * Set an IOVA mapping from a user pointer. If FIXED_IOVA is specified then the + * mapping will be established at iova, otherwise a suitable location will be + * automatically selected and returned in iova. + */ +struct iommu_ioas_map { + __u32 size; + __u32 flags; + __u32 ioas_id; + __u32 __reserved; + __aligned_u64 user_va; + __aligned_u64 length; + __aligned_u64 iova; +}; +#define IOMMU_IOAS_MAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_MAP) + +/** + * struct iommu_ioas_copy - ioctl(IOMMU_IOAS_COPY) + * @size: sizeof(struct iommu_ioas_copy) + * @flags: Combination of enum iommufd_ioas_map_flags + * @dst_ioas_id: IOAS ID to change the mapping of + * @src_ioas_id: IOAS ID to copy from + * @length: Number of bytes to copy and map + * @dst_iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is + * set then this must be provided as input. + * @src_iova: IOVA to start the copy + * + * Copy an already existing mapping from src_ioas_id and establish it in + * dst_ioas_id. The src iova/length must exactly match a range used with + * IOMMU_IOAS_MAP. + */ +struct iommu_ioas_copy { + __u32 size; + __u32 flags; + __u32 dst_ioas_id; + __u32 src_ioas_id; + __aligned_u64 length; + __aligned_u64 dst_iova; + __aligned_u64 src_iova; +}; +#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY) + +/** + * struct iommu_ioas_unmap - ioctl(IOMMU_IOAS_UNMAP) + * @size: sizeof(struct iommu_ioas_copy) + * @ioas_id: IOAS ID to change the mapping of + * @iova: IOVA to start the unmapping at + * @length: Number of bytes to unmap + * + * Unmap an IOVA range. The iova/length must exactly match a range + * used with IOMMU_IOAS_PAGETABLE_MAP, or be the values 0 & U64_MAX. + * In the latter case all IOVAs will be unmaped. + */ +struct iommu_ioas_unmap { + __u32 size; + __u32 ioas_id; + __aligned_u64 iova; + __aligned_u64 length; +}; +#define IOMMU_IOAS_UNMAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_UNMAP) + +/** + * enum iommufd_vfio_ioas_op + * @IOMMU_VFIO_IOAS_GET: Get the current compatibility IOAS + * @IOMMU_VFIO_IOAS_SET: Change the current compatibility IOAS + * @IOMMU_VFIO_IOAS_CLEAR: Disable VFIO compatibility + */ +enum iommufd_vfio_ioas_op { + IOMMU_VFIO_IOAS_GET = 0, + IOMMU_VFIO_IOAS_SET = 1, + IOMMU_VFIO_IOAS_CLEAR = 2, +}; + +/** + * struct iommu_vfio_ioas - ioctl(IOMMU_VFIO_IOAS) + * @size: sizeof(struct iommu_ioas_copy) + * @ioas_id: For IOMMU_VFIO_IOAS_SET the input IOAS ID to set + * For IOMMU_VFIO_IOAS_GET will output the IOAS ID + * @op: One of enum iommufd_vfio_ioas_op + * @__reserved: Must be 0 + * + * The VFIO compatibility support uses a single ioas because VFIO APIs do not + * support the ID field. Set or Get the IOAS that VFIO compatibility will use. + * When VFIO_GROUP_SET_CONTAINER is used on an iommufd it will get the + * compatibility ioas, either by taking what is already set, or auto creating + * one. From then on VFIO will continue to use that ioas and is not effected by + * this ioctl. SET or CLEAR does not destroy any auto-created IOAS. + */ +struct iommu_vfio_ioas { + __u32 size; + __u32 ioas_id; + __u16 op; + __u16 __reserved; +}; +#define IOMMU_VFIO_IOAS _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VFIO_IOAS) +#endif diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h index e680594f27..0e7b1159ca 100644 --- a/linux-headers/linux/vfio.h +++ b/linux-headers/linux/vfio.h @@ -190,6 +190,90 @@ struct vfio_group_status { /* --------------- IOCTLs for DEVICE file descriptors --------------- */ +/* + * VFIO_DEVICE_BIND_IOMMUFD - _IOR(VFIO_TYPE, VFIO_BASE + 19, + * struct vfio_device_bind_iommufd) + * + * Bind a vfio_device to the specified iommufd + * + * The user should provide a device cookie when calling this ioctl. The + * cookie is carried only in event e.g. I/O fault reported to userspace + * via iommufd. The user should use devid returned by this ioctl to mark + * the target device in other ioctls (e.g. capability query via iommufd). + * + * User is not allowed to access the device before the binding operation + * is completed. + * + * Unbind is automatically conducted when device fd is closed. + * + * Input parameters: + * - iommufd; + * - dev_cookie; + * + * Output parameters: + * - devid; + * + * Return: 0 on success, -errno on failure. + */ +struct vfio_device_bind_iommufd { + __u32 argsz; + __u32 flags; + __aligned_u64 dev_cookie; + __s32 iommufd; + __u32 out_devid; +}; + +#define VFIO_DEVICE_BIND_IOMMUFD _IO(VFIO_TYPE, VFIO_BASE + 19) + +/* + * VFIO_DEVICE_ATTACH_IOAS - _IOW(VFIO_TYPE, VFIO_BASE + 21, + * struct vfio_device_attach_ioas) + * + * Attach a vfio device to the specified IOAS. + * + * Multiple vfio devices can be attached to the same IOAS Page Table. One + * device can be attached to only one ioas at this point. + * + * @argsz: user filled size of this data. + * @flags: reserved for future extension. + * @iommufd: iommufd where the ioas comes from. + * @ioas_id: Input the target I/O address space page table. + * @hwpt_id: Output the hw page table id + * + * Return: 0 on success, -errno on failure. + */ +struct vfio_device_attach_ioas { + __u32 argsz; + __u32 flags; + __s32 iommufd; + __u32 ioas_id; + __u32 out_hwpt_id; +}; + +#define VFIO_DEVICE_ATTACH_IOAS _IO(VFIO_TYPE, VFIO_BASE + 20) + +/* + * VFIO_DEVICE_DETACH_IOAS - _IOW(VFIO_TYPE, VFIO_BASE + 21, + * struct vfio_device_detach_ioas) + * + * Detach a vfio device from the specified IOAS. + * + * @argsz: user filled size of this data. + * @flags: reserved for future extension. + * @iommufd: iommufd where the ioas comes from. + * @ioas_id: Input the target I/O address space page table. + * + * Return: 0 on success, -errno on failure. + */ +struct vfio_device_detach_ioas { + __u32 argsz; + __u32 flags; + __s32 iommufd; + __u32 ioas_id; +}; + +#define VFIO_DEVICE_DETACH_IOAS _IO(VFIO_TYPE, VFIO_BASE + 21) + /** * VFIO_DEVICE_GET_INFO - _IOR(VFIO_TYPE, VFIO_BASE + 7, * struct vfio_device_info) From patchwork Thu Apr 14 10:46:55 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yi Liu X-Patchwork-Id: 12813328 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 0B0E0C433F5 for ; Thu, 14 Apr 2022 10:48:58 +0000 (UTC) Received: from localhost ([::1]:47166 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nex22-0004YX-2B for qemu-devel@archiver.kernel.org; Thu, 14 Apr 2022 06:48:58 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:55388) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nex0T-0001sE-FP for qemu-devel@nongnu.org; Thu, 14 Apr 2022 06:47:21 -0400 Received: from mga12.intel.com ([192.55.52.136]:34770) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nex0R-0005Ke-Mh for qemu-devel@nongnu.org; Thu, 14 Apr 2022 06:47:21 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1649933239; x=1681469239; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=G8R8kN5Bu9OPLz5yK28BftrdZsFf6rfmLlwn/OnqjUE=; b=L2mAz/oabLh7//DYEjbUgARxBSjm0TTnhu+THliP8NeUqNo9Mf8mv5Ln HxmJOJfkYOdvQ7PleOxHma7+ucq7cbtuGT6y/fjyJe8BLmYrRj2DP+8+B GvJudE8JG4zZV/rDeLHzbmA5RKizfbHgG6SJx8OL+4n3CQ9VKsf+R594S h14Qppr7j50+t7B6of0APIXEn/46ef8Y9dVSIZPqrqOJEW6vresxu74zF NiovDBHbncMcohvVXubNjAiO41Q99B/vaYSEe06wz+36xlbwkj3gbLnu/ UhBU6J7W6ar1S4nd6CpNtBg/hQEbts+4gHyJoToNE1WzihNeCmQtMoPRk A==; X-IronPort-AV: E=McAfee;i="6400,9594,10316"; a="242836473" X-IronPort-AV: E=Sophos;i="5.90,259,1643702400"; d="scan'208";a="242836473" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Apr 2022 03:47:13 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,259,1643702400"; d="scan'208";a="803091185" Received: from 984fee00a4c6.jf.intel.com ([10.165.58.231]) by fmsmga006.fm.intel.com with ESMTP; 14 Apr 2022 03:47:12 -0700 From: Yi Liu To: alex.williamson@redhat.com, cohuck@redhat.com, qemu-devel@nongnu.org Subject: [RFC 03/18] hw/vfio/pci: fix vfio_pci_hot_reset_result trace point Date: Thu, 14 Apr 2022 03:46:55 -0700 Message-Id: <20220414104710.28534-4-yi.l.liu@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220414104710.28534-1-yi.l.liu@intel.com> References: <20220414104710.28534-1-yi.l.liu@intel.com> MIME-Version: 1.0 Received-SPF: pass client-ip=192.55.52.136; envelope-from=yi.l.liu@intel.com; helo=mga12.intel.com X-Spam_score_int: -44 X-Spam_score: -4.5 X-Spam_bar: ---- X-Spam_report: (-4.5 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: akrowiak@linux.ibm.com, jjherne@linux.ibm.com, thuth@redhat.com, yi.l.liu@intel.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com, jasowang@redhat.com, farman@linux.ibm.com, peterx@redhat.com, pasic@linux.ibm.com, eric.auger@redhat.com, yi.y.sun@intel.com, chao.p.peng@intel.com, nicolinc@nvidia.com, kevin.tian@intel.com, jgg@nvidia.com, eric.auger.pro@gmail.com, david@gibson.dropbear.id.au Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" From: Eric Auger Properly output the errno string. Signed-off-by: Eric Auger Signed-off-by: Yi Liu --- hw/vfio/pci.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index 67a183f17b..e26e65bb1f 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -2337,7 +2337,7 @@ static int vfio_pci_hot_reset(VFIOPCIDevice *vdev, bool single) g_free(reset); trace_vfio_pci_hot_reset_result(vdev->vbasedev.name, - ret ? "%m" : "Success"); + ret ? strerror(errno) : "Success"); out: /* Re-enable INTx on affected devices */ From patchwork Thu Apr 14 10:46:56 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yi Liu X-Patchwork-Id: 12813332 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 563E0C433F5 for ; Thu, 14 Apr 2022 10:51:47 +0000 (UTC) Received: from localhost ([::1]:55706 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nex4k-0001nk-G9 for qemu-devel@archiver.kernel.org; Thu, 14 Apr 2022 06:51:46 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:55418) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nex0V-0001uW-4R for qemu-devel@nongnu.org; Thu, 14 Apr 2022 06:47:23 -0400 Received: from mga12.intel.com ([192.55.52.136]:34768) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nex0S-0005Ka-Pu for qemu-devel@nongnu.org; Thu, 14 Apr 2022 06:47:22 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1649933240; x=1681469240; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=YuSxzN1sk1kFCryZVtMQM+rLrQtwU641Y1zSVCUmBuE=; b=iGYuiXNFZ1j1XXBK1+PMn5CA/qns0xh1Q0uLI6YMKapX9bf9hDScIlcw WatsAERprcQfMIdT8cYhLSjoIMLhlM869lae8xtHE3+gBKsT7ti5Kw1qL iFpihgSptJ0l0luit14dSckgKn4xq3sRXNY3IEKquMETfH2uiQb4smxP+ NOZ/E6yND1yLTUXk35TDkMTRMuPmukaluLb2Z4Zs0hHHIq/Q391+2hK1/ r8Ig8+zzrJTAMWqjgShdflRNe58Wl7WCdM9sKxEYfMox1B1F8f/G8U17t pnzsoMdt2JzsiPaUYyONPc6x/7MUyAzzNJi9M3CumZ8TMDu3CzMLzJYTN g==; X-IronPort-AV: E=McAfee;i="6400,9594,10316"; a="242836475" X-IronPort-AV: E=Sophos;i="5.90,259,1643702400"; d="scan'208";a="242836475" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Apr 2022 03:47:14 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,259,1643702400"; d="scan'208";a="803091189" Received: from 984fee00a4c6.jf.intel.com ([10.165.58.231]) by fmsmga006.fm.intel.com with ESMTP; 14 Apr 2022 03:47:13 -0700 From: Yi Liu To: alex.williamson@redhat.com, cohuck@redhat.com, qemu-devel@nongnu.org Subject: [RFC 04/18] vfio/pci: Use vbasedev local variable in vfio_realize() Date: Thu, 14 Apr 2022 03:46:56 -0700 Message-Id: <20220414104710.28534-5-yi.l.liu@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220414104710.28534-1-yi.l.liu@intel.com> References: <20220414104710.28534-1-yi.l.liu@intel.com> MIME-Version: 1.0 Received-SPF: pass client-ip=192.55.52.136; envelope-from=yi.l.liu@intel.com; helo=mga12.intel.com X-Spam_score_int: -44 X-Spam_score: -4.5 X-Spam_bar: ---- X-Spam_report: (-4.5 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: akrowiak@linux.ibm.com, jjherne@linux.ibm.com, thuth@redhat.com, yi.l.liu@intel.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com, jasowang@redhat.com, farman@linux.ibm.com, peterx@redhat.com, pasic@linux.ibm.com, eric.auger@redhat.com, yi.y.sun@intel.com, chao.p.peng@intel.com, nicolinc@nvidia.com, kevin.tian@intel.com, jgg@nvidia.com, eric.auger.pro@gmail.com, david@gibson.dropbear.id.au Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" From: Eric Auger Using a VFIODevice handle local variable to improve the code readability. no functional change intended Signed-off-by: Eric Auger Signed-off-by: Yi Liu --- hw/vfio/pci.c | 49 +++++++++++++++++++++++++------------------------ 1 file changed, 25 insertions(+), 24 deletions(-) diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index e26e65bb1f..e707329394 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -2803,6 +2803,7 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev) static void vfio_realize(PCIDevice *pdev, Error **errp) { VFIOPCIDevice *vdev = VFIO_PCI(pdev); + VFIODevice *vbasedev = &vdev->vbasedev; VFIODevice *vbasedev_iter; VFIOGroup *group; char *tmp, *subsys, group_path[PATH_MAX], *group_name; @@ -2813,7 +2814,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp) int i, ret; bool is_mdev; - if (!vdev->vbasedev.sysfsdev) { + if (!vbasedev->sysfsdev) { if (!(~vdev->host.domain || ~vdev->host.bus || ~vdev->host.slot || ~vdev->host.function)) { error_setg(errp, "No provided host device"); @@ -2821,24 +2822,24 @@ static void vfio_realize(PCIDevice *pdev, Error **errp) "or -device vfio-pci,sysfsdev=PATH_TO_DEVICE\n"); return; } - vdev->vbasedev.sysfsdev = + vbasedev->sysfsdev = g_strdup_printf("/sys/bus/pci/devices/%04x:%02x:%02x.%01x", vdev->host.domain, vdev->host.bus, vdev->host.slot, vdev->host.function); } - if (stat(vdev->vbasedev.sysfsdev, &st) < 0) { + if (stat(vbasedev->sysfsdev, &st) < 0) { error_setg_errno(errp, errno, "no such host device"); - error_prepend(errp, VFIO_MSG_PREFIX, vdev->vbasedev.sysfsdev); + error_prepend(errp, VFIO_MSG_PREFIX, vbasedev->sysfsdev); return; } - vdev->vbasedev.name = g_path_get_basename(vdev->vbasedev.sysfsdev); - vdev->vbasedev.ops = &vfio_pci_ops; - vdev->vbasedev.type = VFIO_DEVICE_TYPE_PCI; - vdev->vbasedev.dev = DEVICE(vdev); + vbasedev->name = g_path_get_basename(vbasedev->sysfsdev); + vbasedev->ops = &vfio_pci_ops; + vbasedev->type = VFIO_DEVICE_TYPE_PCI; + vbasedev->dev = DEVICE(vdev); - tmp = g_strdup_printf("%s/iommu_group", vdev->vbasedev.sysfsdev); + tmp = g_strdup_printf("%s/iommu_group", vbasedev->sysfsdev); len = readlink(tmp, group_path, sizeof(group_path)); g_free(tmp); @@ -2856,7 +2857,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp) goto error; } - trace_vfio_realize(vdev->vbasedev.name, groupid); + trace_vfio_realize(vbasedev->name, groupid); group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev), errp); if (!group) { @@ -2864,7 +2865,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp) } QLIST_FOREACH(vbasedev_iter, &group->device_list, next) { - if (strcmp(vbasedev_iter->name, vdev->vbasedev.name) == 0) { + if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) { error_setg(errp, "device is already attached"); vfio_put_group(group); goto error; @@ -2877,22 +2878,22 @@ static void vfio_realize(PCIDevice *pdev, Error **errp) * stays in sync with the active working set of the guest driver. Prevent * the x-balloon-allowed option unless this is minimally an mdev device. */ - tmp = g_strdup_printf("%s/subsystem", vdev->vbasedev.sysfsdev); + tmp = g_strdup_printf("%s/subsystem", vbasedev->sysfsdev); subsys = realpath(tmp, NULL); g_free(tmp); is_mdev = subsys && (strcmp(subsys, "/sys/bus/mdev") == 0); free(subsys); - trace_vfio_mdev(vdev->vbasedev.name, is_mdev); + trace_vfio_mdev(vbasedev->name, is_mdev); - if (vdev->vbasedev.ram_block_discard_allowed && !is_mdev) { + if (vbasedev->ram_block_discard_allowed && !is_mdev) { error_setg(errp, "x-balloon-allowed only potentially compatible " "with mdev devices"); vfio_put_group(group); goto error; } - ret = vfio_get_device(group, vdev->vbasedev.name, &vdev->vbasedev, errp); + ret = vfio_get_device(group, vbasedev->name, vbasedev, errp); if (ret) { vfio_put_group(group); goto error; @@ -2905,7 +2906,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp) } /* Get a copy of config space */ - ret = pread(vdev->vbasedev.fd, vdev->pdev.config, + ret = pread(vbasedev->fd, vdev->pdev.config, MIN(pci_config_size(&vdev->pdev), vdev->config_size), vdev->config_offset); if (ret < (int)MIN(pci_config_size(&vdev->pdev), vdev->config_size)) { @@ -2933,7 +2934,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp) goto error; } vfio_add_emulated_word(vdev, PCI_VENDOR_ID, vdev->vendor_id, ~0); - trace_vfio_pci_emulated_vendor_id(vdev->vbasedev.name, vdev->vendor_id); + trace_vfio_pci_emulated_vendor_id(vbasedev->name, vdev->vendor_id); } else { vdev->vendor_id = pci_get_word(pdev->config + PCI_VENDOR_ID); } @@ -2944,7 +2945,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp) goto error; } vfio_add_emulated_word(vdev, PCI_DEVICE_ID, vdev->device_id, ~0); - trace_vfio_pci_emulated_device_id(vdev->vbasedev.name, vdev->device_id); + trace_vfio_pci_emulated_device_id(vbasedev->name, vdev->device_id); } else { vdev->device_id = pci_get_word(pdev->config + PCI_DEVICE_ID); } @@ -2956,7 +2957,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp) } vfio_add_emulated_word(vdev, PCI_SUBSYSTEM_VENDOR_ID, vdev->sub_vendor_id, ~0); - trace_vfio_pci_emulated_sub_vendor_id(vdev->vbasedev.name, + trace_vfio_pci_emulated_sub_vendor_id(vbasedev->name, vdev->sub_vendor_id); } @@ -2966,7 +2967,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp) goto error; } vfio_add_emulated_word(vdev, PCI_SUBSYSTEM_ID, vdev->sub_device_id, ~0); - trace_vfio_pci_emulated_sub_device_id(vdev->vbasedev.name, + trace_vfio_pci_emulated_sub_device_id(vbasedev->name, vdev->sub_device_id); } @@ -3025,7 +3026,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp) goto out_teardown; } - ret = vfio_get_dev_region_info(&vdev->vbasedev, + ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_PCI_VENDOR_TYPE | PCI_VENDOR_ID_INTEL, VFIO_REGION_SUBTYPE_INTEL_IGD_OPREGION, &opregion); if (ret) { @@ -3101,9 +3102,9 @@ static void vfio_realize(PCIDevice *pdev, Error **errp) } if (!pdev->failover_pair_id) { - ret = vfio_migration_probe(&vdev->vbasedev, errp); + ret = vfio_migration_probe(vbasedev, errp); if (ret) { - error_report("%s: Migration disabled", vdev->vbasedev.name); + error_report("%s: Migration disabled", vbasedev->name); } } @@ -3120,7 +3121,7 @@ out_teardown: vfio_teardown_msi(vdev); vfio_bars_exit(vdev); error: - error_prepend(errp, VFIO_MSG_PREFIX, vdev->vbasedev.name); + error_prepend(errp, VFIO_MSG_PREFIX, vbasedev->name); } static void vfio_instance_finalize(Object *obj) From patchwork Thu Apr 14 10:46:57 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yi Liu X-Patchwork-Id: 12813331 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 082F5C433F5 for ; Thu, 14 Apr 2022 10:51:24 +0000 (UTC) Received: from localhost ([::1]:55486 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nex4M-0001f7-SC for qemu-devel@archiver.kernel.org; Thu, 14 Apr 2022 06:51:23 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:55420) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nex0V-0001vK-BU for qemu-devel@nongnu.org; Thu, 14 Apr 2022 06:47:23 -0400 Received: from mga12.intel.com ([192.55.52.136]:34772) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nex0T-0005Kn-Bv for qemu-devel@nongnu.org; Thu, 14 Apr 2022 06:47:23 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1649933241; x=1681469241; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=MkfDu/adIzxh1OcaPHvWWpqlG9bfb7sCRkpa27MhdSk=; b=ahttKsIlZEw30Uglp0YDemvhPcvTRUiBjIRBFI7CiO3zRibNwxEKIabB ucWFmeKL/GS8p4pqE7rrN3eJxV54MyIJcTqSujjsfWuui9eUWP9CwXU3Y 5Il8WiU9yVZ087ZSvkvf43/e1XpgQB0uGMsxnh0WaPAnN5Bb1JxGm8kLb ylmx5s2LZxRVxP14VTW6Vg9IrYgO8v6bEUKWKHR6YPJNXU+rvO8RVR502 gvPflUfBRFxHePs7yR+8eDCxHQID1p+Thsg4ICmKY2jUdha0x/vKcZS3u qbPgyFiegZ5fg6jcY2Ze+ijyOwU7WaQb3cfpKzVgyWGSG5DwgYe2AL4OZ Q==; X-IronPort-AV: E=McAfee;i="6400,9594,10316"; a="242836479" X-IronPort-AV: E=Sophos;i="5.90,259,1643702400"; d="scan'208";a="242836479" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Apr 2022 03:47:15 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,259,1643702400"; d="scan'208";a="803091193" Received: from 984fee00a4c6.jf.intel.com ([10.165.58.231]) by fmsmga006.fm.intel.com with ESMTP; 14 Apr 2022 03:47:14 -0700 From: Yi Liu To: alex.williamson@redhat.com, cohuck@redhat.com, qemu-devel@nongnu.org Subject: [RFC 05/18] vfio/common: Rename VFIOGuestIOMMU::iommu into ::iommu_mr Date: Thu, 14 Apr 2022 03:46:57 -0700 Message-Id: <20220414104710.28534-6-yi.l.liu@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220414104710.28534-1-yi.l.liu@intel.com> References: <20220414104710.28534-1-yi.l.liu@intel.com> MIME-Version: 1.0 Received-SPF: pass client-ip=192.55.52.136; envelope-from=yi.l.liu@intel.com; helo=mga12.intel.com X-Spam_score_int: -44 X-Spam_score: -4.5 X-Spam_bar: ---- X-Spam_report: (-4.5 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: akrowiak@linux.ibm.com, jjherne@linux.ibm.com, thuth@redhat.com, yi.l.liu@intel.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com, jasowang@redhat.com, farman@linux.ibm.com, peterx@redhat.com, pasic@linux.ibm.com, eric.auger@redhat.com, yi.y.sun@intel.com, chao.p.peng@intel.com, nicolinc@nvidia.com, kevin.tian@intel.com, jgg@nvidia.com, eric.auger.pro@gmail.com, david@gibson.dropbear.id.au Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" Rename VFIOGuestIOMMU iommu field into iommu_mr. Then it becomes clearer it is an IOMMU memory region. no functional change intended Signed-off-by: Yi Liu --- hw/vfio/common.c | 16 ++++++++-------- include/hw/vfio/vfio-common.h | 2 +- 2 files changed, 9 insertions(+), 9 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 080046e3f5..b05f68b5c7 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -992,7 +992,7 @@ static void vfio_listener_region_add(MemoryListener *listener, * device emulation the VFIO iommu handles to use). */ giommu = g_malloc0(sizeof(*giommu)); - giommu->iommu = iommu_mr; + giommu->iommu_mr = iommu_mr; giommu->iommu_offset = section->offset_within_address_space - section->offset_within_region; giommu->container = container; @@ -1007,7 +1007,7 @@ static void vfio_listener_region_add(MemoryListener *listener, int128_get64(llend), iommu_idx); - ret = memory_region_iommu_set_page_size_mask(giommu->iommu, + ret = memory_region_iommu_set_page_size_mask(giommu->iommu_mr, container->pgsizes, &err); if (ret) { @@ -1022,7 +1022,7 @@ static void vfio_listener_region_add(MemoryListener *listener, goto fail; } QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next); - memory_region_iommu_replay(giommu->iommu, &giommu->n); + memory_region_iommu_replay(giommu->iommu_mr, &giommu->n); return; } @@ -1128,7 +1128,7 @@ static void vfio_listener_region_del(MemoryListener *listener, VFIOGuestIOMMU *giommu; QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) { - if (MEMORY_REGION(giommu->iommu) == section->mr && + if (MEMORY_REGION(giommu->iommu_mr) == section->mr && giommu->n.start == section->offset_within_region) { memory_region_unregister_iommu_notifier(section->mr, &giommu->n); @@ -1393,11 +1393,11 @@ static int vfio_sync_dirty_bitmap(VFIOContainer *container, VFIOGuestIOMMU *giommu; QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) { - if (MEMORY_REGION(giommu->iommu) == section->mr && + if (MEMORY_REGION(giommu->iommu_mr) == section->mr && giommu->n.start == section->offset_within_region) { Int128 llend; vfio_giommu_dirty_notifier gdn = { .giommu = giommu }; - int idx = memory_region_iommu_attrs_to_index(giommu->iommu, + int idx = memory_region_iommu_attrs_to_index(giommu->iommu_mr, MEMTXATTRS_UNSPECIFIED); llend = int128_add(int128_make64(section->offset_within_region), @@ -1410,7 +1410,7 @@ static int vfio_sync_dirty_bitmap(VFIOContainer *container, section->offset_within_region, int128_get64(llend), idx); - memory_region_iommu_replay(giommu->iommu, &gdn.n); + memory_region_iommu_replay(giommu->iommu_mr, &gdn.n); break; } } @@ -2246,7 +2246,7 @@ static void vfio_disconnect_container(VFIOGroup *group) QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) { memory_region_unregister_iommu_notifier( - MEMORY_REGION(giommu->iommu), &giommu->n); + MEMORY_REGION(giommu->iommu_mr), &giommu->n); QLIST_REMOVE(giommu, giommu_next); g_free(giommu); } diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index 8af11b0a76..e573f5a9f1 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -98,7 +98,7 @@ typedef struct VFIOContainer { typedef struct VFIOGuestIOMMU { VFIOContainer *container; - IOMMUMemoryRegion *iommu; + IOMMUMemoryRegion *iommu_mr; hwaddr iommu_offset; IOMMUNotifier n; QLIST_ENTRY(VFIOGuestIOMMU) giommu_next; From patchwork Thu Apr 14 10:46:58 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yi Liu X-Patchwork-Id: 12813347 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 8D162C433F5 for ; Thu, 14 Apr 2022 10:54:51 +0000 (UTC) Received: from localhost ([::1]:36460 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nex7i-0007qL-Ll for qemu-devel@archiver.kernel.org; Thu, 14 Apr 2022 06:54:50 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:55458) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nex0Y-00026l-Nl for qemu-devel@nongnu.org; Thu, 14 Apr 2022 06:47:26 -0400 Received: from mga12.intel.com ([192.55.52.136]:34770) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nex0T-0005Ke-PW for qemu-devel@nongnu.org; Thu, 14 Apr 2022 06:47:26 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1649933241; x=1681469241; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=GsA4c5ll96/xkpfc+JqJvHmmFEC9GDXHrun1J2VAhcA=; b=CcM0YhEo8oYz/z0WB5/iz9rIyQOYTO29fZ4cXma+TXUspzvAy3kfjEAx 6fcbO+qvyQLEjx3+Vc+Dt4QW6wZPlwKixjhh0+PW/7lNNYEUcxMMD+ugd I4RPkLOtwbyRilcn9WCLMKprvFV590olBAPLEt4dy166wWXBKmJ772VrJ NItVrD7OJ97q52Wk4Z7UViqvGCbatqlWSfjxMaOIS5T0mKFfUrWpNHFOB /x9MuRsZd+mQ/TmyQdz4PdNeWYLVsi18xHr4RCdUg6iJ0pkAtPEKrRiF7 jSfDopSQaGB6KQH8FrnKQCZrlgvECjTpOb1kgXA9N++SyS42j5moKz9fM Q==; X-IronPort-AV: E=McAfee;i="6400,9594,10316"; a="242836484" X-IronPort-AV: E=Sophos;i="5.90,259,1643702400"; d="scan'208";a="242836484" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Apr 2022 03:47:16 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,259,1643702400"; d="scan'208";a="803091201" Received: from 984fee00a4c6.jf.intel.com ([10.165.58.231]) by fmsmga006.fm.intel.com with ESMTP; 14 Apr 2022 03:47:15 -0700 From: Yi Liu To: alex.williamson@redhat.com, cohuck@redhat.com, qemu-devel@nongnu.org Subject: [RFC 06/18] vfio/common: Split common.c into common.c, container.c and as.c Date: Thu, 14 Apr 2022 03:46:58 -0700 Message-Id: <20220414104710.28534-7-yi.l.liu@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220414104710.28534-1-yi.l.liu@intel.com> References: <20220414104710.28534-1-yi.l.liu@intel.com> MIME-Version: 1.0 Received-SPF: pass client-ip=192.55.52.136; envelope-from=yi.l.liu@intel.com; helo=mga12.intel.com X-Spam_score_int: -44 X-Spam_score: -4.5 X-Spam_bar: ---- X-Spam_report: (-4.5 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: akrowiak@linux.ibm.com, jjherne@linux.ibm.com, thuth@redhat.com, yi.l.liu@intel.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com, jasowang@redhat.com, farman@linux.ibm.com, peterx@redhat.com, pasic@linux.ibm.com, eric.auger@redhat.com, yi.y.sun@intel.com, chao.p.peng@intel.com, nicolinc@nvidia.com, kevin.tian@intel.com, jgg@nvidia.com, eric.auger.pro@gmail.com, david@gibson.dropbear.id.au Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" Before introducing the support for the new /dev/iommu backend in VFIO let's try to split common.c file into 3 parts: - in common.c we keep backend agnostic code unrelated to dma mapping - as.c is created and contains code related to VFIOAddressSpace and MemoryListeners. This code will be backend agnostic. - container.c is created and will contain code related to the legacy VFIO backend (containers, groups, ...). No functional change intended Signed-off-by: Yi Liu Signed-off-by: Eric Auger Signed-off-by: Yi Liu --- hw/vfio/as.c | 868 ++++++++++++ hw/vfio/common.c | 2340 +++------------------------------ hw/vfio/container.c | 1193 +++++++++++++++++ hw/vfio/meson.build | 2 + include/hw/vfio/vfio-common.h | 28 + 5 files changed, 2278 insertions(+), 2153 deletions(-) create mode 100644 hw/vfio/as.c create mode 100644 hw/vfio/container.c diff --git a/hw/vfio/as.c b/hw/vfio/as.c new file mode 100644 index 0000000000..4181182808 --- /dev/null +++ b/hw/vfio/as.c @@ -0,0 +1,868 @@ +/* + * generic functions used by VFIO devices + * + * Copyright Red Hat, Inc. 2012 + * + * Authors: + * Alex Williamson + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + * Based on qemu-kvm device-assignment: + * Adapted for KVM by Qumranet. + * Copyright (c) 2007, Neocleus, Alex Novik (alex@neocleus.com) + * Copyright (c) 2007, Neocleus, Guy Zana (guy@neocleus.com) + * Copyright (C) 2008, Qumranet, Amit Shah (amit.shah@qumranet.com) + * Copyright (C) 2008, Red Hat, Amit Shah (amit.shah@redhat.com) + * Copyright (C) 2008, IBM, Muli Ben-Yehuda (muli@il.ibm.com) + */ + +#include "qemu/osdep.h" +#include +#ifdef CONFIG_KVM +#include +#endif +#include + +#include "hw/vfio/vfio-common.h" +#include "hw/vfio/vfio.h" +#include "exec/address-spaces.h" +#include "exec/memory.h" +#include "exec/ram_addr.h" +#include "hw/hw.h" +#include "qemu/error-report.h" +#include "qemu/main-loop.h" +#include "qemu/range.h" +#include "sysemu/kvm.h" +#include "sysemu/reset.h" +#include "sysemu/runstate.h" +#include "trace.h" +#include "qapi/error.h" +#include "migration/migration.h" + +static QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces = + QLIST_HEAD_INITIALIZER(vfio_address_spaces); + +void vfio_host_win_add(VFIOContainer *container, + hwaddr min_iova, hwaddr max_iova, + uint64_t iova_pgsizes) +{ + VFIOHostDMAWindow *hostwin; + + QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) { + if (ranges_overlap(hostwin->min_iova, + hostwin->max_iova - hostwin->min_iova + 1, + min_iova, + max_iova - min_iova + 1)) { + hw_error("%s: Overlapped IOMMU are not enabled", __func__); + } + } + + hostwin = g_malloc0(sizeof(*hostwin)); + + hostwin->min_iova = min_iova; + hostwin->max_iova = max_iova; + hostwin->iova_pgsizes = iova_pgsizes; + QLIST_INSERT_HEAD(&container->hostwin_list, hostwin, hostwin_next); +} + +int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova, + hwaddr max_iova) +{ + VFIOHostDMAWindow *hostwin; + + QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) { + if (hostwin->min_iova == min_iova && hostwin->max_iova == max_iova) { + QLIST_REMOVE(hostwin, hostwin_next); + g_free(hostwin); + return 0; + } + } + + return -1; +} + +static bool vfio_listener_skipped_section(MemoryRegionSection *section) +{ + return (!memory_region_is_ram(section->mr) && + !memory_region_is_iommu(section->mr)) || + memory_region_is_protected(section->mr) || + /* + * Sizing an enabled 64-bit BAR can cause spurious mappings to + * addresses in the upper part of the 64-bit address space. These + * are never accessed by the CPU and beyond the address width of + * some IOMMU hardware. TODO: VFIO should tell us the IOMMU width. + */ + section->offset_within_address_space & (1ULL << 63); +} + +/* Called with rcu_read_lock held. */ +static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr, + ram_addr_t *ram_addr, bool *read_only) +{ + MemoryRegion *mr; + hwaddr xlat; + hwaddr len = iotlb->addr_mask + 1; + bool writable = iotlb->perm & IOMMU_WO; + + /* + * The IOMMU TLB entry we have just covers translation through + * this IOMMU to its immediate target. We need to translate + * it the rest of the way through to memory. + */ + mr = address_space_translate(&address_space_memory, + iotlb->translated_addr, + &xlat, &len, writable, + MEMTXATTRS_UNSPECIFIED); + if (!memory_region_is_ram(mr)) { + error_report("iommu map to non memory area %"HWADDR_PRIx"", + xlat); + return false; + } else if (memory_region_has_ram_discard_manager(mr)) { + RamDiscardManager *rdm = memory_region_get_ram_discard_manager(mr); + MemoryRegionSection tmp = { + .mr = mr, + .offset_within_region = xlat, + .size = int128_make64(len), + }; + + /* + * Malicious VMs can map memory into the IOMMU, which is expected + * to remain discarded. vfio will pin all pages, populating memory. + * Disallow that. vmstate priorities make sure any RamDiscardManager + * were already restored before IOMMUs are restored. + */ + if (!ram_discard_manager_is_populated(rdm, &tmp)) { + error_report("iommu map to discarded memory (e.g., unplugged via" + " virtio-mem): %"HWADDR_PRIx"", + iotlb->translated_addr); + return false; + } + + /* + * Malicious VMs might trigger discarding of IOMMU-mapped memory. The + * pages will remain pinned inside vfio until unmapped, resulting in a + * higher memory consumption than expected. If memory would get + * populated again later, there would be an inconsistency between pages + * pinned by vfio and pages seen by QEMU. This is the case until + * unmapped from the IOMMU (e.g., during device reset). + * + * With malicious guests, we really only care about pinning more memory + * than expected. RLIMIT_MEMLOCK set for the user/process can never be + * exceeded and can be used to mitigate this problem. + */ + warn_report_once("Using vfio with vIOMMUs and coordinated discarding of" + " RAM (e.g., virtio-mem) works, however, malicious" + " guests can trigger pinning of more memory than" + " intended via an IOMMU. It's possible to mitigate " + " by setting/adjusting RLIMIT_MEMLOCK."); + } + + /* + * Translation truncates length to the IOMMU page size, + * check that it did not truncate too much. + */ + if (len & iotlb->addr_mask) { + error_report("iommu has granularity incompatible with target AS"); + return false; + } + + if (vaddr) { + *vaddr = memory_region_get_ram_ptr(mr) + xlat; + } + + if (ram_addr) { + *ram_addr = memory_region_get_ram_addr(mr) + xlat; + } + + if (read_only) { + *read_only = !writable || mr->readonly; + } + + return true; +} + +static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) +{ + VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n); + VFIOContainer *container = giommu->container; + hwaddr iova = iotlb->iova + giommu->iommu_offset; + void *vaddr; + int ret; + + trace_vfio_iommu_map_notify(iotlb->perm == IOMMU_NONE ? "UNMAP" : "MAP", + iova, iova + iotlb->addr_mask); + + if (iotlb->target_as != &address_space_memory) { + error_report("Wrong target AS \"%s\", only system memory is allowed", + iotlb->target_as->name ? iotlb->target_as->name : "none"); + return; + } + + rcu_read_lock(); + + if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) { + bool read_only; + + if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only)) { + goto out; + } + /* + * vaddr is only valid until rcu_read_unlock(). But after + * vfio_dma_map has set up the mapping the pages will be + * pinned by the kernel. This makes sure that the RAM backend + * of vaddr will always be there, even if the memory object is + * destroyed and its backing memory munmap-ed. + */ + ret = vfio_dma_map(container, iova, + iotlb->addr_mask + 1, vaddr, + read_only); + if (ret) { + error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", " + "0x%"HWADDR_PRIx", %p) = %d (%m)", + container, iova, + iotlb->addr_mask + 1, vaddr, ret); + } + } else { + ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1, iotlb); + if (ret) { + error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", " + "0x%"HWADDR_PRIx") = %d (%m)", + container, iova, + iotlb->addr_mask + 1, ret); + } + } +out: + rcu_read_unlock(); +} + +static void vfio_ram_discard_notify_discard(RamDiscardListener *rdl, + MemoryRegionSection *section) +{ + VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener, + listener); + const hwaddr size = int128_get64(section->size); + const hwaddr iova = section->offset_within_address_space; + int ret; + + /* Unmap with a single call. */ + ret = vfio_dma_unmap(vrdl->container, iova, size , NULL); + if (ret) { + error_report("%s: vfio_dma_unmap() failed: %s", __func__, + strerror(-ret)); + } +} + +static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl, + MemoryRegionSection *section) +{ + VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener, + listener); + const hwaddr end = section->offset_within_region + + int128_get64(section->size); + hwaddr start, next, iova; + void *vaddr; + int ret; + + /* + * Map in (aligned within memory region) minimum granularity, so we can + * unmap in minimum granularity later. + */ + for (start = section->offset_within_region; start < end; start = next) { + next = ROUND_UP(start + 1, vrdl->granularity); + next = MIN(next, end); + + iova = start - section->offset_within_region + + section->offset_within_address_space; + vaddr = memory_region_get_ram_ptr(section->mr) + start; + + ret = vfio_dma_map(vrdl->container, iova, next - start, + vaddr, section->readonly); + if (ret) { + /* Rollback */ + vfio_ram_discard_notify_discard(rdl, section); + return ret; + } + } + return 0; +} + +static void vfio_register_ram_discard_listener(VFIOContainer *container, + MemoryRegionSection *section) +{ + RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr); + VFIORamDiscardListener *vrdl; + + /* Ignore some corner cases not relevant in practice. */ + g_assert(QEMU_IS_ALIGNED(section->offset_within_region, TARGET_PAGE_SIZE)); + g_assert(QEMU_IS_ALIGNED(section->offset_within_address_space, + TARGET_PAGE_SIZE)); + g_assert(QEMU_IS_ALIGNED(int128_get64(section->size), TARGET_PAGE_SIZE)); + + vrdl = g_new0(VFIORamDiscardListener, 1); + vrdl->container = container; + vrdl->mr = section->mr; + vrdl->offset_within_address_space = section->offset_within_address_space; + vrdl->size = int128_get64(section->size); + vrdl->granularity = ram_discard_manager_get_min_granularity(rdm, + section->mr); + + g_assert(vrdl->granularity && is_power_of_2(vrdl->granularity)); + g_assert(container->pgsizes && + vrdl->granularity >= 1ULL << ctz64(container->pgsizes)); + + ram_discard_listener_init(&vrdl->listener, + vfio_ram_discard_notify_populate, + vfio_ram_discard_notify_discard, true); + ram_discard_manager_register_listener(rdm, &vrdl->listener, section); + QLIST_INSERT_HEAD(&container->vrdl_list, vrdl, next); + + /* + * Sanity-check if we have a theoretically problematic setup where we could + * exceed the maximum number of possible DMA mappings over time. We assume + * that each mapped section in the same address space as a RamDiscardManager + * section consumes exactly one DMA mapping, with the exception of + * RamDiscardManager sections; i.e., we don't expect to have gIOMMU sections + * in the same address space as RamDiscardManager sections. + * + * We assume that each section in the address space consumes one memslot. + * We take the number of KVM memory slots as a best guess for the maximum + * number of sections in the address space we could have over time, + * also consuming DMA mappings. + */ + if (container->dma_max_mappings) { + unsigned int vrdl_count = 0, vrdl_mappings = 0, max_memslots = 512; + +#ifdef CONFIG_KVM + if (kvm_enabled()) { + max_memslots = kvm_get_max_memslots(); + } +#endif + + QLIST_FOREACH(vrdl, &container->vrdl_list, next) { + hwaddr start, end; + + start = QEMU_ALIGN_DOWN(vrdl->offset_within_address_space, + vrdl->granularity); + end = ROUND_UP(vrdl->offset_within_address_space + vrdl->size, + vrdl->granularity); + vrdl_mappings += (end - start) / vrdl->granularity; + vrdl_count++; + } + + if (vrdl_mappings + max_memslots - vrdl_count > + container->dma_max_mappings) { + warn_report("%s: possibly running out of DMA mappings. E.g., try" + " increasing the 'block-size' of virtio-mem devies." + " Maximum possible DMA mappings: %d, Maximum possible" + " memslots: %d", __func__, container->dma_max_mappings, + max_memslots); + } + } +} + +static void vfio_unregister_ram_discard_listener(VFIOContainer *container, + MemoryRegionSection *section) +{ + RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr); + VFIORamDiscardListener *vrdl = NULL; + + QLIST_FOREACH(vrdl, &container->vrdl_list, next) { + if (vrdl->mr == section->mr && + vrdl->offset_within_address_space == + section->offset_within_address_space) { + break; + } + } + + if (!vrdl) { + hw_error("vfio: Trying to unregister missing RAM discard listener"); + } + + ram_discard_manager_unregister_listener(rdm, &vrdl->listener); + QLIST_REMOVE(vrdl, next); + g_free(vrdl); +} + +static void vfio_listener_region_add(MemoryListener *listener, + MemoryRegionSection *section) +{ + VFIOContainer *container = container_of(listener, VFIOContainer, listener); + hwaddr iova, end; + Int128 llend, llsize; + void *vaddr; + int ret; + VFIOHostDMAWindow *hostwin; + bool hostwin_found; + Error *err = NULL; + + if (vfio_listener_skipped_section(section)) { + trace_vfio_listener_region_add_skip( + section->offset_within_address_space, + section->offset_within_address_space + + int128_get64(int128_sub(section->size, int128_one()))); + return; + } + + if (unlikely((section->offset_within_address_space & + ~qemu_real_host_page_mask) != + (section->offset_within_region & ~qemu_real_host_page_mask))) { + error_report("%s received unaligned region", __func__); + return; + } + + iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space); + llend = int128_make64(section->offset_within_address_space); + llend = int128_add(llend, section->size); + llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask)); + + if (int128_ge(int128_make64(iova), llend)) { + if (memory_region_is_ram_device(section->mr)) { + trace_vfio_listener_region_add_no_dma_map( + memory_region_name(section->mr), + section->offset_within_address_space, + int128_getlo(section->size), + qemu_real_host_page_size); + } + return; + } + end = int128_get64(int128_sub(llend, int128_one())); + + if (vfio_container_add_section_window(container, section, &err)) { + goto fail; + } + + hostwin_found = false; + QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) { + if (hostwin->min_iova <= iova && end <= hostwin->max_iova) { + hostwin_found = true; + break; + } + } + + if (!hostwin_found) { + error_setg(&err, "Container %p can't map guest IOVA region" + " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx, container, iova, end); + goto fail; + } + + memory_region_ref(section->mr); + + if (memory_region_is_iommu(section->mr)) { + VFIOGuestIOMMU *giommu; + IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr); + int iommu_idx; + + trace_vfio_listener_region_add_iommu(iova, end); + /* + * FIXME: For VFIO iommu types which have KVM acceleration to + * avoid bouncing all map/unmaps through qemu this way, this + * would be the right place to wire that up (tell the KVM + * device emulation the VFIO iommu handles to use). + */ + giommu = g_malloc0(sizeof(*giommu)); + giommu->iommu_mr = iommu_mr; + giommu->iommu_offset = section->offset_within_address_space - + section->offset_within_region; + giommu->container = container; + llend = int128_add(int128_make64(section->offset_within_region), + section->size); + llend = int128_sub(llend, int128_one()); + iommu_idx = memory_region_iommu_attrs_to_index(iommu_mr, + MEMTXATTRS_UNSPECIFIED); + iommu_notifier_init(&giommu->n, vfio_iommu_map_notify, + IOMMU_NOTIFIER_IOTLB_EVENTS, + section->offset_within_region, + int128_get64(llend), + iommu_idx); + + ret = memory_region_iommu_set_page_size_mask(giommu->iommu_mr, + container->pgsizes, + &err); + if (ret) { + g_free(giommu); + goto fail; + } + + ret = memory_region_register_iommu_notifier(section->mr, &giommu->n, + &err); + if (ret) { + g_free(giommu); + goto fail; + } + QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next); + memory_region_iommu_replay(giommu->iommu_mr, &giommu->n); + + return; + } + + /* Here we assume that memory_region_is_ram(section->mr)==true */ + + /* + * For RAM memory regions with a RamDiscardManager, we only want to map the + * actually populated parts - and update the mapping whenever we're notified + * about changes. + */ + if (memory_region_has_ram_discard_manager(section->mr)) { + vfio_register_ram_discard_listener(container, section); + return; + } + + vaddr = memory_region_get_ram_ptr(section->mr) + + section->offset_within_region + + (iova - section->offset_within_address_space); + + trace_vfio_listener_region_add_ram(iova, end, vaddr); + + llsize = int128_sub(llend, int128_make64(iova)); + + if (memory_region_is_ram_device(section->mr)) { + hwaddr pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1; + + if ((iova & pgmask) || (int128_get64(llsize) & pgmask)) { + trace_vfio_listener_region_add_no_dma_map( + memory_region_name(section->mr), + section->offset_within_address_space, + int128_getlo(section->size), + pgmask + 1); + return; + } + } + + ret = vfio_dma_map(container, iova, int128_get64(llsize), + vaddr, section->readonly); + if (ret) { + error_setg(&err, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", " + "0x%"HWADDR_PRIx", %p) = %d (%m)", + container, iova, int128_get64(llsize), vaddr, ret); + if (memory_region_is_ram_device(section->mr)) { + /* Allow unexpected mappings not to be fatal for RAM devices */ + error_report_err(err); + return; + } + goto fail; + } + + return; + +fail: + if (memory_region_is_ram_device(section->mr)) { + error_report("failed to vfio_dma_map. pci p2p may not work"); + return; + } + /* + * On the initfn path, store the first error in the container so we + * can gracefully fail. Runtime, there's not much we can do other + * than throw a hardware error. + */ + if (!container->initialized) { + if (!container->error) { + error_propagate_prepend(&container->error, err, + "Region %s: ", + memory_region_name(section->mr)); + } else { + error_free(err); + } + } else { + error_report_err(err); + hw_error("vfio: DMA mapping failed, unable to continue"); + } +} + +static void vfio_listener_region_del(MemoryListener *listener, + MemoryRegionSection *section) +{ + VFIOContainer *container = container_of(listener, VFIOContainer, listener); + hwaddr iova, end; + Int128 llend, llsize; + int ret; + bool try_unmap = true; + + if (vfio_listener_skipped_section(section)) { + trace_vfio_listener_region_del_skip( + section->offset_within_address_space, + section->offset_within_address_space + + int128_get64(int128_sub(section->size, int128_one()))); + return; + } + + if (unlikely((section->offset_within_address_space & + ~qemu_real_host_page_mask) != + (section->offset_within_region & ~qemu_real_host_page_mask))) { + error_report("%s received unaligned region", __func__); + return; + } + + if (memory_region_is_iommu(section->mr)) { + VFIOGuestIOMMU *giommu; + + QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) { + if (MEMORY_REGION(giommu->iommu_mr) == section->mr && + giommu->n.start == section->offset_within_region) { + memory_region_unregister_iommu_notifier(section->mr, + &giommu->n); + QLIST_REMOVE(giommu, giommu_next); + g_free(giommu); + break; + } + } + + /* + * FIXME: We assume the one big unmap below is adequate to + * remove any individual page mappings in the IOMMU which + * might have been copied into VFIO. This works for a page table + * based IOMMU where a big unmap flattens a large range of IO-PTEs. + * That may not be true for all IOMMU types. + */ + } + + iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space); + llend = int128_make64(section->offset_within_address_space); + llend = int128_add(llend, section->size); + llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask)); + + if (int128_ge(int128_make64(iova), llend)) { + return; + } + end = int128_get64(int128_sub(llend, int128_one())); + + llsize = int128_sub(llend, int128_make64(iova)); + + trace_vfio_listener_region_del(iova, end); + + if (memory_region_is_ram_device(section->mr)) { + hwaddr pgmask; + VFIOHostDMAWindow *hostwin; + bool hostwin_found = false; + + QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) { + if (hostwin->min_iova <= iova && end <= hostwin->max_iova) { + hostwin_found = true; + break; + } + } + assert(hostwin_found); /* or region_add() would have failed */ + + pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1; + try_unmap = !((iova & pgmask) || (int128_get64(llsize) & pgmask)); + } else if (memory_region_has_ram_discard_manager(section->mr)) { + vfio_unregister_ram_discard_listener(container, section); + /* Unregistering will trigger an unmap. */ + try_unmap = false; + } + + if (try_unmap) { + if (int128_eq(llsize, int128_2_64())) { + /* The unmap ioctl doesn't accept a full 64-bit span. */ + llsize = int128_rshift(llsize, 1); + ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL); + if (ret) { + error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", " + "0x%"HWADDR_PRIx") = %d (%m)", + container, iova, int128_get64(llsize), ret); + } + iova += int128_get64(llsize); + } + ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL); + if (ret) { + error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", " + "0x%"HWADDR_PRIx") = %d (%m)", + container, iova, int128_get64(llsize), ret); + } + } + + memory_region_unref(section->mr); + + vfio_container_del_section_window(container, section); +} + +static void vfio_listener_log_global_start(MemoryListener *listener) +{ + VFIOContainer *container = container_of(listener, VFIOContainer, listener); + + vfio_set_dirty_page_tracking(container, true); +} + +static void vfio_listener_log_global_stop(MemoryListener *listener) +{ + VFIOContainer *container = container_of(listener, VFIOContainer, listener); + + vfio_set_dirty_page_tracking(container, false); +} + +typedef struct { + IOMMUNotifier n; + VFIOGuestIOMMU *giommu; +} vfio_giommu_dirty_notifier; + +static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) +{ + vfio_giommu_dirty_notifier *gdn = container_of(n, + vfio_giommu_dirty_notifier, n); + VFIOGuestIOMMU *giommu = gdn->giommu; + VFIOContainer *container = giommu->container; + hwaddr iova = iotlb->iova + giommu->iommu_offset; + ram_addr_t translated_addr; + + trace_vfio_iommu_map_dirty_notify(iova, iova + iotlb->addr_mask); + + if (iotlb->target_as != &address_space_memory) { + error_report("Wrong target AS \"%s\", only system memory is allowed", + iotlb->target_as->name ? iotlb->target_as->name : "none"); + return; + } + + rcu_read_lock(); + if (vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL)) { + int ret; + + ret = vfio_get_dirty_bitmap(container, iova, iotlb->addr_mask + 1, + translated_addr); + if (ret) { + error_report("vfio_iommu_map_dirty_notify(%p, 0x%"HWADDR_PRIx", " + "0x%"HWADDR_PRIx") = %d (%m)", + container, iova, + iotlb->addr_mask + 1, ret); + } + } + rcu_read_unlock(); +} + +static int vfio_ram_discard_get_dirty_bitmap(MemoryRegionSection *section, + void *opaque) +{ + const hwaddr size = int128_get64(section->size); + const hwaddr iova = section->offset_within_address_space; + const ram_addr_t ram_addr = memory_region_get_ram_addr(section->mr) + + section->offset_within_region; + VFIORamDiscardListener *vrdl = opaque; + + /* + * Sync the whole mapped region (spanning multiple individual mappings) + * in one go. + */ + return vfio_get_dirty_bitmap(vrdl->container, iova, size, ram_addr); +} + +static int vfio_sync_ram_discard_listener_dirty_bitmap(VFIOContainer *container, + MemoryRegionSection *section) +{ + RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr); + VFIORamDiscardListener *vrdl = NULL; + + QLIST_FOREACH(vrdl, &container->vrdl_list, next) { + if (vrdl->mr == section->mr && + vrdl->offset_within_address_space == + section->offset_within_address_space) { + break; + } + } + + if (!vrdl) { + hw_error("vfio: Trying to sync missing RAM discard listener"); + } + + /* + * We only want/can synchronize the bitmap for actually mapped parts - + * which correspond to populated parts. Replay all populated parts. + */ + return ram_discard_manager_replay_populated(rdm, section, + vfio_ram_discard_get_dirty_bitmap, + &vrdl); +} + +static int vfio_sync_dirty_bitmap(VFIOContainer *container, + MemoryRegionSection *section) +{ + ram_addr_t ram_addr; + + if (memory_region_is_iommu(section->mr)) { + VFIOGuestIOMMU *giommu; + + QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) { + if (MEMORY_REGION(giommu->iommu_mr) == section->mr && + giommu->n.start == section->offset_within_region) { + Int128 llend; + vfio_giommu_dirty_notifier gdn = { .giommu = giommu }; + int idx = memory_region_iommu_attrs_to_index(giommu->iommu_mr, + MEMTXATTRS_UNSPECIFIED); + + llend = int128_add(int128_make64(section->offset_within_region), + section->size); + llend = int128_sub(llend, int128_one()); + + iommu_notifier_init(&gdn.n, + vfio_iommu_map_dirty_notify, + IOMMU_NOTIFIER_MAP, + section->offset_within_region, + int128_get64(llend), + idx); + memory_region_iommu_replay(giommu->iommu_mr, &gdn.n); + break; + } + } + return 0; + } else if (memory_region_has_ram_discard_manager(section->mr)) { + return vfio_sync_ram_discard_listener_dirty_bitmap(container, section); + } + + ram_addr = memory_region_get_ram_addr(section->mr) + + section->offset_within_region; + + return vfio_get_dirty_bitmap(container, + REAL_HOST_PAGE_ALIGN(section->offset_within_address_space), + int128_get64(section->size), ram_addr); +} + +static void vfio_listener_log_sync(MemoryListener *listener, + MemoryRegionSection *section) +{ + VFIOContainer *container = container_of(listener, VFIOContainer, listener); + + if (vfio_listener_skipped_section(section) || + !container->dirty_pages_supported) { + return; + } + + if (vfio_devices_all_dirty_tracking(container)) { + vfio_sync_dirty_bitmap(container, section); + } +} + +const MemoryListener vfio_memory_listener = { + .name = "vfio", + .region_add = vfio_listener_region_add, + .region_del = vfio_listener_region_del, + .log_global_start = vfio_listener_log_global_start, + .log_global_stop = vfio_listener_log_global_stop, + .log_sync = vfio_listener_log_sync, +}; + +VFIOAddressSpace *vfio_get_address_space(AddressSpace *as) +{ + VFIOAddressSpace *space; + + QLIST_FOREACH(space, &vfio_address_spaces, list) { + if (space->as == as) { + return space; + } + } + + /* No suitable VFIOAddressSpace, create a new one */ + space = g_malloc0(sizeof(*space)); + space->as = as; + QLIST_INIT(&space->containers); + + QLIST_INSERT_HEAD(&vfio_address_spaces, space, list); + + return space; +} + +void vfio_put_address_space(VFIOAddressSpace *space) +{ + if (QLIST_EMPTY(&space->containers)) { + QLIST_REMOVE(space, list); + g_free(space); + } +} diff --git a/hw/vfio/common.c b/hw/vfio/common.c index b05f68b5c7..892aa47113 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -20,42 +20,13 @@ #include "qemu/osdep.h" #include -#ifdef CONFIG_KVM -#include -#endif #include #include "hw/vfio/vfio-common.h" #include "hw/vfio/vfio.h" -#include "exec/address-spaces.h" -#include "exec/memory.h" -#include "exec/ram_addr.h" #include "hw/hw.h" -#include "qemu/error-report.h" -#include "qemu/main-loop.h" -#include "qemu/range.h" -#include "sysemu/kvm.h" -#include "sysemu/reset.h" -#include "sysemu/runstate.h" #include "trace.h" #include "qapi/error.h" -#include "migration/migration.h" - -VFIOGroupList vfio_group_list = - QLIST_HEAD_INITIALIZER(vfio_group_list); -static QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces = - QLIST_HEAD_INITIALIZER(vfio_address_spaces); - -#ifdef CONFIG_KVM -/* - * We have a single VFIO pseudo device per KVM VM. Once created it lives - * for the life of the VM. Closing the file descriptor only drops our - * reference to it and the device's reference to kvm. Therefore once - * initialized, this file descriptor is only released on QEMU exit and - * we'll re-use it should another vfio device be attached before then. - */ -static int vfio_kvm_device_fd = -1; -#endif /* * Common VFIO interrupt disable @@ -135,29 +106,6 @@ static const char *index_to_str(VFIODevice *vbasedev, int index) } } -static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state) -{ - switch (container->iommu_type) { - case VFIO_TYPE1v2_IOMMU: - case VFIO_TYPE1_IOMMU: - /* - * We support coordinated discarding of RAM via the RamDiscardManager. - */ - return ram_block_uncoordinated_discard_disable(state); - default: - /* - * VFIO_SPAPR_TCE_IOMMU most probably works just fine with - * RamDiscardManager, however, it is completely untested. - * - * VFIO_SPAPR_TCE_v2_IOMMU with "DMA memory preregistering" does - * completely the opposite of managing mapping/pinning dynamically as - * required by RamDiscardManager. We would have to special-case sections - * with a RamDiscardManager. - */ - return ram_block_discard_disable(state); - } -} - int vfio_set_irq_signaling(VFIODevice *vbasedev, int index, int subindex, int action, int fd, Error **errp) { @@ -312,2115 +260,296 @@ const MemoryRegionOps vfio_region_ops = { }, }; -/* - * Device state interfaces - */ - -bool vfio_mig_active(void) +static struct vfio_info_cap_header * +vfio_get_cap(void *ptr, uint32_t cap_offset, uint16_t id) { - VFIOGroup *group; - VFIODevice *vbasedev; - - if (QLIST_EMPTY(&vfio_group_list)) { - return false; - } + struct vfio_info_cap_header *hdr; - QLIST_FOREACH(group, &vfio_group_list, next) { - QLIST_FOREACH(vbasedev, &group->device_list, next) { - if (vbasedev->migration_blocker) { - return false; - } + for (hdr = ptr + cap_offset; hdr != ptr; hdr = ptr + hdr->next) { + if (hdr->id == id) { + return hdr; } } - return true; + + return NULL; } -static bool vfio_devices_all_dirty_tracking(VFIOContainer *container) +struct vfio_info_cap_header * +vfio_get_region_info_cap(struct vfio_region_info *info, uint16_t id) { - VFIOGroup *group; - VFIODevice *vbasedev; - MigrationState *ms = migrate_get_current(); - - if (!migration_is_setup_or_active(ms->state)) { - return false; + if (!(info->flags & VFIO_REGION_INFO_FLAG_CAPS)) { + return NULL; } - QLIST_FOREACH(group, &container->group_list, container_next) { - QLIST_FOREACH(vbasedev, &group->device_list, next) { - VFIOMigration *migration = vbasedev->migration; + return vfio_get_cap((void *)info, info->cap_offset, id); +} + +static struct vfio_info_cap_header * +vfio_get_iommu_type1_info_cap(struct vfio_iommu_type1_info *info, uint16_t id) +{ + if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) { + return NULL; + } - if (!migration) { - return false; - } + return vfio_get_cap((void *)info, info->cap_offset, id); +} - if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) - && (migration->device_state & VFIO_DEVICE_STATE_RUNNING)) { - return false; - } - } +struct vfio_info_cap_header * +vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id) +{ + if (!(info->flags & VFIO_DEVICE_FLAGS_CAPS)) { + return NULL; } - return true; + + return vfio_get_cap((void *)info, info->cap_offset, id); } -static bool vfio_devices_all_running_and_saving(VFIOContainer *container) +bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info, + unsigned int *avail) { - VFIOGroup *group; - VFIODevice *vbasedev; - MigrationState *ms = migrate_get_current(); + struct vfio_info_cap_header *hdr; + struct vfio_iommu_type1_info_dma_avail *cap; - if (!migration_is_setup_or_active(ms->state)) { + /* If the capability cannot be found, assume no DMA limiting */ + hdr = vfio_get_iommu_type1_info_cap(info, + VFIO_IOMMU_TYPE1_INFO_DMA_AVAIL); + if (hdr == NULL) { return false; } - QLIST_FOREACH(group, &container->group_list, container_next) { - QLIST_FOREACH(vbasedev, &group->device_list, next) { - VFIOMigration *migration = vbasedev->migration; - - if (!migration) { - return false; - } - - if ((migration->device_state & VFIO_DEVICE_STATE_SAVING) && - (migration->device_state & VFIO_DEVICE_STATE_RUNNING)) { - continue; - } else { - return false; - } - } + if (avail != NULL) { + cap = (void *) hdr; + *avail = cap->avail; } + return true; } -static int vfio_dma_unmap_bitmap(VFIOContainer *container, - hwaddr iova, ram_addr_t size, - IOMMUTLBEntry *iotlb) +static int vfio_setup_region_sparse_mmaps(VFIORegion *region, + struct vfio_region_info *info) { - struct vfio_iommu_type1_dma_unmap *unmap; - struct vfio_bitmap *bitmap; - uint64_t pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size; - int ret; + struct vfio_info_cap_header *hdr; + struct vfio_region_info_cap_sparse_mmap *sparse; + int i, j; - unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap)); + hdr = vfio_get_region_info_cap(info, VFIO_REGION_INFO_CAP_SPARSE_MMAP); + if (!hdr) { + return -ENODEV; + } - unmap->argsz = sizeof(*unmap) + sizeof(*bitmap); - unmap->iova = iova; - unmap->size = size; - unmap->flags |= VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP; - bitmap = (struct vfio_bitmap *)&unmap->data; + sparse = container_of(hdr, struct vfio_region_info_cap_sparse_mmap, header); - /* - * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of - * qemu_real_host_page_size to mark those dirty. Hence set bitmap_pgsize - * to qemu_real_host_page_size. - */ + trace_vfio_region_sparse_mmap_header(region->vbasedev->name, + region->nr, sparse->nr_areas); - bitmap->pgsize = qemu_real_host_page_size; - bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) / - BITS_PER_BYTE; + region->mmaps = g_new0(VFIOMmap, sparse->nr_areas); - if (bitmap->size > container->max_dirty_bitmap_size) { - error_report("UNMAP: Size of bitmap too big 0x%"PRIx64, - (uint64_t)bitmap->size); - ret = -E2BIG; - goto unmap_exit; - } + for (i = 0, j = 0; i < sparse->nr_areas; i++) { + trace_vfio_region_sparse_mmap_entry(i, sparse->areas[i].offset, + sparse->areas[i].offset + + sparse->areas[i].size); - bitmap->data = g_try_malloc0(bitmap->size); - if (!bitmap->data) { - ret = -ENOMEM; - goto unmap_exit; + if (sparse->areas[i].size) { + region->mmaps[j].offset = sparse->areas[i].offset; + region->mmaps[j].size = sparse->areas[i].size; + j++; + } } - ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap); - if (!ret) { - cpu_physical_memory_set_dirty_lebitmap((unsigned long *)bitmap->data, - iotlb->translated_addr, pages); - } else { - error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %m"); - } + region->nr_mmaps = j; + region->mmaps = g_realloc(region->mmaps, j * sizeof(VFIOMmap)); - g_free(bitmap->data); -unmap_exit: - g_free(unmap); - return ret; + return 0; } -/* - * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86 - */ -static int vfio_dma_unmap(VFIOContainer *container, - hwaddr iova, ram_addr_t size, - IOMMUTLBEntry *iotlb) +int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region, + int index, const char *name) { - struct vfio_iommu_type1_dma_unmap unmap = { - .argsz = sizeof(unmap), - .flags = 0, - .iova = iova, - .size = size, - }; + struct vfio_region_info *info; + int ret; - if (iotlb && container->dirty_pages_supported && - vfio_devices_all_running_and_saving(container)) { - return vfio_dma_unmap_bitmap(container, iova, size, iotlb); + ret = vfio_get_region_info(vbasedev, index, &info); + if (ret) { + return ret; } - while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) { - /* - * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c - * v4.15) where an overflow in its wrap-around check prevents us from - * unmapping the last page of the address space. Test for the error - * condition and re-try the unmap excluding the last page. The - * expectation is that we've never mapped the last page anyway and this - * unmap request comes via vIOMMU support which also makes it unlikely - * that this page is used. This bug was introduced well after type1 v2 - * support was introduced, so we shouldn't need to test for v1. A fix - * is queued for kernel v5.0 so this workaround can be removed once - * affected kernels are sufficiently deprecated. - */ - if (errno == EINVAL && unmap.size && !(unmap.iova + unmap.size) && - container->iommu_type == VFIO_TYPE1v2_IOMMU) { - trace_vfio_dma_unmap_overflow_workaround(); - unmap.size -= 1ULL << ctz64(container->pgsizes); - continue; - } - error_report("VFIO_UNMAP_DMA failed: %s", strerror(errno)); - return -errno; - } + region->vbasedev = vbasedev; + region->flags = info->flags; + region->size = info->size; + region->fd_offset = info->offset; + region->nr = index; - return 0; -} + if (region->size) { + region->mem = g_new0(MemoryRegion, 1); + memory_region_init_io(region->mem, obj, &vfio_region_ops, + region, name, region->size); -static int vfio_dma_map(VFIOContainer *container, hwaddr iova, - ram_addr_t size, void *vaddr, bool readonly) -{ - struct vfio_iommu_type1_dma_map map = { - .argsz = sizeof(map), - .flags = VFIO_DMA_MAP_FLAG_READ, - .vaddr = (__u64)(uintptr_t)vaddr, - .iova = iova, - .size = size, - }; + if (!vbasedev->no_mmap && + region->flags & VFIO_REGION_INFO_FLAG_MMAP) { - if (!readonly) { - map.flags |= VFIO_DMA_MAP_FLAG_WRITE; - } + ret = vfio_setup_region_sparse_mmaps(region, info); - /* - * Try the mapping, if it fails with EBUSY, unmap the region and try - * again. This shouldn't be necessary, but we sometimes see it in - * the VGA ROM space. - */ - if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 || - (errno == EBUSY && vfio_dma_unmap(container, iova, size, NULL) == 0 && - ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) { - return 0; + if (ret) { + region->nr_mmaps = 1; + region->mmaps = g_new0(VFIOMmap, region->nr_mmaps); + region->mmaps[0].offset = 0; + region->mmaps[0].size = region->size; + } + } } - error_report("VFIO_MAP_DMA failed: %s", strerror(errno)); - return -errno; + g_free(info); + + trace_vfio_region_setup(vbasedev->name, index, name, + region->flags, region->fd_offset, region->size); + return 0; } -static void vfio_host_win_add(VFIOContainer *container, - hwaddr min_iova, hwaddr max_iova, - uint64_t iova_pgsizes) +static void vfio_subregion_unmap(VFIORegion *region, int index) { - VFIOHostDMAWindow *hostwin; - - QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) { - if (ranges_overlap(hostwin->min_iova, - hostwin->max_iova - hostwin->min_iova + 1, - min_iova, - max_iova - min_iova + 1)) { - hw_error("%s: Overlapped IOMMU are not enabled", __func__); - } - } - - hostwin = g_malloc0(sizeof(*hostwin)); - - hostwin->min_iova = min_iova; - hostwin->max_iova = max_iova; - hostwin->iova_pgsizes = iova_pgsizes; - QLIST_INSERT_HEAD(&container->hostwin_list, hostwin, hostwin_next); + trace_vfio_region_unmap(memory_region_name(®ion->mmaps[index].mem), + region->mmaps[index].offset, + region->mmaps[index].offset + + region->mmaps[index].size - 1); + memory_region_del_subregion(region->mem, ®ion->mmaps[index].mem); + munmap(region->mmaps[index].mmap, region->mmaps[index].size); + object_unparent(OBJECT(®ion->mmaps[index].mem)); + region->mmaps[index].mmap = NULL; } -static int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova, - hwaddr max_iova) +int vfio_region_mmap(VFIORegion *region) { - VFIOHostDMAWindow *hostwin; + int i, prot = 0; + char *name; - QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) { - if (hostwin->min_iova == min_iova && hostwin->max_iova == max_iova) { - QLIST_REMOVE(hostwin, hostwin_next); - g_free(hostwin); - return 0; - } + if (!region->mem) { + return 0; } - return -1; -} - -static bool vfio_listener_skipped_section(MemoryRegionSection *section) -{ - return (!memory_region_is_ram(section->mr) && - !memory_region_is_iommu(section->mr)) || - memory_region_is_protected(section->mr) || - /* - * Sizing an enabled 64-bit BAR can cause spurious mappings to - * addresses in the upper part of the 64-bit address space. These - * are never accessed by the CPU and beyond the address width of - * some IOMMU hardware. TODO: VFIO should tell us the IOMMU width. - */ - section->offset_within_address_space & (1ULL << 63); -} + prot |= region->flags & VFIO_REGION_INFO_FLAG_READ ? PROT_READ : 0; + prot |= region->flags & VFIO_REGION_INFO_FLAG_WRITE ? PROT_WRITE : 0; -/* Called with rcu_read_lock held. */ -static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr, - ram_addr_t *ram_addr, bool *read_only) -{ - MemoryRegion *mr; - hwaddr xlat; - hwaddr len = iotlb->addr_mask + 1; - bool writable = iotlb->perm & IOMMU_WO; + for (i = 0; i < region->nr_mmaps; i++) { + region->mmaps[i].mmap = mmap(NULL, region->mmaps[i].size, prot, + MAP_SHARED, region->vbasedev->fd, + region->fd_offset + + region->mmaps[i].offset); + if (region->mmaps[i].mmap == MAP_FAILED) { + int ret = -errno; - /* - * The IOMMU TLB entry we have just covers translation through - * this IOMMU to its immediate target. We need to translate - * it the rest of the way through to memory. - */ - mr = address_space_translate(&address_space_memory, - iotlb->translated_addr, - &xlat, &len, writable, - MEMTXATTRS_UNSPECIFIED); - if (!memory_region_is_ram(mr)) { - error_report("iommu map to non memory area %"HWADDR_PRIx"", - xlat); - return false; - } else if (memory_region_has_ram_discard_manager(mr)) { - RamDiscardManager *rdm = memory_region_get_ram_discard_manager(mr); - MemoryRegionSection tmp = { - .mr = mr, - .offset_within_region = xlat, - .size = int128_make64(len), - }; - - /* - * Malicious VMs can map memory into the IOMMU, which is expected - * to remain discarded. vfio will pin all pages, populating memory. - * Disallow that. vmstate priorities make sure any RamDiscardManager - * were already restored before IOMMUs are restored. - */ - if (!ram_discard_manager_is_populated(rdm, &tmp)) { - error_report("iommu map to discarded memory (e.g., unplugged via" - " virtio-mem): %"HWADDR_PRIx"", - iotlb->translated_addr); - return false; - } + trace_vfio_region_mmap_fault(memory_region_name(region->mem), i, + region->fd_offset + + region->mmaps[i].offset, + region->fd_offset + + region->mmaps[i].offset + + region->mmaps[i].size - 1, ret); - /* - * Malicious VMs might trigger discarding of IOMMU-mapped memory. The - * pages will remain pinned inside vfio until unmapped, resulting in a - * higher memory consumption than expected. If memory would get - * populated again later, there would be an inconsistency between pages - * pinned by vfio and pages seen by QEMU. This is the case until - * unmapped from the IOMMU (e.g., during device reset). - * - * With malicious guests, we really only care about pinning more memory - * than expected. RLIMIT_MEMLOCK set for the user/process can never be - * exceeded and can be used to mitigate this problem. - */ - warn_report_once("Using vfio with vIOMMUs and coordinated discarding of" - " RAM (e.g., virtio-mem) works, however, malicious" - " guests can trigger pinning of more memory than" - " intended via an IOMMU. It's possible to mitigate " - " by setting/adjusting RLIMIT_MEMLOCK."); - } + region->mmaps[i].mmap = NULL; - /* - * Translation truncates length to the IOMMU page size, - * check that it did not truncate too much. - */ - if (len & iotlb->addr_mask) { - error_report("iommu has granularity incompatible with target AS"); - return false; - } + for (i--; i >= 0; i--) { + vfio_subregion_unmap(region, i); + } - if (vaddr) { - *vaddr = memory_region_get_ram_ptr(mr) + xlat; - } + return ret; + } - if (ram_addr) { - *ram_addr = memory_region_get_ram_addr(mr) + xlat; - } + name = g_strdup_printf("%s mmaps[%d]", + memory_region_name(region->mem), i); + memory_region_init_ram_device_ptr(®ion->mmaps[i].mem, + memory_region_owner(region->mem), + name, region->mmaps[i].size, + region->mmaps[i].mmap); + g_free(name); + memory_region_add_subregion(region->mem, region->mmaps[i].offset, + ®ion->mmaps[i].mem); - if (read_only) { - *read_only = !writable || mr->readonly; + trace_vfio_region_mmap(memory_region_name(®ion->mmaps[i].mem), + region->mmaps[i].offset, + region->mmaps[i].offset + + region->mmaps[i].size - 1); } - return true; + return 0; } -static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) +void vfio_region_unmap(VFIORegion *region) { - VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n); - VFIOContainer *container = giommu->container; - hwaddr iova = iotlb->iova + giommu->iommu_offset; - void *vaddr; - int ret; - - trace_vfio_iommu_map_notify(iotlb->perm == IOMMU_NONE ? "UNMAP" : "MAP", - iova, iova + iotlb->addr_mask); + int i; - if (iotlb->target_as != &address_space_memory) { - error_report("Wrong target AS \"%s\", only system memory is allowed", - iotlb->target_as->name ? iotlb->target_as->name : "none"); + if (!region->mem) { return; } - rcu_read_lock(); - - if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) { - bool read_only; - - if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only)) { - goto out; - } - /* - * vaddr is only valid until rcu_read_unlock(). But after - * vfio_dma_map has set up the mapping the pages will be - * pinned by the kernel. This makes sure that the RAM backend - * of vaddr will always be there, even if the memory object is - * destroyed and its backing memory munmap-ed. - */ - ret = vfio_dma_map(container, iova, - iotlb->addr_mask + 1, vaddr, - read_only); - if (ret) { - error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", " - "0x%"HWADDR_PRIx", %p) = %d (%m)", - container, iova, - iotlb->addr_mask + 1, vaddr, ret); - } - } else { - ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1, iotlb); - if (ret) { - error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", " - "0x%"HWADDR_PRIx") = %d (%m)", - container, iova, - iotlb->addr_mask + 1, ret); + for (i = 0; i < region->nr_mmaps; i++) { + if (region->mmaps[i].mmap) { + vfio_subregion_unmap(region, i); } } -out: - rcu_read_unlock(); } -static void vfio_ram_discard_notify_discard(RamDiscardListener *rdl, - MemoryRegionSection *section) +void vfio_region_exit(VFIORegion *region) { - VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener, - listener); - const hwaddr size = int128_get64(section->size); - const hwaddr iova = section->offset_within_address_space; - int ret; + int i; - /* Unmap with a single call. */ - ret = vfio_dma_unmap(vrdl->container, iova, size , NULL); - if (ret) { - error_report("%s: vfio_dma_unmap() failed: %s", __func__, - strerror(-ret)); + if (!region->mem) { + return; } -} - -static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl, - MemoryRegionSection *section) -{ - VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener, - listener); - const hwaddr end = section->offset_within_region + - int128_get64(section->size); - hwaddr start, next, iova; - void *vaddr; - int ret; - /* - * Map in (aligned within memory region) minimum granularity, so we can - * unmap in minimum granularity later. - */ - for (start = section->offset_within_region; start < end; start = next) { - next = ROUND_UP(start + 1, vrdl->granularity); - next = MIN(next, end); - - iova = start - section->offset_within_region + - section->offset_within_address_space; - vaddr = memory_region_get_ram_ptr(section->mr) + start; - - ret = vfio_dma_map(vrdl->container, iova, next - start, - vaddr, section->readonly); - if (ret) { - /* Rollback */ - vfio_ram_discard_notify_discard(rdl, section); - return ret; + for (i = 0; i < region->nr_mmaps; i++) { + if (region->mmaps[i].mmap) { + memory_region_del_subregion(region->mem, ®ion->mmaps[i].mem); } } - return 0; + + trace_vfio_region_exit(region->vbasedev->name, region->nr); } -static void vfio_register_ram_discard_listener(VFIOContainer *container, - MemoryRegionSection *section) +void vfio_region_finalize(VFIORegion *region) { - RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr); - VFIORamDiscardListener *vrdl; - - /* Ignore some corner cases not relevant in practice. */ - g_assert(QEMU_IS_ALIGNED(section->offset_within_region, TARGET_PAGE_SIZE)); - g_assert(QEMU_IS_ALIGNED(section->offset_within_address_space, - TARGET_PAGE_SIZE)); - g_assert(QEMU_IS_ALIGNED(int128_get64(section->size), TARGET_PAGE_SIZE)); - - vrdl = g_new0(VFIORamDiscardListener, 1); - vrdl->container = container; - vrdl->mr = section->mr; - vrdl->offset_within_address_space = section->offset_within_address_space; - vrdl->size = int128_get64(section->size); - vrdl->granularity = ram_discard_manager_get_min_granularity(rdm, - section->mr); - - g_assert(vrdl->granularity && is_power_of_2(vrdl->granularity)); - g_assert(container->pgsizes && - vrdl->granularity >= 1ULL << ctz64(container->pgsizes)); - - ram_discard_listener_init(&vrdl->listener, - vfio_ram_discard_notify_populate, - vfio_ram_discard_notify_discard, true); - ram_discard_manager_register_listener(rdm, &vrdl->listener, section); - QLIST_INSERT_HEAD(&container->vrdl_list, vrdl, next); + int i; - /* - * Sanity-check if we have a theoretically problematic setup where we could - * exceed the maximum number of possible DMA mappings over time. We assume - * that each mapped section in the same address space as a RamDiscardManager - * section consumes exactly one DMA mapping, with the exception of - * RamDiscardManager sections; i.e., we don't expect to have gIOMMU sections - * in the same address space as RamDiscardManager sections. - * - * We assume that each section in the address space consumes one memslot. - * We take the number of KVM memory slots as a best guess for the maximum - * number of sections in the address space we could have over time, - * also consuming DMA mappings. - */ - if (container->dma_max_mappings) { - unsigned int vrdl_count = 0, vrdl_mappings = 0, max_memslots = 512; - -#ifdef CONFIG_KVM - if (kvm_enabled()) { - max_memslots = kvm_get_max_memslots(); - } -#endif - - QLIST_FOREACH(vrdl, &container->vrdl_list, next) { - hwaddr start, end; - - start = QEMU_ALIGN_DOWN(vrdl->offset_within_address_space, - vrdl->granularity); - end = ROUND_UP(vrdl->offset_within_address_space + vrdl->size, - vrdl->granularity); - vrdl_mappings += (end - start) / vrdl->granularity; - vrdl_count++; - } - - if (vrdl_mappings + max_memslots - vrdl_count > - container->dma_max_mappings) { - warn_report("%s: possibly running out of DMA mappings. E.g., try" - " increasing the 'block-size' of virtio-mem devies." - " Maximum possible DMA mappings: %d, Maximum possible" - " memslots: %d", __func__, container->dma_max_mappings, - max_memslots); - } - } -} - -static void vfio_unregister_ram_discard_listener(VFIOContainer *container, - MemoryRegionSection *section) -{ - RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr); - VFIORamDiscardListener *vrdl = NULL; - - QLIST_FOREACH(vrdl, &container->vrdl_list, next) { - if (vrdl->mr == section->mr && - vrdl->offset_within_address_space == - section->offset_within_address_space) { - break; - } - } - - if (!vrdl) { - hw_error("vfio: Trying to unregister missing RAM discard listener"); - } - - ram_discard_manager_unregister_listener(rdm, &vrdl->listener); - QLIST_REMOVE(vrdl, next); - g_free(vrdl); -} - -static void vfio_listener_region_add(MemoryListener *listener, - MemoryRegionSection *section) -{ - VFIOContainer *container = container_of(listener, VFIOContainer, listener); - hwaddr iova, end; - Int128 llend, llsize; - void *vaddr; - int ret; - VFIOHostDMAWindow *hostwin; - bool hostwin_found; - Error *err = NULL; - - if (vfio_listener_skipped_section(section)) { - trace_vfio_listener_region_add_skip( - section->offset_within_address_space, - section->offset_within_address_space + - int128_get64(int128_sub(section->size, int128_one()))); - return; - } - - if (unlikely((section->offset_within_address_space & - ~qemu_real_host_page_mask) != - (section->offset_within_region & ~qemu_real_host_page_mask))) { - error_report("%s received unaligned region", __func__); - return; - } - - iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space); - llend = int128_make64(section->offset_within_address_space); - llend = int128_add(llend, section->size); - llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask)); - - if (int128_ge(int128_make64(iova), llend)) { - if (memory_region_is_ram_device(section->mr)) { - trace_vfio_listener_region_add_no_dma_map( - memory_region_name(section->mr), - section->offset_within_address_space, - int128_getlo(section->size), - qemu_real_host_page_size); - } - return; - } - end = int128_get64(int128_sub(llend, int128_one())); - - if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) { - hwaddr pgsize = 0; - - /* For now intersections are not allowed, we may relax this later */ - QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) { - if (ranges_overlap(hostwin->min_iova, - hostwin->max_iova - hostwin->min_iova + 1, - section->offset_within_address_space, - int128_get64(section->size))) { - error_setg(&err, - "region [0x%"PRIx64",0x%"PRIx64"] overlaps with existing" - "host DMA window [0x%"PRIx64",0x%"PRIx64"]", - section->offset_within_address_space, - section->offset_within_address_space + - int128_get64(section->size) - 1, - hostwin->min_iova, hostwin->max_iova); - goto fail; - } - } - - ret = vfio_spapr_create_window(container, section, &pgsize); - if (ret) { - error_setg_errno(&err, -ret, "Failed to create SPAPR window"); - goto fail; - } - - vfio_host_win_add(container, section->offset_within_address_space, - section->offset_within_address_space + - int128_get64(section->size) - 1, pgsize); -#ifdef CONFIG_KVM - if (kvm_enabled()) { - VFIOGroup *group; - IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr); - struct kvm_vfio_spapr_tce param; - struct kvm_device_attr attr = { - .group = KVM_DEV_VFIO_GROUP, - .attr = KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE, - .addr = (uint64_t)(unsigned long)¶m, - }; - - if (!memory_region_iommu_get_attr(iommu_mr, IOMMU_ATTR_SPAPR_TCE_FD, - ¶m.tablefd)) { - QLIST_FOREACH(group, &container->group_list, container_next) { - param.groupfd = group->fd; - if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) { - error_report("vfio: failed to setup fd %d " - "for a group with fd %d: %s", - param.tablefd, param.groupfd, - strerror(errno)); - return; - } - trace_vfio_spapr_group_attach(param.groupfd, param.tablefd); - } - } - } -#endif - } - - hostwin_found = false; - QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) { - if (hostwin->min_iova <= iova && end <= hostwin->max_iova) { - hostwin_found = true; - break; - } - } - - if (!hostwin_found) { - error_setg(&err, "Container %p can't map guest IOVA region" - " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx, container, iova, end); - goto fail; - } - - memory_region_ref(section->mr); - - if (memory_region_is_iommu(section->mr)) { - VFIOGuestIOMMU *giommu; - IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr); - int iommu_idx; - - trace_vfio_listener_region_add_iommu(iova, end); - /* - * FIXME: For VFIO iommu types which have KVM acceleration to - * avoid bouncing all map/unmaps through qemu this way, this - * would be the right place to wire that up (tell the KVM - * device emulation the VFIO iommu handles to use). - */ - giommu = g_malloc0(sizeof(*giommu)); - giommu->iommu_mr = iommu_mr; - giommu->iommu_offset = section->offset_within_address_space - - section->offset_within_region; - giommu->container = container; - llend = int128_add(int128_make64(section->offset_within_region), - section->size); - llend = int128_sub(llend, int128_one()); - iommu_idx = memory_region_iommu_attrs_to_index(iommu_mr, - MEMTXATTRS_UNSPECIFIED); - iommu_notifier_init(&giommu->n, vfio_iommu_map_notify, - IOMMU_NOTIFIER_IOTLB_EVENTS, - section->offset_within_region, - int128_get64(llend), - iommu_idx); - - ret = memory_region_iommu_set_page_size_mask(giommu->iommu_mr, - container->pgsizes, - &err); - if (ret) { - g_free(giommu); - goto fail; - } - - ret = memory_region_register_iommu_notifier(section->mr, &giommu->n, - &err); - if (ret) { - g_free(giommu); - goto fail; - } - QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next); - memory_region_iommu_replay(giommu->iommu_mr, &giommu->n); - - return; - } - - /* Here we assume that memory_region_is_ram(section->mr)==true */ - - /* - * For RAM memory regions with a RamDiscardManager, we only want to map the - * actually populated parts - and update the mapping whenever we're notified - * about changes. - */ - if (memory_region_has_ram_discard_manager(section->mr)) { - vfio_register_ram_discard_listener(container, section); - return; - } - - vaddr = memory_region_get_ram_ptr(section->mr) + - section->offset_within_region + - (iova - section->offset_within_address_space); - - trace_vfio_listener_region_add_ram(iova, end, vaddr); - - llsize = int128_sub(llend, int128_make64(iova)); - - if (memory_region_is_ram_device(section->mr)) { - hwaddr pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1; - - if ((iova & pgmask) || (int128_get64(llsize) & pgmask)) { - trace_vfio_listener_region_add_no_dma_map( - memory_region_name(section->mr), - section->offset_within_address_space, - int128_getlo(section->size), - pgmask + 1); - return; - } - } - - ret = vfio_dma_map(container, iova, int128_get64(llsize), - vaddr, section->readonly); - if (ret) { - error_setg(&err, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", " - "0x%"HWADDR_PRIx", %p) = %d (%m)", - container, iova, int128_get64(llsize), vaddr, ret); - if (memory_region_is_ram_device(section->mr)) { - /* Allow unexpected mappings not to be fatal for RAM devices */ - error_report_err(err); - return; - } - goto fail; - } - - return; - -fail: - if (memory_region_is_ram_device(section->mr)) { - error_report("failed to vfio_dma_map. pci p2p may not work"); - return; - } - /* - * On the initfn path, store the first error in the container so we - * can gracefully fail. Runtime, there's not much we can do other - * than throw a hardware error. - */ - if (!container->initialized) { - if (!container->error) { - error_propagate_prepend(&container->error, err, - "Region %s: ", - memory_region_name(section->mr)); - } else { - error_free(err); - } - } else { - error_report_err(err); - hw_error("vfio: DMA mapping failed, unable to continue"); - } -} - -static void vfio_listener_region_del(MemoryListener *listener, - MemoryRegionSection *section) -{ - VFIOContainer *container = container_of(listener, VFIOContainer, listener); - hwaddr iova, end; - Int128 llend, llsize; - int ret; - bool try_unmap = true; - - if (vfio_listener_skipped_section(section)) { - trace_vfio_listener_region_del_skip( - section->offset_within_address_space, - section->offset_within_address_space + - int128_get64(int128_sub(section->size, int128_one()))); - return; - } - - if (unlikely((section->offset_within_address_space & - ~qemu_real_host_page_mask) != - (section->offset_within_region & ~qemu_real_host_page_mask))) { - error_report("%s received unaligned region", __func__); - return; - } - - if (memory_region_is_iommu(section->mr)) { - VFIOGuestIOMMU *giommu; - - QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) { - if (MEMORY_REGION(giommu->iommu_mr) == section->mr && - giommu->n.start == section->offset_within_region) { - memory_region_unregister_iommu_notifier(section->mr, - &giommu->n); - QLIST_REMOVE(giommu, giommu_next); - g_free(giommu); - break; - } - } - - /* - * FIXME: We assume the one big unmap below is adequate to - * remove any individual page mappings in the IOMMU which - * might have been copied into VFIO. This works for a page table - * based IOMMU where a big unmap flattens a large range of IO-PTEs. - * That may not be true for all IOMMU types. - */ - } - - iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space); - llend = int128_make64(section->offset_within_address_space); - llend = int128_add(llend, section->size); - llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask)); - - if (int128_ge(int128_make64(iova), llend)) { - return; - } - end = int128_get64(int128_sub(llend, int128_one())); - - llsize = int128_sub(llend, int128_make64(iova)); - - trace_vfio_listener_region_del(iova, end); - - if (memory_region_is_ram_device(section->mr)) { - hwaddr pgmask; - VFIOHostDMAWindow *hostwin; - bool hostwin_found = false; - - QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) { - if (hostwin->min_iova <= iova && end <= hostwin->max_iova) { - hostwin_found = true; - break; - } - } - assert(hostwin_found); /* or region_add() would have failed */ - - pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1; - try_unmap = !((iova & pgmask) || (int128_get64(llsize) & pgmask)); - } else if (memory_region_has_ram_discard_manager(section->mr)) { - vfio_unregister_ram_discard_listener(container, section); - /* Unregistering will trigger an unmap. */ - try_unmap = false; - } - - if (try_unmap) { - if (int128_eq(llsize, int128_2_64())) { - /* The unmap ioctl doesn't accept a full 64-bit span. */ - llsize = int128_rshift(llsize, 1); - ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL); - if (ret) { - error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", " - "0x%"HWADDR_PRIx") = %d (%m)", - container, iova, int128_get64(llsize), ret); - } - iova += int128_get64(llsize); - } - ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL); - if (ret) { - error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", " - "0x%"HWADDR_PRIx") = %d (%m)", - container, iova, int128_get64(llsize), ret); - } - } - - memory_region_unref(section->mr); - - if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) { - vfio_spapr_remove_window(container, - section->offset_within_address_space); - if (vfio_host_win_del(container, - section->offset_within_address_space, - section->offset_within_address_space + - int128_get64(section->size) - 1) < 0) { - hw_error("%s: Cannot delete missing window at %"HWADDR_PRIx, - __func__, section->offset_within_address_space); - } - } -} - -static void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start) -{ - int ret; - struct vfio_iommu_type1_dirty_bitmap dirty = { - .argsz = sizeof(dirty), - }; - - if (start) { - dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START; - } else { - dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP; - } - - ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty); - if (ret) { - error_report("Failed to set dirty tracking flag 0x%x errno: %d", - dirty.flags, errno); - } -} - -static void vfio_listener_log_global_start(MemoryListener *listener) -{ - VFIOContainer *container = container_of(listener, VFIOContainer, listener); - - vfio_set_dirty_page_tracking(container, true); -} - -static void vfio_listener_log_global_stop(MemoryListener *listener) -{ - VFIOContainer *container = container_of(listener, VFIOContainer, listener); - - vfio_set_dirty_page_tracking(container, false); -} - -static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova, - uint64_t size, ram_addr_t ram_addr) -{ - struct vfio_iommu_type1_dirty_bitmap *dbitmap; - struct vfio_iommu_type1_dirty_bitmap_get *range; - uint64_t pages; - int ret; - - dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range)); - - dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range); - dbitmap->flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP; - range = (struct vfio_iommu_type1_dirty_bitmap_get *)&dbitmap->data; - range->iova = iova; - range->size = size; - - /* - * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of - * qemu_real_host_page_size to mark those dirty. Hence set bitmap's pgsize - * to qemu_real_host_page_size. - */ - range->bitmap.pgsize = qemu_real_host_page_size; - - pages = REAL_HOST_PAGE_ALIGN(range->size) / qemu_real_host_page_size; - range->bitmap.size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) / - BITS_PER_BYTE; - range->bitmap.data = g_try_malloc0(range->bitmap.size); - if (!range->bitmap.data) { - ret = -ENOMEM; - goto err_out; - } - - ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap); - if (ret) { - error_report("Failed to get dirty bitmap for iova: 0x%"PRIx64 - " size: 0x%"PRIx64" err: %d", (uint64_t)range->iova, - (uint64_t)range->size, errno); - goto err_out; - } - - cpu_physical_memory_set_dirty_lebitmap((unsigned long *)range->bitmap.data, - ram_addr, pages); - - trace_vfio_get_dirty_bitmap(container->fd, range->iova, range->size, - range->bitmap.size, ram_addr); -err_out: - g_free(range->bitmap.data); - g_free(dbitmap); - - return ret; -} - -typedef struct { - IOMMUNotifier n; - VFIOGuestIOMMU *giommu; -} vfio_giommu_dirty_notifier; - -static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) -{ - vfio_giommu_dirty_notifier *gdn = container_of(n, - vfio_giommu_dirty_notifier, n); - VFIOGuestIOMMU *giommu = gdn->giommu; - VFIOContainer *container = giommu->container; - hwaddr iova = iotlb->iova + giommu->iommu_offset; - ram_addr_t translated_addr; - - trace_vfio_iommu_map_dirty_notify(iova, iova + iotlb->addr_mask); - - if (iotlb->target_as != &address_space_memory) { - error_report("Wrong target AS \"%s\", only system memory is allowed", - iotlb->target_as->name ? iotlb->target_as->name : "none"); - return; - } - - rcu_read_lock(); - if (vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL)) { - int ret; - - ret = vfio_get_dirty_bitmap(container, iova, iotlb->addr_mask + 1, - translated_addr); - if (ret) { - error_report("vfio_iommu_map_dirty_notify(%p, 0x%"HWADDR_PRIx", " - "0x%"HWADDR_PRIx") = %d (%m)", - container, iova, - iotlb->addr_mask + 1, ret); - } - } - rcu_read_unlock(); -} - -static int vfio_ram_discard_get_dirty_bitmap(MemoryRegionSection *section, - void *opaque) -{ - const hwaddr size = int128_get64(section->size); - const hwaddr iova = section->offset_within_address_space; - const ram_addr_t ram_addr = memory_region_get_ram_addr(section->mr) + - section->offset_within_region; - VFIORamDiscardListener *vrdl = opaque; - - /* - * Sync the whole mapped region (spanning multiple individual mappings) - * in one go. - */ - return vfio_get_dirty_bitmap(vrdl->container, iova, size, ram_addr); -} - -static int vfio_sync_ram_discard_listener_dirty_bitmap(VFIOContainer *container, - MemoryRegionSection *section) -{ - RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr); - VFIORamDiscardListener *vrdl = NULL; - - QLIST_FOREACH(vrdl, &container->vrdl_list, next) { - if (vrdl->mr == section->mr && - vrdl->offset_within_address_space == - section->offset_within_address_space) { - break; - } - } - - if (!vrdl) { - hw_error("vfio: Trying to sync missing RAM discard listener"); - } - - /* - * We only want/can synchronize the bitmap for actually mapped parts - - * which correspond to populated parts. Replay all populated parts. - */ - return ram_discard_manager_replay_populated(rdm, section, - vfio_ram_discard_get_dirty_bitmap, - &vrdl); -} - -static int vfio_sync_dirty_bitmap(VFIOContainer *container, - MemoryRegionSection *section) -{ - ram_addr_t ram_addr; - - if (memory_region_is_iommu(section->mr)) { - VFIOGuestIOMMU *giommu; - - QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) { - if (MEMORY_REGION(giommu->iommu_mr) == section->mr && - giommu->n.start == section->offset_within_region) { - Int128 llend; - vfio_giommu_dirty_notifier gdn = { .giommu = giommu }; - int idx = memory_region_iommu_attrs_to_index(giommu->iommu_mr, - MEMTXATTRS_UNSPECIFIED); - - llend = int128_add(int128_make64(section->offset_within_region), - section->size); - llend = int128_sub(llend, int128_one()); - - iommu_notifier_init(&gdn.n, - vfio_iommu_map_dirty_notify, - IOMMU_NOTIFIER_MAP, - section->offset_within_region, - int128_get64(llend), - idx); - memory_region_iommu_replay(giommu->iommu_mr, &gdn.n); - break; - } - } - return 0; - } else if (memory_region_has_ram_discard_manager(section->mr)) { - return vfio_sync_ram_discard_listener_dirty_bitmap(container, section); - } - - ram_addr = memory_region_get_ram_addr(section->mr) + - section->offset_within_region; - - return vfio_get_dirty_bitmap(container, - REAL_HOST_PAGE_ALIGN(section->offset_within_address_space), - int128_get64(section->size), ram_addr); -} - -static void vfio_listener_log_sync(MemoryListener *listener, - MemoryRegionSection *section) -{ - VFIOContainer *container = container_of(listener, VFIOContainer, listener); - - if (vfio_listener_skipped_section(section) || - !container->dirty_pages_supported) { - return; - } - - if (vfio_devices_all_dirty_tracking(container)) { - vfio_sync_dirty_bitmap(container, section); - } -} - -static const MemoryListener vfio_memory_listener = { - .name = "vfio", - .region_add = vfio_listener_region_add, - .region_del = vfio_listener_region_del, - .log_global_start = vfio_listener_log_global_start, - .log_global_stop = vfio_listener_log_global_stop, - .log_sync = vfio_listener_log_sync, -}; - -static void vfio_listener_release(VFIOContainer *container) -{ - memory_listener_unregister(&container->listener); - if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) { - memory_listener_unregister(&container->prereg_listener); - } -} - -static struct vfio_info_cap_header * -vfio_get_cap(void *ptr, uint32_t cap_offset, uint16_t id) -{ - struct vfio_info_cap_header *hdr; - - for (hdr = ptr + cap_offset; hdr != ptr; hdr = ptr + hdr->next) { - if (hdr->id == id) { - return hdr; - } - } - - return NULL; -} - -struct vfio_info_cap_header * -vfio_get_region_info_cap(struct vfio_region_info *info, uint16_t id) -{ - if (!(info->flags & VFIO_REGION_INFO_FLAG_CAPS)) { - return NULL; - } - - return vfio_get_cap((void *)info, info->cap_offset, id); -} - -static struct vfio_info_cap_header * -vfio_get_iommu_type1_info_cap(struct vfio_iommu_type1_info *info, uint16_t id) -{ - if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) { - return NULL; - } - - return vfio_get_cap((void *)info, info->cap_offset, id); -} - -struct vfio_info_cap_header * -vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id) -{ - if (!(info->flags & VFIO_DEVICE_FLAGS_CAPS)) { - return NULL; - } - - return vfio_get_cap((void *)info, info->cap_offset, id); -} - -bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info, - unsigned int *avail) -{ - struct vfio_info_cap_header *hdr; - struct vfio_iommu_type1_info_dma_avail *cap; - - /* If the capability cannot be found, assume no DMA limiting */ - hdr = vfio_get_iommu_type1_info_cap(info, - VFIO_IOMMU_TYPE1_INFO_DMA_AVAIL); - if (hdr == NULL) { - return false; - } - - if (avail != NULL) { - cap = (void *) hdr; - *avail = cap->avail; - } - - return true; -} - -static int vfio_setup_region_sparse_mmaps(VFIORegion *region, - struct vfio_region_info *info) -{ - struct vfio_info_cap_header *hdr; - struct vfio_region_info_cap_sparse_mmap *sparse; - int i, j; - - hdr = vfio_get_region_info_cap(info, VFIO_REGION_INFO_CAP_SPARSE_MMAP); - if (!hdr) { - return -ENODEV; - } - - sparse = container_of(hdr, struct vfio_region_info_cap_sparse_mmap, header); - - trace_vfio_region_sparse_mmap_header(region->vbasedev->name, - region->nr, sparse->nr_areas); - - region->mmaps = g_new0(VFIOMmap, sparse->nr_areas); - - for (i = 0, j = 0; i < sparse->nr_areas; i++) { - trace_vfio_region_sparse_mmap_entry(i, sparse->areas[i].offset, - sparse->areas[i].offset + - sparse->areas[i].size); - - if (sparse->areas[i].size) { - region->mmaps[j].offset = sparse->areas[i].offset; - region->mmaps[j].size = sparse->areas[i].size; - j++; - } - } - - region->nr_mmaps = j; - region->mmaps = g_realloc(region->mmaps, j * sizeof(VFIOMmap)); - - return 0; -} - -int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region, - int index, const char *name) -{ - struct vfio_region_info *info; - int ret; - - ret = vfio_get_region_info(vbasedev, index, &info); - if (ret) { - return ret; - } - - region->vbasedev = vbasedev; - region->flags = info->flags; - region->size = info->size; - region->fd_offset = info->offset; - region->nr = index; - - if (region->size) { - region->mem = g_new0(MemoryRegion, 1); - memory_region_init_io(region->mem, obj, &vfio_region_ops, - region, name, region->size); - - if (!vbasedev->no_mmap && - region->flags & VFIO_REGION_INFO_FLAG_MMAP) { - - ret = vfio_setup_region_sparse_mmaps(region, info); - - if (ret) { - region->nr_mmaps = 1; - region->mmaps = g_new0(VFIOMmap, region->nr_mmaps); - region->mmaps[0].offset = 0; - region->mmaps[0].size = region->size; - } - } - } - - g_free(info); - - trace_vfio_region_setup(vbasedev->name, index, name, - region->flags, region->fd_offset, region->size); - return 0; -} - -static void vfio_subregion_unmap(VFIORegion *region, int index) -{ - trace_vfio_region_unmap(memory_region_name(®ion->mmaps[index].mem), - region->mmaps[index].offset, - region->mmaps[index].offset + - region->mmaps[index].size - 1); - memory_region_del_subregion(region->mem, ®ion->mmaps[index].mem); - munmap(region->mmaps[index].mmap, region->mmaps[index].size); - object_unparent(OBJECT(®ion->mmaps[index].mem)); - region->mmaps[index].mmap = NULL; -} - -int vfio_region_mmap(VFIORegion *region) -{ - int i, prot = 0; - char *name; - - if (!region->mem) { - return 0; - } - - prot |= region->flags & VFIO_REGION_INFO_FLAG_READ ? PROT_READ : 0; - prot |= region->flags & VFIO_REGION_INFO_FLAG_WRITE ? PROT_WRITE : 0; - - for (i = 0; i < region->nr_mmaps; i++) { - region->mmaps[i].mmap = mmap(NULL, region->mmaps[i].size, prot, - MAP_SHARED, region->vbasedev->fd, - region->fd_offset + - region->mmaps[i].offset); - if (region->mmaps[i].mmap == MAP_FAILED) { - int ret = -errno; - - trace_vfio_region_mmap_fault(memory_region_name(region->mem), i, - region->fd_offset + - region->mmaps[i].offset, - region->fd_offset + - region->mmaps[i].offset + - region->mmaps[i].size - 1, ret); - - region->mmaps[i].mmap = NULL; - - for (i--; i >= 0; i--) { - vfio_subregion_unmap(region, i); - } - - return ret; - } - - name = g_strdup_printf("%s mmaps[%d]", - memory_region_name(region->mem), i); - memory_region_init_ram_device_ptr(®ion->mmaps[i].mem, - memory_region_owner(region->mem), - name, region->mmaps[i].size, - region->mmaps[i].mmap); - g_free(name); - memory_region_add_subregion(region->mem, region->mmaps[i].offset, - ®ion->mmaps[i].mem); - - trace_vfio_region_mmap(memory_region_name(®ion->mmaps[i].mem), - region->mmaps[i].offset, - region->mmaps[i].offset + - region->mmaps[i].size - 1); - } - - return 0; -} - -void vfio_region_unmap(VFIORegion *region) -{ - int i; - - if (!region->mem) { - return; - } - - for (i = 0; i < region->nr_mmaps; i++) { - if (region->mmaps[i].mmap) { - vfio_subregion_unmap(region, i); - } - } -} - -void vfio_region_exit(VFIORegion *region) -{ - int i; - - if (!region->mem) { - return; - } - - for (i = 0; i < region->nr_mmaps; i++) { - if (region->mmaps[i].mmap) { - memory_region_del_subregion(region->mem, ®ion->mmaps[i].mem); - } - } - - trace_vfio_region_exit(region->vbasedev->name, region->nr); -} - -void vfio_region_finalize(VFIORegion *region) -{ - int i; - - if (!region->mem) { - return; - } - - for (i = 0; i < region->nr_mmaps; i++) { - if (region->mmaps[i].mmap) { - munmap(region->mmaps[i].mmap, region->mmaps[i].size); - object_unparent(OBJECT(®ion->mmaps[i].mem)); - } - } - - object_unparent(OBJECT(region->mem)); - - g_free(region->mem); - g_free(region->mmaps); - - trace_vfio_region_finalize(region->vbasedev->name, region->nr); - - region->mem = NULL; - region->mmaps = NULL; - region->nr_mmaps = 0; - region->size = 0; - region->flags = 0; - region->nr = 0; -} - -void vfio_region_mmaps_set_enabled(VFIORegion *region, bool enabled) -{ - int i; - - if (!region->mem) { - return; - } - - for (i = 0; i < region->nr_mmaps; i++) { - if (region->mmaps[i].mmap) { - memory_region_set_enabled(®ion->mmaps[i].mem, enabled); - } - } - - trace_vfio_region_mmaps_set_enabled(memory_region_name(region->mem), - enabled); -} - -void vfio_reset_handler(void *opaque) -{ - VFIOGroup *group; - VFIODevice *vbasedev; - - QLIST_FOREACH(group, &vfio_group_list, next) { - QLIST_FOREACH(vbasedev, &group->device_list, next) { - if (vbasedev->dev->realized) { - vbasedev->ops->vfio_compute_needs_reset(vbasedev); - } - } - } - - QLIST_FOREACH(group, &vfio_group_list, next) { - QLIST_FOREACH(vbasedev, &group->device_list, next) { - if (vbasedev->dev->realized && vbasedev->needs_reset) { - vbasedev->ops->vfio_hot_reset_multi(vbasedev); - } - } - } -} - -static void vfio_kvm_device_add_group(VFIOGroup *group) -{ -#ifdef CONFIG_KVM - struct kvm_device_attr attr = { - .group = KVM_DEV_VFIO_GROUP, - .attr = KVM_DEV_VFIO_GROUP_ADD, - .addr = (uint64_t)(unsigned long)&group->fd, - }; - - if (!kvm_enabled()) { - return; - } - - if (vfio_kvm_device_fd < 0) { - struct kvm_create_device cd = { - .type = KVM_DEV_TYPE_VFIO, - }; - - if (kvm_vm_ioctl(kvm_state, KVM_CREATE_DEVICE, &cd)) { - error_report("Failed to create KVM VFIO device: %m"); - return; - } - - vfio_kvm_device_fd = cd.fd; - } - - if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) { - error_report("Failed to add group %d to KVM VFIO device: %m", - group->groupid); - } -#endif -} - -static void vfio_kvm_device_del_group(VFIOGroup *group) -{ -#ifdef CONFIG_KVM - struct kvm_device_attr attr = { - .group = KVM_DEV_VFIO_GROUP, - .attr = KVM_DEV_VFIO_GROUP_DEL, - .addr = (uint64_t)(unsigned long)&group->fd, - }; - - if (vfio_kvm_device_fd < 0) { - return; - } - - if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) { - error_report("Failed to remove group %d from KVM VFIO device: %m", - group->groupid); - } -#endif -} - -static VFIOAddressSpace *vfio_get_address_space(AddressSpace *as) -{ - VFIOAddressSpace *space; - - QLIST_FOREACH(space, &vfio_address_spaces, list) { - if (space->as == as) { - return space; - } - } - - /* No suitable VFIOAddressSpace, create a new one */ - space = g_malloc0(sizeof(*space)); - space->as = as; - QLIST_INIT(&space->containers); - - QLIST_INSERT_HEAD(&vfio_address_spaces, space, list); - - return space; -} - -static void vfio_put_address_space(VFIOAddressSpace *space) -{ - if (QLIST_EMPTY(&space->containers)) { - QLIST_REMOVE(space, list); - g_free(space); - } -} - -/* - * vfio_get_iommu_type - selects the richest iommu_type (v2 first) - */ -static int vfio_get_iommu_type(VFIOContainer *container, - Error **errp) -{ - int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU, - VFIO_SPAPR_TCE_v2_IOMMU, VFIO_SPAPR_TCE_IOMMU }; - int i; - - for (i = 0; i < ARRAY_SIZE(iommu_types); i++) { - if (ioctl(container->fd, VFIO_CHECK_EXTENSION, iommu_types[i])) { - return iommu_types[i]; - } - } - error_setg(errp, "No available IOMMU models"); - return -EINVAL; -} - -static int vfio_init_container(VFIOContainer *container, int group_fd, - Error **errp) -{ - int iommu_type, ret; - - iommu_type = vfio_get_iommu_type(container, errp); - if (iommu_type < 0) { - return iommu_type; - } - - ret = ioctl(group_fd, VFIO_GROUP_SET_CONTAINER, &container->fd); - if (ret) { - error_setg_errno(errp, errno, "Failed to set group container"); - return -errno; - } - - while (ioctl(container->fd, VFIO_SET_IOMMU, iommu_type)) { - if (iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) { - /* - * On sPAPR, despite the IOMMU subdriver always advertises v1 and - * v2, the running platform may not support v2 and there is no - * way to guess it until an IOMMU group gets added to the container. - * So in case it fails with v2, try v1 as a fallback. - */ - iommu_type = VFIO_SPAPR_TCE_IOMMU; - continue; - } - error_setg_errno(errp, errno, "Failed to set iommu for container"); - return -errno; - } - - container->iommu_type = iommu_type; - return 0; -} - -static int vfio_get_iommu_info(VFIOContainer *container, - struct vfio_iommu_type1_info **info) -{ - - size_t argsz = sizeof(struct vfio_iommu_type1_info); - - *info = g_new0(struct vfio_iommu_type1_info, 1); -again: - (*info)->argsz = argsz; - - if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) { - g_free(*info); - *info = NULL; - return -errno; - } - - if (((*info)->argsz > argsz)) { - argsz = (*info)->argsz; - *info = g_realloc(*info, argsz); - goto again; - } - - return 0; -} - -static struct vfio_info_cap_header * -vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id) -{ - struct vfio_info_cap_header *hdr; - void *ptr = info; - - if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) { - return NULL; - } - - for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) { - if (hdr->id == id) { - return hdr; - } - } - - return NULL; -} - -static void vfio_get_iommu_info_migration(VFIOContainer *container, - struct vfio_iommu_type1_info *info) -{ - struct vfio_info_cap_header *hdr; - struct vfio_iommu_type1_info_cap_migration *cap_mig; - - hdr = vfio_get_iommu_info_cap(info, VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION); - if (!hdr) { + if (!region->mem) { return; } - cap_mig = container_of(hdr, struct vfio_iommu_type1_info_cap_migration, - header); - - /* - * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of - * qemu_real_host_page_size to mark those dirty. - */ - if (cap_mig->pgsize_bitmap & qemu_real_host_page_size) { - container->dirty_pages_supported = true; - container->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size; - container->dirty_pgsizes = cap_mig->pgsize_bitmap; - } -} - -static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, - Error **errp) -{ - VFIOContainer *container; - int ret, fd; - VFIOAddressSpace *space; - - space = vfio_get_address_space(as); - - /* - * VFIO is currently incompatible with discarding of RAM insofar as the - * madvise to purge (zap) the page from QEMU's address space does not - * interact with the memory API and therefore leaves stale virtual to - * physical mappings in the IOMMU if the page was previously pinned. We - * therefore set discarding broken for each group added to a container, - * whether the container is used individually or shared. This provides - * us with options to allow devices within a group to opt-in and allow - * discarding, so long as it is done consistently for a group (for instance - * if the device is an mdev device where it is known that the host vendor - * driver will never pin pages outside of the working set of the guest - * driver, which would thus not be discarding candidates). - * - * The first opportunity to induce pinning occurs here where we attempt to - * attach the group to existing containers within the AddressSpace. If any - * pages are already zapped from the virtual address space, such as from - * previous discards, new pinning will cause valid mappings to be - * re-established. Likewise, when the overall MemoryListener for a new - * container is registered, a replay of mappings within the AddressSpace - * will occur, re-establishing any previously zapped pages as well. - * - * Especially virtio-balloon is currently only prevented from discarding - * new memory, it will not yet set ram_block_discard_set_required() and - * therefore, neither stops us here or deals with the sudden memory - * consumption of inflated memory. - * - * We do support discarding of memory coordinated via the RamDiscardManager - * with some IOMMU types. vfio_ram_block_discard_disable() handles the - * details once we know which type of IOMMU we are using. - */ - - QLIST_FOREACH(container, &space->containers, next) { - if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) { - ret = vfio_ram_block_discard_disable(container, true); - if (ret) { - error_setg_errno(errp, -ret, - "Cannot set discarding of RAM broken"); - if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, - &container->fd)) { - error_report("vfio: error disconnecting group %d from" - " container", group->groupid); - } - return ret; - } - group->container = container; - QLIST_INSERT_HEAD(&container->group_list, group, container_next); - vfio_kvm_device_add_group(group); - return 0; - } - } - - fd = qemu_open_old("/dev/vfio/vfio", O_RDWR); - if (fd < 0) { - error_setg_errno(errp, errno, "failed to open /dev/vfio/vfio"); - ret = -errno; - goto put_space_exit; - } - - ret = ioctl(fd, VFIO_GET_API_VERSION); - if (ret != VFIO_API_VERSION) { - error_setg(errp, "supported vfio version: %d, " - "reported version: %d", VFIO_API_VERSION, ret); - ret = -EINVAL; - goto close_fd_exit; - } - - container = g_malloc0(sizeof(*container)); - container->space = space; - container->fd = fd; - container->error = NULL; - container->dirty_pages_supported = false; - container->dma_max_mappings = 0; - QLIST_INIT(&container->giommu_list); - QLIST_INIT(&container->hostwin_list); - QLIST_INIT(&container->vrdl_list); - - ret = vfio_init_container(container, group->fd, errp); - if (ret) { - goto free_container_exit; - } - - ret = vfio_ram_block_discard_disable(container, true); - if (ret) { - error_setg_errno(errp, -ret, "Cannot set discarding of RAM broken"); - goto free_container_exit; - } - - switch (container->iommu_type) { - case VFIO_TYPE1v2_IOMMU: - case VFIO_TYPE1_IOMMU: - { - struct vfio_iommu_type1_info *info; - - /* - * FIXME: This assumes that a Type1 IOMMU can map any 64-bit - * IOVA whatsoever. That's not actually true, but the current - * kernel interface doesn't tell us what it can map, and the - * existing Type1 IOMMUs generally support any IOVA we're - * going to actually try in practice. - */ - ret = vfio_get_iommu_info(container, &info); - - if (ret || !(info->flags & VFIO_IOMMU_INFO_PGSIZES)) { - /* Assume 4k IOVA page size */ - info->iova_pgsizes = 4096; - } - vfio_host_win_add(container, 0, (hwaddr)-1, info->iova_pgsizes); - container->pgsizes = info->iova_pgsizes; - - /* The default in the kernel ("dma_entry_limit") is 65535. */ - container->dma_max_mappings = 65535; - if (!ret) { - vfio_get_info_dma_avail(info, &container->dma_max_mappings); - vfio_get_iommu_info_migration(container, info); - } - g_free(info); - break; - } - case VFIO_SPAPR_TCE_v2_IOMMU: - case VFIO_SPAPR_TCE_IOMMU: - { - struct vfio_iommu_spapr_tce_info info; - bool v2 = container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU; - - /* - * The host kernel code implementing VFIO_IOMMU_DISABLE is called - * when container fd is closed so we do not call it explicitly - * in this file. - */ - if (!v2) { - ret = ioctl(fd, VFIO_IOMMU_ENABLE); - if (ret) { - error_setg_errno(errp, errno, "failed to enable container"); - ret = -errno; - goto enable_discards_exit; - } - } else { - container->prereg_listener = vfio_prereg_listener; - - memory_listener_register(&container->prereg_listener, - &address_space_memory); - if (container->error) { - memory_listener_unregister(&container->prereg_listener); - ret = -1; - error_propagate_prepend(errp, container->error, - "RAM memory listener initialization failed: "); - goto enable_discards_exit; - } - } - - info.argsz = sizeof(info); - ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info); - if (ret) { - error_setg_errno(errp, errno, - "VFIO_IOMMU_SPAPR_TCE_GET_INFO failed"); - ret = -errno; - if (v2) { - memory_listener_unregister(&container->prereg_listener); - } - goto enable_discards_exit; - } - - if (v2) { - container->pgsizes = info.ddw.pgsizes; - /* - * There is a default window in just created container. - * To make region_add/del simpler, we better remove this - * window now and let those iommu_listener callbacks - * create/remove them when needed. - */ - ret = vfio_spapr_remove_window(container, info.dma32_window_start); - if (ret) { - error_setg_errno(errp, -ret, - "failed to remove existing window"); - goto enable_discards_exit; - } - } else { - /* The default table uses 4K pages */ - container->pgsizes = 0x1000; - vfio_host_win_add(container, info.dma32_window_start, - info.dma32_window_start + - info.dma32_window_size - 1, - 0x1000); + for (i = 0; i < region->nr_mmaps; i++) { + if (region->mmaps[i].mmap) { + munmap(region->mmaps[i].mmap, region->mmaps[i].size); + object_unparent(OBJECT(®ion->mmaps[i].mem)); } } - } - - vfio_kvm_device_add_group(group); - - QLIST_INIT(&container->group_list); - QLIST_INSERT_HEAD(&space->containers, container, next); - - group->container = container; - QLIST_INSERT_HEAD(&container->group_list, group, container_next); - - container->listener = vfio_memory_listener; - - memory_listener_register(&container->listener, container->space->as); - - if (container->error) { - ret = -1; - error_propagate_prepend(errp, container->error, - "memory listener initialization failed: "); - goto listener_release_exit; - } - - container->initialized = true; - - return 0; -listener_release_exit: - QLIST_REMOVE(group, container_next); - QLIST_REMOVE(container, next); - vfio_kvm_device_del_group(group); - vfio_listener_release(container); - -enable_discards_exit: - vfio_ram_block_discard_disable(container, false); - -free_container_exit: - g_free(container); - -close_fd_exit: - close(fd); - -put_space_exit: - vfio_put_address_space(space); - - return ret; -} - -static void vfio_disconnect_container(VFIOGroup *group) -{ - VFIOContainer *container = group->container; - - QLIST_REMOVE(group, container_next); - group->container = NULL; - - /* - * Explicitly release the listener first before unset container, - * since unset may destroy the backend container if it's the last - * group. - */ - if (QLIST_EMPTY(&container->group_list)) { - vfio_listener_release(container); - } - - if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) { - error_report("vfio: error disconnecting group %d from container", - group->groupid); - } - - if (QLIST_EMPTY(&container->group_list)) { - VFIOAddressSpace *space = container->space; - VFIOGuestIOMMU *giommu, *tmp; - VFIOHostDMAWindow *hostwin, *next; - QLIST_REMOVE(container, next); - - QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) { - memory_region_unregister_iommu_notifier( - MEMORY_REGION(giommu->iommu_mr), &giommu->n); - QLIST_REMOVE(giommu, giommu_next); - g_free(giommu); - } + object_unparent(OBJECT(region->mem)); - QLIST_FOREACH_SAFE(hostwin, &container->hostwin_list, hostwin_next, - next) { - QLIST_REMOVE(hostwin, hostwin_next); - g_free(hostwin); - } + g_free(region->mem); + g_free(region->mmaps); - trace_vfio_disconnect_container(container->fd); - close(container->fd); - g_free(container); + trace_vfio_region_finalize(region->vbasedev->name, region->nr); - vfio_put_address_space(space); - } + region->mem = NULL; + region->mmaps = NULL; + region->nr_mmaps = 0; + region->size = 0; + region->flags = 0; + region->nr = 0; } -VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp) +void vfio_region_mmaps_set_enabled(VFIORegion *region, bool enabled) { - VFIOGroup *group; - char path[32]; - struct vfio_group_status status = { .argsz = sizeof(status) }; - - QLIST_FOREACH(group, &vfio_group_list, next) { - if (group->groupid == groupid) { - /* Found it. Now is it already in the right context? */ - if (group->container->space->as == as) { - return group; - } else { - error_setg(errp, "group %d used in multiple address spaces", - group->groupid); - return NULL; - } - } - } - - group = g_malloc0(sizeof(*group)); - - snprintf(path, sizeof(path), "/dev/vfio/%d", groupid); - group->fd = qemu_open_old(path, O_RDWR); - if (group->fd < 0) { - error_setg_errno(errp, errno, "failed to open %s", path); - goto free_group_exit; - } - - if (ioctl(group->fd, VFIO_GROUP_GET_STATUS, &status)) { - error_setg_errno(errp, errno, "failed to get group %d status", groupid); - goto close_fd_exit; - } - - if (!(status.flags & VFIO_GROUP_FLAGS_VIABLE)) { - error_setg(errp, "group %d is not viable", groupid); - error_append_hint(errp, - "Please ensure all devices within the iommu_group " - "are bound to their vfio bus driver.\n"); - goto close_fd_exit; - } - - group->groupid = groupid; - QLIST_INIT(&group->device_list); - - if (vfio_connect_container(group, as, errp)) { - error_prepend(errp, "failed to setup container for group %d: ", - groupid); - goto close_fd_exit; - } - - if (QLIST_EMPTY(&vfio_group_list)) { - qemu_register_reset(vfio_reset_handler, NULL); - } - - QLIST_INSERT_HEAD(&vfio_group_list, group, next); - - return group; - -close_fd_exit: - close(group->fd); - -free_group_exit: - g_free(group); - - return NULL; -} + int i; -void vfio_put_group(VFIOGroup *group) -{ - if (!group || !QLIST_EMPTY(&group->device_list)) { + if (!region->mem) { return; } - if (!group->ram_block_discard_allowed) { - vfio_ram_block_discard_disable(group->container, false); - } - vfio_kvm_device_del_group(group); - vfio_disconnect_container(group); - QLIST_REMOVE(group, next); - trace_vfio_put_group(group->fd); - close(group->fd); - g_free(group); - - if (QLIST_EMPTY(&vfio_group_list)) { - qemu_unregister_reset(vfio_reset_handler, NULL); - } -} - -int vfio_get_device(VFIOGroup *group, const char *name, - VFIODevice *vbasedev, Error **errp) -{ - struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) }; - int ret, fd; - - fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name); - if (fd < 0) { - error_setg_errno(errp, errno, "error getting device from group %d", - group->groupid); - error_append_hint(errp, - "Verify all devices in group %d are bound to vfio- " - "or pci-stub and not already in use\n", group->groupid); - return fd; - } - - ret = ioctl(fd, VFIO_DEVICE_GET_INFO, &dev_info); - if (ret) { - error_setg_errno(errp, errno, "error getting device info"); - close(fd); - return ret; - } - - /* - * Set discarding of RAM as not broken for this group if the driver knows - * the device operates compatibly with discarding. Setting must be - * consistent per group, but since compatibility is really only possible - * with mdev currently, we expect singleton groups. - */ - if (vbasedev->ram_block_discard_allowed != - group->ram_block_discard_allowed) { - if (!QLIST_EMPTY(&group->device_list)) { - error_setg(errp, "Inconsistent setting of support for discarding " - "RAM (e.g., balloon) within group"); - close(fd); - return -1; - } - - if (!group->ram_block_discard_allowed) { - group->ram_block_discard_allowed = true; - vfio_ram_block_discard_disable(group->container, false); + for (i = 0; i < region->nr_mmaps; i++) { + if (region->mmaps[i].mmap) { + memory_region_set_enabled(®ion->mmaps[i].mem, enabled); } } - vbasedev->fd = fd; - vbasedev->group = group; - QLIST_INSERT_HEAD(&group->device_list, vbasedev, next); - - vbasedev->num_irqs = dev_info.num_irqs; - vbasedev->num_regions = dev_info.num_regions; - vbasedev->flags = dev_info.flags; - - trace_vfio_get_device(name, dev_info.flags, dev_info.num_regions, - dev_info.num_irqs); - - vbasedev->reset_works = !!(dev_info.flags & VFIO_DEVICE_FLAGS_RESET); - return 0; -} - -void vfio_put_base_device(VFIODevice *vbasedev) -{ - if (!vbasedev->group) { - return; - } - QLIST_REMOVE(vbasedev, next); - vbasedev->group = NULL; - trace_vfio_put_base_device(vbasedev->fd); - close(vbasedev->fd); + trace_vfio_region_mmaps_set_enabled(memory_region_name(region->mem), + enabled); } int vfio_get_region_info(VFIODevice *vbasedev, int index, @@ -2499,98 +628,3 @@ bool vfio_has_region_cap(VFIODevice *vbasedev, int region, uint16_t cap_type) return ret; } - -/* - * Interfaces for IBM EEH (Enhanced Error Handling) - */ -static bool vfio_eeh_container_ok(VFIOContainer *container) -{ - /* - * As of 2016-03-04 (linux-4.5) the host kernel EEH/VFIO - * implementation is broken if there are multiple groups in a - * container. The hardware works in units of Partitionable - * Endpoints (== IOMMU groups) and the EEH operations naively - * iterate across all groups in the container, without any logic - * to make sure the groups have their state synchronized. For - * certain operations (ENABLE) that might be ok, until an error - * occurs, but for others (GET_STATE) it's clearly broken. - */ - - /* - * XXX Once fixed kernels exist, test for them here - */ - - if (QLIST_EMPTY(&container->group_list)) { - return false; - } - - if (QLIST_NEXT(QLIST_FIRST(&container->group_list), container_next)) { - return false; - } - - return true; -} - -static int vfio_eeh_container_op(VFIOContainer *container, uint32_t op) -{ - struct vfio_eeh_pe_op pe_op = { - .argsz = sizeof(pe_op), - .op = op, - }; - int ret; - - if (!vfio_eeh_container_ok(container)) { - error_report("vfio/eeh: EEH_PE_OP 0x%x: " - "kernel requires a container with exactly one group", op); - return -EPERM; - } - - ret = ioctl(container->fd, VFIO_EEH_PE_OP, &pe_op); - if (ret < 0) { - error_report("vfio/eeh: EEH_PE_OP 0x%x failed: %m", op); - return -errno; - } - - return ret; -} - -static VFIOContainer *vfio_eeh_as_container(AddressSpace *as) -{ - VFIOAddressSpace *space = vfio_get_address_space(as); - VFIOContainer *container = NULL; - - if (QLIST_EMPTY(&space->containers)) { - /* No containers to act on */ - goto out; - } - - container = QLIST_FIRST(&space->containers); - - if (QLIST_NEXT(container, next)) { - /* We don't yet have logic to synchronize EEH state across - * multiple containers */ - container = NULL; - goto out; - } - -out: - vfio_put_address_space(space); - return container; -} - -bool vfio_eeh_as_ok(AddressSpace *as) -{ - VFIOContainer *container = vfio_eeh_as_container(as); - - return (container != NULL) && vfio_eeh_container_ok(container); -} - -int vfio_eeh_as_op(AddressSpace *as, uint32_t op) -{ - VFIOContainer *container = vfio_eeh_as_container(as); - - if (!container) { - return -ENODEV; - } - return vfio_eeh_container_op(container, op); -} diff --git a/hw/vfio/container.c b/hw/vfio/container.c new file mode 100644 index 0000000000..9c665c1720 --- /dev/null +++ b/hw/vfio/container.c @@ -0,0 +1,1193 @@ +/* + * generic functions used by VFIO devices + * + * Copyright Red Hat, Inc. 2012 + * + * Authors: + * Alex Williamson + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + * Based on qemu-kvm device-assignment: + * Adapted for KVM by Qumranet. + * Copyright (c) 2007, Neocleus, Alex Novik (alex@neocleus.com) + * Copyright (c) 2007, Neocleus, Guy Zana (guy@neocleus.com) + * Copyright (C) 2008, Qumranet, Amit Shah (amit.shah@qumranet.com) + * Copyright (C) 2008, Red Hat, Amit Shah (amit.shah@redhat.com) + * Copyright (C) 2008, IBM, Muli Ben-Yehuda (muli@il.ibm.com) + */ + +#include "qemu/osdep.h" +#include +#ifdef CONFIG_KVM +#include +#endif +#include + +#include "hw/vfio/vfio-common.h" +#include "hw/vfio/vfio.h" +#include "exec/address-spaces.h" +#include "exec/memory.h" +#include "exec/ram_addr.h" +#include "hw/hw.h" +#include "qemu/error-report.h" +#include "qemu/range.h" +#include "sysemu/kvm.h" +#include "sysemu/reset.h" +#include "trace.h" +#include "qapi/error.h" +#include "migration/migration.h" + +#ifdef CONFIG_KVM +/* + * We have a single VFIO pseudo device per KVM VM. Once created it lives + * for the life of the VM. Closing the file descriptor only drops our + * reference to it and the device's reference to kvm. Therefore once + * initialized, this file descriptor is only released on QEMU exit and + * we'll re-use it should another vfio device be attached before then. + */ +static int vfio_kvm_device_fd = -1; +#endif + +VFIOGroupList vfio_group_list = + QLIST_HEAD_INITIALIZER(vfio_group_list); + +/* + * Device state interfaces + */ + +bool vfio_mig_active(void) +{ + VFIOGroup *group; + VFIODevice *vbasedev; + + if (QLIST_EMPTY(&vfio_group_list)) { + return false; + } + + QLIST_FOREACH(group, &vfio_group_list, next) { + QLIST_FOREACH(vbasedev, &group->device_list, next) { + if (vbasedev->migration_blocker) { + return false; + } + } + } + return true; +} + +bool vfio_devices_all_dirty_tracking(VFIOContainer *container) +{ + VFIOGroup *group; + VFIODevice *vbasedev; + MigrationState *ms = migrate_get_current(); + + if (!migration_is_setup_or_active(ms->state)) { + return false; + } + + QLIST_FOREACH(group, &container->group_list, container_next) { + QLIST_FOREACH(vbasedev, &group->device_list, next) { + VFIOMigration *migration = vbasedev->migration; + + if (!migration) { + return false; + } + + if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) + && (migration->device_state & VFIO_DEVICE_STATE_RUNNING)) { + return false; + } + } + } + return true; +} + +bool vfio_devices_all_running_and_saving(VFIOContainer *container) +{ + VFIOGroup *group; + VFIODevice *vbasedev; + MigrationState *ms = migrate_get_current(); + + if (!migration_is_setup_or_active(ms->state)) { + return false; + } + + QLIST_FOREACH(group, &container->group_list, container_next) { + QLIST_FOREACH(vbasedev, &group->device_list, next) { + VFIOMigration *migration = vbasedev->migration; + + if (!migration) { + return false; + } + + if ((migration->device_state & VFIO_DEVICE_STATE_SAVING) && + (migration->device_state & VFIO_DEVICE_STATE_RUNNING)) { + continue; + } else { + return false; + } + } + } + return true; +} + +static int vfio_dma_unmap_bitmap(VFIOContainer *container, + hwaddr iova, ram_addr_t size, + IOMMUTLBEntry *iotlb) +{ + struct vfio_iommu_type1_dma_unmap *unmap; + struct vfio_bitmap *bitmap; + uint64_t pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size; + int ret; + + unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap)); + + unmap->argsz = sizeof(*unmap) + sizeof(*bitmap); + unmap->iova = iova; + unmap->size = size; + unmap->flags |= VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP; + bitmap = (struct vfio_bitmap *)&unmap->data; + + /* + * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of + * qemu_real_host_page_size to mark those dirty. Hence set bitmap_pgsize + * to qemu_real_host_page_size. + */ + + bitmap->pgsize = qemu_real_host_page_size; + bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) / + BITS_PER_BYTE; + + if (bitmap->size > container->max_dirty_bitmap_size) { + error_report("UNMAP: Size of bitmap too big 0x%"PRIx64, + (uint64_t)bitmap->size); + ret = -E2BIG; + goto unmap_exit; + } + + bitmap->data = g_try_malloc0(bitmap->size); + if (!bitmap->data) { + ret = -ENOMEM; + goto unmap_exit; + } + + ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap); + if (!ret) { + cpu_physical_memory_set_dirty_lebitmap((unsigned long *)bitmap->data, + iotlb->translated_addr, pages); + } else { + error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %m"); + } + + g_free(bitmap->data); +unmap_exit: + g_free(unmap); + return ret; +} + +/* + * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86 + */ +int vfio_dma_unmap(VFIOContainer *container, + hwaddr iova, ram_addr_t size, + IOMMUTLBEntry *iotlb) +{ + struct vfio_iommu_type1_dma_unmap unmap = { + .argsz = sizeof(unmap), + .flags = 0, + .iova = iova, + .size = size, + }; + + if (iotlb && container->dirty_pages_supported && + vfio_devices_all_running_and_saving(container)) { + return vfio_dma_unmap_bitmap(container, iova, size, iotlb); + } + + while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) { + /* + * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c + * v4.15) where an overflow in its wrap-around check prevents us from + * unmapping the last page of the address space. Test for the error + * condition and re-try the unmap excluding the last page. The + * expectation is that we've never mapped the last page anyway and this + * unmap request comes via vIOMMU support which also makes it unlikely + * that this page is used. This bug was introduced well after type1 v2 + * support was introduced, so we shouldn't need to test for v1. A fix + * is queued for kernel v5.0 so this workaround can be removed once + * affected kernels are sufficiently deprecated. + */ + if (errno == EINVAL && unmap.size && !(unmap.iova + unmap.size) && + container->iommu_type == VFIO_TYPE1v2_IOMMU) { + trace_vfio_dma_unmap_overflow_workaround(); + unmap.size -= 1ULL << ctz64(container->pgsizes); + continue; + } + error_report("VFIO_UNMAP_DMA failed: %s", strerror(errno)); + return -errno; + } + + return 0; +} + +int vfio_dma_map(VFIOContainer *container, hwaddr iova, + ram_addr_t size, void *vaddr, bool readonly) +{ + struct vfio_iommu_type1_dma_map map = { + .argsz = sizeof(map), + .flags = VFIO_DMA_MAP_FLAG_READ, + .vaddr = (__u64)(uintptr_t)vaddr, + .iova = iova, + .size = size, + }; + + if (!readonly) { + map.flags |= VFIO_DMA_MAP_FLAG_WRITE; + } + + /* + * Try the mapping, if it fails with EBUSY, unmap the region and try + * again. This shouldn't be necessary, but we sometimes see it in + * the VGA ROM space. + */ + if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 || + (errno == EBUSY && vfio_dma_unmap(container, iova, size, NULL) == 0 && + ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) { + return 0; + } + + error_report("VFIO_MAP_DMA failed: %s", strerror(errno)); + return -errno; +} + +void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start) +{ + int ret; + struct vfio_iommu_type1_dirty_bitmap dirty = { + .argsz = sizeof(dirty), + }; + + if (start) { + dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START; + } else { + dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP; + } + + ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty); + if (ret) { + error_report("Failed to set dirty tracking flag 0x%x errno: %d", + dirty.flags, errno); + } +} + +int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova, + uint64_t size, ram_addr_t ram_addr) +{ + struct vfio_iommu_type1_dirty_bitmap *dbitmap; + struct vfio_iommu_type1_dirty_bitmap_get *range; + uint64_t pages; + int ret; + + dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range)); + + dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range); + dbitmap->flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP; + range = (struct vfio_iommu_type1_dirty_bitmap_get *)&dbitmap->data; + range->iova = iova; + range->size = size; + + /* + * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of + * qemu_real_host_page_size to mark those dirty. Hence set bitmap's pgsize + * to qemu_real_host_page_size. + */ + range->bitmap.pgsize = qemu_real_host_page_size; + + pages = REAL_HOST_PAGE_ALIGN(range->size) / qemu_real_host_page_size; + range->bitmap.size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) / + BITS_PER_BYTE; + range->bitmap.data = g_try_malloc0(range->bitmap.size); + if (!range->bitmap.data) { + ret = -ENOMEM; + goto err_out; + } + + ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap); + if (ret) { + error_report("Failed to get dirty bitmap for iova: 0x%"PRIx64 + " size: 0x%"PRIx64" err: %d", (uint64_t)range->iova, + (uint64_t)range->size, errno); + goto err_out; + } + + cpu_physical_memory_set_dirty_lebitmap((unsigned long *)range->bitmap.data, + ram_addr, pages); + + trace_vfio_get_dirty_bitmap(container->fd, range->iova, range->size, + range->bitmap.size, ram_addr); +err_out: + g_free(range->bitmap.data); + g_free(dbitmap); + + return ret; +} + +static void vfio_listener_release(VFIOContainer *container) +{ + memory_listener_unregister(&container->listener); + if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) { + memory_listener_unregister(&container->prereg_listener); + } +} + +int vfio_container_add_section_window(VFIOContainer *container, + MemoryRegionSection *section, + Error **errp) +{ + VFIOHostDMAWindow *hostwin; + hwaddr pgsize = 0; + int ret; + + if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) { + return 0; + } + + /* For now intersections are not allowed, we may relax this later */ + QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) { + if (ranges_overlap(hostwin->min_iova, + hostwin->max_iova - hostwin->min_iova + 1, + section->offset_within_address_space, + int128_get64(section->size))) { + error_setg(errp, + "region [0x%"PRIx64",0x%"PRIx64"] overlaps with existing" + "host DMA window [0x%"PRIx64",0x%"PRIx64"]", + section->offset_within_address_space, + section->offset_within_address_space + + int128_get64(section->size) - 1, + hostwin->min_iova, hostwin->max_iova); + return -1; + } + } + + ret = vfio_spapr_create_window(container, section, &pgsize); + if (ret) { + error_setg_errno(errp, -ret, "Failed to create SPAPR window"); + return ret; + } + + vfio_host_win_add(container, section->offset_within_address_space, + section->offset_within_address_space + + int128_get64(section->size) - 1, pgsize); +#ifdef CONFIG_KVM + if (kvm_enabled()) { + VFIOGroup *group; + IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr); + struct kvm_vfio_spapr_tce param; + struct kvm_device_attr attr = { + .group = KVM_DEV_VFIO_GROUP, + .attr = KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE, + .addr = (uint64_t)(unsigned long)¶m, + }; + + if (!memory_region_iommu_get_attr(iommu_mr, IOMMU_ATTR_SPAPR_TCE_FD, + ¶m.tablefd)) { + QLIST_FOREACH(group, &container->group_list, container_next) { + param.groupfd = group->fd; + if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) { + error_report("vfio: failed to setup fd %d " + "for a group with fd %d: %s", + param.tablefd, param.groupfd, + strerror(errno)); + return -1; + } + trace_vfio_spapr_group_attach(param.groupfd, param.tablefd); + } + } + } +#endif + return 0; +} + +void vfio_container_del_section_window(VFIOContainer *container, + MemoryRegionSection *section) +{ + if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) { + return; + } + + vfio_spapr_remove_window(container, + section->offset_within_address_space); + if (vfio_host_win_del(container, + section->offset_within_address_space, + section->offset_within_address_space + + int128_get64(section->size) - 1) < 0) { + hw_error("%s: Cannot delete missing window at %"HWADDR_PRIx, + __func__, section->offset_within_address_space); + } +} + +void vfio_reset_handler(void *opaque) +{ + VFIOGroup *group; + VFIODevice *vbasedev; + + QLIST_FOREACH(group, &vfio_group_list, next) { + QLIST_FOREACH(vbasedev, &group->device_list, next) { + if (vbasedev->dev->realized) { + vbasedev->ops->vfio_compute_needs_reset(vbasedev); + } + } + } + + QLIST_FOREACH(group, &vfio_group_list, next) { + QLIST_FOREACH(vbasedev, &group->device_list, next) { + if (vbasedev->dev->realized && vbasedev->needs_reset) { + vbasedev->ops->vfio_hot_reset_multi(vbasedev); + } + } + } +} + +static void vfio_kvm_device_add_group(VFIOGroup *group) +{ +#ifdef CONFIG_KVM + struct kvm_device_attr attr = { + .group = KVM_DEV_VFIO_GROUP, + .attr = KVM_DEV_VFIO_GROUP_ADD, + .addr = (uint64_t)(unsigned long)&group->fd, + }; + + if (!kvm_enabled()) { + return; + } + + if (vfio_kvm_device_fd < 0) { + struct kvm_create_device cd = { + .type = KVM_DEV_TYPE_VFIO, + }; + + if (kvm_vm_ioctl(kvm_state, KVM_CREATE_DEVICE, &cd)) { + error_report("Failed to create KVM VFIO device: %m"); + return; + } + + vfio_kvm_device_fd = cd.fd; + } + + if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) { + error_report("Failed to add group %d to KVM VFIO device: %m", + group->groupid); + } +#endif +} + +static void vfio_kvm_device_del_group(VFIOGroup *group) +{ +#ifdef CONFIG_KVM + struct kvm_device_attr attr = { + .group = KVM_DEV_VFIO_GROUP, + .attr = KVM_DEV_VFIO_GROUP_DEL, + .addr = (uint64_t)(unsigned long)&group->fd, + }; + + if (vfio_kvm_device_fd < 0) { + return; + } + + if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) { + error_report("Failed to remove group %d from KVM VFIO device: %m", + group->groupid); + } +#endif +} + +/* + * vfio_get_iommu_type - selects the richest iommu_type (v2 first) + */ +static int vfio_get_iommu_type(VFIOContainer *container, + Error **errp) +{ + int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU, + VFIO_SPAPR_TCE_v2_IOMMU, VFIO_SPAPR_TCE_IOMMU }; + int i; + + for (i = 0; i < ARRAY_SIZE(iommu_types); i++) { + if (ioctl(container->fd, VFIO_CHECK_EXTENSION, iommu_types[i])) { + return iommu_types[i]; + } + } + error_setg(errp, "No available IOMMU models"); + return -EINVAL; +} + +static int vfio_init_container(VFIOContainer *container, int group_fd, + Error **errp) +{ + int iommu_type, ret; + + iommu_type = vfio_get_iommu_type(container, errp); + if (iommu_type < 0) { + return iommu_type; + } + + ret = ioctl(group_fd, VFIO_GROUP_SET_CONTAINER, &container->fd); + if (ret) { + error_setg_errno(errp, errno, "Failed to set group container"); + return -errno; + } + + while (ioctl(container->fd, VFIO_SET_IOMMU, iommu_type)) { + if (iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) { + /* + * On sPAPR, despite the IOMMU subdriver always advertises v1 and + * v2, the running platform may not support v2 and there is no + * way to guess it until an IOMMU group gets added to the container. + * So in case it fails with v2, try v1 as a fallback. + */ + iommu_type = VFIO_SPAPR_TCE_IOMMU; + continue; + } + error_setg_errno(errp, errno, "Failed to set iommu for container"); + return -errno; + } + + container->iommu_type = iommu_type; + return 0; +} + +static int vfio_get_iommu_info(VFIOContainer *container, + struct vfio_iommu_type1_info **info) +{ + + size_t argsz = sizeof(struct vfio_iommu_type1_info); + + *info = g_new0(struct vfio_iommu_type1_info, 1); +again: + (*info)->argsz = argsz; + + if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) { + g_free(*info); + *info = NULL; + return -errno; + } + + if (((*info)->argsz > argsz)) { + argsz = (*info)->argsz; + *info = g_realloc(*info, argsz); + goto again; + } + + return 0; +} + +static struct vfio_info_cap_header * +vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id) +{ + struct vfio_info_cap_header *hdr; + void *ptr = info; + + if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) { + return NULL; + } + + for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) { + if (hdr->id == id) { + return hdr; + } + } + + return NULL; +} + +static void vfio_get_iommu_info_migration(VFIOContainer *container, + struct vfio_iommu_type1_info *info) +{ + struct vfio_info_cap_header *hdr; + struct vfio_iommu_type1_info_cap_migration *cap_mig; + + hdr = vfio_get_iommu_info_cap(info, VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION); + if (!hdr) { + return; + } + + cap_mig = container_of(hdr, struct vfio_iommu_type1_info_cap_migration, + header); + + /* + * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of + * qemu_real_host_page_size to mark those dirty. + */ + if (cap_mig->pgsize_bitmap & qemu_real_host_page_size) { + container->dirty_pages_supported = true; + container->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size; + container->dirty_pgsizes = cap_mig->pgsize_bitmap; + } +} + +static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state) +{ + switch (container->iommu_type) { + case VFIO_TYPE1v2_IOMMU: + case VFIO_TYPE1_IOMMU: + /* + * We support coordinated discarding of RAM via the RamDiscardManager. + */ + return ram_block_uncoordinated_discard_disable(state); + default: + /* + * VFIO_SPAPR_TCE_IOMMU most probably works just fine with + * RamDiscardManager, however, it is completely untested. + * + * VFIO_SPAPR_TCE_v2_IOMMU with "DMA memory preregistering" does + * completely the opposite of managing mapping/pinning dynamically as + * required by RamDiscardManager. We would have to special-case sections + * with a RamDiscardManager. + */ + return ram_block_discard_disable(state); + } +} + +static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, + Error **errp) +{ + VFIOContainer *container; + int ret, fd; + VFIOAddressSpace *space; + + space = vfio_get_address_space(as); + + /* + * VFIO is currently incompatible with discarding of RAM insofar as the + * madvise to purge (zap) the page from QEMU's address space does not + * interact with the memory API and therefore leaves stale virtual to + * physical mappings in the IOMMU if the page was previously pinned. We + * therefore set discarding broken for each group added to a container, + * whether the container is used individually or shared. This provides + * us with options to allow devices within a group to opt-in and allow + * discarding, so long as it is done consistently for a group (for instance + * if the device is an mdev device where it is known that the host vendor + * driver will never pin pages outside of the working set of the guest + * driver, which would thus not be discarding candidates). + * + * The first opportunity to induce pinning occurs here where we attempt to + * attach the group to existing containers within the AddressSpace. If any + * pages are already zapped from the virtual address space, such as from + * previous discards, new pinning will cause valid mappings to be + * re-established. Likewise, when the overall MemoryListener for a new + * container is registered, a replay of mappings within the AddressSpace + * will occur, re-establishing any previously zapped pages as well. + * + * Especially virtio-balloon is currently only prevented from discarding + * new memory, it will not yet set ram_block_discard_set_required() and + * therefore, neither stops us here or deals with the sudden memory + * consumption of inflated memory. + * + * We do support discarding of memory coordinated via the RamDiscardManager + * with some IOMMU types. vfio_ram_block_discard_disable() handles the + * details once we know which type of IOMMU we are using. + */ + + QLIST_FOREACH(container, &space->containers, next) { + if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) { + ret = vfio_ram_block_discard_disable(container, true); + if (ret) { + error_setg_errno(errp, -ret, + "Cannot set discarding of RAM broken"); + if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, + &container->fd)) { + error_report("vfio: error disconnecting group %d from" + " container", group->groupid); + } + return ret; + } + group->container = container; + QLIST_INSERT_HEAD(&container->group_list, group, container_next); + vfio_kvm_device_add_group(group); + return 0; + } + } + + fd = qemu_open_old("/dev/vfio/vfio", O_RDWR); + if (fd < 0) { + error_setg_errno(errp, errno, "failed to open /dev/vfio/vfio"); + ret = -errno; + goto put_space_exit; + } + + ret = ioctl(fd, VFIO_GET_API_VERSION); + if (ret != VFIO_API_VERSION) { + error_setg(errp, "supported vfio version: %d, " + "reported version: %d", VFIO_API_VERSION, ret); + ret = -EINVAL; + goto close_fd_exit; + } + + container = g_malloc0(sizeof(*container)); + container->space = space; + container->fd = fd; + container->error = NULL; + container->dirty_pages_supported = false; + container->dma_max_mappings = 0; + QLIST_INIT(&container->giommu_list); + QLIST_INIT(&container->hostwin_list); + QLIST_INIT(&container->vrdl_list); + + ret = vfio_init_container(container, group->fd, errp); + if (ret) { + goto free_container_exit; + } + + ret = vfio_ram_block_discard_disable(container, true); + if (ret) { + error_setg_errno(errp, -ret, "Cannot set discarding of RAM broken"); + goto free_container_exit; + } + + switch (container->iommu_type) { + case VFIO_TYPE1v2_IOMMU: + case VFIO_TYPE1_IOMMU: + { + struct vfio_iommu_type1_info *info; + + /* + * FIXME: This assumes that a Type1 IOMMU can map any 64-bit + * IOVA whatsoever. That's not actually true, but the current + * kernel interface doesn't tell us what it can map, and the + * existing Type1 IOMMUs generally support any IOVA we're + * going to actually try in practice. + */ + ret = vfio_get_iommu_info(container, &info); + + if (ret || !(info->flags & VFIO_IOMMU_INFO_PGSIZES)) { + /* Assume 4k IOVA page size */ + info->iova_pgsizes = 4096; + } + vfio_host_win_add(container, 0, (hwaddr)-1, info->iova_pgsizes); + container->pgsizes = info->iova_pgsizes; + + /* The default in the kernel ("dma_entry_limit") is 65535. */ + container->dma_max_mappings = 65535; + if (!ret) { + vfio_get_info_dma_avail(info, &container->dma_max_mappings); + vfio_get_iommu_info_migration(container, info); + } + g_free(info); + break; + } + case VFIO_SPAPR_TCE_v2_IOMMU: + case VFIO_SPAPR_TCE_IOMMU: + { + struct vfio_iommu_spapr_tce_info info; + bool v2 = container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU; + + /* + * The host kernel code implementing VFIO_IOMMU_DISABLE is called + * when container fd is closed so we do not call it explicitly + * in this file. + */ + if (!v2) { + ret = ioctl(fd, VFIO_IOMMU_ENABLE); + if (ret) { + error_setg_errno(errp, errno, "failed to enable container"); + ret = -errno; + goto enable_discards_exit; + } + } else { + container->prereg_listener = vfio_prereg_listener; + + memory_listener_register(&container->prereg_listener, + &address_space_memory); + if (container->error) { + memory_listener_unregister(&container->prereg_listener); + ret = -1; + error_propagate_prepend(errp, container->error, + "RAM memory listener initialization failed: "); + goto enable_discards_exit; + } + } + + info.argsz = sizeof(info); + ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info); + if (ret) { + error_setg_errno(errp, errno, + "VFIO_IOMMU_SPAPR_TCE_GET_INFO failed"); + ret = -errno; + if (v2) { + memory_listener_unregister(&container->prereg_listener); + } + goto enable_discards_exit; + } + + if (v2) { + container->pgsizes = info.ddw.pgsizes; + /* + * There is a default window in just created container. + * To make region_add/del simpler, we better remove this + * window now and let those iommu_listener callbacks + * create/remove them when needed. + */ + ret = vfio_spapr_remove_window(container, info.dma32_window_start); + if (ret) { + error_setg_errno(errp, -ret, + "failed to remove existing window"); + goto enable_discards_exit; + } + } else { + /* The default table uses 4K pages */ + container->pgsizes = 0x1000; + vfio_host_win_add(container, info.dma32_window_start, + info.dma32_window_start + + info.dma32_window_size - 1, + 0x1000); + } + } + } + + vfio_kvm_device_add_group(group); + + QLIST_INIT(&container->group_list); + QLIST_INSERT_HEAD(&space->containers, container, next); + + group->container = container; + QLIST_INSERT_HEAD(&container->group_list, group, container_next); + + container->listener = vfio_memory_listener; + + memory_listener_register(&container->listener, container->space->as); + + if (container->error) { + ret = -1; + error_propagate_prepend(errp, container->error, + "memory listener initialization failed: "); + goto listener_release_exit; + } + + container->initialized = true; + + return 0; +listener_release_exit: + QLIST_REMOVE(group, container_next); + QLIST_REMOVE(container, next); + vfio_kvm_device_del_group(group); + vfio_listener_release(container); + +enable_discards_exit: + vfio_ram_block_discard_disable(container, false); + +free_container_exit: + g_free(container); + +close_fd_exit: + close(fd); + +put_space_exit: + vfio_put_address_space(space); + + return ret; +} + +static void vfio_disconnect_container(VFIOGroup *group) +{ + VFIOContainer *container = group->container; + + QLIST_REMOVE(group, container_next); + group->container = NULL; + + /* + * Explicitly release the listener first before unset container, + * since unset may destroy the backend container if it's the last + * group. + */ + if (QLIST_EMPTY(&container->group_list)) { + vfio_listener_release(container); + } + + if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) { + error_report("vfio: error disconnecting group %d from container", + group->groupid); + } + + if (QLIST_EMPTY(&container->group_list)) { + VFIOAddressSpace *space = container->space; + VFIOGuestIOMMU *giommu, *tmp; + VFIOHostDMAWindow *hostwin, *next; + + QLIST_REMOVE(container, next); + + QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) { + memory_region_unregister_iommu_notifier( + MEMORY_REGION(giommu->iommu_mr), &giommu->n); + QLIST_REMOVE(giommu, giommu_next); + g_free(giommu); + } + + QLIST_FOREACH_SAFE(hostwin, &container->hostwin_list, hostwin_next, + next) { + QLIST_REMOVE(hostwin, hostwin_next); + g_free(hostwin); + } + + trace_vfio_disconnect_container(container->fd); + close(container->fd); + g_free(container); + + vfio_put_address_space(space); + } +} + +VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp) +{ + VFIOGroup *group; + char path[32]; + struct vfio_group_status status = { .argsz = sizeof(status) }; + + QLIST_FOREACH(group, &vfio_group_list, next) { + if (group->groupid == groupid) { + /* Found it. Now is it already in the right context? */ + if (group->container->space->as == as) { + return group; + } else { + error_setg(errp, "group %d used in multiple address spaces", + group->groupid); + return NULL; + } + } + } + + group = g_malloc0(sizeof(*group)); + + snprintf(path, sizeof(path), "/dev/vfio/%d", groupid); + group->fd = qemu_open_old(path, O_RDWR); + if (group->fd < 0) { + error_setg_errno(errp, errno, "failed to open %s", path); + goto free_group_exit; + } + + if (ioctl(group->fd, VFIO_GROUP_GET_STATUS, &status)) { + error_setg_errno(errp, errno, "failed to get group %d status", groupid); + goto close_fd_exit; + } + + if (!(status.flags & VFIO_GROUP_FLAGS_VIABLE)) { + error_setg(errp, "group %d is not viable", groupid); + error_append_hint(errp, + "Please ensure all devices within the iommu_group " + "are bound to their vfio bus driver.\n"); + goto close_fd_exit; + } + + group->groupid = groupid; + QLIST_INIT(&group->device_list); + + if (vfio_connect_container(group, as, errp)) { + error_prepend(errp, "failed to setup container for group %d: ", + groupid); + goto close_fd_exit; + } + + if (QLIST_EMPTY(&vfio_group_list)) { + qemu_register_reset(vfio_reset_handler, NULL); + } + + QLIST_INSERT_HEAD(&vfio_group_list, group, next); + + return group; + +close_fd_exit: + close(group->fd); + +free_group_exit: + g_free(group); + + return NULL; +} + +void vfio_put_group(VFIOGroup *group) +{ + if (!group || !QLIST_EMPTY(&group->device_list)) { + return; + } + + if (!group->ram_block_discard_allowed) { + vfio_ram_block_discard_disable(group->container, false); + } + vfio_kvm_device_del_group(group); + vfio_disconnect_container(group); + QLIST_REMOVE(group, next); + trace_vfio_put_group(group->fd); + close(group->fd); + g_free(group); + + if (QLIST_EMPTY(&vfio_group_list)) { + qemu_unregister_reset(vfio_reset_handler, NULL); + } +} + +int vfio_get_device(VFIOGroup *group, const char *name, + VFIODevice *vbasedev, Error **errp) +{ + struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) }; + int ret, fd; + + fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name); + if (fd < 0) { + error_setg_errno(errp, errno, "error getting device from group %d", + group->groupid); + error_append_hint(errp, + "Verify all devices in group %d are bound to vfio- " + "or pci-stub and not already in use\n", group->groupid); + return fd; + } + + ret = ioctl(fd, VFIO_DEVICE_GET_INFO, &dev_info); + if (ret) { + error_setg_errno(errp, errno, "error getting device info"); + close(fd); + return ret; + } + + /* + * Set discarding of RAM as not broken for this group if the driver knows + * the device operates compatibly with discarding. Setting must be + * consistent per group, but since compatibility is really only possible + * with mdev currently, we expect singleton groups. + */ + if (vbasedev->ram_block_discard_allowed != + group->ram_block_discard_allowed) { + if (!QLIST_EMPTY(&group->device_list)) { + error_setg(errp, "Inconsistent setting of support for discarding " + "RAM (e.g., balloon) within group"); + close(fd); + return -1; + } + + if (!group->ram_block_discard_allowed) { + group->ram_block_discard_allowed = true; + vfio_ram_block_discard_disable(group->container, false); + } + } + + vbasedev->fd = fd; + vbasedev->group = group; + QLIST_INSERT_HEAD(&group->device_list, vbasedev, next); + + vbasedev->num_irqs = dev_info.num_irqs; + vbasedev->num_regions = dev_info.num_regions; + vbasedev->flags = dev_info.flags; + + trace_vfio_get_device(name, dev_info.flags, dev_info.num_regions, + dev_info.num_irqs); + + vbasedev->reset_works = !!(dev_info.flags & VFIO_DEVICE_FLAGS_RESET); + return 0; +} + +void vfio_put_base_device(VFIODevice *vbasedev) +{ + if (!vbasedev->group) { + return; + } + QLIST_REMOVE(vbasedev, next); + vbasedev->group = NULL; + trace_vfio_put_base_device(vbasedev->fd); + close(vbasedev->fd); +} + +/* FIXME: should below code be in common.c? */ +/* + * Interfaces for IBM EEH (Enhanced Error Handling) + */ +static bool vfio_eeh_container_ok(VFIOContainer *container) +{ + /* + * As of 2016-03-04 (linux-4.5) the host kernel EEH/VFIO + * implementation is broken if there are multiple groups in a + * container. The hardware works in units of Partitionable + * Endpoints (== IOMMU groups) and the EEH operations naively + * iterate across all groups in the container, without any logic + * to make sure the groups have their state synchronized. For + * certain operations (ENABLE) that might be ok, until an error + * occurs, but for others (GET_STATE) it's clearly broken. + */ + + /* + * XXX Once fixed kernels exist, test for them here + */ + + if (QLIST_EMPTY(&container->group_list)) { + return false; + } + + if (QLIST_NEXT(QLIST_FIRST(&container->group_list), container_next)) { + return false; + } + + return true; +} + +static int vfio_eeh_container_op(VFIOContainer *container, uint32_t op) +{ + struct vfio_eeh_pe_op pe_op = { + .argsz = sizeof(pe_op), + .op = op, + }; + int ret; + + if (!vfio_eeh_container_ok(container)) { + error_report("vfio/eeh: EEH_PE_OP 0x%x: " + "kernel requires a container with exactly one group", op); + return -EPERM; + } + + ret = ioctl(container->fd, VFIO_EEH_PE_OP, &pe_op); + if (ret < 0) { + error_report("vfio/eeh: EEH_PE_OP 0x%x failed: %m", op); + return -errno; + } + + return ret; +} + +static VFIOContainer *vfio_eeh_as_container(AddressSpace *as) +{ + VFIOAddressSpace *space = vfio_get_address_space(as); + VFIOContainer *container = NULL; + + if (QLIST_EMPTY(&space->containers)) { + /* No containers to act on */ + goto out; + } + + container = QLIST_FIRST(&space->containers); + + if (QLIST_NEXT(container, next)) { + /* + * We don't yet have logic to synchronize EEH state across + * multiple containers. + */ + container = NULL; + goto out; + } + +out: + vfio_put_address_space(space); + return container; +} + +bool vfio_eeh_as_ok(AddressSpace *as) +{ + VFIOContainer *container = vfio_eeh_as_container(as); + + return (container != NULL) && vfio_eeh_container_ok(container); +} + +int vfio_eeh_as_op(AddressSpace *as, uint32_t op) +{ + VFIOContainer *container = vfio_eeh_as_container(as); + + if (!container) { + return -ENODEV; + } + return vfio_eeh_container_op(container, op); +} diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build index da9af297a0..e3b6d6e2cb 100644 --- a/hw/vfio/meson.build +++ b/hw/vfio/meson.build @@ -1,6 +1,8 @@ vfio_ss = ss.source_set() vfio_ss.add(files( 'common.c', + 'as.c', + 'container.c', 'spapr.c', 'migration.c', )) diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index e573f5a9f1..03ff7944cb 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -33,6 +33,8 @@ #define VFIO_MSG_PREFIX "vfio %s: " +extern const MemoryListener vfio_memory_listener; + enum { VFIO_DEVICE_TYPE_PCI = 0, VFIO_DEVICE_TYPE_PLATFORM = 1, @@ -190,6 +192,32 @@ typedef struct VFIODisplay { } dmabuf; } VFIODisplay; +void vfio_host_win_add(VFIOContainer *container, + hwaddr min_iova, hwaddr max_iova, + uint64_t iova_pgsizes); +int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova, + hwaddr max_iova); +VFIOAddressSpace *vfio_get_address_space(AddressSpace *as); +void vfio_put_address_space(VFIOAddressSpace *space); +bool vfio_devices_all_running_and_saving(VFIOContainer *container); +bool vfio_devices_all_dirty_tracking(VFIOContainer *container); + +/* container->fd */ +int vfio_dma_unmap(VFIOContainer *container, + hwaddr iova, ram_addr_t size, + IOMMUTLBEntry *iotlb); +int vfio_dma_map(VFIOContainer *container, hwaddr iova, + ram_addr_t size, void *vaddr, bool readonly); +void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start); +int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova, + uint64_t size, ram_addr_t ram_addr); + +int vfio_container_add_section_window(VFIOContainer *container, + MemoryRegionSection *section, + Error **errp); +void vfio_container_del_section_window(VFIOContainer *container, + MemoryRegionSection *section); + void vfio_put_base_device(VFIODevice *vbasedev); void vfio_disable_irqindex(VFIODevice *vbasedev, int index); void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index); From patchwork Thu Apr 14 10:46:59 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yi Liu X-Patchwork-Id: 12813355 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id A93DCC433EF for ; Thu, 14 Apr 2022 10:58:40 +0000 (UTC) Received: from localhost ([::1]:44326 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nexBP-0004nT-NT for qemu-devel@archiver.kernel.org; Thu, 14 Apr 2022 06:58:39 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:55486) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nex0a-0002CR-45 for qemu-devel@nongnu.org; Thu, 14 Apr 2022 06:47:28 -0400 Received: from mga12.intel.com ([192.55.52.136]:34768) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nex0V-0005Ka-EF for qemu-devel@nongnu.org; Thu, 14 Apr 2022 06:47:27 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1649933243; x=1681469243; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=PgZexHJK77H62TwLn4DS+5IBrAOzu0/+IsI2BeEibz0=; b=h/bqK85Ghjz2dlyRu466k2RQp17VR9rCH2XMw3zGiduGwIcJCTdF6b97 DVb1oA3fN7iqEVoMzMjGZEpAtFwEV3QuFo2VBBae1+48TRg+Rc2MjSDUC htcXPX1UXsr7s1ziov8IWgNu1fAUPS6IVgw9aIc+PSsmIXaP4xIH/Kh0b rbwbWLgKYkOoFhgsuaNf+H2yoSNjmO3s7F3TGTRlUt/d6BbGq+VCVJZbG YDhKGO1DImQ+Ep8MhfU/nxhHWfCBKKukz9NwMw9whzbLS3Vbl5gr1dVVi W3cbXdfwScx+/I6cFcTAAy9lFv4+C0MXOSKaFYnarMR17I2MBUAveN7uO g==; X-IronPort-AV: E=McAfee;i="6400,9594,10316"; a="242836487" X-IronPort-AV: E=Sophos;i="5.90,259,1643702400"; d="scan'208";a="242836487" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Apr 2022 03:47:16 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,259,1643702400"; d="scan'208";a="803091208" Received: from 984fee00a4c6.jf.intel.com ([10.165.58.231]) by fmsmga006.fm.intel.com with ESMTP; 14 Apr 2022 03:47:15 -0700 From: Yi Liu To: alex.williamson@redhat.com, cohuck@redhat.com, qemu-devel@nongnu.org Subject: [RFC 07/18] vfio: Add base object for VFIOContainer Date: Thu, 14 Apr 2022 03:46:59 -0700 Message-Id: <20220414104710.28534-8-yi.l.liu@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220414104710.28534-1-yi.l.liu@intel.com> References: <20220414104710.28534-1-yi.l.liu@intel.com> MIME-Version: 1.0 Received-SPF: pass client-ip=192.55.52.136; envelope-from=yi.l.liu@intel.com; helo=mga12.intel.com X-Spam_score_int: -44 X-Spam_score: -4.5 X-Spam_bar: ---- X-Spam_report: (-4.5 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: akrowiak@linux.ibm.com, jjherne@linux.ibm.com, thuth@redhat.com, yi.l.liu@intel.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com, jasowang@redhat.com, farman@linux.ibm.com, peterx@redhat.com, pasic@linux.ibm.com, eric.auger@redhat.com, yi.y.sun@intel.com, chao.p.peng@intel.com, nicolinc@nvidia.com, kevin.tian@intel.com, jgg@nvidia.com, eric.auger.pro@gmail.com, david@gibson.dropbear.id.au Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" Qomify the VFIOContainer object which acts as a base class for a container. This base class is derived into the legacy VFIO container and later on, into the new iommufd based container. The base class implements generic code such as code related to memory_listener and address space management whereas the derived class implements callbacks that depend on the kernel user space being used. 'as.c' only manipulates the base class object with wrapper functions that call the right class functions. Existing 'container.c' code is converted to implement the legacy container class functions. Existing migration code only works with the legacy container. Also 'spapr.c' isn't BE agnostic. Below is the object. It's named as VFIOContainer, old VFIOContainer is replaced with VFIOLegacyContainer. struct VFIOContainer { /* private */ Object parent_obj; VFIOAddressSpace *space; MemoryListener listener; Error *error; bool initialized; bool dirty_pages_supported; uint64_t dirty_pgsizes; uint64_t max_dirty_bitmap_size; unsigned long pgsizes; unsigned int dma_max_mappings; QLIST_HEAD(, VFIOGuestIOMMU) giommu_list; QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list; QLIST_HEAD(, VFIORamDiscardListener) vrdl_list; QLIST_ENTRY(VFIOContainer) next; }; struct VFIOLegacyContainer { VFIOContainer obj; int fd; /* /dev/vfio/vfio, empowered by the attached groups */ MemoryListener prereg_listener; unsigned iommu_type; QLIST_HEAD(, VFIOGroup) group_list; }; Co-authored-by: Eric Auger Signed-off-by: Eric Auger Signed-off-by: Yi Liu --- hw/vfio/as.c | 48 +++--- hw/vfio/container-obj.c | 195 +++++++++++++++++++++++ hw/vfio/container.c | 224 ++++++++++++++++----------- hw/vfio/meson.build | 1 + hw/vfio/migration.c | 4 +- hw/vfio/pci.c | 4 +- hw/vfio/spapr.c | 22 +-- include/hw/vfio/vfio-common.h | 78 ++-------- include/hw/vfio/vfio-container-obj.h | 154 ++++++++++++++++++ 9 files changed, 540 insertions(+), 190 deletions(-) create mode 100644 hw/vfio/container-obj.c create mode 100644 include/hw/vfio/vfio-container-obj.h diff --git a/hw/vfio/as.c b/hw/vfio/as.c index 4181182808..37423d2c89 100644 --- a/hw/vfio/as.c +++ b/hw/vfio/as.c @@ -215,9 +215,9 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) * of vaddr will always be there, even if the memory object is * destroyed and its backing memory munmap-ed. */ - ret = vfio_dma_map(container, iova, - iotlb->addr_mask + 1, vaddr, - read_only); + ret = vfio_container_dma_map(container, iova, + iotlb->addr_mask + 1, vaddr, + read_only); if (ret) { error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", " "0x%"HWADDR_PRIx", %p) = %d (%m)", @@ -225,7 +225,8 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) iotlb->addr_mask + 1, vaddr, ret); } } else { - ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1, iotlb); + ret = vfio_container_dma_unmap(container, iova, + iotlb->addr_mask + 1, iotlb); if (ret) { error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", " "0x%"HWADDR_PRIx") = %d (%m)", @@ -242,12 +243,13 @@ static void vfio_ram_discard_notify_discard(RamDiscardListener *rdl, { VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener, listener); + VFIOContainer *container = vrdl->container; const hwaddr size = int128_get64(section->size); const hwaddr iova = section->offset_within_address_space; int ret; /* Unmap with a single call. */ - ret = vfio_dma_unmap(vrdl->container, iova, size , NULL); + ret = vfio_container_dma_unmap(container, iova, size , NULL); if (ret) { error_report("%s: vfio_dma_unmap() failed: %s", __func__, strerror(-ret)); @@ -259,6 +261,7 @@ static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl, { VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener, listener); + VFIOContainer *container = vrdl->container; const hwaddr end = section->offset_within_region + int128_get64(section->size); hwaddr start, next, iova; @@ -277,8 +280,8 @@ static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl, section->offset_within_address_space; vaddr = memory_region_get_ram_ptr(section->mr) + start; - ret = vfio_dma_map(vrdl->container, iova, next - start, - vaddr, section->readonly); + ret = vfio_container_dma_map(container, iova, next - start, + vaddr, section->readonly); if (ret) { /* Rollback */ vfio_ram_discard_notify_discard(rdl, section); @@ -530,8 +533,8 @@ static void vfio_listener_region_add(MemoryListener *listener, } } - ret = vfio_dma_map(container, iova, int128_get64(llsize), - vaddr, section->readonly); + ret = vfio_container_dma_map(container, iova, int128_get64(llsize), + vaddr, section->readonly); if (ret) { error_setg(&err, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", " "0x%"HWADDR_PRIx", %p) = %d (%m)", @@ -656,7 +659,8 @@ static void vfio_listener_region_del(MemoryListener *listener, if (int128_eq(llsize, int128_2_64())) { /* The unmap ioctl doesn't accept a full 64-bit span. */ llsize = int128_rshift(llsize, 1); - ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL); + ret = vfio_container_dma_unmap(container, iova, + int128_get64(llsize), NULL); if (ret) { error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", " "0x%"HWADDR_PRIx") = %d (%m)", @@ -664,7 +668,8 @@ static void vfio_listener_region_del(MemoryListener *listener, } iova += int128_get64(llsize); } - ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL); + ret = vfio_container_dma_unmap(container, iova, + int128_get64(llsize), NULL); if (ret) { error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", " "0x%"HWADDR_PRIx") = %d (%m)", @@ -681,14 +686,14 @@ static void vfio_listener_log_global_start(MemoryListener *listener) { VFIOContainer *container = container_of(listener, VFIOContainer, listener); - vfio_set_dirty_page_tracking(container, true); + vfio_container_set_dirty_page_tracking(container, true); } static void vfio_listener_log_global_stop(MemoryListener *listener) { VFIOContainer *container = container_of(listener, VFIOContainer, listener); - vfio_set_dirty_page_tracking(container, false); + vfio_container_set_dirty_page_tracking(container, false); } typedef struct { @@ -717,8 +722,9 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) if (vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL)) { int ret; - ret = vfio_get_dirty_bitmap(container, iova, iotlb->addr_mask + 1, - translated_addr); + ret = vfio_container_get_dirty_bitmap(container, iova, + iotlb->addr_mask + 1, + translated_addr); if (ret) { error_report("vfio_iommu_map_dirty_notify(%p, 0x%"HWADDR_PRIx", " "0x%"HWADDR_PRIx") = %d (%m)", @@ -742,11 +748,13 @@ static int vfio_ram_discard_get_dirty_bitmap(MemoryRegionSection *section, * Sync the whole mapped region (spanning multiple individual mappings) * in one go. */ - return vfio_get_dirty_bitmap(vrdl->container, iova, size, ram_addr); + return vfio_container_get_dirty_bitmap(vrdl->container, iova, + size, ram_addr); } -static int vfio_sync_ram_discard_listener_dirty_bitmap(VFIOContainer *container, - MemoryRegionSection *section) +static int +vfio_sync_ram_discard_listener_dirty_bitmap(VFIOContainer *container, + MemoryRegionSection *section) { RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr); VFIORamDiscardListener *vrdl = NULL; @@ -810,7 +818,7 @@ static int vfio_sync_dirty_bitmap(VFIOContainer *container, ram_addr = memory_region_get_ram_addr(section->mr) + section->offset_within_region; - return vfio_get_dirty_bitmap(container, + return vfio_container_get_dirty_bitmap(container, REAL_HOST_PAGE_ALIGN(section->offset_within_address_space), int128_get64(section->size), ram_addr); } @@ -825,7 +833,7 @@ static void vfio_listener_log_sync(MemoryListener *listener, return; } - if (vfio_devices_all_dirty_tracking(container)) { + if (vfio_container_devices_all_dirty_tracking(container)) { vfio_sync_dirty_bitmap(container, section); } } diff --git a/hw/vfio/container-obj.c b/hw/vfio/container-obj.c new file mode 100644 index 0000000000..40c1e2a2b5 --- /dev/null +++ b/hw/vfio/container-obj.c @@ -0,0 +1,195 @@ +/* + * VFIO CONTAINER BASE OBJECT + * + * Copyright (C) 2022 Intel Corporation. + * Copyright Red Hat, Inc. 2022 + * + * Authors: Yi Liu + * Eric Auger + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + + * You should have received a copy of the GNU General Public License along + * with this program; if not, see . + */ + +#include "qemu/osdep.h" +#include "qapi/error.h" +#include "qemu/error-report.h" +#include "qom/object.h" +#include "qapi/visitor.h" +#include "hw/vfio/vfio-container-obj.h" + +bool vfio_container_check_extension(VFIOContainer *container, + VFIOContainerFeature feat) +{ + VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container); + + if (!vccs->check_extension) { + return false; + } + + return vccs->check_extension(container, feat); +} + +int vfio_container_dma_map(VFIOContainer *container, + hwaddr iova, ram_addr_t size, + void *vaddr, bool readonly) +{ + VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container); + + if (!vccs->dma_map) { + return -EINVAL; + } + + return vccs->dma_map(container, iova, size, vaddr, readonly); +} + +int vfio_container_dma_unmap(VFIOContainer *container, + hwaddr iova, ram_addr_t size, + IOMMUTLBEntry *iotlb) +{ + VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container); + + vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container); + + if (!vccs->dma_unmap) { + return -EINVAL; + } + + return vccs->dma_unmap(container, iova, size, iotlb); +} + +void vfio_container_set_dirty_page_tracking(VFIOContainer *container, + bool start) +{ + VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container); + + if (!vccs->set_dirty_page_tracking) { + return; + } + + vccs->set_dirty_page_tracking(container, start); +} + +bool vfio_container_devices_all_dirty_tracking(VFIOContainer *container) +{ + VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container); + + if (!vccs->devices_all_dirty_tracking) { + return false; + } + + return vccs->devices_all_dirty_tracking(container); +} + +int vfio_container_get_dirty_bitmap(VFIOContainer *container, uint64_t iova, + uint64_t size, ram_addr_t ram_addr) +{ + VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container); + + if (!vccs->get_dirty_bitmap) { + return -EINVAL; + } + + return vccs->get_dirty_bitmap(container, iova, size, ram_addr); +} + +int vfio_container_add_section_window(VFIOContainer *container, + MemoryRegionSection *section, + Error **errp) +{ + VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container); + + if (!vccs->add_window) { + return 0; + } + + return vccs->add_window(container, section, errp); +} + +void vfio_container_del_section_window(VFIOContainer *container, + MemoryRegionSection *section) +{ + VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container); + + if (!vccs->del_window) { + return; + } + + return vccs->del_window(container, section); +} + +void vfio_container_init(void *_container, size_t instance_size, + const char *mrtypename, + VFIOAddressSpace *space) +{ + VFIOContainer *container; + + object_initialize(_container, instance_size, mrtypename); + container = VFIO_CONTAINER_OBJ(_container); + + container->space = space; + container->error = NULL; + container->dirty_pages_supported = false; + container->dma_max_mappings = 0; + QLIST_INIT(&container->giommu_list); + QLIST_INIT(&container->hostwin_list); + QLIST_INIT(&container->vrdl_list); +} + +void vfio_container_destroy(VFIOContainer *container) +{ + VFIORamDiscardListener *vrdl, *vrdl_tmp; + VFIOGuestIOMMU *giommu, *tmp; + VFIOHostDMAWindow *hostwin, *next; + + QLIST_SAFE_REMOVE(container, next); + + QLIST_FOREACH_SAFE(vrdl, &container->vrdl_list, next, vrdl_tmp) { + RamDiscardManager *rdm; + + rdm = memory_region_get_ram_discard_manager(vrdl->mr); + ram_discard_manager_unregister_listener(rdm, &vrdl->listener); + QLIST_REMOVE(vrdl, next); + g_free(vrdl); + } + + QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) { + memory_region_unregister_iommu_notifier( + MEMORY_REGION(giommu->iommu_mr), &giommu->n); + QLIST_REMOVE(giommu, giommu_next); + g_free(giommu); + } + + QLIST_FOREACH_SAFE(hostwin, &container->hostwin_list, hostwin_next, + next) { + QLIST_REMOVE(hostwin, hostwin_next); + g_free(hostwin); + } + + object_unref(&container->parent_obj); +} + +static const TypeInfo vfio_container_info = { + .parent = TYPE_OBJECT, + .name = TYPE_VFIO_CONTAINER_OBJ, + .class_size = sizeof(VFIOContainerClass), + .instance_size = sizeof(VFIOContainer), + .abstract = true, +}; + +static void vfio_container_register_types(void) +{ + type_register_static(&vfio_container_info); +} + +type_init(vfio_container_register_types) diff --git a/hw/vfio/container.c b/hw/vfio/container.c index 9c665c1720..79972064d3 100644 --- a/hw/vfio/container.c +++ b/hw/vfio/container.c @@ -50,6 +50,8 @@ static int vfio_kvm_device_fd = -1; #endif +#define TYPE_VFIO_LEGACY_CONTAINER "qemu:vfio-legacy-container" + VFIOGroupList vfio_group_list = QLIST_HEAD_INITIALIZER(vfio_group_list); @@ -76,8 +78,10 @@ bool vfio_mig_active(void) return true; } -bool vfio_devices_all_dirty_tracking(VFIOContainer *container) +static bool vfio_devices_all_dirty_tracking(VFIOContainer *bcontainer) { + VFIOLegacyContainer *container = container_of(bcontainer, + VFIOLegacyContainer, obj); VFIOGroup *group; VFIODevice *vbasedev; MigrationState *ms = migrate_get_current(); @@ -103,7 +107,7 @@ bool vfio_devices_all_dirty_tracking(VFIOContainer *container) return true; } -bool vfio_devices_all_running_and_saving(VFIOContainer *container) +static bool vfio_devices_all_running_and_saving(VFIOLegacyContainer *container) { VFIOGroup *group; VFIODevice *vbasedev; @@ -132,10 +136,11 @@ bool vfio_devices_all_running_and_saving(VFIOContainer *container) return true; } -static int vfio_dma_unmap_bitmap(VFIOContainer *container, +static int vfio_dma_unmap_bitmap(VFIOLegacyContainer *container, hwaddr iova, ram_addr_t size, IOMMUTLBEntry *iotlb) { + VFIOContainer *bcontainer = &container->obj; struct vfio_iommu_type1_dma_unmap *unmap; struct vfio_bitmap *bitmap; uint64_t pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size; @@ -159,7 +164,7 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container, bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) / BITS_PER_BYTE; - if (bitmap->size > container->max_dirty_bitmap_size) { + if (bitmap->size > bcontainer->max_dirty_bitmap_size) { error_report("UNMAP: Size of bitmap too big 0x%"PRIx64, (uint64_t)bitmap->size); ret = -E2BIG; @@ -189,10 +194,12 @@ unmap_exit: /* * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86 */ -int vfio_dma_unmap(VFIOContainer *container, - hwaddr iova, ram_addr_t size, - IOMMUTLBEntry *iotlb) +static int vfio_dma_unmap(VFIOContainer *bcontainer, + hwaddr iova, ram_addr_t size, + IOMMUTLBEntry *iotlb) { + VFIOLegacyContainer *container = container_of(bcontainer, + VFIOLegacyContainer, obj); struct vfio_iommu_type1_dma_unmap unmap = { .argsz = sizeof(unmap), .flags = 0, @@ -200,7 +207,7 @@ int vfio_dma_unmap(VFIOContainer *container, .size = size, }; - if (iotlb && container->dirty_pages_supported && + if (iotlb && bcontainer->dirty_pages_supported && vfio_devices_all_running_and_saving(container)) { return vfio_dma_unmap_bitmap(container, iova, size, iotlb); } @@ -221,7 +228,7 @@ int vfio_dma_unmap(VFIOContainer *container, if (errno == EINVAL && unmap.size && !(unmap.iova + unmap.size) && container->iommu_type == VFIO_TYPE1v2_IOMMU) { trace_vfio_dma_unmap_overflow_workaround(); - unmap.size -= 1ULL << ctz64(container->pgsizes); + unmap.size -= 1ULL << ctz64(bcontainer->pgsizes); continue; } error_report("VFIO_UNMAP_DMA failed: %s", strerror(errno)); @@ -231,9 +238,22 @@ int vfio_dma_unmap(VFIOContainer *container, return 0; } -int vfio_dma_map(VFIOContainer *container, hwaddr iova, - ram_addr_t size, void *vaddr, bool readonly) +static bool vfio_legacy_container_check_extension(VFIOContainer *bcontainer, + VFIOContainerFeature feat) { + switch (feat) { + case VFIO_FEAT_LIVE_MIGRATION: + return true; + default: + return false; + }; +} + +static int vfio_dma_map(VFIOContainer *bcontainer, hwaddr iova, + ram_addr_t size, void *vaddr, bool readonly) +{ + VFIOLegacyContainer *container = container_of(bcontainer, + VFIOLegacyContainer, obj); struct vfio_iommu_type1_dma_map map = { .argsz = sizeof(map), .flags = VFIO_DMA_MAP_FLAG_READ, @@ -252,7 +272,7 @@ int vfio_dma_map(VFIOContainer *container, hwaddr iova, * the VGA ROM space. */ if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 || - (errno == EBUSY && vfio_dma_unmap(container, iova, size, NULL) == 0 && + (errno == EBUSY && vfio_dma_unmap(bcontainer, iova, size, NULL) == 0 && ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) { return 0; } @@ -261,8 +281,10 @@ int vfio_dma_map(VFIOContainer *container, hwaddr iova, return -errno; } -void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start) +static void vfio_set_dirty_page_tracking(VFIOContainer *bcontainer, bool start) { + VFIOLegacyContainer *container = container_of(bcontainer, + VFIOLegacyContainer, obj); int ret; struct vfio_iommu_type1_dirty_bitmap dirty = { .argsz = sizeof(dirty), @@ -281,9 +303,11 @@ void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start) } } -int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova, - uint64_t size, ram_addr_t ram_addr) +static int vfio_get_dirty_bitmap(VFIOContainer *bcontainer, uint64_t iova, + uint64_t size, ram_addr_t ram_addr) { + VFIOLegacyContainer *container = container_of(bcontainer, + VFIOLegacyContainer, obj); struct vfio_iommu_type1_dirty_bitmap *dbitmap; struct vfio_iommu_type1_dirty_bitmap_get *range; uint64_t pages; @@ -333,18 +357,23 @@ err_out: return ret; } -static void vfio_listener_release(VFIOContainer *container) +static void vfio_listener_release(VFIOLegacyContainer *container) { - memory_listener_unregister(&container->listener); + VFIOContainer *bcontainer = &container->obj; + + memory_listener_unregister(&bcontainer->listener); if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) { memory_listener_unregister(&container->prereg_listener); } } -int vfio_container_add_section_window(VFIOContainer *container, - MemoryRegionSection *section, - Error **errp) +static int +vfio_legacy_container_add_section_window(VFIOContainer *bcontainer, + MemoryRegionSection *section, + Error **errp) { + VFIOLegacyContainer *container = container_of(bcontainer, + VFIOLegacyContainer, obj); VFIOHostDMAWindow *hostwin; hwaddr pgsize = 0; int ret; @@ -354,7 +383,7 @@ int vfio_container_add_section_window(VFIOContainer *container, } /* For now intersections are not allowed, we may relax this later */ - QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) { + QLIST_FOREACH(hostwin, &bcontainer->hostwin_list, hostwin_next) { if (ranges_overlap(hostwin->min_iova, hostwin->max_iova - hostwin->min_iova + 1, section->offset_within_address_space, @@ -376,7 +405,7 @@ int vfio_container_add_section_window(VFIOContainer *container, return ret; } - vfio_host_win_add(container, section->offset_within_address_space, + vfio_host_win_add(bcontainer, section->offset_within_address_space, section->offset_within_address_space + int128_get64(section->size) - 1, pgsize); #ifdef CONFIG_KVM @@ -409,16 +438,20 @@ int vfio_container_add_section_window(VFIOContainer *container, return 0; } -void vfio_container_del_section_window(VFIOContainer *container, - MemoryRegionSection *section) +static void +vfio_legacy_container_del_section_window(VFIOContainer *bcontainer, + MemoryRegionSection *section) { + VFIOLegacyContainer *container = container_of(bcontainer, + VFIOLegacyContainer, obj); + if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) { return; } vfio_spapr_remove_window(container, section->offset_within_address_space); - if (vfio_host_win_del(container, + if (vfio_host_win_del(bcontainer, section->offset_within_address_space, section->offset_within_address_space + int128_get64(section->size) - 1) < 0) { @@ -505,7 +538,7 @@ static void vfio_kvm_device_del_group(VFIOGroup *group) /* * vfio_get_iommu_type - selects the richest iommu_type (v2 first) */ -static int vfio_get_iommu_type(VFIOContainer *container, +static int vfio_get_iommu_type(VFIOLegacyContainer *container, Error **errp) { int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU, @@ -521,7 +554,7 @@ static int vfio_get_iommu_type(VFIOContainer *container, return -EINVAL; } -static int vfio_init_container(VFIOContainer *container, int group_fd, +static int vfio_init_container(VFIOLegacyContainer *container, int group_fd, Error **errp) { int iommu_type, ret; @@ -556,7 +589,7 @@ static int vfio_init_container(VFIOContainer *container, int group_fd, return 0; } -static int vfio_get_iommu_info(VFIOContainer *container, +static int vfio_get_iommu_info(VFIOLegacyContainer *container, struct vfio_iommu_type1_info **info) { @@ -600,11 +633,12 @@ vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id) return NULL; } -static void vfio_get_iommu_info_migration(VFIOContainer *container, - struct vfio_iommu_type1_info *info) +static void vfio_get_iommu_info_migration(VFIOLegacyContainer *container, + struct vfio_iommu_type1_info *info) { struct vfio_info_cap_header *hdr; struct vfio_iommu_type1_info_cap_migration *cap_mig; + VFIOContainer *bcontainer = &container->obj; hdr = vfio_get_iommu_info_cap(info, VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION); if (!hdr) { @@ -619,13 +653,14 @@ static void vfio_get_iommu_info_migration(VFIOContainer *container, * qemu_real_host_page_size to mark those dirty. */ if (cap_mig->pgsize_bitmap & qemu_real_host_page_size) { - container->dirty_pages_supported = true; - container->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size; - container->dirty_pgsizes = cap_mig->pgsize_bitmap; + bcontainer->dirty_pages_supported = true; + bcontainer->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size; + bcontainer->dirty_pgsizes = cap_mig->pgsize_bitmap; } } -static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state) +static int +vfio_ram_block_discard_disable(VFIOLegacyContainer *container, bool state) { switch (container->iommu_type) { case VFIO_TYPE1v2_IOMMU: @@ -651,7 +686,8 @@ static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state) static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, Error **errp) { - VFIOContainer *container; + VFIOContainer *bcontainer; + VFIOLegacyContainer *container; int ret, fd; VFIOAddressSpace *space; @@ -688,7 +724,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, * details once we know which type of IOMMU we are using. */ - QLIST_FOREACH(container, &space->containers, next) { + QLIST_FOREACH(bcontainer, &space->containers, next) { + container = container_of(bcontainer, VFIOLegacyContainer, obj); if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) { ret = vfio_ram_block_discard_disable(container, true); if (ret) { @@ -724,14 +761,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, } container = g_malloc0(sizeof(*container)); - container->space = space; container->fd = fd; - container->error = NULL; - container->dirty_pages_supported = false; - container->dma_max_mappings = 0; - QLIST_INIT(&container->giommu_list); - QLIST_INIT(&container->hostwin_list); - QLIST_INIT(&container->vrdl_list); + bcontainer = &container->obj; + vfio_container_init(bcontainer, sizeof(*bcontainer), + TYPE_VFIO_LEGACY_CONTAINER, space); ret = vfio_init_container(container, group->fd, errp); if (ret) { @@ -763,13 +796,13 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, /* Assume 4k IOVA page size */ info->iova_pgsizes = 4096; } - vfio_host_win_add(container, 0, (hwaddr)-1, info->iova_pgsizes); - container->pgsizes = info->iova_pgsizes; + vfio_host_win_add(bcontainer, 0, (hwaddr)-1, info->iova_pgsizes); + bcontainer->pgsizes = info->iova_pgsizes; /* The default in the kernel ("dma_entry_limit") is 65535. */ - container->dma_max_mappings = 65535; + bcontainer->dma_max_mappings = 65535; if (!ret) { - vfio_get_info_dma_avail(info, &container->dma_max_mappings); + vfio_get_info_dma_avail(info, &bcontainer->dma_max_mappings); vfio_get_iommu_info_migration(container, info); } g_free(info); @@ -798,10 +831,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, memory_listener_register(&container->prereg_listener, &address_space_memory); - if (container->error) { + if (bcontainer->error) { memory_listener_unregister(&container->prereg_listener); ret = -1; - error_propagate_prepend(errp, container->error, + error_propagate_prepend(errp, bcontainer->error, "RAM memory listener initialization failed: "); goto enable_discards_exit; } @@ -820,7 +853,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, } if (v2) { - container->pgsizes = info.ddw.pgsizes; + bcontainer->pgsizes = info.ddw.pgsizes; /* * There is a default window in just created container. * To make region_add/del simpler, we better remove this @@ -835,8 +868,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, } } else { /* The default table uses 4K pages */ - container->pgsizes = 0x1000; - vfio_host_win_add(container, info.dma32_window_start, + bcontainer->pgsizes = 0x1000; + vfio_host_win_add(bcontainer, info.dma32_window_start, info.dma32_window_start + info.dma32_window_size - 1, 0x1000); @@ -847,28 +880,28 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, vfio_kvm_device_add_group(group); QLIST_INIT(&container->group_list); - QLIST_INSERT_HEAD(&space->containers, container, next); + QLIST_INSERT_HEAD(&space->containers, bcontainer, next); group->container = container; QLIST_INSERT_HEAD(&container->group_list, group, container_next); - container->listener = vfio_memory_listener; + bcontainer->listener = vfio_memory_listener; - memory_listener_register(&container->listener, container->space->as); + memory_listener_register(&bcontainer->listener, bcontainer->space->as); - if (container->error) { + if (bcontainer->error) { ret = -1; - error_propagate_prepend(errp, container->error, + error_propagate_prepend(errp, bcontainer->error, "memory listener initialization failed: "); goto listener_release_exit; } - container->initialized = true; + bcontainer->initialized = true; return 0; listener_release_exit: QLIST_REMOVE(group, container_next); - QLIST_REMOVE(container, next); + QLIST_REMOVE(bcontainer, next); vfio_kvm_device_del_group(group); vfio_listener_release(container); @@ -889,7 +922,8 @@ put_space_exit: static void vfio_disconnect_container(VFIOGroup *group) { - VFIOContainer *container = group->container; + VFIOLegacyContainer *container = group->container; + VFIOContainer *bcontainer = &container->obj; QLIST_REMOVE(group, container_next); group->container = NULL; @@ -909,25 +943,9 @@ static void vfio_disconnect_container(VFIOGroup *group) } if (QLIST_EMPTY(&container->group_list)) { - VFIOAddressSpace *space = container->space; - VFIOGuestIOMMU *giommu, *tmp; - VFIOHostDMAWindow *hostwin, *next; - - QLIST_REMOVE(container, next); - - QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) { - memory_region_unregister_iommu_notifier( - MEMORY_REGION(giommu->iommu_mr), &giommu->n); - QLIST_REMOVE(giommu, giommu_next); - g_free(giommu); - } - - QLIST_FOREACH_SAFE(hostwin, &container->hostwin_list, hostwin_next, - next) { - QLIST_REMOVE(hostwin, hostwin_next); - g_free(hostwin); - } + VFIOAddressSpace *space = bcontainer->space; + vfio_container_destroy(bcontainer); trace_vfio_disconnect_container(container->fd); close(container->fd); g_free(container); @@ -939,13 +957,15 @@ static void vfio_disconnect_container(VFIOGroup *group) VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp) { VFIOGroup *group; + VFIOContainer *bcontainer; char path[32]; struct vfio_group_status status = { .argsz = sizeof(status) }; QLIST_FOREACH(group, &vfio_group_list, next) { if (group->groupid == groupid) { /* Found it. Now is it already in the right context? */ - if (group->container->space->as == as) { + bcontainer = &group->container->obj; + if (bcontainer->space->as == as) { return group; } else { error_setg(errp, "group %d used in multiple address spaces", @@ -1098,7 +1118,7 @@ void vfio_put_base_device(VFIODevice *vbasedev) /* * Interfaces for IBM EEH (Enhanced Error Handling) */ -static bool vfio_eeh_container_ok(VFIOContainer *container) +static bool vfio_eeh_container_ok(VFIOLegacyContainer *container) { /* * As of 2016-03-04 (linux-4.5) the host kernel EEH/VFIO @@ -1126,7 +1146,7 @@ static bool vfio_eeh_container_ok(VFIOContainer *container) return true; } -static int vfio_eeh_container_op(VFIOContainer *container, uint32_t op) +static int vfio_eeh_container_op(VFIOLegacyContainer *container, uint32_t op) { struct vfio_eeh_pe_op pe_op = { .argsz = sizeof(pe_op), @@ -1149,19 +1169,21 @@ static int vfio_eeh_container_op(VFIOContainer *container, uint32_t op) return ret; } -static VFIOContainer *vfio_eeh_as_container(AddressSpace *as) +static VFIOLegacyContainer *vfio_eeh_as_container(AddressSpace *as) { VFIOAddressSpace *space = vfio_get_address_space(as); - VFIOContainer *container = NULL; + VFIOLegacyContainer *container = NULL; + VFIOContainer *bcontainer = NULL; if (QLIST_EMPTY(&space->containers)) { /* No containers to act on */ goto out; } - container = QLIST_FIRST(&space->containers); + bcontainer = QLIST_FIRST(&space->containers); + container = container_of(bcontainer, VFIOLegacyContainer, obj); - if (QLIST_NEXT(container, next)) { + if (QLIST_NEXT(bcontainer, next)) { /* * We don't yet have logic to synchronize EEH state across * multiple containers. @@ -1177,17 +1199,45 @@ out: bool vfio_eeh_as_ok(AddressSpace *as) { - VFIOContainer *container = vfio_eeh_as_container(as); + VFIOLegacyContainer *container = vfio_eeh_as_container(as); return (container != NULL) && vfio_eeh_container_ok(container); } int vfio_eeh_as_op(AddressSpace *as, uint32_t op) { - VFIOContainer *container = vfio_eeh_as_container(as); + VFIOLegacyContainer *container = vfio_eeh_as_container(as); if (!container) { return -ENODEV; } return vfio_eeh_container_op(container, op); } + +static void vfio_legacy_container_class_init(ObjectClass *klass, + void *data) +{ + VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_CLASS(klass); + + vccs->dma_map = vfio_dma_map; + vccs->dma_unmap = vfio_dma_unmap; + vccs->devices_all_dirty_tracking = vfio_devices_all_dirty_tracking; + vccs->set_dirty_page_tracking = vfio_set_dirty_page_tracking; + vccs->get_dirty_bitmap = vfio_get_dirty_bitmap; + vccs->add_window = vfio_legacy_container_add_section_window; + vccs->del_window = vfio_legacy_container_del_section_window; + vccs->check_extension = vfio_legacy_container_check_extension; +} + +static const TypeInfo vfio_legacy_container_info = { + .parent = TYPE_VFIO_CONTAINER_OBJ, + .name = TYPE_VFIO_LEGACY_CONTAINER, + .class_init = vfio_legacy_container_class_init, +}; + +static void vfio_register_types(void) +{ + type_register_static(&vfio_legacy_container_info); +} + +type_init(vfio_register_types) diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build index e3b6d6e2cb..df4fa2b695 100644 --- a/hw/vfio/meson.build +++ b/hw/vfio/meson.build @@ -2,6 +2,7 @@ vfio_ss = ss.source_set() vfio_ss.add(files( 'common.c', 'as.c', + 'container-obj.c', 'container.c', 'spapr.c', 'migration.c', diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index ff6b45de6b..cbbde177c3 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -856,11 +856,11 @@ int64_t vfio_mig_bytes_transferred(void) int vfio_migration_probe(VFIODevice *vbasedev, Error **errp) { - VFIOContainer *container = vbasedev->group->container; + VFIOLegacyContainer *container = vbasedev->group->container; struct vfio_region_info *info = NULL; int ret = -ENOTSUP; - if (!vbasedev->enable_migration || !container->dirty_pages_supported) { + if (!vbasedev->enable_migration || !container->obj.dirty_pages_supported) { goto add_blocker; } diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index e707329394..a00a485e46 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -3101,7 +3101,9 @@ static void vfio_realize(PCIDevice *pdev, Error **errp) } } - if (!pdev->failover_pair_id) { + if (!pdev->failover_pair_id && + vfio_container_check_extension(&vbasedev->group->container->obj, + VFIO_FEAT_LIVE_MIGRATION)) { ret = vfio_migration_probe(vbasedev, errp); if (ret) { error_report("%s: Migration disabled", vbasedev->name); diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c index 04c6e67f8f..cdcd9e05ba 100644 --- a/hw/vfio/spapr.c +++ b/hw/vfio/spapr.c @@ -39,8 +39,8 @@ static void *vfio_prereg_gpa_to_vaddr(MemoryRegionSection *section, hwaddr gpa) static void vfio_prereg_listener_region_add(MemoryListener *listener, MemoryRegionSection *section) { - VFIOContainer *container = container_of(listener, VFIOContainer, - prereg_listener); + VFIOLegacyContainer *container = container_of(listener, VFIOLegacyContainer, + prereg_listener); const hwaddr gpa = section->offset_within_address_space; hwaddr end; int ret; @@ -83,9 +83,9 @@ static void vfio_prereg_listener_region_add(MemoryListener *listener, * can gracefully fail. Runtime, there's not much we can do other * than throw a hardware error. */ - if (!container->initialized) { - if (!container->error) { - error_setg_errno(&container->error, -ret, + if (!container->obj.initialized) { + if (!container->obj.error) { + error_setg_errno(&container->obj.error, -ret, "Memory registering failed"); } } else { @@ -97,8 +97,8 @@ static void vfio_prereg_listener_region_add(MemoryListener *listener, static void vfio_prereg_listener_region_del(MemoryListener *listener, MemoryRegionSection *section) { - VFIOContainer *container = container_of(listener, VFIOContainer, - prereg_listener); + VFIOLegacyContainer *container = container_of(listener, VFIOLegacyContainer, + prereg_listener); const hwaddr gpa = section->offset_within_address_space; hwaddr end; int ret; @@ -141,7 +141,7 @@ const MemoryListener vfio_prereg_listener = { .region_del = vfio_prereg_listener_region_del, }; -int vfio_spapr_create_window(VFIOContainer *container, +int vfio_spapr_create_window(VFIOLegacyContainer *container, MemoryRegionSection *section, hwaddr *pgsize) { @@ -159,13 +159,13 @@ int vfio_spapr_create_window(VFIOContainer *container, if (pagesize > rampagesize) { pagesize = rampagesize; } - pgmask = container->pgsizes & (pagesize | (pagesize - 1)); + pgmask = container->obj.pgsizes & (pagesize | (pagesize - 1)); pagesize = pgmask ? (1ULL << (63 - clz64(pgmask))) : 0; if (!pagesize) { error_report("Host doesn't support page size 0x%"PRIx64 ", the supported mask is 0x%lx", memory_region_iommu_get_min_page_size(iommu_mr), - container->pgsizes); + container->obj.pgsizes); return -EINVAL; } @@ -233,7 +233,7 @@ int vfio_spapr_create_window(VFIOContainer *container, return 0; } -int vfio_spapr_remove_window(VFIOContainer *container, +int vfio_spapr_remove_window(VFIOLegacyContainer *container, hwaddr offset_within_address_space) { struct vfio_iommu_spapr_tce_remove remove = { diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index 03ff7944cb..02a6f36a9e 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -30,6 +30,7 @@ #include #endif #include "sysemu/sysemu.h" +#include "hw/vfio/vfio-container-obj.h" #define VFIO_MSG_PREFIX "vfio %s: " @@ -70,58 +71,15 @@ typedef struct VFIOMigration { uint64_t pending_bytes; } VFIOMigration; -typedef struct VFIOAddressSpace { - AddressSpace *as; - QLIST_HEAD(, VFIOContainer) containers; - QLIST_ENTRY(VFIOAddressSpace) list; -} VFIOAddressSpace; - struct VFIOGroup; -typedef struct VFIOContainer { - VFIOAddressSpace *space; +typedef struct VFIOLegacyContainer { + VFIOContainer obj; int fd; /* /dev/vfio/vfio, empowered by the attached groups */ - MemoryListener listener; MemoryListener prereg_listener; unsigned iommu_type; - Error *error; - bool initialized; - bool dirty_pages_supported; - uint64_t dirty_pgsizes; - uint64_t max_dirty_bitmap_size; - unsigned long pgsizes; - unsigned int dma_max_mappings; - QLIST_HEAD(, VFIOGuestIOMMU) giommu_list; - QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list; QLIST_HEAD(, VFIOGroup) group_list; - QLIST_HEAD(, VFIORamDiscardListener) vrdl_list; - QLIST_ENTRY(VFIOContainer) next; -} VFIOContainer; - -typedef struct VFIOGuestIOMMU { - VFIOContainer *container; - IOMMUMemoryRegion *iommu_mr; - hwaddr iommu_offset; - IOMMUNotifier n; - QLIST_ENTRY(VFIOGuestIOMMU) giommu_next; -} VFIOGuestIOMMU; - -typedef struct VFIORamDiscardListener { - VFIOContainer *container; - MemoryRegion *mr; - hwaddr offset_within_address_space; - hwaddr size; - uint64_t granularity; - RamDiscardListener listener; - QLIST_ENTRY(VFIORamDiscardListener) next; -} VFIORamDiscardListener; - -typedef struct VFIOHostDMAWindow { - hwaddr min_iova; - hwaddr max_iova; - uint64_t iova_pgsizes; - QLIST_ENTRY(VFIOHostDMAWindow) hostwin_next; -} VFIOHostDMAWindow; +} VFIOLegacyContainer; typedef struct VFIODeviceOps VFIODeviceOps; @@ -159,7 +117,7 @@ struct VFIODeviceOps { typedef struct VFIOGroup { int fd; int groupid; - VFIOContainer *container; + VFIOLegacyContainer *container; QLIST_HEAD(, VFIODevice) device_list; QLIST_ENTRY(VFIOGroup) next; QLIST_ENTRY(VFIOGroup) container_next; @@ -192,31 +150,13 @@ typedef struct VFIODisplay { } dmabuf; } VFIODisplay; -void vfio_host_win_add(VFIOContainer *container, +void vfio_host_win_add(VFIOContainer *bcontainer, hwaddr min_iova, hwaddr max_iova, uint64_t iova_pgsizes); -int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova, +int vfio_host_win_del(VFIOContainer *bcontainer, hwaddr min_iova, hwaddr max_iova); VFIOAddressSpace *vfio_get_address_space(AddressSpace *as); void vfio_put_address_space(VFIOAddressSpace *space); -bool vfio_devices_all_running_and_saving(VFIOContainer *container); -bool vfio_devices_all_dirty_tracking(VFIOContainer *container); - -/* container->fd */ -int vfio_dma_unmap(VFIOContainer *container, - hwaddr iova, ram_addr_t size, - IOMMUTLBEntry *iotlb); -int vfio_dma_map(VFIOContainer *container, hwaddr iova, - ram_addr_t size, void *vaddr, bool readonly); -void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start); -int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova, - uint64_t size, ram_addr_t ram_addr); - -int vfio_container_add_section_window(VFIOContainer *container, - MemoryRegionSection *section, - Error **errp); -void vfio_container_del_section_window(VFIOContainer *container, - MemoryRegionSection *section); void vfio_put_base_device(VFIODevice *vbasedev); void vfio_disable_irqindex(VFIODevice *vbasedev, int index); @@ -263,10 +203,10 @@ vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id); #endif extern const MemoryListener vfio_prereg_listener; -int vfio_spapr_create_window(VFIOContainer *container, +int vfio_spapr_create_window(VFIOLegacyContainer *container, MemoryRegionSection *section, hwaddr *pgsize); -int vfio_spapr_remove_window(VFIOContainer *container, +int vfio_spapr_remove_window(VFIOLegacyContainer *container, hwaddr offset_within_address_space); int vfio_migration_probe(VFIODevice *vbasedev, Error **errp); diff --git a/include/hw/vfio/vfio-container-obj.h b/include/hw/vfio/vfio-container-obj.h new file mode 100644 index 0000000000..7ffbbb299f --- /dev/null +++ b/include/hw/vfio/vfio-container-obj.h @@ -0,0 +1,154 @@ +/* + * VFIO CONTAINER BASE OBJECT + * + * Copyright (C) 2022 Intel Corporation. + * Copyright Red Hat, Inc. 2022 + * + * Authors: Yi Liu + * Eric Auger + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + + * You should have received a copy of the GNU General Public License along + * with this program; if not, see . + */ + +#ifndef HW_VFIO_VFIO_CONTAINER_OBJ_H +#define HW_VFIO_VFIO_CONTAINER_OBJ_H + +#include "qom/object.h" +#include "exec/memory.h" +#include "qemu/queue.h" +#include "qemu/thread.h" +#ifndef CONFIG_USER_ONLY +#include "exec/hwaddr.h" +#endif + +#define TYPE_VFIO_CONTAINER_OBJ "qemu:vfio-base-container-obj" +#define VFIO_CONTAINER_OBJ(obj) \ + OBJECT_CHECK(VFIOContainer, (obj), TYPE_VFIO_CONTAINER_OBJ) +#define VFIO_CONTAINER_OBJ_CLASS(klass) \ + OBJECT_CLASS_CHECK(VFIOContainerClass, (klass), \ + TYPE_VFIO_CONTAINER_OBJ) +#define VFIO_CONTAINER_OBJ_GET_CLASS(obj) \ + OBJECT_GET_CLASS(VFIOContainerClass, (obj), \ + TYPE_VFIO_CONTAINER_OBJ) + +typedef enum VFIOContainerFeature { + VFIO_FEAT_LIVE_MIGRATION, +} VFIOContainerFeature; + +typedef struct VFIOContainer VFIOContainer; + +typedef struct VFIOAddressSpace { + AddressSpace *as; + QLIST_HEAD(, VFIOContainer) containers; + QLIST_ENTRY(VFIOAddressSpace) list; +} VFIOAddressSpace; + +typedef struct VFIOGuestIOMMU { + VFIOContainer *container; + IOMMUMemoryRegion *iommu_mr; + hwaddr iommu_offset; + IOMMUNotifier n; + QLIST_ENTRY(VFIOGuestIOMMU) giommu_next; +} VFIOGuestIOMMU; + +typedef struct VFIORamDiscardListener { + VFIOContainer *container; + MemoryRegion *mr; + hwaddr offset_within_address_space; + hwaddr size; + uint64_t granularity; + RamDiscardListener listener; + QLIST_ENTRY(VFIORamDiscardListener) next; +} VFIORamDiscardListener; + +typedef struct VFIOHostDMAWindow { + hwaddr min_iova; + hwaddr max_iova; + uint64_t iova_pgsizes; + QLIST_ENTRY(VFIOHostDMAWindow) hostwin_next; +} VFIOHostDMAWindow; + +/* + * This is the base object for vfio container backends + */ +struct VFIOContainer { + /* private */ + Object parent_obj; + + VFIOAddressSpace *space; + MemoryListener listener; + Error *error; + bool initialized; + bool dirty_pages_supported; + uint64_t dirty_pgsizes; + uint64_t max_dirty_bitmap_size; + unsigned long pgsizes; + unsigned int dma_max_mappings; + QLIST_HEAD(, VFIOGuestIOMMU) giommu_list; + QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list; + QLIST_HEAD(, VFIORamDiscardListener) vrdl_list; + QLIST_ENTRY(VFIOContainer) next; +}; + +typedef struct VFIOContainerClass { + /* private */ + ObjectClass parent_class; + + /* required */ + bool (*check_extension)(VFIOContainer *container, + VFIOContainerFeature feat); + int (*dma_map)(VFIOContainer *container, + hwaddr iova, ram_addr_t size, + void *vaddr, bool readonly); + int (*dma_unmap)(VFIOContainer *container, + hwaddr iova, ram_addr_t size, + IOMMUTLBEntry *iotlb); + /* migration feature */ + bool (*devices_all_dirty_tracking)(VFIOContainer *container); + void (*set_dirty_page_tracking)(VFIOContainer *container, bool start); + int (*get_dirty_bitmap)(VFIOContainer *container, uint64_t iova, + uint64_t size, ram_addr_t ram_addr); + + /* SPAPR specific */ + int (*add_window)(VFIOContainer *container, + MemoryRegionSection *section, + Error **errp); + void (*del_window)(VFIOContainer *container, + MemoryRegionSection *section); +} VFIOContainerClass; + +bool vfio_container_check_extension(VFIOContainer *container, + VFIOContainerFeature feat); +int vfio_container_dma_map(VFIOContainer *container, + hwaddr iova, ram_addr_t size, + void *vaddr, bool readonly); +int vfio_container_dma_unmap(VFIOContainer *container, + hwaddr iova, ram_addr_t size, + IOMMUTLBEntry *iotlb); +bool vfio_container_devices_all_dirty_tracking(VFIOContainer *container); +void vfio_container_set_dirty_page_tracking(VFIOContainer *container, + bool start); +int vfio_container_get_dirty_bitmap(VFIOContainer *container, uint64_t iova, + uint64_t size, ram_addr_t ram_addr); +int vfio_container_add_section_window(VFIOContainer *container, + MemoryRegionSection *section, + Error **errp); +void vfio_container_del_section_window(VFIOContainer *container, + MemoryRegionSection *section); + +void vfio_container_init(void *_container, size_t instance_size, + const char *mrtypename, + VFIOAddressSpace *space); +void vfio_container_destroy(VFIOContainer *container); +#endif /* HW_VFIO_VFIO_CONTAINER_OBJ_H */ From patchwork Thu Apr 14 10:47:00 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yi Liu X-Patchwork-Id: 12813345 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id AB78CC433F5 for ; Thu, 14 Apr 2022 10:54:25 +0000 (UTC) Received: from localhost ([::1]:35740 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nex7I-0007Md-NN for qemu-devel@archiver.kernel.org; Thu, 14 Apr 2022 06:54:24 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:55450) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nex0X-00024A-VH for qemu-devel@nongnu.org; Thu, 14 Apr 2022 06:47:25 -0400 Received: from mga12.intel.com ([192.55.52.136]:34772) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nex0V-0005Kn-Ks for qemu-devel@nongnu.org; Thu, 14 Apr 2022 06:47:25 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1649933243; x=1681469243; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=fzVg/TulejTLRtdp0hUfLMrxdcudhtYLzHSl9O78E70=; b=Cgferb4YefIWWyoLLhEaSwO/zd9sux+66dsIMPkhQHU3vBUAF+cTc8hM /ehLS6y2Vgyk/tHba1oVP5Oo0XdrALXuKoO49gXwSrRnxLp60uskU0SDy 5Ic911XA+yfhT1n0cHuXg9PHtcChT+Kq7gU3n1Od6xov5EXLaIiMI0hsb einLhuxKx4ph8DjiqPL5GiBkqpRlP3jijynnpttfrbJRlkA48Os8dr/Yq GYnOsSmxBVnkLfqml8P0lkGZat6Fk9e64VRAyqK2FW5jL3Q68v9RvsZn7 /tTZlGkbnneEb1CrpXJT4WasvQK1m8Iug3/YgyGeIRdTBNG9qXU9Trm4P g==; X-IronPort-AV: E=McAfee;i="6400,9594,10316"; a="242836493" X-IronPort-AV: E=Sophos;i="5.90,259,1643702400"; d="scan'208";a="242836493" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Apr 2022 03:47:17 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,259,1643702400"; d="scan'208";a="803091214" Received: from 984fee00a4c6.jf.intel.com ([10.165.58.231]) by fmsmga006.fm.intel.com with ESMTP; 14 Apr 2022 03:47:16 -0700 From: Yi Liu To: alex.williamson@redhat.com, cohuck@redhat.com, qemu-devel@nongnu.org Subject: [RFC 08/18] vfio/container: Introduce vfio_[attach/detach]_device Date: Thu, 14 Apr 2022 03:47:00 -0700 Message-Id: <20220414104710.28534-9-yi.l.liu@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220414104710.28534-1-yi.l.liu@intel.com> References: <20220414104710.28534-1-yi.l.liu@intel.com> MIME-Version: 1.0 Received-SPF: pass client-ip=192.55.52.136; envelope-from=yi.l.liu@intel.com; helo=mga12.intel.com X-Spam_score_int: -44 X-Spam_score: -4.5 X-Spam_bar: ---- X-Spam_report: (-4.5 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: akrowiak@linux.ibm.com, jjherne@linux.ibm.com, thuth@redhat.com, yi.l.liu@intel.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com, jasowang@redhat.com, farman@linux.ibm.com, peterx@redhat.com, pasic@linux.ibm.com, eric.auger@redhat.com, yi.y.sun@intel.com, chao.p.peng@intel.com, nicolinc@nvidia.com, kevin.tian@intel.com, jgg@nvidia.com, eric.auger.pro@gmail.com, david@gibson.dropbear.id.au Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" From: Eric Auger We want the VFIO devices to be able to use two different IOMMU callbacks, the legacy VFIO one and the new iommufd one. Introduce vfio_[attach/detach]_device which aim at hiding the underlying IOMMU backend (IOCTLs, datatypes, ...). Once vfio_attach_device completes, the device is attached to a security context and its fd can be used. Conversely When vfio_detach_device completes, the device has been detached to the security context. In this patch, only the vfio-pci device gets converted to use the new API. Subsequent patches will handle other devices. Signed-off-by: Eric Auger Signed-off-by: Yi Liu --- hw/vfio/container.c | 65 +++++++++++++++++++++++++++++++++++ hw/vfio/pci.c | 50 +++------------------------ include/hw/vfio/vfio-common.h | 2 ++ 3 files changed, 72 insertions(+), 45 deletions(-) diff --git a/hw/vfio/container.c b/hw/vfio/container.c index 79972064d3..c74a3cd4ae 100644 --- a/hw/vfio/container.c +++ b/hw/vfio/container.c @@ -1214,6 +1214,71 @@ int vfio_eeh_as_op(AddressSpace *as, uint32_t op) return vfio_eeh_container_op(container, op); } +static int vfio_device_groupid(VFIODevice *vbasedev, Error **errp) +{ + char *tmp, group_path[PATH_MAX], *group_name; + int ret, groupid; + ssize_t len; + + tmp = g_strdup_printf("%s/iommu_group", vbasedev->sysfsdev); + len = readlink(tmp, group_path, sizeof(group_path)); + g_free(tmp); + + if (len <= 0 || len >= sizeof(group_path)) { + ret = len < 0 ? -errno : -ENAMETOOLONG; + error_setg_errno(errp, -ret, "no iommu_group found"); + return ret; + } + + group_path[len] = 0; + + group_name = basename(group_path); + if (sscanf(group_name, "%d", &groupid) != 1) { + error_setg_errno(errp, errno, "failed to read %s", group_path); + return -errno; + } + return groupid; +} + +int vfio_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp) +{ + int groupid = vfio_device_groupid(vbasedev, errp); + VFIODevice *vbasedev_iter; + VFIOGroup *group; + int ret; + + if (groupid < 0) { + return groupid; + } + + trace_vfio_realize(vbasedev->name, groupid); + group = vfio_get_group(groupid, as, errp); + if (!group) { + return -1; + } + + QLIST_FOREACH(vbasedev_iter, &group->device_list, next) { + if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) { + error_setg(errp, "device is already attached"); + vfio_put_group(group); + return -1; + } + } + ret = vfio_get_device(group, vbasedev->name, vbasedev, errp); + if (ret) { + vfio_put_group(group); + return -1; + } + + return 0; +} + +void vfio_detach_device(VFIODevice *vbasedev) +{ + vfio_put_base_device(vbasedev); + vfio_put_group(vbasedev->group); +} + static void vfio_legacy_container_class_init(ObjectClass *klass, void *data) { diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index a00a485e46..0363f81017 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -2654,10 +2654,9 @@ static void vfio_populate_device(VFIOPCIDevice *vdev, Error **errp) static void vfio_put_device(VFIOPCIDevice *vdev) { - g_free(vdev->vbasedev.name); g_free(vdev->msix); - vfio_put_base_device(&vdev->vbasedev); + vfio_detach_device(&vdev->vbasedev); } static void vfio_err_notifier_handler(void *opaque) @@ -2804,13 +2803,9 @@ static void vfio_realize(PCIDevice *pdev, Error **errp) { VFIOPCIDevice *vdev = VFIO_PCI(pdev); VFIODevice *vbasedev = &vdev->vbasedev; - VFIODevice *vbasedev_iter; - VFIOGroup *group; - char *tmp, *subsys, group_path[PATH_MAX], *group_name; + char *tmp, *subsys; Error *err = NULL; - ssize_t len; struct stat st; - int groupid; int i, ret; bool is_mdev; @@ -2839,39 +2834,6 @@ static void vfio_realize(PCIDevice *pdev, Error **errp) vbasedev->type = VFIO_DEVICE_TYPE_PCI; vbasedev->dev = DEVICE(vdev); - tmp = g_strdup_printf("%s/iommu_group", vbasedev->sysfsdev); - len = readlink(tmp, group_path, sizeof(group_path)); - g_free(tmp); - - if (len <= 0 || len >= sizeof(group_path)) { - error_setg_errno(errp, len < 0 ? errno : ENAMETOOLONG, - "no iommu_group found"); - goto error; - } - - group_path[len] = 0; - - group_name = basename(group_path); - if (sscanf(group_name, "%d", &groupid) != 1) { - error_setg_errno(errp, errno, "failed to read %s", group_path); - goto error; - } - - trace_vfio_realize(vbasedev->name, groupid); - - group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev), errp); - if (!group) { - goto error; - } - - QLIST_FOREACH(vbasedev_iter, &group->device_list, next) { - if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) { - error_setg(errp, "device is already attached"); - vfio_put_group(group); - goto error; - } - } - /* * Mediated devices *might* operate compatibly with discarding of RAM, but * we cannot know for certain, it depends on whether the mdev vendor driver @@ -2889,13 +2851,12 @@ static void vfio_realize(PCIDevice *pdev, Error **errp) if (vbasedev->ram_block_discard_allowed && !is_mdev) { error_setg(errp, "x-balloon-allowed only potentially compatible " "with mdev devices"); - vfio_put_group(group); goto error; } - ret = vfio_get_device(group, vbasedev->name, vbasedev, errp); + ret = vfio_attach_device(vbasedev, + pci_device_iommu_address_space(pdev), errp); if (ret) { - vfio_put_group(group); goto error; } @@ -3124,12 +3085,12 @@ out_teardown: vfio_bars_exit(vdev); error: error_prepend(errp, VFIO_MSG_PREFIX, vbasedev->name); + vfio_detach_device(vbasedev); } static void vfio_instance_finalize(Object *obj) { VFIOPCIDevice *vdev = VFIO_PCI(obj); - VFIOGroup *group = vdev->vbasedev.group; vfio_display_finalize(vdev); vfio_bars_finalize(vdev); @@ -3143,7 +3104,6 @@ static void vfio_instance_finalize(Object *obj) * g_free(vdev->igd_opregion); */ vfio_put_device(vdev); - vfio_put_group(group); } static void vfio_exitfn(PCIDevice *pdev) diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index 02a6f36a9e..978b2c2f6e 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -180,6 +180,8 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp); void vfio_put_group(VFIOGroup *group); int vfio_get_device(VFIOGroup *group, const char *name, VFIODevice *vbasedev, Error **errp); +int vfio_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp); +void vfio_detach_device(VFIODevice *vbasedev); extern const MemoryRegionOps vfio_region_ops; typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList; From patchwork Thu Apr 14 10:47:01 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yi Liu X-Patchwork-Id: 12813346 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id E385CC433F5 for ; Thu, 14 Apr 2022 10:54:29 +0000 (UTC) Received: from localhost ([::1]:36112 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nex7M-0007bs-T0 for qemu-devel@archiver.kernel.org; Thu, 14 Apr 2022 06:54:28 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:55484) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nex0a-0002CH-26 for qemu-devel@nongnu.org; Thu, 14 Apr 2022 06:47:28 -0400 Received: from mga12.intel.com ([192.55.52.136]:34772) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nex0Y-0005Kn-8o for qemu-devel@nongnu.org; Thu, 14 Apr 2022 06:47:27 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1649933246; x=1681469246; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=gM0xGzwg7MLnTWpYcUR+4zIK9VWgzL3tJ/HAYadjIXo=; b=iJeO7BB0tpLBRDlEIBV9Vpeg8T363pv8DRsv6Y/PKzImOw6hHtFOvg2h t+JLvxkdAmj/Fyutdpr94xZUQSybsI9Uz/x7AzOKwyvznlcfnBSGSBNRQ z/5PFcGuxTSvDNcmRh2g7ErMUy0w1zuj1tr9SoW8SDFU8+pnw2rE7aA0U /nU7yqvcYua4ulITc/l+eGtCI7e2GLAsExSEVJFeHbrXXxosJVVJBR1Kp KJtEweazjuuhIRO5D+xfwluRPjXD7WuZvZ5A8k71WEswf+VICRmiL5WC8 WSymYnTZTeeFfm8r8cDENUysGhJkbYoZRmDdzfiih7dw5/z6H2m3dP1nV g==; X-IronPort-AV: E=McAfee;i="6400,9594,10316"; a="242836497" X-IronPort-AV: E=Sophos;i="5.90,259,1643702400"; d="scan'208";a="242836497" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Apr 2022 03:47:18 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,259,1643702400"; d="scan'208";a="803091221" Received: from 984fee00a4c6.jf.intel.com ([10.165.58.231]) by fmsmga006.fm.intel.com with ESMTP; 14 Apr 2022 03:47:17 -0700 From: Yi Liu To: alex.williamson@redhat.com, cohuck@redhat.com, qemu-devel@nongnu.org Subject: [RFC 09/18] vfio/platform: Use vfio_[attach/detach]_device Date: Thu, 14 Apr 2022 03:47:01 -0700 Message-Id: <20220414104710.28534-10-yi.l.liu@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220414104710.28534-1-yi.l.liu@intel.com> References: <20220414104710.28534-1-yi.l.liu@intel.com> MIME-Version: 1.0 Received-SPF: pass client-ip=192.55.52.136; envelope-from=yi.l.liu@intel.com; helo=mga12.intel.com X-Spam_score_int: -44 X-Spam_score: -4.5 X-Spam_bar: ---- X-Spam_report: (-4.5 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: akrowiak@linux.ibm.com, jjherne@linux.ibm.com, thuth@redhat.com, yi.l.liu@intel.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com, jasowang@redhat.com, farman@linux.ibm.com, peterx@redhat.com, pasic@linux.ibm.com, eric.auger@redhat.com, yi.y.sun@intel.com, chao.p.peng@intel.com, nicolinc@nvidia.com, kevin.tian@intel.com, jgg@nvidia.com, eric.auger.pro@gmail.com, david@gibson.dropbear.id.au Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" From: Eric Auger Let the vfio-platform device use vfio_attach_device() and vfio_detach_device(), hence hiding the details of the used IOMMU backend. Signed-off-by: Eric Auger Signed-off-by: Yi Liu --- hw/vfio/platform.c | 42 ++---------------------------------------- 1 file changed, 2 insertions(+), 40 deletions(-) diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c index 5af73f9287..3bcdc20667 100644 --- a/hw/vfio/platform.c +++ b/hw/vfio/platform.c @@ -529,12 +529,7 @@ static VFIODeviceOps vfio_platform_ops = { */ static int vfio_base_device_init(VFIODevice *vbasedev, Error **errp) { - VFIOGroup *group; - VFIODevice *vbasedev_iter; - char *tmp, group_path[PATH_MAX], *group_name; - ssize_t len; struct stat st; - int groupid; int ret; /* @sysfsdev takes precedence over @host */ @@ -557,47 +552,14 @@ static int vfio_base_device_init(VFIODevice *vbasedev, Error **errp) return -errno; } - tmp = g_strdup_printf("%s/iommu_group", vbasedev->sysfsdev); - len = readlink(tmp, group_path, sizeof(group_path)); - g_free(tmp); - - if (len < 0 || len >= sizeof(group_path)) { - ret = len < 0 ? -errno : -ENAMETOOLONG; - error_setg_errno(errp, -ret, "no iommu_group found"); - return ret; - } - - group_path[len] = 0; - - group_name = basename(group_path); - if (sscanf(group_name, "%d", &groupid) != 1) { - error_setg_errno(errp, errno, "failed to read %s", group_path); - return -errno; - } - - trace_vfio_platform_base_device_init(vbasedev->name, groupid); - - group = vfio_get_group(groupid, &address_space_memory, errp); - if (!group) { - return -ENOENT; - } - - QLIST_FOREACH(vbasedev_iter, &group->device_list, next) { - if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) { - error_setg(errp, "device is already attached"); - vfio_put_group(group); - return -EBUSY; - } - } - ret = vfio_get_device(group, vbasedev->name, vbasedev, errp); + ret = vfio_attach_device(vbasedev, &address_space_memory, errp); if (ret) { - vfio_put_group(group); return ret; } ret = vfio_populate_device(vbasedev, errp); if (ret) { - vfio_put_group(group); + vfio_detach_device(vbasedev); } return ret; From patchwork Thu Apr 14 10:47:02 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yi Liu X-Patchwork-Id: 12813358 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 2011CC433EF for ; Thu, 14 Apr 2022 11:02:33 +0000 (UTC) Received: from localhost ([::1]:52874 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nexF9-0002Qf-U6 for qemu-devel@archiver.kernel.org; Thu, 14 Apr 2022 07:02:31 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:55490) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nex0a-0002EO-RZ for qemu-devel@nongnu.org; Thu, 14 Apr 2022 06:47:28 -0400 Received: from mga12.intel.com ([192.55.52.136]:34770) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nex0Z-0005Ke-1E for qemu-devel@nongnu.org; Thu, 14 Apr 2022 06:47:28 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1649933247; x=1681469247; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=k4VcpDgjB2tGSrAgzMFAbDoDoLC4lC0u8qknfORhCBg=; b=Etr6cFdSePZvhRODnogb/D76AJxpdpepXa2otbjRxhFrkJ0VziqsX4us tnYcMoNhe83Tng5IRorW4NONflRz+dh5hDActy4EigdTyXdKYpil77eCg 9MYhMtz/7xzBqRNnjF2ctWdMVcq1aUj2tzQv9lCWjbFfuC26guWCu6KGh 2ZRK3CZ5S+dOg0cLIHcNUGd2lkjD9neUBb0WYy3LpSYpDPhIqtA/s5zXR NoaGDKsyRCPAi5JtnETjL3EoU418bEzTPfIgJAFfNxNo/R7CRwz4CXryj HVE8Al1pkK2cf2Z8OY9UehI37uN1vA3H5lPS1hnD34rBYS0CEUiTidaXS Q==; X-IronPort-AV: E=McAfee;i="6400,9594,10316"; a="242836500" X-IronPort-AV: E=Sophos;i="5.90,259,1643702400"; d="scan'208";a="242836500" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Apr 2022 03:47:19 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,259,1643702400"; d="scan'208";a="803091226" Received: from 984fee00a4c6.jf.intel.com ([10.165.58.231]) by fmsmga006.fm.intel.com with ESMTP; 14 Apr 2022 03:47:18 -0700 From: Yi Liu To: alex.williamson@redhat.com, cohuck@redhat.com, qemu-devel@nongnu.org Subject: [RFC 10/18] vfio/ap: Use vfio_[attach/detach]_device Date: Thu, 14 Apr 2022 03:47:02 -0700 Message-Id: <20220414104710.28534-11-yi.l.liu@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220414104710.28534-1-yi.l.liu@intel.com> References: <20220414104710.28534-1-yi.l.liu@intel.com> MIME-Version: 1.0 Received-SPF: pass client-ip=192.55.52.136; envelope-from=yi.l.liu@intel.com; helo=mga12.intel.com X-Spam_score_int: -44 X-Spam_score: -4.5 X-Spam_bar: ---- X-Spam_report: (-4.5 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: akrowiak@linux.ibm.com, jjherne@linux.ibm.com, thuth@redhat.com, yi.l.liu@intel.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com, jasowang@redhat.com, farman@linux.ibm.com, peterx@redhat.com, pasic@linux.ibm.com, eric.auger@redhat.com, yi.y.sun@intel.com, chao.p.peng@intel.com, nicolinc@nvidia.com, kevin.tian@intel.com, jgg@nvidia.com, eric.auger.pro@gmail.com, david@gibson.dropbear.id.au Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" From: Eric Auger Let the vfio-ap device use vfio_attach_device() and vfio_detach_device(), hence hiding the details of the used IOMMU backend. Signed-off-by: Eric Auger Signed-off-by: Yi Liu --- hw/vfio/ap.c | 62 ++++++++-------------------------------------------- 1 file changed, 9 insertions(+), 53 deletions(-) diff --git a/hw/vfio/ap.c b/hw/vfio/ap.c index e0dd561e85..286ac638e5 100644 --- a/hw/vfio/ap.c +++ b/hw/vfio/ap.c @@ -50,58 +50,17 @@ struct VFIODeviceOps vfio_ap_ops = { .vfio_compute_needs_reset = vfio_ap_compute_needs_reset, }; -static void vfio_ap_put_device(VFIOAPDevice *vapdev) -{ - g_free(vapdev->vdev.name); - vfio_put_base_device(&vapdev->vdev); -} - -static VFIOGroup *vfio_ap_get_group(VFIOAPDevice *vapdev, Error **errp) -{ - GError *gerror = NULL; - char *symlink, *group_path; - int groupid; - - symlink = g_strdup_printf("%s/iommu_group", vapdev->vdev.sysfsdev); - group_path = g_file_read_link(symlink, &gerror); - g_free(symlink); - - if (!group_path) { - error_setg(errp, "%s: no iommu_group found for %s: %s", - TYPE_VFIO_AP_DEVICE, vapdev->vdev.sysfsdev, gerror->message); - g_error_free(gerror); - return NULL; - } - - if (sscanf(basename(group_path), "%d", &groupid) != 1) { - error_setg(errp, "vfio: failed to read %s", group_path); - g_free(group_path); - return NULL; - } - - g_free(group_path); - - return vfio_get_group(groupid, &address_space_memory, errp); -} - static void vfio_ap_realize(DeviceState *dev, Error **errp) { - int ret; - char *mdevid; - VFIOGroup *vfio_group; APDevice *apdev = AP_DEVICE(dev); VFIOAPDevice *vapdev = VFIO_AP_DEVICE(apdev); + VFIODevice *vbasedev = &vapdev->vdev; + int ret; - vfio_group = vfio_ap_get_group(vapdev, errp); - if (!vfio_group) { - return; - } - - vapdev->vdev.ops = &vfio_ap_ops; - vapdev->vdev.type = VFIO_DEVICE_TYPE_AP; - mdevid = basename(vapdev->vdev.sysfsdev); - vapdev->vdev.name = g_strdup_printf("%s", mdevid); - vapdev->vdev.dev = dev; + vbasedev->name = g_path_get_basename(vbasedev->sysfsdev); + vbasedev->ops = &vfio_ap_ops; + vbasedev->type = VFIO_DEVICE_TYPE_AP; + vbasedev->dev = dev; /* * vfio-ap devices operate in a way compatible with discarding of @@ -111,7 +70,7 @@ static void vfio_ap_realize(DeviceState *dev, Error **errp) */ vapdev->vdev.ram_block_discard_allowed = true; - ret = vfio_get_device(vfio_group, mdevid, &vapdev->vdev, errp); + ret = vfio_attach_device(vbasedev, &address_space_memory, errp); if (ret) { goto out_get_dev_err; } @@ -119,18 +78,15 @@ static void vfio_ap_realize(DeviceState *dev, Error **errp) return; out_get_dev_err: - vfio_ap_put_device(vapdev); - vfio_put_group(vfio_group); + vfio_detach_device(vbasedev); } static void vfio_ap_unrealize(DeviceState *dev) { APDevice *apdev = AP_DEVICE(dev); VFIOAPDevice *vapdev = VFIO_AP_DEVICE(apdev); - VFIOGroup *group = vapdev->vdev.group; - vfio_ap_put_device(vapdev); - vfio_put_group(group); + vfio_detach_device(&vapdev->vdev); } static Property vfio_ap_properties[] = { From patchwork Thu Apr 14 10:47:03 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yi Liu X-Patchwork-Id: 12813362 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 77693C433F5 for ; Thu, 14 Apr 2022 11:08:03 +0000 (UTC) Received: from localhost ([::1]:60186 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nexKU-0007sN-Ek for qemu-devel@archiver.kernel.org; Thu, 14 Apr 2022 07:08:02 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:55572) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nex0n-0002Vy-KF for qemu-devel@nongnu.org; Thu, 14 Apr 2022 06:47:41 -0400 Received: from mga12.intel.com ([192.55.52.136]:34772) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nex0k-0005Kn-BM for qemu-devel@nongnu.org; Thu, 14 Apr 2022 06:47:40 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1649933258; x=1681469258; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=KHydA018jKFqdkeNibVMAGbxhdJmQrL0qfYlDKZbHHU=; b=Ydwly7ceogGIuWi8G2rPbfh8lHxdGNCPbBAUTcDuKEdRLUS4FlmOufeh KP+0AOMIq79FUQ5Q+se1yP39K2kTAOz52PRmYAcbujPsuRWe0DAVFJa0E Z+8cAhlU2XvFA7suCi7gTTIDe3BJJ9zS5/pt317Tfc5fEU6+fsTPtwrRA GstYTbO9gO5fwl4EtgpcyTESvBf9Al21ngmr/N4O31V05ilfP9Tbp4sYU gMtiL2WUpU/xr2CS+JAgrc5bXXGoKWRkiGBU+FkK/nxsrZL10tUT7t0F4 NlWxbXz6pLLHzDIM6b1zaVkO3h+eljCUoLtrqdXizm5A2hv7QSndeJjCl A==; X-IronPort-AV: E=McAfee;i="6400,9594,10316"; a="242836505" X-IronPort-AV: E=Sophos;i="5.90,259,1643702400"; d="scan'208";a="242836505" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Apr 2022 03:47:19 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,259,1643702400"; d="scan'208";a="803091230" Received: from 984fee00a4c6.jf.intel.com ([10.165.58.231]) by fmsmga006.fm.intel.com with ESMTP; 14 Apr 2022 03:47:19 -0700 From: Yi Liu To: alex.williamson@redhat.com, cohuck@redhat.com, qemu-devel@nongnu.org Subject: [RFC 11/18] vfio/ccw: Use vfio_[attach/detach]_device Date: Thu, 14 Apr 2022 03:47:03 -0700 Message-Id: <20220414104710.28534-12-yi.l.liu@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220414104710.28534-1-yi.l.liu@intel.com> References: <20220414104710.28534-1-yi.l.liu@intel.com> MIME-Version: 1.0 Received-SPF: pass client-ip=192.55.52.136; envelope-from=yi.l.liu@intel.com; helo=mga12.intel.com X-Spam_score_int: -44 X-Spam_score: -4.5 X-Spam_bar: ---- X-Spam_report: (-4.5 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: akrowiak@linux.ibm.com, jjherne@linux.ibm.com, thuth@redhat.com, yi.l.liu@intel.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com, jasowang@redhat.com, farman@linux.ibm.com, peterx@redhat.com, pasic@linux.ibm.com, eric.auger@redhat.com, yi.y.sun@intel.com, chao.p.peng@intel.com, nicolinc@nvidia.com, kevin.tian@intel.com, jgg@nvidia.com, eric.auger.pro@gmail.com, david@gibson.dropbear.id.au Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" From: Eric Auger Let the vfio-ccw device use vfio_attach_device() and vfio_detach_device(), hence hiding the details of the used IOMMU backend. Also now all the devices have been migrated to use the new vfio_attach_device/vfio_detach_device API, let's turn the legacy functions into static functions, local to container.c. Signed-off-by: Eric Auger Signed-off-by: Yi Liu --- hw/vfio/ccw.c | 118 ++++++++-------------------------- hw/vfio/container.c | 8 +-- include/hw/vfio/vfio-common.h | 4 -- 3 files changed, 32 insertions(+), 98 deletions(-) diff --git a/hw/vfio/ccw.c b/hw/vfio/ccw.c index 0354737666..6fde7849cc 100644 --- a/hw/vfio/ccw.c +++ b/hw/vfio/ccw.c @@ -579,27 +579,32 @@ static void vfio_ccw_put_region(VFIOCCWDevice *vcdev) g_free(vcdev->io_region); } -static void vfio_ccw_put_device(VFIOCCWDevice *vcdev) -{ - g_free(vcdev->vdev.name); - vfio_put_base_device(&vcdev->vdev); -} - -static void vfio_ccw_get_device(VFIOGroup *group, VFIOCCWDevice *vcdev, - Error **errp) +static void vfio_ccw_realize(DeviceState *dev, Error **errp) { + CcwDevice *ccw_dev = DO_UPCAST(CcwDevice, parent_obj, dev); + S390CCWDevice *cdev = DO_UPCAST(S390CCWDevice, parent_obj, ccw_dev); + VFIOCCWDevice *vcdev = DO_UPCAST(VFIOCCWDevice, cdev, cdev); + S390CCWDeviceClass *cdc = S390_CCW_DEVICE_GET_CLASS(cdev); + VFIODevice *vbasedev = &vcdev->vdev; + Error *err = NULL; char *name = g_strdup_printf("%x.%x.%04x", vcdev->cdev.hostid.cssid, vcdev->cdev.hostid.ssid, vcdev->cdev.hostid.devid); - VFIODevice *vbasedev; + int ret; - QLIST_FOREACH(vbasedev, &group->device_list, next) { - if (strcmp(vbasedev->name, name) == 0) { - error_setg(errp, "vfio: subchannel %s has already been attached", - name); - goto out_err; + /* Call the class init function for subchannel. */ + if (cdc->realize) { + cdc->realize(cdev, vcdev->vdev.sysfsdev, &err); + if (err) { + goto out_err_propagate; } } + vbasedev->sysfsdev = g_strdup_printf("/sys/bus/css/devices/%s/%s", + name, cdev->mdevid); + vbasedev->ops = &vfio_ccw_ops; + vbasedev->type = VFIO_DEVICE_TYPE_CCW; + vbasedev->name = name; + vbasedev->dev = &vcdev->cdev.parent_obj.parent_obj; /* * All vfio-ccw devices are believed to operate in a way compatible with @@ -609,80 +614,18 @@ static void vfio_ccw_get_device(VFIOGroup *group, VFIOCCWDevice *vcdev, * needs to be set before vfio_get_device() for vfio common to handle * ram_block_discard_disable(). */ - vcdev->vdev.ram_block_discard_allowed = true; - - if (vfio_get_device(group, vcdev->cdev.mdevid, &vcdev->vdev, errp)) { - goto out_err; - } - - vcdev->vdev.ops = &vfio_ccw_ops; - vcdev->vdev.type = VFIO_DEVICE_TYPE_CCW; - vcdev->vdev.name = name; - vcdev->vdev.dev = &vcdev->cdev.parent_obj.parent_obj; - - return; - -out_err: - g_free(name); -} - -static VFIOGroup *vfio_ccw_get_group(S390CCWDevice *cdev, Error **errp) -{ - char *tmp, group_path[PATH_MAX]; - ssize_t len; - int groupid; - tmp = g_strdup_printf("/sys/bus/css/devices/%x.%x.%04x/%s/iommu_group", - cdev->hostid.cssid, cdev->hostid.ssid, - cdev->hostid.devid, cdev->mdevid); - len = readlink(tmp, group_path, sizeof(group_path)); - g_free(tmp); + vbasedev->ram_block_discard_allowed = true; - if (len <= 0 || len >= sizeof(group_path)) { - error_setg(errp, "vfio: no iommu_group found"); - return NULL; - } - - group_path[len] = 0; - - if (sscanf(basename(group_path), "%d", &groupid) != 1) { - error_setg(errp, "vfio: failed to read %s", group_path); - return NULL; - } - - return vfio_get_group(groupid, &address_space_memory, errp); -} - -static void vfio_ccw_realize(DeviceState *dev, Error **errp) -{ - VFIOGroup *group; - CcwDevice *ccw_dev = DO_UPCAST(CcwDevice, parent_obj, dev); - S390CCWDevice *cdev = DO_UPCAST(S390CCWDevice, parent_obj, ccw_dev); - VFIOCCWDevice *vcdev = DO_UPCAST(VFIOCCWDevice, cdev, cdev); - S390CCWDeviceClass *cdc = S390_CCW_DEVICE_GET_CLASS(cdev); - Error *err = NULL; - - /* Call the class init function for subchannel. */ - if (cdc->realize) { - cdc->realize(cdev, vcdev->vdev.sysfsdev, &err); - if (err) { - goto out_err_propagate; - } - } - - group = vfio_ccw_get_group(cdev, &err); - if (!group) { - goto out_group_err; - } - - vfio_ccw_get_device(group, vcdev, &err); - if (err) { - goto out_device_err; + ret = vfio_attach_device(vbasedev, &address_space_memory, errp); + if (ret) { + g_free(vbasedev->name); + g_free(vbasedev->sysfsdev); } vfio_ccw_get_region(vcdev, &err); if (err) { - goto out_region_err; + goto out_get_dev_err; } vfio_ccw_register_irq_notifier(vcdev, VFIO_CCW_IO_IRQ_INDEX, &err); @@ -714,11 +657,8 @@ out_irq_notifier_err: vfio_ccw_unregister_irq_notifier(vcdev, VFIO_CCW_IO_IRQ_INDEX); out_io_notifier_err: vfio_ccw_put_region(vcdev); -out_region_err: - vfio_ccw_put_device(vcdev); -out_device_err: - vfio_put_group(group); -out_group_err: +out_get_dev_err: + vfio_detach_device(vbasedev); if (cdc->unrealize) { cdc->unrealize(cdev); } @@ -732,14 +672,12 @@ static void vfio_ccw_unrealize(DeviceState *dev) S390CCWDevice *cdev = DO_UPCAST(S390CCWDevice, parent_obj, ccw_dev); VFIOCCWDevice *vcdev = DO_UPCAST(VFIOCCWDevice, cdev, cdev); S390CCWDeviceClass *cdc = S390_CCW_DEVICE_GET_CLASS(cdev); - VFIOGroup *group = vcdev->vdev.group; vfio_ccw_unregister_irq_notifier(vcdev, VFIO_CCW_REQ_IRQ_INDEX); vfio_ccw_unregister_irq_notifier(vcdev, VFIO_CCW_CRW_IRQ_INDEX); vfio_ccw_unregister_irq_notifier(vcdev, VFIO_CCW_IO_IRQ_INDEX); vfio_ccw_put_region(vcdev); - vfio_ccw_put_device(vcdev); - vfio_put_group(group); + vfio_detach_device(&vcdev->vdev); if (cdc->unrealize) { cdc->unrealize(cdev); diff --git a/hw/vfio/container.c b/hw/vfio/container.c index c74a3cd4ae..5d73f8285e 100644 --- a/hw/vfio/container.c +++ b/hw/vfio/container.c @@ -954,7 +954,7 @@ static void vfio_disconnect_container(VFIOGroup *group) } } -VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp) +static VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp) { VFIOGroup *group; VFIOContainer *bcontainer; @@ -1023,7 +1023,7 @@ free_group_exit: return NULL; } -void vfio_put_group(VFIOGroup *group) +static void vfio_put_group(VFIOGroup *group) { if (!group || !QLIST_EMPTY(&group->device_list)) { return; @@ -1044,8 +1044,8 @@ void vfio_put_group(VFIOGroup *group) } } -int vfio_get_device(VFIOGroup *group, const char *name, - VFIODevice *vbasedev, Error **errp) +static int vfio_get_device(VFIOGroup *group, const char *name, + VFIODevice *vbasedev, Error **errp) { struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) }; int ret, fd; diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index 978b2c2f6e..7d7898717e 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -176,10 +176,6 @@ void vfio_region_unmap(VFIORegion *region); void vfio_region_exit(VFIORegion *region); void vfio_region_finalize(VFIORegion *region); void vfio_reset_handler(void *opaque); -VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp); -void vfio_put_group(VFIOGroup *group); -int vfio_get_device(VFIOGroup *group, const char *name, - VFIODevice *vbasedev, Error **errp); int vfio_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp); void vfio_detach_device(VFIODevice *vbasedev); From patchwork Thu Apr 14 10:47:04 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yi Liu X-Patchwork-Id: 12813344 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 5231EC433FE for ; Thu, 14 Apr 2022 10:53:04 +0000 (UTC) Received: from localhost ([::1]:59104 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nex5z-0004AA-Cy for qemu-devel@archiver.kernel.org; Thu, 14 Apr 2022 06:53:03 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:55574) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nex0p-0002Xx-JF for qemu-devel@nongnu.org; Thu, 14 Apr 2022 06:47:43 -0400 Received: from mga12.intel.com ([192.55.52.136]:34768) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nex0k-0005Ka-Jp for qemu-devel@nongnu.org; Thu, 14 Apr 2022 06:47:41 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1649933258; x=1681469258; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=0YDGzdA0v7DGQjpQ43n6QGUsR6hhCiY+JzhRyO8LL5g=; b=ikrRcRcUZ5Pi1VowrbCc3rgph/TJrYd9Rce2gKVfinFfITVnstckLA0A QcpW1nljIiJguFdqG4JfJflpVI5q4o1BpKNgIZo45lYE1AgrKI4JdKmMy /yRdGGTxJlpXIuj6BEc/h5SuAlXZRIlN1tto+90NDJGXmVZES9mXdHRDe h3R06e9ubysByCFjjSQBFAXn/GNRvHuskMZInY6dfIugND1DWslKtQZJx pWZ8WErV3xyDSoWZFzfrV2OPdUJ9z721FQrzeEPJSFo0koPbg1jnZkvIw ZwEZXDthRSLzBYT8eoEHvlXtaCP+VQ4nzEyLcgPmB/3fGjw7J3Rk6Rk8J A==; X-IronPort-AV: E=McAfee;i="6400,9594,10316"; a="242836508" X-IronPort-AV: E=Sophos;i="5.90,259,1643702400"; d="scan'208";a="242836508" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Apr 2022 03:47:21 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,259,1643702400"; d="scan'208";a="803091238" Received: from 984fee00a4c6.jf.intel.com ([10.165.58.231]) by fmsmga006.fm.intel.com with ESMTP; 14 Apr 2022 03:47:19 -0700 From: Yi Liu To: alex.williamson@redhat.com, cohuck@redhat.com, qemu-devel@nongnu.org Subject: [RFC 12/18] vfio/container-obj: Introduce [attach/detach]_device container callbacks Date: Thu, 14 Apr 2022 03:47:04 -0700 Message-Id: <20220414104710.28534-13-yi.l.liu@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220414104710.28534-1-yi.l.liu@intel.com> References: <20220414104710.28534-1-yi.l.liu@intel.com> MIME-Version: 1.0 Received-SPF: pass client-ip=192.55.52.136; envelope-from=yi.l.liu@intel.com; helo=mga12.intel.com X-Spam_score_int: -44 X-Spam_score: -4.5 X-Spam_bar: ---- X-Spam_report: (-4.5 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: akrowiak@linux.ibm.com, jjherne@linux.ibm.com, thuth@redhat.com, yi.l.liu@intel.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com, jasowang@redhat.com, farman@linux.ibm.com, peterx@redhat.com, pasic@linux.ibm.com, eric.auger@redhat.com, yi.y.sun@intel.com, chao.p.peng@intel.com, nicolinc@nvidia.com, kevin.tian@intel.com, jgg@nvidia.com, eric.auger.pro@gmail.com, david@gibson.dropbear.id.au Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" From: Eric Auger Let's turn attach/detach_device as container callbacks. That way, their implementation can be easily customized for a given backend. For the time being, only the legacy container is supported. Signed-off-by: Eric Auger Signed-off-by: Yi Liu --- hw/vfio/as.c | 36 ++++++++++++++++++++++++++++ hw/vfio/container.c | 11 +++++---- hw/vfio/pci.c | 2 +- include/hw/vfio/vfio-common.h | 7 ++++++ include/hw/vfio/vfio-container-obj.h | 6 +++++ 5 files changed, 57 insertions(+), 5 deletions(-) diff --git a/hw/vfio/as.c b/hw/vfio/as.c index 37423d2c89..30e86f6833 100644 --- a/hw/vfio/as.c +++ b/hw/vfio/as.c @@ -874,3 +874,39 @@ void vfio_put_address_space(VFIOAddressSpace *space) g_free(space); } } + +static VFIOContainerClass * +vfio_get_container_class(VFIOIOMMUBackendType be) +{ + ObjectClass *klass; + + switch (be) { + case VFIO_IOMMU_BACKEND_TYPE_LEGACY: + klass = object_class_by_name(TYPE_VFIO_LEGACY_CONTAINER); + return VFIO_CONTAINER_OBJ_CLASS(klass); + default: + return NULL; + } +} + +int vfio_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp) +{ + VFIOContainerClass *vccs; + + vccs = vfio_get_container_class(VFIO_IOMMU_BACKEND_TYPE_LEGACY); + if (!vccs) { + return -ENOENT; + } + return vccs->attach_device(vbasedev, as, errp); +} + +void vfio_detach_device(VFIODevice *vbasedev) +{ + VFIOContainerClass *vccs; + + if (!vbasedev->container) { + return; + } + vccs = VFIO_CONTAINER_OBJ_GET_CLASS(vbasedev->container); + vccs->detach_device(vbasedev); +} diff --git a/hw/vfio/container.c b/hw/vfio/container.c index 5d73f8285e..74febc1567 100644 --- a/hw/vfio/container.c +++ b/hw/vfio/container.c @@ -50,8 +50,6 @@ static int vfio_kvm_device_fd = -1; #endif -#define TYPE_VFIO_LEGACY_CONTAINER "qemu:vfio-legacy-container" - VFIOGroupList vfio_group_list = QLIST_HEAD_INITIALIZER(vfio_group_list); @@ -1240,7 +1238,8 @@ static int vfio_device_groupid(VFIODevice *vbasedev, Error **errp) return groupid; } -int vfio_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp) +static int +legacy_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp) { int groupid = vfio_device_groupid(vbasedev, errp); VFIODevice *vbasedev_iter; @@ -1269,14 +1268,16 @@ int vfio_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp) vfio_put_group(group); return -1; } + vbasedev->container = &group->container->obj; return 0; } -void vfio_detach_device(VFIODevice *vbasedev) +static void legacy_detach_device(VFIODevice *vbasedev) { vfio_put_base_device(vbasedev); vfio_put_group(vbasedev->group); + vbasedev->container = NULL; } static void vfio_legacy_container_class_init(ObjectClass *klass, @@ -1292,6 +1293,8 @@ static void vfio_legacy_container_class_init(ObjectClass *klass, vccs->add_window = vfio_legacy_container_add_section_window; vccs->del_window = vfio_legacy_container_del_section_window; vccs->check_extension = vfio_legacy_container_check_extension; + vccs->attach_device = legacy_attach_device; + vccs->detach_device = legacy_detach_device; } static const TypeInfo vfio_legacy_container_info = { diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index 0363f81017..e1ab6d339d 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -3063,7 +3063,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp) } if (!pdev->failover_pair_id && - vfio_container_check_extension(&vbasedev->group->container->obj, + vfio_container_check_extension(vbasedev->container, VFIO_FEAT_LIVE_MIGRATION)) { ret = vfio_migration_probe(vbasedev, errp); if (ret) { diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index 7d7898717e..2040c27cda 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -83,9 +83,15 @@ typedef struct VFIOLegacyContainer { typedef struct VFIODeviceOps VFIODeviceOps; +typedef enum VFIOIOMMUBackendType { + VFIO_IOMMU_BACKEND_TYPE_LEGACY = 0, + VFIO_IOMMU_BACKEND_TYPE_IOMMUFD = 1, +} VFIOIOMMUBackendType; + typedef struct VFIODevice { QLIST_ENTRY(VFIODevice) next; struct VFIOGroup *group; + VFIOContainer *container; char *sysfsdev; char *name; DeviceState *dev; @@ -97,6 +103,7 @@ typedef struct VFIODevice { bool ram_block_discard_allowed; bool enable_migration; VFIODeviceOps *ops; + VFIOIOMMUBackendType be; unsigned int num_irqs; unsigned int num_regions; unsigned int flags; diff --git a/include/hw/vfio/vfio-container-obj.h b/include/hw/vfio/vfio-container-obj.h index 7ffbbb299f..ebc1340530 100644 --- a/include/hw/vfio/vfio-container-obj.h +++ b/include/hw/vfio/vfio-container-obj.h @@ -42,6 +42,8 @@ OBJECT_GET_CLASS(VFIOContainerClass, (obj), \ TYPE_VFIO_CONTAINER_OBJ) +#define TYPE_VFIO_LEGACY_CONTAINER "qemu:vfio-legacy-container" + typedef enum VFIOContainerFeature { VFIO_FEAT_LIVE_MIGRATION, } VFIOContainerFeature; @@ -101,6 +103,8 @@ struct VFIOContainer { QLIST_ENTRY(VFIOContainer) next; }; +typedef struct VFIODevice VFIODevice; + typedef struct VFIOContainerClass { /* private */ ObjectClass parent_class; @@ -126,6 +130,8 @@ typedef struct VFIOContainerClass { Error **errp); void (*del_window)(VFIOContainer *container, MemoryRegionSection *section); + int (*attach_device)(VFIODevice *vbasedev, AddressSpace *as, Error **errp); + void (*detach_device)(VFIODevice *vbasedev); } VFIOContainerClass; bool vfio_container_check_extension(VFIOContainer *container, From patchwork Thu Apr 14 10:47:05 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yi Liu X-Patchwork-Id: 12813330 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 85321C433F5 for ; Thu, 14 Apr 2022 10:49:33 +0000 (UTC) Received: from localhost ([::1]:50558 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nex2a-0006o7-I9 for qemu-devel@archiver.kernel.org; Thu, 14 Apr 2022 06:49:32 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:55576) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nex0p-0002Xz-JF for qemu-devel@nongnu.org; Thu, 14 Apr 2022 06:47:43 -0400 Received: from mga12.intel.com ([192.55.52.136]:34770) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nex0l-0005Ke-5H for qemu-devel@nongnu.org; Thu, 14 Apr 2022 06:47:41 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1649933259; x=1681469259; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=xY7UyaVEvJ5caj+3+YkMuOpNZxDlr5V2P2U4rZef+Fc=; b=UiZrWBHmej0sSc9tdH4+6AjFPkJ5nENUfsXGj1/Jf7e854eOmHWvAjtd VKNN5+G7C1LzvP43MG0kF82YVMqey/1q3VK9WuRgWo7brtmaP2nOldbal IokS7hlojfeCHeCCY1bwmBX4zDvwx3uiyzU0/XWMnfkEsmk94F9UoAxPz 5EWnuu1lysZxKHs3LgCa03FeSouXtp8T1k9C95XvFXMvfQTiv1f1BEGLH kwl6nd8cJfy39cUCvX2RkcYEQ54fKuP40HzEbR7T9StL1cy/kGB4Fi3yM ct9efVBDGs9OtaY8ic7h4+upwIQC6s4/gAaTRkxJGgZdmAzbl6SZjOo4P g==; X-IronPort-AV: E=McAfee;i="6400,9594,10316"; a="242836511" X-IronPort-AV: E=Sophos;i="5.90,259,1643702400"; d="scan'208";a="242836511" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Apr 2022 03:47:22 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,259,1643702400"; d="scan'208";a="803091245" Received: from 984fee00a4c6.jf.intel.com ([10.165.58.231]) by fmsmga006.fm.intel.com with ESMTP; 14 Apr 2022 03:47:20 -0700 From: Yi Liu To: alex.williamson@redhat.com, cohuck@redhat.com, qemu-devel@nongnu.org Subject: [RFC 13/18] vfio/container-obj: Introduce VFIOContainer reset callback Date: Thu, 14 Apr 2022 03:47:05 -0700 Message-Id: <20220414104710.28534-14-yi.l.liu@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220414104710.28534-1-yi.l.liu@intel.com> References: <20220414104710.28534-1-yi.l.liu@intel.com> MIME-Version: 1.0 Received-SPF: pass client-ip=192.55.52.136; envelope-from=yi.l.liu@intel.com; helo=mga12.intel.com X-Spam_score_int: -44 X-Spam_score: -4.5 X-Spam_bar: ---- X-Spam_report: (-4.5 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: akrowiak@linux.ibm.com, jjherne@linux.ibm.com, thuth@redhat.com, yi.l.liu@intel.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com, jasowang@redhat.com, farman@linux.ibm.com, peterx@redhat.com, pasic@linux.ibm.com, eric.auger@redhat.com, yi.y.sun@intel.com, chao.p.peng@intel.com, nicolinc@nvidia.com, kevin.tian@intel.com, jgg@nvidia.com, eric.auger.pro@gmail.com, david@gibson.dropbear.id.au Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" From: Eric Auger Reset implementation depends on the container backend. Let's introduce a VFIOContainer class function and register a generic reset handler that will be able to call the right reset function depending on the container type. Also, let's move the registration/unregistration to a place that is not backend-specific (first vfio address space created instead of the first group). Signed-off-by: Eric Auger Signed-off-by: Yi Liu --- hw/vfio/as.c | 18 ++++++++++++++++++ hw/vfio/container-obj.c | 13 +++++++++++++ hw/vfio/container.c | 26 ++++++++++++++------------ include/hw/vfio/vfio-container-obj.h | 2 ++ 4 files changed, 47 insertions(+), 12 deletions(-) diff --git a/hw/vfio/as.c b/hw/vfio/as.c index 30e86f6833..4abaa4068f 100644 --- a/hw/vfio/as.c +++ b/hw/vfio/as.c @@ -847,6 +847,18 @@ const MemoryListener vfio_memory_listener = { .log_sync = vfio_listener_log_sync, }; +void vfio_reset_handler(void *opaque) +{ + VFIOAddressSpace *space; + VFIOContainer *bcontainer; + + QLIST_FOREACH(space, &vfio_address_spaces, list) { + QLIST_FOREACH(bcontainer, &space->containers, next) { + vfio_container_reset(bcontainer); + } + } +} + VFIOAddressSpace *vfio_get_address_space(AddressSpace *as) { VFIOAddressSpace *space; @@ -862,6 +874,9 @@ VFIOAddressSpace *vfio_get_address_space(AddressSpace *as) space->as = as; QLIST_INIT(&space->containers); + if (QLIST_EMPTY(&vfio_address_spaces)) { + qemu_register_reset(vfio_reset_handler, NULL); + } QLIST_INSERT_HEAD(&vfio_address_spaces, space, list); return space; @@ -873,6 +888,9 @@ void vfio_put_address_space(VFIOAddressSpace *space) QLIST_REMOVE(space, list); g_free(space); } + if (QLIST_EMPTY(&vfio_address_spaces)) { + qemu_unregister_reset(vfio_reset_handler, NULL); + } } static VFIOContainerClass * diff --git a/hw/vfio/container-obj.c b/hw/vfio/container-obj.c index 40c1e2a2b5..c4220336af 100644 --- a/hw/vfio/container-obj.c +++ b/hw/vfio/container-obj.c @@ -68,6 +68,19 @@ int vfio_container_dma_unmap(VFIOContainer *container, return vccs->dma_unmap(container, iova, size, iotlb); } +int vfio_container_reset(VFIOContainer *container) +{ + VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container); + + vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container); + + if (!vccs->reset) { + return -ENOENT; + } + + return vccs->reset(container); +} + void vfio_container_set_dirty_page_tracking(VFIOContainer *container, bool start) { diff --git a/hw/vfio/container.c b/hw/vfio/container.c index 74febc1567..2f59422048 100644 --- a/hw/vfio/container.c +++ b/hw/vfio/container.c @@ -458,12 +458,15 @@ vfio_legacy_container_del_section_window(VFIOContainer *bcontainer, } } -void vfio_reset_handler(void *opaque) +static int vfio_legacy_container_reset(VFIOContainer *bcontainer) { + VFIOLegacyContainer *container = container_of(bcontainer, + VFIOLegacyContainer, obj); VFIOGroup *group; VFIODevice *vbasedev; + int ret, final_ret = 0; - QLIST_FOREACH(group, &vfio_group_list, next) { + QLIST_FOREACH(group, &container->group_list, container_next) { QLIST_FOREACH(vbasedev, &group->device_list, next) { if (vbasedev->dev->realized) { vbasedev->ops->vfio_compute_needs_reset(vbasedev); @@ -471,13 +474,19 @@ void vfio_reset_handler(void *opaque) } } - QLIST_FOREACH(group, &vfio_group_list, next) { + QLIST_FOREACH(group, &container->group_list, next) { QLIST_FOREACH(vbasedev, &group->device_list, next) { if (vbasedev->dev->realized && vbasedev->needs_reset) { - vbasedev->ops->vfio_hot_reset_multi(vbasedev); + ret = vbasedev->ops->vfio_hot_reset_multi(vbasedev); + if (ret) { + error_report("failed to reset %s (%d)", + vbasedev->name, ret); + final_ret = ret; + } } } } + return final_ret; } static void vfio_kvm_device_add_group(VFIOGroup *group) @@ -1004,10 +1013,6 @@ static VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp) goto close_fd_exit; } - if (QLIST_EMPTY(&vfio_group_list)) { - qemu_register_reset(vfio_reset_handler, NULL); - } - QLIST_INSERT_HEAD(&vfio_group_list, group, next); return group; @@ -1036,10 +1041,6 @@ static void vfio_put_group(VFIOGroup *group) trace_vfio_put_group(group->fd); close(group->fd); g_free(group); - - if (QLIST_EMPTY(&vfio_group_list)) { - qemu_unregister_reset(vfio_reset_handler, NULL); - } } static int vfio_get_device(VFIOGroup *group, const char *name, @@ -1293,6 +1294,7 @@ static void vfio_legacy_container_class_init(ObjectClass *klass, vccs->add_window = vfio_legacy_container_add_section_window; vccs->del_window = vfio_legacy_container_del_section_window; vccs->check_extension = vfio_legacy_container_check_extension; + vccs->reset = vfio_legacy_container_reset; vccs->attach_device = legacy_attach_device; vccs->detach_device = legacy_detach_device; } diff --git a/include/hw/vfio/vfio-container-obj.h b/include/hw/vfio/vfio-container-obj.h index ebc1340530..ffd8590ff8 100644 --- a/include/hw/vfio/vfio-container-obj.h +++ b/include/hw/vfio/vfio-container-obj.h @@ -118,6 +118,7 @@ typedef struct VFIOContainerClass { int (*dma_unmap)(VFIOContainer *container, hwaddr iova, ram_addr_t size, IOMMUTLBEntry *iotlb); + int (*reset)(VFIOContainer *container); /* migration feature */ bool (*devices_all_dirty_tracking)(VFIOContainer *container); void (*set_dirty_page_tracking)(VFIOContainer *container, bool start); @@ -142,6 +143,7 @@ int vfio_container_dma_map(VFIOContainer *container, int vfio_container_dma_unmap(VFIOContainer *container, hwaddr iova, ram_addr_t size, IOMMUTLBEntry *iotlb); +int vfio_container_reset(VFIOContainer *container); bool vfio_container_devices_all_dirty_tracking(VFIOContainer *container); void vfio_container_set_dirty_page_tracking(VFIOContainer *container, bool start); From patchwork Thu Apr 14 10:47:06 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yi Liu X-Patchwork-Id: 12813348 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 52422C433F5 for ; Thu, 14 Apr 2022 10:55:17 +0000 (UTC) Received: from localhost ([::1]:39038 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nex88-00019i-DU for qemu-devel@archiver.kernel.org; Thu, 14 Apr 2022 06:55:16 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:55636) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nex11-0002sz-TL for qemu-devel@nongnu.org; Thu, 14 Apr 2022 06:47:55 -0400 Received: from mga12.intel.com ([192.55.52.136]:34772) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nex0z-0005Kn-Co for qemu-devel@nongnu.org; Thu, 14 Apr 2022 06:47:55 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1649933273; x=1681469273; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=3UceRf2HWmSm4CQsvPTxQRPUH75cKHVY3AXYZmTVfe8=; b=T8iazVPJip+jWtMKG/w7oewZdYu9nRVDAZEIHsp4AsZ0ENW6ReBleQh/ 3gOSK1SldiI7sOKcAAuqry9LqPuB2BBls1As8dQH8Zd00IOj6/dngLXLY vurrC4GgJZKXfoCE8kLHbHEPzZxie81WMHIC7rT2KY6oPtwBnO+7FOnJm T8DKoTzoCi2fJA+hRbi/SXR+SL8pmjKunwxJfdwgjm+gZUzcbdX3nURbH K6ruKbnEf6VcrvV3Ch3c63+oH6wU5ctBu48cyUgSc8+Ky5vrN5d3ubuZ3 //0lO3XWK/7SCzYuoMyunPasL8u+0SHgOnc+hMgjFS1PUk2seLFCjU1dG g==; X-IronPort-AV: E=McAfee;i="6400,9594,10316"; a="242836513" X-IronPort-AV: E=Sophos;i="5.90,259,1643702400"; d="scan'208";a="242836513" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Apr 2022 03:47:22 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,259,1643702400"; d="scan'208";a="803091249" Received: from 984fee00a4c6.jf.intel.com ([10.165.58.231]) by fmsmga006.fm.intel.com with ESMTP; 14 Apr 2022 03:47:21 -0700 From: Yi Liu To: alex.williamson@redhat.com, cohuck@redhat.com, qemu-devel@nongnu.org Subject: [RFC 14/18] hw/iommufd: Creation Date: Thu, 14 Apr 2022 03:47:06 -0700 Message-Id: <20220414104710.28534-15-yi.l.liu@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220414104710.28534-1-yi.l.liu@intel.com> References: <20220414104710.28534-1-yi.l.liu@intel.com> MIME-Version: 1.0 Received-SPF: pass client-ip=192.55.52.136; envelope-from=yi.l.liu@intel.com; helo=mga12.intel.com X-Spam_score_int: -44 X-Spam_score: -4.5 X-Spam_bar: ---- X-Spam_report: (-4.5 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: akrowiak@linux.ibm.com, jjherne@linux.ibm.com, thuth@redhat.com, yi.l.liu@intel.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com, jasowang@redhat.com, farman@linux.ibm.com, peterx@redhat.com, pasic@linux.ibm.com, eric.auger@redhat.com, yi.y.sun@intel.com, chao.p.peng@intel.com, nicolinc@nvidia.com, kevin.tian@intel.com, jgg@nvidia.com, eric.auger.pro@gmail.com, david@gibson.dropbear.id.au Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" Introduce iommufd utility library which can be compiled out with CONFIG_IOMMUFD configuration. This code is bound to be called by several subsystems: vdpa, and vfio. Co-authored-by: Eric Auger Signed-off-by: Eric Auger Signed-off-by: Yi Liu --- MAINTAINERS | 7 ++ hw/Kconfig | 1 + hw/iommufd/Kconfig | 4 + hw/iommufd/iommufd.c | 209 +++++++++++++++++++++++++++++++++++ hw/iommufd/meson.build | 1 + hw/iommufd/trace-events | 11 ++ hw/iommufd/trace.h | 1 + hw/meson.build | 1 + include/hw/iommufd/iommufd.h | 37 +++++++ meson.build | 1 + 10 files changed, 273 insertions(+) create mode 100644 hw/iommufd/Kconfig create mode 100644 hw/iommufd/iommufd.c create mode 100644 hw/iommufd/meson.build create mode 100644 hw/iommufd/trace-events create mode 100644 hw/iommufd/trace.h create mode 100644 include/hw/iommufd/iommufd.h diff --git a/MAINTAINERS b/MAINTAINERS index 4ad2451e03..f6bcb25f7f 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1954,6 +1954,13 @@ F: hw/vfio/ap.c F: docs/system/s390x/vfio-ap.rst L: qemu-s390x@nongnu.org +iommufd +M: Yi Liu +M: Eric Auger +S: Supported +F: hw/iommufd/* +F: include/hw/iommufd/* + vhost M: Michael S. Tsirkin S: Supported diff --git a/hw/Kconfig b/hw/Kconfig index ad20cce0a9..d270d44760 100644 --- a/hw/Kconfig +++ b/hw/Kconfig @@ -63,6 +63,7 @@ source sparc/Kconfig source sparc64/Kconfig source tricore/Kconfig source xtensa/Kconfig +source iommufd/Kconfig # Symbols used by multiple targets config TEST_DEVICES diff --git a/hw/iommufd/Kconfig b/hw/iommufd/Kconfig new file mode 100644 index 0000000000..4b1b00e36b --- /dev/null +++ b/hw/iommufd/Kconfig @@ -0,0 +1,4 @@ +config IOMMUFD + bool + default y + depends on LINUX diff --git a/hw/iommufd/iommufd.c b/hw/iommufd/iommufd.c new file mode 100644 index 0000000000..4e8179d612 --- /dev/null +++ b/hw/iommufd/iommufd.c @@ -0,0 +1,209 @@ +/* + * QEMU IOMMUFD + * + * Copyright (C) 2022 Intel Corporation. + * Copyright Red Hat, Inc. 2022 + * + * Authors: Yi Liu + * Eric Auger + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + + * You should have received a copy of the GNU General Public License along + * with this program; if not, see . + */ + +#include "qemu/osdep.h" +#include "qapi/error.h" +#include "qemu/error-report.h" +#include "qemu/thread.h" +#include "qemu/module.h" +#include +#include +#include "hw/iommufd/iommufd.h" +#include "trace.h" + +static QemuMutex iommufd_lock; +static uint32_t iommufd_users; +static int iommufd = -1; + +static int iommufd_get(void) +{ + qemu_mutex_lock(&iommufd_lock); + if (iommufd == -1) { + iommufd = qemu_open_old("/dev/iommu", O_RDWR); + if (iommufd < 0) { + error_report("Failed to open /dev/iommu!"); + } else { + iommufd_users = 1; + } + trace_iommufd_get(iommufd); + } else if (++iommufd_users == UINT32_MAX) { + error_report("Failed to get iommufd: %d, count overflow", iommufd); + iommufd_users--; + qemu_mutex_unlock(&iommufd_lock); + return -E2BIG; + } + qemu_mutex_unlock(&iommufd_lock); + return iommufd; +} + +static void iommufd_put(int fd) +{ + qemu_mutex_lock(&iommufd_lock); + if (--iommufd_users) { + qemu_mutex_unlock(&iommufd_lock); + return; + } + iommufd = -1; + trace_iommufd_put(fd); + close(fd); + qemu_mutex_unlock(&iommufd_lock); +} + +static int iommufd_alloc_ioas(int iommufd, uint32_t *ioas) +{ + int ret; + struct iommu_ioas_alloc alloc_data = { + .size = sizeof(alloc_data), + .flags = 0, + }; + + ret = ioctl(iommufd, IOMMU_IOAS_ALLOC, &alloc_data); + if (ret) { + error_report("Failed to allocate ioas %m"); + } + + *ioas = alloc_data.out_ioas_id; + trace_iommufd_alloc_ioas(iommufd, *ioas, ret); + + return ret; +} + +static void iommufd_free_ioas(int iommufd, uint32_t ioas) +{ + int ret; + struct iommu_destroy des = { + .size = sizeof(des), + .id = ioas, + }; + + ret = ioctl(iommufd, IOMMU_DESTROY, &des); + trace_iommufd_free_ioas(iommufd, ioas, ret); + if (ret) { + error_report("Failed to free ioas: %u %m", ioas); + } +} + +int iommufd_get_ioas(int *fd, uint32_t *ioas_id) +{ + int ret; + + *fd = iommufd_get(); + if (*fd < 0) { + return *fd; + } + + ret = iommufd_alloc_ioas(*fd, ioas_id); + trace_iommufd_get_ioas(*fd, *ioas_id, ret); + if (ret) { + iommufd_put(*fd); + } + return ret; +} + +void iommufd_put_ioas(int iommufd, uint32_t ioas) +{ + trace_iommufd_put_ioas(iommufd, ioas); + iommufd_free_ioas(iommufd, ioas); + iommufd_put(iommufd); +} + +int iommufd_unmap_dma(int iommufd, uint32_t ioas, + hwaddr iova, ram_addr_t size) +{ + int ret; + struct iommu_ioas_unmap unmap = { + .size = sizeof(unmap), + .ioas_id = ioas, + .iova = iova, + .length = size, + }; + + ret = ioctl(iommufd, IOMMU_IOAS_UNMAP, &unmap); + trace_iommufd_unmap_dma(iommufd, ioas, iova, size, ret); + if (ret) { + error_report("IOMMU_IOAS_UNMAP failed: %s", strerror(errno)); + } + return !ret ? 0 : -errno; +} + +int iommufd_map_dma(int iommufd, uint32_t ioas, hwaddr iova, + ram_addr_t size, void *vaddr, bool readonly) +{ + int ret; + struct iommu_ioas_map map = { + .size = sizeof(map), + .flags = IOMMU_IOAS_MAP_READABLE | + IOMMU_IOAS_MAP_FIXED_IOVA, + .ioas_id = ioas, + .__reserved = 0, + .user_va = (int64_t)vaddr, + .iova = iova, + .length = size, + }; + + if (!readonly) { + map.flags |= IOMMU_IOAS_MAP_WRITEABLE; + } + + ret = ioctl(iommufd, IOMMU_IOAS_MAP, &map); + trace_iommufd_map_dma(iommufd, ioas, iova, size, vaddr, readonly, ret); + if (ret) { + error_report("IOMMU_IOAS_MAP failed: %s", strerror(errno)); + } + return !ret ? 0 : -errno; +} + +int iommufd_copy_dma(int iommufd, uint32_t src_ioas, uint32_t dst_ioas, + hwaddr iova, ram_addr_t size, bool readonly) +{ + int ret; + struct iommu_ioas_copy copy = { + .size = sizeof(copy), + .flags = IOMMU_IOAS_MAP_READABLE | + IOMMU_IOAS_MAP_FIXED_IOVA, + .dst_ioas_id = dst_ioas, + .src_ioas_id = src_ioas, + .length = size, + .dst_iova = iova, + .src_iova = iova, + }; + + if (!readonly) { + copy.flags |= IOMMU_IOAS_MAP_WRITEABLE; + } + + ret = ioctl(iommufd, IOMMU_IOAS_COPY, ©); + trace_iommufd_copy_dma(iommufd, src_ioas, dst_ioas, + iova, size, readonly, ret); + if (ret) { + error_report("IOMMU_IOAS_COPY failed: %s", strerror(errno)); + } + return !ret ? 0 : -errno; +} + +static void iommufd_register_types(void) +{ + qemu_mutex_init(&iommufd_lock); +} + +type_init(iommufd_register_types) diff --git a/hw/iommufd/meson.build b/hw/iommufd/meson.build new file mode 100644 index 0000000000..515bc40cbe --- /dev/null +++ b/hw/iommufd/meson.build @@ -0,0 +1 @@ +specific_ss.add(when: 'CONFIG_IOMMUFD', if_true: files('iommufd.c')) diff --git a/hw/iommufd/trace-events b/hw/iommufd/trace-events new file mode 100644 index 0000000000..615d80cdf4 --- /dev/null +++ b/hw/iommufd/trace-events @@ -0,0 +1,11 @@ +# See docs/devel/tracing.rst for syntax documentation. + +iommufd_get(int iommufd) " iommufd=%d" +iommufd_put(int iommufd) " iommufd=%d" +iommufd_alloc_ioas(int iommufd, uint32_t ioas, int ret) " iommufd=%d ioas=%d (%d)" +iommufd_free_ioas(int iommufd, uint32_t ioas, int ret) " iommufd=%d ioas=%d (%d)" +iommufd_get_ioas(int iommufd, uint32_t ioas, int ret) " iommufd=%d ioas=%d (%d)" +iommufd_put_ioas(int iommufd, uint32_t ioas) " iommufd=%d ioas=%d" +iommufd_unmap_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)" +iommufd_map_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, void *vaddr, bool readonly, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" addr=%p readonly=%d (%d)" +iommufd_copy_dma(int iommufd, uint32_t src_ioas, uint32_t dst_ioas, uint64_t iova, uint64_t size, bool readonly, int ret) " iommufd=%d src_ioas=%d dst_ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" readonly=%d (%d)" diff --git a/hw/iommufd/trace.h b/hw/iommufd/trace.h new file mode 100644 index 0000000000..3fb40b0932 --- /dev/null +++ b/hw/iommufd/trace.h @@ -0,0 +1 @@ +#include "trace/trace-hw_iommufd.h" diff --git a/hw/meson.build b/hw/meson.build index b3366c888e..ffb5203265 100644 --- a/hw/meson.build +++ b/hw/meson.build @@ -38,6 +38,7 @@ subdir('timer') subdir('tpm') subdir('usb') subdir('vfio') +subdir('iommufd') subdir('virtio') subdir('watchdog') subdir('xen') diff --git a/include/hw/iommufd/iommufd.h b/include/hw/iommufd/iommufd.h new file mode 100644 index 0000000000..59835cddca --- /dev/null +++ b/include/hw/iommufd/iommufd.h @@ -0,0 +1,37 @@ +/* + * QEMU IOMMUFD + * + * Copyright (C) 2022 Intel Corporation. + * Copyright Red Hat, Inc. 2022 + * + * Authors: Yi Liu + * Eric Auger + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + + * You should have received a copy of the GNU General Public License along + * with this program; if not, see . + */ + +#ifndef HW_IOMMUFD_IOMMUFD_H +#define HW_IOMMUFD_IOMMUFD_H +#include "exec/hwaddr.h" +#include "exec/cpu-common.h" + +int iommufd_get_ioas(int *fd, uint32_t *ioas_id); +void iommufd_put_ioas(int fd, uint32_t ioas_id); +int iommufd_unmap_dma(int iommufd, uint32_t ioas, hwaddr iova, ram_addr_t size); +int iommufd_map_dma(int iommufd, uint32_t ioas, hwaddr iova, + ram_addr_t size, void *vaddr, bool readonly); +int iommufd_copy_dma(int iommufd, uint32_t src_ioas, uint32_t dst_ioas, + hwaddr iova, ram_addr_t size, bool readonly); +bool iommufd_supported(void); +#endif /* HW_IOMMUFD_IOMMUFD_H */ diff --git a/meson.build b/meson.build index 861de93c4f..45caa53db6 100644 --- a/meson.build +++ b/meson.build @@ -2755,6 +2755,7 @@ if have_system 'hw/tpm', 'hw/usb', 'hw/vfio', + 'hw/iommufd', 'hw/virtio', 'hw/watchdog', 'hw/xen', From patchwork Thu Apr 14 10:47:07 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yi Liu X-Patchwork-Id: 12813363 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 86780C433F5 for ; Thu, 14 Apr 2022 11:10:00 +0000 (UTC) Received: from localhost ([::1]:35282 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nexMN-00023j-JA for qemu-devel@archiver.kernel.org; Thu, 14 Apr 2022 07:09:59 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:55654) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nex13-0002y4-Cc for qemu-devel@nongnu.org; Thu, 14 Apr 2022 06:47:57 -0400 Received: from mga12.intel.com ([192.55.52.136]:34768) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nex0z-0005Ka-Tw for qemu-devel@nongnu.org; Thu, 14 Apr 2022 06:47:57 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1649933273; x=1681469273; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=q+vT2BEKQEHbHyLdWl79aLwAJYUV5G4FFgW26Q2IqvA=; b=OH34sCGsJej6a20nMaJm4lJ7yjlcKvWp/uCgK/V34f02gKo1X2wObHKj yG6g+1SKX411mhCJpUIJC+HDtG77e7JSChjnwPWIvKhHnUF1QHrCzLSXV HaH+5ivuh1cbnrKIMjQVN+aYHzB+B+0f3GiIkUSpv3tEW5EYKgurzthQF tb7TK3yX/6LBlevl+LANV/ZSIOLJLgfSy8ybOpthS8U2AcGYlLgds6PE/ 9UbF1MiD6YKGEpEMU55W4JQbiV8uBWbIAGW8TwYDZOPnRYXY/PyaMU1bI 3ndfXXYbTbHesa9YumdQ8I1jJ43zqIGLgPTdk9I3YKtnynNPylZkmw8OF w==; X-IronPort-AV: E=McAfee;i="6400,9594,10316"; a="242836516" X-IronPort-AV: E=Sophos;i="5.90,259,1643702400"; d="scan'208";a="242836516" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Apr 2022 03:47:23 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,259,1643702400"; d="scan'208";a="803091252" Received: from 984fee00a4c6.jf.intel.com ([10.165.58.231]) by fmsmga006.fm.intel.com with ESMTP; 14 Apr 2022 03:47:22 -0700 From: Yi Liu To: alex.williamson@redhat.com, cohuck@redhat.com, qemu-devel@nongnu.org Subject: [RFC 15/18] vfio/iommufd: Implement iommufd backend Date: Thu, 14 Apr 2022 03:47:07 -0700 Message-Id: <20220414104710.28534-16-yi.l.liu@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220414104710.28534-1-yi.l.liu@intel.com> References: <20220414104710.28534-1-yi.l.liu@intel.com> MIME-Version: 1.0 Received-SPF: pass client-ip=192.55.52.136; envelope-from=yi.l.liu@intel.com; helo=mga12.intel.com X-Spam_score_int: -44 X-Spam_score: -4.5 X-Spam_bar: ---- X-Spam_report: (-4.5 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: akrowiak@linux.ibm.com, jjherne@linux.ibm.com, thuth@redhat.com, yi.l.liu@intel.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com, jasowang@redhat.com, farman@linux.ibm.com, peterx@redhat.com, pasic@linux.ibm.com, eric.auger@redhat.com, yi.y.sun@intel.com, chao.p.peng@intel.com, nicolinc@nvidia.com, kevin.tian@intel.com, jgg@nvidia.com, eric.auger.pro@gmail.com, david@gibson.dropbear.id.au Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" Add the iommufd backend. The IOMMUFD container class is implemented based on the new /dev/iommu user API. This backend obviously depends on CONFIG_IOMMUFD. So far, the iommufd backend doesn't support live migration and cache coherency yet due to missing support in the host kernel meaning that only a subset of the container class callbacks is implemented. Co-authored-by: Eric Auger Signed-off-by: Eric Auger Signed-off-by: Yi Liu --- hw/vfio/as.c | 2 +- hw/vfio/iommufd.c | 545 +++++++++++++++++++++++++++ hw/vfio/meson.build | 3 + hw/vfio/pci.c | 10 + hw/vfio/trace-events | 11 + include/hw/vfio/vfio-common.h | 18 + include/hw/vfio/vfio-container-obj.h | 1 + 7 files changed, 589 insertions(+), 1 deletion(-) create mode 100644 hw/vfio/iommufd.c diff --git a/hw/vfio/as.c b/hw/vfio/as.c index 4abaa4068f..94618efd1f 100644 --- a/hw/vfio/as.c +++ b/hw/vfio/as.c @@ -41,7 +41,7 @@ #include "qapi/error.h" #include "migration/migration.h" -static QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces = +VFIOAddressSpaceList vfio_address_spaces = QLIST_HEAD_INITIALIZER(vfio_address_spaces); void vfio_host_win_add(VFIOContainer *container, diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c new file mode 100644 index 0000000000..f8375f1672 --- /dev/null +++ b/hw/vfio/iommufd.c @@ -0,0 +1,545 @@ +/* + * iommufd container backend + * + * Copyright (C) 2022 Intel Corporation. + * Copyright Red Hat, Inc. 2022 + * + * Authors: Yi Liu + * Eric Auger + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + + * You should have received a copy of the GNU General Public License along + * with this program; if not, see . + */ + +#include "qemu/osdep.h" +#include +#include + +#include "hw/vfio/vfio-common.h" +#include "qemu/error-report.h" +#include "trace.h" +#include "qapi/error.h" +#include "hw/iommufd/iommufd.h" +#include "hw/qdev-core.h" +#include "sysemu/reset.h" +#include "qemu/cutils.h" + +static bool iommufd_check_extension(VFIOContainer *bcontainer, + VFIOContainerFeature feat) +{ + switch (feat) { + default: + return false; + }; +} + +static int iommufd_map(VFIOContainer *bcontainer, hwaddr iova, + ram_addr_t size, void *vaddr, bool readonly) +{ + VFIOIOMMUFDContainer *container = container_of(bcontainer, + VFIOIOMMUFDContainer, obj); + + return iommufd_map_dma(container->iommufd, container->ioas_id, + iova, size, vaddr, readonly); +} + +static int iommufd_unmap(VFIOContainer *bcontainer, + hwaddr iova, ram_addr_t size, + IOMMUTLBEntry *iotlb) +{ + VFIOIOMMUFDContainer *container = container_of(bcontainer, + VFIOIOMMUFDContainer, obj); + + /* TODO: Handle dma_unmap_bitmap with iotlb args (migration) */ + return iommufd_unmap_dma(container->iommufd, + container->ioas_id, iova, size); +} + +static int vfio_get_devicefd(const char *sysfs_path, Error **errp) +{ + long int vfio_id = -1, ret = -ENOTTY; + char *path, *tmp = NULL; + DIR *dir; + struct dirent *dent; + struct stat st; + gchar *contents; + gsize length; + int major, minor; + dev_t vfio_devt; + + path = g_strdup_printf("%s/vfio-device", sysfs_path); + if (stat(path, &st) < 0) { + error_setg_errno(errp, errno, "no such host device"); + goto out; + } + + dir = opendir(path); + if (!dir) { + error_setg_errno(errp, errno, "couldn't open dirrectory %s", path); + goto out; + } + + while ((dent = readdir(dir))) { + const char *end_name; + + if (!strncmp(dent->d_name, "vfio", 4)) { + ret = qemu_strtol(dent->d_name + 4, &end_name, 10, &vfio_id); + if (ret) { + error_setg(errp, "suspicious vfio* file in %s", path); + goto out; + } + break; + } + } + + /* check if the major:minor matches */ + tmp = g_strdup_printf("%s/%s/dev", path, dent->d_name); + if (!g_file_get_contents(tmp, &contents, &length, NULL)) { + error_setg(errp, "failed to load \"%s\"", tmp); + goto out; + } + + if (sscanf(contents, "%d:%d", &major, &minor) != 2) { + error_setg(errp, "failed to load \"%s\"", tmp); + goto out; + } + g_free(contents); + g_free(tmp); + + tmp = g_strdup_printf("/dev/vfio/devices/vfio%ld", vfio_id); + if (stat(tmp, &st) < 0) { + error_setg_errno(errp, errno, "no such vfio device"); + goto out; + } + vfio_devt = makedev(major, minor); + if (st.st_rdev != vfio_devt) { + error_setg(errp, "minor do not match: %lu, %lu", vfio_devt, st.st_rdev); + goto out; + } + + ret = qemu_open_old(tmp, O_RDWR); + if (ret < 0) { + error_setg(errp, "Failed to open %s", tmp); + } + trace_vfio_iommufd_get_devicefd(tmp, ret); +out: + g_free(tmp); + g_free(path); + + if (*errp) { + error_prepend(errp, VFIO_MSG_PREFIX, path); + } + return ret; +} + +static VFIOIOASHwpt *vfio_container_get_hwpt(VFIOIOMMUFDContainer *container, + uint32_t hwpt_id) +{ + VFIOIOASHwpt *hwpt; + + QLIST_FOREACH(hwpt, &container->hwpt_list, next) { + if (hwpt->hwpt_id == hwpt_id) { + return hwpt; + } + } + + hwpt = g_malloc0(sizeof(*hwpt)); + + hwpt->hwpt_id = hwpt_id; + QLIST_INIT(&hwpt->device_list); + QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next); + + return hwpt; +} + +static void vfio_container_put_hwpt(VFIOIOASHwpt *hwpt) +{ + if (!QLIST_EMPTY(&hwpt->device_list)) { + g_assert_not_reached(); + } + QLIST_REMOVE(hwpt, next); + g_free(hwpt); +} + +static VFIOIOASHwpt *vfio_find_hwpt_for_dev(VFIOIOMMUFDContainer *container, + VFIODevice *vbasedev) +{ + VFIOIOASHwpt *hwpt; + VFIODevice *vbasedev_iter; + + QLIST_FOREACH(hwpt, &container->hwpt_list, next) { + QLIST_FOREACH(vbasedev_iter, &hwpt->device_list, hwpt_next) { + if (vbasedev_iter == vbasedev) { + return hwpt; + } + } + } + return NULL; +} + +static void +__vfio_device_detach_container(VFIODevice *vbasedev, + VFIOIOMMUFDContainer *container, Error **errp) +{ + struct vfio_device_detach_ioas detach_data = { + .argsz = sizeof(detach_data), + .flags = 0, + .iommufd = container->iommufd, + .ioas_id = container->ioas_id, + }; + + if (ioctl(vbasedev->fd, VFIO_DEVICE_DETACH_IOAS, &detach_data)) { + error_setg_errno(errp, errno, "detach %s from ioas id=%d failed", + vbasedev->name, container->ioas_id); + } + trace_vfio_iommufd_detach_device(container->iommufd, vbasedev->name, + container->ioas_id); + + /* iommufd unbind is done per device fd close */ +} + +static void vfio_device_detach_container(VFIODevice *vbasedev, + VFIOIOMMUFDContainer *container, + Error **errp) +{ + VFIOIOASHwpt *hwpt; + + hwpt = vfio_find_hwpt_for_dev(container, vbasedev); + if (hwpt) { + QLIST_REMOVE(vbasedev, hwpt_next); + if (QLIST_EMPTY(&hwpt->device_list)) { + vfio_container_put_hwpt(hwpt); + } + } + + __vfio_device_detach_container(vbasedev, container, errp); +} + +static int vfio_device_attach_container(VFIODevice *vbasedev, + VFIOIOMMUFDContainer *container, + Error **errp) +{ + struct vfio_device_bind_iommufd bind = { + .argsz = sizeof(bind), + .flags = 0, + .iommufd = container->iommufd, + .dev_cookie = (uint64_t)vbasedev, + }; + struct vfio_device_attach_ioas attach_data = { + .argsz = sizeof(attach_data), + .flags = 0, + .iommufd = container->iommufd, + .ioas_id = container->ioas_id, + }; + VFIOIOASHwpt *hwpt; + int ret; + + /* Bind device to iommufd */ + ret = ioctl(vbasedev->fd, VFIO_DEVICE_BIND_IOMMUFD, &bind); + if (ret) { + error_setg_errno(errp, errno, "error bind device fd=%d to iommufd=%d", + vbasedev->fd, bind.iommufd); + return ret; + } + + vbasedev->devid = bind.out_devid; + trace_vfio_iommufd_bind_device(bind.iommufd, vbasedev->name, + vbasedev->fd, vbasedev->devid); + + /* Attach device to an ioas within iommufd */ + ret = ioctl(vbasedev->fd, VFIO_DEVICE_ATTACH_IOAS, &attach_data); + if (ret) { + error_setg_errno(errp, errno, + "[iommufd=%d] error attach %s (%d) to ioasid=%d", + container->iommufd, vbasedev->name, vbasedev->fd, + attach_data.ioas_id); + return ret; + + } + trace_vfio_iommufd_attach_device(bind.iommufd, vbasedev->name, + vbasedev->fd, container->ioas_id, + attach_data.out_hwpt_id); + + hwpt = vfio_container_get_hwpt(container, attach_data.out_hwpt_id); + + QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next); + return 0; +} + +static int vfio_device_reset(VFIODevice *vbasedev) +{ + if (vbasedev->dev->realized) { + vbasedev->ops->vfio_compute_needs_reset(vbasedev); + if (vbasedev->needs_reset) { + return vbasedev->ops->vfio_hot_reset_multi(vbasedev); + } + } + return 0; +} + +static int vfio_iommufd_container_reset(VFIOContainer *bcontainer) +{ + VFIOIOMMUFDContainer *container; + int ret, final_ret = 0; + VFIODevice *vbasedev; + VFIOIOASHwpt *hwpt; + + container = container_of(bcontainer, VFIOIOMMUFDContainer, obj); + + QLIST_FOREACH(hwpt, &container->hwpt_list, next) { + QLIST_FOREACH(vbasedev, &hwpt->device_list, hwpt_next) { + ret = vfio_device_reset(vbasedev); + if (ret) { + error_report("failed to reset %s (%d)", vbasedev->name, ret); + final_ret = ret; + } else { + trace_vfio_iommufd_container_reset(vbasedev->name); + } + } + } + return final_ret; +} + +static void vfio_iommufd_container_destroy(VFIOIOMMUFDContainer *container) +{ + vfio_container_destroy(&container->obj); + g_free(container); +} + +static int vfio_ram_block_discard_disable(bool state) +{ + /* + * We support coordinated discarding of RAM via the RamDiscardManager. + */ + return ram_block_uncoordinated_discard_disable(state); +} + +static void iommufd_detach_device(VFIODevice *vbasedev); + +static int iommufd_attach_device(VFIODevice *vbasedev, AddressSpace *as, + Error **errp) +{ + VFIOContainer *bcontainer; + VFIOIOMMUFDContainer *container; + VFIOAddressSpace *space; + struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) }; + int ret, devfd, iommufd; + uint32_t ioas_id; + Error *err = NULL; + + devfd = vfio_get_devicefd(vbasedev->sysfsdev, errp); + if (devfd < 0) { + return devfd; + } + vbasedev->fd = devfd; + + space = vfio_get_address_space(as); + + /* try to attach to an existing container in this space */ + QLIST_FOREACH(bcontainer, &space->containers, next) { + if (!object_dynamic_cast(OBJECT(bcontainer), + TYPE_VFIO_IOMMUFD_CONTAINER)) { + continue; + } + container = container_of(bcontainer, VFIOIOMMUFDContainer, obj); + if (vfio_device_attach_container(vbasedev, container, &err)) { + const char *msg = error_get_pretty(err); + + trace_vfio_iommufd_fail_attach_existing_container(msg); + error_free(err); + err = NULL; + } else { + ret = vfio_ram_block_discard_disable(true); + if (ret) { + vfio_device_detach_container(vbasedev, container, &err); + error_propagate(errp, err); + vfio_put_address_space(space); + close(vbasedev->fd); + error_prepend(errp, + "Cannot set discarding of RAM broken (%d)", ret); + return ret; + } + goto out; + } + } + + /* Need to allocate a new dedicated container */ + ret = iommufd_get_ioas(&iommufd, &ioas_id); + if (ret < 0) { + vfio_put_address_space(space); + close(vbasedev->fd); + error_report("Failed to alloc ioas (%s)", strerror(errno)); + return ret; + } + + trace_vfio_iommufd_alloc_ioas(iommufd, ioas_id); + + container = g_malloc0(sizeof(*container)); + container->iommufd = iommufd; + container->ioas_id = ioas_id; + QLIST_INIT(&container->hwpt_list); + + bcontainer = &container->obj; + vfio_container_init(bcontainer, sizeof(*bcontainer), + TYPE_VFIO_IOMMUFD_CONTAINER, space); + + ret = vfio_device_attach_container(vbasedev, container, &err); + if (ret) { + /* todo check if any other thing to do */ + error_propagate(errp, err); + vfio_iommufd_container_destroy(container); + iommufd_put_ioas(iommufd, ioas_id); + vfio_put_address_space(space); + close(vbasedev->fd); + return ret; + } + + ret = vfio_ram_block_discard_disable(true); + if (ret) { + vfio_device_detach_container(vbasedev, container, &err); + error_propagate(errp, err); + error_prepend(errp, "Cannot set discarding of RAM broken (%d)", -ret); + vfio_iommufd_container_destroy(container); + iommufd_put_ioas(iommufd, ioas_id); + vfio_put_address_space(space); + close(vbasedev->fd); + return ret; + } + + /* + * TODO: for now iommufd BE is on par with vfio iommu type1, so it's + * fine to add the whole range as window. For SPAPR, below code + * should be updated. + */ + vfio_host_win_add(bcontainer, 0, (hwaddr)-1, 4096); + + /* + * TODO: kvmgroup, unable to do it before the protocol done + * between iommufd and kvm. + */ + + QLIST_INSERT_HEAD(&space->containers, bcontainer, next); + + bcontainer->listener = vfio_memory_listener; + + memory_listener_register(&bcontainer->listener, bcontainer->space->as); + + bcontainer->initialized = true; + +out: + vbasedev->container = bcontainer; + + /* + * TODO: examine RAM_BLOCK_DISCARD stuff, should we do group level + * for discarding incompatibility check as well? + */ + if (vbasedev->ram_block_discard_allowed) { + vfio_ram_block_discard_disable(false); + } + + ret = ioctl(devfd, VFIO_DEVICE_GET_INFO, &dev_info); + if (ret) { + error_setg_errno(errp, errno, "error getting device info"); + /* + * Needs to use iommufd_detach_device() as this may be failed after + * attaching a new deivce to an existing container. + */ + iommufd_detach_device(vbasedev); + close(vbasedev->fd); + return ret; + } + + vbasedev->group = 0; + vbasedev->num_irqs = dev_info.num_irqs; + vbasedev->num_regions = dev_info.num_regions; + vbasedev->flags = dev_info.flags; + vbasedev->reset_works = !!(dev_info.flags & VFIO_DEVICE_FLAGS_RESET); + + trace_vfio_iommufd_device_info(vbasedev->name, devfd, vbasedev->num_irqs, + vbasedev->num_regions, vbasedev->flags); + return 0; +} + +static void iommufd_detach_device(VFIODevice *vbasedev) +{ + VFIOContainer *bcontainer = vbasedev->container; + VFIOIOMMUFDContainer *container; + VFIODevice *vbasedev_iter; + VFIOIOASHwpt *hwpt; + Error *err; + + if (!bcontainer) { + goto out; + } + + if (!vbasedev->ram_block_discard_allowed) { + vfio_ram_block_discard_disable(false); + } + + container = container_of(bcontainer, VFIOIOMMUFDContainer, obj); + QLIST_FOREACH(hwpt, &container->hwpt_list, next) { + QLIST_FOREACH(vbasedev_iter, &hwpt->device_list, hwpt_next) { + if (vbasedev_iter == vbasedev) { + goto found; + } + } + } + g_assert_not_reached(); +found: + QLIST_REMOVE(vbasedev, hwpt_next); + if (QLIST_EMPTY(&hwpt->device_list)) { + vfio_container_put_hwpt(hwpt); + } + + __vfio_device_detach_container(vbasedev, container, &err); + if (err) { + error_report_err(err); + } + if (QLIST_EMPTY(&container->hwpt_list)) { + VFIOAddressSpace *space = bcontainer->space; + + iommufd_put_ioas(container->iommufd, container->ioas_id); + vfio_iommufd_container_destroy(container); + vfio_put_address_space(space); + } + vbasedev->container = NULL; +out: + close(vbasedev->fd); + g_free(vbasedev->name); +} + +static void vfio_iommufd_class_init(ObjectClass *klass, + void *data) +{ + VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_CLASS(klass); + + vccs->check_extension = iommufd_check_extension; + vccs->dma_map = iommufd_map; + vccs->dma_unmap = iommufd_unmap; + vccs->attach_device = iommufd_attach_device; + vccs->detach_device = iommufd_detach_device; + vccs->reset = vfio_iommufd_container_reset; +} + +static const TypeInfo vfio_iommufd_info = { + .parent = TYPE_VFIO_CONTAINER_OBJ, + .name = TYPE_VFIO_IOMMUFD_CONTAINER, + .class_init = vfio_iommufd_class_init, +}; + +static void vfio_iommufd_register_types(void) +{ + type_register_static(&vfio_iommufd_info); +} + +type_init(vfio_iommufd_register_types) diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build index df4fa2b695..3c53c87200 100644 --- a/hw/vfio/meson.build +++ b/hw/vfio/meson.build @@ -7,6 +7,9 @@ vfio_ss.add(files( 'spapr.c', 'migration.c', )) +vfio_ss.add(when: 'CONFIG_IOMMUFD', if_true: files( + 'iommufd.c', +)) vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files( 'display.c', 'pci-quirks.c', diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index e1ab6d339d..cf5703f94b 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -3148,6 +3148,16 @@ static void vfio_pci_reset(DeviceState *dev) goto post_reset; } + /* + * This is a temporary check, long term iommufd should + * support hot reset as well + */ + if (vdev->vbasedev.be == VFIO_IOMMU_BACKEND_TYPE_IOMMUFD) { + error_report("Dangerous: iommufd BE doesn't support hot " + "reset, please stop the VM"); + goto post_reset; + } + /* See if we can do our own bus reset */ if (!vfio_pci_hot_reset_one(vdev)) { goto post_reset; diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events index 0ef1b5f4a6..51f04b0b80 100644 --- a/hw/vfio/trace-events +++ b/hw/vfio/trace-events @@ -165,3 +165,14 @@ vfio_load_state_device_data(const char *name, uint64_t data_offset, uint64_t dat vfio_load_cleanup(const char *name) " (%s)" vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64 vfio_iommu_map_dirty_notify(uint64_t iova_start, uint64_t iova_end) "iommu dirty @ 0x%"PRIx64" - 0x%"PRIx64 + +#iommufd.c + +vfio_iommufd_get_devicefd(const char *dev, int devfd) " %s (fd=%d)" +vfio_iommufd_bind_device(int iommufd, const char *name, int devfd, int devid) " [iommufd=%d] Succesfully bound device %s (fd=%d): output devid=%d" +vfio_iommufd_attach_device(int iommufd, const char *name, int devfd, int ioasid, int hwptid) " [iommufd=%d] Succesfully attached device %s (%d) to ioasid=%d: output hwptd=%d" +vfio_iommufd_detach_device(int iommufd, const char *name, int ioasid) " [iommufd=%d] Detached %s from ioasid=%d" +vfio_iommufd_alloc_ioas(int iommufd, int ioas_id) " [iommufd=%d] new IOMMUFD container with ioasid=%d" +vfio_iommufd_device_info(char *name, int devfd, int num_irqs, int num_regions, int flags) " %s (%d) num_irqs=%d num_regions=%d flags=%d" +vfio_iommufd_fail_attach_existing_container(const char *msg) " %s" +vfio_iommufd_container_reset(char *name) " Successfully reset %s" diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index 2040c27cda..19731ea685 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -81,6 +81,22 @@ typedef struct VFIOLegacyContainer { QLIST_HEAD(, VFIOGroup) group_list; } VFIOLegacyContainer; +typedef struct VFIOIOASHwpt { + uint32_t hwpt_id; + QLIST_HEAD(, VFIODevice) device_list; + QLIST_ENTRY(VFIOIOASHwpt) next; +} VFIOIOASHwpt; + +typedef struct VFIOIOMMUFDContainer { + VFIOContainer obj; + int iommufd; /* /dev/vfio/vfio, empowered by the attached device */ + uint32_t ioas_id; + QLIST_HEAD(, VFIOIOASHwpt) hwpt_list; +} VFIOIOMMUFDContainer; + +typedef QLIST_HEAD(VFIOAddressSpaceList, VFIOAddressSpace) VFIOAddressSpaceList; +extern VFIOAddressSpaceList vfio_address_spaces; + typedef struct VFIODeviceOps VFIODeviceOps; typedef enum VFIOIOMMUBackendType { @@ -90,6 +106,7 @@ typedef enum VFIOIOMMUBackendType { typedef struct VFIODevice { QLIST_ENTRY(VFIODevice) next; + QLIST_ENTRY(VFIODevice) hwpt_next; struct VFIOGroup *group; VFIOContainer *container; char *sysfsdev; @@ -97,6 +114,7 @@ typedef struct VFIODevice { DeviceState *dev; int fd; int type; + int devid; bool reset_works; bool needs_reset; bool no_mmap; diff --git a/include/hw/vfio/vfio-container-obj.h b/include/hw/vfio/vfio-container-obj.h index ffd8590ff8..b5ef2160d8 100644 --- a/include/hw/vfio/vfio-container-obj.h +++ b/include/hw/vfio/vfio-container-obj.h @@ -43,6 +43,7 @@ TYPE_VFIO_CONTAINER_OBJ) #define TYPE_VFIO_LEGACY_CONTAINER "qemu:vfio-legacy-container" +#define TYPE_VFIO_IOMMUFD_CONTAINER "qemu:vfio-iommufd-container" typedef enum VFIOContainerFeature { VFIO_FEAT_LIVE_MIGRATION, From patchwork Thu Apr 14 10:47:08 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yi Liu X-Patchwork-Id: 12813354 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 43698C433F5 for ; Thu, 14 Apr 2022 10:58:28 +0000 (UTC) Received: from localhost ([::1]:43892 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nexBD-0004Q4-Bx for qemu-devel@archiver.kernel.org; Thu, 14 Apr 2022 06:58:27 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:55656) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nex14-000306-3j for qemu-devel@nongnu.org; Thu, 14 Apr 2022 06:47:58 -0400 Received: from mga12.intel.com ([192.55.52.136]:34770) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nex0z-0005Ke-U6 for qemu-devel@nongnu.org; Thu, 14 Apr 2022 06:47:57 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1649933273; x=1681469273; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=hqCVHBvMLiO5PUml3lYQKrW7UAw+dj0Yypw69s/N8c0=; b=Iw9D1CIEKzX3synERese6bMfRCugnIsp2QqPhCei265qOwYM7GUGS5Ql QHyUimTZjBgtR6/JopWN7+VvudwtAA02oU5k5njwMAT+nw0dcLdaj5m6x lJPB/3cD+Eg6+YUixJ25kE7f4IJriSy8I+eYc/gBj7PXFcwWwRbtL+Hn+ G+WDnNDiByTenKy9KE7ewZ4gM8mS147fSsmP96GxP+Qql6/7697pQMCDZ Ru5whAX+QdYYPdqRAloKtp63q+pZRmaI/Y3HQNWyNsz+TVEVoapIdzwPm odEgSCJmuiBfd4fkzZ6E453UwzIMVOfsKd0O/pr2AqicSX5TjbwjVo8RI w==; X-IronPort-AV: E=McAfee;i="6400,9594,10316"; a="242836520" X-IronPort-AV: E=Sophos;i="5.90,259,1643702400"; d="scan'208";a="242836520" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Apr 2022 03:47:24 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,259,1643702400"; d="scan'208";a="803091258" Received: from 984fee00a4c6.jf.intel.com ([10.165.58.231]) by fmsmga006.fm.intel.com with ESMTP; 14 Apr 2022 03:47:23 -0700 From: Yi Liu To: alex.williamson@redhat.com, cohuck@redhat.com, qemu-devel@nongnu.org Subject: [RFC 16/18] vfio/iommufd: Add IOAS_COPY_DMA support Date: Thu, 14 Apr 2022 03:47:08 -0700 Message-Id: <20220414104710.28534-17-yi.l.liu@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220414104710.28534-1-yi.l.liu@intel.com> References: <20220414104710.28534-1-yi.l.liu@intel.com> MIME-Version: 1.0 Received-SPF: pass client-ip=192.55.52.136; envelope-from=yi.l.liu@intel.com; helo=mga12.intel.com X-Spam_score_int: -44 X-Spam_score: -4.5 X-Spam_bar: ---- X-Spam_report: (-4.5 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: akrowiak@linux.ibm.com, jjherne@linux.ibm.com, thuth@redhat.com, yi.l.liu@intel.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com, jasowang@redhat.com, farman@linux.ibm.com, peterx@redhat.com, pasic@linux.ibm.com, eric.auger@redhat.com, yi.y.sun@intel.com, chao.p.peng@intel.com, nicolinc@nvidia.com, kevin.tian@intel.com, jgg@nvidia.com, eric.auger.pro@gmail.com, david@gibson.dropbear.id.au Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" Compared with legacy vfio container BE, one of the benefits provided by iommufd is to reduce the redundant page pinning on kernel side through the usage of IOAS_COPY_DMA. For iommufd containers within the same address space, IOVA mappings can be copied from a source container to destination container. To achieve this, move the vfio_memory_listener to be per address space. In the memory listener callbacks, all the containers within the address space will be looped. For the iommufd containers, QEMU uses IOAS_MAP_DMA on the first one, and then uses IOAS_COPY_DMA to copy the IOVA mappings from the first iommufd container to other iommufd containers within the address space. For legacy containers, IOVA mapping is done by VFIO_IOMMU_MAP_DMA. Signed-off-by: Yi Liu --- hw/vfio/as.c | 117 +++++++++++++++++++++++---- hw/vfio/container-obj.c | 17 +++- hw/vfio/container.c | 19 ++--- hw/vfio/iommufd.c | 43 +++++++--- include/hw/vfio/vfio-common.h | 6 +- include/hw/vfio/vfio-container-obj.h | 8 +- 6 files changed, 167 insertions(+), 43 deletions(-) diff --git a/hw/vfio/as.c b/hw/vfio/as.c index 94618efd1f..13a6653a0d 100644 --- a/hw/vfio/as.c +++ b/hw/vfio/as.c @@ -388,16 +388,16 @@ static void vfio_unregister_ram_discard_listener(VFIOContainer *container, g_free(vrdl); } -static void vfio_listener_region_add(MemoryListener *listener, - MemoryRegionSection *section) +static void vfio_container_region_add(VFIOContainer *container, + VFIOContainer **src_container, + MemoryRegionSection *section) { - VFIOContainer *container = container_of(listener, VFIOContainer, listener); hwaddr iova, end; Int128 llend, llsize; void *vaddr; int ret; VFIOHostDMAWindow *hostwin; - bool hostwin_found; + bool hostwin_found, copy_dma_supported = false; Error *err = NULL; if (vfio_listener_skipped_section(section)) { @@ -533,12 +533,25 @@ static void vfio_listener_region_add(MemoryListener *listener, } } + copy_dma_supported = vfio_container_check_extension(container, + VFIO_FEAT_DMA_COPY); + + if (copy_dma_supported && *src_container) { + if (!vfio_container_dma_copy(*src_container, container, + iova, int128_get64(llsize), + section->readonly)) { + return; + } else { + info_report("IOAS copy failed try map for container: %p", container); + } + } + ret = vfio_container_dma_map(container, iova, int128_get64(llsize), vaddr, section->readonly); if (ret) { - error_setg(&err, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", " - "0x%"HWADDR_PRIx", %p) = %d (%m)", - container, iova, int128_get64(llsize), vaddr, ret); + error_setg(&err, "vfio_container_dma_map(%p, 0x%"HWADDR_PRIx", " + "0x%"HWADDR_PRIx", %p) = %d (%m)", container, iova, + int128_get64(llsize), vaddr, ret); if (memory_region_is_ram_device(section->mr)) { /* Allow unexpected mappings not to be fatal for RAM devices */ error_report_err(err); @@ -547,6 +560,9 @@ static void vfio_listener_region_add(MemoryListener *listener, goto fail; } + if (copy_dma_supported) { + *src_container = container; + } return; fail: @@ -573,10 +589,22 @@ fail: } } -static void vfio_listener_region_del(MemoryListener *listener, +static void vfio_listener_region_add(MemoryListener *listener, MemoryRegionSection *section) { - VFIOContainer *container = container_of(listener, VFIOContainer, listener); + VFIOAddressSpace *space = container_of(listener, + VFIOAddressSpace, listener); + VFIOContainer *container, *src_container; + + src_container = NULL; + QLIST_FOREACH(container, &space->containers, next) { + vfio_container_region_add(container, &src_container, section); + } +} + +static void vfio_container_region_del(VFIOContainer *container, + MemoryRegionSection *section) +{ hwaddr iova, end; Int128 llend, llsize; int ret; @@ -682,18 +710,38 @@ static void vfio_listener_region_del(MemoryListener *listener, vfio_container_del_section_window(container, section); } +static void vfio_listener_region_del(MemoryListener *listener, + MemoryRegionSection *section) +{ + VFIOAddressSpace *space = container_of(listener, + VFIOAddressSpace, listener); + VFIOContainer *container; + + QLIST_FOREACH(container, &space->containers, next) { + vfio_container_region_del(container, section); + } +} + static void vfio_listener_log_global_start(MemoryListener *listener) { - VFIOContainer *container = container_of(listener, VFIOContainer, listener); + VFIOAddressSpace *space = container_of(listener, + VFIOAddressSpace, listener); + VFIOContainer *container; - vfio_container_set_dirty_page_tracking(container, true); + QLIST_FOREACH(container, &space->containers, next) { + vfio_container_set_dirty_page_tracking(container, true); + } } static void vfio_listener_log_global_stop(MemoryListener *listener) { - VFIOContainer *container = container_of(listener, VFIOContainer, listener); + VFIOAddressSpace *space = container_of(listener, + VFIOAddressSpace, listener); + VFIOContainer *container; - vfio_container_set_dirty_page_tracking(container, false); + QLIST_FOREACH(container, &space->containers, next) { + vfio_container_set_dirty_page_tracking(container, false); + } } typedef struct { @@ -823,11 +871,9 @@ static int vfio_sync_dirty_bitmap(VFIOContainer *container, int128_get64(section->size), ram_addr); } -static void vfio_listener_log_sync(MemoryListener *listener, - MemoryRegionSection *section) +static void vfio_container_log_sync(VFIOContainer *container, + MemoryRegionSection *section) { - VFIOContainer *container = container_of(listener, VFIOContainer, listener); - if (vfio_listener_skipped_section(section) || !container->dirty_pages_supported) { return; @@ -838,6 +884,18 @@ static void vfio_listener_log_sync(MemoryListener *listener, } } +static void vfio_listener_log_sync(MemoryListener *listener, + MemoryRegionSection *section) +{ + VFIOAddressSpace *space = container_of(listener, + VFIOAddressSpace, listener); + VFIOContainer *container; + + QLIST_FOREACH(container, &space->containers, next) { + vfio_container_log_sync(container, section); + } +} + const MemoryListener vfio_memory_listener = { .name = "vfio", .region_add = vfio_listener_region_add, @@ -882,6 +940,31 @@ VFIOAddressSpace *vfio_get_address_space(AddressSpace *as) return space; } +void vfio_as_add_container(VFIOAddressSpace *space, + VFIOContainer *container) +{ + if (space->listener_initialized) { + memory_listener_unregister(&space->listener); + } + + QLIST_INSERT_HEAD(&space->containers, container, next); + + /* Unregistration happen in vfio_as_del_container() */ + space->listener = vfio_memory_listener; + memory_listener_register(&space->listener, space->as); + space->listener_initialized = true; +} + +void vfio_as_del_container(VFIOAddressSpace *space, + VFIOContainer *container) +{ + QLIST_SAFE_REMOVE(container, next); + + if (QLIST_EMPTY(&space->containers)) { + memory_listener_unregister(&space->listener); + } +} + void vfio_put_address_space(VFIOAddressSpace *space) { if (QLIST_EMPTY(&space->containers)) { diff --git a/hw/vfio/container-obj.c b/hw/vfio/container-obj.c index c4220336af..2c79089364 100644 --- a/hw/vfio/container-obj.c +++ b/hw/vfio/container-obj.c @@ -27,6 +27,7 @@ #include "qom/object.h" #include "qapi/visitor.h" #include "hw/vfio/vfio-container-obj.h" +#include "exec/memory.h" bool vfio_container_check_extension(VFIOContainer *container, VFIOContainerFeature feat) @@ -53,6 +54,20 @@ int vfio_container_dma_map(VFIOContainer *container, return vccs->dma_map(container, iova, size, vaddr, readonly); } +int vfio_container_dma_copy(VFIOContainer *src, VFIOContainer *dst, + hwaddr iova, ram_addr_t size, bool readonly) +{ + VFIOContainerClass *vccs1 = VFIO_CONTAINER_OBJ_GET_CLASS(src); + VFIOContainerClass *vccs2 = VFIO_CONTAINER_OBJ_GET_CLASS(dst); + + if (!vccs1->dma_copy || vccs1->dma_copy != vccs2->dma_copy) { + error_report("Incompatiable container: unable to copy dma"); + return -EINVAL; + } + + return vccs1->dma_copy(src, dst, iova, size, readonly); +} + int vfio_container_dma_unmap(VFIOContainer *container, hwaddr iova, ram_addr_t size, IOMMUTLBEntry *iotlb) @@ -165,8 +180,6 @@ void vfio_container_destroy(VFIOContainer *container) VFIOGuestIOMMU *giommu, *tmp; VFIOHostDMAWindow *hostwin, *next; - QLIST_SAFE_REMOVE(container, next); - QLIST_FOREACH_SAFE(vrdl, &container->vrdl_list, next, vrdl_tmp) { RamDiscardManager *rdm; diff --git a/hw/vfio/container.c b/hw/vfio/container.c index 2f59422048..6bc1b8763f 100644 --- a/hw/vfio/container.c +++ b/hw/vfio/container.c @@ -357,9 +357,6 @@ err_out: static void vfio_listener_release(VFIOLegacyContainer *container) { - VFIOContainer *bcontainer = &container->obj; - - memory_listener_unregister(&bcontainer->listener); if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) { memory_listener_unregister(&container->prereg_listener); } @@ -887,14 +884,11 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, vfio_kvm_device_add_group(group); QLIST_INIT(&container->group_list); - QLIST_INSERT_HEAD(&space->containers, bcontainer, next); group->container = container; QLIST_INSERT_HEAD(&container->group_list, group, container_next); - bcontainer->listener = vfio_memory_listener; - - memory_listener_register(&bcontainer->listener, bcontainer->space->as); + vfio_as_add_container(space, bcontainer); if (bcontainer->error) { ret = -1; @@ -907,8 +901,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, return 0; listener_release_exit: + vfio_as_del_container(space, bcontainer); QLIST_REMOVE(group, container_next); - QLIST_REMOVE(bcontainer, next); vfio_kvm_device_del_group(group); vfio_listener_release(container); @@ -931,6 +925,7 @@ static void vfio_disconnect_container(VFIOGroup *group) { VFIOLegacyContainer *container = group->container; VFIOContainer *bcontainer = &container->obj; + VFIOAddressSpace *space = bcontainer->space; QLIST_REMOVE(group, container_next); group->container = NULL; @@ -938,10 +933,12 @@ static void vfio_disconnect_container(VFIOGroup *group) /* * Explicitly release the listener first before unset container, * since unset may destroy the backend container if it's the last - * group. + * group. By removing container from the list, container is disconnected + * with address space memory listener. */ if (QLIST_EMPTY(&container->group_list)) { vfio_listener_release(container); + vfio_as_del_container(space, bcontainer); } if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) { @@ -950,10 +947,8 @@ static void vfio_disconnect_container(VFIOGroup *group) } if (QLIST_EMPTY(&container->group_list)) { - VFIOAddressSpace *space = bcontainer->space; - - vfio_container_destroy(bcontainer); trace_vfio_disconnect_container(container->fd); + vfio_container_destroy(bcontainer); close(container->fd); g_free(container); diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c index f8375f1672..8ff5988b07 100644 --- a/hw/vfio/iommufd.c +++ b/hw/vfio/iommufd.c @@ -38,6 +38,8 @@ static bool iommufd_check_extension(VFIOContainer *bcontainer, VFIOContainerFeature feat) { switch (feat) { + case VFIO_FEAT_DMA_COPY: + return true; default: return false; }; @@ -49,10 +51,25 @@ static int iommufd_map(VFIOContainer *bcontainer, hwaddr iova, VFIOIOMMUFDContainer *container = container_of(bcontainer, VFIOIOMMUFDContainer, obj); - return iommufd_map_dma(container->iommufd, container->ioas_id, + return iommufd_map_dma(container->iommufd, + container->ioas_id, iova, size, vaddr, readonly); } +static int iommufd_copy(VFIOContainer *src, VFIOContainer *dst, + hwaddr iova, ram_addr_t size, bool readonly) +{ + VFIOIOMMUFDContainer *container_src = container_of(src, + VFIOIOMMUFDContainer, obj); + VFIOIOMMUFDContainer *container_dst = container_of(dst, + VFIOIOMMUFDContainer, obj); + + assert(container_src->iommufd == container_dst->iommufd); + + return iommufd_copy_dma(container_src->iommufd, container_src->ioas_id, + container_dst->ioas_id, iova, size, readonly); +} + static int iommufd_unmap(VFIOContainer *bcontainer, hwaddr iova, ram_addr_t size, IOMMUTLBEntry *iotlb) @@ -428,12 +445,7 @@ static int iommufd_attach_device(VFIODevice *vbasedev, AddressSpace *as, * between iommufd and kvm. */ - QLIST_INSERT_HEAD(&space->containers, bcontainer, next); - - bcontainer->listener = vfio_memory_listener; - - memory_listener_register(&bcontainer->listener, bcontainer->space->as); - + vfio_as_add_container(space, bcontainer); bcontainer->initialized = true; out: @@ -476,6 +488,7 @@ static void iommufd_detach_device(VFIODevice *vbasedev) VFIOIOMMUFDContainer *container; VFIODevice *vbasedev_iter; VFIOIOASHwpt *hwpt; + VFIOAddressSpace *space; Error *err; if (!bcontainer) { @@ -501,15 +514,26 @@ found: vfio_container_put_hwpt(hwpt); } + space = bcontainer->space; + /* + * Needs to remove the bcontainer from space->containers list before + * detach container. Otherwise, detach container may destroy the + * container if it's the last device. By removing bcontainer from the + * list, container is disconnected with address space memory listener. + */ + if (QLIST_EMPTY(&container->hwpt_list)) { + vfio_as_del_container(space, bcontainer); + } __vfio_device_detach_container(vbasedev, container, &err); if (err) { error_report_err(err); } if (QLIST_EMPTY(&container->hwpt_list)) { - VFIOAddressSpace *space = bcontainer->space; + int iommufd = container->iommufd; + uint32_t ioas_id = container->ioas_id; - iommufd_put_ioas(container->iommufd, container->ioas_id); vfio_iommufd_container_destroy(container); + iommufd_put_ioas(iommufd, ioas_id); vfio_put_address_space(space); } vbasedev->container = NULL; @@ -525,6 +549,7 @@ static void vfio_iommufd_class_init(ObjectClass *klass, vccs->check_extension = iommufd_check_extension; vccs->dma_map = iommufd_map; + vccs->dma_copy = iommufd_copy; vccs->dma_unmap = iommufd_unmap; vccs->attach_device = iommufd_attach_device; vccs->detach_device = iommufd_detach_device; diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index 19731ea685..bef48ddfaf 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -34,8 +34,6 @@ #define VFIO_MSG_PREFIX "vfio %s: " -extern const MemoryListener vfio_memory_listener; - enum { VFIO_DEVICE_TYPE_PCI = 0, VFIO_DEVICE_TYPE_PLATFORM = 1, @@ -181,6 +179,10 @@ void vfio_host_win_add(VFIOContainer *bcontainer, int vfio_host_win_del(VFIOContainer *bcontainer, hwaddr min_iova, hwaddr max_iova); VFIOAddressSpace *vfio_get_address_space(AddressSpace *as); +void vfio_as_add_container(VFIOAddressSpace *space, + VFIOContainer *bcontainer); +void vfio_as_del_container(VFIOAddressSpace *space, + VFIOContainer *container); void vfio_put_address_space(VFIOAddressSpace *space); void vfio_put_base_device(VFIODevice *vbasedev); diff --git a/include/hw/vfio/vfio-container-obj.h b/include/hw/vfio/vfio-container-obj.h index b5ef2160d8..b65f827bc1 100644 --- a/include/hw/vfio/vfio-container-obj.h +++ b/include/hw/vfio/vfio-container-obj.h @@ -47,12 +47,15 @@ typedef enum VFIOContainerFeature { VFIO_FEAT_LIVE_MIGRATION, + VFIO_FEAT_DMA_COPY, } VFIOContainerFeature; typedef struct VFIOContainer VFIOContainer; typedef struct VFIOAddressSpace { AddressSpace *as; + MemoryListener listener; + bool listener_initialized; QLIST_HEAD(, VFIOContainer) containers; QLIST_ENTRY(VFIOAddressSpace) list; } VFIOAddressSpace; @@ -90,7 +93,6 @@ struct VFIOContainer { Object parent_obj; VFIOAddressSpace *space; - MemoryListener listener; Error *error; bool initialized; bool dirty_pages_supported; @@ -116,6 +118,8 @@ typedef struct VFIOContainerClass { int (*dma_map)(VFIOContainer *container, hwaddr iova, ram_addr_t size, void *vaddr, bool readonly); + int (*dma_copy)(VFIOContainer *src, VFIOContainer *dst, + hwaddr iova, ram_addr_t size, bool readonly); int (*dma_unmap)(VFIOContainer *container, hwaddr iova, ram_addr_t size, IOMMUTLBEntry *iotlb); @@ -141,6 +145,8 @@ bool vfio_container_check_extension(VFIOContainer *container, int vfio_container_dma_map(VFIOContainer *container, hwaddr iova, ram_addr_t size, void *vaddr, bool readonly); +int vfio_container_dma_copy(VFIOContainer *src, VFIOContainer *dst, + hwaddr iova, ram_addr_t size, bool readonly); int vfio_container_dma_unmap(VFIOContainer *container, hwaddr iova, ram_addr_t size, IOMMUTLBEntry *iotlb); From patchwork Thu Apr 14 10:47:09 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yi Liu X-Patchwork-Id: 12813356 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 9C61EC433EF for ; Thu, 14 Apr 2022 10:59:28 +0000 (UTC) Received: from localhost ([::1]:47060 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nexCB-0006gl-KM for qemu-devel@archiver.kernel.org; Thu, 14 Apr 2022 06:59:27 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:55710) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nex1D-0003V9-VS for qemu-devel@nongnu.org; Thu, 14 Apr 2022 06:48:08 -0400 Received: from mga12.intel.com ([192.55.52.136]:34772) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nex1C-0005Kn-6w for qemu-devel@nongnu.org; Thu, 14 Apr 2022 06:48:07 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1649933286; x=1681469286; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=F9ZmIqeNr50Fo8Vzy6kBuBQvKFH22JABuQLYrlJNHNM=; b=jtEitNVNd3vngRwFx9kHVTCZlXeBx6wnN+TcBJ3XWcxxmZoL5R5Arqcz 6j9hHr45tCH9mPSBJAveh3ps6+b5e6IaY4x4R8BDLLxjkpX1w59MPgx5+ kbU/CFjHrjXjazrJ9RBPU4YQM/uQ9RMfkTGFWJ4nuJ0ul8o10I56neO9S 1Rhn2MOwW7AFY9B7lF3CnqYLFpgst53FAqr1KxZE97jVNJ/BO1XIXStmM GMv73IGPJ9WZcxXBgoxiHfxxaE6dCZpF5P+JHcRTl/wLXZKRg3Y5ilgvy Kh4CID/tOlq05PUSWulzBe6fve2fuQSolzyKOXaegpCKG8NYUdWelEunL Q==; X-IronPort-AV: E=McAfee;i="6400,9594,10316"; a="242836523" X-IronPort-AV: E=Sophos;i="5.90,259,1643702400"; d="scan'208";a="242836523" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Apr 2022 03:47:25 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,259,1643702400"; d="scan'208";a="803091263" Received: from 984fee00a4c6.jf.intel.com ([10.165.58.231]) by fmsmga006.fm.intel.com with ESMTP; 14 Apr 2022 03:47:24 -0700 From: Yi Liu To: alex.williamson@redhat.com, cohuck@redhat.com, qemu-devel@nongnu.org Subject: [RFC 17/18] vfio/as: Allow the selection of a given iommu backend Date: Thu, 14 Apr 2022 03:47:09 -0700 Message-Id: <20220414104710.28534-18-yi.l.liu@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220414104710.28534-1-yi.l.liu@intel.com> References: <20220414104710.28534-1-yi.l.liu@intel.com> MIME-Version: 1.0 Received-SPF: pass client-ip=192.55.52.136; envelope-from=yi.l.liu@intel.com; helo=mga12.intel.com X-Spam_score_int: -44 X-Spam_score: -4.5 X-Spam_bar: ---- X-Spam_report: (-4.5 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: akrowiak@linux.ibm.com, jjherne@linux.ibm.com, thuth@redhat.com, yi.l.liu@intel.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com, jasowang@redhat.com, farman@linux.ibm.com, peterx@redhat.com, pasic@linux.ibm.com, eric.auger@redhat.com, yi.y.sun@intel.com, chao.p.peng@intel.com, nicolinc@nvidia.com, kevin.tian@intel.com, jgg@nvidia.com, eric.auger.pro@gmail.com, david@gibson.dropbear.id.au Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" From: Eric Auger Now we support two types of iommu backends, let's add the capability to select one of them. This is based on a VFIODevice auto/on/off iommu_be field. This field is likely to be forced to a given value or set by a device option. Signed-off-by: Eric Auger Signed-off-by: Yi Liu --- hw/vfio/as.c | 31 ++++++++++++++++++++++++++++++- include/hw/vfio/vfio-common.h | 1 + 2 files changed, 31 insertions(+), 1 deletion(-) diff --git a/hw/vfio/as.c b/hw/vfio/as.c index 13a6653a0d..fce7a088e9 100644 --- a/hw/vfio/as.c +++ b/hw/vfio/as.c @@ -985,16 +985,45 @@ vfio_get_container_class(VFIOIOMMUBackendType be) case VFIO_IOMMU_BACKEND_TYPE_LEGACY: klass = object_class_by_name(TYPE_VFIO_LEGACY_CONTAINER); return VFIO_CONTAINER_OBJ_CLASS(klass); + case VFIO_IOMMU_BACKEND_TYPE_IOMMUFD: + klass = object_class_by_name(TYPE_VFIO_IOMMUFD_CONTAINER); + return VFIO_CONTAINER_OBJ_CLASS(klass); default: return NULL; } } +static VFIOContainerClass * +select_iommu_backend(OnOffAuto value, Error **errp) +{ + VFIOContainerClass *vccs = NULL; + + if (value == ON_OFF_AUTO_OFF) { + return vfio_get_container_class(VFIO_IOMMU_BACKEND_TYPE_LEGACY); + } else { + int iommufd = qemu_open_old("/dev/iommu", O_RDWR); + + vccs = vfio_get_container_class(VFIO_IOMMU_BACKEND_TYPE_IOMMUFD); + if (iommufd < 0 || !vccs) { + if (value == ON_OFF_AUTO_AUTO) { + vccs = vfio_get_container_class(VFIO_IOMMU_BACKEND_TYPE_LEGACY); + } else { /* ON */ + error_setg(errp, "iommufd backend is not supported by %s", + iommufd < 0 ? "the host" : "QEMU"); + error_append_hint(errp, "set iommufd=off\n"); + vccs = NULL; + } + } + close(iommufd); + } + return vccs; +} + int vfio_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp) { VFIOContainerClass *vccs; - vccs = vfio_get_container_class(VFIO_IOMMU_BACKEND_TYPE_LEGACY); + vccs = select_iommu_backend(vbasedev->iommufd_be, errp); if (!vccs) { return -ENOENT; } diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index bef48ddfaf..2d941aae70 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -126,6 +126,7 @@ typedef struct VFIODevice { VFIOMigration *migration; Error *migration_blocker; OnOffAuto pre_copy_dirty_page_tracking; + OnOffAuto iommufd_be; } VFIODevice; struct VFIODeviceOps { From patchwork Thu Apr 14 10:47:10 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yi Liu X-Patchwork-Id: 12813357 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id E8358C433F5 for ; Thu, 14 Apr 2022 11:01:51 +0000 (UTC) Received: from localhost ([::1]:51896 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nexEU-0001gz-Cu for qemu-devel@archiver.kernel.org; Thu, 14 Apr 2022 07:01:51 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:55738) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nex1F-0003aO-Kh for qemu-devel@nongnu.org; Thu, 14 Apr 2022 06:48:09 -0400 Received: from mga12.intel.com ([192.55.52.136]:34768) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nex1D-0005Ka-Le for qemu-devel@nongnu.org; Thu, 14 Apr 2022 06:48:09 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1649933287; x=1681469287; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=0P3OalrBHnBhU01QF07hMbsmlYmh/io4cNMn4gf855E=; b=G3hZ/nXbWaRokI6YYdLJeRk3u+Hs7cPyhTzOmPIaBOUn5+JC9LBiN2xa iA/6tYQyHALAFQra5nfN+BI03KQzC8C4xmqqdCD7skYVI3J5cnUGcojeg htF0On3uYn8E+LiL/nB+p46QVukLn2hQ+WGK9Og1gN1j2TUlGFlYVXFsL YYK28UrIqYipGN+WKVMih2shAGdKl/a0LhBf9H20hexauqFSaptMVtOmu K3ygXK5i1ySO+8Q6ydQRuhM64htaGr393PRP4t2amSaO/5qTI8ufi1zOc GC3dt6mscDhT63IDcDTk94zfmgVgmpi6AGB8UMpP0+6NVBQ46I/gOfRWx A==; X-IronPort-AV: E=McAfee;i="6400,9594,10316"; a="242836525" X-IronPort-AV: E=Sophos;i="5.90,259,1643702400"; d="scan'208";a="242836525" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Apr 2022 03:47:26 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,259,1643702400"; d="scan'208";a="803091268" Received: from 984fee00a4c6.jf.intel.com ([10.165.58.231]) by fmsmga006.fm.intel.com with ESMTP; 14 Apr 2022 03:47:25 -0700 From: Yi Liu To: alex.williamson@redhat.com, cohuck@redhat.com, qemu-devel@nongnu.org Subject: [RFC 18/18] vfio/pci: Add an iommufd option Date: Thu, 14 Apr 2022 03:47:10 -0700 Message-Id: <20220414104710.28534-19-yi.l.liu@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220414104710.28534-1-yi.l.liu@intel.com> References: <20220414104710.28534-1-yi.l.liu@intel.com> MIME-Version: 1.0 Received-SPF: pass client-ip=192.55.52.136; envelope-from=yi.l.liu@intel.com; helo=mga12.intel.com X-Spam_score_int: -44 X-Spam_score: -4.5 X-Spam_bar: ---- X-Spam_report: (-4.5 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: akrowiak@linux.ibm.com, jjherne@linux.ibm.com, thuth@redhat.com, yi.l.liu@intel.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com, jasowang@redhat.com, farman@linux.ibm.com, peterx@redhat.com, pasic@linux.ibm.com, eric.auger@redhat.com, yi.y.sun@intel.com, chao.p.peng@intel.com, nicolinc@nvidia.com, kevin.tian@intel.com, jgg@nvidia.com, eric.auger.pro@gmail.com, david@gibson.dropbear.id.au Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" From: Eric Auger This auto/on/off option allows the user to force a the select the iommu BE (iommufd or legacy). Signed-off-by: Eric Auger Signed-off-by: Yi Liu --- hw/vfio/pci.c | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index cf5703f94b..70a4c2b0a8 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -42,6 +42,8 @@ #include "qapi/error.h" #include "migration/blocker.h" #include "migration/qemu-file.h" +#include "qapi/visitor.h" +#include "qapi/qapi-visit-common.h" #define TYPE_VFIO_PCI_NOHOTPLUG "vfio-pci-nohotplug" @@ -3246,6 +3248,26 @@ static Property vfio_pci_dev_properties[] = { DEFINE_PROP_END_OF_LIST(), }; +static void get_iommu_be(Object *obj, Visitor *v, const char *name, + void *opaque, Error **errp) +{ + VFIOPCIDevice *vdev = VFIO_PCI(obj); + VFIODevice *vbasedev = &vdev->vbasedev; + OnOffAuto iommufd_be = vbasedev->iommufd_be; + + visit_type_OnOffAuto(v, name, &iommufd_be, errp); +} + +static void set_iommu_be(Object *obj, Visitor *v, const char *name, + void *opaque, Error **errp) +{ + VFIOPCIDevice *vdev = VFIO_PCI(obj); + VFIODevice *vbasedev = &vdev->vbasedev; + + visit_type_OnOffAuto(v, name, &vbasedev->iommufd_be, errp); +} + + static void vfio_pci_dev_class_init(ObjectClass *klass, void *data) { DeviceClass *dc = DEVICE_CLASS(klass); @@ -3253,6 +3275,10 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data) dc->reset = vfio_pci_reset; device_class_set_props(dc, vfio_pci_dev_properties); + object_class_property_add(klass, "iommufd", "OnOffAuto", + get_iommu_be, set_iommu_be, NULL, NULL); + object_class_property_set_description(klass, "iommufd", + "Enable iommufd backend"); dc->desc = "VFIO-based PCI device assignment"; set_bit(DEVICE_CATEGORY_MISC, dc->categories); pdc->realize = vfio_realize;