From patchwork Thu Oct 24 09:34:44 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qinyun Tan X-Patchwork-Id: 13848632 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EF28ED0BB7F for ; Thu, 24 Oct 2024 09:34:59 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DE17F6B0092; Thu, 24 Oct 2024 05:34:58 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D90216B0095; Thu, 24 Oct 2024 05:34:58 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C58976B0093; Thu, 24 Oct 2024 05:34:58 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 8613C6B0089 for ; Thu, 24 Oct 2024 05:34:58 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 2C692411BD for ; Thu, 24 Oct 2024 09:34:48 +0000 (UTC) X-FDA: 82707986088.10.52F80BA Received: from out30-101.freemail.mail.aliyun.com (out30-101.freemail.mail.aliyun.com [115.124.30.101]) by imf08.hostedemail.com (Postfix) with ESMTP id 2E04216001E for ; Thu, 24 Oct 2024 09:34:41 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=uwcqroFy; spf=pass (imf08.hostedemail.com: domain of qinyuntan@linux.alibaba.com designates 115.124.30.101 as permitted sender) smtp.mailfrom=qinyuntan@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729762342; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=U2nq3lUV0/BjAICa8iDKpSHbCXIbE/HxIiv56DO1T7Y=; b=AtFvLcv6N/t/RdGKehHi15PqLtfWbi1AstU0uswa/4JYl8TIv9R6m09PCu8KLWYDfGDOm5 ANxTEgZLxc0V99DuZEqUQh+xleHP78sAPidnb403wX0jstipSi7B/K6hM+MrocbzLbX8VI RnYhAkpsFXsA7jPi6AmOecSSSG64cY4= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729762342; a=rsa-sha256; cv=none; b=DzYwZjPDjDeuKKnjo89RKm22nX8WSeRIORpaFYER+YX+T6WBgTuKWdycf+t1OwketcjoeH ZOkYUnAzcR+V7gt5YHmza99+lwCUvo19cvkQeY68tu13yiUwcAgPwuciLy8FiYNqgcLc2T mmiKtrgKf+zIQQN6L5LUH/zrunL6T3o= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=uwcqroFy; spf=pass (imf08.hostedemail.com: domain of qinyuntan@linux.alibaba.com designates 115.124.30.101 as permitted sender) smtp.mailfrom=qinyuntan@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1729762491; h=From:To:Subject:Date:Message-ID:MIME-Version; bh=U2nq3lUV0/BjAICa8iDKpSHbCXIbE/HxIiv56DO1T7Y=; b=uwcqroFybtvHJnhsw4RnUKUdpjdfLuZsHok1ZJkjsPGqxUOjlVbo/9sPsmnP3Gq2VYtxEB8F0wk0NXutfIT4t160TKKzP57zWYOnVBewXRfhWsaaZtAvlhy1P8DdSWFGYD6RowqBtDOprC69jGhMKBZHj6FWrYj2WlZd14PuP0U= Received: from localhost.localdomain(mailfrom:qinyuntan@linux.alibaba.com fp:SMTPD_---0WHoiyxU_1729762489 cluster:ay36) by smtp.aliyun-inc.com; Thu, 24 Oct 2024 17:34:50 +0800 From: Qinyun Tan To: Andrew Morton , Alex Williamson Cc: linux-mm@kvack.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Qinyun Tan , Guanghui Feng , Xunlei Pang Subject: [PATCH v1: vfio: avoid unnecessary pin memory when dma map io address space 2/2] vfio: avoid unnecessary pin memory when dma map io address space Date: Thu, 24 Oct 2024 17:34:44 +0800 Message-ID: <15b38c90ef1eb0825b7492d633d46d901e428f7d.1729760996.git.qinyuntan@linux.alibaba.com> X-Mailer: git-send-email 2.46.0 In-Reply-To: References: MIME-Version: 1.0 X-Stat-Signature: wjyizs9eom45jg8yyz3ds6oaax5kiz8z X-Rspamd-Queue-Id: 2E04216001E X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1729762481-825045 X-HE-Meta: U2FsdGVkX190MyaRZ/AKlvX5IuCHsSxWBOUYqQ1cPP7Cqd6vMCn6kkaOC3EIs7+X8tne4E75Xc8YkEAMN/11YV7fmhoLREI1+41pw+mTAHHGznwl0Bq+icBPdqo8FDk9V04Gd45XmNFW0/ubM2Gbk0SMtPahepeLs6AU+LL3cDO3ZyUNLm+HMk1yqI2g8RS2CkYmhYbFoHBqL3LkbDhnvI+VRMGfJmxbd3XDB7380dgRfhc7k0184t8wHztIu3UqtmD1wxDzUUxiuGerE5e0jkzkj0oIYTQUUoTsj07F3/k08PrrFAYyWNyujv2bN6l9p5HU0G0JstKKQgcOERoSAVcOlMcZiugj5j42lvT30CN5dKTp6hWCUClO3P3TzbZnEKgBmc4T3dRd0wEymku3dTM0/an8OK2VNyv5NtpzM5BA+6pDcIyMLISfwayGpXqNAmWKbTaWifaGy/gIVf+2yMJC/AIwO1//Yex6ECEFUE12EDCXSMAddvdtWzGxAYjmr4oPP3psmQEYI94iUQYsbi7eWXPSurLkAep/4+Sd6AgQTEDatcQljgnFKMjkHOE7zba35dDL2UzvStr40ED0qrN1ZeTAcCdhzOS1mC04ydRHQ78jGvqwUsri5dvXy05tD0bmzHbYVPz9zZ5EIvq+1TnD2vgf+DBzfKzpw0RYf4EfOLHjvVDlU+03Fj9SpIv5waL1YuWg6VIrfxX3xbzA5Ongi4wdTTqXD592tX1ZLxZXQdoAmAabC+ST4otO6muABiEiRFjSDkQz6Jrop7TizmAV41yCBLoZKOdTUR1mx/p6dtGAHFxeRbZ2/J6xKdXLypyjNHr8TK/3385HYeGmRIIAx0wG2mqtu4q7oJn5cE55VoPis8BJcD4lxT2+YB6IfbqVl02JRXdOc2M2o3pnPSNaqtt1sjSPkQVX8YHZKfZbSJmh1Zgrc+6xZS88U3fNLE4U3CuhyMjDQEeBbem foa6cD4T 2tWiyrNEKn4pMHO6EyKiEz2X7QV4BiLxDwasPBcbdEfg3QfPXp9b2WOyQlNQbNGfE/CW30RYGnlSVYnWgsKOrdmOxIz4CbSAm1/SCsPH8Xhs8APeAKaPsG71bLWpZl6Ruc91rADj6vjKEsc4sqbEKA0+vaf28+lSbaaAtQ5uW85HaM8NH5gUrirj45zurvXMc8XbQUL+dNMFADyY21CDnU//erWMIcuhXYdnc X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: When user application call ioctl(VFIO_IOMMU_MAP_DMA) to map a dma address, the general handler 'vfio_pin_map_dma' attempts to pin the memory and then create the mapping in the iommu. However, some mappings aren't backed by a struct page, for example an mmap'd MMIO range for our own or another device. In this scenario, a vma with flag VM_IO | VM_PFNMAP, the pin operation will fail. Moreover, the pin operation incurs a large overhead which will result in a longer startup time for the VM. We don't actually need a pin in this scenario. To address this issue, we introduce a new DMA MAP flag 'VFIO_DMA_MAP_FLAG_MMIO_DONT_PIN' to skip the 'vfio_pin_pages_remote' operation in the DMA map process for mmio memory. Additionally, we add the 'VM_PGOFF_IS_PFN' flag for vfio_pci_mmap address, ensuring that we can directly obtain the pfn through vma->vm_pgoff. This approach allows us to avoid unnecessary memory pinning operations, which would otherwise introduce additional overhead during DMA mapping. In my tests, using vfio to pass through an 8-card AMD GPU which with a large bar size (128GB*8), the time mapping the 192GB*8 bar was reduced from about 50.79s to 1.57s. Signed-off-by: Qinyun Tan Signed-off-by: Guanghui Feng Reviewed-by: Xunlei Pang --- drivers/vfio/pci/vfio_pci_core.c | 2 +- drivers/vfio/vfio_iommu_type1.c | 64 +++++++++++++++++++++++++------- include/uapi/linux/vfio.h | 11 ++++++ 3 files changed, 62 insertions(+), 15 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index 1ab58da9f38a6..9e8743429e490 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -1802,7 +1802,7 @@ int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma * the VMA flags. */ vm_flags_set(vma, VM_ALLOW_ANY_UNCACHED | VM_IO | VM_PFNMAP | - VM_DONTEXPAND | VM_DONTDUMP); + VM_DONTEXPAND | VM_DONTDUMP | VM_PGOFF_IS_PFN); vma->vm_ops = &vfio_pci_mmap_ops; return 0; diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index bf391b40e576f..156e668de117d 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -1439,7 +1439,7 @@ static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova, } static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma, - size_t map_size) + size_t map_size, unsigned int map_flags) { dma_addr_t iova = dma->iova; unsigned long vaddr = dma->vaddr; @@ -1448,27 +1448,61 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma, long npage; unsigned long pfn, limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; int ret = 0; + struct mm_struct *mm = current->mm; + bool mmio_dont_pin = map_flags & VFIO_DMA_MAP_FLAG_MMIO_DONT_PIN; + + /* This code path is only user initiated */ + if (!mm) { + ret = -ENODEV; + goto out; + } vfio_batch_init(&batch); while (size) { - /* Pin a contiguous chunk of memory */ - npage = vfio_pin_pages_remote(dma, vaddr + dma->size, - size >> PAGE_SHIFT, &pfn, limit, - &batch); - if (npage <= 0) { - WARN_ON(!npage); - ret = (int)npage; - break; + struct vm_area_struct *vma; + unsigned long start = vaddr + dma->size; + bool do_pin_pages = true; + + if (mmio_dont_pin) { + mmap_read_lock(mm); + + vma = find_vma_intersection(mm, start, start+1); + + /* + * If this dma address rang belongs to the IO address space with VMA flags + * VM_IO | VM_PFNMAP | VM_PGOFF_IS_PFN, it doesn't need to be pinned. + * Simply skip the pin operation to avoid unnecessary overhead. + */ + if (vma && (vma->vm_flags & VM_PFNMAP) && (vma->vm_flags & VM_IO) + && (vma->vm_flags & VM_PGOFF_IS_PFN)) { + pfn = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT); + npage = min_t(long, (vma->vm_end - start), size) >> PAGE_SHIFT; + do_pin_pages = false; + } + mmap_read_unlock(mm); + } + + if (do_pin_pages) { + /* Pin a contiguous chunk of memory */ + npage = vfio_pin_pages_remote(dma, start, size >> PAGE_SHIFT, &pfn, + limit, &batch); + if (npage <= 0) { + WARN_ON(!npage); + ret = (int)npage; + break; + } } /* Map it! */ ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, dma->prot); if (ret) { - vfio_unpin_pages_remote(dma, iova + dma->size, pfn, - npage, true); - vfio_batch_unpin(&batch, dma); + if (do_pin_pages) { + vfio_unpin_pages_remote(dma, iova + dma->size, pfn, + npage, true); + vfio_batch_unpin(&batch, dma); + } break; } @@ -1479,6 +1513,7 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma, vfio_batch_fini(&batch); dma->iommu_mapped = true; +out: if (ret) vfio_remove_dma(iommu, dma); @@ -1645,7 +1680,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu, if (list_empty(&iommu->domain_list)) dma->size = size; else - ret = vfio_pin_map_dma(iommu, dma, size); + ret = vfio_pin_map_dma(iommu, dma, size, map->flags); if (!ret && iommu->dirty_page_tracking) { ret = vfio_dma_bitmap_alloc(dma, pgsize); @@ -2639,6 +2674,7 @@ static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu, case VFIO_TYPE1_IOMMU: case VFIO_TYPE1v2_IOMMU: case VFIO_TYPE1_NESTING_IOMMU: + case VFIO_DMA_MAP_MMIO_DONT_PIN: case VFIO_UNMAP_ALL: return 1; case VFIO_UPDATE_VADDR: @@ -2811,7 +2847,7 @@ static int vfio_iommu_type1_map_dma(struct vfio_iommu *iommu, struct vfio_iommu_type1_dma_map map; unsigned long minsz; uint32_t mask = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE | - VFIO_DMA_MAP_FLAG_VADDR; + VFIO_DMA_MAP_FLAG_VADDR | VFIO_DMA_MAP_FLAG_MMIO_DONT_PIN; minsz = offsetofend(struct vfio_iommu_type1_dma_map, size); diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 2b68e6cdf1902..ca391ec41b3c3 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -56,6 +56,16 @@ */ #define VFIO_UPDATE_VADDR 10 +/* + * Support VFIO_DMA_MAP_FLAG_MMIO_DONT_PIN for DMA mapping. For MMIO addresses, + * we do not need to pin the pages or establish address mapping in the MMU + * at this stage. We only need to establish the address mapping in the IOMMU for + * the ioctl(VFIO_IOMMU_MAP_DMA). The page table mapping in the MMU will be + * dynamically established through the page fault mechanism when the page + * is accessed in the future. + */ +#define VFIO_DMA_MAP_MMIO_DONT_PIN 11 + /* * The IOCTL interface is designed for extensibility by embedding the * structure length (argsz) and flags into structures passed between @@ -1560,6 +1570,7 @@ struct vfio_iommu_type1_dma_map { #define VFIO_DMA_MAP_FLAG_READ (1 << 0) /* readable from device */ #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1) /* writable from device */ #define VFIO_DMA_MAP_FLAG_VADDR (1 << 2) +#define VFIO_DMA_MAP_FLAG_MMIO_DONT_PIN (1 << 3) /* MMIO doesn't need pin page */ __u64 vaddr; /* Process virtual address */ __u64 iova; /* IO virtual address */ __u64 size; /* Size of mapping (bytes) */