From patchwork Wed Feb 5 14:40:20 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Leon Romanovsky X-Patchwork-Id: 13961179 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3C350C02194 for ; Wed, 5 Feb 2025 14:41:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BF9C66B0098; Wed, 5 Feb 2025 09:40:59 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id BA9A06B0099; Wed, 5 Feb 2025 09:40:59 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A23E66B009A; Wed, 5 Feb 2025 09:40:59 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 82D576B0098 for ; Wed, 5 Feb 2025 09:40:59 -0500 (EST) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 34FB61C85EE for ; Wed, 5 Feb 2025 14:40:59 +0000 (UTC) X-FDA: 83086153038.16.27042BE Received: from nyc.source.kernel.org (nyc.source.kernel.org [147.75.193.91]) by imf13.hostedemail.com (Postfix) with ESMTP id 9894920008 for ; Wed, 5 Feb 2025 14:40:57 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=i5TpDLVi; spf=pass (imf13.hostedemail.com: domain of leon@kernel.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=leon@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1738766457; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=2uvrPyRZUMxC9+qhXbSQVTKgYH+FCWfbok7ttkRBGGw=; b=WzA0FS9d4keJXdGuPEzZsOYsNVcw2vrDAL1b/+W3QSkSUefJrIEnTayGaNJoLZFln0B2pO pfcztMeVA95IwFVs4TcJKuGkNQdixH6QGLmbvfTC0dj4uWIhXhTNXDxMsEHr855lN5GbT6 ainFxx3Yke062AtLtUFOq2qHu5f6bF4= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1738766457; a=rsa-sha256; cv=none; b=cxghTlPUMCGm4tYu8G9A0H6TicveOikqF9eQmcKxfR8UWvGHsEiL3rEnsGpEoDBOhvGuf7 QqKnLaAGmQt0ri+AYmVEM7uiHt80VjTbI/lTcIv9KN+OFGb+X2BLMicK1FSCkw3MrwP+0i Q0ozowiEHYLM3Zb740g9eRy/MUVBP7g= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=i5TpDLVi; spf=pass (imf13.hostedemail.com: domain of leon@kernel.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=leon@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by nyc.source.kernel.org (Postfix) with ESMTP id 98F0FA436AF; Wed, 5 Feb 2025 14:39:10 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 18D35C4CED1; Wed, 5 Feb 2025 14:40:55 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1738766456; bh=Im+VBOTKpjrNCIlKnlAediWg4cQeXGnKU56wLhDT9oU=; h=From:To:Cc:Subject:Date:From; b=i5TpDLVigC52A1456psRIrtl7bLLBIDktmcw0zOI67fkQGwnp1mWIL5BJHncwuJWI mqpjzMII1KODdUrAyY+Oih69YDa2Cdcx+7oGlMcdjENnLejd/oM3xpmeuoenXD3eJR aN8KKUB6gkqvAfml4E7f7fJ+ScEUU33cuqGgGC9FoVesV98jhiumnS891acKPrOhvv Ufktq6S8vrLItFptTPYcVRFJODleDI3ASgS6uZmUWdekR6YHiVq6qTf2yBWgmm64a+ WbDdK4GGu/LszXG1yIkuf3gbPOaVWmeBbgIGVEVpJi/PhQlHhCsyZxmhgZkOXngrYl me9VuYMeN/2Gw== From: Leon Romanovsky To: Christoph Hellwig , Jason Gunthorpe , Robin Murphy Cc: Leon Romanovsky , Jens Axboe , Joerg Roedel , Will Deacon , Sagi Grimberg , Keith Busch , Bjorn Helgaas , Logan Gunthorpe , Yishai Hadas , Shameer Kolothum , Kevin Tian , Alex Williamson , Marek Szyprowski , =?utf-8?b?SsOpcsO0bWUgR2xpc3Nl?= , Andrew Morton , Jonathan Corbet , linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-block@vger.kernel.org, linux-rdma@vger.kernel.org, iommu@lists.linux.dev, linux-nvme@lists.infradead.org, linux-pci@vger.kernel.org, kvm@vger.kernel.org, linux-mm@kvack.org, Randy Dunlap Subject: [PATCH v7 00/17] Provide a new two step DMA mapping API Date: Wed, 5 Feb 2025 16:40:20 +0200 Message-ID: X-Mailer: git-send-email 2.48.1 MIME-Version: 1.0 X-Rspamd-Queue-Id: 9894920008 X-Stat-Signature: 9n1wjo6ffshqwjcf9rr4rka9fof39b93 X-Rspam-User: X-Rspamd-Server: rspam12 X-HE-Tag: 1738766457-388063 X-HE-Meta: U2FsdGVkX1+Mr8Mr1ZQk0ojwBqRn6npMAWCmtB2Yn73M6Wxsziwd79hQ9GalK66bbfJnw0qSv8rNzQhLLdPv4pxZUmq5fwkcHSyV+JafOTiEx/io3919Ir6Ka0VxsMZ0kYQX/DrJSrqbTAXG4yY/6Ix7T0OYKtRklIY77maebT4LgW9SAt/XSzYgzvX8eHoj/CeV88oHkYBshVEJmjrvaZFgo6Ai8TVc3BTmbx+QPYCBIjDbs1M/z+Xw8XquBR52bqf77QlrBl7FwkcNZKjkFDv0UBvw3f9tJgq3gY53eVRWzJfnfUA7ofxby4lO/rpW1Y6xnjP/2RY633PZw7RKm+PNn0P9IPD4ze6Tvp8GqX4PQspwQNXfyiKitdFTAUbbDgj5lC3Uln3W/OKLTAM9cp3vpUlRNunVoUZM1XT+FFSuahf32tIUA20MknXbVntm0/NT+z4Ilg7J+vlFfJlXha+XF0XRIDDtIktjqWRAgavmiTSRPeG6Coh5V/ut81ui3PJWltni27C48ovgOhXBcqqgTNmhmOKiNkCa6zssnuH1Kf97jqd/lqWvVklhNSvXPsmbEfI9hbb2H0c0VSc9SxAm43mqI+K5oCXny4A71OdlVTDe2Zszm+/ij7U7Ehq0okXg/bVtprh/QmbYe0S1RjeQm5Oet3Dw4+WwJRfl6RU2WZko3Y7CAidhyQ6WTWYCx6Sv1YV3A+RE8wVrvjJAT8+BPdoO8wXsOBYH/svPh/uD4A6LvXAcwL7GORYqXGP+Nfjn0VqfDAw0/zqSz/Z296zIIXda6aadf1vho5t4tIouqfNA5+hgiRFjAJ0BY2s1fNchuRDKYoI86jh5yaROgpcF8fm58jcrh8wdSJvASVZzQO5AqFUO0UTIy4z1jcmGxuSI0k1hxiZLqw786pQhXYkur81tNcAQayYO/J+iFYElH9e8lighKzao1l17X/iMkZk92eeN7dzoLTXhjDd fq/cZNjS TapezGAeeLRUaw2JY29J78eRJIFzogeSQ4iJ9mnQzNkGVn7DmBV56Wl+YvVpY3uc3xFAc3ckRtUwGOJha+776Q7hMOhf+L7JiFCVpXZA5OkWG2iGDiZxxUbFH7izYDKsDhllMWElzVH/jXNE2SlUb+s2VFI0MiGzuvLhXpwVBYyfxyX1uPjwpc6bznGC/WkL7xcGupLpXGyb93DAfxT64agLQSTLEssrKGiXr9HdJjYlrOi1wqvd1huXTe8XPt6tfBjX+J/VatDTs3FET2OsoABU27Q== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Leon Romanovsky Changelog: v7: * Rebased to v6.14-rc1 v6: https://lore.kernel.org/all/cover.1737106761.git.leon@kernel.org * Changed internal __size variable to u64 to properly set private flag in most significant bit. * Added comment about why we check DMA_IOVA_USE_SWIOTLB * Break unlink loop if phys is NULL, condition which we shouldn't get. v5: https://lore.kernel.org/all/cover.1734436840.git.leon@kernel.org * Trimmed long lines in all patches. * Squashed "dma-mapping: Add check if IOVA can be used" into "dma: Provide an interface to allow allocate IOVA" patch. * Added tags from Christoph and Will. * Fixed spelling/grammar errors. * Change title from "dma: Provide an ..." to be "dma-mapping: Provide * an ...". * Slightly changed hmm patch to set sticky flags in one place. v4: https://lore.kernel.org/all/cover.1733398913.git.leon@kernel.org * Added extra patch to add kernel-doc for iommu_unmap and * iommu_unmap_fast * Rebased to v6.13-rc1 * Added Will's tags v3: https://lore.kernel.org/all/cover.1731244445.git.leon@kernel.org * Added DMA_ATTR_SKIP_CPU_SYNC to p2p pages in HMM. * Fixed error unwind if dma_iova_sync fails in HMM. * Clear all PFN flags which were set in map to make code. more clean, the callers anyway cleaned them. * Generalize sticky PFN flags logic in HMM. * Removed not-needed #ifdef-#endif section. v2: https://lore.kernel.org/all/cover.1730892663.git.leon@kernel.org * Fixed docs file as Randy suggested * Fixed releases of memory in HMM path. It was allocated with kv.. variants but released with kfree instead of kvfree. * Slightly changed commit message in VFIO patch. v1: https://lore.kernel.org/all/cover.1730298502.git.leon@kernel.org * Squashed two VFIO patches into one * Added Acked-by/Reviewed-by tags * Fix docs spelling errors * Simplified dma_iova_sync() API * Added extra check in dma_iova_destroy() if mapped size to make code * more clear * Fixed checkpatch warnings in p2p patch * Changed implementation of VFIO mlx5 mlx5vf_add_migration_pages() to be more general * Reduced the number of changes in VFIO patch v0: https://lore.kernel.org/all/cover.1730037276.git.leon@kernel.org ---------------------------------------------------------------------------- No changes in checks, documentation and naming as no suggestion was given. Everything like this can be improved in followup patches. ---------------------------------------------------------------------------- LWN coverage: Dancing the DMA two-step - https://lwn.net/Articles/997563/ ---------------------------------------------------------------------------- Currently the only efficient way to map a complex memory description through the DMA API is by using the scatterlist APIs. The SG APIs are unique in that they efficiently combine the two fundamental operations of sizing and allocating a large IOVA window from the IOMMU and processing all the per-address swiotlb/flushing/p2p/map details. This uniqueness has been a long standing pain point as the scatterlist API is mandatory, but expensive to use. It prevents any kind of optimization or feature improvement (such as avoiding struct page for P2P) due to the impossibility of improving the scatterlist. Several approaches have been explored to expand the DMA API with additional scatterlist-like structures (BIO, rlist), instead split up the DMA API to allow callers to bring their own data structure. The API is split up into parts: - Allocate IOVA space: To do any pre-allocation required. This is done based on the caller supplying some details about how much IOMMU address space it would need in worst case. - Map and unmap relevant structures to pre-allocated IOVA space: Perform the actual mapping into the pre-allocated IOVA. This is very similar to dma_map_page(). In this and the next series [1], examples of three different users are converted to the new API to show the benefits and its versatility. Each user has a unique flow: 1. RDMA ODP is an example of "SVA mirroring" using HMM that needs to dynamically map/unmap large numbers of single pages. This becomes significantly faster in the IOMMU case as the map/unmap is now just a page table walk, the IOVA allocation is pre-computed once. Significant amounts of memory are saved as there is no longer a need to store the dma_addr_t of each page. 2. VFIO PCI live migration code is building a very large "page list" for the device. Instead of allocating a scatter list entry per allocated page it can just allocate an array of 'struct page *', saving a large amount of memory. 3. NVMe PCI demonstrates how a BIO can be converted to a HW scatter list without having to allocate then populate an intermediate SG table. To make the use of the new API easier, HMM and block subsystems are extended to hide the optimization details from the caller. Among these optimizations: * Memory reduction as in most real use cases there is no need to store mapped DMA addresses and unmap them. * Reducing the function call overhead by removing the need to call function pointers and use direct calls instead. This step is first along a path to provide alternatives to scatterlist and solve some of the abuses and design mistakes. Thanks Christoph Hellwig (6): PCI/P2PDMA: Refactor the p2pdma mapping helpers dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h iommu: generalize the batched sync after map interface iommu/dma: Factor out a iommu_dma_map_swiotlb helper dma-mapping: add a dma_need_unmap helper docs: core-api: document the IOVA-based API Leon Romanovsky (11): iommu: add kernel-doc for iommu_unmap and iommu_unmap_fast dma-mapping: Provide an interface to allow allocate IOVA dma-mapping: Implement link/unlink ranges API mm/hmm: let users to tag specific PFN with DMA mapped bit mm/hmm: provide generic DMA managing logic RDMA/umem: Store ODP access mask information in PFN RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page linkage RDMA/umem: Separate implicit ODP initialization from explicit ODP vfio/mlx5: Explicitly use number of pages instead of allocated length vfio/mlx5: Rewrite create mkey flow to allow better code reuse vfio/mlx5: Enable the DMA link API Documentation/core-api/dma-api.rst | 70 ++++ drivers/infiniband/core/umem_odp.c | 250 +++++--------- drivers/infiniband/hw/mlx5/mlx5_ib.h | 12 +- drivers/infiniband/hw/mlx5/odp.c | 65 ++-- drivers/infiniband/hw/mlx5/umr.c | 12 +- drivers/iommu/dma-iommu.c | 468 +++++++++++++++++++++++---- drivers/iommu/iommu.c | 84 ++--- drivers/pci/p2pdma.c | 38 +-- drivers/vfio/pci/mlx5/cmd.c | 375 +++++++++++---------- drivers/vfio/pci/mlx5/cmd.h | 35 +- drivers/vfio/pci/mlx5/main.c | 87 +++-- include/linux/dma-map-ops.h | 54 ---- include/linux/dma-mapping.h | 85 +++++ include/linux/hmm-dma.h | 33 ++ include/linux/hmm.h | 21 ++ include/linux/iommu.h | 4 + include/linux/pci-p2pdma.h | 84 +++++ include/rdma/ib_umem_odp.h | 25 +- kernel/dma/direct.c | 44 +-- kernel/dma/mapping.c | 18 ++ mm/hmm.c | 264 +++++++++++++-- 21 files changed, 1435 insertions(+), 693 deletions(-) create mode 100644 include/linux/hmm-dma.h