mbox series

[v3,00/17] Provide a new two step DMA mapping API

Message ID cover.1731244445.git.leon@kernel.org (mailing list archive)
Headers show
Series Provide a new two step DMA mapping API | expand

Message

Leon Romanovsky Nov. 10, 2024, 1:46 p.m. UTC
Changelog:
v3:
 * Added DMA_ATTR_SKIP_CPU_SYNC to p2p pages in HMM.
 * Fixed error unwind if dma_iova_sync fails in HMM.
 * Clear all PFN flags which were set in map to make code.
   more clean, the callers anyway cleaned them.
 * Generalize sticky PFN flags logic in HMM.
 * Removed not-needed #ifdef-#endif section.
v2: https://lore.kernel.org/all/cover.1730892663.git.leon@kernel.org
 * Fixed docs file as Randy suggested
 * Fixed releases of memory in HMM path. It was allocated with kv..
   variants but released with kfree instead of kvfree.
 * Slightly changed commit message in VFIO patch.
v1: https://lore.kernel.org/all/cover.1730298502.git.leon@kernel.org
 * Squashed two VFIO patches into one
 * Added Acked-by/Reviewed-by tags
 * Fix docs spelling errors
 * Simplified dma_iova_sync() API
 * Added extra check in dma_iova_destroy() if mapped size to make code
 * more clear
 * Fixed checkpatch warnings in p2p patch
 * Changed implementation of VFIO mlx5 mlx5vf_add_migration_pages() to
   be more general
 * Reduced the number of changes in VFIO patch
v0: https://lore.kernel.org/all/cover.1730037276.git.leon@kernel.org

----------------------------------------------------------------------------
The code can be downloaded from:
https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git tag:dma-split-nov-09

Christoph,

Can you please take this series through your DMA mapping tree, so it
will soak enough time in linux-next before sending PR to Linus?

Thanks
----------------------------------------------------------------------------
Currently the only efficient way to map a complex memory description through
the DMA API is by using the scatterlist APIs. The SG APIs are unique in that
they efficiently combine the two fundamental operations of sizing and allocating
a large IOVA window from the IOMMU and processing all the per-address
swiotlb/flushing/p2p/map details.

This uniqueness has been a long standing pain point as the scatterlist API
is mandatory, but expensive to use. It prevents any kind of optimization or
feature improvement (such as avoiding struct page for P2P) due to the impossibility
of improving the scatterlist.

Several approaches have been explored to expand the DMA API with additional
scatterlist-like structures (BIO, rlist), instead split up the DMA API
to allow callers to bring their own data structure.

The API is split up into parts:
 - Allocate IOVA space:
    To do any pre-allocation required. This is done based on the caller
    supplying some details about how much IOMMU address space it would need
    in worst case.
 - Map and unmap relevant structures to pre-allocated IOVA space:
    Perform the actual mapping into the pre-allocated IOVA. This is very
    similar to dma_map_page().

In this and the next series [1], examples of three different users are converted
to the new API to show the benefits and its versatility. Each user has a unique
flow:
 1. RDMA ODP is an example of "SVA mirroring" using HMM that needs to
    dynamically map/unmap large numbers of single pages. This becomes
    significantly faster in the IOMMU case as the map/unmap is now just
    a page table walk, the IOVA allocation is pre-computed once. Significant
    amounts of memory are saved as there is no longer a need to store the
    dma_addr_t of each page.
 2. VFIO PCI live migration code is building a very large "page list"
    for the device. Instead of allocating a scatter list entry per allocated
    page it can just allocate an array of 'struct page *', saving a large
    amount of memory.
 3. NVMe PCI demonstrates how a BIO can be converted to a HW scatter
    list without having to allocate then populate an intermediate SG table.

To make the use of the new API easier, HMM and block subsystems are extended
to hide the optimization details from the caller. Among these optimizations:
 * Memory reduction as in most real use cases there is no need to store mapped
   DMA addresses and unmap them.
 * Reducing the function call overhead by removing the need to call function
   pointers and use direct calls instead.

This step is first along a path to provide alternatives to scatterlist and
solve some of the abuses and design mistakes, for instance in DMABUF's P2P
support.

Thanks

[1] This still points to v0, as the change is just around handling dma_iova_sync()
and extra attribute flag provided to map/unmap:
https://lore.kernel.org/all/cover.1730037261.git.leon@kernel.org


Christoph Hellwig (6):
  PCI/P2PDMA: Refactor the p2pdma mapping helpers
  dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h
  iommu: generalize the batched sync after map interface
  iommu/dma: Factor out a iommu_dma_map_swiotlb helper
  dma-mapping: add a dma_need_unmap helper
  docs: core-api: document the IOVA-based API

Leon Romanovsky (11):
  dma-mapping: Add check if IOVA can be used
  dma: Provide an interface to allow allocate IOVA
  dma-mapping: Implement link/unlink ranges API
  mm/hmm: let users to tag specific PFN with DMA mapped bit
  mm/hmm: provide generic DMA managing logic
  RDMA/umem: Store ODP access mask information in PFN
  RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page
    linkage
  RDMA/umem: Separate implicit ODP initialization from explicit ODP
  vfio/mlx5: Explicitly use number of pages instead of allocated length
  vfio/mlx5: Rewrite create mkey flow to allow better code reuse
  vfio/mlx5: Enable the DMA link API

 Documentation/core-api/dma-api.rst   |  70 ++++
 drivers/infiniband/core/umem_odp.c   | 250 +++++----------
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  12 +-
 drivers/infiniband/hw/mlx5/odp.c     |  65 ++--
 drivers/infiniband/hw/mlx5/umr.c     |  12 +-
 drivers/iommu/dma-iommu.c            | 459 +++++++++++++++++++++++----
 drivers/iommu/iommu.c                |  65 ++--
 drivers/pci/p2pdma.c                 |  38 +--
 drivers/vfio/pci/mlx5/cmd.c          | 373 +++++++++++-----------
 drivers/vfio/pci/mlx5/cmd.h          |  35 +-
 drivers/vfio/pci/mlx5/main.c         |  87 +++--
 include/linux/dma-map-ops.h          |  54 ----
 include/linux/dma-mapping.h          |  85 +++++
 include/linux/hmm-dma.h              |  32 ++
 include/linux/hmm.h                  |  21 ++
 include/linux/iommu.h                |   4 +
 include/linux/pci-p2pdma.h           |  84 +++++
 include/rdma/ib_umem_odp.h           |  25 +-
 kernel/dma/direct.c                  |  44 +--
 kernel/dma/mapping.c                 |  18 ++
 mm/hmm.c                             | 250 ++++++++++++++-
 21 files changed, 1399 insertions(+), 684 deletions(-)
 create mode 100644 include/linux/hmm-dma.h

Comments

Leon Romanovsky Nov. 10, 2024, 3:02 p.m. UTC | #1
On Sun, Nov 10, 2024 at 03:46:47PM +0200, Leon Romanovsky wrote:
> Changelog:
> v3:

<...>

> Christoph Hellwig (6):
>   PCI/P2PDMA: Refactor the p2pdma mapping helpers
>   dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h
>   iommu: generalize the batched sync after map interface
>   iommu/dma: Factor out a iommu_dma_map_swiotlb helper
>   dma-mapping: add a dma_need_unmap helper
>   docs: core-api: document the IOVA-based API
> 
> Leon Romanovsky (11):
>   dma-mapping: Add check if IOVA can be used
>   dma: Provide an interface to allow allocate IOVA
>   dma-mapping: Implement link/unlink ranges API
>   mm/hmm: let users to tag specific PFN with DMA mapped bit
>   mm/hmm: provide generic DMA managing logic

This patch detached from thread and can be found here.
https://lore.kernel.org/all/a42f5a1ede10d4181c5cab3c88ed43a04be79565.1731245995.git.leon@kernel.org

Thanks

>   RDMA/umem: Store ODP access mask information in PFN
>   RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page
>     linkage
>   RDMA/umem: Separate implicit ODP initialization from explicit ODP
>   vfio/mlx5: Explicitly use number of pages instead of allocated length
>   vfio/mlx5: Rewrite create mkey flow to allow better code reuse
>   vfio/mlx5: Enable the DMA link API
Leon Romanovsky Nov. 12, 2024, 7:20 a.m. UTC | #2
On Sun, Nov 10, 2024 at 03:46:47PM +0200, Leon Romanovsky wrote:

<...>

> ----------------------------------------------------------------------------
> The code can be downloaded from:
> https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git tag:dma-split-nov-09

<...>

> 
> Christoph Hellwig (6):
>   PCI/P2PDMA: Refactor the p2pdma mapping helpers
>   dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h
>   iommu: generalize the batched sync after map interface
>   iommu/dma: Factor out a iommu_dma_map_swiotlb helper
>   dma-mapping: add a dma_need_unmap helper
>   docs: core-api: document the IOVA-based API
> 
> Leon Romanovsky (11):
>   dma-mapping: Add check if IOVA can be used
>   dma: Provide an interface to allow allocate IOVA
>   dma-mapping: Implement link/unlink ranges API
>   mm/hmm: let users to tag specific PFN with DMA mapped bit
>   mm/hmm: provide generic DMA managing logic
>   RDMA/umem: Store ODP access mask information in PFN
>   RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page
>     linkage
>   RDMA/umem: Separate implicit ODP initialization from explicit ODP
>   vfio/mlx5: Explicitly use number of pages instead of allocated length
>   vfio/mlx5: Rewrite create mkey flow to allow better code reuse
>   vfio/mlx5: Enable the DMA link API

Robin,

All technical concerns were handled and this series is ready to be merged.

Robin, can you please Ack the dma-iommu patches?

Thanks
Leon Romanovsky Nov. 14, 2024, 1:30 p.m. UTC | #3
On Tue, Nov 12, 2024 at 09:20:40AM +0200, Leon Romanovsky wrote:
> On Sun, Nov 10, 2024 at 03:46:47PM +0200, Leon Romanovsky wrote:
> 
> <...>
> 
> > ----------------------------------------------------------------------------
> > The code can be downloaded from:
> > https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git tag:dma-split-nov-09
> 
> <...>
> 
> > 
> > Christoph Hellwig (6):
> >   PCI/P2PDMA: Refactor the p2pdma mapping helpers
> >   dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h
> >   iommu: generalize the batched sync after map interface
> >   iommu/dma: Factor out a iommu_dma_map_swiotlb helper
> >   dma-mapping: add a dma_need_unmap helper
> >   docs: core-api: document the IOVA-based API
> > 
> > Leon Romanovsky (11):
> >   dma-mapping: Add check if IOVA can be used
> >   dma: Provide an interface to allow allocate IOVA
> >   dma-mapping: Implement link/unlink ranges API
> >   mm/hmm: let users to tag specific PFN with DMA mapped bit
> >   mm/hmm: provide generic DMA managing logic
> >   RDMA/umem: Store ODP access mask information in PFN
> >   RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page
> >     linkage
> >   RDMA/umem: Separate implicit ODP initialization from explicit ODP
> >   vfio/mlx5: Explicitly use number of pages instead of allocated length
> >   vfio/mlx5: Rewrite create mkey flow to allow better code reuse
> >   vfio/mlx5: Enable the DMA link API
> 
> Robin,
> 
> All technical concerns were handled and this series is ready to be merged.
> 
> Robin, can you please Ack the dma-iommu patches?

I don't see any response, so my assumption is that this series is ready
to be merged. Let's do it this cycle and save from us the burden of
having dependencies between subsystems.

Thanks

> 
> Thanks
>
Christoph Hellwig Nov. 14, 2024, 4:36 p.m. UTC | #4
On Thu, Nov 14, 2024 at 03:30:11PM +0200, Leon Romanovsky wrote:
> > All technical concerns were handled and this series is ready to be merged.
> > 
> > Robin, can you please Ack the dma-iommu patches?
> 
> I don't see any response, so my assumption is that this series is ready
> to be merged. Let's do it this cycle and save from us the burden of
> having dependencies between subsystems.

Slow down, please.  Nothing of this complexity is going to get merged
half a week before a release.

No changs to dma-iommu.c are going to get merged without an explicit
ACK from Robin, and while there is no 100% for the core dma-mapping
code I will also not merge anything that hasn't been resolved with
my most trusted reviewer first, not even code I wrote myself.
Leon Romanovsky Nov. 14, 2024, 4:48 p.m. UTC | #5
On Thu, Nov 14, 2024 at 05:36:22PM +0100, Christoph Hellwig wrote:
> On Thu, Nov 14, 2024 at 03:30:11PM +0200, Leon Romanovsky wrote:
> > > All technical concerns were handled and this series is ready to be merged.
> > > 
> > > Robin, can you please Ack the dma-iommu patches?
> > 
> > I don't see any response, so my assumption is that this series is ready
> > to be merged. Let's do it this cycle and save from us the burden of
> > having dependencies between subsystems.
> 
> Slow down, please.  Nothing of this complexity is going to get merged
> half a week before a release.

It is fine, but as a bare minimum, I would expect some sort of response,
so I won't sit and wait for unknown date when this API will be acknowledged/NACKed.

I would like to start to work on next step, e.g removing SG from RDMA,
and I need to know if this foundation is stable to do so.

> 
> No changs to dma-iommu.c are going to get merged without an explicit
> ACK from Robin, and while there is no 100% for the core dma-mapping
> code I will also not merge anything that hasn't been resolved with
> my most trusted reviewer first, not even code I wrote myself.

And do you know what is not resolved here? I don't.
All technical questions/issues were handled.

Thanks
Christoph Hellwig Nov. 14, 2024, 5:02 p.m. UTC | #6
On Thu, Nov 14, 2024 at 06:48:23PM +0200, Leon Romanovsky wrote:
> It is fine, but as a bare minimum, I would expect some sort of response,
> so I won't sit and wait for unknown date when this API will be acknowledged/NACKed.
> 
> I would like to start to work on next step, e.g removing SG from RDMA,
> and I need to know if this foundation is stable to do so.
> 
> > 
> > No changs to dma-iommu.c are going to get merged without an explicit
> > ACK from Robin, and while there is no 100% for the core dma-mapping
> > code I will also not merge anything that hasn't been resolved with
> > my most trusted reviewer first, not even code I wrote myself.
> 
> And do you know what is not resolved here? I don't.
> All technical questions/issues were handled.

Let's just wait a little bit, we're all overworked can't instantly
context switch.