[05/18] dma: Provide an interface to allow allocate IOVA

Message ID	844f3dcf9c341b8178bfbc90909ef13d11dd2193.1730037276.git.leon@kernel.org (mailing list archive)
State	Superseded
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 29EDF15FD01; Sun, 27 Oct 2024 14:22:08 +0000 (UTC) From: Leon Romanovsky <leon@kernel.org> To: Jens Axboe <axboe@kernel.dk>, Jason Gunthorpe <jgg@ziepe.ca>, Robin Murphy <robin.murphy@arm.com>, Joerg Roedel <joro@8bytes.org>, Will Deacon <will@kernel.org>, Christoph Hellwig <hch@lst.de>, Sagi Grimberg <sagi@grimberg.me> Cc: Leon Romanovsky <leonro@nvidia.com>, Keith Busch <kbusch@kernel.org>, Bjorn Helgaas <bhelgaas@google.com>, Logan Gunthorpe <logang@deltatee.com>, Yishai Hadas <yishaih@nvidia.com>, Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>, Kevin Tian <kevin.tian@intel.com>, Alex Williamson <alex.williamson@redhat.com>, Marek Szyprowski <m.szyprowski@samsung.com>, =?utf-8?b?SsOpcsO0bWUgR2xpc3Nl?= <jglisse@redhat.com>, Andrew Morton <akpm@linux-foundation.org>, Jonathan Corbet <corbet@lwn.net>, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-block@vger.kernel.org, linux-rdma@vger.kernel.org, iommu@lists.linux.dev, linux-nvme@lists.infradead.org, linux-pci@vger.kernel.org, kvm@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH 05/18] dma: Provide an interface to allow allocate IOVA Date: Sun, 27 Oct 2024 16:21:05 +0200 Message-ID: <844f3dcf9c341b8178bfbc90909ef13d11dd2193.1730037276.git.leon@kernel.org> In-Reply-To: <cover.1730037276.git.leon@kernel.org> References: <cover.1730037276.git.leon@kernel.org> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Provide a new two step DMA mapping API \| expand [00/18] Provide a new two step DMA mapping API [01/18] PCI/P2PDMA: refactor the p2pdma mapping helpers [02/18] dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h [03/18] iommu: generalize the batched sync after map interface [04/18] dma-mapping: Add check if IOVA can be used [05/18] dma: Provide an interface to allow allocate IOVA [06/18] iommu/dma: Factor out a iommu_dma_map_swiotlb helper [07/18] dma-mapping: Implement link/unlink ranges API [08/18] dma-mapping: add a dma_need_unmap helper [09/18] docs: core-api: document the IOVA-based API [10/18] mm/hmm: let users to tag specific PFN with DMA mapped bit [11/18] mm/hmm: provide generic DMA managing logic [12/18] RDMA/umem: Store ODP access mask information in PFN [13/18] RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page linkage [14/18] RDMA/umem: Separate implicit ODP initialization from explicit ODP [15/18] vfio/mlx5: Explicitly use number of pages instead of allocated length [16/18] vfio/mlx5: Rewrite create mkey flow to allow better code reuse [17/18] vfio/mlx5: Explicitly store page list [18/18] vfio/mlx5: Convert vfio to use DMA link API

Leon Romanovsky Oct. 27, 2024, 2:21 p.m. UTC

From: Leon Romanovsky <leonro@nvidia.com>

The existing .map_page() callback provides both allocating of IOVA
and linking DMA pages. That combination works great for most of the
callers who use it in control paths, but is less effective in fast
paths where there may be multiple calls to map_page().

These advanced callers already manage their data in some sort of
database and can perform IOVA allocation in advance, leaving range
linkage operation to be in fast path.

Provide an interface to allocate/deallocate IOVA and next patch
link/unlink DMA ranges to that specific IOVA.

The API is exported from dma-iommu as it is the only implementation
supported, the namespace is clearly different from iommu_* functions
which are not allowed to be used. This code layout allows us to save
function call per API call used in datapath as well as a lot of boilerplate
code.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/iommu/dma-iommu.c   | 79 +++++++++++++++++++++++++++++++++++++
 include/linux/dma-mapping.h | 15 +++++++
 2 files changed, 94 insertions(+)

Baolu Lu Oct. 28, 2024, 1:24 a.m. UTC | #1

On 2024/10/27 22:21, Leon Romanovsky wrote:
> From: Leon Romanovsky<leonro@nvidia.com>
> 
> The existing .map_page() callback provides both allocating of IOVA
> and linking DMA pages. That combination works great for most of the
> callers who use it in control paths, but is less effective in fast
> paths where there may be multiple calls to map_page().
> 
> These advanced callers already manage their data in some sort of
> database and can perform IOVA allocation in advance, leaving range
> linkage operation to be in fast path.
> 
> Provide an interface to allocate/deallocate IOVA and next patch
> link/unlink DMA ranges to that specific IOVA.
> 
> The API is exported from dma-iommu as it is the only implementation
> supported, the namespace is clearly different from iommu_* functions
> which are not allowed to be used. This code layout allows us to save
> function call per API call used in datapath as well as a lot of boilerplate
> code.
> 
> Signed-off-by: Leon Romanovsky<leonro@nvidia.com>
> ---
>   drivers/iommu/dma-iommu.c   | 79 +++++++++++++++++++++++++++++++++++++
>   include/linux/dma-mapping.h | 15 +++++++
>   2 files changed, 94 insertions(+)
> 
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index c422e36c0d66..0644152c5aad 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -1745,6 +1745,85 @@ size_t iommu_dma_max_mapping_size(struct device *dev)
>   	return SIZE_MAX;
>   }
>   
> +static bool iommu_dma_iova_alloc(struct device *dev,
> +		struct dma_iova_state *state, phys_addr_t phys, size_t size)
> +{
> +	struct iommu_domain *domain = iommu_get_dma_domain(dev);
> +	struct iommu_dma_cookie *cookie = domain->iova_cookie;
> +	struct iova_domain *iovad = &cookie->iovad;
> +	size_t iova_off = iova_offset(iovad, phys);
> +	dma_addr_t addr;
> +
> +	if (WARN_ON_ONCE(!size))
> +		return false;
> +	if (WARN_ON_ONCE(size & DMA_IOVA_USE_SWIOTLB))
> +		return false;
> +
> +	addr = iommu_dma_alloc_iova(domain,
> +			iova_align(iovad, size + iova_off),
> +			dma_get_mask(dev), dev);
> +	if (!addr)
> +		return false;
> +
> +	state->addr = addr + iova_off;
> +	state->__size = size;
> +	return true;
> +}
> +
> +/**
> + * dma_iova_try_alloc - Try to allocate an IOVA space
> + * @dev: Device to allocate the IOVA space for
> + * @state: IOVA state
> + * @phys: physical address
I'm curious to know why a physical address is necessary for IOVA space
allocation. Could you please elaborate?

> + * @size: IOVA size
> + *
> + * Check if @dev supports the IOVA-based DMA API, and if yes allocate IOVA space
> + * for the given base address and size.
> + *
> + * Note: @phys is only used to calculate the IOVA alignment. Callers that always
> + * do PAGE_SIZE aligned transfers can safely pass 0 here.
> + *
> + * Returns %true if the IOVA-based DMA API can be used and IOVA space has been
> + * allocated, or %false if the regular DMA API should be used.
> + */
> +bool dma_iova_try_alloc(struct device *dev, struct dma_iova_state *state,
> +		phys_addr_t phys, size_t size)
> +{
> +	memset(state, 0, sizeof(*state));
> +	if (!use_dma_iommu(dev))
> +		return false;
> +	if (static_branch_unlikely(&iommu_deferred_attach_enabled) &&
> +	    iommu_deferred_attach(dev, iommu_get_domain_for_dev(dev)))
> +		return false;
> +	return iommu_dma_iova_alloc(dev, state, phys, size);
> +}
> +EXPORT_SYMBOL_GPL(dma_iova_try_alloc);

Thanks,
baolu

Srinivasulu Thanneeru Oct. 28, 2024, 4:24 a.m. UTC | #2

On Sun, Oct 27, 2024 at 10:23 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> From: Leon Romanovsky <leonro@nvidia.com>
>
> The existing .map_page() callback provides both allocating of IOVA
> and linking DMA pages. That combination works great for most of the
> callers who use it in control paths, but is less effective in fast
> paths where there may be multiple calls to map_page().

Can you please share perf stats with this patch in fast path, if available?

> These advanced callers already manage their data in some sort of
> database and can perform IOVA allocation in advance, leaving range
> linkage operation to be in fast path.
>
> Provide an interface to allocate/deallocate IOVA and next patch
> link/unlink DMA ranges to that specific IOVA.
>
> The API is exported from dma-iommu as it is the only implementation
> supported, the namespace is clearly different from iommu_* functions
> which are not allowed to be used. This code layout allows us to save
> function call per API call used in datapath as well as a lot of boilerplate
> code.
>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  drivers/iommu/dma-iommu.c   | 79 +++++++++++++++++++++++++++++++++++++
>  include/linux/dma-mapping.h | 15 +++++++
>  2 files changed, 94 insertions(+)
>
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index c422e36c0d66..0644152c5aad 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -1745,6 +1745,85 @@ size_t iommu_dma_max_mapping_size(struct device *dev)
>         return SIZE_MAX;
>  }
>
> +static bool iommu_dma_iova_alloc(struct device *dev,
> +               struct dma_iova_state *state, phys_addr_t phys, size_t size)
> +{
> +       struct iommu_domain *domain = iommu_get_dma_domain(dev);
> +       struct iommu_dma_cookie *cookie = domain->iova_cookie;
> +       struct iova_domain *iovad = &cookie->iovad;
> +       size_t iova_off = iova_offset(iovad, phys);
> +       dma_addr_t addr;
> +
> +       if (WARN_ON_ONCE(!size))
> +               return false;
> +       if (WARN_ON_ONCE(size & DMA_IOVA_USE_SWIOTLB))
> +               return false;
> +
> +       addr = iommu_dma_alloc_iova(domain,
> +                       iova_align(iovad, size + iova_off),
> +                       dma_get_mask(dev), dev);
> +       if (!addr)
> +               return false;
> +
> +       state->addr = addr + iova_off;
> +       state->__size = size;
> +       return true;
> +}
> +
> +/**
> + * dma_iova_try_alloc - Try to allocate an IOVA space
> + * @dev: Device to allocate the IOVA space for
> + * @state: IOVA state
> + * @phys: physical address
> + * @size: IOVA size
> + *
> + * Check if @dev supports the IOVA-based DMA API, and if yes allocate IOVA space
> + * for the given base address and size.
> + *
> + * Note: @phys is only used to calculate the IOVA alignment. Callers that always
> + * do PAGE_SIZE aligned transfers can safely pass 0 here.
> + *
> + * Returns %true if the IOVA-based DMA API can be used and IOVA space has been
> + * allocated, or %false if the regular DMA API should be used.
> + */
> +bool dma_iova_try_alloc(struct device *dev, struct dma_iova_state *state,
> +               phys_addr_t phys, size_t size)
> +{
> +       memset(state, 0, sizeof(*state));
> +       if (!use_dma_iommu(dev))
> +               return false;
> +       if (static_branch_unlikely(&iommu_deferred_attach_enabled) &&
> +           iommu_deferred_attach(dev, iommu_get_domain_for_dev(dev)))
> +               return false;
> +       return iommu_dma_iova_alloc(dev, state, phys, size);
> +}
> +EXPORT_SYMBOL_GPL(dma_iova_try_alloc);
> +
> +/**
> + * dma_iova_free - Free an IOVA space
> + * @dev: Device to free the IOVA space for
> + * @state: IOVA state
> + *
> + * Undoes a successful dma_try_iova_alloc().
> + *
> + * Note that all dma_iova_link() calls need to be undone first.  For callers
> + * that never call dma_iova_unlink(), dma_iova_destroy() can be used instead
> + * which unlinks all ranges and frees the IOVA space in a single efficient
> + * operation.
> + */
> +void dma_iova_free(struct device *dev, struct dma_iova_state *state)
> +{
> +       struct iommu_domain *domain = iommu_get_dma_domain(dev);
> +       struct iommu_dma_cookie *cookie = domain->iova_cookie;
> +       struct iova_domain *iovad = &cookie->iovad;
> +       size_t iova_start_pad = iova_offset(iovad, state->addr);
> +       size_t size = dma_iova_size(state);
> +
> +       iommu_dma_free_iova(cookie, state->addr - iova_start_pad,
> +                       iova_align(iovad, size + iova_start_pad), NULL);
> +}
> +EXPORT_SYMBOL_GPL(dma_iova_free);
> +
>  void iommu_setup_dma_ops(struct device *dev)
>  {
>         struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index 6075e0708deb..817f11bce7bc 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -11,6 +11,7 @@
>  #include <linux/scatterlist.h>
>  #include <linux/bug.h>
>  #include <linux/mem_encrypt.h>
> +#include <linux/iommu.h>
>
>  /**
>   * List of possible attributes associated with a DMA mapping. The semantics
> @@ -77,6 +78,7 @@
>  #define DMA_BIT_MASK(n)        (((n) == 64) ? ~0ULL : ((1ULL<<(n))-1))
>
>  struct dma_iova_state {
> +       dma_addr_t addr;
>         size_t __size;
>  };
>
> @@ -307,11 +309,24 @@ static inline bool dma_use_iova(struct dma_iova_state *state)
>  {
>         return state->__size != 0;
>  }
> +
> +bool dma_iova_try_alloc(struct device *dev, struct dma_iova_state *state,
> +               phys_addr_t phys, size_t size);
> +void dma_iova_free(struct device *dev, struct dma_iova_state *state);
>  #else /* CONFIG_IOMMU_DMA */
>  static inline bool dma_use_iova(struct dma_iova_state *state)
>  {
>         return false;
>  }
> +static inline bool dma_iova_try_alloc(struct device *dev,
> +               struct dma_iova_state *state, phys_addr_t phys, size_t size)
> +{
> +       return false;
> +}
> +static inline void dma_iova_free(struct device *dev,
> +               struct dma_iova_state *state)
> +{
> +}
>  #endif /* CONFIG_IOMMU_DMA */
>
>  #if defined(CONFIG_HAS_DMA) && defined(CONFIG_DMA_NEED_SYNC)
> --
> 2.46.2
>
>

Leon Romanovsky Oct. 28, 2024, 6:37 a.m. UTC | #3

On Mon, Oct 28, 2024 at 09:24:08AM +0800, Baolu Lu wrote:
> On 2024/10/27 22:21, Leon Romanovsky wrote:
> > From: Leon Romanovsky<leonro@nvidia.com>
> > 
> > The existing .map_page() callback provides both allocating of IOVA
> > and linking DMA pages. That combination works great for most of the
> > callers who use it in control paths, but is less effective in fast
> > paths where there may be multiple calls to map_page().
> > 
> > These advanced callers already manage their data in some sort of
> > database and can perform IOVA allocation in advance, leaving range
> > linkage operation to be in fast path.
> > 
> > Provide an interface to allocate/deallocate IOVA and next patch
> > link/unlink DMA ranges to that specific IOVA.
> > 
> > The API is exported from dma-iommu as it is the only implementation
> > supported, the namespace is clearly different from iommu_* functions
> > which are not allowed to be used. This code layout allows us to save
> > function call per API call used in datapath as well as a lot of boilerplate
> > code.
> > 
> > Signed-off-by: Leon Romanovsky<leonro@nvidia.com>
> > ---
> >   drivers/iommu/dma-iommu.c   | 79 +++++++++++++++++++++++++++++++++++++
> >   include/linux/dma-mapping.h | 15 +++++++
> >   2 files changed, 94 insertions(+)
> > 
> > diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> > index c422e36c0d66..0644152c5aad 100644
> > --- a/drivers/iommu/dma-iommu.c
> > +++ b/drivers/iommu/dma-iommu.c
> > @@ -1745,6 +1745,85 @@ size_t iommu_dma_max_mapping_size(struct device *dev)
> >   	return SIZE_MAX;
> >   }
> > +static bool iommu_dma_iova_alloc(struct device *dev,
> > +		struct dma_iova_state *state, phys_addr_t phys, size_t size)
> > +{
> > +	struct iommu_domain *domain = iommu_get_dma_domain(dev);
> > +	struct iommu_dma_cookie *cookie = domain->iova_cookie;
> > +	struct iova_domain *iovad = &cookie->iovad;
> > +	size_t iova_off = iova_offset(iovad, phys);
> > +	dma_addr_t addr;
> > +
> > +	if (WARN_ON_ONCE(!size))
> > +		return false;
> > +	if (WARN_ON_ONCE(size & DMA_IOVA_USE_SWIOTLB))
> > +		return false;
> > +
> > +	addr = iommu_dma_alloc_iova(domain,
> > +			iova_align(iovad, size + iova_off),
> > +			dma_get_mask(dev), dev);
> > +	if (!addr)
> > +		return false;
> > +
> > +	state->addr = addr + iova_off;
> > +	state->__size = size;
> > +	return true;
> > +}
> > +
> > +/**
> > + * dma_iova_try_alloc - Try to allocate an IOVA space
> > + * @dev: Device to allocate the IOVA space for
> > + * @state: IOVA state
> > + * @phys: physical address
> I'm curious to know why a physical address is necessary for IOVA space
> allocation. Could you please elaborate?

The proposed API is not only splitted to allow batching of DMA
operations without need of scatter-gather, but also allowed to
users without "struct *page" to use it. 

In IOMMU and DMA layers all operations are performed on physical
addresses and the API "request" to provide "struct *page" in
dma_map_sg/dma_map_page is not truly needed.

In this specific case, the physical address is used to calculate
IOVA offset, see "size_t iova_off = iova_offset(iovad, phys);" line,
which is needed for NVMe PCI/block layer, as they can have first
address to be unaligned and IOVA allocation will need an offset to
properly calculate size.

HMM and VFIO operate on page granularity and in simple case
they don't need alignment. In more advance scenarios, they will
benefit from this offset anyway as it will cause to reduce of IOVA
space.

Thanks

Leon Romanovsky Oct. 28, 2024, 6:46 a.m. UTC | #4

On Mon, Oct 28, 2024 at 09:54:49AM +0530, Srinivasulu Thanneeru wrote:
> On Sun, Oct 27, 2024 at 10:23 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > From: Leon Romanovsky <leonro@nvidia.com>
> >
> > The existing .map_page() callback provides both allocating of IOVA
> > and linking DMA pages. That combination works great for most of the
> > callers who use it in control paths, but is less effective in fast
> > paths where there may be multiple calls to map_page().
> 
> Can you please share perf stats with this patch in fast path, if available?

I don't have this data for HMM and VFIO as they have other benefits from this
series except performance. For NVMe, I don't have the data yet, but it will
come https://lore.kernel.org/all/cover.1730037261.git.leon@kernel.org/,
as it is the main performant user of this API.

This is the main reason why NVMe series is marked as RFC yet.

Thanks

Christoph Hellwig Oct. 29, 2024, 7:46 a.m. UTC | #5

On Mon, Oct 28, 2024 at 08:37:40AM +0200, Leon Romanovsky wrote:
> In this specific case, the physical address is used to calculate
> IOVA offset, see "size_t iova_off = iova_offset(iovad, phys);" line,
> which is needed for NVMe PCI/block layer, as they can have first
> address to be unaligned and IOVA allocation will need an offset to
> properly calculate size.

And that is also very explicitly spelled out in the kerneldoc comments,
including the note that the physical address is optional if the
transfer is granule aligned (actually it says PAGE_SIZE which should
be fixed).

Any suggestions to further improve it are welcome of course.

[05/18] dma: Provide an interface to allow allocate IOVA

Commit Message

Comments

Patch