diff mbox series

[RFC,01/12] dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI

Message ID 20250107142719.179636-2-yilun.xu@linux.intel.com (mailing list archive)
State New
Headers show
Series Private MMIO support for private assigned dev | expand

Commit Message

Xu Yilun Jan. 7, 2025, 2:27 p.m. UTC
Introduce a new API for dma-buf importer, also add a dma_buf_ops
callback for dma-buf exporter. This API is for subsystem importers who
map the dma-buf to some user defined address space, e.g. for IOMMUFD to
map the dma-buf to userspace IOVA via IOMMU page table, or for KVM to
map the dma-buf to GPA via KVM MMU (e.g. EPT).

Currently dma-buf is only used to get DMA address for device's default
domain by using kernel DMA APIs. But for these new use-cases, importers
only need the pfn of the dma-buf resource to build their own mapping
tables. So the map_dma_buf() callback is not mandatory for exporters
anymore. Also the importers could choose not to provide
struct device *dev on dma_buf_attach() if they don't call
dma_buf_map_attachment().

Like dma_buf_map_attachment(), the importer should firstly call
dma_buf_attach/dynamic_attach() then call dma_buf_get_pfn_unlocked().
If the importer choose to do dynamic attach, it also should handle the
dma-buf move notification.

Only the unlocked version of dma_buf_get_pfn is implemented for now,
just because no locked version is used for now.

Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>

---
IIUC, Only get_pfn() is needed but no put_pfn(). The whole dma-buf is
de/referenced at dma-buf attach/detach time.

Specifically, for static attachment, the exporter should always make
memory resource available/pinned on first dma_buf_attach(), and
release/unpin memory resource on last dma_buf_detach(). For dynamic
attachment, the exporter could populate & invalidate the memory
resource at any time, it's OK as long as the importers follow dma-buf
move notification. So no pinning is needed for get_pfn() and no
put_pfn() is needed.
---
 drivers/dma-buf/dma-buf.c | 90 +++++++++++++++++++++++++++++++--------
 include/linux/dma-buf.h   | 13 ++++++
 2 files changed, 86 insertions(+), 17 deletions(-)

Comments

Christian König Jan. 8, 2025, 8:01 a.m. UTC | #1
Am 07.01.25 um 15:27 schrieb Xu Yilun:
> Introduce a new API for dma-buf importer, also add a dma_buf_ops
> callback for dma-buf exporter. This API is for subsystem importers who
> map the dma-buf to some user defined address space, e.g. for IOMMUFD to
> map the dma-buf to userspace IOVA via IOMMU page table, or for KVM to
> map the dma-buf to GPA via KVM MMU (e.g. EPT).
>
> Currently dma-buf is only used to get DMA address for device's default
> domain by using kernel DMA APIs. But for these new use-cases, importers
> only need the pfn of the dma-buf resource to build their own mapping
> tables.

As far as I can see I have to fundamentally reject this whole approach.

It's intentional DMA-buf design that we don't expose struct pages nor 
PFNs to the importer. Essentially DMA-buf only transports DMA addresses.

In other words the mapping is done by the exporter and *not* the importer.

What we certainly can do is to annotate those DMA addresses to a better 
specify in which domain they are applicable, e.g. if they are PCIe bus 
addresses or some inter device bus addresses etc...

But moving the functionality to map the pages/PFNs to DMA addresses into 
the importer is an absolutely clear NO-GO.

Regards,
Christian.

>   So the map_dma_buf() callback is not mandatory for exporters
> anymore. Also the importers could choose not to provide
> struct device *dev on dma_buf_attach() if they don't call
> dma_buf_map_attachment().
>
> Like dma_buf_map_attachment(), the importer should firstly call
> dma_buf_attach/dynamic_attach() then call dma_buf_get_pfn_unlocked().
> If the importer choose to do dynamic attach, it also should handle the
> dma-buf move notification.
>
> Only the unlocked version of dma_buf_get_pfn is implemented for now,
> just because no locked version is used for now.
>
> Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>
>
> ---
> IIUC, Only get_pfn() is needed but no put_pfn(). The whole dma-buf is
> de/referenced at dma-buf attach/detach time.
>
> Specifically, for static attachment, the exporter should always make
> memory resource available/pinned on first dma_buf_attach(), and
> release/unpin memory resource on last dma_buf_detach(). For dynamic
> attachment, the exporter could populate & invalidate the memory
> resource at any time, it's OK as long as the importers follow dma-buf
> move notification. So no pinning is needed for get_pfn() and no
> put_pfn() is needed.
> ---
>   drivers/dma-buf/dma-buf.c | 90 +++++++++++++++++++++++++++++++--------
>   include/linux/dma-buf.h   | 13 ++++++
>   2 files changed, 86 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 7eeee3a38202..83d1448b6dcc 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -630,10 +630,10 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
>   	size_t alloc_size = sizeof(struct dma_buf);
>   	int ret;
>   
> -	if (WARN_ON(!exp_info->priv || !exp_info->ops
> -		    || !exp_info->ops->map_dma_buf
> -		    || !exp_info->ops->unmap_dma_buf
> -		    || !exp_info->ops->release))
> +	if (WARN_ON(!exp_info->priv || !exp_info->ops ||
> +		    (!!exp_info->ops->map_dma_buf != !!exp_info->ops->unmap_dma_buf) ||
> +		    (!exp_info->ops->map_dma_buf && !exp_info->ops->get_pfn) ||
> +		    !exp_info->ops->release))
>   		return ERR_PTR(-EINVAL);
>   
>   	if (WARN_ON(exp_info->ops->cache_sgt_mapping &&
> @@ -909,7 +909,10 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
>   	struct dma_buf_attachment *attach;
>   	int ret;
>   
> -	if (WARN_ON(!dmabuf || !dev))
> +	if (WARN_ON(!dmabuf))
> +		return ERR_PTR(-EINVAL);
> +
> +	if (WARN_ON(dmabuf->ops->map_dma_buf && !dev))
>   		return ERR_PTR(-EINVAL);
>   
>   	if (WARN_ON(importer_ops && !importer_ops->move_notify))
> @@ -941,7 +944,7 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
>   	 */
>   	if (dma_buf_attachment_is_dynamic(attach) !=
>   	    dma_buf_is_dynamic(dmabuf)) {
> -		struct sg_table *sgt;
> +		struct sg_table *sgt = NULL;
>   
>   		dma_resv_lock(attach->dmabuf->resv, NULL);
>   		if (dma_buf_is_dynamic(attach->dmabuf)) {
> @@ -950,13 +953,16 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
>   				goto err_unlock;
>   		}
>   
> -		sgt = __map_dma_buf(attach, DMA_BIDIRECTIONAL);
> -		if (!sgt)
> -			sgt = ERR_PTR(-ENOMEM);
> -		if (IS_ERR(sgt)) {
> -			ret = PTR_ERR(sgt);
> -			goto err_unpin;
> +		if (dmabuf->ops->map_dma_buf) {
> +			sgt = __map_dma_buf(attach, DMA_BIDIRECTIONAL);
> +			if (!sgt)
> +				sgt = ERR_PTR(-ENOMEM);
> +			if (IS_ERR(sgt)) {
> +				ret = PTR_ERR(sgt);
> +				goto err_unpin;
> +			}
>   		}
> +
>   		dma_resv_unlock(attach->dmabuf->resv);
>   		attach->sgt = sgt;
>   		attach->dir = DMA_BIDIRECTIONAL;
> @@ -1119,7 +1125,8 @@ struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *attach,
>   
>   	might_sleep();
>   
> -	if (WARN_ON(!attach || !attach->dmabuf))
> +	if (WARN_ON(!attach || !attach->dmabuf ||
> +		    !attach->dmabuf->ops->map_dma_buf))
>   		return ERR_PTR(-EINVAL);
>   
>   	dma_resv_assert_held(attach->dmabuf->resv);
> @@ -1195,7 +1202,8 @@ dma_buf_map_attachment_unlocked(struct dma_buf_attachment *attach,
>   
>   	might_sleep();
>   
> -	if (WARN_ON(!attach || !attach->dmabuf))
> +	if (WARN_ON(!attach || !attach->dmabuf ||
> +		    !attach->dmabuf->ops->map_dma_buf))
>   		return ERR_PTR(-EINVAL);
>   
>   	dma_resv_lock(attach->dmabuf->resv, NULL);
> @@ -1222,7 +1230,8 @@ void dma_buf_unmap_attachment(struct dma_buf_attachment *attach,
>   {
>   	might_sleep();
>   
> -	if (WARN_ON(!attach || !attach->dmabuf || !sg_table))
> +	if (WARN_ON(!attach || !attach->dmabuf ||
> +		    !attach->dmabuf->ops->unmap_dma_buf || !sg_table))
>   		return;
>   
>   	dma_resv_assert_held(attach->dmabuf->resv);
> @@ -1254,7 +1263,8 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
>   {
>   	might_sleep();
>   
> -	if (WARN_ON(!attach || !attach->dmabuf || !sg_table))
> +	if (WARN_ON(!attach || !attach->dmabuf ||
> +		    !attach->dmabuf->ops->unmap_dma_buf || !sg_table))
>   		return;
>   
>   	dma_resv_lock(attach->dmabuf->resv, NULL);
> @@ -1263,6 +1273,52 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
>   }
>   EXPORT_SYMBOL_NS_GPL(dma_buf_unmap_attachment_unlocked, "DMA_BUF");
>   
> +/**
> + * dma_buf_get_pfn_unlocked -
> + * @attach:	[in]	attachment to get pfn from
> + * @pgoff:	[in]	page offset of the buffer against the start of dma_buf
> + * @pfn:	[out]	returns the pfn of the buffer
> + * @max_order	[out]	returns the max mapping order of the buffer
> + */
> +int dma_buf_get_pfn_unlocked(struct dma_buf_attachment *attach,
> +			     pgoff_t pgoff, u64 *pfn, int *max_order)
> +{
> +	struct dma_buf *dmabuf = attach->dmabuf;
> +	int ret;
> +
> +	if (WARN_ON(!attach || !attach->dmabuf ||
> +		    !attach->dmabuf->ops->get_pfn))
> +		return -EINVAL;
> +
> +	/*
> +	 * Open:
> +	 *
> +	 * When dma_buf is dynamic but dma_buf move is disabled, the buffer
> +	 * should be pinned before use, See dma_buf_map_attachment() for
> +	 * reference.
> +	 *
> +	 * But for now no pin is intended inside dma_buf_get_pfn(), otherwise
> +	 * need another API to unpin the dma_buf. So just fail out this case.
> +	 */
> +	if (dma_buf_is_dynamic(attach->dmabuf) &&
> +	    !IS_ENABLED(CONFIG_DMABUF_MOVE_NOTIFY))
> +		return -ENOENT;
> +
> +	dma_resv_lock(attach->dmabuf->resv, NULL);
> +	ret = dmabuf->ops->get_pfn(attach, pgoff, pfn, max_order);
> +	/*
> +	 * Open:
> +	 *
> +	 * Is dma_resv_wait_timeout() needed? I assume no. The DMA buffer
> +	 * content synchronization could be done when the buffer is to be
> +	 * mapped by importer.
> +	 */
> +	dma_resv_unlock(attach->dmabuf->resv);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_NS_GPL(dma_buf_get_pfn_unlocked, "DMA_BUF");
> +
>   /**
>    * dma_buf_move_notify - notify attachments that DMA-buf is moving
>    *
> @@ -1662,7 +1718,7 @@ static int dma_buf_debug_show(struct seq_file *s, void *unused)
>   		attach_count = 0;
>   
>   		list_for_each_entry(attach_obj, &buf_obj->attachments, node) {
> -			seq_printf(s, "\t%s\n", dev_name(attach_obj->dev));
> +			seq_printf(s, "\t%s\n", attach_obj->dev ? dev_name(attach_obj->dev) : NULL);
>   			attach_count++;
>   		}
>   		dma_resv_unlock(buf_obj->resv);
> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> index 36216d28d8bd..b16183edfb3a 100644
> --- a/include/linux/dma-buf.h
> +++ b/include/linux/dma-buf.h
> @@ -194,6 +194,17 @@ struct dma_buf_ops {
>   	 * if the call would block.
>   	 */
>   
> +	/**
> +	 * @get_pfn:
> +	 *
> +	 * This is called by dma_buf_get_pfn(). It is used to get the pfn
> +	 * of the buffer positioned by the page offset against the start of
> +	 * the dma_buf. It can only be called if @attach has been called
> +	 * successfully.
> +	 */
> +	int (*get_pfn)(struct dma_buf_attachment *attach, pgoff_t pgoff,
> +		       u64 *pfn, int *max_order);
> +
>   	/**
>   	 * @release:
>   	 *
> @@ -629,6 +640,8 @@ dma_buf_map_attachment_unlocked(struct dma_buf_attachment *attach,
>   void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
>   				       struct sg_table *sg_table,
>   				       enum dma_data_direction direction);
> +int dma_buf_get_pfn_unlocked(struct dma_buf_attachment *attach,
> +			     pgoff_t pgoff, u64 *pfn, int *max_order);
>   
>   int dma_buf_mmap(struct dma_buf *, struct vm_area_struct *,
>   		 unsigned long);
Jason Gunthorpe Jan. 8, 2025, 1:23 p.m. UTC | #2
On Wed, Jan 08, 2025 at 09:01:46AM +0100, Christian König wrote:
> Am 07.01.25 um 15:27 schrieb Xu Yilun:
> > Introduce a new API for dma-buf importer, also add a dma_buf_ops
> > callback for dma-buf exporter. This API is for subsystem importers who
> > map the dma-buf to some user defined address space, e.g. for IOMMUFD to
> > map the dma-buf to userspace IOVA via IOMMU page table, or for KVM to
> > map the dma-buf to GPA via KVM MMU (e.g. EPT).
> > 
> > Currently dma-buf is only used to get DMA address for device's default
> > domain by using kernel DMA APIs. But for these new use-cases, importers
> > only need the pfn of the dma-buf resource to build their own mapping
> > tables.
> 
> As far as I can see I have to fundamentally reject this whole approach.
> 
> It's intentional DMA-buf design that we don't expose struct pages nor PFNs
> to the importer. Essentially DMA-buf only transports DMA addresses.
> 
> In other words the mapping is done by the exporter and *not* the importer.
> 
> What we certainly can do is to annotate those DMA addresses to a better
> specify in which domain they are applicable, e.g. if they are PCIe bus
> addresses or some inter device bus addresses etc...
> 
> But moving the functionality to map the pages/PFNs to DMA addresses into the
> importer is an absolutely clear NO-GO.

Oh?

Having the importer do the mapping is the correct way to operate the
DMA API and the new API that Leon has built to fix the scatterlist
abuse in dmabuf relies on importer mapping as part of it's
construction.

Why on earth do you want the exporter to map? That is completely
backwards and unworkable in many cases. The disfunctional P2P support
in dmabuf is like that principally because of this.

That said, I don't think get_pfn() is an especially good interface,
but we will need to come with something that passes the physical pfn
out.

Jason
Christian König Jan. 8, 2025, 1:44 p.m. UTC | #3
Am 08.01.25 um 14:23 schrieb Jason Gunthorpe:
> On Wed, Jan 08, 2025 at 09:01:46AM +0100, Christian König wrote:
>> Am 07.01.25 um 15:27 schrieb Xu Yilun:
>>> Introduce a new API for dma-buf importer, also add a dma_buf_ops
>>> callback for dma-buf exporter. This API is for subsystem importers who
>>> map the dma-buf to some user defined address space, e.g. for IOMMUFD to
>>> map the dma-buf to userspace IOVA via IOMMU page table, or for KVM to
>>> map the dma-buf to GPA via KVM MMU (e.g. EPT).
>>>
>>> Currently dma-buf is only used to get DMA address for device's default
>>> domain by using kernel DMA APIs. But for these new use-cases, importers
>>> only need the pfn of the dma-buf resource to build their own mapping
>>> tables.
>> As far as I can see I have to fundamentally reject this whole approach.
>>
>> It's intentional DMA-buf design that we don't expose struct pages nor PFNs
>> to the importer. Essentially DMA-buf only transports DMA addresses.
>>
>> In other words the mapping is done by the exporter and *not* the importer.
>>
>> What we certainly can do is to annotate those DMA addresses to a better
>> specify in which domain they are applicable, e.g. if they are PCIe bus
>> addresses or some inter device bus addresses etc...
>>
>> But moving the functionality to map the pages/PFNs to DMA addresses into the
>> importer is an absolutely clear NO-GO.
> Oh?
>
> Having the importer do the mapping is the correct way to operate the
> DMA API and the new API that Leon has built to fix the scatterlist
> abuse in dmabuf relies on importer mapping as part of it's
> construction.

Exactly on that I strongly disagree on.

DMA-buf works by providing DMA addresses the importer can work with and 
*NOT* the underlying location of the buffer.

> Why on earth do you want the exporter to map?

Because the exporter owns the exported buffer and only the exporter 
knows to how correctly access it.

> That is completely backwards and unworkable in many cases. The disfunctional P2P support
> in dmabuf is like that principally because of this.

No, that is exactly what we need.

Using the scatterlist to transport the DMA addresses was clearly a 
mistake, but providing the DMA addresses by the exporter has proved many 
times to be the right approach.

Keep in mind that the exported buffer is not necessary memory, but can 
also be MMIO or stuff which is only accessible through address space 
windows where you can't create a PFN nor struct page for.

> That said, I don't think get_pfn() is an especially good interface,
> but we will need to come with something that passes the physical pfn
> out.

No, physical pfn is absolutely not a good way of passing the location of 
data around because it is limited to what the CPU sees as address space.

We have use cases where DMA-buf transports the location of CPU invisible 
data which only the involved devices can see.

Regards,
Christian.

>
> Jason
Jason Gunthorpe Jan. 8, 2025, 2:58 p.m. UTC | #4
On Wed, Jan 08, 2025 at 02:44:26PM +0100, Christian König wrote:

> > Having the importer do the mapping is the correct way to operate the
> > DMA API and the new API that Leon has built to fix the scatterlist
> > abuse in dmabuf relies on importer mapping as part of it's
> > construction.
> 
> Exactly on that I strongly disagree on.
> 
> DMA-buf works by providing DMA addresses the importer can work with and
> *NOT* the underlying location of the buffer.

The expectation is that the DMA API will be used to DMA map (most)
things, and the DMA API always works with a physaddr_t/pfn
argument. Basically, everything that is not a private address space
should be supported by improving the DMA API. We are on course for
finally getting all the common cases like P2P and MMIO solved
here. That alone will take care of alot.

For P2P cases we are going toward (PFN + P2P source information) as
input to the DMA API. The additional "P2P source information" provides
a good way for co-operating drivers to represent private address
spaces as well. Both importer and exporter can have full understanding
what is being mapped and do the correct things, safely.

So, no, we don't loose private address space support when moving to
importer mapping, in fact it works better because the importer gets
more information about what is going on.

I have imagined a staged approach were DMABUF gets a new API that
works with the new DMA API to do importer mapping with "P2P source
information" and a gradual conversion.

Exporter mapping falls down in too many cases already:

1) Private addresses spaces don't work fully well because many devices
need some indication what address space is being used and scatter list
can't really properly convey that. If the DMABUF has a mixture of CPU
and private it becomes a PITA

2) Multi-path PCI can require the importer to make mapping decisions
unique to the device and program device specific information for the
multi-path. We are doing this in mlx5 today and have hacks because
DMABUF is destroying the information the importer needs to choose the
correct PCI path.

3) Importing devices need to know if they are working with PCI P2P
addresses during mapping because they need to do things like turn on
ATS on their DMA. As for multi-path we have the same hacks inside mlx5
today that assume DMABUFs are always P2P because we cannot determine
if things are P2P or not after being DMA mapped.

4) TPH bits needs to be programmed into the importer device but are
derived based on the NUMA topology of the DMA target. The importer has
no idea what the DMA target actually was because the exporter mapping
destroyed that information.

5) iommufd and kvm are both using CPU addresses without DMA. No
exporter mapping is possible

Jason
Christian König Jan. 8, 2025, 3:25 p.m. UTC | #5
Am 08.01.25 um 15:58 schrieb Jason Gunthorpe:
> On Wed, Jan 08, 2025 at 02:44:26PM +0100, Christian König wrote:
>
>>> Having the importer do the mapping is the correct way to operate the
>>> DMA API and the new API that Leon has built to fix the scatterlist
>>> abuse in dmabuf relies on importer mapping as part of it's
>>> construction.
>> Exactly on that I strongly disagree on.
>>
>> DMA-buf works by providing DMA addresses the importer can work with and
>> *NOT* the underlying location of the buffer.
> The expectation is that the DMA API will be used to DMA map (most)
> things, and the DMA API always works with a physaddr_t/pfn
> argument. Basically, everything that is not a private address space
> should be supported by improving the DMA API. We are on course for
> finally getting all the common cases like P2P and MMIO solved
> here. That alone will take care of alot.

Well, from experience the DMA API has failed more often than it actually 
worked in the way required by drivers.

Especially that we tried to hide architectural complexity in there 
instead of properly expose limitations to drivers is not something I 
consider a good design approach.

So I see putting even more into that extremely critical.

> For P2P cases we are going toward (PFN + P2P source information) as
> input to the DMA API. The additional "P2P source information" provides
> a good way for co-operating drivers to represent private address
> spaces as well. Both importer and exporter can have full understanding
> what is being mapped and do the correct things, safely.

I can say from experience that this is clearly not going to work for all 
use cases.

It would mean that we have to pull a massive amount of driver specific 
functionality into the DMA API.

Things like programming access windows for PCI BARs is completely driver 
specific and as far as I can see can't be part of the DMA API without 
things like callbacks.

With that in mind the DMA API would become a mid layer between different 
drivers and that is really not something you are suggesting, isn't it?

> So, no, we don't loose private address space support when moving to
> importer mapping, in fact it works better because the importer gets
> more information about what is going on.

Well, sounds like I wasn't able to voice my concern. Let me try again:

We should not give importers information they don't need. Especially not 
information about the backing store of buffers.

So that importers get more information about what's going on is a bad thing.

> I have imagined a staged approach were DMABUF gets a new API that
> works with the new DMA API to do importer mapping with "P2P source
> information" and a gradual conversion.

To make it clear as maintainer of that subsystem I would reject such a 
step with all I have.

We have already gone down that road and it didn't worked at all and was 
a really big pain to pull people back from it.

> Exporter mapping falls down in too many cases already:
>
> 1) Private addresses spaces don't work fully well because many devices
> need some indication what address space is being used and scatter list
> can't really properly convey that. If the DMABUF has a mixture of CPU
> and private it becomes a PITA

Correct, yes. That's why I said that scatterlist was a bad choice for 
the interface.

But exposing the backing store to importers and then let them do 
whatever they want with it sounds like an even worse idea.

> 2) Multi-path PCI can require the importer to make mapping decisions
> unique to the device and program device specific information for the
> multi-path. We are doing this in mlx5 today and have hacks because
> DMABUF is destroying the information the importer needs to choose the
> correct PCI path.

That's why the exporter gets the struct device of the importer so that 
it can plan how those accesses are made. Where exactly is the problem 
with that?

When you have an use case which is not covered by the existing DMA-buf 
interfaces then please voice that to me and other maintainers instead of 
implementing some hack.

> 3) Importing devices need to know if they are working with PCI P2P
> addresses during mapping because they need to do things like turn on
> ATS on their DMA. As for multi-path we have the same hacks inside mlx5
> today that assume DMABUFs are always P2P because we cannot determine
> if things are P2P or not after being DMA mapped.

Why would you need ATS on PCI P2P and not for system memory accesses?

> 4) TPH bits needs to be programmed into the importer device but are
> derived based on the NUMA topology of the DMA target. The importer has
> no idea what the DMA target actually was because the exporter mapping
> destroyed that information.

Yeah, but again that is completely intentional.

I assume you mean TLP processing hints when you say TPH and those should 
be part of the DMA addresses provided by the exporter.

That an importer tries to look behind the curtain and determines the 
NUMA placement and topology themselves is clearly a no-go from the 
design perspective.

> 5) iommufd and kvm are both using CPU addresses without DMA. No
> exporter mapping is possible

We have customers using both KVM and XEN with DMA-buf, so I can clearly 
confirm that this isn't true.

Regards,
Christian.

>
> Jason
Jason Gunthorpe Jan. 8, 2025, 4:22 p.m. UTC | #6
On Wed, Jan 08, 2025 at 04:25:54PM +0100, Christian König wrote:
> Am 08.01.25 um 15:58 schrieb Jason Gunthorpe:
> > On Wed, Jan 08, 2025 at 02:44:26PM +0100, Christian König wrote:
> > 
> > > > Having the importer do the mapping is the correct way to operate the
> > > > DMA API and the new API that Leon has built to fix the scatterlist
> > > > abuse in dmabuf relies on importer mapping as part of it's
> > > > construction.
> > > Exactly on that I strongly disagree on.
> > > 
> > > DMA-buf works by providing DMA addresses the importer can work with and
> > > *NOT* the underlying location of the buffer.
> > The expectation is that the DMA API will be used to DMA map (most)
> > things, and the DMA API always works with a physaddr_t/pfn
> > argument. Basically, everything that is not a private address space
> > should be supported by improving the DMA API. We are on course for
> > finally getting all the common cases like P2P and MMIO solved
> > here. That alone will take care of alot.
> 
> Well, from experience the DMA API has failed more often than it actually
> worked in the way required by drivers.

The DMA API has been static and very hard to change in these ways for
a long time. I think Leon's new API will break through this and we
will be able finally address these issues.

> > For P2P cases we are going toward (PFN + P2P source information) as
> > input to the DMA API. The additional "P2P source information" provides
> > a good way for co-operating drivers to represent private address
> > spaces as well. Both importer and exporter can have full understanding
> > what is being mapped and do the correct things, safely.
> 
> I can say from experience that this is clearly not going to work for all use
> cases.
> 
> It would mean that we have to pull a massive amount of driver specific
> functionality into the DMA API.

That isn't what I mean. There are two distinct parts, the means to
describe the source (PFN + P2P source information) that is compatible
with the DMA API, and the DMA API itself that works with a few general
P2P source information types.

Private source information would be detected by co-operating drivers
and go down driver private paths. It would be rejected by other
drivers. This broadly follows how the new API is working.

So here I mean you can use the same PFN + Source API between importer
and exporter and the importer can simply detect the special source and
do the private stuff. It is not shifting things under the DMA API, it
is building along side it using compatible design approaches. You
would match the source information, cast it to a driver structure, do
whatever driver math is needed to compute the local DMA address and
then write it to the device. Nothing is hard or "not going to work"
here.

> > So, no, we don't loose private address space support when moving to
> > importer mapping, in fact it works better because the importer gets
> > more information about what is going on.
> 
> Well, sounds like I wasn't able to voice my concern. Let me try again:
> 
> We should not give importers information they don't need. Especially not
> information about the backing store of buffers.
> 
> So that importers get more information about what's going on is a bad thing.

I strongly disagree because we are suffering today in mlx5 because of
this viewpoint. You cannot predict in advance what importers are going
to need. I already listed many examples where it does not work today
as is.

> > I have imagined a staged approach were DMABUF gets a new API that
> > works with the new DMA API to do importer mapping with "P2P source
> > information" and a gradual conversion.
> 
> To make it clear as maintainer of that subsystem I would reject such a step
> with all I have.

This is unexpected, so you want to just leave dmabuf broken? Do you
have any plan to fix it, to fix the misuse of the DMA API, and all
the problems I listed below? This is a big deal, it is causing real
problems today.

If it going to be like this I think we will stop trying to use dmabuf
and do something simpler for vfio/kvm/iommufd :(

> We have already gone down that road and it didn't worked at all and
> was a really big pain to pull people back from it.

Nobody has really seriously tried to improve the DMA API before, so I
don't think this is true at all.

> > Exporter mapping falls down in too many cases already:
> > 
> > 1) Private addresses spaces don't work fully well because many devices
> > need some indication what address space is being used and scatter list
> > can't really properly convey that. If the DMABUF has a mixture of CPU
> > and private it becomes a PITA
> 
> Correct, yes. That's why I said that scatterlist was a bad choice for the
> interface.
> 
> But exposing the backing store to importers and then let them do whatever
> they want with it sounds like an even worse idea.

You keep saying this without real justification. To me it is a nanny
style of API design. But also I don't see how you can possibly fix the
above without telling the importer alot more information.

> > 2) Multi-path PCI can require the importer to make mapping decisions
> > unique to the device and program device specific information for the
> > multi-path. We are doing this in mlx5 today and have hacks because
> > DMABUF is destroying the information the importer needs to choose the
> > correct PCI path.
> 
> That's why the exporter gets the struct device of the importer so that it
> can plan how those accesses are made. Where exactly is the problem with
> that?

A single struct device does not convey the multipath options. We have
multiple struct devices (and multiple PCI endpoints) doing DMA
concurrently under one driver.

Multipath always needs additional meta information in the importer
side to tell the device which path to select. A naked dma address is
not sufficient.

Today we guess that DMABUF will be using P2P and hack to choose a P2P
struct device to pass the exporter. We need to know what is in the
dmabuf before we can choose which of the multiple struct devices the
driver has to use for DMA mapping.

But even simple CPU centric cases we will eventually want to select
the proper NUMA local PCI channel matching struct device for CPU only
buffers.

> When you have an use case which is not covered by the existing DMA-buf
> interfaces then please voice that to me and other maintainers instead of
> implementing some hack.

Do you have any suggestion for any of this then? We have a good plan
to fix this stuff and more. Many experts in their fields have agreed
on the different parts now. We haven't got to dmabuf because I had no
idea there would be an objection like this.

> > 3) Importing devices need to know if they are working with PCI P2P
> > addresses during mapping because they need to do things like turn on
> > ATS on their DMA. As for multi-path we have the same hacks inside mlx5
> > today that assume DMABUFs are always P2P because we cannot determine
> > if things are P2P or not after being DMA mapped.
> 
> Why would you need ATS on PCI P2P and not for system memory accesses?

ATS has a significant performance cost. It is mandatory for PCI P2P,
but ideally should be avoided for CPU memory.

> > 4) TPH bits needs to be programmed into the importer device but are
> > derived based on the NUMA topology of the DMA target. The importer has
> > no idea what the DMA target actually was because the exporter mapping
> > destroyed that information.
> 
> Yeah, but again that is completely intentional.
> 
> I assume you mean TLP processing hints when you say TPH and those should be
> part of the DMA addresses provided by the exporter.

Yes, but is not part of the DMA addresses.

> That an importer tries to look behind the curtain and determines the NUMA
> placement and topology themselves is clearly a no-go from the design
> perspective.

I strongly disagree, this is important. Drivers need this information
in a future TPH/UIO/multipath PCI world.

> > 5) iommufd and kvm are both using CPU addresses without DMA. No
> > exporter mapping is possible
> 
> We have customers using both KVM and XEN with DMA-buf, so I can clearly
> confirm that this isn't true.

Today they are mmaping the dma-buf into a VMA and then using KVM's
follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable
dma-buf must have a CPU PFN.

Here Xu implements basically the same path, except without the VMA
indirection, and it suddenly not OK? Illogical.

Jason
Simona Vetter Jan. 8, 2025, 6:44 p.m. UTC | #7
On Wed, Jan 08, 2025 at 12:22:27PM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 08, 2025 at 04:25:54PM +0100, Christian König wrote:
> > Am 08.01.25 um 15:58 schrieb Jason Gunthorpe:
> > > I have imagined a staged approach were DMABUF gets a new API that
> > > works with the new DMA API to do importer mapping with "P2P source
> > > information" and a gradual conversion.
> > 
> > To make it clear as maintainer of that subsystem I would reject such a step
> > with all I have.
> 
> This is unexpected, so you want to just leave dmabuf broken? Do you
> have any plan to fix it, to fix the misuse of the DMA API, and all
> the problems I listed below? This is a big deal, it is causing real
> problems today.
> 
> If it going to be like this I think we will stop trying to use dmabuf
> and do something simpler for vfio/kvm/iommufd :(

As the gal who help edit the og dma-buf spec 13 years ago, I think adding
pfn isn't a terrible idea. By design, dma-buf is the "everything is
optional" interface. And in the beginning, even consistent locking was
optional, but we've managed to fix that by now :-/

Where I do agree with Christian is that stuffing pfn support into the
dma_buf_attachment interfaces feels a bit much wrong.

> > We have already gone down that road and it didn't worked at all and
> > was a really big pain to pull people back from it.
> 
> Nobody has really seriously tried to improve the DMA API before, so I
> don't think this is true at all.

Aside, I really hope this finally happens!

> > > 3) Importing devices need to know if they are working with PCI P2P
> > > addresses during mapping because they need to do things like turn on
> > > ATS on their DMA. As for multi-path we have the same hacks inside mlx5
> > > today that assume DMABUFs are always P2P because we cannot determine
> > > if things are P2P or not after being DMA mapped.
> > 
> > Why would you need ATS on PCI P2P and not for system memory accesses?
> 
> ATS has a significant performance cost. It is mandatory for PCI P2P,
> but ideally should be avoided for CPU memory.

Huh, I didn't know that. And yeah kinda means we've butchered the pci p2p
stuff a bit I guess ...

> > > 5) iommufd and kvm are both using CPU addresses without DMA. No
> > > exporter mapping is possible
> > 
> > We have customers using both KVM and XEN with DMA-buf, so I can clearly
> > confirm that this isn't true.
> 
> Today they are mmaping the dma-buf into a VMA and then using KVM's
> follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable
> dma-buf must have a CPU PFN.
> 
> Here Xu implements basically the same path, except without the VMA
> indirection, and it suddenly not OK? Illogical.

So the big difference is that for follow_pfn() you need mmu_notifier since
the mmap might move around, whereas with pfn smashed into
dma_buf_attachment you need dma_resv_lock rules, and the move_notify
callback if you go dynamic.

So I guess my first question is, which locking rules do you want here for
pfn importers?

If mmu notifiers is fine, then I think the current approach of follow_pfn
should be ok. But if you instead dma_resv_lock rules (or the cpu mmap
somehow is an issue itself), then I think the clean design is create a new
separate access mechanism just for that. It would be the 5th or so (kernel
vmap, userspace mmap, dma_buf_attach and driver private stuff like
virtio_dma_buf.c where you access your buffer with a uuid), so really not
a big deal.

And for non-contrived exporters we might be able to implement the other
access methods in terms of the pfn method generically, so this wouldn't
even be a terrible maintenance burden going forward. And meanwhile all the
contrived exporters just keep working as-is.

The other part is that cpu mmap is optional, and there's plenty of strange
exporters who don't implement. But you can dma map the attachment into
plenty devices. This tends to mostly be a thing on SoC devices with some
very funky memory. But I guess you don't care about these use-case, so
should be ok.

I couldn't come up with a good name for these pfn users, maybe
dma_buf_pfn_attachment? This does _not_ have a struct device, but maybe
some of these new p2p source specifiers (or a list of those which are
allowed, no idea how this would need to fit into the new dma api).

Cheers, Sima
diff mbox series

Patch

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 7eeee3a38202..83d1448b6dcc 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -630,10 +630,10 @@  struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
 	size_t alloc_size = sizeof(struct dma_buf);
 	int ret;
 
-	if (WARN_ON(!exp_info->priv || !exp_info->ops
-		    || !exp_info->ops->map_dma_buf
-		    || !exp_info->ops->unmap_dma_buf
-		    || !exp_info->ops->release))
+	if (WARN_ON(!exp_info->priv || !exp_info->ops ||
+		    (!!exp_info->ops->map_dma_buf != !!exp_info->ops->unmap_dma_buf) ||
+		    (!exp_info->ops->map_dma_buf && !exp_info->ops->get_pfn) ||
+		    !exp_info->ops->release))
 		return ERR_PTR(-EINVAL);
 
 	if (WARN_ON(exp_info->ops->cache_sgt_mapping &&
@@ -909,7 +909,10 @@  dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
 	struct dma_buf_attachment *attach;
 	int ret;
 
-	if (WARN_ON(!dmabuf || !dev))
+	if (WARN_ON(!dmabuf))
+		return ERR_PTR(-EINVAL);
+
+	if (WARN_ON(dmabuf->ops->map_dma_buf && !dev))
 		return ERR_PTR(-EINVAL);
 
 	if (WARN_ON(importer_ops && !importer_ops->move_notify))
@@ -941,7 +944,7 @@  dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
 	 */
 	if (dma_buf_attachment_is_dynamic(attach) !=
 	    dma_buf_is_dynamic(dmabuf)) {
-		struct sg_table *sgt;
+		struct sg_table *sgt = NULL;
 
 		dma_resv_lock(attach->dmabuf->resv, NULL);
 		if (dma_buf_is_dynamic(attach->dmabuf)) {
@@ -950,13 +953,16 @@  dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
 				goto err_unlock;
 		}
 
-		sgt = __map_dma_buf(attach, DMA_BIDIRECTIONAL);
-		if (!sgt)
-			sgt = ERR_PTR(-ENOMEM);
-		if (IS_ERR(sgt)) {
-			ret = PTR_ERR(sgt);
-			goto err_unpin;
+		if (dmabuf->ops->map_dma_buf) {
+			sgt = __map_dma_buf(attach, DMA_BIDIRECTIONAL);
+			if (!sgt)
+				sgt = ERR_PTR(-ENOMEM);
+			if (IS_ERR(sgt)) {
+				ret = PTR_ERR(sgt);
+				goto err_unpin;
+			}
 		}
+
 		dma_resv_unlock(attach->dmabuf->resv);
 		attach->sgt = sgt;
 		attach->dir = DMA_BIDIRECTIONAL;
@@ -1119,7 +1125,8 @@  struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *attach,
 
 	might_sleep();
 
-	if (WARN_ON(!attach || !attach->dmabuf))
+	if (WARN_ON(!attach || !attach->dmabuf ||
+		    !attach->dmabuf->ops->map_dma_buf))
 		return ERR_PTR(-EINVAL);
 
 	dma_resv_assert_held(attach->dmabuf->resv);
@@ -1195,7 +1202,8 @@  dma_buf_map_attachment_unlocked(struct dma_buf_attachment *attach,
 
 	might_sleep();
 
-	if (WARN_ON(!attach || !attach->dmabuf))
+	if (WARN_ON(!attach || !attach->dmabuf ||
+		    !attach->dmabuf->ops->map_dma_buf))
 		return ERR_PTR(-EINVAL);
 
 	dma_resv_lock(attach->dmabuf->resv, NULL);
@@ -1222,7 +1230,8 @@  void dma_buf_unmap_attachment(struct dma_buf_attachment *attach,
 {
 	might_sleep();
 
-	if (WARN_ON(!attach || !attach->dmabuf || !sg_table))
+	if (WARN_ON(!attach || !attach->dmabuf ||
+		    !attach->dmabuf->ops->unmap_dma_buf || !sg_table))
 		return;
 
 	dma_resv_assert_held(attach->dmabuf->resv);
@@ -1254,7 +1263,8 @@  void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
 {
 	might_sleep();
 
-	if (WARN_ON(!attach || !attach->dmabuf || !sg_table))
+	if (WARN_ON(!attach || !attach->dmabuf ||
+		    !attach->dmabuf->ops->unmap_dma_buf || !sg_table))
 		return;
 
 	dma_resv_lock(attach->dmabuf->resv, NULL);
@@ -1263,6 +1273,52 @@  void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
 }
 EXPORT_SYMBOL_NS_GPL(dma_buf_unmap_attachment_unlocked, "DMA_BUF");
 
+/**
+ * dma_buf_get_pfn_unlocked -
+ * @attach:	[in]	attachment to get pfn from
+ * @pgoff:	[in]	page offset of the buffer against the start of dma_buf
+ * @pfn:	[out]	returns the pfn of the buffer
+ * @max_order	[out]	returns the max mapping order of the buffer
+ */
+int dma_buf_get_pfn_unlocked(struct dma_buf_attachment *attach,
+			     pgoff_t pgoff, u64 *pfn, int *max_order)
+{
+	struct dma_buf *dmabuf = attach->dmabuf;
+	int ret;
+
+	if (WARN_ON(!attach || !attach->dmabuf ||
+		    !attach->dmabuf->ops->get_pfn))
+		return -EINVAL;
+
+	/*
+	 * Open:
+	 *
+	 * When dma_buf is dynamic but dma_buf move is disabled, the buffer
+	 * should be pinned before use, See dma_buf_map_attachment() for
+	 * reference.
+	 *
+	 * But for now no pin is intended inside dma_buf_get_pfn(), otherwise
+	 * need another API to unpin the dma_buf. So just fail out this case.
+	 */
+	if (dma_buf_is_dynamic(attach->dmabuf) &&
+	    !IS_ENABLED(CONFIG_DMABUF_MOVE_NOTIFY))
+		return -ENOENT;
+
+	dma_resv_lock(attach->dmabuf->resv, NULL);
+	ret = dmabuf->ops->get_pfn(attach, pgoff, pfn, max_order);
+	/*
+	 * Open:
+	 *
+	 * Is dma_resv_wait_timeout() needed? I assume no. The DMA buffer
+	 * content synchronization could be done when the buffer is to be
+	 * mapped by importer.
+	 */
+	dma_resv_unlock(attach->dmabuf->resv);
+
+	return ret;
+}
+EXPORT_SYMBOL_NS_GPL(dma_buf_get_pfn_unlocked, "DMA_BUF");
+
 /**
  * dma_buf_move_notify - notify attachments that DMA-buf is moving
  *
@@ -1662,7 +1718,7 @@  static int dma_buf_debug_show(struct seq_file *s, void *unused)
 		attach_count = 0;
 
 		list_for_each_entry(attach_obj, &buf_obj->attachments, node) {
-			seq_printf(s, "\t%s\n", dev_name(attach_obj->dev));
+			seq_printf(s, "\t%s\n", attach_obj->dev ? dev_name(attach_obj->dev) : NULL);
 			attach_count++;
 		}
 		dma_resv_unlock(buf_obj->resv);
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index 36216d28d8bd..b16183edfb3a 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -194,6 +194,17 @@  struct dma_buf_ops {
 	 * if the call would block.
 	 */
 
+	/**
+	 * @get_pfn:
+	 *
+	 * This is called by dma_buf_get_pfn(). It is used to get the pfn
+	 * of the buffer positioned by the page offset against the start of
+	 * the dma_buf. It can only be called if @attach has been called
+	 * successfully.
+	 */
+	int (*get_pfn)(struct dma_buf_attachment *attach, pgoff_t pgoff,
+		       u64 *pfn, int *max_order);
+
 	/**
 	 * @release:
 	 *
@@ -629,6 +640,8 @@  dma_buf_map_attachment_unlocked(struct dma_buf_attachment *attach,
 void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
 				       struct sg_table *sg_table,
 				       enum dma_data_direction direction);
+int dma_buf_get_pfn_unlocked(struct dma_buf_attachment *attach,
+			     pgoff_t pgoff, u64 *pfn, int *max_order);
 
 int dma_buf_mmap(struct dma_buf *, struct vm_area_struct *,
 		 unsigned long);