diff mbox series

[RFC,01/12] dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI

Message ID 20250107142719.179636-2-yilun.xu@linux.intel.com (mailing list archive)
State New
Headers show
Series Private MMIO support for private assigned dev | expand

Commit Message

Xu Yilun Jan. 7, 2025, 2:27 p.m. UTC
Introduce a new API for dma-buf importer, also add a dma_buf_ops
callback for dma-buf exporter. This API is for subsystem importers who
map the dma-buf to some user defined address space, e.g. for IOMMUFD to
map the dma-buf to userspace IOVA via IOMMU page table, or for KVM to
map the dma-buf to GPA via KVM MMU (e.g. EPT).

Currently dma-buf is only used to get DMA address for device's default
domain by using kernel DMA APIs. But for these new use-cases, importers
only need the pfn of the dma-buf resource to build their own mapping
tables. So the map_dma_buf() callback is not mandatory for exporters
anymore. Also the importers could choose not to provide
struct device *dev on dma_buf_attach() if they don't call
dma_buf_map_attachment().

Like dma_buf_map_attachment(), the importer should firstly call
dma_buf_attach/dynamic_attach() then call dma_buf_get_pfn_unlocked().
If the importer choose to do dynamic attach, it also should handle the
dma-buf move notification.

Only the unlocked version of dma_buf_get_pfn is implemented for now,
just because no locked version is used for now.

Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>

---
IIUC, Only get_pfn() is needed but no put_pfn(). The whole dma-buf is
de/referenced at dma-buf attach/detach time.

Specifically, for static attachment, the exporter should always make
memory resource available/pinned on first dma_buf_attach(), and
release/unpin memory resource on last dma_buf_detach(). For dynamic
attachment, the exporter could populate & invalidate the memory
resource at any time, it's OK as long as the importers follow dma-buf
move notification. So no pinning is needed for get_pfn() and no
put_pfn() is needed.
---
 drivers/dma-buf/dma-buf.c | 90 +++++++++++++++++++++++++++++++--------
 include/linux/dma-buf.h   | 13 ++++++
 2 files changed, 86 insertions(+), 17 deletions(-)

Comments

Christian König Jan. 8, 2025, 8:01 a.m. UTC | #1
Am 07.01.25 um 15:27 schrieb Xu Yilun:
> Introduce a new API for dma-buf importer, also add a dma_buf_ops
> callback for dma-buf exporter. This API is for subsystem importers who
> map the dma-buf to some user defined address space, e.g. for IOMMUFD to
> map the dma-buf to userspace IOVA via IOMMU page table, or for KVM to
> map the dma-buf to GPA via KVM MMU (e.g. EPT).
>
> Currently dma-buf is only used to get DMA address for device's default
> domain by using kernel DMA APIs. But for these new use-cases, importers
> only need the pfn of the dma-buf resource to build their own mapping
> tables.

As far as I can see I have to fundamentally reject this whole approach.

It's intentional DMA-buf design that we don't expose struct pages nor 
PFNs to the importer. Essentially DMA-buf only transports DMA addresses.

In other words the mapping is done by the exporter and *not* the importer.

What we certainly can do is to annotate those DMA addresses to a better 
specify in which domain they are applicable, e.g. if they are PCIe bus 
addresses or some inter device bus addresses etc...

But moving the functionality to map the pages/PFNs to DMA addresses into 
the importer is an absolutely clear NO-GO.

Regards,
Christian.

>   So the map_dma_buf() callback is not mandatory for exporters
> anymore. Also the importers could choose not to provide
> struct device *dev on dma_buf_attach() if they don't call
> dma_buf_map_attachment().
>
> Like dma_buf_map_attachment(), the importer should firstly call
> dma_buf_attach/dynamic_attach() then call dma_buf_get_pfn_unlocked().
> If the importer choose to do dynamic attach, it also should handle the
> dma-buf move notification.
>
> Only the unlocked version of dma_buf_get_pfn is implemented for now,
> just because no locked version is used for now.
>
> Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>
>
> ---
> IIUC, Only get_pfn() is needed but no put_pfn(). The whole dma-buf is
> de/referenced at dma-buf attach/detach time.
>
> Specifically, for static attachment, the exporter should always make
> memory resource available/pinned on first dma_buf_attach(), and
> release/unpin memory resource on last dma_buf_detach(). For dynamic
> attachment, the exporter could populate & invalidate the memory
> resource at any time, it's OK as long as the importers follow dma-buf
> move notification. So no pinning is needed for get_pfn() and no
> put_pfn() is needed.
> ---
>   drivers/dma-buf/dma-buf.c | 90 +++++++++++++++++++++++++++++++--------
>   include/linux/dma-buf.h   | 13 ++++++
>   2 files changed, 86 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 7eeee3a38202..83d1448b6dcc 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -630,10 +630,10 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
>   	size_t alloc_size = sizeof(struct dma_buf);
>   	int ret;
>   
> -	if (WARN_ON(!exp_info->priv || !exp_info->ops
> -		    || !exp_info->ops->map_dma_buf
> -		    || !exp_info->ops->unmap_dma_buf
> -		    || !exp_info->ops->release))
> +	if (WARN_ON(!exp_info->priv || !exp_info->ops ||
> +		    (!!exp_info->ops->map_dma_buf != !!exp_info->ops->unmap_dma_buf) ||
> +		    (!exp_info->ops->map_dma_buf && !exp_info->ops->get_pfn) ||
> +		    !exp_info->ops->release))
>   		return ERR_PTR(-EINVAL);
>   
>   	if (WARN_ON(exp_info->ops->cache_sgt_mapping &&
> @@ -909,7 +909,10 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
>   	struct dma_buf_attachment *attach;
>   	int ret;
>   
> -	if (WARN_ON(!dmabuf || !dev))
> +	if (WARN_ON(!dmabuf))
> +		return ERR_PTR(-EINVAL);
> +
> +	if (WARN_ON(dmabuf->ops->map_dma_buf && !dev))
>   		return ERR_PTR(-EINVAL);
>   
>   	if (WARN_ON(importer_ops && !importer_ops->move_notify))
> @@ -941,7 +944,7 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
>   	 */
>   	if (dma_buf_attachment_is_dynamic(attach) !=
>   	    dma_buf_is_dynamic(dmabuf)) {
> -		struct sg_table *sgt;
> +		struct sg_table *sgt = NULL;
>   
>   		dma_resv_lock(attach->dmabuf->resv, NULL);
>   		if (dma_buf_is_dynamic(attach->dmabuf)) {
> @@ -950,13 +953,16 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
>   				goto err_unlock;
>   		}
>   
> -		sgt = __map_dma_buf(attach, DMA_BIDIRECTIONAL);
> -		if (!sgt)
> -			sgt = ERR_PTR(-ENOMEM);
> -		if (IS_ERR(sgt)) {
> -			ret = PTR_ERR(sgt);
> -			goto err_unpin;
> +		if (dmabuf->ops->map_dma_buf) {
> +			sgt = __map_dma_buf(attach, DMA_BIDIRECTIONAL);
> +			if (!sgt)
> +				sgt = ERR_PTR(-ENOMEM);
> +			if (IS_ERR(sgt)) {
> +				ret = PTR_ERR(sgt);
> +				goto err_unpin;
> +			}
>   		}
> +
>   		dma_resv_unlock(attach->dmabuf->resv);
>   		attach->sgt = sgt;
>   		attach->dir = DMA_BIDIRECTIONAL;
> @@ -1119,7 +1125,8 @@ struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *attach,
>   
>   	might_sleep();
>   
> -	if (WARN_ON(!attach || !attach->dmabuf))
> +	if (WARN_ON(!attach || !attach->dmabuf ||
> +		    !attach->dmabuf->ops->map_dma_buf))
>   		return ERR_PTR(-EINVAL);
>   
>   	dma_resv_assert_held(attach->dmabuf->resv);
> @@ -1195,7 +1202,8 @@ dma_buf_map_attachment_unlocked(struct dma_buf_attachment *attach,
>   
>   	might_sleep();
>   
> -	if (WARN_ON(!attach || !attach->dmabuf))
> +	if (WARN_ON(!attach || !attach->dmabuf ||
> +		    !attach->dmabuf->ops->map_dma_buf))
>   		return ERR_PTR(-EINVAL);
>   
>   	dma_resv_lock(attach->dmabuf->resv, NULL);
> @@ -1222,7 +1230,8 @@ void dma_buf_unmap_attachment(struct dma_buf_attachment *attach,
>   {
>   	might_sleep();
>   
> -	if (WARN_ON(!attach || !attach->dmabuf || !sg_table))
> +	if (WARN_ON(!attach || !attach->dmabuf ||
> +		    !attach->dmabuf->ops->unmap_dma_buf || !sg_table))
>   		return;
>   
>   	dma_resv_assert_held(attach->dmabuf->resv);
> @@ -1254,7 +1263,8 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
>   {
>   	might_sleep();
>   
> -	if (WARN_ON(!attach || !attach->dmabuf || !sg_table))
> +	if (WARN_ON(!attach || !attach->dmabuf ||
> +		    !attach->dmabuf->ops->unmap_dma_buf || !sg_table))
>   		return;
>   
>   	dma_resv_lock(attach->dmabuf->resv, NULL);
> @@ -1263,6 +1273,52 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
>   }
>   EXPORT_SYMBOL_NS_GPL(dma_buf_unmap_attachment_unlocked, "DMA_BUF");
>   
> +/**
> + * dma_buf_get_pfn_unlocked -
> + * @attach:	[in]	attachment to get pfn from
> + * @pgoff:	[in]	page offset of the buffer against the start of dma_buf
> + * @pfn:	[out]	returns the pfn of the buffer
> + * @max_order	[out]	returns the max mapping order of the buffer
> + */
> +int dma_buf_get_pfn_unlocked(struct dma_buf_attachment *attach,
> +			     pgoff_t pgoff, u64 *pfn, int *max_order)
> +{
> +	struct dma_buf *dmabuf = attach->dmabuf;
> +	int ret;
> +
> +	if (WARN_ON(!attach || !attach->dmabuf ||
> +		    !attach->dmabuf->ops->get_pfn))
> +		return -EINVAL;
> +
> +	/*
> +	 * Open:
> +	 *
> +	 * When dma_buf is dynamic but dma_buf move is disabled, the buffer
> +	 * should be pinned before use, See dma_buf_map_attachment() for
> +	 * reference.
> +	 *
> +	 * But for now no pin is intended inside dma_buf_get_pfn(), otherwise
> +	 * need another API to unpin the dma_buf. So just fail out this case.
> +	 */
> +	if (dma_buf_is_dynamic(attach->dmabuf) &&
> +	    !IS_ENABLED(CONFIG_DMABUF_MOVE_NOTIFY))
> +		return -ENOENT;
> +
> +	dma_resv_lock(attach->dmabuf->resv, NULL);
> +	ret = dmabuf->ops->get_pfn(attach, pgoff, pfn, max_order);
> +	/*
> +	 * Open:
> +	 *
> +	 * Is dma_resv_wait_timeout() needed? I assume no. The DMA buffer
> +	 * content synchronization could be done when the buffer is to be
> +	 * mapped by importer.
> +	 */
> +	dma_resv_unlock(attach->dmabuf->resv);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_NS_GPL(dma_buf_get_pfn_unlocked, "DMA_BUF");
> +
>   /**
>    * dma_buf_move_notify - notify attachments that DMA-buf is moving
>    *
> @@ -1662,7 +1718,7 @@ static int dma_buf_debug_show(struct seq_file *s, void *unused)
>   		attach_count = 0;
>   
>   		list_for_each_entry(attach_obj, &buf_obj->attachments, node) {
> -			seq_printf(s, "\t%s\n", dev_name(attach_obj->dev));
> +			seq_printf(s, "\t%s\n", attach_obj->dev ? dev_name(attach_obj->dev) : NULL);
>   			attach_count++;
>   		}
>   		dma_resv_unlock(buf_obj->resv);
> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> index 36216d28d8bd..b16183edfb3a 100644
> --- a/include/linux/dma-buf.h
> +++ b/include/linux/dma-buf.h
> @@ -194,6 +194,17 @@ struct dma_buf_ops {
>   	 * if the call would block.
>   	 */
>   
> +	/**
> +	 * @get_pfn:
> +	 *
> +	 * This is called by dma_buf_get_pfn(). It is used to get the pfn
> +	 * of the buffer positioned by the page offset against the start of
> +	 * the dma_buf. It can only be called if @attach has been called
> +	 * successfully.
> +	 */
> +	int (*get_pfn)(struct dma_buf_attachment *attach, pgoff_t pgoff,
> +		       u64 *pfn, int *max_order);
> +
>   	/**
>   	 * @release:
>   	 *
> @@ -629,6 +640,8 @@ dma_buf_map_attachment_unlocked(struct dma_buf_attachment *attach,
>   void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
>   				       struct sg_table *sg_table,
>   				       enum dma_data_direction direction);
> +int dma_buf_get_pfn_unlocked(struct dma_buf_attachment *attach,
> +			     pgoff_t pgoff, u64 *pfn, int *max_order);
>   
>   int dma_buf_mmap(struct dma_buf *, struct vm_area_struct *,
>   		 unsigned long);
Jason Gunthorpe Jan. 8, 2025, 1:23 p.m. UTC | #2
On Wed, Jan 08, 2025 at 09:01:46AM +0100, Christian König wrote:
> Am 07.01.25 um 15:27 schrieb Xu Yilun:
> > Introduce a new API for dma-buf importer, also add a dma_buf_ops
> > callback for dma-buf exporter. This API is for subsystem importers who
> > map the dma-buf to some user defined address space, e.g. for IOMMUFD to
> > map the dma-buf to userspace IOVA via IOMMU page table, or for KVM to
> > map the dma-buf to GPA via KVM MMU (e.g. EPT).
> > 
> > Currently dma-buf is only used to get DMA address for device's default
> > domain by using kernel DMA APIs. But for these new use-cases, importers
> > only need the pfn of the dma-buf resource to build their own mapping
> > tables.
> 
> As far as I can see I have to fundamentally reject this whole approach.
> 
> It's intentional DMA-buf design that we don't expose struct pages nor PFNs
> to the importer. Essentially DMA-buf only transports DMA addresses.
> 
> In other words the mapping is done by the exporter and *not* the importer.
> 
> What we certainly can do is to annotate those DMA addresses to a better
> specify in which domain they are applicable, e.g. if they are PCIe bus
> addresses or some inter device bus addresses etc...
> 
> But moving the functionality to map the pages/PFNs to DMA addresses into the
> importer is an absolutely clear NO-GO.

Oh?

Having the importer do the mapping is the correct way to operate the
DMA API and the new API that Leon has built to fix the scatterlist
abuse in dmabuf relies on importer mapping as part of it's
construction.

Why on earth do you want the exporter to map? That is completely
backwards and unworkable in many cases. The disfunctional P2P support
in dmabuf is like that principally because of this.

That said, I don't think get_pfn() is an especially good interface,
but we will need to come with something that passes the physical pfn
out.

Jason
Christian König Jan. 8, 2025, 1:44 p.m. UTC | #3
Am 08.01.25 um 14:23 schrieb Jason Gunthorpe:
> On Wed, Jan 08, 2025 at 09:01:46AM +0100, Christian König wrote:
>> Am 07.01.25 um 15:27 schrieb Xu Yilun:
>>> Introduce a new API for dma-buf importer, also add a dma_buf_ops
>>> callback for dma-buf exporter. This API is for subsystem importers who
>>> map the dma-buf to some user defined address space, e.g. for IOMMUFD to
>>> map the dma-buf to userspace IOVA via IOMMU page table, or for KVM to
>>> map the dma-buf to GPA via KVM MMU (e.g. EPT).
>>>
>>> Currently dma-buf is only used to get DMA address for device's default
>>> domain by using kernel DMA APIs. But for these new use-cases, importers
>>> only need the pfn of the dma-buf resource to build their own mapping
>>> tables.
>> As far as I can see I have to fundamentally reject this whole approach.
>>
>> It's intentional DMA-buf design that we don't expose struct pages nor PFNs
>> to the importer. Essentially DMA-buf only transports DMA addresses.
>>
>> In other words the mapping is done by the exporter and *not* the importer.
>>
>> What we certainly can do is to annotate those DMA addresses to a better
>> specify in which domain they are applicable, e.g. if they are PCIe bus
>> addresses or some inter device bus addresses etc...
>>
>> But moving the functionality to map the pages/PFNs to DMA addresses into the
>> importer is an absolutely clear NO-GO.
> Oh?
>
> Having the importer do the mapping is the correct way to operate the
> DMA API and the new API that Leon has built to fix the scatterlist
> abuse in dmabuf relies on importer mapping as part of it's
> construction.

Exactly on that I strongly disagree on.

DMA-buf works by providing DMA addresses the importer can work with and 
*NOT* the underlying location of the buffer.

> Why on earth do you want the exporter to map?

Because the exporter owns the exported buffer and only the exporter 
knows to how correctly access it.

> That is completely backwards and unworkable in many cases. The disfunctional P2P support
> in dmabuf is like that principally because of this.

No, that is exactly what we need.

Using the scatterlist to transport the DMA addresses was clearly a 
mistake, but providing the DMA addresses by the exporter has proved many 
times to be the right approach.

Keep in mind that the exported buffer is not necessary memory, but can 
also be MMIO or stuff which is only accessible through address space 
windows where you can't create a PFN nor struct page for.

> That said, I don't think get_pfn() is an especially good interface,
> but we will need to come with something that passes the physical pfn
> out.

No, physical pfn is absolutely not a good way of passing the location of 
data around because it is limited to what the CPU sees as address space.

We have use cases where DMA-buf transports the location of CPU invisible 
data which only the involved devices can see.

Regards,
Christian.

>
> Jason
Jason Gunthorpe Jan. 8, 2025, 2:58 p.m. UTC | #4
On Wed, Jan 08, 2025 at 02:44:26PM +0100, Christian König wrote:

> > Having the importer do the mapping is the correct way to operate the
> > DMA API and the new API that Leon has built to fix the scatterlist
> > abuse in dmabuf relies on importer mapping as part of it's
> > construction.
> 
> Exactly on that I strongly disagree on.
> 
> DMA-buf works by providing DMA addresses the importer can work with and
> *NOT* the underlying location of the buffer.

The expectation is that the DMA API will be used to DMA map (most)
things, and the DMA API always works with a physaddr_t/pfn
argument. Basically, everything that is not a private address space
should be supported by improving the DMA API. We are on course for
finally getting all the common cases like P2P and MMIO solved
here. That alone will take care of alot.

For P2P cases we are going toward (PFN + P2P source information) as
input to the DMA API. The additional "P2P source information" provides
a good way for co-operating drivers to represent private address
spaces as well. Both importer and exporter can have full understanding
what is being mapped and do the correct things, safely.

So, no, we don't loose private address space support when moving to
importer mapping, in fact it works better because the importer gets
more information about what is going on.

I have imagined a staged approach were DMABUF gets a new API that
works with the new DMA API to do importer mapping with "P2P source
information" and a gradual conversion.

Exporter mapping falls down in too many cases already:

1) Private addresses spaces don't work fully well because many devices
need some indication what address space is being used and scatter list
can't really properly convey that. If the DMABUF has a mixture of CPU
and private it becomes a PITA

2) Multi-path PCI can require the importer to make mapping decisions
unique to the device and program device specific information for the
multi-path. We are doing this in mlx5 today and have hacks because
DMABUF is destroying the information the importer needs to choose the
correct PCI path.

3) Importing devices need to know if they are working with PCI P2P
addresses during mapping because they need to do things like turn on
ATS on their DMA. As for multi-path we have the same hacks inside mlx5
today that assume DMABUFs are always P2P because we cannot determine
if things are P2P or not after being DMA mapped.

4) TPH bits needs to be programmed into the importer device but are
derived based on the NUMA topology of the DMA target. The importer has
no idea what the DMA target actually was because the exporter mapping
destroyed that information.

5) iommufd and kvm are both using CPU addresses without DMA. No
exporter mapping is possible

Jason
Christian König Jan. 8, 2025, 3:25 p.m. UTC | #5
Am 08.01.25 um 15:58 schrieb Jason Gunthorpe:
> On Wed, Jan 08, 2025 at 02:44:26PM +0100, Christian König wrote:
>
>>> Having the importer do the mapping is the correct way to operate the
>>> DMA API and the new API that Leon has built to fix the scatterlist
>>> abuse in dmabuf relies on importer mapping as part of it's
>>> construction.
>> Exactly on that I strongly disagree on.
>>
>> DMA-buf works by providing DMA addresses the importer can work with and
>> *NOT* the underlying location of the buffer.
> The expectation is that the DMA API will be used to DMA map (most)
> things, and the DMA API always works with a physaddr_t/pfn
> argument. Basically, everything that is not a private address space
> should be supported by improving the DMA API. We are on course for
> finally getting all the common cases like P2P and MMIO solved
> here. That alone will take care of alot.

Well, from experience the DMA API has failed more often than it actually 
worked in the way required by drivers.

Especially that we tried to hide architectural complexity in there 
instead of properly expose limitations to drivers is not something I 
consider a good design approach.

So I see putting even more into that extremely critical.

> For P2P cases we are going toward (PFN + P2P source information) as
> input to the DMA API. The additional "P2P source information" provides
> a good way for co-operating drivers to represent private address
> spaces as well. Both importer and exporter can have full understanding
> what is being mapped and do the correct things, safely.

I can say from experience that this is clearly not going to work for all 
use cases.

It would mean that we have to pull a massive amount of driver specific 
functionality into the DMA API.

Things like programming access windows for PCI BARs is completely driver 
specific and as far as I can see can't be part of the DMA API without 
things like callbacks.

With that in mind the DMA API would become a mid layer between different 
drivers and that is really not something you are suggesting, isn't it?

> So, no, we don't loose private address space support when moving to
> importer mapping, in fact it works better because the importer gets
> more information about what is going on.

Well, sounds like I wasn't able to voice my concern. Let me try again:

We should not give importers information they don't need. Especially not 
information about the backing store of buffers.

So that importers get more information about what's going on is a bad thing.

> I have imagined a staged approach were DMABUF gets a new API that
> works with the new DMA API to do importer mapping with "P2P source
> information" and a gradual conversion.

To make it clear as maintainer of that subsystem I would reject such a 
step with all I have.

We have already gone down that road and it didn't worked at all and was 
a really big pain to pull people back from it.

> Exporter mapping falls down in too many cases already:
>
> 1) Private addresses spaces don't work fully well because many devices
> need some indication what address space is being used and scatter list
> can't really properly convey that. If the DMABUF has a mixture of CPU
> and private it becomes a PITA

Correct, yes. That's why I said that scatterlist was a bad choice for 
the interface.

But exposing the backing store to importers and then let them do 
whatever they want with it sounds like an even worse idea.

> 2) Multi-path PCI can require the importer to make mapping decisions
> unique to the device and program device specific information for the
> multi-path. We are doing this in mlx5 today and have hacks because
> DMABUF is destroying the information the importer needs to choose the
> correct PCI path.

That's why the exporter gets the struct device of the importer so that 
it can plan how those accesses are made. Where exactly is the problem 
with that?

When you have an use case which is not covered by the existing DMA-buf 
interfaces then please voice that to me and other maintainers instead of 
implementing some hack.

> 3) Importing devices need to know if they are working with PCI P2P
> addresses during mapping because they need to do things like turn on
> ATS on their DMA. As for multi-path we have the same hacks inside mlx5
> today that assume DMABUFs are always P2P because we cannot determine
> if things are P2P or not after being DMA mapped.

Why would you need ATS on PCI P2P and not for system memory accesses?

> 4) TPH bits needs to be programmed into the importer device but are
> derived based on the NUMA topology of the DMA target. The importer has
> no idea what the DMA target actually was because the exporter mapping
> destroyed that information.

Yeah, but again that is completely intentional.

I assume you mean TLP processing hints when you say TPH and those should 
be part of the DMA addresses provided by the exporter.

That an importer tries to look behind the curtain and determines the 
NUMA placement and topology themselves is clearly a no-go from the 
design perspective.

> 5) iommufd and kvm are both using CPU addresses without DMA. No
> exporter mapping is possible

We have customers using both KVM and XEN with DMA-buf, so I can clearly 
confirm that this isn't true.

Regards,
Christian.

>
> Jason
Jason Gunthorpe Jan. 8, 2025, 4:22 p.m. UTC | #6
On Wed, Jan 08, 2025 at 04:25:54PM +0100, Christian König wrote:
> Am 08.01.25 um 15:58 schrieb Jason Gunthorpe:
> > On Wed, Jan 08, 2025 at 02:44:26PM +0100, Christian König wrote:
> > 
> > > > Having the importer do the mapping is the correct way to operate the
> > > > DMA API and the new API that Leon has built to fix the scatterlist
> > > > abuse in dmabuf relies on importer mapping as part of it's
> > > > construction.
> > > Exactly on that I strongly disagree on.
> > > 
> > > DMA-buf works by providing DMA addresses the importer can work with and
> > > *NOT* the underlying location of the buffer.
> > The expectation is that the DMA API will be used to DMA map (most)
> > things, and the DMA API always works with a physaddr_t/pfn
> > argument. Basically, everything that is not a private address space
> > should be supported by improving the DMA API. We are on course for
> > finally getting all the common cases like P2P and MMIO solved
> > here. That alone will take care of alot.
> 
> Well, from experience the DMA API has failed more often than it actually
> worked in the way required by drivers.

The DMA API has been static and very hard to change in these ways for
a long time. I think Leon's new API will break through this and we
will be able finally address these issues.

> > For P2P cases we are going toward (PFN + P2P source information) as
> > input to the DMA API. The additional "P2P source information" provides
> > a good way for co-operating drivers to represent private address
> > spaces as well. Both importer and exporter can have full understanding
> > what is being mapped and do the correct things, safely.
> 
> I can say from experience that this is clearly not going to work for all use
> cases.
> 
> It would mean that we have to pull a massive amount of driver specific
> functionality into the DMA API.

That isn't what I mean. There are two distinct parts, the means to
describe the source (PFN + P2P source information) that is compatible
with the DMA API, and the DMA API itself that works with a few general
P2P source information types.

Private source information would be detected by co-operating drivers
and go down driver private paths. It would be rejected by other
drivers. This broadly follows how the new API is working.

So here I mean you can use the same PFN + Source API between importer
and exporter and the importer can simply detect the special source and
do the private stuff. It is not shifting things under the DMA API, it
is building along side it using compatible design approaches. You
would match the source information, cast it to a driver structure, do
whatever driver math is needed to compute the local DMA address and
then write it to the device. Nothing is hard or "not going to work"
here.

> > So, no, we don't loose private address space support when moving to
> > importer mapping, in fact it works better because the importer gets
> > more information about what is going on.
> 
> Well, sounds like I wasn't able to voice my concern. Let me try again:
> 
> We should not give importers information they don't need. Especially not
> information about the backing store of buffers.
> 
> So that importers get more information about what's going on is a bad thing.

I strongly disagree because we are suffering today in mlx5 because of
this viewpoint. You cannot predict in advance what importers are going
to need. I already listed many examples where it does not work today
as is.

> > I have imagined a staged approach were DMABUF gets a new API that
> > works with the new DMA API to do importer mapping with "P2P source
> > information" and a gradual conversion.
> 
> To make it clear as maintainer of that subsystem I would reject such a step
> with all I have.

This is unexpected, so you want to just leave dmabuf broken? Do you
have any plan to fix it, to fix the misuse of the DMA API, and all
the problems I listed below? This is a big deal, it is causing real
problems today.

If it going to be like this I think we will stop trying to use dmabuf
and do something simpler for vfio/kvm/iommufd :(

> We have already gone down that road and it didn't worked at all and
> was a really big pain to pull people back from it.

Nobody has really seriously tried to improve the DMA API before, so I
don't think this is true at all.

> > Exporter mapping falls down in too many cases already:
> > 
> > 1) Private addresses spaces don't work fully well because many devices
> > need some indication what address space is being used and scatter list
> > can't really properly convey that. If the DMABUF has a mixture of CPU
> > and private it becomes a PITA
> 
> Correct, yes. That's why I said that scatterlist was a bad choice for the
> interface.
> 
> But exposing the backing store to importers and then let them do whatever
> they want with it sounds like an even worse idea.

You keep saying this without real justification. To me it is a nanny
style of API design. But also I don't see how you can possibly fix the
above without telling the importer alot more information.

> > 2) Multi-path PCI can require the importer to make mapping decisions
> > unique to the device and program device specific information for the
> > multi-path. We are doing this in mlx5 today and have hacks because
> > DMABUF is destroying the information the importer needs to choose the
> > correct PCI path.
> 
> That's why the exporter gets the struct device of the importer so that it
> can plan how those accesses are made. Where exactly is the problem with
> that?

A single struct device does not convey the multipath options. We have
multiple struct devices (and multiple PCI endpoints) doing DMA
concurrently under one driver.

Multipath always needs additional meta information in the importer
side to tell the device which path to select. A naked dma address is
not sufficient.

Today we guess that DMABUF will be using P2P and hack to choose a P2P
struct device to pass the exporter. We need to know what is in the
dmabuf before we can choose which of the multiple struct devices the
driver has to use for DMA mapping.

But even simple CPU centric cases we will eventually want to select
the proper NUMA local PCI channel matching struct device for CPU only
buffers.

> When you have an use case which is not covered by the existing DMA-buf
> interfaces then please voice that to me and other maintainers instead of
> implementing some hack.

Do you have any suggestion for any of this then? We have a good plan
to fix this stuff and more. Many experts in their fields have agreed
on the different parts now. We haven't got to dmabuf because I had no
idea there would be an objection like this.

> > 3) Importing devices need to know if they are working with PCI P2P
> > addresses during mapping because they need to do things like turn on
> > ATS on their DMA. As for multi-path we have the same hacks inside mlx5
> > today that assume DMABUFs are always P2P because we cannot determine
> > if things are P2P or not after being DMA mapped.
> 
> Why would you need ATS on PCI P2P and not for system memory accesses?

ATS has a significant performance cost. It is mandatory for PCI P2P,
but ideally should be avoided for CPU memory.

> > 4) TPH bits needs to be programmed into the importer device but are
> > derived based on the NUMA topology of the DMA target. The importer has
> > no idea what the DMA target actually was because the exporter mapping
> > destroyed that information.
> 
> Yeah, but again that is completely intentional.
> 
> I assume you mean TLP processing hints when you say TPH and those should be
> part of the DMA addresses provided by the exporter.

Yes, but is not part of the DMA addresses.

> That an importer tries to look behind the curtain and determines the NUMA
> placement and topology themselves is clearly a no-go from the design
> perspective.

I strongly disagree, this is important. Drivers need this information
in a future TPH/UIO/multipath PCI world.

> > 5) iommufd and kvm are both using CPU addresses without DMA. No
> > exporter mapping is possible
> 
> We have customers using both KVM and XEN with DMA-buf, so I can clearly
> confirm that this isn't true.

Today they are mmaping the dma-buf into a VMA and then using KVM's
follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable
dma-buf must have a CPU PFN.

Here Xu implements basically the same path, except without the VMA
indirection, and it suddenly not OK? Illogical.

Jason
Xu Yilun Jan. 8, 2025, 5:56 p.m. UTC | #7
> > > 5) iommufd and kvm are both using CPU addresses without DMA. No
> > > exporter mapping is possible
> > 
> > We have customers using both KVM and XEN with DMA-buf, so I can clearly
> > confirm that this isn't true.
> 
> Today they are mmaping the dma-buf into a VMA and then using KVM's
> follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable
> dma-buf must have a CPU PFN.

Yes, the final target for KVM is still the CPU PFN, just with the help
of CPU mapping table.

I also found the xen gntdev-dmabuf is calculating pfn from mapped
sgt.

From Christion's point, I assume only sgl->dma_address should be
used by importers but in fact not. More importers are 'abusing' sg dma
helpers.

That said there are existing needs for importers to know more about the
real buffer resource, for mapping, or even more than mapping,
e.g. dmabuf_imp_grant_foreign_access()

Thanks,
Yilun
Simona Vetter Jan. 8, 2025, 6:44 p.m. UTC | #8
On Wed, Jan 08, 2025 at 12:22:27PM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 08, 2025 at 04:25:54PM +0100, Christian König wrote:
> > Am 08.01.25 um 15:58 schrieb Jason Gunthorpe:
> > > I have imagined a staged approach were DMABUF gets a new API that
> > > works with the new DMA API to do importer mapping with "P2P source
> > > information" and a gradual conversion.
> > 
> > To make it clear as maintainer of that subsystem I would reject such a step
> > with all I have.
> 
> This is unexpected, so you want to just leave dmabuf broken? Do you
> have any plan to fix it, to fix the misuse of the DMA API, and all
> the problems I listed below? This is a big deal, it is causing real
> problems today.
> 
> If it going to be like this I think we will stop trying to use dmabuf
> and do something simpler for vfio/kvm/iommufd :(

As the gal who help edit the og dma-buf spec 13 years ago, I think adding
pfn isn't a terrible idea. By design, dma-buf is the "everything is
optional" interface. And in the beginning, even consistent locking was
optional, but we've managed to fix that by now :-/

Where I do agree with Christian is that stuffing pfn support into the
dma_buf_attachment interfaces feels a bit much wrong.

> > We have already gone down that road and it didn't worked at all and
> > was a really big pain to pull people back from it.
> 
> Nobody has really seriously tried to improve the DMA API before, so I
> don't think this is true at all.

Aside, I really hope this finally happens!

> > > 3) Importing devices need to know if they are working with PCI P2P
> > > addresses during mapping because they need to do things like turn on
> > > ATS on their DMA. As for multi-path we have the same hacks inside mlx5
> > > today that assume DMABUFs are always P2P because we cannot determine
> > > if things are P2P or not after being DMA mapped.
> > 
> > Why would you need ATS on PCI P2P and not for system memory accesses?
> 
> ATS has a significant performance cost. It is mandatory for PCI P2P,
> but ideally should be avoided for CPU memory.

Huh, I didn't know that. And yeah kinda means we've butchered the pci p2p
stuff a bit I guess ...

> > > 5) iommufd and kvm are both using CPU addresses without DMA. No
> > > exporter mapping is possible
> > 
> > We have customers using both KVM and XEN with DMA-buf, so I can clearly
> > confirm that this isn't true.
> 
> Today they are mmaping the dma-buf into a VMA and then using KVM's
> follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable
> dma-buf must have a CPU PFN.
> 
> Here Xu implements basically the same path, except without the VMA
> indirection, and it suddenly not OK? Illogical.

So the big difference is that for follow_pfn() you need mmu_notifier since
the mmap might move around, whereas with pfn smashed into
dma_buf_attachment you need dma_resv_lock rules, and the move_notify
callback if you go dynamic.

So I guess my first question is, which locking rules do you want here for
pfn importers?

If mmu notifiers is fine, then I think the current approach of follow_pfn
should be ok. But if you instead dma_resv_lock rules (or the cpu mmap
somehow is an issue itself), then I think the clean design is create a new
separate access mechanism just for that. It would be the 5th or so (kernel
vmap, userspace mmap, dma_buf_attach and driver private stuff like
virtio_dma_buf.c where you access your buffer with a uuid), so really not
a big deal.

And for non-contrived exporters we might be able to implement the other
access methods in terms of the pfn method generically, so this wouldn't
even be a terrible maintenance burden going forward. And meanwhile all the
contrived exporters just keep working as-is.

The other part is that cpu mmap is optional, and there's plenty of strange
exporters who don't implement. But you can dma map the attachment into
plenty devices. This tends to mostly be a thing on SoC devices with some
very funky memory. But I guess you don't care about these use-case, so
should be ok.

I couldn't come up with a good name for these pfn users, maybe
dma_buf_pfn_attachment? This does _not_ have a struct device, but maybe
some of these new p2p source specifiers (or a list of those which are
allowed, no idea how this would need to fit into the new dma api).

Cheers, Sima
Xu Yilun Jan. 8, 2025, 7:22 p.m. UTC | #9
On Wed, Jan 08, 2025 at 07:44:54PM +0100, Simona Vetter wrote:
> On Wed, Jan 08, 2025 at 12:22:27PM -0400, Jason Gunthorpe wrote:
> > On Wed, Jan 08, 2025 at 04:25:54PM +0100, Christian König wrote:
> > > Am 08.01.25 um 15:58 schrieb Jason Gunthorpe:
> > > > I have imagined a staged approach were DMABUF gets a new API that
> > > > works with the new DMA API to do importer mapping with "P2P source
> > > > information" and a gradual conversion.
> > > 
> > > To make it clear as maintainer of that subsystem I would reject such a step
> > > with all I have.
> > 
> > This is unexpected, so you want to just leave dmabuf broken? Do you
> > have any plan to fix it, to fix the misuse of the DMA API, and all
> > the problems I listed below? This is a big deal, it is causing real
> > problems today.
> > 
> > If it going to be like this I think we will stop trying to use dmabuf
> > and do something simpler for vfio/kvm/iommufd :(
> 
> As the gal who help edit the og dma-buf spec 13 years ago, I think adding
> pfn isn't a terrible idea. By design, dma-buf is the "everything is
> optional" interface. And in the beginning, even consistent locking was
> optional, but we've managed to fix that by now :-/
> 
> Where I do agree with Christian is that stuffing pfn support into the
> dma_buf_attachment interfaces feels a bit much wrong.

So it could a dmabuf interface like mmap/vmap()? I was also wondering
about that. But finally I start to use dma_buf_attachment interface
because of leveraging existing buffer pin and move_notify.

> 
> > > We have already gone down that road and it didn't worked at all and
> > > was a really big pain to pull people back from it.
> > 
> > Nobody has really seriously tried to improve the DMA API before, so I
> > don't think this is true at all.
> 
> Aside, I really hope this finally happens!
> 
> > > > 3) Importing devices need to know if they are working with PCI P2P
> > > > addresses during mapping because they need to do things like turn on
> > > > ATS on their DMA. As for multi-path we have the same hacks inside mlx5
> > > > today that assume DMABUFs are always P2P because we cannot determine
> > > > if things are P2P or not after being DMA mapped.
> > > 
> > > Why would you need ATS on PCI P2P and not for system memory accesses?
> > 
> > ATS has a significant performance cost. It is mandatory for PCI P2P,
> > but ideally should be avoided for CPU memory.
> 
> Huh, I didn't know that. And yeah kinda means we've butchered the pci p2p
> stuff a bit I guess ...
> 
> > > > 5) iommufd and kvm are both using CPU addresses without DMA. No
> > > > exporter mapping is possible
> > > 
> > > We have customers using both KVM and XEN with DMA-buf, so I can clearly
> > > confirm that this isn't true.
> > 
> > Today they are mmaping the dma-buf into a VMA and then using KVM's
> > follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable
> > dma-buf must have a CPU PFN.
> > 
> > Here Xu implements basically the same path, except without the VMA
> > indirection, and it suddenly not OK? Illogical.
> 
> So the big difference is that for follow_pfn() you need mmu_notifier since
> the mmap might move around, whereas with pfn smashed into
> dma_buf_attachment you need dma_resv_lock rules, and the move_notify
> callback if you go dynamic.
> 
> So I guess my first question is, which locking rules do you want here for
> pfn importers?

follow_pfn() is unwanted for private MMIO, so dma_resv_lock.

> 
> If mmu notifiers is fine, then I think the current approach of follow_pfn
> should be ok. But if you instead dma_resv_lock rules (or the cpu mmap
> somehow is an issue itself), then I think the clean design is create a new

cpu mmap() is an issue, this series is aimed to eliminate userspace
mapping for private MMIO resources.

> separate access mechanism just for that. It would be the 5th or so (kernel
> vmap, userspace mmap, dma_buf_attach and driver private stuff like
> virtio_dma_buf.c where you access your buffer with a uuid), so really not
> a big deal.

OK, will think more about that.

Thanks,
Yilun

> 
> And for non-contrived exporters we might be able to implement the other
> access methods in terms of the pfn method generically, so this wouldn't
> even be a terrible maintenance burden going forward. And meanwhile all the
> contrived exporters just keep working as-is.
> 
> The other part is that cpu mmap is optional, and there's plenty of strange
> exporters who don't implement. But you can dma map the attachment into
> plenty devices. This tends to mostly be a thing on SoC devices with some
> very funky memory. But I guess you don't care about these use-case, so
> should be ok.
> 
> I couldn't come up with a good name for these pfn users, maybe
> dma_buf_pfn_attachment? This does _not_ have a struct device, but maybe
> some of these new p2p source specifiers (or a list of those which are
> allowed, no idea how this would need to fit into the new dma api).
> 
> Cheers, Sima
> -- 
> Simona Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
Xu Yilun Jan. 8, 2025, 11:06 p.m. UTC | #10
>  So I guess my first question is, which locking rules do you want here for
>  pfn importers?
> 
>  follow_pfn() is unwanted for private MMIO, so dma_resv_lock.
> 
>    As Sima explained you either have follow_pfn() and mmu_notifier or you
>    have DMA addresses and dma_resv lock / dma_fence.
> 
>    Just giving out PFNs without some lifetime associated with them is one of
>    the major problems we faced before and really not something you can do.

I'm trying to make exporter give out PFN with lifetime control via
move_notify() in this series. May not be conceptually correct but seems
possible.

> 
> 
>  If mmu notifiers is fine, then I think the current approach of follow_pfn
>  should be ok. But if you instead dma_resv_lock rules (or the cpu mmap
>  somehow is an issue itself), then I think the clean design is create a new
> 
>  cpu mmap() is an issue, this series is aimed to eliminate userspace
>  mapping for private MMIO resources.
> 
>    Why?

OK, I can start from here.

It is about the Secure guest, or CoCo VM. The memory and MMIOs assigned
to this kind of guest is unaccessable to host itself, by leveraging HW
encryption & access control technology (e.g. Intel TDX, AMD SEV-SNP ...).
This is to protect the tenant data being stolen by CSP itself.

The impact is when host accesses the encrypted region, bad things
happen to system, e.g. memory corruption, MCE. Kernel is trying to
mitigate most of the impact by alloc and assign user unmappable memory
resources (private memory) to guest, which prevents userspace
accidents. guest_memfd is the private memory provider that only allows
for KVM to position the page/pfn by fd + offset and create secondary
page table (EPT, NPT...), no host mapping, no VMA, no mmu_notifier. But
the lifecycle of the private memory is still controlled by guest_memfd.
When fallocate(fd, PUNCH_HOLE), the memory resource is revoked and KVM
is notified to unmap corresponding EPT.

The further thought is guest_memfd is also suitable for normal guest.
It makes no sense VMM must build host mapping table before guest access.

Now I'm trying to seek a similar way for private MMIO. A MMIO resource
provider that is exported as an fd. It controls the lifecycle of the
MMIO resource and notify KVM when revoked. dma-buf seems to be a good
provider which have done most of the work, only need to extend the
memory resource seeking by fd + offset.

> 
>  separate access mechanism just for that. It would be the 5th or so (kernel
>  vmap, userspace mmap, dma_buf_attach and driver private stuff like
>  virtio_dma_buf.c where you access your buffer with a uuid), so really not
>  a big deal.
> 
>  OK, will think more about that.
> 
>    Please note that we have follow_pfn() + mmu_notifier working for KVM/XEN

Folow_pfn & mmu_notifier won't work here, cause no VMA, no host mapping
table.

Thanks,
Yilun
>    with MMIO mappings and P2P. And that required exactly zero DMA-buf changes
>    :)
> 
>    I don't fully understand your use case, but I think it's quite likely that
>    we already have that working.
> 
>    Regards,
>    Christian.
Christian König Jan. 9, 2025, 8:04 a.m. UTC | #11
Am 08.01.25 um 20:22 schrieb Xu Yilun:
> On Wed, Jan 08, 2025 at 07:44:54PM +0100, Simona Vetter wrote:
>> On Wed, Jan 08, 2025 at 12:22:27PM -0400, Jason Gunthorpe wrote:
>>> On Wed, Jan 08, 2025 at 04:25:54PM +0100, Christian König wrote:
>>>> Am 08.01.25 um 15:58 schrieb Jason Gunthorpe:
>>>>> I have imagined a staged approach were DMABUF gets a new API that
>>>>> works with the new DMA API to do importer mapping with "P2P source
>>>>> information" and a gradual conversion.
>>>> To make it clear as maintainer of that subsystem I would reject such a step
>>>> with all I have.
>>> This is unexpected, so you want to just leave dmabuf broken? Do you
>>> have any plan to fix it, to fix the misuse of the DMA API, and all
>>> the problems I listed below? This is a big deal, it is causing real
>>> problems today.
>>>
>>> If it going to be like this I think we will stop trying to use dmabuf
>>> and do something simpler for vfio/kvm/iommufd :(
>> As the gal who help edit the og dma-buf spec 13 years ago, I think adding
>> pfn isn't a terrible idea. By design, dma-buf is the "everything is
>> optional" interface. And in the beginning, even consistent locking was
>> optional, but we've managed to fix that by now :-/

Well you were also the person who mangled the struct page pointers in 
the scatterlist because people were abusing this and getting a bloody 
nose :)

>> Where I do agree with Christian is that stuffing pfn support into the
>> dma_buf_attachment interfaces feels a bit much wrong.
> So it could a dmabuf interface like mmap/vmap()? I was also wondering
> about that. But finally I start to use dma_buf_attachment interface
> because of leveraging existing buffer pin and move_notify.

Exactly that's the point, sharing pfn doesn't work with the pin and 
move_notify interfaces because of the MMU notifier approach Sima mentioned.

>>>> We have already gone down that road and it didn't worked at all and
>>>> was a really big pain to pull people back from it.
>>> Nobody has really seriously tried to improve the DMA API before, so I
>>> don't think this is true at all.
>> Aside, I really hope this finally happens!

Sorry my fault. I was not talking about the DMA API, but rather that 
people tried to look behind the curtain of DMA-buf backing stores.

In other words all the fun we had with scatterlists and that people try 
to modify the struct pages inside of them.

Improving the DMA API is something I really really hope for as well.

>>>>> 3) Importing devices need to know if they are working with PCI P2P
>>>>> addresses during mapping because they need to do things like turn on
>>>>> ATS on their DMA. As for multi-path we have the same hacks inside mlx5
>>>>> today that assume DMABUFs are always P2P because we cannot determine
>>>>> if things are P2P or not after being DMA mapped.
>>>> Why would you need ATS on PCI P2P and not for system memory accesses?
>>> ATS has a significant performance cost. It is mandatory for PCI P2P,
>>> but ideally should be avoided for CPU memory.
>> Huh, I didn't know that. And yeah kinda means we've butchered the pci p2p
>> stuff a bit I guess ...

Hui? Why should ATS be mandatory for PCI P2P?

We have tons of production systems using PCI P2P without ATS. And it's 
the first time I hear that.

>>>>> 5) iommufd and kvm are both using CPU addresses without DMA. No
>>>>> exporter mapping is possible
>>>> We have customers using both KVM and XEN with DMA-buf, so I can clearly
>>>> confirm that this isn't true.
>>> Today they are mmaping the dma-buf into a VMA and then using KVM's
>>> follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable
>>> dma-buf must have a CPU PFN.
>>>
>>> Here Xu implements basically the same path, except without the VMA
>>> indirection, and it suddenly not OK? Illogical.
>> So the big difference is that for follow_pfn() you need mmu_notifier since
>> the mmap might move around, whereas with pfn smashed into
>> dma_buf_attachment you need dma_resv_lock rules, and the move_notify
>> callback if you go dynamic.
>>
>> So I guess my first question is, which locking rules do you want here for
>> pfn importers?
> follow_pfn() is unwanted for private MMIO, so dma_resv_lock.

As Sima explained you either have follow_pfn() and mmu_notifier or you 
have DMA addresses and dma_resv lock / dma_fence.

Just giving out PFNs without some lifetime associated with them is one 
of the major problems we faced before and really not something you can do.

>> If mmu notifiers is fine, then I think the current approach of follow_pfn
>> should be ok. But if you instead dma_resv_lock rules (or the cpu mmap
>> somehow is an issue itself), then I think the clean design is create a new
> cpu mmap() is an issue, this series is aimed to eliminate userspace
> mapping for private MMIO resources.

Why?

>> separate access mechanism just for that. It would be the 5th or so (kernel
>> vmap, userspace mmap, dma_buf_attach and driver private stuff like
>> virtio_dma_buf.c where you access your buffer with a uuid), so really not
>> a big deal.
> OK, will think more about that.

Please note that we have follow_pfn() + mmu_notifier working for KVM/XEN 
with MMIO mappings and P2P. And that required exactly zero DMA-buf 
changes :)

I don't fully understand your use case, but I think it's quite likely 
that we already have that working.

Regards,
Christian.

>
> Thanks,
> Yilun
>
>> And for non-contrived exporters we might be able to implement the other
>> access methods in terms of the pfn method generically, so this wouldn't
>> even be a terrible maintenance burden going forward. And meanwhile all the
>> contrived exporters just keep working as-is.
>>
>> The other part is that cpu mmap is optional, and there's plenty of strange
>> exporters who don't implement. But you can dma map the attachment into
>> plenty devices. This tends to mostly be a thing on SoC devices with some
>> very funky memory. But I guess you don't care about these use-case, so
>> should be ok.
>>
>> I couldn't come up with a good name for these pfn users, maybe
>> dma_buf_pfn_attachment? This does _not_ have a struct device, but maybe
>> some of these new p2p source specifiers (or a list of those which are
>> allowed, no idea how this would need to fit into the new dma api).
>>
>> Cheers, Sima
>> -- 
>> Simona Vetter
>> Software Engineer, Intel Corporation
>> http://blog.ffwll.ch
Christian König Jan. 9, 2025, 8:09 a.m. UTC | #12
Answering on my reply once more as pure text mail.

I love AMDs mail servers.

Cheers,
Christian.

Am 09.01.25 um 09:04 schrieb Christian König:
> Am 08.01.25 um 20:22 schrieb Xu Yilun:
>> On Wed, Jan 08, 2025 at 07:44:54PM +0100, Simona Vetter wrote:
>>> On Wed, Jan 08, 2025 at 12:22:27PM -0400, Jason Gunthorpe wrote:
>>>> On Wed, Jan 08, 2025 at 04:25:54PM +0100, Christian König wrote:
>>>>> Am 08.01.25 um 15:58 schrieb Jason Gunthorpe:
>>>>>> I have imagined a staged approach were DMABUF gets a new API that
>>>>>> works with the new DMA API to do importer mapping with "P2P source
>>>>>> information" and a gradual conversion.
>>>>> To make it clear as maintainer of that subsystem I would reject such a step
>>>>> with all I have.
>>>> This is unexpected, so you want to just leave dmabuf broken? Do you
>>>> have any plan to fix it, to fix the misuse of the DMA API, and all
>>>> the problems I listed below? This is a big deal, it is causing real
>>>> problems today.
>>>>
>>>> If it going to be like this I think we will stop trying to use dmabuf
>>>> and do something simpler for vfio/kvm/iommufd :(
>>> As the gal who help edit the og dma-buf spec 13 years ago, I think adding
>>> pfn isn't a terrible idea. By design, dma-buf is the "everything is
>>> optional" interface. And in the beginning, even consistent locking was
>>> optional, but we've managed to fix that by now :-/
>
> Well you were also the person who mangled the struct page pointers in 
> the scatterlist because people were abusing this and getting a bloody 
> nose :)
>
>>> Where I do agree with Christian is that stuffing pfn support into the
>>> dma_buf_attachment interfaces feels a bit much wrong.
>> So it could a dmabuf interface like mmap/vmap()? I was also wondering
>> about that. But finally I start to use dma_buf_attachment interface
>> because of leveraging existing buffer pin and move_notify.
>
> Exactly that's the point, sharing pfn doesn't work with the pin and 
> move_notify interfaces because of the MMU notifier approach Sima 
> mentioned.
>
>>>>> We have already gone down that road and it didn't worked at all and
>>>>> was a really big pain to pull people back from it.
>>>> Nobody has really seriously tried to improve the DMA API before, so I
>>>> don't think this is true at all.
>>> Aside, I really hope this finally happens!
>
> Sorry my fault. I was not talking about the DMA API, but rather that 
> people tried to look behind the curtain of DMA-buf backing stores.
>
> In other words all the fun we had with scatterlists and that people 
> try to modify the struct pages inside of them.
>
> Improving the DMA API is something I really really hope for as well.
>
>>>>>> 3) Importing devices need to know if they are working with PCI P2P
>>>>>> addresses during mapping because they need to do things like turn on
>>>>>> ATS on their DMA. As for multi-path we have the same hacks inside mlx5
>>>>>> today that assume DMABUFs are always P2P because we cannot determine
>>>>>> if things are P2P or not after being DMA mapped.
>>>>> Why would you need ATS on PCI P2P and not for system memory accesses?
>>>> ATS has a significant performance cost. It is mandatory for PCI P2P,
>>>> but ideally should be avoided for CPU memory.
>>> Huh, I didn't know that. And yeah kinda means we've butchered the pci p2p
>>> stuff a bit I guess ...
>
> Hui? Why should ATS be mandatory for PCI P2P?
>
> We have tons of production systems using PCI P2P without ATS. And it's 
> the first time I hear that.
>
>>>>>> 5) iommufd and kvm are both using CPU addresses without DMA. No
>>>>>> exporter mapping is possible
>>>>> We have customers using both KVM and XEN with DMA-buf, so I can clearly
>>>>> confirm that this isn't true.
>>>> Today they are mmaping the dma-buf into a VMA and then using KVM's
>>>> follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable
>>>> dma-buf must have a CPU PFN.
>>>>
>>>> Here Xu implements basically the same path, except without the VMA
>>>> indirection, and it suddenly not OK? Illogical.
>>> So the big difference is that for follow_pfn() you need mmu_notifier since
>>> the mmap might move around, whereas with pfn smashed into
>>> dma_buf_attachment you need dma_resv_lock rules, and the move_notify
>>> callback if you go dynamic.
>>>
>>> So I guess my first question is, which locking rules do you want here for
>>> pfn importers?
>> follow_pfn() is unwanted for private MMIO, so dma_resv_lock.
>
> As Sima explained you either have follow_pfn() and mmu_notifier or you 
> have DMA addresses and dma_resv lock / dma_fence.
>
> Just giving out PFNs without some lifetime associated with them is one 
> of the major problems we faced before and really not something you can do.
>
>>> If mmu notifiers is fine, then I think the current approach of follow_pfn
>>> should be ok. But if you instead dma_resv_lock rules (or the cpu mmap
>>> somehow is an issue itself), then I think the clean design is create a new
>> cpu mmap() is an issue, this series is aimed to eliminate userspace
>> mapping for private MMIO resources.
>
> Why?
>
>>> separate access mechanism just for that. It would be the 5th or so (kernel
>>> vmap, userspace mmap, dma_buf_attach and driver private stuff like
>>> virtio_dma_buf.c where you access your buffer with a uuid), so really not
>>> a big deal.
>> OK, will think more about that.
>
> Please note that we have follow_pfn() + mmu_notifier working for 
> KVM/XEN with MMIO mappings and P2P. And that required exactly zero 
> DMA-buf changes :)
>
> I don't fully understand your use case, but I think it's quite likely 
> that we already have that working.
>
> Regards,
> Christian.
>
>> Thanks,
>> Yilun
>>
>>> And for non-contrived exporters we might be able to implement the other
>>> access methods in terms of the pfn method generically, so this wouldn't
>>> even be a terrible maintenance burden going forward. And meanwhile all the
>>> contrived exporters just keep working as-is.
>>>
>>> The other part is that cpu mmap is optional, and there's plenty of strange
>>> exporters who don't implement. But you can dma map the attachment into
>>> plenty devices. This tends to mostly be a thing on SoC devices with some
>>> very funky memory. But I guess you don't care about these use-case, so
>>> should be ok.
>>>
>>> I couldn't come up with a good name for these pfn users, maybe
>>> dma_buf_pfn_attachment? This does _not_ have a struct device, but maybe
>>> some of these new p2p source specifiers (or a list of those which are
>>> allowed, no idea how this would need to fit into the new dma api).
>>>
>>> Cheers, Sima
>>> -- 
>>> Simona Vetter
>>> Software Engineer, Intel Corporation
>>> http://blog.ffwll.ch
>
Christian König Jan. 9, 2025, 9:10 a.m. UTC | #13
Am 08.01.25 um 17:22 schrieb Jason Gunthorpe:
> [SNIP]
>>> For P2P cases we are going toward (PFN + P2P source information) as
>>> input to the DMA API. The additional "P2P source information" provides
>>> a good way for co-operating drivers to represent private address
>>> spaces as well. Both importer and exporter can have full understanding
>>> what is being mapped and do the correct things, safely.
>> I can say from experience that this is clearly not going to work for all use
>> cases.
>>
>> It would mean that we have to pull a massive amount of driver specific
>> functionality into the DMA API.
> That isn't what I mean. There are two distinct parts, the means to
> describe the source (PFN + P2P source information) that is compatible
> with the DMA API, and the DMA API itself that works with a few general
> P2P source information types.
>
> Private source information would be detected by co-operating drivers
> and go down driver private paths. It would be rejected by other
> drivers. This broadly follows how the new API is working.
>
> So here I mean you can use the same PFN + Source API between importer
> and exporter and the importer can simply detect the special source and
> do the private stuff. It is not shifting things under the DMA API, it
> is building along side it using compatible design approaches. You
> would match the source information, cast it to a driver structure, do
> whatever driver math is needed to compute the local DMA address and
> then write it to the device. Nothing is hard or "not going to work"
> here.

Well to be honest that sounds like an absolutely horrible design.

You are moving all responsibilities for inter driver handling into the 
drivers themselves without any supervision by the core OS.

Drivers are notoriously buggy and should absolutely not do things like 
that on their own.

Do you have pointers to this new API?

>>> So, no, we don't loose private address space support when moving to
>>> importer mapping, in fact it works better because the importer gets
>>> more information about what is going on.
>> Well, sounds like I wasn't able to voice my concern. Let me try again:
>>
>> We should not give importers information they don't need. Especially not
>> information about the backing store of buffers.
>>
>> So that importers get more information about what's going on is a bad thing.
> I strongly disagree because we are suffering today in mlx5 because of
> this viewpoint. You cannot predict in advance what importers are going
> to need. I already listed many examples where it does not work today
> as is.

Well there is no need to predict in advance what importers are going to 
do. We just gather the requirements of the importers and see how we can 
fulfill them.

And yes, it is absolutely intentional that maintainers reject some of 
those requirements because the they find that the importer tries to do 
something totally nasty which shouldn't happen in the first place.

>>> I have imagined a staged approach were DMABUF gets a new API that
>>> works with the new DMA API to do importer mapping with "P2P source
>>> information" and a gradual conversion.
>> To make it clear as maintainer of that subsystem I would reject such a step
>> with all I have.
> This is unexpected, so you want to just leave dmabuf broken? Do you
> have any plan to fix it, to fix the misuse of the DMA API, and all
> the problems I listed below? This is a big deal, it is causing real
> problems today.

Well as maintainer of DMA-buf I don't think it is broken. It's just 
missing requirements.

When you have new requirements we can certainly talk about them, but 
giving drivers a green card to do whatever nasty thing they want to do 
is not going to fly at all.

>>> Exporter mapping falls down in too many cases already:
>>>
>>> 1) Private addresses spaces don't work fully well because many devices
>>> need some indication what address space is being used and scatter list
>>> can't really properly convey that. If the DMABUF has a mixture of CPU
>>> and private it becomes a PITA
>> Correct, yes. That's why I said that scatterlist was a bad choice for the
>> interface.
>>
>> But exposing the backing store to importers and then let them do whatever
>> they want with it sounds like an even worse idea.
> You keep saying this without real justification.

Well I have really strong justification for that. Please look at the 
code here and how it cam to be: 
https://elixir.bootlin.com/linux/v6.12.6/source/drivers/dma-buf/dma-buf.c#L776

We intentionally mangle the struct page pointers in the scatterlist to 
prevent importers from trying to access it.

> To me it is a nanny style of API design.

Yes, absolutely and that is really necessary.

Inter driver APIs are hard work because the framework not only needs to 
define rules but also makes sure to properly enforce them.

That's why we not only added tons of documentation how to implement 
things, but also annotation which gives big warnings whenever somebody 
tries to do something nasty.

> But also I don't see how you can possibly fix the
> above without telling the importer alot more information.

Giving additional information to the DMA addresses is perfectly ok. It's 
just that those additional information should be the output of the DMA 
API or DMA-buf and not something drivers try to figure out on their own.

>>> 2) Multi-path PCI can require the importer to make mapping decisions
>>> unique to the device and program device specific information for the
>>> multi-path. We are doing this in mlx5 today and have hacks because
>>> DMABUF is destroying the information the importer needs to choose the
>>> correct PCI path.
>> That's why the exporter gets the struct device of the importer so that it
>> can plan how those accesses are made. Where exactly is the problem with
>> that?
> A single struct device does not convey the multipath options. We have
> multiple struct devices (and multiple PCI endpoints) doing DMA
> concurrently under one driver.
>
> Multipath always needs additional meta information in the importer
> side to tell the device which path to select. A naked dma address is
> not sufficient.
>
> Today we guess that DMABUF will be using P2P and hack to choose a P2P
> struct device to pass the exporter. We need to know what is in the
> dmabuf before we can choose which of the multiple struct devices the
> driver has to use for DMA mapping.
>
> But even simple CPU centric cases we will eventually want to select
> the proper NUMA local PCI channel matching struct device for CPU only
> buffers.

So basically you want something which tells you if device A or device B 
should be used to access something?

That was discussed before as well, but so far the resolution was that 
userspace should decide that stuff and not the kernel.

Would it be sufficient to have a query API where you can ask the DMA-buf 
exporter for information about the buffer? E.g. NUMA distance etc...

>> When you have an use case which is not covered by the existing DMA-buf
>> interfaces then please voice that to me and other maintainers instead of
>> implementing some hack.
> Do you have any suggestion for any of this then?

Yeah, compared to other requirements that sounds trivial what you have 
at hand here.

>>> 5) iommufd and kvm are both using CPU addresses without DMA. No
>>> exporter mapping is possible
>> We have customers using both KVM and XEN with DMA-buf, so I can clearly
>> confirm that this isn't true.
> Today they are mmaping the dma-buf into a VMA and then using KVM's
> follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable
> dma-buf must have a CPU PFN.

Yeah and I'm the one who forced the KVM guys to do it like that.

> Here Xu implements basically the same path, except without the VMA
> indirection, and it suddenly not OK? Illogical.

Well I'm the designer of this KVM solution and it's not illogical to 
reject that when you are missing quite a bunch of stuff.

Sima correctly pointed out that you need an mmu notifier if you want to 
work with PFNs, but that is just the tip of the iceberg here.

If you want to go down the CPU access path you also need to cover things 
like cacheing flags, lifetime, dynamic changing backing etc...

Regards,
Christian.


>
> Jason
Leon Romanovsky Jan. 9, 2025, 9:28 a.m. UTC | #14
On Thu, Jan 09, 2025 at 10:10:01AM +0100, Christian König wrote:
> Am 08.01.25 um 17:22 schrieb Jason Gunthorpe:
> > [SNIP]
> > > > For P2P cases we are going toward (PFN + P2P source information) as
> > > > input to the DMA API. The additional "P2P source information" provides
> > > > a good way for co-operating drivers to represent private address
> > > > spaces as well. Both importer and exporter can have full understanding
> > > > what is being mapped and do the correct things, safely.
> > > I can say from experience that this is clearly not going to work for all use
> > > cases.
> > > 
> > > It would mean that we have to pull a massive amount of driver specific
> > > functionality into the DMA API.
> > That isn't what I mean. There are two distinct parts, the means to
> > describe the source (PFN + P2P source information) that is compatible
> > with the DMA API, and the DMA API itself that works with a few general
> > P2P source information types.
> > 
> > Private source information would be detected by co-operating drivers
> > and go down driver private paths. It would be rejected by other
> > drivers. This broadly follows how the new API is working.
> > 
> > So here I mean you can use the same PFN + Source API between importer
> > and exporter and the importer can simply detect the special source and
> > do the private stuff. It is not shifting things under the DMA API, it
> > is building along side it using compatible design approaches. You
> > would match the source information, cast it to a driver structure, do
> > whatever driver math is needed to compute the local DMA address and
> > then write it to the device. Nothing is hard or "not going to work"
> > here.
> 
> Well to be honest that sounds like an absolutely horrible design.
> 
> You are moving all responsibilities for inter driver handling into the
> drivers themselves without any supervision by the core OS.
> 
> Drivers are notoriously buggy and should absolutely not do things like that
> on their own.

IMHO, you and Jason give different meaning to word "driver" in this
discussion. It is upto to the subsystems to decide how to provide new
API to the end drivers. Worth to read this LWN article first.

Dancing the DMA two-step - https://lwn.net/Articles/997563/

> 
> Do you have pointers to this new API?

Latest version is here - https://lore.kernel.org/all/cover.1734436840.git.leon@kernel.org/
Unfortunately, I forgot to copy/paste cover letter but it can be seen in
previous version https://lore.kernel.org/all/cover.1733398913.git.leon@kernel.org/.

The most complex example is block layer implementation which hides DMA API from
block drivers. https://lore.kernel.org/all/cover.1730037261.git.leon@kernel.org/

Thanks
diff mbox series

Patch

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 7eeee3a38202..83d1448b6dcc 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -630,10 +630,10 @@  struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
 	size_t alloc_size = sizeof(struct dma_buf);
 	int ret;
 
-	if (WARN_ON(!exp_info->priv || !exp_info->ops
-		    || !exp_info->ops->map_dma_buf
-		    || !exp_info->ops->unmap_dma_buf
-		    || !exp_info->ops->release))
+	if (WARN_ON(!exp_info->priv || !exp_info->ops ||
+		    (!!exp_info->ops->map_dma_buf != !!exp_info->ops->unmap_dma_buf) ||
+		    (!exp_info->ops->map_dma_buf && !exp_info->ops->get_pfn) ||
+		    !exp_info->ops->release))
 		return ERR_PTR(-EINVAL);
 
 	if (WARN_ON(exp_info->ops->cache_sgt_mapping &&
@@ -909,7 +909,10 @@  dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
 	struct dma_buf_attachment *attach;
 	int ret;
 
-	if (WARN_ON(!dmabuf || !dev))
+	if (WARN_ON(!dmabuf))
+		return ERR_PTR(-EINVAL);
+
+	if (WARN_ON(dmabuf->ops->map_dma_buf && !dev))
 		return ERR_PTR(-EINVAL);
 
 	if (WARN_ON(importer_ops && !importer_ops->move_notify))
@@ -941,7 +944,7 @@  dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
 	 */
 	if (dma_buf_attachment_is_dynamic(attach) !=
 	    dma_buf_is_dynamic(dmabuf)) {
-		struct sg_table *sgt;
+		struct sg_table *sgt = NULL;
 
 		dma_resv_lock(attach->dmabuf->resv, NULL);
 		if (dma_buf_is_dynamic(attach->dmabuf)) {
@@ -950,13 +953,16 @@  dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
 				goto err_unlock;
 		}
 
-		sgt = __map_dma_buf(attach, DMA_BIDIRECTIONAL);
-		if (!sgt)
-			sgt = ERR_PTR(-ENOMEM);
-		if (IS_ERR(sgt)) {
-			ret = PTR_ERR(sgt);
-			goto err_unpin;
+		if (dmabuf->ops->map_dma_buf) {
+			sgt = __map_dma_buf(attach, DMA_BIDIRECTIONAL);
+			if (!sgt)
+				sgt = ERR_PTR(-ENOMEM);
+			if (IS_ERR(sgt)) {
+				ret = PTR_ERR(sgt);
+				goto err_unpin;
+			}
 		}
+
 		dma_resv_unlock(attach->dmabuf->resv);
 		attach->sgt = sgt;
 		attach->dir = DMA_BIDIRECTIONAL;
@@ -1119,7 +1125,8 @@  struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *attach,
 
 	might_sleep();
 
-	if (WARN_ON(!attach || !attach->dmabuf))
+	if (WARN_ON(!attach || !attach->dmabuf ||
+		    !attach->dmabuf->ops->map_dma_buf))
 		return ERR_PTR(-EINVAL);
 
 	dma_resv_assert_held(attach->dmabuf->resv);
@@ -1195,7 +1202,8 @@  dma_buf_map_attachment_unlocked(struct dma_buf_attachment *attach,
 
 	might_sleep();
 
-	if (WARN_ON(!attach || !attach->dmabuf))
+	if (WARN_ON(!attach || !attach->dmabuf ||
+		    !attach->dmabuf->ops->map_dma_buf))
 		return ERR_PTR(-EINVAL);
 
 	dma_resv_lock(attach->dmabuf->resv, NULL);
@@ -1222,7 +1230,8 @@  void dma_buf_unmap_attachment(struct dma_buf_attachment *attach,
 {
 	might_sleep();
 
-	if (WARN_ON(!attach || !attach->dmabuf || !sg_table))
+	if (WARN_ON(!attach || !attach->dmabuf ||
+		    !attach->dmabuf->ops->unmap_dma_buf || !sg_table))
 		return;
 
 	dma_resv_assert_held(attach->dmabuf->resv);
@@ -1254,7 +1263,8 @@  void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
 {
 	might_sleep();
 
-	if (WARN_ON(!attach || !attach->dmabuf || !sg_table))
+	if (WARN_ON(!attach || !attach->dmabuf ||
+		    !attach->dmabuf->ops->unmap_dma_buf || !sg_table))
 		return;
 
 	dma_resv_lock(attach->dmabuf->resv, NULL);
@@ -1263,6 +1273,52 @@  void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
 }
 EXPORT_SYMBOL_NS_GPL(dma_buf_unmap_attachment_unlocked, "DMA_BUF");
 
+/**
+ * dma_buf_get_pfn_unlocked -
+ * @attach:	[in]	attachment to get pfn from
+ * @pgoff:	[in]	page offset of the buffer against the start of dma_buf
+ * @pfn:	[out]	returns the pfn of the buffer
+ * @max_order	[out]	returns the max mapping order of the buffer
+ */
+int dma_buf_get_pfn_unlocked(struct dma_buf_attachment *attach,
+			     pgoff_t pgoff, u64 *pfn, int *max_order)
+{
+	struct dma_buf *dmabuf = attach->dmabuf;
+	int ret;
+
+	if (WARN_ON(!attach || !attach->dmabuf ||
+		    !attach->dmabuf->ops->get_pfn))
+		return -EINVAL;
+
+	/*
+	 * Open:
+	 *
+	 * When dma_buf is dynamic but dma_buf move is disabled, the buffer
+	 * should be pinned before use, See dma_buf_map_attachment() for
+	 * reference.
+	 *
+	 * But for now no pin is intended inside dma_buf_get_pfn(), otherwise
+	 * need another API to unpin the dma_buf. So just fail out this case.
+	 */
+	if (dma_buf_is_dynamic(attach->dmabuf) &&
+	    !IS_ENABLED(CONFIG_DMABUF_MOVE_NOTIFY))
+		return -ENOENT;
+
+	dma_resv_lock(attach->dmabuf->resv, NULL);
+	ret = dmabuf->ops->get_pfn(attach, pgoff, pfn, max_order);
+	/*
+	 * Open:
+	 *
+	 * Is dma_resv_wait_timeout() needed? I assume no. The DMA buffer
+	 * content synchronization could be done when the buffer is to be
+	 * mapped by importer.
+	 */
+	dma_resv_unlock(attach->dmabuf->resv);
+
+	return ret;
+}
+EXPORT_SYMBOL_NS_GPL(dma_buf_get_pfn_unlocked, "DMA_BUF");
+
 /**
  * dma_buf_move_notify - notify attachments that DMA-buf is moving
  *
@@ -1662,7 +1718,7 @@  static int dma_buf_debug_show(struct seq_file *s, void *unused)
 		attach_count = 0;
 
 		list_for_each_entry(attach_obj, &buf_obj->attachments, node) {
-			seq_printf(s, "\t%s\n", dev_name(attach_obj->dev));
+			seq_printf(s, "\t%s\n", attach_obj->dev ? dev_name(attach_obj->dev) : NULL);
 			attach_count++;
 		}
 		dma_resv_unlock(buf_obj->resv);
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index 36216d28d8bd..b16183edfb3a 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -194,6 +194,17 @@  struct dma_buf_ops {
 	 * if the call would block.
 	 */
 
+	/**
+	 * @get_pfn:
+	 *
+	 * This is called by dma_buf_get_pfn(). It is used to get the pfn
+	 * of the buffer positioned by the page offset against the start of
+	 * the dma_buf. It can only be called if @attach has been called
+	 * successfully.
+	 */
+	int (*get_pfn)(struct dma_buf_attachment *attach, pgoff_t pgoff,
+		       u64 *pfn, int *max_order);
+
 	/**
 	 * @release:
 	 *
@@ -629,6 +640,8 @@  dma_buf_map_attachment_unlocked(struct dma_buf_attachment *attach,
 void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
 				       struct sg_table *sg_table,
 				       enum dma_data_direction direction);
+int dma_buf_get_pfn_unlocked(struct dma_buf_attachment *attach,
+			     pgoff_t pgoff, u64 *pfn, int *max_order);
 
 int dma_buf_mmap(struct dma_buf *, struct vm_area_struct *,
 		 unsigned long);