diff mbox series

[RFC,01/12] dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI

Message ID 20250107142719.179636-2-yilun.xu@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series Private MMIO support for private assigned dev | expand

Commit Message

Xu Yilun Jan. 7, 2025, 2:27 p.m. UTC
Introduce a new API for dma-buf importer, also add a dma_buf_ops
callback for dma-buf exporter. This API is for subsystem importers who
map the dma-buf to some user defined address space, e.g. for IOMMUFD to
map the dma-buf to userspace IOVA via IOMMU page table, or for KVM to
map the dma-buf to GPA via KVM MMU (e.g. EPT).

Currently dma-buf is only used to get DMA address for device's default
domain by using kernel DMA APIs. But for these new use-cases, importers
only need the pfn of the dma-buf resource to build their own mapping
tables. So the map_dma_buf() callback is not mandatory for exporters
anymore. Also the importers could choose not to provide
struct device *dev on dma_buf_attach() if they don't call
dma_buf_map_attachment().

Like dma_buf_map_attachment(), the importer should firstly call
dma_buf_attach/dynamic_attach() then call dma_buf_get_pfn_unlocked().
If the importer choose to do dynamic attach, it also should handle the
dma-buf move notification.

Only the unlocked version of dma_buf_get_pfn is implemented for now,
just because no locked version is used for now.

Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>

---
IIUC, Only get_pfn() is needed but no put_pfn(). The whole dma-buf is
de/referenced at dma-buf attach/detach time.

Specifically, for static attachment, the exporter should always make
memory resource available/pinned on first dma_buf_attach(), and
release/unpin memory resource on last dma_buf_detach(). For dynamic
attachment, the exporter could populate & invalidate the memory
resource at any time, it's OK as long as the importers follow dma-buf
move notification. So no pinning is needed for get_pfn() and no
put_pfn() is needed.
---
 drivers/dma-buf/dma-buf.c | 90 +++++++++++++++++++++++++++++++--------
 include/linux/dma-buf.h   | 13 ++++++
 2 files changed, 86 insertions(+), 17 deletions(-)

Comments

Xu Yilun June 19, 2024, 11:39 p.m. UTC | #1
On Thu, Jan 16, 2025 at 06:33:48AM +0100, Christoph Hellwig wrote:
> On Wed, Jan 15, 2025 at 09:34:19AM -0400, Jason Gunthorpe wrote:
> > > Or do you mean some that don't have pages associated with them, and
> > > thus have pfn_valid fail on them?  They still have a PFN, just not
> > > one that is valid to use in most of the Linux MM.
> > 
> > He is talking about private interconnect hidden inside clusters of
> > devices.
> > 
> > Ie the system may have many GPUs and those GPUs have their own private
> > interconnect between them. It is not PCI, and packets don't transit
> > through the CPU SOC at all, so the IOMMU is not involved.
> > 
> > DMA can happen on that private interconnect, but from a Linux
> > perspective it is not DMA API DMA, and the addresses used to describe
> > it are not part of the CPU address space. The initiating device will
> > have a way to choose which path the DMA goes through when setting up
> > the DMA.
> 
> So how is this in any way relevant to dma_buf which operates on
> a dma_addr_t right now and thus by definition can't be used for
> these?

One part of dma-buf is the fd-based machanism for exporters to share
buffers across devices & subsystems, while still have buffer's lifetime
controlled by exporters.  Another part is the way the importers use
the buffer (i.e. the dma_addr_t based kAPI).

The fd-based exporting machanism is what we want to dmaareuse for buffer
sharing. But we are pursuing some extended buffer sharing and DMA usage
which doesn't fit into existing DMA API mapping, e.g. for GPA/userspace
IOVA accessing, or IIUC other side channel DMA, multi-channel DMA ...

Thanks,
Yilun
Xu Yilun June 20, 2024, 10:02 p.m. UTC | #2
On Thu, Jan 16, 2025 at 04:13:13PM +0100, Christian König wrote:
>    Am 15.01.25 um 18:09 schrieb Jason Gunthorpe:
> 
>  On Wed, Jan 15, 2025 at 05:34:23PM +0100, Christian König wrote:
> 
>     Granted, let me try to improve this.
>     Here is a real world example of one of the issues we ran into and why
>     CPU mappings of importers are redirected to the exporter.
>     We have a good bunch of different exporters who track the CPU mappings
>     of their backing store using address_space objects in one way or
>     another and then uses unmap_mapping_range() to invalidate those CPU
>     mappings.
>     But when importers get the PFNs of the backing store they can look
>     behind the curtain and directly insert this PFN into the CPU page
>     tables.
>     We had literally tons of cases like this where drivers developers cause
>     access after free issues because the importer created a CPU mappings on
>     their own without the exporter knowing about it.
>     This is just one example of what we ran into. Additional to that
>     basically the whole synchronization between drivers was overhauled as
>     well because we found that we can't trust importers to always do the
>     right thing.
> 
>  But this, fundamentally, is importers creating attachments and then
>  *ignoring the lifetime rules of DMABUF*. If you created an attachment,
>  got a move and *ignored the move* because you put the PFN in your own
>  VMA, then you are not following the attachment lifetime rules!
> 
>    Move notify is solely for informing the importer that they need to
>    re-fresh their DMA mappings and eventually block for ongoing DMA to end.
> 
>    This semantics doesn't work well for CPU mappings because you need to hold
>    the reservation lock to make sure that the information stay valid and you
>    can't hold a lock while returning from a page fault.

Dealing with CPU mapping and resource invalidation is a little hard, but is
resolvable, by using other types of locks. And I guess for now dma-buf
exporters should always handle this CPU mapping VS. invalidation contention if
they support mmap().

It is resolvable so with some invalidation notify, a decent importers could
also handle the contention well.

IIUC now the only concern is importer device drivers are easier to do
something wrong, so move CPU mapping things to exporter. But most of the
exporters are also device drivers, why they are smarter?

And there are increasing mapping needs, today exporters help handle CPU primary
mapping, tomorrow should they also help on all other mappings? Clearly it is
not feasible. So maybe conditionally give trust to some importers.

Thanks,
Yilun
Christian König Jan. 8, 2025, 8:01 a.m. UTC | #3
Am 07.01.25 um 15:27 schrieb Xu Yilun:
> Introduce a new API for dma-buf importer, also add a dma_buf_ops
> callback for dma-buf exporter. This API is for subsystem importers who
> map the dma-buf to some user defined address space, e.g. for IOMMUFD to
> map the dma-buf to userspace IOVA via IOMMU page table, or for KVM to
> map the dma-buf to GPA via KVM MMU (e.g. EPT).
>
> Currently dma-buf is only used to get DMA address for device's default
> domain by using kernel DMA APIs. But for these new use-cases, importers
> only need the pfn of the dma-buf resource to build their own mapping
> tables.

As far as I can see I have to fundamentally reject this whole approach.

It's intentional DMA-buf design that we don't expose struct pages nor 
PFNs to the importer. Essentially DMA-buf only transports DMA addresses.

In other words the mapping is done by the exporter and *not* the importer.

What we certainly can do is to annotate those DMA addresses to a better 
specify in which domain they are applicable, e.g. if they are PCIe bus 
addresses or some inter device bus addresses etc...

But moving the functionality to map the pages/PFNs to DMA addresses into 
the importer is an absolutely clear NO-GO.

Regards,
Christian.

>   So the map_dma_buf() callback is not mandatory for exporters
> anymore. Also the importers could choose not to provide
> struct device *dev on dma_buf_attach() if they don't call
> dma_buf_map_attachment().
>
> Like dma_buf_map_attachment(), the importer should firstly call
> dma_buf_attach/dynamic_attach() then call dma_buf_get_pfn_unlocked().
> If the importer choose to do dynamic attach, it also should handle the
> dma-buf move notification.
>
> Only the unlocked version of dma_buf_get_pfn is implemented for now,
> just because no locked version is used for now.
>
> Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>
>
> ---
> IIUC, Only get_pfn() is needed but no put_pfn(). The whole dma-buf is
> de/referenced at dma-buf attach/detach time.
>
> Specifically, for static attachment, the exporter should always make
> memory resource available/pinned on first dma_buf_attach(), and
> release/unpin memory resource on last dma_buf_detach(). For dynamic
> attachment, the exporter could populate & invalidate the memory
> resource at any time, it's OK as long as the importers follow dma-buf
> move notification. So no pinning is needed for get_pfn() and no
> put_pfn() is needed.
> ---
>   drivers/dma-buf/dma-buf.c | 90 +++++++++++++++++++++++++++++++--------
>   include/linux/dma-buf.h   | 13 ++++++
>   2 files changed, 86 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 7eeee3a38202..83d1448b6dcc 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -630,10 +630,10 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
>   	size_t alloc_size = sizeof(struct dma_buf);
>   	int ret;
>   
> -	if (WARN_ON(!exp_info->priv || !exp_info->ops
> -		    || !exp_info->ops->map_dma_buf
> -		    || !exp_info->ops->unmap_dma_buf
> -		    || !exp_info->ops->release))
> +	if (WARN_ON(!exp_info->priv || !exp_info->ops ||
> +		    (!!exp_info->ops->map_dma_buf != !!exp_info->ops->unmap_dma_buf) ||
> +		    (!exp_info->ops->map_dma_buf && !exp_info->ops->get_pfn) ||
> +		    !exp_info->ops->release))
>   		return ERR_PTR(-EINVAL);
>   
>   	if (WARN_ON(exp_info->ops->cache_sgt_mapping &&
> @@ -909,7 +909,10 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
>   	struct dma_buf_attachment *attach;
>   	int ret;
>   
> -	if (WARN_ON(!dmabuf || !dev))
> +	if (WARN_ON(!dmabuf))
> +		return ERR_PTR(-EINVAL);
> +
> +	if (WARN_ON(dmabuf->ops->map_dma_buf && !dev))
>   		return ERR_PTR(-EINVAL);
>   
>   	if (WARN_ON(importer_ops && !importer_ops->move_notify))
> @@ -941,7 +944,7 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
>   	 */
>   	if (dma_buf_attachment_is_dynamic(attach) !=
>   	    dma_buf_is_dynamic(dmabuf)) {
> -		struct sg_table *sgt;
> +		struct sg_table *sgt = NULL;
>   
>   		dma_resv_lock(attach->dmabuf->resv, NULL);
>   		if (dma_buf_is_dynamic(attach->dmabuf)) {
> @@ -950,13 +953,16 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
>   				goto err_unlock;
>   		}
>   
> -		sgt = __map_dma_buf(attach, DMA_BIDIRECTIONAL);
> -		if (!sgt)
> -			sgt = ERR_PTR(-ENOMEM);
> -		if (IS_ERR(sgt)) {
> -			ret = PTR_ERR(sgt);
> -			goto err_unpin;
> +		if (dmabuf->ops->map_dma_buf) {
> +			sgt = __map_dma_buf(attach, DMA_BIDIRECTIONAL);
> +			if (!sgt)
> +				sgt = ERR_PTR(-ENOMEM);
> +			if (IS_ERR(sgt)) {
> +				ret = PTR_ERR(sgt);
> +				goto err_unpin;
> +			}
>   		}
> +
>   		dma_resv_unlock(attach->dmabuf->resv);
>   		attach->sgt = sgt;
>   		attach->dir = DMA_BIDIRECTIONAL;
> @@ -1119,7 +1125,8 @@ struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *attach,
>   
>   	might_sleep();
>   
> -	if (WARN_ON(!attach || !attach->dmabuf))
> +	if (WARN_ON(!attach || !attach->dmabuf ||
> +		    !attach->dmabuf->ops->map_dma_buf))
>   		return ERR_PTR(-EINVAL);
>   
>   	dma_resv_assert_held(attach->dmabuf->resv);
> @@ -1195,7 +1202,8 @@ dma_buf_map_attachment_unlocked(struct dma_buf_attachment *attach,
>   
>   	might_sleep();
>   
> -	if (WARN_ON(!attach || !attach->dmabuf))
> +	if (WARN_ON(!attach || !attach->dmabuf ||
> +		    !attach->dmabuf->ops->map_dma_buf))
>   		return ERR_PTR(-EINVAL);
>   
>   	dma_resv_lock(attach->dmabuf->resv, NULL);
> @@ -1222,7 +1230,8 @@ void dma_buf_unmap_attachment(struct dma_buf_attachment *attach,
>   {
>   	might_sleep();
>   
> -	if (WARN_ON(!attach || !attach->dmabuf || !sg_table))
> +	if (WARN_ON(!attach || !attach->dmabuf ||
> +		    !attach->dmabuf->ops->unmap_dma_buf || !sg_table))
>   		return;
>   
>   	dma_resv_assert_held(attach->dmabuf->resv);
> @@ -1254,7 +1263,8 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
>   {
>   	might_sleep();
>   
> -	if (WARN_ON(!attach || !attach->dmabuf || !sg_table))
> +	if (WARN_ON(!attach || !attach->dmabuf ||
> +		    !attach->dmabuf->ops->unmap_dma_buf || !sg_table))
>   		return;
>   
>   	dma_resv_lock(attach->dmabuf->resv, NULL);
> @@ -1263,6 +1273,52 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
>   }
>   EXPORT_SYMBOL_NS_GPL(dma_buf_unmap_attachment_unlocked, "DMA_BUF");
>   
> +/**
> + * dma_buf_get_pfn_unlocked -
> + * @attach:	[in]	attachment to get pfn from
> + * @pgoff:	[in]	page offset of the buffer against the start of dma_buf
> + * @pfn:	[out]	returns the pfn of the buffer
> + * @max_order	[out]	returns the max mapping order of the buffer
> + */
> +int dma_buf_get_pfn_unlocked(struct dma_buf_attachment *attach,
> +			     pgoff_t pgoff, u64 *pfn, int *max_order)
> +{
> +	struct dma_buf *dmabuf = attach->dmabuf;
> +	int ret;
> +
> +	if (WARN_ON(!attach || !attach->dmabuf ||
> +		    !attach->dmabuf->ops->get_pfn))
> +		return -EINVAL;
> +
> +	/*
> +	 * Open:
> +	 *
> +	 * When dma_buf is dynamic but dma_buf move is disabled, the buffer
> +	 * should be pinned before use, See dma_buf_map_attachment() for
> +	 * reference.
> +	 *
> +	 * But for now no pin is intended inside dma_buf_get_pfn(), otherwise
> +	 * need another API to unpin the dma_buf. So just fail out this case.
> +	 */
> +	if (dma_buf_is_dynamic(attach->dmabuf) &&
> +	    !IS_ENABLED(CONFIG_DMABUF_MOVE_NOTIFY))
> +		return -ENOENT;
> +
> +	dma_resv_lock(attach->dmabuf->resv, NULL);
> +	ret = dmabuf->ops->get_pfn(attach, pgoff, pfn, max_order);
> +	/*
> +	 * Open:
> +	 *
> +	 * Is dma_resv_wait_timeout() needed? I assume no. The DMA buffer
> +	 * content synchronization could be done when the buffer is to be
> +	 * mapped by importer.
> +	 */
> +	dma_resv_unlock(attach->dmabuf->resv);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_NS_GPL(dma_buf_get_pfn_unlocked, "DMA_BUF");
> +
>   /**
>    * dma_buf_move_notify - notify attachments that DMA-buf is moving
>    *
> @@ -1662,7 +1718,7 @@ static int dma_buf_debug_show(struct seq_file *s, void *unused)
>   		attach_count = 0;
>   
>   		list_for_each_entry(attach_obj, &buf_obj->attachments, node) {
> -			seq_printf(s, "\t%s\n", dev_name(attach_obj->dev));
> +			seq_printf(s, "\t%s\n", attach_obj->dev ? dev_name(attach_obj->dev) : NULL);
>   			attach_count++;
>   		}
>   		dma_resv_unlock(buf_obj->resv);
> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> index 36216d28d8bd..b16183edfb3a 100644
> --- a/include/linux/dma-buf.h
> +++ b/include/linux/dma-buf.h
> @@ -194,6 +194,17 @@ struct dma_buf_ops {
>   	 * if the call would block.
>   	 */
>   
> +	/**
> +	 * @get_pfn:
> +	 *
> +	 * This is called by dma_buf_get_pfn(). It is used to get the pfn
> +	 * of the buffer positioned by the page offset against the start of
> +	 * the dma_buf. It can only be called if @attach has been called
> +	 * successfully.
> +	 */
> +	int (*get_pfn)(struct dma_buf_attachment *attach, pgoff_t pgoff,
> +		       u64 *pfn, int *max_order);
> +
>   	/**
>   	 * @release:
>   	 *
> @@ -629,6 +640,8 @@ dma_buf_map_attachment_unlocked(struct dma_buf_attachment *attach,
>   void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
>   				       struct sg_table *sg_table,
>   				       enum dma_data_direction direction);
> +int dma_buf_get_pfn_unlocked(struct dma_buf_attachment *attach,
> +			     pgoff_t pgoff, u64 *pfn, int *max_order);
>   
>   int dma_buf_mmap(struct dma_buf *, struct vm_area_struct *,
>   		 unsigned long);
Jason Gunthorpe Jan. 8, 2025, 1:23 p.m. UTC | #4
On Wed, Jan 08, 2025 at 09:01:46AM +0100, Christian König wrote:
> Am 07.01.25 um 15:27 schrieb Xu Yilun:
> > Introduce a new API for dma-buf importer, also add a dma_buf_ops
> > callback for dma-buf exporter. This API is for subsystem importers who
> > map the dma-buf to some user defined address space, e.g. for IOMMUFD to
> > map the dma-buf to userspace IOVA via IOMMU page table, or for KVM to
> > map the dma-buf to GPA via KVM MMU (e.g. EPT).
> > 
> > Currently dma-buf is only used to get DMA address for device's default
> > domain by using kernel DMA APIs. But for these new use-cases, importers
> > only need the pfn of the dma-buf resource to build their own mapping
> > tables.
> 
> As far as I can see I have to fundamentally reject this whole approach.
> 
> It's intentional DMA-buf design that we don't expose struct pages nor PFNs
> to the importer. Essentially DMA-buf only transports DMA addresses.
> 
> In other words the mapping is done by the exporter and *not* the importer.
> 
> What we certainly can do is to annotate those DMA addresses to a better
> specify in which domain they are applicable, e.g. if they are PCIe bus
> addresses or some inter device bus addresses etc...
> 
> But moving the functionality to map the pages/PFNs to DMA addresses into the
> importer is an absolutely clear NO-GO.

Oh?

Having the importer do the mapping is the correct way to operate the
DMA API and the new API that Leon has built to fix the scatterlist
abuse in dmabuf relies on importer mapping as part of it's
construction.

Why on earth do you want the exporter to map? That is completely
backwards and unworkable in many cases. The disfunctional P2P support
in dmabuf is like that principally because of this.

That said, I don't think get_pfn() is an especially good interface,
but we will need to come with something that passes the physical pfn
out.

Jason
Christian König Jan. 8, 2025, 1:44 p.m. UTC | #5
Am 08.01.25 um 14:23 schrieb Jason Gunthorpe:
> On Wed, Jan 08, 2025 at 09:01:46AM +0100, Christian König wrote:
>> Am 07.01.25 um 15:27 schrieb Xu Yilun:
>>> Introduce a new API for dma-buf importer, also add a dma_buf_ops
>>> callback for dma-buf exporter. This API is for subsystem importers who
>>> map the dma-buf to some user defined address space, e.g. for IOMMUFD to
>>> map the dma-buf to userspace IOVA via IOMMU page table, or for KVM to
>>> map the dma-buf to GPA via KVM MMU (e.g. EPT).
>>>
>>> Currently dma-buf is only used to get DMA address for device's default
>>> domain by using kernel DMA APIs. But for these new use-cases, importers
>>> only need the pfn of the dma-buf resource to build their own mapping
>>> tables.
>> As far as I can see I have to fundamentally reject this whole approach.
>>
>> It's intentional DMA-buf design that we don't expose struct pages nor PFNs
>> to the importer. Essentially DMA-buf only transports DMA addresses.
>>
>> In other words the mapping is done by the exporter and *not* the importer.
>>
>> What we certainly can do is to annotate those DMA addresses to a better
>> specify in which domain they are applicable, e.g. if they are PCIe bus
>> addresses or some inter device bus addresses etc...
>>
>> But moving the functionality to map the pages/PFNs to DMA addresses into the
>> importer is an absolutely clear NO-GO.
> Oh?
>
> Having the importer do the mapping is the correct way to operate the
> DMA API and the new API that Leon has built to fix the scatterlist
> abuse in dmabuf relies on importer mapping as part of it's
> construction.

Exactly on that I strongly disagree on.

DMA-buf works by providing DMA addresses the importer can work with and 
*NOT* the underlying location of the buffer.

> Why on earth do you want the exporter to map?

Because the exporter owns the exported buffer and only the exporter 
knows to how correctly access it.

> That is completely backwards and unworkable in many cases. The disfunctional P2P support
> in dmabuf is like that principally because of this.

No, that is exactly what we need.

Using the scatterlist to transport the DMA addresses was clearly a 
mistake, but providing the DMA addresses by the exporter has proved many 
times to be the right approach.

Keep in mind that the exported buffer is not necessary memory, but can 
also be MMIO or stuff which is only accessible through address space 
windows where you can't create a PFN nor struct page for.

> That said, I don't think get_pfn() is an especially good interface,
> but we will need to come with something that passes the physical pfn
> out.

No, physical pfn is absolutely not a good way of passing the location of 
data around because it is limited to what the CPU sees as address space.

We have use cases where DMA-buf transports the location of CPU invisible 
data which only the involved devices can see.

Regards,
Christian.

>
> Jason
Jason Gunthorpe Jan. 8, 2025, 2:58 p.m. UTC | #6
On Wed, Jan 08, 2025 at 02:44:26PM +0100, Christian König wrote:

> > Having the importer do the mapping is the correct way to operate the
> > DMA API and the new API that Leon has built to fix the scatterlist
> > abuse in dmabuf relies on importer mapping as part of it's
> > construction.
> 
> Exactly on that I strongly disagree on.
> 
> DMA-buf works by providing DMA addresses the importer can work with and
> *NOT* the underlying location of the buffer.

The expectation is that the DMA API will be used to DMA map (most)
things, and the DMA API always works with a physaddr_t/pfn
argument. Basically, everything that is not a private address space
should be supported by improving the DMA API. We are on course for
finally getting all the common cases like P2P and MMIO solved
here. That alone will take care of alot.

For P2P cases we are going toward (PFN + P2P source information) as
input to the DMA API. The additional "P2P source information" provides
a good way for co-operating drivers to represent private address
spaces as well. Both importer and exporter can have full understanding
what is being mapped and do the correct things, safely.

So, no, we don't loose private address space support when moving to
importer mapping, in fact it works better because the importer gets
more information about what is going on.

I have imagined a staged approach were DMABUF gets a new API that
works with the new DMA API to do importer mapping with "P2P source
information" and a gradual conversion.

Exporter mapping falls down in too many cases already:

1) Private addresses spaces don't work fully well because many devices
need some indication what address space is being used and scatter list
can't really properly convey that. If the DMABUF has a mixture of CPU
and private it becomes a PITA

2) Multi-path PCI can require the importer to make mapping decisions
unique to the device and program device specific information for the
multi-path. We are doing this in mlx5 today and have hacks because
DMABUF is destroying the information the importer needs to choose the
correct PCI path.

3) Importing devices need to know if they are working with PCI P2P
addresses during mapping because they need to do things like turn on
ATS on their DMA. As for multi-path we have the same hacks inside mlx5
today that assume DMABUFs are always P2P because we cannot determine
if things are P2P or not after being DMA mapped.

4) TPH bits needs to be programmed into the importer device but are
derived based on the NUMA topology of the DMA target. The importer has
no idea what the DMA target actually was because the exporter mapping
destroyed that information.

5) iommufd and kvm are both using CPU addresses without DMA. No
exporter mapping is possible

Jason
Christian König Jan. 8, 2025, 3:25 p.m. UTC | #7
Am 08.01.25 um 15:58 schrieb Jason Gunthorpe:
> On Wed, Jan 08, 2025 at 02:44:26PM +0100, Christian König wrote:
>
>>> Having the importer do the mapping is the correct way to operate the
>>> DMA API and the new API that Leon has built to fix the scatterlist
>>> abuse in dmabuf relies on importer mapping as part of it's
>>> construction.
>> Exactly on that I strongly disagree on.
>>
>> DMA-buf works by providing DMA addresses the importer can work with and
>> *NOT* the underlying location of the buffer.
> The expectation is that the DMA API will be used to DMA map (most)
> things, and the DMA API always works with a physaddr_t/pfn
> argument. Basically, everything that is not a private address space
> should be supported by improving the DMA API. We are on course for
> finally getting all the common cases like P2P and MMIO solved
> here. That alone will take care of alot.

Well, from experience the DMA API has failed more often than it actually 
worked in the way required by drivers.

Especially that we tried to hide architectural complexity in there 
instead of properly expose limitations to drivers is not something I 
consider a good design approach.

So I see putting even more into that extremely critical.

> For P2P cases we are going toward (PFN + P2P source information) as
> input to the DMA API. The additional "P2P source information" provides
> a good way for co-operating drivers to represent private address
> spaces as well. Both importer and exporter can have full understanding
> what is being mapped and do the correct things, safely.

I can say from experience that this is clearly not going to work for all 
use cases.

It would mean that we have to pull a massive amount of driver specific 
functionality into the DMA API.

Things like programming access windows for PCI BARs is completely driver 
specific and as far as I can see can't be part of the DMA API without 
things like callbacks.

With that in mind the DMA API would become a mid layer between different 
drivers and that is really not something you are suggesting, isn't it?

> So, no, we don't loose private address space support when moving to
> importer mapping, in fact it works better because the importer gets
> more information about what is going on.

Well, sounds like I wasn't able to voice my concern. Let me try again:

We should not give importers information they don't need. Especially not 
information about the backing store of buffers.

So that importers get more information about what's going on is a bad thing.

> I have imagined a staged approach were DMABUF gets a new API that
> works with the new DMA API to do importer mapping with "P2P source
> information" and a gradual conversion.

To make it clear as maintainer of that subsystem I would reject such a 
step with all I have.

We have already gone down that road and it didn't worked at all and was 
a really big pain to pull people back from it.

> Exporter mapping falls down in too many cases already:
>
> 1) Private addresses spaces don't work fully well because many devices
> need some indication what address space is being used and scatter list
> can't really properly convey that. If the DMABUF has a mixture of CPU
> and private it becomes a PITA

Correct, yes. That's why I said that scatterlist was a bad choice for 
the interface.

But exposing the backing store to importers and then let them do 
whatever they want with it sounds like an even worse idea.

> 2) Multi-path PCI can require the importer to make mapping decisions
> unique to the device and program device specific information for the
> multi-path. We are doing this in mlx5 today and have hacks because
> DMABUF is destroying the information the importer needs to choose the
> correct PCI path.

That's why the exporter gets the struct device of the importer so that 
it can plan how those accesses are made. Where exactly is the problem 
with that?

When you have an use case which is not covered by the existing DMA-buf 
interfaces then please voice that to me and other maintainers instead of 
implementing some hack.

> 3) Importing devices need to know if they are working with PCI P2P
> addresses during mapping because they need to do things like turn on
> ATS on their DMA. As for multi-path we have the same hacks inside mlx5
> today that assume DMABUFs are always P2P because we cannot determine
> if things are P2P or not after being DMA mapped.

Why would you need ATS on PCI P2P and not for system memory accesses?

> 4) TPH bits needs to be programmed into the importer device but are
> derived based on the NUMA topology of the DMA target. The importer has
> no idea what the DMA target actually was because the exporter mapping
> destroyed that information.

Yeah, but again that is completely intentional.

I assume you mean TLP processing hints when you say TPH and those should 
be part of the DMA addresses provided by the exporter.

That an importer tries to look behind the curtain and determines the 
NUMA placement and topology themselves is clearly a no-go from the 
design perspective.

> 5) iommufd and kvm are both using CPU addresses without DMA. No
> exporter mapping is possible

We have customers using both KVM and XEN with DMA-buf, so I can clearly 
confirm that this isn't true.

Regards,
Christian.

>
> Jason
Jason Gunthorpe Jan. 8, 2025, 4:22 p.m. UTC | #8
On Wed, Jan 08, 2025 at 04:25:54PM +0100, Christian König wrote:
> Am 08.01.25 um 15:58 schrieb Jason Gunthorpe:
> > On Wed, Jan 08, 2025 at 02:44:26PM +0100, Christian König wrote:
> > 
> > > > Having the importer do the mapping is the correct way to operate the
> > > > DMA API and the new API that Leon has built to fix the scatterlist
> > > > abuse in dmabuf relies on importer mapping as part of it's
> > > > construction.
> > > Exactly on that I strongly disagree on.
> > > 
> > > DMA-buf works by providing DMA addresses the importer can work with and
> > > *NOT* the underlying location of the buffer.
> > The expectation is that the DMA API will be used to DMA map (most)
> > things, and the DMA API always works with a physaddr_t/pfn
> > argument. Basically, everything that is not a private address space
> > should be supported by improving the DMA API. We are on course for
> > finally getting all the common cases like P2P and MMIO solved
> > here. That alone will take care of alot.
> 
> Well, from experience the DMA API has failed more often than it actually
> worked in the way required by drivers.

The DMA API has been static and very hard to change in these ways for
a long time. I think Leon's new API will break through this and we
will be able finally address these issues.

> > For P2P cases we are going toward (PFN + P2P source information) as
> > input to the DMA API. The additional "P2P source information" provides
> > a good way for co-operating drivers to represent private address
> > spaces as well. Both importer and exporter can have full understanding
> > what is being mapped and do the correct things, safely.
> 
> I can say from experience that this is clearly not going to work for all use
> cases.
> 
> It would mean that we have to pull a massive amount of driver specific
> functionality into the DMA API.

That isn't what I mean. There are two distinct parts, the means to
describe the source (PFN + P2P source information) that is compatible
with the DMA API, and the DMA API itself that works with a few general
P2P source information types.

Private source information would be detected by co-operating drivers
and go down driver private paths. It would be rejected by other
drivers. This broadly follows how the new API is working.

So here I mean you can use the same PFN + Source API between importer
and exporter and the importer can simply detect the special source and
do the private stuff. It is not shifting things under the DMA API, it
is building along side it using compatible design approaches. You
would match the source information, cast it to a driver structure, do
whatever driver math is needed to compute the local DMA address and
then write it to the device. Nothing is hard or "not going to work"
here.

> > So, no, we don't loose private address space support when moving to
> > importer mapping, in fact it works better because the importer gets
> > more information about what is going on.
> 
> Well, sounds like I wasn't able to voice my concern. Let me try again:
> 
> We should not give importers information they don't need. Especially not
> information about the backing store of buffers.
> 
> So that importers get more information about what's going on is a bad thing.

I strongly disagree because we are suffering today in mlx5 because of
this viewpoint. You cannot predict in advance what importers are going
to need. I already listed many examples where it does not work today
as is.

> > I have imagined a staged approach were DMABUF gets a new API that
> > works with the new DMA API to do importer mapping with "P2P source
> > information" and a gradual conversion.
> 
> To make it clear as maintainer of that subsystem I would reject such a step
> with all I have.

This is unexpected, so you want to just leave dmabuf broken? Do you
have any plan to fix it, to fix the misuse of the DMA API, and all
the problems I listed below? This is a big deal, it is causing real
problems today.

If it going to be like this I think we will stop trying to use dmabuf
and do something simpler for vfio/kvm/iommufd :(

> We have already gone down that road and it didn't worked at all and
> was a really big pain to pull people back from it.

Nobody has really seriously tried to improve the DMA API before, so I
don't think this is true at all.

> > Exporter mapping falls down in too many cases already:
> > 
> > 1) Private addresses spaces don't work fully well because many devices
> > need some indication what address space is being used and scatter list
> > can't really properly convey that. If the DMABUF has a mixture of CPU
> > and private it becomes a PITA
> 
> Correct, yes. That's why I said that scatterlist was a bad choice for the
> interface.
> 
> But exposing the backing store to importers and then let them do whatever
> they want with it sounds like an even worse idea.

You keep saying this without real justification. To me it is a nanny
style of API design. But also I don't see how you can possibly fix the
above without telling the importer alot more information.

> > 2) Multi-path PCI can require the importer to make mapping decisions
> > unique to the device and program device specific information for the
> > multi-path. We are doing this in mlx5 today and have hacks because
> > DMABUF is destroying the information the importer needs to choose the
> > correct PCI path.
> 
> That's why the exporter gets the struct device of the importer so that it
> can plan how those accesses are made. Where exactly is the problem with
> that?

A single struct device does not convey the multipath options. We have
multiple struct devices (and multiple PCI endpoints) doing DMA
concurrently under one driver.

Multipath always needs additional meta information in the importer
side to tell the device which path to select. A naked dma address is
not sufficient.

Today we guess that DMABUF will be using P2P and hack to choose a P2P
struct device to pass the exporter. We need to know what is in the
dmabuf before we can choose which of the multiple struct devices the
driver has to use for DMA mapping.

But even simple CPU centric cases we will eventually want to select
the proper NUMA local PCI channel matching struct device for CPU only
buffers.

> When you have an use case which is not covered by the existing DMA-buf
> interfaces then please voice that to me and other maintainers instead of
> implementing some hack.

Do you have any suggestion for any of this then? We have a good plan
to fix this stuff and more. Many experts in their fields have agreed
on the different parts now. We haven't got to dmabuf because I had no
idea there would be an objection like this.

> > 3) Importing devices need to know if they are working with PCI P2P
> > addresses during mapping because they need to do things like turn on
> > ATS on their DMA. As for multi-path we have the same hacks inside mlx5
> > today that assume DMABUFs are always P2P because we cannot determine
> > if things are P2P or not after being DMA mapped.
> 
> Why would you need ATS on PCI P2P and not for system memory accesses?

ATS has a significant performance cost. It is mandatory for PCI P2P,
but ideally should be avoided for CPU memory.

> > 4) TPH bits needs to be programmed into the importer device but are
> > derived based on the NUMA topology of the DMA target. The importer has
> > no idea what the DMA target actually was because the exporter mapping
> > destroyed that information.
> 
> Yeah, but again that is completely intentional.
> 
> I assume you mean TLP processing hints when you say TPH and those should be
> part of the DMA addresses provided by the exporter.

Yes, but is not part of the DMA addresses.

> That an importer tries to look behind the curtain and determines the NUMA
> placement and topology themselves is clearly a no-go from the design
> perspective.

I strongly disagree, this is important. Drivers need this information
in a future TPH/UIO/multipath PCI world.

> > 5) iommufd and kvm are both using CPU addresses without DMA. No
> > exporter mapping is possible
> 
> We have customers using both KVM and XEN with DMA-buf, so I can clearly
> confirm that this isn't true.

Today they are mmaping the dma-buf into a VMA and then using KVM's
follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable
dma-buf must have a CPU PFN.

Here Xu implements basically the same path, except without the VMA
indirection, and it suddenly not OK? Illogical.

Jason
Xu Yilun Jan. 8, 2025, 5:56 p.m. UTC | #9
> > > 5) iommufd and kvm are both using CPU addresses without DMA. No
> > > exporter mapping is possible
> > 
> > We have customers using both KVM and XEN with DMA-buf, so I can clearly
> > confirm that this isn't true.
> 
> Today they are mmaping the dma-buf into a VMA and then using KVM's
> follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable
> dma-buf must have a CPU PFN.

Yes, the final target for KVM is still the CPU PFN, just with the help
of CPU mapping table.

I also found the xen gntdev-dmabuf is calculating pfn from mapped
sgt.

From Christion's point, I assume only sgl->dma_address should be
used by importers but in fact not. More importers are 'abusing' sg dma
helpers.

That said there are existing needs for importers to know more about the
real buffer resource, for mapping, or even more than mapping,
e.g. dmabuf_imp_grant_foreign_access()

Thanks,
Yilun
Simona Vetter Jan. 8, 2025, 6:44 p.m. UTC | #10
On Wed, Jan 08, 2025 at 12:22:27PM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 08, 2025 at 04:25:54PM +0100, Christian König wrote:
> > Am 08.01.25 um 15:58 schrieb Jason Gunthorpe:
> > > I have imagined a staged approach were DMABUF gets a new API that
> > > works with the new DMA API to do importer mapping with "P2P source
> > > information" and a gradual conversion.
> > 
> > To make it clear as maintainer of that subsystem I would reject such a step
> > with all I have.
> 
> This is unexpected, so you want to just leave dmabuf broken? Do you
> have any plan to fix it, to fix the misuse of the DMA API, and all
> the problems I listed below? This is a big deal, it is causing real
> problems today.
> 
> If it going to be like this I think we will stop trying to use dmabuf
> and do something simpler for vfio/kvm/iommufd :(

As the gal who help edit the og dma-buf spec 13 years ago, I think adding
pfn isn't a terrible idea. By design, dma-buf is the "everything is
optional" interface. And in the beginning, even consistent locking was
optional, but we've managed to fix that by now :-/

Where I do agree with Christian is that stuffing pfn support into the
dma_buf_attachment interfaces feels a bit much wrong.

> > We have already gone down that road and it didn't worked at all and
> > was a really big pain to pull people back from it.
> 
> Nobody has really seriously tried to improve the DMA API before, so I
> don't think this is true at all.

Aside, I really hope this finally happens!

> > > 3) Importing devices need to know if they are working with PCI P2P
> > > addresses during mapping because they need to do things like turn on
> > > ATS on their DMA. As for multi-path we have the same hacks inside mlx5
> > > today that assume DMABUFs are always P2P because we cannot determine
> > > if things are P2P or not after being DMA mapped.
> > 
> > Why would you need ATS on PCI P2P and not for system memory accesses?
> 
> ATS has a significant performance cost. It is mandatory for PCI P2P,
> but ideally should be avoided for CPU memory.

Huh, I didn't know that. And yeah kinda means we've butchered the pci p2p
stuff a bit I guess ...

> > > 5) iommufd and kvm are both using CPU addresses without DMA. No
> > > exporter mapping is possible
> > 
> > We have customers using both KVM and XEN with DMA-buf, so I can clearly
> > confirm that this isn't true.
> 
> Today they are mmaping the dma-buf into a VMA and then using KVM's
> follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable
> dma-buf must have a CPU PFN.
> 
> Here Xu implements basically the same path, except without the VMA
> indirection, and it suddenly not OK? Illogical.

So the big difference is that for follow_pfn() you need mmu_notifier since
the mmap might move around, whereas with pfn smashed into
dma_buf_attachment you need dma_resv_lock rules, and the move_notify
callback if you go dynamic.

So I guess my first question is, which locking rules do you want here for
pfn importers?

If mmu notifiers is fine, then I think the current approach of follow_pfn
should be ok. But if you instead dma_resv_lock rules (or the cpu mmap
somehow is an issue itself), then I think the clean design is create a new
separate access mechanism just for that. It would be the 5th or so (kernel
vmap, userspace mmap, dma_buf_attach and driver private stuff like
virtio_dma_buf.c where you access your buffer with a uuid), so really not
a big deal.

And for non-contrived exporters we might be able to implement the other
access methods in terms of the pfn method generically, so this wouldn't
even be a terrible maintenance burden going forward. And meanwhile all the
contrived exporters just keep working as-is.

The other part is that cpu mmap is optional, and there's plenty of strange
exporters who don't implement. But you can dma map the attachment into
plenty devices. This tends to mostly be a thing on SoC devices with some
very funky memory. But I guess you don't care about these use-case, so
should be ok.

I couldn't come up with a good name for these pfn users, maybe
dma_buf_pfn_attachment? This does _not_ have a struct device, but maybe
some of these new p2p source specifiers (or a list of those which are
allowed, no idea how this would need to fit into the new dma api).

Cheers, Sima
Xu Yilun Jan. 8, 2025, 7:22 p.m. UTC | #11
On Wed, Jan 08, 2025 at 07:44:54PM +0100, Simona Vetter wrote:
> On Wed, Jan 08, 2025 at 12:22:27PM -0400, Jason Gunthorpe wrote:
> > On Wed, Jan 08, 2025 at 04:25:54PM +0100, Christian König wrote:
> > > Am 08.01.25 um 15:58 schrieb Jason Gunthorpe:
> > > > I have imagined a staged approach were DMABUF gets a new API that
> > > > works with the new DMA API to do importer mapping with "P2P source
> > > > information" and a gradual conversion.
> > > 
> > > To make it clear as maintainer of that subsystem I would reject such a step
> > > with all I have.
> > 
> > This is unexpected, so you want to just leave dmabuf broken? Do you
> > have any plan to fix it, to fix the misuse of the DMA API, and all
> > the problems I listed below? This is a big deal, it is causing real
> > problems today.
> > 
> > If it going to be like this I think we will stop trying to use dmabuf
> > and do something simpler for vfio/kvm/iommufd :(
> 
> As the gal who help edit the og dma-buf spec 13 years ago, I think adding
> pfn isn't a terrible idea. By design, dma-buf is the "everything is
> optional" interface. And in the beginning, even consistent locking was
> optional, but we've managed to fix that by now :-/
> 
> Where I do agree with Christian is that stuffing pfn support into the
> dma_buf_attachment interfaces feels a bit much wrong.

So it could a dmabuf interface like mmap/vmap()? I was also wondering
about that. But finally I start to use dma_buf_attachment interface
because of leveraging existing buffer pin and move_notify.

> 
> > > We have already gone down that road and it didn't worked at all and
> > > was a really big pain to pull people back from it.
> > 
> > Nobody has really seriously tried to improve the DMA API before, so I
> > don't think this is true at all.
> 
> Aside, I really hope this finally happens!
> 
> > > > 3) Importing devices need to know if they are working with PCI P2P
> > > > addresses during mapping because they need to do things like turn on
> > > > ATS on their DMA. As for multi-path we have the same hacks inside mlx5
> > > > today that assume DMABUFs are always P2P because we cannot determine
> > > > if things are P2P or not after being DMA mapped.
> > > 
> > > Why would you need ATS on PCI P2P and not for system memory accesses?
> > 
> > ATS has a significant performance cost. It is mandatory for PCI P2P,
> > but ideally should be avoided for CPU memory.
> 
> Huh, I didn't know that. And yeah kinda means we've butchered the pci p2p
> stuff a bit I guess ...
> 
> > > > 5) iommufd and kvm are both using CPU addresses without DMA. No
> > > > exporter mapping is possible
> > > 
> > > We have customers using both KVM and XEN with DMA-buf, so I can clearly
> > > confirm that this isn't true.
> > 
> > Today they are mmaping the dma-buf into a VMA and then using KVM's
> > follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable
> > dma-buf must have a CPU PFN.
> > 
> > Here Xu implements basically the same path, except without the VMA
> > indirection, and it suddenly not OK? Illogical.
> 
> So the big difference is that for follow_pfn() you need mmu_notifier since
> the mmap might move around, whereas with pfn smashed into
> dma_buf_attachment you need dma_resv_lock rules, and the move_notify
> callback if you go dynamic.
> 
> So I guess my first question is, which locking rules do you want here for
> pfn importers?

follow_pfn() is unwanted for private MMIO, so dma_resv_lock.

> 
> If mmu notifiers is fine, then I think the current approach of follow_pfn
> should be ok. But if you instead dma_resv_lock rules (or the cpu mmap
> somehow is an issue itself), then I think the clean design is create a new

cpu mmap() is an issue, this series is aimed to eliminate userspace
mapping for private MMIO resources.

> separate access mechanism just for that. It would be the 5th or so (kernel
> vmap, userspace mmap, dma_buf_attach and driver private stuff like
> virtio_dma_buf.c where you access your buffer with a uuid), so really not
> a big deal.

OK, will think more about that.

Thanks,
Yilun

> 
> And for non-contrived exporters we might be able to implement the other
> access methods in terms of the pfn method generically, so this wouldn't
> even be a terrible maintenance burden going forward. And meanwhile all the
> contrived exporters just keep working as-is.
> 
> The other part is that cpu mmap is optional, and there's plenty of strange
> exporters who don't implement. But you can dma map the attachment into
> plenty devices. This tends to mostly be a thing on SoC devices with some
> very funky memory. But I guess you don't care about these use-case, so
> should be ok.
> 
> I couldn't come up with a good name for these pfn users, maybe
> dma_buf_pfn_attachment? This does _not_ have a struct device, but maybe
> some of these new p2p source specifiers (or a list of those which are
> allowed, no idea how this would need to fit into the new dma api).
> 
> Cheers, Sima
> -- 
> Simona Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
Xu Yilun Jan. 8, 2025, 11:06 p.m. UTC | #12
>  So I guess my first question is, which locking rules do you want here for
>  pfn importers?
> 
>  follow_pfn() is unwanted for private MMIO, so dma_resv_lock.
> 
>    As Sima explained you either have follow_pfn() and mmu_notifier or you
>    have DMA addresses and dma_resv lock / dma_fence.
> 
>    Just giving out PFNs without some lifetime associated with them is one of
>    the major problems we faced before and really not something you can do.

I'm trying to make exporter give out PFN with lifetime control via
move_notify() in this series. May not be conceptually correct but seems
possible.

> 
> 
>  If mmu notifiers is fine, then I think the current approach of follow_pfn
>  should be ok. But if you instead dma_resv_lock rules (or the cpu mmap
>  somehow is an issue itself), then I think the clean design is create a new
> 
>  cpu mmap() is an issue, this series is aimed to eliminate userspace
>  mapping for private MMIO resources.
> 
>    Why?

OK, I can start from here.

It is about the Secure guest, or CoCo VM. The memory and MMIOs assigned
to this kind of guest is unaccessable to host itself, by leveraging HW
encryption & access control technology (e.g. Intel TDX, AMD SEV-SNP ...).
This is to protect the tenant data being stolen by CSP itself.

The impact is when host accesses the encrypted region, bad things
happen to system, e.g. memory corruption, MCE. Kernel is trying to
mitigate most of the impact by alloc and assign user unmappable memory
resources (private memory) to guest, which prevents userspace
accidents. guest_memfd is the private memory provider that only allows
for KVM to position the page/pfn by fd + offset and create secondary
page table (EPT, NPT...), no host mapping, no VMA, no mmu_notifier. But
the lifecycle of the private memory is still controlled by guest_memfd.
When fallocate(fd, PUNCH_HOLE), the memory resource is revoked and KVM
is notified to unmap corresponding EPT.

The further thought is guest_memfd is also suitable for normal guest.
It makes no sense VMM must build host mapping table before guest access.

Now I'm trying to seek a similar way for private MMIO. A MMIO resource
provider that is exported as an fd. It controls the lifecycle of the
MMIO resource and notify KVM when revoked. dma-buf seems to be a good
provider which have done most of the work, only need to extend the
memory resource seeking by fd + offset.

> 
>  separate access mechanism just for that. It would be the 5th or so (kernel
>  vmap, userspace mmap, dma_buf_attach and driver private stuff like
>  virtio_dma_buf.c where you access your buffer with a uuid), so really not
>  a big deal.
> 
>  OK, will think more about that.
> 
>    Please note that we have follow_pfn() + mmu_notifier working for KVM/XEN

Folow_pfn & mmu_notifier won't work here, cause no VMA, no host mapping
table.

Thanks,
Yilun
>    with MMIO mappings and P2P. And that required exactly zero DMA-buf changes
>    :)
> 
>    I don't fully understand your use case, but I think it's quite likely that
>    we already have that working.
> 
>    Regards,
>    Christian.
Christian König Jan. 9, 2025, 8:09 a.m. UTC | #13
Answering on my reply once more as pure text mail.

I love AMDs mail servers.

Cheers,
Christian.

Am 09.01.25 um 09:04 schrieb Christian König:
> Am 08.01.25 um 20:22 schrieb Xu Yilun:
>> On Wed, Jan 08, 2025 at 07:44:54PM +0100, Simona Vetter wrote:
>>> On Wed, Jan 08, 2025 at 12:22:27PM -0400, Jason Gunthorpe wrote:
>>>> On Wed, Jan 08, 2025 at 04:25:54PM +0100, Christian König wrote:
>>>>> Am 08.01.25 um 15:58 schrieb Jason Gunthorpe:
>>>>>> I have imagined a staged approach were DMABUF gets a new API that
>>>>>> works with the new DMA API to do importer mapping with "P2P source
>>>>>> information" and a gradual conversion.
>>>>> To make it clear as maintainer of that subsystem I would reject such a step
>>>>> with all I have.
>>>> This is unexpected, so you want to just leave dmabuf broken? Do you
>>>> have any plan to fix it, to fix the misuse of the DMA API, and all
>>>> the problems I listed below? This is a big deal, it is causing real
>>>> problems today.
>>>>
>>>> If it going to be like this I think we will stop trying to use dmabuf
>>>> and do something simpler for vfio/kvm/iommufd :(
>>> As the gal who help edit the og dma-buf spec 13 years ago, I think adding
>>> pfn isn't a terrible idea. By design, dma-buf is the "everything is
>>> optional" interface. And in the beginning, even consistent locking was
>>> optional, but we've managed to fix that by now :-/
>
> Well you were also the person who mangled the struct page pointers in 
> the scatterlist because people were abusing this and getting a bloody 
> nose :)
>
>>> Where I do agree with Christian is that stuffing pfn support into the
>>> dma_buf_attachment interfaces feels a bit much wrong.
>> So it could a dmabuf interface like mmap/vmap()? I was also wondering
>> about that. But finally I start to use dma_buf_attachment interface
>> because of leveraging existing buffer pin and move_notify.
>
> Exactly that's the point, sharing pfn doesn't work with the pin and 
> move_notify interfaces because of the MMU notifier approach Sima 
> mentioned.
>
>>>>> We have already gone down that road and it didn't worked at all and
>>>>> was a really big pain to pull people back from it.
>>>> Nobody has really seriously tried to improve the DMA API before, so I
>>>> don't think this is true at all.
>>> Aside, I really hope this finally happens!
>
> Sorry my fault. I was not talking about the DMA API, but rather that 
> people tried to look behind the curtain of DMA-buf backing stores.
>
> In other words all the fun we had with scatterlists and that people 
> try to modify the struct pages inside of them.
>
> Improving the DMA API is something I really really hope for as well.
>
>>>>>> 3) Importing devices need to know if they are working with PCI P2P
>>>>>> addresses during mapping because they need to do things like turn on
>>>>>> ATS on their DMA. As for multi-path we have the same hacks inside mlx5
>>>>>> today that assume DMABUFs are always P2P because we cannot determine
>>>>>> if things are P2P or not after being DMA mapped.
>>>>> Why would you need ATS on PCI P2P and not for system memory accesses?
>>>> ATS has a significant performance cost. It is mandatory for PCI P2P,
>>>> but ideally should be avoided for CPU memory.
>>> Huh, I didn't know that. And yeah kinda means we've butchered the pci p2p
>>> stuff a bit I guess ...
>
> Hui? Why should ATS be mandatory for PCI P2P?
>
> We have tons of production systems using PCI P2P without ATS. And it's 
> the first time I hear that.
>
>>>>>> 5) iommufd and kvm are both using CPU addresses without DMA. No
>>>>>> exporter mapping is possible
>>>>> We have customers using both KVM and XEN with DMA-buf, so I can clearly
>>>>> confirm that this isn't true.
>>>> Today they are mmaping the dma-buf into a VMA and then using KVM's
>>>> follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable
>>>> dma-buf must have a CPU PFN.
>>>>
>>>> Here Xu implements basically the same path, except without the VMA
>>>> indirection, and it suddenly not OK? Illogical.
>>> So the big difference is that for follow_pfn() you need mmu_notifier since
>>> the mmap might move around, whereas with pfn smashed into
>>> dma_buf_attachment you need dma_resv_lock rules, and the move_notify
>>> callback if you go dynamic.
>>>
>>> So I guess my first question is, which locking rules do you want here for
>>> pfn importers?
>> follow_pfn() is unwanted for private MMIO, so dma_resv_lock.
>
> As Sima explained you either have follow_pfn() and mmu_notifier or you 
> have DMA addresses and dma_resv lock / dma_fence.
>
> Just giving out PFNs without some lifetime associated with them is one 
> of the major problems we faced before and really not something you can do.
>
>>> If mmu notifiers is fine, then I think the current approach of follow_pfn
>>> should be ok. But if you instead dma_resv_lock rules (or the cpu mmap
>>> somehow is an issue itself), then I think the clean design is create a new
>> cpu mmap() is an issue, this series is aimed to eliminate userspace
>> mapping for private MMIO resources.
>
> Why?
>
>>> separate access mechanism just for that. It would be the 5th or so (kernel
>>> vmap, userspace mmap, dma_buf_attach and driver private stuff like
>>> virtio_dma_buf.c where you access your buffer with a uuid), so really not
>>> a big deal.
>> OK, will think more about that.
>
> Please note that we have follow_pfn() + mmu_notifier working for 
> KVM/XEN with MMIO mappings and P2P. And that required exactly zero 
> DMA-buf changes :)
>
> I don't fully understand your use case, but I think it's quite likely 
> that we already have that working.
>
> Regards,
> Christian.
>
>> Thanks,
>> Yilun
>>
>>> And for non-contrived exporters we might be able to implement the other
>>> access methods in terms of the pfn method generically, so this wouldn't
>>> even be a terrible maintenance burden going forward. And meanwhile all the
>>> contrived exporters just keep working as-is.
>>>
>>> The other part is that cpu mmap is optional, and there's plenty of strange
>>> exporters who don't implement. But you can dma map the attachment into
>>> plenty devices. This tends to mostly be a thing on SoC devices with some
>>> very funky memory. But I guess you don't care about these use-case, so
>>> should be ok.
>>>
>>> I couldn't come up with a good name for these pfn users, maybe
>>> dma_buf_pfn_attachment? This does _not_ have a struct device, but maybe
>>> some of these new p2p source specifiers (or a list of those which are
>>> allowed, no idea how this would need to fit into the new dma api).
>>>
>>> Cheers, Sima
>>> -- 
>>> Simona Vetter
>>> Software Engineer, Intel Corporation
>>> http://blog.ffwll.ch
>
Leon Romanovsky Jan. 9, 2025, 9:28 a.m. UTC | #14
On Thu, Jan 09, 2025 at 10:10:01AM +0100, Christian König wrote:
> Am 08.01.25 um 17:22 schrieb Jason Gunthorpe:
> > [SNIP]
> > > > For P2P cases we are going toward (PFN + P2P source information) as
> > > > input to the DMA API. The additional "P2P source information" provides
> > > > a good way for co-operating drivers to represent private address
> > > > spaces as well. Both importer and exporter can have full understanding
> > > > what is being mapped and do the correct things, safely.
> > > I can say from experience that this is clearly not going to work for all use
> > > cases.
> > > 
> > > It would mean that we have to pull a massive amount of driver specific
> > > functionality into the DMA API.
> > That isn't what I mean. There are two distinct parts, the means to
> > describe the source (PFN + P2P source information) that is compatible
> > with the DMA API, and the DMA API itself that works with a few general
> > P2P source information types.
> > 
> > Private source information would be detected by co-operating drivers
> > and go down driver private paths. It would be rejected by other
> > drivers. This broadly follows how the new API is working.
> > 
> > So here I mean you can use the same PFN + Source API between importer
> > and exporter and the importer can simply detect the special source and
> > do the private stuff. It is not shifting things under the DMA API, it
> > is building along side it using compatible design approaches. You
> > would match the source information, cast it to a driver structure, do
> > whatever driver math is needed to compute the local DMA address and
> > then write it to the device. Nothing is hard or "not going to work"
> > here.
> 
> Well to be honest that sounds like an absolutely horrible design.
> 
> You are moving all responsibilities for inter driver handling into the
> drivers themselves without any supervision by the core OS.
> 
> Drivers are notoriously buggy and should absolutely not do things like that
> on their own.

IMHO, you and Jason give different meaning to word "driver" in this
discussion. It is upto to the subsystems to decide how to provide new
API to the end drivers. Worth to read this LWN article first.

Dancing the DMA two-step - https://lwn.net/Articles/997563/

> 
> Do you have pointers to this new API?

Latest version is here - https://lore.kernel.org/all/cover.1734436840.git.leon@kernel.org/
Unfortunately, I forgot to copy/paste cover letter but it can be seen in
previous version https://lore.kernel.org/all/cover.1733398913.git.leon@kernel.org/.

The most complex example is block layer implementation which hides DMA API from
block drivers. https://lore.kernel.org/all/cover.1730037261.git.leon@kernel.org/

Thanks
Simona Vetter Jan. 10, 2025, 7:24 p.m. UTC | #15
On Thu, Jan 09, 2025 at 01:56:02AM +0800, Xu Yilun wrote:
> > > > 5) iommufd and kvm are both using CPU addresses without DMA. No
> > > > exporter mapping is possible
> > > 
> > > We have customers using both KVM and XEN with DMA-buf, so I can clearly
> > > confirm that this isn't true.
> > 
> > Today they are mmaping the dma-buf into a VMA and then using KVM's
> > follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable
> > dma-buf must have a CPU PFN.
> 
> Yes, the final target for KVM is still the CPU PFN, just with the help
> of CPU mapping table.
> 
> I also found the xen gntdev-dmabuf is calculating pfn from mapped
> sgt.

See the comment, it's ok because it's a fake device with fake iommu and
the xen code has special knowledge to peek behind the curtain.
-Sima
 
> From Christion's point, I assume only sgl->dma_address should be
> used by importers but in fact not. More importers are 'abusing' sg dma
> helpers.
> 
> That said there are existing needs for importers to know more about the
> real buffer resource, for mapping, or even more than mapping,
> e.g. dmabuf_imp_grant_foreign_access()
Simona Vetter Jan. 10, 2025, 7:34 p.m. UTC | #16
On Thu, Jan 09, 2025 at 07:06:48AM +0800, Xu Yilun wrote:
> >  So I guess my first question is, which locking rules do you want here for
> >  pfn importers?
> > 
> >  follow_pfn() is unwanted for private MMIO, so dma_resv_lock.
> > 
> >    As Sima explained you either have follow_pfn() and mmu_notifier or you
> >    have DMA addresses and dma_resv lock / dma_fence.
> > 
> >    Just giving out PFNs without some lifetime associated with them is one of
> >    the major problems we faced before and really not something you can do.
> 
> I'm trying to make exporter give out PFN with lifetime control via
> move_notify() in this series. May not be conceptually correct but seems
> possible.
> 
> > 
> > 
> >  If mmu notifiers is fine, then I think the current approach of follow_pfn
> >  should be ok. But if you instead dma_resv_lock rules (or the cpu mmap
> >  somehow is an issue itself), then I think the clean design is create a new
> > 
> >  cpu mmap() is an issue, this series is aimed to eliminate userspace
> >  mapping for private MMIO resources.
> > 
> >    Why?
> 
> OK, I can start from here.
> 
> It is about the Secure guest, or CoCo VM. The memory and MMIOs assigned
> to this kind of guest is unaccessable to host itself, by leveraging HW
> encryption & access control technology (e.g. Intel TDX, AMD SEV-SNP ...).
> This is to protect the tenant data being stolen by CSP itself.
> 
> The impact is when host accesses the encrypted region, bad things
> happen to system, e.g. memory corruption, MCE. Kernel is trying to
> mitigate most of the impact by alloc and assign user unmappable memory
> resources (private memory) to guest, which prevents userspace
> accidents. guest_memfd is the private memory provider that only allows
> for KVM to position the page/pfn by fd + offset and create secondary
> page table (EPT, NPT...), no host mapping, no VMA, no mmu_notifier. But
> the lifecycle of the private memory is still controlled by guest_memfd.
> When fallocate(fd, PUNCH_HOLE), the memory resource is revoked and KVM
> is notified to unmap corresponding EPT.
> 
> The further thought is guest_memfd is also suitable for normal guest.
> It makes no sense VMM must build host mapping table before guest access.
> 
> Now I'm trying to seek a similar way for private MMIO. A MMIO resource
> provider that is exported as an fd. It controls the lifecycle of the
> MMIO resource and notify KVM when revoked. dma-buf seems to be a good
> provider which have done most of the work, only need to extend the
> memory resource seeking by fd + offset.

So if I'm getting this right, what you need from a functional pov is a
dma_buf_tdx_mmap? Because due to tdx restrictions, the normal dma_buf_mmap
is not going to work I guess?

Also another thing that's a bit tricky is that kvm kinda has a 3rd dma-buf
memory model:
- permanently pinned dma-buf, they never move
- dynamic dma-buf, they move through ->move_notify and importers can remap
- revocable dma-buf, which thus far only exist for pci mmio resources

Since we're leaning even more on that 3rd model I'm wondering whether we
should make it something official. Because the existing dynamic importers
do very much assume that re-acquiring the memory after move_notify will
work. But for the revocable use-case the entire point is that it will
never work.

I feel like that's a concept we need to make explicit, so that dynamic
importers can reject such memory if necessary.

So yeah there's a bunch of tricky lifetime questions that need to be
sorted out with proper design I think, and the current "let's just use pfn
directly" proposal hides them all under the rug. I agree with Christian
that we need a bit more care here.
-Sima

> 
> > 
> >  separate access mechanism just for that. It would be the 5th or so (kernel
> >  vmap, userspace mmap, dma_buf_attach and driver private stuff like
> >  virtio_dma_buf.c where you access your buffer with a uuid), so really not
> >  a big deal.
> > 
> >  OK, will think more about that.
> > 
> >    Please note that we have follow_pfn() + mmu_notifier working for KVM/XEN
> 
> Folow_pfn & mmu_notifier won't work here, cause no VMA, no host mapping
> table.
> 
> Thanks,
> Yilun
> >    with MMIO mappings and P2P. And that required exactly zero DMA-buf changes
> >    :)
> > 
> >    I don't fully understand your use case, but I think it's quite likely that
> >    we already have that working.
> > 
> >    Regards,
> >    Christian.
Jason Gunthorpe Jan. 10, 2025, 8:16 p.m. UTC | #17
On Fri, Jan 10, 2025 at 08:24:22PM +0100, Simona Vetter wrote:
> On Thu, Jan 09, 2025 at 01:56:02AM +0800, Xu Yilun wrote:
> > > > > 5) iommufd and kvm are both using CPU addresses without DMA. No
> > > > > exporter mapping is possible
> > > > 
> > > > We have customers using both KVM and XEN with DMA-buf, so I can clearly
> > > > confirm that this isn't true.
> > > 
> > > Today they are mmaping the dma-buf into a VMA and then using KVM's
> > > follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable
> > > dma-buf must have a CPU PFN.
> > 
> > Yes, the final target for KVM is still the CPU PFN, just with the help
> > of CPU mapping table.
> > 
> > I also found the xen gntdev-dmabuf is calculating pfn from mapped
> > sgt.
> 
> See the comment, it's ok because it's a fake device with fake iommu and
> the xen code has special knowledge to peek behind the curtain.

        /*
         * Now convert sgt to array of gfns without accessing underlying pages.
         * It is not allowed to access the underlying struct page of an sg table
         * exported by DMA-buf, but since we deal with special Xen dma device here
         * (not a normal physical one) look at the dma addresses in the sg table
         * and then calculate gfns directly from them.
         */
        for_each_sgtable_dma_page(sgt, &sg_iter, 0) {
                dma_addr_t addr = sg_page_iter_dma_address(&sg_iter);
                unsigned long pfn = bfn_to_pfn(XEN_PFN_DOWN(dma_to_phys(dev, addr)));

*barf*

Can we please all agree that is a horrible abuse of the DMA API and
lets not point it as some acceptable "solution"? KVM and iommufd do
not have fake struct devices with fake iommus.

Jason
Jason Gunthorpe Jan. 10, 2025, 8:38 p.m. UTC | #18
On Fri, Jan 10, 2025 at 08:34:55PM +0100, Simona Vetter wrote:

> So if I'm getting this right, what you need from a functional pov is a
> dma_buf_tdx_mmap? Because due to tdx restrictions, the normal dma_buf_mmap
> is not going to work I guess?

Don't want something TDX specific!

There is a general desire, and CC is one, but there are other
motivations like performance, to stop using VMAs and mmaps as a way to
exchanage memory between two entities. Instead we want to use FDs.

We now have memfd and guestmemfd that are usable with
memfd_pin_folios() - this covers pinnable CPU memory.

And for a long time we had DMABUF which is for all the other wild
stuff, and it supports movable memory too.

So, the normal DMABUF semantics with reservation locking and move
notifiers seem workable to me here. They are broadly similar enough to
the mmu notifier locking that they can serve the same job of updating
page tables.

> Also another thing that's a bit tricky is that kvm kinda has a 3rd dma-buf
> memory model:
> - permanently pinned dma-buf, they never move
> - dynamic dma-buf, they move through ->move_notify and importers can remap
> - revocable dma-buf, which thus far only exist for pci mmio resources

I would like to see the importers be able to discover which one is
going to be used, because we have RDMA cases where we can support 1
and 3 but not 2.

revocable doesn't require page faulting as it is a terminal condition.

> Since we're leaning even more on that 3rd model I'm wondering whether we
> should make it something official. Because the existing dynamic importers
> do very much assume that re-acquiring the memory after move_notify will
> work. But for the revocable use-case the entire point is that it will
> never work.

> I feel like that's a concept we need to make explicit, so that dynamic
> importers can reject such memory if necessary.

It strikes me as strange that HW can do page faulting, so it can
support #2, but it can't handle a non-present fault?

> So yeah there's a bunch of tricky lifetime questions that need to be
> sorted out with proper design I think, and the current "let's just use pfn
> directly" proposal hides them all under the rug. 

I don't think these two things are connected. The lifetime model that
KVM needs to work with the EPT, and that VFIO needs for it's MMIO,
definately should be reviewed and evaluated.

But it is completely orthogonal to allowing iommufd and kvm to access
the CPU PFN to use in their mapping flows, instead of the
dma_addr_t.

What I want to get to is a replacement for scatter list in DMABUF that
is an array of arrays, roughly like:

  struct memory_chunks {
      struct memory_p2p_provider *provider;
      struct bio_vec addrs[];
  };
  int (*dmabuf_get_memory)(struct memory_chunks **chunks, size_t *num_chunks);

This can represent all forms of memory: P2P, private, CPU, etc and
would be efficient with the new DMA API.

This is similar to the structure BIO has, and it composes nicely with
a future pin_user_pages() and memfd_pin_folios().

Jason
Jason Gunthorpe Jan. 10, 2025, 8:54 p.m. UTC | #19
On Thu, Jan 09, 2025 at 09:09:46AM +0100, Christian König wrote:
> Answering on my reply once more as pure text mail.

It is hard to do anything with your HTML mails :\

> > Well you were also the person who mangled the struct page pointers in
> > the scatterlist because people were abusing this and getting a bloody
> > nose :)

But alot of this is because scatterlist is too limited, you actually
can't correctly describe anything except struct page backed CPU memory
in a scatterlist.

As soon as we can correctly describe everything in a datastructure
these issues go away - or at least turn into a compatability
exchange problem.

> > > > Where I do agree with Christian is that stuffing pfn support into the
> > > > dma_buf_attachment interfaces feels a bit much wrong.
> > > So it could a dmabuf interface like mmap/vmap()? I was also wondering
> > > about that. But finally I start to use dma_buf_attachment interface
> > > because of leveraging existing buffer pin and move_notify.
> > 
> > Exactly that's the point, sharing pfn doesn't work with the pin and
> > move_notify interfaces because of the MMU notifier approach Sima
> > mentioned.

Huh? 

mmu notifiers are for tracking changes to VMAs

pin/move_notify are for tracking changes the the underlying memory of
a DMABUF.

How does sharing the PFN vs DMA addre effect the pin/move_notify
lifetime rules at all?

> > > > > > > 3) Importing devices need to know if they are working with PCI P2P
> > > > > > > addresses during mapping because they need to do things like turn on
> > > > > > > ATS on their DMA. As for multi-path we have the same hacks inside mlx5
> > > > > > > today that assume DMABUFs are always P2P because we cannot determine
> > > > > > > if things are P2P or not after being DMA mapped.
> > > > > > Why would you need ATS on PCI P2P and not for system memory accesses?
> > > > > ATS has a significant performance cost. It is mandatory for PCI P2P,
> > > > > but ideally should be avoided for CPU memory.
> > > > Huh, I didn't know that. And yeah kinda means we've butchered the pci p2p
> > > > stuff a bit I guess ...
> > 
> > Hui? Why should ATS be mandatory for PCI P2P?

I should say "mandatory on some configurations"

If you need the iommu turned on, and you have a PCI switch in your
path, then ATS allows you to have full P2P bandwidth and retain full
IOMMU security.

> > We have tons of production systems using PCI P2P without ATS. And it's
> > the first time I hear that.

It is situational and topologically dependent. We have very large
number of deployed systems now that rely on ATS for PCI P2P.

> > As Sima explained you either have follow_pfn() and mmu_notifier or you
> > have DMA addresses and dma_resv lock / dma_fence.
> > 
> > Just giving out PFNs without some lifetime associated with them is one
> > of the major problems we faced before and really not something you can
> > do.

Certainly I never imagined there would be no liftime, I expect
anything coming out of the dmabuf interface to use the dma_resv lock,
fence and move_notify for lifetime managament, regardless of how the
target memory is described.

> > > > separate access mechanism just for that. It would be the 5th or so (kernel
> > > > vmap, userspace mmap, dma_buf_attach and driver private stuff like
> > > > virtio_dma_buf.c where you access your buffer with a uuid), so really not
> > > > a big deal.
> > > OK, will think more about that.
> > 
> > Please note that we have follow_pfn() + mmu_notifier working for KVM/XEN
> > with MMIO mappings and P2P. And that required exactly zero DMA-buf
> > changes :)

> > I don't fully understand your use case, but I think it's quite likely
> > that we already have that working.

In Intel CC systems you cannot mmap secure memory or the system will
take a machine check.

You have to convey secure memory inside a FD entirely within the
kernel so that only an importer that understands how to handle secure
memory (such as KVM) is using it to avoid machine checking.

The patch series here should be thought of as the first part of this,
allowing PFNs to flow without VMAs. IMHO the second part of preventing
machine checks is not complete.

In the approach I have been talking about the secure memory would be
represented by a p2p_provider structure that is incompatible with
everything else. For instance importers that can only do DMA would
simply cleanly fail when presented with this memory.

Jason
Xu Yilun Jan. 12, 2025, 10:10 p.m. UTC | #20
On Fri, Jan 10, 2025 at 04:38:38PM -0400, Jason Gunthorpe wrote:
> On Fri, Jan 10, 2025 at 08:34:55PM +0100, Simona Vetter wrote:
> 
> > So if I'm getting this right, what you need from a functional pov is a
> > dma_buf_tdx_mmap? Because due to tdx restrictions, the normal dma_buf_mmap

I'm not sure if the word 'mmap' is proper here.

It is kind of like the mapping from (FD+offset) to backend memory,
which is directly provided by memory provider, rather than via VMA
and cpu page table.  Basically VMA & cpu page table are for host to
access the memory, but VMM/host doesn't access most of the guest
memory, so why must build them.

> > is not going to work I guess?
> 
> Don't want something TDX specific!
> 
> There is a general desire, and CC is one, but there are other
> motivations like performance, to stop using VMAs and mmaps as a way to
> exchanage memory between two entities. Instead we want to use FDs.

Exactly.

> 
> We now have memfd and guestmemfd that are usable with
> memfd_pin_folios() - this covers pinnable CPU memory.
> 
> And for a long time we had DMABUF which is for all the other wild
> stuff, and it supports movable memory too.
> 
> So, the normal DMABUF semantics with reservation locking and move
> notifiers seem workable to me here. They are broadly similar enough to
> the mmu notifier locking that they can serve the same job of updating
> page tables.

Yes. With this new sharing model, the lifecycle of the shared memory/pfn/Page
is directly controlled by dma-buf exporter, not by CPU mapping. So I also
think reservation lock & move_notify works well for lifecycle control,
no conflict (nothing to do) with follow_pfn() & mmu_notifier.

> 
> > Also another thing that's a bit tricky is that kvm kinda has a 3rd dma-buf
> > memory model:
> > - permanently pinned dma-buf, they never move
> > - dynamic dma-buf, they move through ->move_notify and importers can remap
> > - revocable dma-buf, which thus far only exist for pci mmio resources
> 
> I would like to see the importers be able to discover which one is
> going to be used, because we have RDMA cases where we can support 1
> and 3 but not 2.
> 
> revocable doesn't require page faulting as it is a terminal condition.
> 
> > Since we're leaning even more on that 3rd model I'm wondering whether we
> > should make it something official. Because the existing dynamic importers
> > do very much assume that re-acquiring the memory after move_notify will
> > work. But for the revocable use-case the entire point is that it will
> > never work.
> 
> > I feel like that's a concept we need to make explicit, so that dynamic
> > importers can reject such memory if necessary.
> 
> It strikes me as strange that HW can do page faulting, so it can
> support #2, but it can't handle a non-present fault?
> 
> > So yeah there's a bunch of tricky lifetime questions that need to be
> > sorted out with proper design I think, and the current "let's just use pfn
> > directly" proposal hides them all under the rug. 
> 
> I don't think these two things are connected. The lifetime model that
> KVM needs to work with the EPT, and that VFIO needs for it's MMIO,
> definately should be reviewed and evaluated.
> 
> But it is completely orthogonal to allowing iommufd and kvm to access
> the CPU PFN to use in their mapping flows, instead of the
> dma_addr_t.
> 
> What I want to get to is a replacement for scatter list in DMABUF that
> is an array of arrays, roughly like:
> 
>   struct memory_chunks {
>       struct memory_p2p_provider *provider;
>       struct bio_vec addrs[];
>   };
>   int (*dmabuf_get_memory)(struct memory_chunks **chunks, size_t *num_chunks);

Maybe we need to specify which object the API is operating on,
struct dma_buf, or struct dma_buf_attachment, or a new attachment.

I think:

  int (*dmabuf_get_memory)(struct dma_buf_attachment *attach, struct memory_chunks **chunks, size_t *num_chunks);

works, but maybe a new attachment is conceptually more clear to
importers and harder to abuse?

Thanks,
Yilun

> 
> This can represent all forms of memory: P2P, private, CPU, etc and
> would be efficient with the new DMA API.
> 
> This is similar to the structure BIO has, and it composes nicely with
> a future pin_user_pages() and memfd_pin_folios().
> 
> Jason
Simona Vetter Jan. 14, 2025, 2:44 p.m. UTC | #21
On Fri, Jan 10, 2025 at 04:38:38PM -0400, Jason Gunthorpe wrote:
> On Fri, Jan 10, 2025 at 08:34:55PM +0100, Simona Vetter wrote:
> 
> > So if I'm getting this right, what you need from a functional pov is a
> > dma_buf_tdx_mmap? Because due to tdx restrictions, the normal dma_buf_mmap
> > is not going to work I guess?
> 
> Don't want something TDX specific!
> 
> There is a general desire, and CC is one, but there are other
> motivations like performance, to stop using VMAs and mmaps as a way to
> exchanage memory between two entities. Instead we want to use FDs.
> 
> We now have memfd and guestmemfd that are usable with
> memfd_pin_folios() - this covers pinnable CPU memory.
> 
> And for a long time we had DMABUF which is for all the other wild
> stuff, and it supports movable memory too.
> 
> So, the normal DMABUF semantics with reservation locking and move
> notifiers seem workable to me here. They are broadly similar enough to
> the mmu notifier locking that they can serve the same job of updating
> page tables.

Yeah raw pfn is fine with me too. It might come with more "might not work
on this dma-buf" restrictions, but I can't think of a practical one right
now.

> > Also another thing that's a bit tricky is that kvm kinda has a 3rd dma-buf
> > memory model:
> > - permanently pinned dma-buf, they never move
> > - dynamic dma-buf, they move through ->move_notify and importers can remap
> > - revocable dma-buf, which thus far only exist for pci mmio resources
> 
> I would like to see the importers be able to discover which one is
> going to be used, because we have RDMA cases where we can support 1
> and 3 but not 2.
> 
> revocable doesn't require page faulting as it is a terminal condition.

Yeah this is why I think we should separate the dynamic from the revocable
use-cases clearly, because mixing them is going to result in issues.

> > Since we're leaning even more on that 3rd model I'm wondering whether we
> > should make it something official. Because the existing dynamic importers
> > do very much assume that re-acquiring the memory after move_notify will
> > work. But for the revocable use-case the entire point is that it will
> > never work.
> 
> > I feel like that's a concept we need to make explicit, so that dynamic
> > importers can reject such memory if necessary.
> 
> It strikes me as strange that HW can do page faulting, so it can
> support #2, but it can't handle a non-present fault?

I guess it's not a kernel issue, but userspace might want to know whether
this dma-buf could potentially nuke the entire gpu context. Because that's
what you get when we can't patch up a fault, which is the difference
between a recovable dma-buf and a dynamic dma-buf.

E.g. if a compositor gets a dma-buf it assumes that by just binding that
it will not risk gpu context destruction (unless you're out of memory and
everything is on fire anyway, and it's ok to die). But if a nasty client
app supplies a revocable dma-buf, then it can shot down the higher
priviledged compositor gpu workload with precision. Which is not great, so
maybe existing dynamic gpu importers should reject revocable dma-buf.
That's at least what I had in mind as a potential issue.

> > So yeah there's a bunch of tricky lifetime questions that need to be
> > sorted out with proper design I think, and the current "let's just use pfn
> > directly" proposal hides them all under the rug. 
> 
> I don't think these two things are connected. The lifetime model that
> KVM needs to work with the EPT, and that VFIO needs for it's MMIO,
> definately should be reviewed and evaluated.
> 
> But it is completely orthogonal to allowing iommufd and kvm to access
> the CPU PFN to use in their mapping flows, instead of the
> dma_addr_t.
> 
> What I want to get to is a replacement for scatter list in DMABUF that
> is an array of arrays, roughly like:
> 
>   struct memory_chunks {
>       struct memory_p2p_provider *provider;
>       struct bio_vec addrs[];
>   };
>   int (*dmabuf_get_memory)(struct memory_chunks **chunks, size_t *num_chunks);
> 
> This can represent all forms of memory: P2P, private, CPU, etc and
> would be efficient with the new DMA API.
> 
> This is similar to the structure BIO has, and it composes nicely with
> a future pin_user_pages() and memfd_pin_folios().

Since you mention pin here, I think that's another aspect of the revocable
vs dynamic question. Dynamic buffers are expected to sometimes just move
around for no reason, and importers must be able to cope.

For recovable exporters/importers I'd expect that movement is not
happening, meaning it's pinned until the single terminal revocation. And
maybe I read the kvm stuff wrong, but it reads more like the latter to me
when crawling through the pfn code.

Once we have the lifetime rules nailed then there's the other issue of how
to describe the memory, and my take for that is that once the dma-api has
a clear answer we'll just blindly adopt that one and done.

And currently with either dynamic attachments and dma_addr_t or through
fishing the pfn from the cpu pagetables there's some very clearly defined
lifetime and locking rules (which kvm might get wrong, I've seen some
discussions fly by where it wasn't doing a perfect job with reflecting pte
changes, but that was about access attributes iirc). If we add something
new, we need clear rules and not just "here's the kvm code that uses it".
That's how we've done dma-buf at first, and it was a terrible mess of
mismatched expecations.
-Sima
Jason Gunthorpe Jan. 14, 2025, 5:31 p.m. UTC | #22
On Tue, Jan 14, 2025 at 03:44:04PM +0100, Simona Vetter wrote:

> E.g. if a compositor gets a dma-buf it assumes that by just binding that
> it will not risk gpu context destruction (unless you're out of memory and
> everything is on fire anyway, and it's ok to die). But if a nasty client
> app supplies a revocable dma-buf, then it can shot down the higher
> priviledged compositor gpu workload with precision. Which is not great, so
> maybe existing dynamic gpu importers should reject revocable dma-buf.
> That's at least what I had in mind as a potential issue.

I see, so it is not that they can't handle a non-present fault it is
just that the non-present effectively turns into a crash of the
context and you want to avoid the crash. It makes sense to me to
negotiate this as part of the API.

> > This is similar to the structure BIO has, and it composes nicely with
> > a future pin_user_pages() and memfd_pin_folios().
> 
> Since you mention pin here, I think that's another aspect of the revocable
> vs dynamic question. Dynamic buffers are expected to sometimes just move
> around for no reason, and importers must be able to cope.

Yes, and we have importers that can tolerate dynamic and those that
can't. Though those that can't tolerate it can often implement revoke.

I view your list as a cascade:
 1) Fully pinned can never be changed so long as the attach is present
 2) Fully pinned, but can be revoked. Revoked is a fatal condition and
    the importer is allowed to experience an error
 3) Fully dynamic and always present. Support for move, and
    restartable fault, is required

Today in RDMA we ask the exporter if it is 1 or 3 and allow different
things. I've seen the GPU side start to offer 1 more often as it has
significant performance wins.

> For recovable exporters/importers I'd expect that movement is not
> happening, meaning it's pinned until the single terminal revocation. And
> maybe I read the kvm stuff wrong, but it reads more like the latter to me
> when crawling through the pfn code.

kvm should be fully faultable and it should be able handle move. It
handles move today using the mmu notifiers after all.

kvm would need to interact with the dmabuf reservations on its page
fault path.

iommufd cannot be faultable and it would only support revoke. For VFIO
revoke would not be fully terminal as VFIO can unrevoke too
(sigh).  If we make revoke special I'd like to eventually include
unrevoke for this reason.

> Once we have the lifetime rules nailed then there's the other issue of how
> to describe the memory, and my take for that is that once the dma-api has
> a clear answer we'll just blindly adopt that one and done.

This is what I hope, we are not there yet, first Leon's series needs
to get merged then we can start on making the DMA API P2P safe without
any struct page. From there it should be clear what direction things
go in.

DMABUF would return pfns annotated with whatever matches the DMA API,
and the importer would be able to inspect the PFNs to learn
information like their P2Pness, CPU mappability or whatever.

I'm pushing for the extra struct, and Christoph has been thinking
about searching a maple tree on the PFN. We need to see what works best.

> And currently with either dynamic attachments and dma_addr_t or through
> fishing the pfn from the cpu pagetables there's some very clearly defined
> lifetime and locking rules (which kvm might get wrong, I've seen some
> discussions fly by where it wasn't doing a perfect job with reflecting pte
> changes, but that was about access attributes iirc). 

Wouldn't surprise me, mmu notifiers are very complex all around. We've
had bugs already where the mm doesn't signal the notifiers at the
right points.

> If we add something
> new, we need clear rules and not just "here's the kvm code that uses it".
> That's how we've done dma-buf at first, and it was a terrible mess of
> mismatched expecations.

Yes, that would be wrong. It should be self defined within dmabuf and
kvm should adopt to it, move semantics and all.

My general desire is to move all of RDMA's MR process away from
scatterlist and work using only the new DMA API. This will save *huge*
amounts of memory in common workloads and be the basis for non-struct
page DMA support, including P2P.

Jason
Simona Vetter Jan. 15, 2025, 8:55 a.m. UTC | #23
On Tue, Jan 14, 2025 at 01:31:03PM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 14, 2025 at 03:44:04PM +0100, Simona Vetter wrote:
> 
> > E.g. if a compositor gets a dma-buf it assumes that by just binding that
> > it will not risk gpu context destruction (unless you're out of memory and
> > everything is on fire anyway, and it's ok to die). But if a nasty client
> > app supplies a revocable dma-buf, then it can shot down the higher
> > priviledged compositor gpu workload with precision. Which is not great, so
> > maybe existing dynamic gpu importers should reject revocable dma-buf.
> > That's at least what I had in mind as a potential issue.
> 
> I see, so it is not that they can't handle a non-present fault it is
> just that the non-present effectively turns into a crash of the
> context and you want to avoid the crash. It makes sense to me to
> negotiate this as part of the API.

Yup.

> > > This is similar to the structure BIO has, and it composes nicely with
> > > a future pin_user_pages() and memfd_pin_folios().
> > 
> > Since you mention pin here, I think that's another aspect of the revocable
> > vs dynamic question. Dynamic buffers are expected to sometimes just move
> > around for no reason, and importers must be able to cope.
> 
> Yes, and we have importers that can tolerate dynamic and those that
> can't. Though those that can't tolerate it can often implement revoke.
> 
> I view your list as a cascade:
>  1) Fully pinned can never be changed so long as the attach is present
>  2) Fully pinned, but can be revoked. Revoked is a fatal condition and
>     the importer is allowed to experience an error
>  3) Fully dynamic and always present. Support for move, and
>     restartable fault, is required
> 
> Today in RDMA we ask the exporter if it is 1 or 3 and allow different
> things. I've seen the GPU side start to offer 1 more often as it has
> significant performance wins.

I'm not entirely sure a cascade is the right thing or whether we should
have importers just specify bitmask of what is acceptable to them. But
that little detail aside this sounds like what I was thinking of.

> > For recovable exporters/importers I'd expect that movement is not
> > happening, meaning it's pinned until the single terminal revocation. And
> > maybe I read the kvm stuff wrong, but it reads more like the latter to me
> > when crawling through the pfn code.
> 
> kvm should be fully faultable and it should be able handle move. It
> handles move today using the mmu notifiers after all.
> 
> kvm would need to interact with the dmabuf reservations on its page
> fault path.
> 
> iommufd cannot be faultable and it would only support revoke. For VFIO
> revoke would not be fully terminal as VFIO can unrevoke too
> (sigh).  If we make revoke special I'd like to eventually include
> unrevoke for this reason.
> 
> > Once we have the lifetime rules nailed then there's the other issue of how
> > to describe the memory, and my take for that is that once the dma-api has
> > a clear answer we'll just blindly adopt that one and done.
> 
> This is what I hope, we are not there yet, first Leon's series needs
> to get merged then we can start on making the DMA API P2P safe without
> any struct page. From there it should be clear what direction things
> go in.
> 
> DMABUF would return pfns annotated with whatever matches the DMA API,
> and the importer would be able to inspect the PFNs to learn
> information like their P2Pness, CPU mappability or whatever.

I think for 90% of exporters pfn would fit, but there's some really funny
ones where you cannot get a cpu pfn by design. So we need to keep the
pfn-less interfaces around. But ideally for the pfn-capable exporters we'd
have helpers/common code that just implements all the other interfaces.

I'm not worried about the resulting fragmentation, because dma-buf is the
"everythig is optional" api. We just need to make sure we have really
clear semantic api contracts, like with the revocable vs dynamic/moveable
distinction. Otherwise things go boom when an importer gets an unexpected
dma-buf fd, and we pass this across security boundaries a lot.

> I'm pushing for the extra struct, and Christoph has been thinking
> about searching a maple tree on the PFN. We need to see what works best.
> 
> > And currently with either dynamic attachments and dma_addr_t or through
> > fishing the pfn from the cpu pagetables there's some very clearly defined
> > lifetime and locking rules (which kvm might get wrong, I've seen some
> > discussions fly by where it wasn't doing a perfect job with reflecting pte
> > changes, but that was about access attributes iirc). 
> 
> Wouldn't surprise me, mmu notifiers are very complex all around. We've
> had bugs already where the mm doesn't signal the notifiers at the
> right points.
> 
> > If we add something
> > new, we need clear rules and not just "here's the kvm code that uses it".
> > That's how we've done dma-buf at first, and it was a terrible mess of
> > mismatched expecations.
> 
> Yes, that would be wrong. It should be self defined within dmabuf and
> kvm should adopt to it, move semantics and all.

Ack.

I feel like we have a plan here. Summary from my side:

- Sort out pin vs revocable vs dynamic/moveable semantics, make sure
  importers have no surprises.

- Adopt whatever new dma-api datastructures pops out of the dma-api
  reworks.

- Add pfn based memory access as yet another optional access method, with
  helpers so that exporters who support this get all the others for free.

I don't see a strict ordering between these, imo should be driven by
actual users of the dma-buf api.

Already done:

- dmem cgroup so that we can resource control device pinnings just landed
  in drm-next for next merge window. So that part is imo sorted and we can
  charge ahead with pinning into device memory without all the concerns
  we've had years ago when discussing that for p2p dma-buf support.

  But there might be some work so that we can p2p pin without requiring
  dynamic attachments only, I haven't checked whether that needs
  adjustment in dma-buf.c code or just in exporters.

Anything missing?

I feel like this is small enough that m-l archives is good enough. For
some of the bigger projects we do in graphics we sometimes create entries
in our kerneldoc with wip design consensus and things like that. But
feels like overkill here.

> My general desire is to move all of RDMA's MR process away from
> scatterlist and work using only the new DMA API. This will save *huge*
> amounts of memory in common workloads and be the basis for non-struct
> page DMA support, including P2P.

Yeah a more memory efficient structure than the scatterlist would be
really nice. That would even benefit the very special dma-buf exporters
where you cannot get a pfn and only the dma_addr_t, altough most of those
(all maybe even?) have contig buffers, so your scatterlist has only one
entry. But it would definitely be nice from a design pov.

Aside: A way to more efficiently create compressed scatterlists would be
neat too, because a lot of drivers hand-roll that and it's a bit brittle
and kinda silly to duplicate. With compressed I mean just a single entry
for a contig range, in practice thanks to huge pages/folios and allocators
trying to hand out contig ranges if there's plenty of memory that saves a
lot of memory too. But currently it's a bit a pain to construct these
efficiently, mostly it's just a two-pass approach and then trying to free
surplus memory or krealloc to fit. Also I don't have good ideas here, but
dma-api folks might have some from looking at too many things that create
scatterlists.
-Sima
Christoph Hellwig Jan. 15, 2025, 9:32 a.m. UTC | #24
On Wed, Jan 15, 2025 at 09:55:29AM +0100, Simona Vetter wrote:
> I think for 90% of exporters pfn would fit, but there's some really funny
> ones where you cannot get a cpu pfn by design. So we need to keep the
> pfn-less interfaces around. But ideally for the pfn-capable exporters we'd
> have helpers/common code that just implements all the other interfaces.

There is no way to have dma address without a PFN in Linux right now.
How would you generate them?  That implies you have an IOMMU that can
generate IOVAs for something that doesn't have a physical address at
all.

Or do you mean some that don't have pages associated with them, and
thus have pfn_valid fail on them?  They still have a PFN, just not
one that is valid to use in most of the Linux MM.
Christian König Jan. 15, 2025, 9:38 a.m. UTC | #25
Am 10.01.25 um 21:54 schrieb Jason Gunthorpe:
> [SNIP]
>>> I don't fully understand your use case, but I think it's quite likely
>>> that we already have that working.
> In Intel CC systems you cannot mmap secure memory or the system will
> take a machine check.
>
> You have to convey secure memory inside a FD entirely within the
> kernel so that only an importer that understands how to handle secure
> memory (such as KVM) is using it to avoid machine checking.
>
> The patch series here should be thought of as the first part of this,
> allowing PFNs to flow without VMAs. IMHO the second part of preventing
> machine checks is not complete.
>
> In the approach I have been talking about the secure memory would be
> represented by a p2p_provider structure that is incompatible with
> everything else. For instance importers that can only do DMA would
> simply cleanly fail when presented with this memory.

That's a rather interesting use case, but not something I consider 
fitting for the DMA-buf interface.

See DMA-buf in meant to be used between drivers to allow DMA access on 
shared buffers.

What you try to do here instead is to give memory in the form of a file 
descriptor to a client VM to do things like CPU mapping and giving it to 
drivers to do DMA etc...

As far as I can see this memory is secured by some kind of MMU which 
makes sure that even the host CPU can't access it without causing a 
machine check exception.

That sounds more something for the TEE driver instead of anything 
DMA-buf should be dealing with.

Regards,
Christian.

>
> Jason
Jason Gunthorpe Jan. 15, 2025, 1:34 p.m. UTC | #26
On Wed, Jan 15, 2025 at 10:32:34AM +0100, Christoph Hellwig wrote:
> On Wed, Jan 15, 2025 at 09:55:29AM +0100, Simona Vetter wrote:
> > I think for 90% of exporters pfn would fit, but there's some really funny
> > ones where you cannot get a cpu pfn by design. So we need to keep the
> > pfn-less interfaces around. But ideally for the pfn-capable exporters we'd
> > have helpers/common code that just implements all the other interfaces.
> 
> There is no way to have dma address without a PFN in Linux right now.
> How would you generate them?  That implies you have an IOMMU that can
> generate IOVAs for something that doesn't have a physical address at
> all.
> 
> Or do you mean some that don't have pages associated with them, and
> thus have pfn_valid fail on them?  They still have a PFN, just not
> one that is valid to use in most of the Linux MM.

He is talking about private interconnect hidden inside clusters of
devices.

Ie the system may have many GPUs and those GPUs have their own private
interconnect between them. It is not PCI, and packets don't transit
through the CPU SOC at all, so the IOMMU is not involved.

DMA can happen on that private interconnect, but from a Linux
perspective it is not DMA API DMA, and the addresses used to describe
it are not part of the CPU address space. The initiating device will
have a way to choose which path the DMA goes through when setting up
the DMA.

Effectively if you look at one of these complex GPU systems you will
have a physical bit of memory, say HBM memory located on the GPU. Then
from an OS perspective we have a whole bunch of different
representations/addresses of that very same memory. A Grace/Hopper
system would have at least three different addresses (ZONE_MOVABLE, a
PCI MMIO aperture, and a global NVLink address). Each different
address effectively represents a different physical interconnect
multipath, and an initiator may have three different routes/addresses
available to reach the same physical target memory.

Part of what DMABUF needs to do is pick which multi-path will be used
between expoter/importer.

So, the hack today has the DMABUF exporter GPU driver understand the
importer is part of the private interconnect and then generate a
scatterlist with a NULL sg_page, but a sg_dma_addr that encodes the
private global address on the hidden interconnect. Somehow the
importer knows this has happened and programs its HW to use the
private path.

Jason
Jason Gunthorpe Jan. 15, 2025, 1:38 p.m. UTC | #27
On Wed, Jan 15, 2025 at 10:38:00AM +0100, Christian König wrote:
> Am 10.01.25 um 21:54 schrieb Jason Gunthorpe:
> > [SNIP]
> > > > I don't fully understand your use case, but I think it's quite likely
> > > > that we already have that working.
> > In Intel CC systems you cannot mmap secure memory or the system will
> > take a machine check.
> > 
> > You have to convey secure memory inside a FD entirely within the
> > kernel so that only an importer that understands how to handle secure
> > memory (such as KVM) is using it to avoid machine checking.
> > 
> > The patch series here should be thought of as the first part of this,
> > allowing PFNs to flow without VMAs. IMHO the second part of preventing
> > machine checks is not complete.
> > 
> > In the approach I have been talking about the secure memory would be
> > represented by a p2p_provider structure that is incompatible with
> > everything else. For instance importers that can only do DMA would
> > simply cleanly fail when presented with this memory.
> 
> That's a rather interesting use case, but not something I consider fitting
> for the DMA-buf interface.

To recast the problem statement, it is basically the same as your
device private interconnects. There are certain devices that
understand how to use this memory, and if they work together they can
access it.
 
> See DMA-buf in meant to be used between drivers to allow DMA access on
> shared buffers.

They are shared, just not with everyone :)
 
> What you try to do here instead is to give memory in the form of a file
> descriptor to a client VM to do things like CPU mapping and giving it to
> drivers to do DMA etc...

How is this paragraph different from the first? It is a shared buffer
that we want real DMA and CPU "DMA" access to. It is "private" so
things that don't understand the interconnect rules cannot access it.

> That sounds more something for the TEE driver instead of anything DMA-buf
> should be dealing with.

Has nothing to do with TEE.

Jason
Christian König Jan. 15, 2025, 1:46 p.m. UTC | #28
Explicitly replying as text mail once more.

I just love the AMD mails servers :(

Christian.

Am 15.01.25 um 14:45 schrieb Christian König:
> Am 15.01.25 um 14:38 schrieb Jason Gunthorpe:
>> On Wed, Jan 15, 2025 at 10:38:00AM +0100, Christian König wrote:
>>> Am 10.01.25 um 21:54 schrieb Jason Gunthorpe:
>>>> [SNIP]
>>>>>> I don't fully understand your use case, but I think it's quite likely
>>>>>> that we already have that working.
>>>> In Intel CC systems you cannot mmap secure memory or the system will
>>>> take a machine check.
>>>>
>>>> You have to convey secure memory inside a FD entirely within the
>>>> kernel so that only an importer that understands how to handle secure
>>>> memory (such as KVM) is using it to avoid machine checking.
>>>>
>>>> The patch series here should be thought of as the first part of this,
>>>> allowing PFNs to flow without VMAs. IMHO the second part of preventing
>>>> machine checks is not complete.
>>>>
>>>> In the approach I have been talking about the secure memory would be
>>>> represented by a p2p_provider structure that is incompatible with
>>>> everything else. For instance importers that can only do DMA would
>>>> simply cleanly fail when presented with this memory.
>>> That's a rather interesting use case, but not something I consider fitting
>>> for the DMA-buf interface.
>> To recast the problem statement, it is basically the same as your
>> device private interconnects. There are certain devices that
>> understand how to use this memory, and if they work together they can
>> access it.
>>   
>>> See DMA-buf in meant to be used between drivers to allow DMA access on
>>> shared buffers.
>> They are shared, just not with everyone :)
>>   
>>> What you try to do here instead is to give memory in the form of a file
>>> descriptor to a client VM to do things like CPU mapping and giving it to
>>> drivers to do DMA etc...
>> How is this paragraph different from the first? It is a shared buffer
>> that we want real DMA and CPU "DMA" access to. It is "private" so
>> things that don't understand the interconnect rules cannot access it.
>
> Yeah, but it's private to the exporter. And a very fundamental rule of 
> DMA-buf is that the exporter is the one in control of things.
>
> So for example it is illegal for an importer to setup CPU mappings to 
> a buffer. That's why we have dma_buf_mmap() which redirects mmap() 
> requests from the importer to the exporter.
>
> In your use case here the importer wants to be in control and do 
> things like both CPU as well as DMA mappings.
>
> As far as I can see that is really not an use case which fits DMA-buf 
> in any way.
>
>>> That sounds more something for the TEE driver instead of anything DMA-buf
>>> should be dealing with.
>> Has nothing to do with TEE.
>
> Why?
>
> Regards,
> Christian.
>
>> Jason
>
Jason Gunthorpe Jan. 15, 2025, 2:14 p.m. UTC | #29
On Wed, Jan 15, 2025 at 02:46:56PM +0100, Christian König wrote:
> Explicitly replying as text mail once more.
> 
> I just love the AMD mails servers :(

:( This is hard

> > Yeah, but it's private to the exporter. And a very fundamental rule of
> > DMA-buf is that the exporter is the one in control of things.

I've said a few times now, I don't think we can build the kind of
buffer sharing framework we need to solve all the problems with this
philosophy. It is also inefficient with the new DMA API.

I think it is backwards looking and we need to move forwards with
fixing the fundamental API issues which motivated that design.

> > So for example it is illegal for an importer to setup CPU mappings to a
> > buffer. That's why we have dma_buf_mmap() which redirects mmap()
> > requests from the importer to the exporter.

Like this, in a future no-scatter list world I would want to make this
safe. The importer will have enough information to know if CPU
mappings exist and are safe to use under what conditions.

There is no reason the importer should not be able to CPU access
memory that is HW permitted to be CPU accessible.

If the importer needs CPU access and the exporter cannot provide it
then the attachment simply fails.

Saying CPU access is banned 100% of the time is not a helpful position
when we have use cases that need it.

> > As far as I can see that is really not an use case which fits DMA-buf in
> > any way.

I really don't want to make a dmabuf2 - everyone would have to
implement it, including all the GPU drivers if they want to work with
RDMA. I don't think this makes any sense compared to incrementally
evolving dmabuf with more optional capabilities.

> > > > That sounds more something for the TEE driver instead of anything DMA-buf
> > > > should be dealing with.
> > > Has nothing to do with TEE.
> > 
> > Why?

The Linux TEE framework is not used as part of confidential compute.

CC already has guest memfd for holding it's private CPU memory.

This is about confidential MMIO memory.

This is also not just about the KVM side, the VM side also has issues
with DMABUF and CC - only co-operating devices can interact with the
VM side "encrypted" memory and there needs to be a negotiation as part
of all buffer setup what the mutual capability is. :\ swiotlb hides
some of this some times, but confidential P2P is currently unsolved.

Jason
Christian König Jan. 15, 2025, 2:30 p.m. UTC | #30
Sending it as text mail to the mailing lists once more :(

Christian.

Am 15.01.25 um 15:29 schrieb Christian König:
> Am 15.01.25 um 15:14 schrieb Jason Gunthorpe:
>> On Wed, Jan 15, 2025 at 02:46:56PM +0100, Christian König wrote:
>> [SNIP]
>>>> Yeah, but it's private to the exporter. And a very fundamental rule of
>>>> DMA-buf is that the exporter is the one in control of things.
>> I've said a few times now, I don't think we can build the kind of
>> buffer sharing framework we need to solve all the problems with this
>> philosophy. It is also inefficient with the new DMA API.
>>
>> I think it is backwards looking and we need to move forwards with
>> fixing the fundamental API issues which motivated that design.
>
> And that's what I clearly see completely different.
>
> Those rules are not something we cam up with because of some 
> limitation of the DMA-API, but rather from experience working with 
> different device driver and especially their developers.
>
> Applying and enforcing those restrictions is absolutely mandatory must 
> have for extending DMA-buf.
>
>>>> So for example it is illegal for an importer to setup CPU mappings to a
>>>> buffer. That's why we have dma_buf_mmap() which redirects mmap()
>>>> requests from the importer to the exporter.
>> Like this, in a future no-scatter list world I would want to make this
>> safe. The importer will have enough information to know if CPU
>> mappings exist and are safe to use under what conditions.
>>
>> There is no reason the importer should not be able to CPU access
>> memory that is HW permitted to be CPU accessible.
>>
>> If the importer needs CPU access and the exporter cannot provide it
>> then the attachment simply fails.
>>
>> Saying CPU access is banned 100% of the time is not a helpful position
>> when we have use cases that need it.
>
> That approach is an absolutely no-go from my side.
>
> We have fully intentionally implemented the restriction that importers 
> can't CPU access DMA-buf for both kernel and userspace without going 
> through the exporter because of design requirements and a lot of 
> negative experience with exactly this approach.
>
> This is not something which is discuss-able in any way possible.
>
>>>> As far as I can see that is really not an use case which fits DMA-buf in
>>>> any way.
>> I really don't want to make a dmabuf2 - everyone would have to
>> implement it, including all the GPU drivers if they want to work with
>> RDMA. I don't think this makes any sense compared to incrementally
>> evolving dmabuf with more optional capabilities.
>
> The point is that a dmabuf2 would most likely be rejected as well or 
> otherwise run into the same issues we have seen before.
>
>>>>>> That sounds more something for the TEE driver instead of anything DMA-buf
>>>>>> should be dealing with.
>>>>> Has nothing to do with TEE.
>>>> Why?
>> The Linux TEE framework is not used as part of confidential compute.
>>
>> CC already has guest memfd for holding it's private CPU memory.
>
> Where is that coming from and how it is used?
>
>> This is about confidential MMIO memory.
>
> Who is the exporter and who is the importer of the DMA-buf in this use 
> case?
>
>> This is also not just about the KVM side, the VM side also has issues
>> with DMABUF and CC - only co-operating devices can interact with the
>> VM side "encrypted" memory and there needs to be a negotiation as part
>> of all buffer setup what the mutual capability is. :\ swiotlb hides
>> some of this some times, but confidential P2P is currently unsolved.
>
> Yes and it is documented by now how that is supposed to happen with 
> DMA-buf.
>
> As far as I can see there is not much new approach here.
>
> Regards,
> Christian.
>
>> Jason
>
Jason Gunthorpe Jan. 15, 2025, 3:10 p.m. UTC | #31
On Wed, Jan 15, 2025 at 03:30:47PM +0100, Christian König wrote:

> > Those rules are not something we cam up with because of some limitation
> > of the DMA-API, but rather from experience working with different device
> > driver and especially their developers.

I would say it stems from the use of scatter list. You do not have
enough information exchanged between exporter and importer to
implement something sane and correct. At that point being restrictive
is a reasonable path.

Because of scatterlist developers don't have APIs that correctly solve
the problems they want to solve, so of course things get into a mess.

> > Applying and enforcing those restrictions is absolutely mandatory must
> > have for extending DMA-buf.

You said to come to the maintainers with the problems, here are the
problems. Your answer is don't use dmabuf.

That doesn't make the problems go away :(

> > > I really don't want to make a dmabuf2 - everyone would have to
> > > implement it, including all the GPU drivers if they want to work with
> > > RDMA. I don't think this makes any sense compared to incrementally
> > > evolving dmabuf with more optional capabilities.
> > 
> > The point is that a dmabuf2 would most likely be rejected as well or
> > otherwise run into the same issues we have seen before.

You'd need to be much more concrete and technical in your objections
to cause a rejection. "We tried something else before and it didn't
work" won't cut it.

There is a very simple problem statement here, we need a FD handle for
various kinds of memory, with a lifetime model that fits a couple of
different use cases. The exporter and importer need to understand what
type of memory it is and what rules apply to working with it. The
required importers are more general that just simple PCI DMA.

I feel like this is already exactly DMABUF's mission.

Besides, you have been saying to go do this in TEE or whatever, how is
that any different from dmabuf2?

> > > > > > > That sounds more something for the TEE driver instead of anything DMA-buf
> > > > > > > should be dealing with.
> > > > > > Has nothing to do with TEE.
> > > > > Why?
> > > The Linux TEE framework is not used as part of confidential compute.
> > > 
> > > CC already has guest memfd for holding it's private CPU memory.
> > 
> > Where is that coming from and how it is used?

What do you mean? guest memfd is the result of years of negotiation in
the mm and x86 arch subsystems :( It is used like a normal memfd, and
we now have APIs in KVM and iommufd to directly intake and map from a
memfd. I expect guestmemfd will soon grow some more generic
dmabuf-like lifetime callbacks to avoid pinning - it already has some
KVM specific APIs IIRC.

But it is 100% exclusively focused on CPU memory and nothing else.

> > > This is about confidential MMIO memory.
> > 
> > Who is the exporter and who is the importer of the DMA-buf in this use
> > case?

In this case Xu is exporting MMIO from VFIO and importing to KVM and
iommufd.

> > This is also not just about the KVM side, the VM side also has issues
> > with DMABUF and CC - only co-operating devices can interact with the
> > VM side "encrypted" memory and there needs to be a negotiation as part
> > of all buffer setup what the mutual capability is. :\ swiotlb hides
> > some of this some times, but confidential P2P is currently unsolved.
> 
> Yes and it is documented by now how that is supposed to happen with
> DMA-buf.

I doubt that. It is complex and not fully solved in the core code
today. Many scenarios do not work correctly, devices don't even exist
yet that can exercise the hard paths. This is a future problem :(

Jason
Jason Gunthorpe Jan. 15, 2025, 5:09 p.m. UTC | #32
On Wed, Jan 15, 2025 at 05:34:23PM +0100, Christian König wrote:
>    Granted, let me try to improve this.
>    Here is a real world example of one of the issues we ran into and why
>    CPU mappings of importers are redirected to the exporter.
>    We have a good bunch of different exporters who track the CPU mappings
>    of their backing store using address_space objects in one way or
>    another and then uses unmap_mapping_range() to invalidate those CPU
>    mappings.
>    But when importers get the PFNs of the backing store they can look
>    behind the curtain and directly insert this PFN into the CPU page
>    tables.
>    We had literally tons of cases like this where drivers developers cause
>    access after free issues because the importer created a CPU mappings on
>    their own without the exporter knowing about it.
>    This is just one example of what we ran into. Additional to that
>    basically the whole synchronization between drivers was overhauled as
>    well because we found that we can't trust importers to always do the
>    right thing.

But this, fundamentally, is importers creating attachments and then
*ignoring the lifetime rules of DMABUF*. If you created an attachment,
got a move and *ignored the move* because you put the PFN in your own
VMA, then you are not following the attachment lifetime rules!

To implement this safely the driver would need to use
unma_mapping_range() on the driver VMA inside the move callback, and
hook into the VMA fault callback to re-attach the dmabuf.

This is where I get into trouble with your argument. It is not that
the API has an issue, or that the rules of the API are not logical and
functional.

You are arguing that even a logical and functional API will be
mis-used by some people and that reviewers will not catch
it.

Honestly, I don't think that is consistent with the kernel philosophy.

We should do our best to make APIs that are had to mis-use, but if we
can't achieve that it doesn't mean we stop and give up on problems,
we go into the world of APIs that can be mis-used and we are supposed
to rely on the reviewer system to catch it.

>    You can already turn both a TEE allocated buffer as well as a memfd
>    into a DMA-buf. So basically TEE and memfd already provides different
>    interfaces which go beyond what DMA-buf does and allows.

>    In other words if you want to do things like direct I/O to block or
>    network devices you can mmap() your memfd and do this while at the same
>    time send your memfd as DMA-buf to your GPU, V4L or neural accelerator.
>    Would this be a way you could work with as well? 

I guess, but this still requires creating a dmabuf2 type thing with
very similar semantics and then shimming dmabuf2 to 1 for DRM consumers.

I don't see how it addresses your fundamental concern that the
semantics we want are an API that is too easy for drivers to abuse.

And being more functional and efficient we'd just see people wanting
to use dmabuf2 directly instead of bothering with 1.

>    separate file descriptor representing the private MMIO which iommufd
>    and KVM uses but you can turn it into a DMA-buf whenever you need to
>    give it to a DMA-buf importer?

Well, it would end up just being used everywhere. I think one person
wanted to use this with DRM drivers for some reason, but RDMA would
use the dmabuf2 directly because it will be much more efficient than
using scatterlist.

Honestly, I'd much rather extend dmabuf and see DRM institute some
rule that DRM drivers may not use XYZ parts of the improvement. Like
maybe we could use some symbol namespaces to really enforce it
eg. MODULE_IMPORT_NS(DMABUF_NOT_FOR_DRM_USAGE)

Some of the improvements we want like the revoke rules for lifetime
seem to be agreeable.

Block the API that gives you the non-scatterlist attachment. Only
VFIO/RDMA/kvm/iommufd will get to implement it.

> In this case Xu is exporting MMIO from VFIO and importing to KVM and
> iommufd.
> 
>    So basically a portion of a PCIe BAR is imported into iommufd?

And KVM. We need to get the CPU address into KVM and IOMMU page
tables. It must go through a private FD path and not a VMA because of
the CC rules about machine check I mentioned earlier. The private FD
must have a lifetime model to ensure we don't UAF the PCIe BAR memory.

Someone else had some use case where they wanted to put the VFIO MMIO
PCIe BAR into a DMABUF and ship it into a GPU driver for
somethingsomething virtualization but I didn't understand it.

>    Let's just say that both the ARM guys as well as the GPU people already
>    have some pretty "interesting" ways of doing digital rights management
>    and content protection.

Well, that is TEE stuff, TEE and CC are not the same thing, though
they have some high level conceptual overlap.

In a certain sense CC is a TEE that is built using KVM instead of the
TEE subsystem. Using KVM and integrating with the MM brings a whole
set of unique challenges that TEE got to avoid.. 

Jason
Christoph Hellwig Jan. 16, 2025, 5:33 a.m. UTC | #33
On Wed, Jan 15, 2025 at 09:34:19AM -0400, Jason Gunthorpe wrote:
> > Or do you mean some that don't have pages associated with them, and
> > thus have pfn_valid fail on them?  They still have a PFN, just not
> > one that is valid to use in most of the Linux MM.
> 
> He is talking about private interconnect hidden inside clusters of
> devices.
> 
> Ie the system may have many GPUs and those GPUs have their own private
> interconnect between them. It is not PCI, and packets don't transit
> through the CPU SOC at all, so the IOMMU is not involved.
> 
> DMA can happen on that private interconnect, but from a Linux
> perspective it is not DMA API DMA, and the addresses used to describe
> it are not part of the CPU address space. The initiating device will
> have a way to choose which path the DMA goes through when setting up
> the DMA.

So how is this in any way relevant to dma_buf which operates on
a dma_addr_t right now and thus by definition can't be used for
these?
Jason Gunthorpe Jan. 16, 2025, 1:28 p.m. UTC | #34
On Thu, Jan 16, 2025 at 06:33:48AM +0100, Christoph Hellwig wrote:
> On Wed, Jan 15, 2025 at 09:34:19AM -0400, Jason Gunthorpe wrote:
> > > Or do you mean some that don't have pages associated with them, and
> > > thus have pfn_valid fail on them?  They still have a PFN, just not
> > > one that is valid to use in most of the Linux MM.
> > 
> > He is talking about private interconnect hidden inside clusters of
> > devices.
> > 
> > Ie the system may have many GPUs and those GPUs have their own private
> > interconnect between them. It is not PCI, and packets don't transit
> > through the CPU SOC at all, so the IOMMU is not involved.
> > 
> > DMA can happen on that private interconnect, but from a Linux
> > perspective it is not DMA API DMA, and the addresses used to describe
> > it are not part of the CPU address space. The initiating device will
> > have a way to choose which path the DMA goes through when setting up
> > the DMA.
> 
> So how is this in any way relevant to dma_buf which operates on
> a dma_addr_t right now and thus by definition can't be used for
> these?

Oh, well since this private stuff exists the DRM folks implemented it
and used dmabuf to hook it together tough the uAPI. To make it work it
abuses scatterlist and dma_addr_t to carry this other information.

Thus the pushback in this thread we can't naively fixup dmabuf because
this non-dma_addr_t abuse exists and is uAPI. So it also needs some
improved architecture to move forward :\

Basically, the scatterlist in dmabuf API does not follow any of the
normal rules scatterlist should follow. It is not using the semantics
of dma_addr_t even though that is the type. It is really just an array
of addr/len pairs - we can't reason about it in the normal way :(

Jason
Jason Gunthorpe Jan. 16, 2025, 4:07 p.m. UTC | #35
On Thu, Jan 16, 2025 at 04:13:13PM +0100, Christian König wrote:
>> But this, fundamentally, is importers creating attachments and then
>> *ignoring the lifetime rules of DMABUF*. If you created an attachment,
>> got a move and *ignored the move* because you put the PFN in your own
>> VMA, then you are not following the attachment lifetime rules!
> 
>    Move notify is solely for informing the importer that they need to
>    re-fresh their DMA mappings and eventually block for ongoing DMA to
>    end.

I feel that it is a bit pedantic to say DMA and CPU are somehow
different. The DMABUF API gives you a scatterlist, it is reasonable to
say that move invalidates the entire scatterlist, CPU and DMA equally.

>    This semantics doesn't work well for CPU mappings because you need to
>    hold the reservation lock to make sure that the information stay valid
>    and you can't hold a lock while returning from a page fault.

Sure, I imagine hooking up a VMA is hard - but that doesn't change my
point. The semantics can be reasonable and well defined.

>    Yeah and exactly that is something we don't want to allow because it
>    means that every importer need to get things right to prevent exporters
>    from running into problems.

You can make the same argument about the DMA address. We should just
get rid of DMABUF entirely because people are going to mis-use it and
wrongly implement the invalidation callback.

I have no idea why GPU drivers want to implement mmap of dmabuf, that
seems to be a uniquely GPU thing. We are not going to be doing stuff
like that in KVM and other places. And we can implement the
invalidation callback with correct locking. Why should we all be
punished because DRM drivers seem to have this weird historical mmap
problem?

I don't think that is a reasonable way to approach building a general
purpose linux kernel API.
 
>    Well it's not miss-used, it's just a very bad design decision to let
>    every importer implement functionality which actually belong into a
>    single point in the exporter.

Well, this is the problem. Sure it may be that importers should not
implement mmap - but using the PFN side address is needed for more
than just mmap!

DMA mapping belongs in the importer, and the new DMA API makes this
even more explicit by allowing the importer alot of options to
optimize the process of building the HW datastructures. Scatterlist
and the enforeced represetation of the DMA list is very inefficient
and we are working to get rid of it. It isn't going to be replaced by
any sort of list of DMA addresses though.

If you really disagree you can try to convince the NVMe people to give
up their optimizations the new DMA API allows so DRM can prevent this
code-review problem.

I also want the same optimizations in RDMA, and I am also not
convinced giving them up is a worthwhile tradeoff.

>    Why would you want to do a dmabuf2 here?

Because I need the same kind of common framework. I need to hook VFIO
to RDMA as well. I need to fix RDMA to have working P2P in all
cases. I need to hook KVM virtual device stuff to iommufd. Someone
else need VFIO to hook into DRM.

How many different times do I need to implement a buffer sharing
lifetime model? No, we should not make a VFIO specific thing, we need
a general tool to do this properly and cover all the different use
cases. That's "dmabuf2" or whatever you want to call it. There are
more than enough use cases to justify doing this. I think this is a
bad idea, we do not need two things, we should have dmabuf to handle
all the use cases people have, not just DRMs.

>    I don't mind improving the scatterlist approach in any way possible.
>    I'm just rejecting things which we already tried and turned out to be a
>    bad idea.
>    If you make an interface which gives DMA addresses plus additional
>    information like address space, access hints etc.. to importers that
>    would be really welcomed.

This is not welcomed, having lists of DMA addresses is inefficient and
does not match the direction of the DMA API. We are trying very hard
to completely remove the lists of DMA addresses in common fast paths.

>    But exposing PFNs and letting the importers created their DMA mappings
>    themselves and making CPU mappings themselves is an absolutely clear
>    no-go.

Again, this is what we must have to support the new DMA API, the KVM
and IOMMUFD use cases I mentioned.

>> In this case Xu is exporting MMIO from VFIO and importing to KVM and
>> iommufd.
> 
>    So basically a portion of a PCIe BAR is imported into iommufd?

Yeah, and KVM. And RMDA.

>    Then create an interface between VFIO and KVM/iommufd which allows to
>    pass data between these two.
>    We already do this between DMA-buf exporters/importers all the time.
>    Just don't make it general DMA-buf API.

I have no idea what this means. We'd need a new API linked to DMABUF
that would be optional and used by this part of the world. As I said
above we could protect it with some module namespace so you can keep
it out of DRM. If you can agree to that then it seems fine..

> > Someone else had some use case where they wanted to put the VFIO MMIO
> > PCIe BAR into a DMABUF and ship it into a GPU driver for
> > somethingsomething virtualization but I didn't understand it.
> 
>    Yeah, that is already perfectly supported.

No, it isn't. Christoph is blocking DMABUF in VFIO because he does not
want to scatterlist abuses that dmabuf is doing to proliferate.  We
already have some ARM systems where the naive way typical DMABUF
implementations are setting up P2P does not work. Those systems have
PCI offset.

Getting this to be "perfectly supported" is why we are working on all
these aspects to improve the DMA API and remove the scatterlist
abuses.

>> In a certain sense CC is a TEE that is built using KVM instead of the
>> TEE subsystem. Using KVM and integrating with the MM brings a whole
>> set of unique challenges that TEE got to avoid..
> 
>    Please go over those challenges in more detail. I need to get a better
>    understanding of what's going on here.
>    E.g. who manages encryption keys, who raises the machine check on
>    violations etc...

TEE broadly has Linux launch a secure world that does some private
work. The secure worlds tend to be very limited, they are not really
VMs and they don't run full Linux inside

CC broadly has the secure world exist at boot and launch Linux and
provide services to Linux. The secure world enforces memory isolation
on Linux and generates faults on violations. KVM is the gateway to
launch new secure worlds and the secure worlds are full VMs with all
the device emulation and more.

It CC is much more like xen with it's hypervisor and DOM0 concepts.

From this perspective, the only thing that matters is that CC secure
memory is different and special - it is very much like your private
memory concept. Only special places that understand it and have the
right HW capability can use it. All the consumers need a CPU address
to program their HW because of how the secure world security works.

Jason
Simona Vetter Jan. 17, 2025, 2:37 p.m. UTC | #36
On Thu, Jan 16, 2025 at 12:07:47PM -0400, Jason Gunthorpe wrote:
> On Thu, Jan 16, 2025 at 04:13:13PM +0100, Christian König wrote:
> >> But this, fundamentally, is importers creating attachments and then
> >> *ignoring the lifetime rules of DMABUF*. If you created an attachment,
> >> got a move and *ignored the move* because you put the PFN in your own
> >> VMA, then you are not following the attachment lifetime rules!
> > 
> >    Move notify is solely for informing the importer that they need to
> >    re-fresh their DMA mappings and eventually block for ongoing DMA to
> >    end.
> 
> I feel that it is a bit pedantic to say DMA and CPU are somehow
> different. The DMABUF API gives you a scatterlist, it is reasonable to
> say that move invalidates the entire scatterlist, CPU and DMA equally.

dma-buf doesn't even give you a scatterlist, there's dma-buf which truly
are just handles that are 100% device specific.

It really is the "everything is optional" interface. And yes we've had
endless amounts of fun where importers tried to peek behind the curtain,
largely because scatterlists didn't cleanly separate the dma_addr_t from
the struct page side of things. So I understand Christian's concerns here.

But that doesn't mean we cannot add another interface that does allow
importers to peek behind the curtiain. As long as we don't make a mess and
confuse importers, and ideally have compat functions so that existing
importers can deal with pfn exporters too it's all fine with me.

And I also don't see an issue with reasonable pfn semantics that would
prevent a generic mmap implementation on top of that, so that
dma_buf_mmap() just works for those. The mmu notifier pain is when you
consume mmaps/vma in a generic way, but implementing mmap with the current
dma_resv locking is dead easy. Unless I've missed something nonobvious and
just made a big fool of myself :-)

What we cannot demand is that all existing dma-buf exporters switch over
to the pfn interfaces. But hey it's the everything optional interface, the
only things is guarantees are
- fixed length
- no copying
- refcounting and yay it only took 10 years: consistent locking rules

So I really don't understand Christian's fundamental opposition here.

Cheers, Sima

> >    This semantics doesn't work well for CPU mappings because you need to
> >    hold the reservation lock to make sure that the information stay valid
> >    and you can't hold a lock while returning from a page fault.
> 
> Sure, I imagine hooking up a VMA is hard - but that doesn't change my
> point. The semantics can be reasonable and well defined.
> 
> >    Yeah and exactly that is something we don't want to allow because it
> >    means that every importer need to get things right to prevent exporters
> >    from running into problems.
> 
> You can make the same argument about the DMA address. We should just
> get rid of DMABUF entirely because people are going to mis-use it and
> wrongly implement the invalidation callback.
> 
> I have no idea why GPU drivers want to implement mmap of dmabuf, that
> seems to be a uniquely GPU thing. We are not going to be doing stuff
> like that in KVM and other places. And we can implement the
> invalidation callback with correct locking. Why should we all be
> punished because DRM drivers seem to have this weird historical mmap
> problem?
> 
> I don't think that is a reasonable way to approach building a general
> purpose linux kernel API.
>  
> >    Well it's not miss-used, it's just a very bad design decision to let
> >    every importer implement functionality which actually belong into a
> >    single point in the exporter.
> 
> Well, this is the problem. Sure it may be that importers should not
> implement mmap - but using the PFN side address is needed for more
> than just mmap!
> 
> DMA mapping belongs in the importer, and the new DMA API makes this
> even more explicit by allowing the importer alot of options to
> optimize the process of building the HW datastructures. Scatterlist
> and the enforeced represetation of the DMA list is very inefficient
> and we are working to get rid of it. It isn't going to be replaced by
> any sort of list of DMA addresses though.
> 
> If you really disagree you can try to convince the NVMe people to give
> up their optimizations the new DMA API allows so DRM can prevent this
> code-review problem.
> 
> I also want the same optimizations in RDMA, and I am also not
> convinced giving them up is a worthwhile tradeoff.
> 
> >    Why would you want to do a dmabuf2 here?
> 
> Because I need the same kind of common framework. I need to hook VFIO
> to RDMA as well. I need to fix RDMA to have working P2P in all
> cases. I need to hook KVM virtual device stuff to iommufd. Someone
> else need VFIO to hook into DRM.
> 
> How many different times do I need to implement a buffer sharing
> lifetime model? No, we should not make a VFIO specific thing, we need
> a general tool to do this properly and cover all the different use
> cases. That's "dmabuf2" or whatever you want to call it. There are
> more than enough use cases to justify doing this. I think this is a
> bad idea, we do not need two things, we should have dmabuf to handle
> all the use cases people have, not just DRMs.
> 
> >    I don't mind improving the scatterlist approach in any way possible.
> >    I'm just rejecting things which we already tried and turned out to be a
> >    bad idea.
> >    If you make an interface which gives DMA addresses plus additional
> >    information like address space, access hints etc.. to importers that
> >    would be really welcomed.
> 
> This is not welcomed, having lists of DMA addresses is inefficient and
> does not match the direction of the DMA API. We are trying very hard
> to completely remove the lists of DMA addresses in common fast paths.
> 
> >    But exposing PFNs and letting the importers created their DMA mappings
> >    themselves and making CPU mappings themselves is an absolutely clear
> >    no-go.
> 
> Again, this is what we must have to support the new DMA API, the KVM
> and IOMMUFD use cases I mentioned.
> 
> >> In this case Xu is exporting MMIO from VFIO and importing to KVM and
> >> iommufd.
> > 
> >    So basically a portion of a PCIe BAR is imported into iommufd?
> 
> Yeah, and KVM. And RMDA.
> 
> >    Then create an interface between VFIO and KVM/iommufd which allows to
> >    pass data between these two.
> >    We already do this between DMA-buf exporters/importers all the time.
> >    Just don't make it general DMA-buf API.
> 
> I have no idea what this means. We'd need a new API linked to DMABUF
> that would be optional and used by this part of the world. As I said
> above we could protect it with some module namespace so you can keep
> it out of DRM. If you can agree to that then it seems fine..
> 
> > > Someone else had some use case where they wanted to put the VFIO MMIO
> > > PCIe BAR into a DMABUF and ship it into a GPU driver for
> > > somethingsomething virtualization but I didn't understand it.
> > 
> >    Yeah, that is already perfectly supported.
> 
> No, it isn't. Christoph is blocking DMABUF in VFIO because he does not
> want to scatterlist abuses that dmabuf is doing to proliferate.  We
> already have some ARM systems where the naive way typical DMABUF
> implementations are setting up P2P does not work. Those systems have
> PCI offset.
> 
> Getting this to be "perfectly supported" is why we are working on all
> these aspects to improve the DMA API and remove the scatterlist
> abuses.
> 
> >> In a certain sense CC is a TEE that is built using KVM instead of the
> >> TEE subsystem. Using KVM and integrating with the MM brings a whole
> >> set of unique challenges that TEE got to avoid..
> > 
> >    Please go over those challenges in more detail. I need to get a better
> >    understanding of what's going on here.
> >    E.g. who manages encryption keys, who raises the machine check on
> >    violations etc...
> 
> TEE broadly has Linux launch a secure world that does some private
> work. The secure worlds tend to be very limited, they are not really
> VMs and they don't run full Linux inside
> 
> CC broadly has the secure world exist at boot and launch Linux and
> provide services to Linux. The secure world enforces memory isolation
> on Linux and generates faults on violations. KVM is the gateway to
> launch new secure worlds and the secure worlds are full VMs with all
> the device emulation and more.
> 
> It CC is much more like xen with it's hypervisor and DOM0 concepts.
> 
> From this perspective, the only thing that matters is that CC secure
> memory is different and special - it is very much like your private
> memory concept. Only special places that understand it and have the
> right HW capability can use it. All the consumers need a CPU address
> to program their HW because of how the secure world security works.
> 
> Jason
Simona Vetter Jan. 17, 2025, 2:42 p.m. UTC | #37
On Wed, Jan 15, 2025 at 11:06:53AM +0100, Christian König wrote:
> Am 15.01.25 um 09:55 schrieb Simona Vetter:
> > > > If we add something
> > > > new, we need clear rules and not just "here's the kvm code that uses it".
> > > > That's how we've done dma-buf at first, and it was a terrible mess of
> > > > mismatched expecations.
> > > Yes, that would be wrong. It should be self defined within dmabuf and
> > > kvm should adopt to it, move semantics and all.
> > Ack.
> > 
> > I feel like we have a plan here.
> 
> I think I have to object a bit on that.
> 
> >   Summary from my side:
> > 
> > - Sort out pin vs revocable vs dynamic/moveable semantics, make sure
> >    importers have no surprises.
> > 
> > - Adopt whatever new dma-api datastructures pops out of the dma-api
> >    reworks.
> > 
> > - Add pfn based memory access as yet another optional access method, with
> >    helpers so that exporters who support this get all the others for free.
> > 
> > I don't see a strict ordering between these, imo should be driven by
> > actual users of the dma-buf api.
> > 
> > Already done:
> > 
> > - dmem cgroup so that we can resource control device pinnings just landed
> >    in drm-next for next merge window. So that part is imo sorted and we can
> >    charge ahead with pinning into device memory without all the concerns
> >    we've had years ago when discussing that for p2p dma-buf support.
> > 
> >    But there might be some work so that we can p2p pin without requiring
> >    dynamic attachments only, I haven't checked whether that needs
> >    adjustment in dma-buf.c code or just in exporters.
> > 
> > Anything missing?
> 
> Well as far as I can see this use case is not a good fit for the DMA-buf
> interfaces in the first place. DMA-buf deals with devices and buffer
> exchange.
> 
> What's necessary here instead is to give an importing VM full access on some
> memory for their specific use case.
> 
> That full access includes CPU and DMA mappings, modifying caching
> attributes, potentially setting encryption keys for specific ranges etc....
> etc...
> 
> In other words we have a lot of things the importer here should be able to
> do which we don't want most of the DMA-buf importers to do.

This proposal isn't about forcing existing exporters to allow importers to
do new stuff. That stays as-is, because it would break things.

It's about adding yet another interface to get at the underlying data, and
we have tons of those already. The only difference is that if we don't
butcher the design, we'll be able to implement all the existing dma-buf
interfaces on top of this new pfn interface, for some neat maximal
compatibility.

But fundamentally there's never been an expectation that you can take any
arbitrary dma-buf and pass it any arbitrary importer, and that is must
work. The fundamental promise is that if it _does_ work, then
- it's zero copy
- and fast, or as fast as we can make it

I don't see this any different than all the much more specific prposals
and existing code, where a subset of importers/exporters have special
rules so that e.g. gpu interconnect or vfio uuid based sharing works.
pfn-based sharing is just yet another flavor that exists to get the max
amount of speed out of interconnects.

Cheers, Sima

> 
> The semantics for things like pin vs revocable vs dynamic/moveable seems
> similar, but that's basically it.
> 
> As far as I know the TEE subsystem also represents their allocations as file
> descriptors. If I'm not completely mistaken this use case most likely fit's
> better there.
> 
> > I feel like this is small enough that m-l archives is good enough. For
> > some of the bigger projects we do in graphics we sometimes create entries
> > in our kerneldoc with wip design consensus and things like that. But
> > feels like overkill here.
> > 
> > > My general desire is to move all of RDMA's MR process away from
> > > scatterlist and work using only the new DMA API. This will save *huge*
> > > amounts of memory in common workloads and be the basis for non-struct
> > > page DMA support, including P2P.
> > Yeah a more memory efficient structure than the scatterlist would be
> > really nice. That would even benefit the very special dma-buf exporters
> > where you cannot get a pfn and only the dma_addr_t, altough most of those
> > (all maybe even?) have contig buffers, so your scatterlist has only one
> > entry. But it would definitely be nice from a design pov.
> 
> Completely agree on that part.
> 
> Scatterlist have a some design flaws, especially mixing the input and out
> parameters of the DMA API into the same structure.
> 
> Additional to that DMA addresses are basically missing which bus they belong
> to and details how the access should be made (e.g. snoop vs no-snoop
> etc...).
> 
> > Aside: A way to more efficiently create compressed scatterlists would be
> > neat too, because a lot of drivers hand-roll that and it's a bit brittle
> > and kinda silly to duplicate. With compressed I mean just a single entry
> > for a contig range, in practice thanks to huge pages/folios and allocators
> > trying to hand out contig ranges if there's plenty of memory that saves a
> > lot of memory too. But currently it's a bit a pain to construct these
> > efficiently, mostly it's just a two-pass approach and then trying to free
> > surplus memory or krealloc to fit. Also I don't have good ideas here, but
> > dma-api folks might have some from looking at too many things that create
> > scatterlists.
> 
> I mailed with Christoph about that a while back as well and we both agreed
> that it would probably be a good idea to start defining a data structure to
> better encapsulate DMA addresses.
> 
> It's just that nobody had time for that yet and/or I wasn't looped in in the
> final discussion about it.
> 
> Regards,
> Christian.
> 
> > -Sima
Christian König Jan. 20, 2025, 12:14 p.m. UTC | #38
Am 17.01.25 um 15:42 schrieb Simona Vetter:
> On Wed, Jan 15, 2025 at 11:06:53AM +0100, Christian König wrote:
>> [SNIP]
>>> Anything missing?
>> Well as far as I can see this use case is not a good fit for the DMA-buf
>> interfaces in the first place. DMA-buf deals with devices and buffer
>> exchange.
>>
>> What's necessary here instead is to give an importing VM full access on some
>> memory for their specific use case.
>>
>> That full access includes CPU and DMA mappings, modifying caching
>> attributes, potentially setting encryption keys for specific ranges etc....
>> etc...
>>
>> In other words we have a lot of things the importer here should be able to
>> do which we don't want most of the DMA-buf importers to do.
> This proposal isn't about forcing existing exporters to allow importers to
> do new stuff. That stays as-is, because it would break things.
>
> It's about adding yet another interface to get at the underlying data, and
> we have tons of those already. The only difference is that if we don't
> butcher the design, we'll be able to implement all the existing dma-buf
> interfaces on top of this new pfn interface, for some neat maximal
> compatibility.

That sounds like you missed my concern.

When an exporter and an importer agree that they want to exchange PFNs 
instead of DMA addresses then that is perfectly fine.

The problems start when you define the semantics how those PFNs, DMA 
address, private bus addresses, whatever is echanged different to what 
we have documented for DMA-buf.

This semantics is very well defined for DMA-buf now, because that is 
really important or otherwise things usually seem to work under testing 
(e.g. without memory pressure) and then totally fall apart in production 
environments.

In other words we have defined what lock you need to hold when calling 
functions, what a DMA fence is, when exchanged addresses are valid etc...

> But fundamentally there's never been an expectation that you can take any
> arbitrary dma-buf and pass it any arbitrary importer, and that is must
> work. The fundamental promise is that if it _does_ work, then
> - it's zero copy
> - and fast, or as fast as we can make it
>
> I don't see this any different than all the much more specific prposals
> and existing code, where a subset of importers/exporters have special
> rules so that e.g. gpu interconnect or vfio uuid based sharing works.
> pfn-based sharing is just yet another flavor that exists to get the max
> amount of speed out of interconnects.

Please take another look at what is proposed here. The function is 
called dma_buf_get_pfn_*unlocked* !

This is not following DMA-buf semantics for exchanging addresses and 
keeping them valid, but rather something more like userptrs.

Inserting PFNs into CPU (or probably also IOMMU) page tables have a 
different semantics than what DMA-buf usually does, because as soon as 
the address is written into the page table it is made public. So you 
need some kind of mechanism to make sure that this addr you made public 
stays valid as long as it is public.

The usual I/O operation we encapsulate with DMA-fences have a 
fundamentally different semantics because we have the lock which 
enforces that stuff stays valid and then have a DMA-fence which notes 
how long the stuff needs to stay valid for an operation to complete.

Regards,
Christian.

>
> Cheers, Sima
>
>> The semantics for things like pin vs revocable vs dynamic/moveable seems
>> similar, but that's basically it.
>>
>> As far as I know the TEE subsystem also represents their allocations as file
>> descriptors. If I'm not completely mistaken this use case most likely fit's
>> better there.
>>
>>> I feel like this is small enough that m-l archives is good enough. For
>>> some of the bigger projects we do in graphics we sometimes create entries
>>> in our kerneldoc with wip design consensus and things like that. But
>>> feels like overkill here.
>>>
>>>> My general desire is to move all of RDMA's MR process away from
>>>> scatterlist and work using only the new DMA API. This will save *huge*
>>>> amounts of memory in common workloads and be the basis for non-struct
>>>> page DMA support, including P2P.
>>> Yeah a more memory efficient structure than the scatterlist would be
>>> really nice. That would even benefit the very special dma-buf exporters
>>> where you cannot get a pfn and only the dma_addr_t, altough most of those
>>> (all maybe even?) have contig buffers, so your scatterlist has only one
>>> entry. But it would definitely be nice from a design pov.
>> Completely agree on that part.
>>
>> Scatterlist have a some design flaws, especially mixing the input and out
>> parameters of the DMA API into the same structure.
>>
>> Additional to that DMA addresses are basically missing which bus they belong
>> to and details how the access should be made (e.g. snoop vs no-snoop
>> etc...).
>>
>>> Aside: A way to more efficiently create compressed scatterlists would be
>>> neat too, because a lot of drivers hand-roll that and it's a bit brittle
>>> and kinda silly to duplicate. With compressed I mean just a single entry
>>> for a contig range, in practice thanks to huge pages/folios and allocators
>>> trying to hand out contig ranges if there's plenty of memory that saves a
>>> lot of memory too. But currently it's a bit a pain to construct these
>>> efficiently, mostly it's just a two-pass approach and then trying to free
>>> surplus memory or krealloc to fit. Also I don't have good ideas here, but
>>> dma-api folks might have some from looking at too many things that create
>>> scatterlists.
>> I mailed with Christoph about that a while back as well and we both agreed
>> that it would probably be a good idea to start defining a data structure to
>> better encapsulate DMA addresses.
>>
>> It's just that nobody had time for that yet and/or I wasn't looped in in the
>> final discussion about it.
>>
>> Regards,
>> Christian.
>>
>>> -Sima
Christian König Jan. 20, 2025, 1:44 p.m. UTC | #39
Am 21.06.24 um 00:02 schrieb Xu Yilun:
> On Thu, Jan 16, 2025 at 04:13:13PM +0100, Christian König wrote:
>>     Am 15.01.25 um 18:09 schrieb Jason Gunthorpe:
>>
>>   On Wed, Jan 15, 2025 at 05:34:23PM +0100, Christian König wrote:
>>
>>      Granted, let me try to improve this.
>>      Here is a real world example of one of the issues we ran into and why
>>      CPU mappings of importers are redirected to the exporter.
>>      We have a good bunch of different exporters who track the CPU mappings
>>      of their backing store using address_space objects in one way or
>>      another and then uses unmap_mapping_range() to invalidate those CPU
>>      mappings.
>>      But when importers get the PFNs of the backing store they can look
>>      behind the curtain and directly insert this PFN into the CPU page
>>      tables.
>>      We had literally tons of cases like this where drivers developers cause
>>      access after free issues because the importer created a CPU mappings on
>>      their own without the exporter knowing about it.
>>      This is just one example of what we ran into. Additional to that
>>      basically the whole synchronization between drivers was overhauled as
>>      well because we found that we can't trust importers to always do the
>>      right thing.
>>
>>   But this, fundamentally, is importers creating attachments and then
>>   *ignoring the lifetime rules of DMABUF*. If you created an attachment,
>>   got a move and *ignored the move* because you put the PFN in your own
>>   VMA, then you are not following the attachment lifetime rules!
>>
>>     Move notify is solely for informing the importer that they need to
>>     re-fresh their DMA mappings and eventually block for ongoing DMA to end.
>>
>>     This semantics doesn't work well for CPU mappings because you need to hold
>>     the reservation lock to make sure that the information stay valid and you
>>     can't hold a lock while returning from a page fault.
> Dealing with CPU mapping and resource invalidation is a little hard, but is
> resolvable, by using other types of locks. And I guess for now dma-buf
> exporters should always handle this CPU mapping VS. invalidation contention if
> they support mmap().
>
> It is resolvable so with some invalidation notify, a decent importers could
> also handle the contention well.

That doesn't work like this.

See page tables updates under DMA-buf works by using the same locking 
approach for both the validation and invalidation side. In other words 
we hold the same lock while inserting and removing entries into/from the 
page tables.

That this here should be an unlocked API means that can only use it with 
pre-allocated and hard pinned memory without any chance to invalidate it 
while running. Otherwise you can never be sure of the validity of the 
address information you got from the exporter.

> IIUC now the only concern is importer device drivers are easier to do
> something wrong, so move CPU mapping things to exporter. But most of the
> exporters are also device drivers, why they are smarter?

Exporters always use their invalidation code path no matter if they are 
exporting their buffers for other to use or if they are stand alone.

If you do the invalidation on the importer side you always need both 
exporter and importer around to test it.

Additional to that we have much more importers than exporters. E.g. a 
lot of simple drivers only import DMA-heap buffers and never exports 
anything.

> And there are increasing mapping needs, today exporters help handle CPU primary
> mapping, tomorrow should they also help on all other mappings? Clearly it is
> not feasible. So maybe conditionally give trust to some importers.

Why should that be necessary? Exporters *must* know what somebody does 
with their buffers.

If you have an use case the exporter doesn't support in their mapping 
operation then that use case most likely doesn't work in the first place.

For example direct I/O is enabled/disabled by exporters on their CPU 
mappings based on if that works correctly for them. And importer simply 
doesn't know if they should use vm_insert_pfn() or vm_insert_page().

We could of course implement that logic into each importer to chose 
between the different approaches, but than each importer gains logic it 
only exercises with a specific exporter. And that doesn't seem to be a 
good idea at all.

Regards,
Christian.

>
> Thanks,
> Yilun
Jason Gunthorpe Jan. 20, 2025, 5:59 p.m. UTC | #40
On Mon, Jan 20, 2025 at 01:14:12PM +0100, Christian König wrote:

What is going wrong with your email? You replied to Simona, but
Simona Vetter <simona.vetter@ffwll.ch> is dropped from the To/CC
list??? I added the address back, but seems like a weird thing to
happen.

> Please take another look at what is proposed here. The function is called
> dma_buf_get_pfn_*unlocked* !

I don't think Simona and I are defending the implementation in this
series. This series needs work.

We have been talking about what the implementation should be. I think
we've all been clear on the idea that the DMA buf locking rules should
apply to any description of the memory, regardless of if the address
are CPU, DMA, or private.

I agree that the idea of any "get unlocked" concept seems nonsensical
and wrong within dmabuf.

> Inserting PFNs into CPU (or probably also IOMMU) page tables have a
> different semantics than what DMA-buf usually does, because as soon as the
> address is written into the page table it is made public.

Not really.

The KVM/CPU is fully compatible with move semantics, it has
restartable page faults and can implement dmabuf's move locking
scheme. It can use the resv lock, the fences, move_notify and so on to
implement it. It is a bug if this series isn't doing that.

The iommu cannot support move semantics. It would need the existing
pin semantics (ie we call dma_buf_pin() and don't support
move_notify). To work with VFIO we would need to formalize the revoke
semantics that Simona was discussing.

We already implement both of these modalities in rdma, the locking API
is fine and workable with CPU pfns just as well.

I've imagined a staged flow here:

 1) The new DMA API lands
 2) We improve the new DMA API to be fully struct page free, including
    setting up P2P
 3) VFIO provides a dmbuf exporter using the new DMA API's P2P
    support. We'd have to continue with the scatterlist hacks for now.
    VFIO would be a move_notify exporter. This should work with RDMA
 4) RDMA works to drop scatterlist from its internal flows and use the
    new DMA API instead.
 5) VFIO/RDMA implement a new non-scatterlist DMABUF op to
    demonstrate the post-scatterlist world and deprecate the scatterlist
    hacks.
 6) We add revoke semantics to dmabuf, and VFIO/RDMA implements them
 7) iommufd can import a pinnable revokable dmabuf using CPU pfns
    through the non-scatterlist op.
 8) Relevant GPU drivers implement the non-scatterlist op and RDMA
    removes support for the deprecated scatterlist hacks.

Xu's series has jumped ahead a bit and is missing infrastructure to
build it properly.

Jason
Simona Vetter Jan. 20, 2025, 6:50 p.m. UTC | #41
On Mon, Jan 20, 2025 at 01:59:01PM -0400, Jason Gunthorpe wrote:
> On Mon, Jan 20, 2025 at 01:14:12PM +0100, Christian König wrote:
> What is going wrong with your email? You replied to Simona, but
> Simona Vetter <simona.vetter@ffwll.ch> is dropped from the To/CC
> list??? I added the address back, but seems like a weird thing to
> happen.

Might also be funny mailing list stuff, depending how you get these. I
read mails over lore and pretty much ignore cc (unless it's not also on
any list, since those tend to be security issues) because I get cc'ed on
way too much stuff for that to be a useful signal.

> > Please take another look at what is proposed here. The function is called
> > dma_buf_get_pfn_*unlocked* !
> 
> I don't think Simona and I are defending the implementation in this
> series. This series needs work.

Yeah this current series is good for kicking off the discussions, it's
defo not close to anything we can merge.

> We have been talking about what the implementation should be. I think
> we've all been clear on the idea that the DMA buf locking rules should
> apply to any description of the memory, regardless of if the address
> are CPU, DMA, or private.
> 
> I agree that the idea of any "get unlocked" concept seems nonsensical
> and wrong within dmabuf.
> 
> > Inserting PFNs into CPU (or probably also IOMMU) page tables have a
> > different semantics than what DMA-buf usually does, because as soon as the
> > address is written into the page table it is made public.
> 
> Not really.
> 
> The KVM/CPU is fully compatible with move semantics, it has
> restartable page faults and can implement dmabuf's move locking
> scheme. It can use the resv lock, the fences, move_notify and so on to
> implement it. It is a bug if this series isn't doing that.

Yeah I'm not worried about cpu mmap locking semantics. drm/ttm is a pretty
clear example that you can implement dma-buf mmap with the rules we have,
except the unmap_mapping_range might need a bit fudging with a separate
address_space.

For cpu mmaps I'm more worried about the arch bits in the pte, stuff like
caching mode or encrypted memory bits and things like that. There's
vma->vm_pgprot, but it's a mess. But maybe this all is an incentive to
clean up that mess a bit.

> The iommu cannot support move semantics. It would need the existing
> pin semantics (ie we call dma_buf_pin() and don't support
> move_notify). To work with VFIO we would need to formalize the revoke
> semantics that Simona was discussing.

I thought iommuv2 (or whatever linux calls these) has full fault support
and could support current move semantics. But yeah for iommu without
fault support we need some kind of pin or a newly formalized revoke model.

> We already implement both of these modalities in rdma, the locking API
> is fine and workable with CPU pfns just as well.
> 
> I've imagined a staged flow here:
> 
>  1) The new DMA API lands
>  2) We improve the new DMA API to be fully struct page free, including
>     setting up P2P
>  3) VFIO provides a dmbuf exporter using the new DMA API's P2P
>     support. We'd have to continue with the scatterlist hacks for now.
>     VFIO would be a move_notify exporter. This should work with RDMA
>  4) RDMA works to drop scatterlist from its internal flows and use the
>     new DMA API instead.
>  5) VFIO/RDMA implement a new non-scatterlist DMABUF op to
>     demonstrate the post-scatterlist world and deprecate the scatterlist
>     hacks.
>  6) We add revoke semantics to dmabuf, and VFIO/RDMA implements them
>  7) iommufd can import a pinnable revokable dmabuf using CPU pfns
>     through the non-scatterlist op.
>  8) Relevant GPU drivers implement the non-scatterlist op and RDMA
>     removes support for the deprecated scatterlist hacks.

Sounds pretty reasonable as a first sketch of a proper plan. Of course
fully expecting that no plan ever survives implementation intact :-)

Cheers, Sima

> 
> Xu's series has jumped ahead a bit and is missing infrastructure to
> build it properly.
> 
> Jason
Jason Gunthorpe Jan. 20, 2025, 7:48 p.m. UTC | #42
On Mon, Jan 20, 2025 at 07:50:23PM +0100, Simona Vetter wrote:
> On Mon, Jan 20, 2025 at 01:59:01PM -0400, Jason Gunthorpe wrote:
> > On Mon, Jan 20, 2025 at 01:14:12PM +0100, Christian König wrote:
> > What is going wrong with your email? You replied to Simona, but
> > Simona Vetter <simona.vetter@ffwll.ch> is dropped from the To/CC
> > list??? I added the address back, but seems like a weird thing to
> > happen.
> 
> Might also be funny mailing list stuff, depending how you get these. I
> read mails over lore and pretty much ignore cc (unless it's not also on
> any list, since those tend to be security issues) because I get cc'ed on
> way too much stuff for that to be a useful signal.

Oh I see, you are sending a Mail-followup-to header that excludes your
address, so you don't get any emails at all.. My mutt is dropping you
as well.

> Yeah I'm not worried about cpu mmap locking semantics. drm/ttm is a pretty
> clear example that you can implement dma-buf mmap with the rules we have,
> except the unmap_mapping_range might need a bit fudging with a separate
> address_space.

From my perspective the mmap thing is a bit of a side/DRM-only thing
as nothing I'm interested in wants to mmap dmabuf into a VMA.

However, I think if you have locking rules that can fit into a VMA
fault path and link move_notify to unmap_mapping_range() then you've
got a pretty usuable API.

> For cpu mmaps I'm more worried about the arch bits in the pte, stuff like
> caching mode or encrypted memory bits and things like that. There's
> vma->vm_pgprot, but it's a mess. But maybe this all is an incentive to
> clean up that mess a bit.

I'm convinced we need meta-data along with pfns, there is too much
stuff that needs more information than just the address. Cachability,
CC encryption, exporting device, etc. This is a topic to partially
cross when we talk about how to fully remove struct page requirements
from the new DMA API.

I'm hoping we can get to something where we describe not just how the
pfns should be DMA mapped, but also can describe how they should be
CPU mapped. For instance that this PFN space is always mapped
uncachable, in CPU and in IOMMU.

We also have current bugs in the iommu/vfio side where we are fudging
CC stuff, like assuming CPU memory is encrypted (not always true) and
that MMIO is non-encrypted (not always true)

> I thought iommuv2 (or whatever linux calls these) has full fault support
> and could support current move semantics. But yeah for iommu without
> fault support we need some kind of pin or a newly formalized revoke model.

No, this is HW dependent, including PCI device, and I'm aware of no HW
that fully implements this in a way that could be useful to implement
arbitary move semantics for VFIO..

Jason
Simona Vetter Jan. 21, 2025, 4:11 p.m. UTC | #43
On Mon, Jan 20, 2025 at 03:48:04PM -0400, Jason Gunthorpe wrote:
> On Mon, Jan 20, 2025 at 07:50:23PM +0100, Simona Vetter wrote:
> > On Mon, Jan 20, 2025 at 01:59:01PM -0400, Jason Gunthorpe wrote:
> > > On Mon, Jan 20, 2025 at 01:14:12PM +0100, Christian König wrote:
> > > What is going wrong with your email? You replied to Simona, but
> > > Simona Vetter <simona.vetter@ffwll.ch> is dropped from the To/CC
> > > list??? I added the address back, but seems like a weird thing to
> > > happen.
> > 
> > Might also be funny mailing list stuff, depending how you get these. I
> > read mails over lore and pretty much ignore cc (unless it's not also on
> > any list, since those tend to be security issues) because I get cc'ed on
> > way too much stuff for that to be a useful signal.
> 
> Oh I see, you are sending a Mail-followup-to header that excludes your
> address, so you don't get any emails at all.. My mutt is dropping you
> as well.
> 
> > Yeah I'm not worried about cpu mmap locking semantics. drm/ttm is a pretty
> > clear example that you can implement dma-buf mmap with the rules we have,
> > except the unmap_mapping_range might need a bit fudging with a separate
> > address_space.
> 
> From my perspective the mmap thing is a bit of a side/DRM-only thing
> as nothing I'm interested in wants to mmap dmabuf into a VMA.

I guess we could just skip mmap on these pfn exporters, but it also means
a bit more boilerplate. At least the device mapping / dma_buf_attachment
side should be doable with just the pfn and the new dma-api?

> However, I think if you have locking rules that can fit into a VMA
> fault path and link move_notify to unmap_mapping_range() then you've
> got a pretty usuable API.
> 
> > For cpu mmaps I'm more worried about the arch bits in the pte, stuff like
> > caching mode or encrypted memory bits and things like that. There's
> > vma->vm_pgprot, but it's a mess. But maybe this all is an incentive to
> > clean up that mess a bit.
> 
> I'm convinced we need meta-data along with pfns, there is too much
> stuff that needs more information than just the address. Cachability,
> CC encryption, exporting device, etc. This is a topic to partially
> cross when we talk about how to fully remove struct page requirements
> from the new DMA API.
> 
> I'm hoping we can get to something where we describe not just how the
> pfns should be DMA mapped, but also can describe how they should be
> CPU mapped. For instance that this PFN space is always mapped
> uncachable, in CPU and in IOMMU.

I was pondering whether dma_mmap and friends would be a good place to
prototype this and go for a fully generic implementation. But then even
those have _wc/_uncached variants.

If you go into arch specific stuff, then x86 does have wc/uc/... tracking,
but only for memory (set_page_wc and friends iirc). And you can bypass it
if you know what you're doing.

> We also have current bugs in the iommu/vfio side where we are fudging
> CC stuff, like assuming CPU memory is encrypted (not always true) and
> that MMIO is non-encrypted (not always true)

tbf CC pte flags I just don't grok at all. I've once tried to understand
what current exporters and gpu drivers do and just gave up. But that's
also a bit why I'm worried here because it's an enigma to me.

> > I thought iommuv2 (or whatever linux calls these) has full fault support
> > and could support current move semantics. But yeah for iommu without
> > fault support we need some kind of pin or a newly formalized revoke model.
> 
> No, this is HW dependent, including PCI device, and I'm aware of no HW
> that fully implements this in a way that could be useful to implement
> arbitary move semantics for VFIO..

Hm I thought we've had at least prototypes floating around of device fault
repair, but I guess that only works with ATS/pasid stuff and not general
iommu traffic from devices. Definitely needs some device cooperation since
the timeouts of a full fault are almost endless.
-Sima
Jason Gunthorpe Jan. 21, 2025, 5:36 p.m. UTC | #44
On Tue, Jan 21, 2025 at 05:11:32PM +0100, Simona Vetter wrote:
> On Mon, Jan 20, 2025 at 03:48:04PM -0400, Jason Gunthorpe wrote:
> > On Mon, Jan 20, 2025 at 07:50:23PM +0100, Simona Vetter wrote:
> > > On Mon, Jan 20, 2025 at 01:59:01PM -0400, Jason Gunthorpe wrote:
> > > > On Mon, Jan 20, 2025 at 01:14:12PM +0100, Christian König wrote:
> > > > What is going wrong with your email? You replied to Simona, but
> > > > Simona Vetter <simona.vetter@ffwll.ch> is dropped from the To/CC
> > > > list??? I added the address back, but seems like a weird thing to
> > > > happen.
> > > 
> > > Might also be funny mailing list stuff, depending how you get these. I
> > > read mails over lore and pretty much ignore cc (unless it's not also on
> > > any list, since those tend to be security issues) because I get cc'ed on
> > > way too much stuff for that to be a useful signal.
> > 
> > Oh I see, you are sending a Mail-followup-to header that excludes your
> > address, so you don't get any emails at all.. My mutt is dropping you
> > as well.
> > 
> > > Yeah I'm not worried about cpu mmap locking semantics. drm/ttm is a pretty
> > > clear example that you can implement dma-buf mmap with the rules we have,
> > > except the unmap_mapping_range might need a bit fudging with a separate
> > > address_space.
> > 
> > From my perspective the mmap thing is a bit of a side/DRM-only thing
> > as nothing I'm interested in wants to mmap dmabuf into a VMA.
>
> I guess we could just skip mmap on these pfn exporters, but it also means
> a bit more boilerplate. 

I have been assuming that dmabuf mmap remains unchanged, that
exporters will continue to implement that mmap() callback as today.

My main interest has been what data structure is produced in the
attach APIs.

Eg today we have a struct dma_buf_attachment that returns a sg_table.

I'm expecting some kind of new data structure, lets call it "physical
list" that is some efficient coding of meta/addr/len tuples that works
well with the new DMA API. Matthew has been calling this thing phyr..

So, I imagine, struct dma_buf_attachment gaining an optional
feature negotiation and then we have in dma_buf_attachment:

        union {
              struct sg_table *sgt;
	      struct physical_list *phyr;
	};

That's basicaly it, an alternative to scatterlist that has a clean
architecture.

Now, if you are asking if the current dmabuf mmap callback can be
improved with the above? Maybe? phyr should have the neccessary
information inside it to populate a VMA - eventually even fully
correctly with all the right cachable/encrypted/forbidden/etc flags.

So, you could imagine that exporters could just have one routine to
generate the phyr list and that goes into the attachment, goes into
some common code to fill VMA PTEs, and some other common code that
will convert it into the DMABUF scatterlist. If performance is not a
concern with these data structure conversions it could be an appealing
simplification.

And yes, I could imagine the meta information being descriptive enough
to support the private interconnect cases, the common code could
detect private meta information and just cleanly fail.

> At least the device mapping / dma_buf_attachment
> side should be doable with just the pfn and the new dma-api?

Yes, that would be my first goal post. Figure out some meta
information and a container data structure that allows struct
page-less P2P mapping through the new DMA API.

> > I'm hoping we can get to something where we describe not just how the
> > pfns should be DMA mapped, but also can describe how they should be
> > CPU mapped. For instance that this PFN space is always mapped
> > uncachable, in CPU and in IOMMU.
> 
> I was pondering whether dma_mmap and friends would be a good place to
> prototype this and go for a fully generic implementation. But then even
> those have _wc/_uncached variants.

Given that the inability to correctly DMA map P2P MMIO without struct
page is a current pain point and current source of hacks in dmabuf
exporters, I wanted to make resolving that a priority.

However, if you mean what I described above for "fully generic [dmabuf
mmap] implementation", then we'd have the phyr datastructure as a
dependency to attempt that work.

phyr, and particularly the meta information, has a number of
stakeholders. I was thinking of going first with rdma's memory
registration flow because we are now pretty close to being able to do
such a big change, and it can demonstrate most of the requirements.

But that doesn't mean mmap couldn't go concurrently on the same agreed
datastructure if people are interested.

> > We also have current bugs in the iommu/vfio side where we are fudging
> > CC stuff, like assuming CPU memory is encrypted (not always true) and
> > that MMIO is non-encrypted (not always true)
> 
> tbf CC pte flags I just don't grok at all. I've once tried to understand
> what current exporters and gpu drivers do and just gave up. But that's
> also a bit why I'm worried here because it's an enigma to me.

For CC, inside the secure world, is some information if each PFN
inside the VM is 'encrypted' or not. Any VM PTE (including the IOPTEs)
pointing at the PFN must match the secure world's view of
'encrypted'. The VM can ask the secure world to change its view at
runtime.

The way CC has been bolted on to the kernel so far laregly hides this
from drivers, so it is troubled to tell in driver code if the PFN you
have is 'encrypted' or not. Right now the general rule (that is not
always true) is that struct page CPU memory is encrypted and
everything else is decrypted.

So right now, you can mostly ignore it and the above assumption
largely happens for you transparently.

However, soon we will have encrypted P2P MMIO which will stress this
hiding strategy.

> > > I thought iommuv2 (or whatever linux calls these) has full fault support
> > > and could support current move semantics. But yeah for iommu without
> > > fault support we need some kind of pin or a newly formalized revoke model.
> > 
> > No, this is HW dependent, including PCI device, and I'm aware of no HW
> > that fully implements this in a way that could be useful to implement
> > arbitary move semantics for VFIO..
> 
> Hm I thought we've had at least prototypes floating around of device fault
> repair, but I guess that only works with ATS/pasid stuff and not general
> iommu traffic from devices. Definitely needs some device cooperation since
> the timeouts of a full fault are almost endless.

Yes, exactly. What all real devices I'm aware have done is make a
subset of their traffic work with ATS and PRI, but not all their
traffic. Without *all* traffic you can't make any generic assumption
in the iommu that a transient non-present won't be fatal to the
device.

Stuff like dmabuf move semantics rely on transient non-present being
non-disruptive...

Jason
Xu Yilun Jan. 22, 2025, 4:16 a.m. UTC | #45
On Mon, Jan 20, 2025 at 02:44:13PM +0100, Christian König wrote:
> Am 21.06.24 um 00:02 schrieb Xu Yilun:
> > On Thu, Jan 16, 2025 at 04:13:13PM +0100, Christian König wrote:
> > >     Am 15.01.25 um 18:09 schrieb Jason Gunthorpe:
> > > 
> > >   On Wed, Jan 15, 2025 at 05:34:23PM +0100, Christian König wrote:
> > > 
> > >      Granted, let me try to improve this.
> > >      Here is a real world example of one of the issues we ran into and why
> > >      CPU mappings of importers are redirected to the exporter.
> > >      We have a good bunch of different exporters who track the CPU mappings
> > >      of their backing store using address_space objects in one way or
> > >      another and then uses unmap_mapping_range() to invalidate those CPU
> > >      mappings.
> > >      But when importers get the PFNs of the backing store they can look
> > >      behind the curtain and directly insert this PFN into the CPU page
> > >      tables.
> > >      We had literally tons of cases like this where drivers developers cause
> > >      access after free issues because the importer created a CPU mappings on
> > >      their own without the exporter knowing about it.
> > >      This is just one example of what we ran into. Additional to that
> > >      basically the whole synchronization between drivers was overhauled as
> > >      well because we found that we can't trust importers to always do the
> > >      right thing.
> > > 
> > >   But this, fundamentally, is importers creating attachments and then
> > >   *ignoring the lifetime rules of DMABUF*. If you created an attachment,
> > >   got a move and *ignored the move* because you put the PFN in your own
> > >   VMA, then you are not following the attachment lifetime rules!
> > > 
> > >     Move notify is solely for informing the importer that they need to
> > >     re-fresh their DMA mappings and eventually block for ongoing DMA to end.
> > > 
> > >     This semantics doesn't work well for CPU mappings because you need to hold
> > >     the reservation lock to make sure that the information stay valid and you
> > >     can't hold a lock while returning from a page fault.
> > Dealing with CPU mapping and resource invalidation is a little hard, but is
> > resolvable, by using other types of locks. And I guess for now dma-buf
> > exporters should always handle this CPU mapping VS. invalidation contention if
> > they support mmap().
> > 
> > It is resolvable so with some invalidation notify, a decent importers could
> > also handle the contention well.
> 
> That doesn't work like this.
> 
> See page tables updates under DMA-buf works by using the same locking
> approach for both the validation and invalidation side. In other words we
> hold the same lock while inserting and removing entries into/from the page
> tables.

Not sure what's the issue it causes, maybe I don't get why "you can't
hold a lock while returning from a page fault".

> 
> That this here should be an unlocked API means that can only use it with
> pre-allocated and hard pinned memory without any chance to invalidate it
> while running. Otherwise you can never be sure of the validity of the

Then importers can use a locked version to get pfn, and manually use
dma_resv lock only to ensure the PFN validity during page table setup.
Importers could detect the PFN will be invalid via move notify and
remove page table entries. Then find the new PFN next time page fault
happens.

IIUC, Simona mentions drm/ttm is already doing it this way.

I'm not trying to change the CPU mmap things for existing drivers, just
to ensure importer mapping is possible with faultable MMU. I wanna KVM
MMU (also faultable) to work in this importer mapping way.

Thanks,
Yilun
Simona Vetter Jan. 22, 2025, 11:04 a.m. UTC | #46
On Tue, Jan 21, 2025 at 01:36:33PM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 21, 2025 at 05:11:32PM +0100, Simona Vetter wrote:
> > On Mon, Jan 20, 2025 at 03:48:04PM -0400, Jason Gunthorpe wrote:
> > > On Mon, Jan 20, 2025 at 07:50:23PM +0100, Simona Vetter wrote:
> > > > On Mon, Jan 20, 2025 at 01:59:01PM -0400, Jason Gunthorpe wrote:
> > > > > On Mon, Jan 20, 2025 at 01:14:12PM +0100, Christian König wrote:
> > > > > What is going wrong with your email? You replied to Simona, but
> > > > > Simona Vetter <simona.vetter@ffwll.ch> is dropped from the To/CC
> > > > > list??? I added the address back, but seems like a weird thing to
> > > > > happen.
> > > > 
> > > > Might also be funny mailing list stuff, depending how you get these. I
> > > > read mails over lore and pretty much ignore cc (unless it's not also on
> > > > any list, since those tend to be security issues) because I get cc'ed on
> > > > way too much stuff for that to be a useful signal.
> > > 
> > > Oh I see, you are sending a Mail-followup-to header that excludes your
> > > address, so you don't get any emails at all.. My mutt is dropping you
> > > as well.
> > > 
> > > > Yeah I'm not worried about cpu mmap locking semantics. drm/ttm is a pretty
> > > > clear example that you can implement dma-buf mmap with the rules we have,
> > > > except the unmap_mapping_range might need a bit fudging with a separate
> > > > address_space.
> > > 
> > > From my perspective the mmap thing is a bit of a side/DRM-only thing
> > > as nothing I'm interested in wants to mmap dmabuf into a VMA.
> >
> > I guess we could just skip mmap on these pfn exporters, but it also means
> > a bit more boilerplate. 
> 
> I have been assuming that dmabuf mmap remains unchanged, that
> exporters will continue to implement that mmap() callback as today.
> 
> My main interest has been what data structure is produced in the
> attach APIs.
> 
> Eg today we have a struct dma_buf_attachment that returns a sg_table.
> 
> I'm expecting some kind of new data structure, lets call it "physical
> list" that is some efficient coding of meta/addr/len tuples that works
> well with the new DMA API. Matthew has been calling this thing phyr..
> 
> So, I imagine, struct dma_buf_attachment gaining an optional
> feature negotiation and then we have in dma_buf_attachment:
> 
>         union {
>               struct sg_table *sgt;
> 	      struct physical_list *phyr;
> 	};
> 
> That's basicaly it, an alternative to scatterlist that has a clean
> architecture.
> 
> Now, if you are asking if the current dmabuf mmap callback can be
> improved with the above? Maybe? phyr should have the neccessary
> information inside it to populate a VMA - eventually even fully
> correctly with all the right cachable/encrypted/forbidden/etc flags.
> 
> So, you could imagine that exporters could just have one routine to
> generate the phyr list and that goes into the attachment, goes into
> some common code to fill VMA PTEs, and some other common code that
> will convert it into the DMABUF scatterlist. If performance is not a
> concern with these data structure conversions it could be an appealing
> simplification.
> 
> And yes, I could imagine the meta information being descriptive enough
> to support the private interconnect cases, the common code could
> detect private meta information and just cleanly fail.

I'm kinda leaning towards entirely separate dma-buf interfaces for the new
phyr stuff, because I fear that adding that to the existing ones will only
make the chaos worse. But that aside sounds all reasonable, and even that
could just be too much worry on my side and mixing phyr into existing
attachments (with a pile of importer/exporter flags probably) is fine.

For the existing dma-buf importers/exporters I'm kinda hoping for a pure
dma_addr_t based list eventually. Going all the way to a phyr based
approach for everyone might be too much churn, there's some real bad cruft
there. It's not going to work for every case, but it covers a lot of them
and might be less pain for existing importers.

But in theory it should be possible to use phyr everywhere eventually, as
long as there's no obviously api-rules-breaking way to go from a phyr back to
a struct page even when that exists.

> > At least the device mapping / dma_buf_attachment
> > side should be doable with just the pfn and the new dma-api?
> 
> Yes, that would be my first goal post. Figure out some meta
> information and a container data structure that allows struct
> page-less P2P mapping through the new DMA API.
> 
> > > I'm hoping we can get to something where we describe not just how the
> > > pfns should be DMA mapped, but also can describe how they should be
> > > CPU mapped. For instance that this PFN space is always mapped
> > > uncachable, in CPU and in IOMMU.
> > 
> > I was pondering whether dma_mmap and friends would be a good place to
> > prototype this and go for a fully generic implementation. But then even
> > those have _wc/_uncached variants.
> 
> Given that the inability to correctly DMA map P2P MMIO without struct
> page is a current pain point and current source of hacks in dmabuf
> exporters, I wanted to make resolving that a priority.
> 
> However, if you mean what I described above for "fully generic [dmabuf
> mmap] implementation", then we'd have the phyr datastructure as a
> dependency to attempt that work.
> 
> phyr, and particularly the meta information, has a number of
> stakeholders. I was thinking of going first with rdma's memory
> registration flow because we are now pretty close to being able to do
> such a big change, and it can demonstrate most of the requirements.
> 
> But that doesn't mean mmap couldn't go concurrently on the same agreed
> datastructure if people are interested.

Yeah cpu mmap needs a lot more, going with a very limited p2p use-case
first only makes sense.

> > > We also have current bugs in the iommu/vfio side where we are fudging
> > > CC stuff, like assuming CPU memory is encrypted (not always true) and
> > > that MMIO is non-encrypted (not always true)
> > 
> > tbf CC pte flags I just don't grok at all. I've once tried to understand
> > what current exporters and gpu drivers do and just gave up. But that's
> > also a bit why I'm worried here because it's an enigma to me.
> 
> For CC, inside the secure world, is some information if each PFN
> inside the VM is 'encrypted' or not. Any VM PTE (including the IOPTEs)
> pointing at the PFN must match the secure world's view of
> 'encrypted'. The VM can ask the secure world to change its view at
> runtime.
> 
> The way CC has been bolted on to the kernel so far laregly hides this
> from drivers, so it is troubled to tell in driver code if the PFN you
> have is 'encrypted' or not. Right now the general rule (that is not
> always true) is that struct page CPU memory is encrypted and
> everything else is decrypted.
> 
> So right now, you can mostly ignore it and the above assumption
> largely happens for you transparently.
> 
> However, soon we will have encrypted P2P MMIO which will stress this
> hiding strategy.

It's already breaking with stuff like virtual gpu drivers, vmwgfx is
fiddling around with these bits (at least last I tried to understand this
all) and I think a few others do too.

> > > > I thought iommuv2 (or whatever linux calls these) has full fault support
> > > > and could support current move semantics. But yeah for iommu without
> > > > fault support we need some kind of pin or a newly formalized revoke model.
> > > 
> > > No, this is HW dependent, including PCI device, and I'm aware of no HW
> > > that fully implements this in a way that could be useful to implement
> > > arbitary move semantics for VFIO..
> > 
> > Hm I thought we've had at least prototypes floating around of device fault
> > repair, but I guess that only works with ATS/pasid stuff and not general
> > iommu traffic from devices. Definitely needs some device cooperation since
> > the timeouts of a full fault are almost endless.
> 
> Yes, exactly. What all real devices I'm aware have done is make a
> subset of their traffic work with ATS and PRI, but not all their
> traffic. Without *all* traffic you can't make any generic assumption
> in the iommu that a transient non-present won't be fatal to the
> device.
> 
> Stuff like dmabuf move semantics rely on transient non-present being
> non-disruptive...

Ah now I get it, at the iommu level you have to pessimistically assume
whether a device can handle a fault, and none can for all traffic. I was
thinking too much about the driver level where generally the dma-buf you
importer are only used for the subset of device functions that can cope
with faults on many devices.

Cheers, Sima
Jason Gunthorpe Jan. 22, 2025, 1:28 p.m. UTC | #47
On Wed, Jan 22, 2025 at 12:04:19PM +0100, Simona Vetter wrote:

> I'm kinda leaning towards entirely separate dma-buf interfaces for the new
> phyr stuff, because I fear that adding that to the existing ones will only
> make the chaos worse. 

Lets see when some patches come up, if importers have to deal with
several formats a single attach interface would probably be simpler
for them..

> For the existing dma-buf importers/exporters I'm kinda hoping for a pure
> dma_addr_t based list eventually. 

IMHO the least churn would be to have the old importers call a helper
(perhaps implicitly in the core code) to convert phyr into a dma
mapped scatterlist and then just keep using their existing code
exactly as is.

Performance improvement would come from importers switching to use the
new dma api internally and avoiding the scatterlist allocation, but
even the extra alocation is not so bad.

Then, perhaps, and I really hesitate to say this, but perhaps to ease
the migration we could store a dma mapped list in a phyr using the
meta information. That would allow a dmabuf scatterlist exporter to be
converted to a phyr with a helper. The importer would have to have a
special path to detect the dma mapped mode and skip the normal
mapping.

The point would be to let maintained drivers use the new data
structures and flows easily and still interoperate with older
drivers. Otherwise we have to somehow build parallel scatterlist and
phyr code flows into importers :\

> Going all the way to a phyr based
> approach for everyone might be too much churn, there's some real bad cruft
> there. It's not going to work for every case, but it covers a lot of them
> and might be less pain for existing importers.

I'd hope we can reach every case.. But I don't know what kind of
horrors you guys have :)

> But in theory it should be possible to use phyr everywhere eventually, as
> long as there's no obviously api-rules-breaking way to go from a phyr back to
> a struct page even when that exists.

I'd say getting a struct page is a perfectly safe operation, from a
phyr API perspective.

The dmabuf issue seems to be entirely about following the locking
rules, an importer is not allowed to get a struct page and switch from
reservation locking to page refcount locking.

I get the appeal for DRM of blocking struct page use because that
directly prevents the above. But, in RDMA I want to re-use the same
phyr datastructure for tracking pin_user_pages() CPU memory, and I
must get the struct page back so I can put_user_page() it when done.

Perhaps we can find some compromise where the phyr data structure has
some kind of flag 'disable struct page' and then the
phyr_entry_to_page() API will WARN_ON() or something like that.

> > However, soon we will have encrypted P2P MMIO which will stress this
> > hiding strategy.
> 
> It's already breaking with stuff like virtual gpu drivers, vmwgfx is
> fiddling around with these bits (at least last I tried to understand this
> all) and I think a few others do too.

IMHO, most places I've seen touching this out side arch code feel very
hacky. :\

Jason
Christian König Jan. 22, 2025, 1:29 p.m. UTC | #48
Am 22.01.25 um 12:04 schrieb Simona Vetter:
> On Tue, Jan 21, 2025 at 01:36:33PM -0400, Jason Gunthorpe wrote:
>> On Tue, Jan 21, 2025 at 05:11:32PM +0100, Simona Vetter wrote:
>>> On Mon, Jan 20, 2025 at 03:48:04PM -0400, Jason Gunthorpe wrote:
>>>> On Mon, Jan 20, 2025 at 07:50:23PM +0100, Simona Vetter wrote:
>>>>> On Mon, Jan 20, 2025 at 01:59:01PM -0400, Jason Gunthorpe wrote:
>>>>>> On Mon, Jan 20, 2025 at 01:14:12PM +0100, Christian König wrote:
>>>>>> What is going wrong with your email? You replied to Simona, but
>>>>>> Simona Vetter <simona.vetter@ffwll.ch> is dropped from the To/CC
>>>>>> list??? I added the address back, but seems like a weird thing to
>>>>>> happen.
>>>>> Might also be funny mailing list stuff, depending how you get these. I
>>>>> read mails over lore and pretty much ignore cc (unless it's not also on
>>>>> any list, since those tend to be security issues) because I get cc'ed on
>>>>> way too much stuff for that to be a useful signal.
>>>> Oh I see, you are sending a Mail-followup-to header that excludes your
>>>> address, so you don't get any emails at all.. My mutt is dropping you
>>>> as well.

I'm having all kind of funny phenomena with AMDs mail servers since 
coming back from xmas vacation.

 From the news it looks like Outlook on Windows has a new major security 
issue where just viewing a mail can compromise the system and my 
educated guess is that our IT guys went into panic mode because of this 
and has changed something.

>> [SNIP]
>> I have been assuming that dmabuf mmap remains unchanged, that
>> exporters will continue to implement that mmap() callback as today.

That sounds really really good to me because that was my major concern 
when you noted that you want to have PFNs to build up KVM page tables.

But you don't want to handle mmap() on your own, you basically don't 
want to have a VMA for this stuff at all, correct?

>> My main interest has been what data structure is produced in the
>> attach APIs.
>>
>> Eg today we have a struct dma_buf_attachment that returns a sg_table.
>>
>> I'm expecting some kind of new data structure, lets call it "physical
>> list" that is some efficient coding of meta/addr/len tuples that works
>> well with the new DMA API. Matthew has been calling this thing phyr..

I would not use a data structure at all. Instead we should have 
something like an iterator/cursor based approach similar to what the new 
DMA API is doing.

>> So, I imagine, struct dma_buf_attachment gaining an optional
>> feature negotiation and then we have in dma_buf_attachment:
>>
>>          union {
>>                struct sg_table *sgt;
>> 	      struct physical_list *phyr;
>> 	};
>>
>> That's basicaly it, an alternative to scatterlist that has a clean
>> architecture.

I would rather suggest something like dma_buf_attachment() gets offset 
and size to map and returns a cursor object you can use to get your 
address, length and access attributes.

And then you can iterate over this cursor and fill in your importer data 
structure with the necessary information.

This way neither the exporter nor the importer need to convert their 
data back and forth between their specific representations of the 
information.

>> Now, if you are asking if the current dmabuf mmap callback can be
>> improved with the above? Maybe? phyr should have the neccessary
>> information inside it to populate a VMA - eventually even fully
>> correctly with all the right cachable/encrypted/forbidden/etc flags.

That won't work like this.

See the exporter needs to be informed about page faults on the VMA to 
eventually wait for operations to end and sync caches.

Otherwise we either potentially allow access to freed up or re-used 
memory or run into issues with device cache coherency.

>> So, you could imagine that exporters could just have one routine to
>> generate the phyr list and that goes into the attachment, goes into
>> some common code to fill VMA PTEs, and some other common code that
>> will convert it into the DMABUF scatterlist. If performance is not a
>> concern with these data structure conversions it could be an appealing
>> simplification.
>>
>> And yes, I could imagine the meta information being descriptive enough
>> to support the private interconnect cases, the common code could
>> detect private meta information and just cleanly fail.
> I'm kinda leaning towards entirely separate dma-buf interfaces for the new
> phyr stuff, because I fear that adding that to the existing ones will only
> make the chaos worse. But that aside sounds all reasonable, and even that
> could just be too much worry on my side and mixing phyr into existing
> attachments (with a pile of importer/exporter flags probably) is fine.

I lean into the other direction.

Dmitry and Thomas have done a really good job at cleaning up all the 
interaction between dynamic and static exporters / importers.

Especially that we now have consistent locking for map_dma_buf() and 
unmap_dma_buf() should make that transition rather straight forward.

> For the existing dma-buf importers/exporters I'm kinda hoping for a pure
> dma_addr_t based list eventually. Going all the way to a phyr based
> approach for everyone might be too much churn, there's some real bad cruft
> there. It's not going to work for every case, but it covers a lot of them
> and might be less pain for existing importers.

The point is we have use cases that won't work without exchanging DMA 
addresses any more.

For example we have cases with multiple devices are in the same IOMMU 
domain and re-using their DMA address mappings.

> But in theory it should be possible to use phyr everywhere eventually, as
> long as there's no obviously api-rules-breaking way to go from a phyr back to
> a struct page even when that exists.

I would rather say we should stick to DMA addresses as much as possible.

What we can do is to add an address space description to the addresses, 
e.g. if it's a PCIe BUS addr in IOMMU domain X, or of it's a device 
private bus addr or in the case of sharing with iommufd and KVM PFNs.

Regards,
Christian.

>>> At least the device mapping / dma_buf_attachment
>>> side should be doable with just the pfn and the new dma-api?
>> Yes, that would be my first goal post. Figure out some meta
>> information and a container data structure that allows struct
>> page-less P2P mapping through the new DMA API.
>>
>>>> I'm hoping we can get to something where we describe not just how the
>>>> pfns should be DMA mapped, but also can describe how they should be
>>>> CPU mapped. For instance that this PFN space is always mapped
>>>> uncachable, in CPU and in IOMMU.
>>> I was pondering whether dma_mmap and friends would be a good place to
>>> prototype this and go for a fully generic implementation. But then even
>>> those have _wc/_uncached variants.
>> Given that the inability to correctly DMA map P2P MMIO without struct
>> page is a current pain point and current source of hacks in dmabuf
>> exporters, I wanted to make resolving that a priority.
>>
>> However, if you mean what I described above for "fully generic [dmabuf
>> mmap] implementation", then we'd have the phyr datastructure as a
>> dependency to attempt that work.
>>
>> phyr, and particularly the meta information, has a number of
>> stakeholders. I was thinking of going first with rdma's memory
>> registration flow because we are now pretty close to being able to do
>> such a big change, and it can demonstrate most of the requirements.
>>
>> But that doesn't mean mmap couldn't go concurrently on the same agreed
>> datastructure if people are interested.
> Yeah cpu mmap needs a lot more, going with a very limited p2p use-case
> first only makes sense.
>
>>>> We also have current bugs in the iommu/vfio side where we are fudging
>>>> CC stuff, like assuming CPU memory is encrypted (not always true) and
>>>> that MMIO is non-encrypted (not always true)
>>> tbf CC pte flags I just don't grok at all. I've once tried to understand
>>> what current exporters and gpu drivers do and just gave up. But that's
>>> also a bit why I'm worried here because it's an enigma to me.
>> For CC, inside the secure world, is some information if each PFN
>> inside the VM is 'encrypted' or not. Any VM PTE (including the IOPTEs)
>> pointing at the PFN must match the secure world's view of
>> 'encrypted'. The VM can ask the secure world to change its view at
>> runtime.
>>
>> The way CC has been bolted on to the kernel so far laregly hides this
>> from drivers, so it is troubled to tell in driver code if the PFN you
>> have is 'encrypted' or not. Right now the general rule (that is not
>> always true) is that struct page CPU memory is encrypted and
>> everything else is decrypted.
>>
>> So right now, you can mostly ignore it and the above assumption
>> largely happens for you transparently.
>>
>> However, soon we will have encrypted P2P MMIO which will stress this
>> hiding strategy.
> It's already breaking with stuff like virtual gpu drivers, vmwgfx is
> fiddling around with these bits (at least last I tried to understand this
> all) and I think a few others do too.
>
>>>>> I thought iommuv2 (or whatever linux calls these) has full fault support
>>>>> and could support current move semantics. But yeah for iommu without
>>>>> fault support we need some kind of pin or a newly formalized revoke model.
>>>> No, this is HW dependent, including PCI device, and I'm aware of no HW
>>>> that fully implements this in a way that could be useful to implement
>>>> arbitary move semantics for VFIO..
>>> Hm I thought we've had at least prototypes floating around of device fault
>>> repair, but I guess that only works with ATS/pasid stuff and not general
>>> iommu traffic from devices. Definitely needs some device cooperation since
>>> the timeouts of a full fault are almost endless.
>> Yes, exactly. What all real devices I'm aware have done is make a
>> subset of their traffic work with ATS and PRI, but not all their
>> traffic. Without *all* traffic you can't make any generic assumption
>> in the iommu that a transient non-present won't be fatal to the
>> device.
>>
>> Stuff like dmabuf move semantics rely on transient non-present being
>> non-disruptive...
> Ah now I get it, at the iommu level you have to pessimistically assume
> whether a device can handle a fault, and none can for all traffic. I was
> thinking too much about the driver level where generally the dma-buf you
> importer are only used for the subset of device functions that can cope
> with faults on many devices.
>
> Cheers, Sima
Jason Gunthorpe Jan. 22, 2025, 2:37 p.m. UTC | #49
On Wed, Jan 22, 2025 at 02:29:09PM +0100, Christian König wrote:

> I'm having all kind of funny phenomena with AMDs mail servers since coming
> back from xmas vacation.

:(

A few years back our IT fully migrated our email to into Office 365
cloud and gave up all the crazy half on-prem stuff they were
doing. The mail started working fully perfectly after that, as long as
you use MS's servers directly :\

> But you don't want to handle mmap() on your own, you basically don't want to
> have a VMA for this stuff at all, correct?

Right, we have no interest in mmap, VMAs or struct page in
rdma/kvm/iommu.
 
> > > My main interest has been what data structure is produced in the
> > > attach APIs.
> > > 
> > > Eg today we have a struct dma_buf_attachment that returns a sg_table.
> > > 
> > > I'm expecting some kind of new data structure, lets call it "physical
> > > list" that is some efficient coding of meta/addr/len tuples that works
> > > well with the new DMA API. Matthew has been calling this thing phyr..
> 
> I would not use a data structure at all. Instead we should have something
> like an iterator/cursor based approach similar to what the new DMA API is
> doing.

I'm certainly open to this idea. There may be some technical
challenges, it is a big change from scatterlist today, and
function-pointer-per-page sounds like bad performance if there are
alot of pages..

RDMA would probably have to stuff this immediately into something like
a phyr anyhow because it needs to fully extent the thing being mapped
to figure out what the HW page size and geometry should be - that
would be trivial though, and a RDMA problem.

> > > Now, if you are asking if the current dmabuf mmap callback can be
> > > improved with the above? Maybe? phyr should have the neccessary
> > > information inside it to populate a VMA - eventually even fully
> > > correctly with all the right cachable/encrypted/forbidden/etc flags.
> 
> That won't work like this.

Note I said "populate a VMA", ie a helper to build the VMA PTEs only.

> See the exporter needs to be informed about page faults on the VMA to
> eventually wait for operations to end and sync caches.

All of this would still have to be provided outside in the same way as
today.

> For example we have cases with multiple devices are in the same IOMMU domain
> and re-using their DMA address mappings.

IMHO this is just another flavour of "private" address flow between
two cooperating drivers.

It is not a "dma address" in the sense of a dma_addr_t that was output
from the DMA API. I think that subtle distinction is very
important. When I say pfn/dma address I'm really only talking about
standard DMA API flows, used by generic drivers.

IMHO, DMABUF needs a private address "escape hatch", and cooperating
drivers should do whatever they want when using that flow. The address
is *fully private*, so the co-operating drivers can do whatever they
want. iommu_map in exporter and pass an IOVA? Fine! pass a PFN and
iommu_map in the importer? Also fine! Private is private.

> > But in theory it should be possible to use phyr everywhere eventually, as
> > long as there's no obviously api-rules-breaking way to go from a phyr back to
> > a struct page even when that exists.
> 
> I would rather say we should stick to DMA addresses as much as possible.

I remain skeptical of this.. Aside from all the technical reasons I
already outlined..

I think it is too much work to have the exporters conditionally build
all sorts of different representations of the same thing depending on
the importer. Like having alot of DRM drivers generate both a PFN and
DMA mapped list in their export code doesn't sound very appealing to
me at all.

It makes sense that a driver would be able to conditionally generate
private and generic based on negotiation, but IMHO, not more than one
flavour of generic..

Jason
Christian König Jan. 22, 2025, 2:59 p.m. UTC | #50
Am 22.01.25 um 15:37 schrieb Jason Gunthorpe:
>>>> My main interest has been what data structure is produced in the
>>>> attach APIs.
>>>>
>>>> Eg today we have a struct dma_buf_attachment that returns a sg_table.
>>>>
>>>> I'm expecting some kind of new data structure, lets call it "physical
>>>> list" that is some efficient coding of meta/addr/len tuples that works
>>>> well with the new DMA API. Matthew has been calling this thing phyr..
>> I would not use a data structure at all. Instead we should have something
>> like an iterator/cursor based approach similar to what the new DMA API is
>> doing.
> I'm certainly open to this idea. There may be some technical
> challenges, it is a big change from scatterlist today, and
> function-pointer-per-page sounds like bad performance if there are
> alot of pages..
>
> RDMA would probably have to stuff this immediately into something like
> a phyr anyhow because it needs to fully extent the thing being mapped
> to figure out what the HW page size and geometry should be - that
> would be trivial though, and a RDMA problem.
>
>>>> Now, if you are asking if the current dmabuf mmap callback can be
>>>> improved with the above? Maybe? phyr should have the neccessary
>>>> information inside it to populate a VMA - eventually even fully
>>>> correctly with all the right cachable/encrypted/forbidden/etc flags.
>> That won't work like this.
> Note I said "populate a VMA", ie a helper to build the VMA PTEs only.
>
>> See the exporter needs to be informed about page faults on the VMA to
>> eventually wait for operations to end and sync caches.
> All of this would still have to be provided outside in the same way as
> today.
>
>> For example we have cases with multiple devices are in the same IOMMU domain
>> and re-using their DMA address mappings.
> IMHO this is just another flavour of "private" address flow between
> two cooperating drivers.

Well that's the point. The inporter is not cooperating here.

The importer doesn't have the slightest idea that he is sharing it's DMA 
addresses with the exporter.

All the importer gets is when you want to access this information use 
this address here.

> It is not a "dma address" in the sense of a dma_addr_t that was output
> from the DMA API. I think that subtle distinction is very
> important. When I say pfn/dma address I'm really only talking about
> standard DMA API flows, used by generic drivers.
>
> IMHO, DMABUF needs a private address "escape hatch", and cooperating
> drivers should do whatever they want when using that flow. The address
> is *fully private*, so the co-operating drivers can do whatever they
> want. iommu_map in exporter and pass an IOVA? Fine! pass a PFN and
> iommu_map in the importer? Also fine! Private is private.
>
>>> But in theory it should be possible to use phyr everywhere eventually, as
>>> long as there's no obviously api-rules-breaking way to go from a phyr back to
>>> a struct page even when that exists.
>> I would rather say we should stick to DMA addresses as much as possible.
> I remain skeptical of this.. Aside from all the technical reasons I
> already outlined..
>
> I think it is too much work to have the exporters conditionally build
> all sorts of different representations of the same thing depending on
> the importer. Like having alot of DRM drivers generate both a PFN and
> DMA mapped list in their export code doesn't sound very appealing to
> me at all.

Well from experience I can say that it is actually the other way around.

We have a very limited number of exporters and a lot of different 
importers. So having complexity in the exporter instead of the importer 
is absolutely beneficial.

PFN is the special case, in other words this is the private address 
passed around. And I will push hard to not support that in the DRM 
drivers nor any DMA buf heap.

> It makes sense that a driver would be able to conditionally generate
> private and generic based on negotiation, but IMHO, not more than one
> flavour of generic..

I still strongly think that the exporter should talk with the DMA API to 
setup the access path for the importer and *not* the importer directly.

Regards,
Christian.

>
> Jason
Jason Gunthorpe Jan. 23, 2025, 1:59 p.m. UTC | #51
On Wed, Jan 22, 2025 at 03:59:11PM +0100, Christian König wrote:
> > > For example we have cases with multiple devices are in the same IOMMU domain
> > > and re-using their DMA address mappings.
> > IMHO this is just another flavour of "private" address flow between
> > two cooperating drivers.
> 
> Well that's the point. The inporter is not cooperating here.

If the private address relies on a shared iommu_domain controlled by
the driver, then yes, the importer MUST be cooperating. For instance,
if you send the same private address into RDMA it will explode because
it doesn't have any notion of shared iommu_domain mappings, and it
certainly doesn't setup any such shared domains.

> The importer doesn't have the slightest idea that he is sharing it's DMA
> addresses with the exporter.

Of course it does. The importer driver would have had to explicitly
set this up! The normal kernel behavior is that all drivers get
private iommu_domains controled by the DMA API. If your driver is
doing something else *it did it deliberately*.

Some of that mess in tegra host1x around this area is not well
structured, it should not be implicitly setting up domains for
drivers. It is old code that hasn't been updated to use the new iommu
subsystem approach for driver controled non-DMA API domains.

The new iommu architecture has the probing driver disable the DMA API
and can then manipulate its iommu domain however it likes, safely. Ie
the probing driver is aware and particiapting in disabling the DMA
API.

Again, either you are using the DMA API and you work in generic ways
with generic devices or it is "private" and only co-operating drivers
can interwork with private addresses. A private address must not ever
be sent to a DMA API using driver and vice versa.

IMHO this is an important architecture point and why Christoph was
frowning on abusing dma_addr_t to represent things that did NOT come
out of the DMA API.

> We have a very limited number of exporters and a lot of different importers.
> So having complexity in the exporter instead of the importer is absolutely
> beneficial.

Isn't every DRM driver both an importer and exporter? That is what I
was expecting at least..

> I still strongly think that the exporter should talk with the DMA API to
> setup the access path for the importer and *not* the importer directly.

It is contrary to the design of the new API which wants to co-optimize
mapping and HW setup together as one unit.

For instance in RDMA we want to hint and control the way the IOMMU
mapping works in the DMA API to optimize the RDMA HW side. I can't do
those optimizations if I'm not in control of the mapping.

The same is probably true on the GPU side too, you want IOVAs that
have tidy alignment with your PTE structure, but only the importer
understands its own HW to make the correct hints to the DMA API.

Jason
Christian König Jan. 23, 2025, 2:35 p.m. UTC | #52
Sending it as text mail once more.

Am 23.01.25 um 15:32 schrieb Christian König:
> Am 23.01.25 um 14:59 schrieb Jason Gunthorpe:
>> On Wed, Jan 22, 2025 at 03:59:11PM +0100, Christian König wrote:
>>>>> For example we have cases with multiple devices are in the same IOMMU domain
>>>>> and re-using their DMA address mappings.
>>>> IMHO this is just another flavour of "private" address flow between
>>>> two cooperating drivers.
>>> Well that's the point. The inporter is not cooperating here.
>> If the private address relies on a shared iommu_domain controlled by
>> the driver, then yes, the importer MUST be cooperating. For instance,
>> if you send the same private address into RDMA it will explode because
>> it doesn't have any notion of shared iommu_domain mappings, and it
>> certainly doesn't setup any such shared domains.
>
> Hui? Why the heck should a driver own it's iommu domain?
>
> The domain is owned and assigned by the PCI subsystem under Linux.
>
>>> The importer doesn't have the slightest idea that he is sharing it's DMA
>>> addresses with the exporter.
>> Of course it does. The importer driver would have had to explicitly
>> set this up! The normal kernel behavior is that all drivers get
>> private iommu_domains controled by the DMA API. If your driver is
>> doing something else *it did it deliberately*.
>
> As far as I know that is simply not correct. Currently IOMMU 
> domains/groups are usually shared between devices.
>
> Especially multi function devices get only a single IOMMU domain.
>
>> Some of that mess in tegra host1x around this area is not well
>> structured, it should not be implicitly setting up domains for
>> drivers. It is old code that hasn't been updated to use the new iommu
>> subsystem approach for driver controled non-DMA API domains.
>>
>> The new iommu architecture has the probing driver disable the DMA API
>> and can then manipulate its iommu domain however it likes, safely. Ie
>> the probing driver is aware and particiapting in disabling the DMA
>> API.
>
> Why the heck should we do this?
>
> That drivers manage all of that on their own sounds like a massive 
> step in the wrong direction.
>
>> Again, either you are using the DMA API and you work in generic ways
>> with generic devices or it is "private" and only co-operating drivers
>> can interwork with private addresses. A private address must not ever
>> be sent to a DMA API using driver and vice versa.
>>
>> IMHO this is an important architecture point and why Christoph was
>> frowning on abusing dma_addr_t to represent things that did NOT come
>> out of the DMA API.
>>
>>> We have a very limited number of exporters and a lot of different importers.
>>> So having complexity in the exporter instead of the importer is absolutely
>>> beneficial.
>> Isn't every DRM driver both an importer and exporter? That is what I
>> was expecting at least..
>>
>>> I still strongly think that the exporter should talk with the DMA API to
>>> setup the access path for the importer and *not* the importer directly.
>> It is contrary to the design of the new API which wants to co-optimize
>> mapping and HW setup together as one unit.
>
> Yeah and I'm really questioning this design goal. That sounds like 
> totally going into the wrong direction just because of the RDMA drivers.
>
>> For instance in RDMA we want to hint and control the way the IOMMU
>> mapping works in the DMA API to optimize the RDMA HW side. I can't do
>> those optimizations if I'm not in control of the mapping.
>
> Why? What is the technical background here?
>
>> The same is probably true on the GPU side too, you want IOVAs that
>> have tidy alignment with your PTE structure, but only the importer
>> understands its own HW to make the correct hints to the DMA API.
>
> Yeah but then express those as requirements to the DMA API and not 
> move all the important decisions into the driver where they are 
> implemented over and over again and potentially broken halve the time.
>
> See drivers are supposed to be simple, small and stupid. They should 
> be controlled by the core OS and not allowed to do whatever they want.
>
> Driver developers are not trust able to always get everything right if 
> you make it as complicated as this.
>
> Regards,
> Christian.
>
>> Jason
>
Jason Gunthorpe Jan. 23, 2025, 3:02 p.m. UTC | #53
On Thu, Jan 23, 2025 at 03:35:21PM +0100, Christian König wrote:
> Sending it as text mail once more.
> 
> Am 23.01.25 um 15:32 schrieb Christian König:
> > Am 23.01.25 um 14:59 schrieb Jason Gunthorpe:
> > > On Wed, Jan 22, 2025 at 03:59:11PM +0100, Christian König wrote:
> > > > > > For example we have cases with multiple devices are in the same IOMMU domain
> > > > > > and re-using their DMA address mappings.
> > > > > IMHO this is just another flavour of "private" address flow between
> > > > > two cooperating drivers.
> > > > Well that's the point. The inporter is not cooperating here.
> > > If the private address relies on a shared iommu_domain controlled by
> > > the driver, then yes, the importer MUST be cooperating. For instance,
> > > if you send the same private address into RDMA it will explode because
> > > it doesn't have any notion of shared iommu_domain mappings, and it
> > > certainly doesn't setup any such shared domains.
> > 
> > Hui? Why the heck should a driver own it's iommu domain?

I don't know, you are the one saying the drivers have special shared
iommu_domains so DMA BUF need some special design to accommodate it.

I'm aware that DRM drivers do directly call into the iommu subsystem
and do directly manage their own IOVA. I assumed this is what you were
talkinga bout. See below.

> > The domain is owned and assigned by the PCI subsystem under Linux.

That domain is *exclusively* owned by the DMA API and is only accessed
via maps created by DMA API calls.

If you are using the DMA API correctly then all of this is abstracted
and none of it matters to you. There is no concept of "shared domains"
in the DMA API. 

You call the DMA API, you get a dma_addr_t that is valid for a
*single* device, you program it in HW. That is all. There is no reason
to dig deeper than this.
 
> > > > The importer doesn't have the slightest idea that he is sharing it's DMA
> > > > addresses with the exporter.
> > > Of course it does. The importer driver would have had to explicitly
> > > set this up! The normal kernel behavior is that all drivers get
> > > private iommu_domains controled by the DMA API. If your driver is
> > > doing something else *it did it deliberately*.
> > 
> > As far as I know that is simply not correct. Currently IOMMU
> > domains/groups are usually shared between devices.

No, the opposite. The iommu subsystem tries to maximally isolate
devices up to the HW limit.

On server platforms every device is expected to get its own iommu domain.

> > Especially multi function devices get only a single IOMMU domain.

Only if the PCI HW doesn't support ACS.

This is all DMA API internal details you shouldn't even be talking
about at the DMA BUF level. It is all hidden and simply does not
matter to DMA BUF at all.

> > > The new iommu architecture has the probing driver disable the DMA API
> > > and can then manipulate its iommu domain however it likes, safely. Ie
> > > the probing driver is aware and particiapting in disabling the DMA
> > > API.
> > 
> > Why the heck should we do this?
> > 
> > That drivers manage all of that on their own sounds like a massive step
> > in the wrong direction.

I am talking about DRM drivers that HAVE to manage their own for some
reason I don't know. eg:

drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c:             tdev->iommu.domain = iommu_domain_alloc(&platform_bus_type);
drivers/gpu/drm/msm/msm_iommu.c:        domain = iommu_paging_domain_alloc(dev);
drivers/gpu/drm/rockchip/rockchip_drm_drv.c:    private->domain = iommu_paging_domain_alloc(private->iommu_dev);
drivers/gpu/drm/tegra/drm.c:            tegra->domain = iommu_paging_domain_alloc(dma_dev);
drivers/gpu/host1x/dev.c:               host->domain = iommu_paging_domain_alloc(host->dev);

Normal simple drivers should never be calling these functions!

If you are calling these functions you are not using the DMA API, and,
yes, some cases like tegra n1x are actively sharing these special
domains across multiple devices and drivers.

If you want to pass an IOVA in one of these special driver-created
domains then it would be some private address in DMABUF that only
works on drivers that have understood they attached to these manually
created domains. No DMA API involvement here.

> > > > I still strongly think that the exporter should talk with the DMA API to
> > > > setup the access path for the importer and *not* the importer directly.
> > > It is contrary to the design of the new API which wants to co-optimize
> > > mapping and HW setup together as one unit.
> > 
> > Yeah and I'm really questioning this design goal. That sounds like
> > totally going into the wrong direction just because of the RDMA
> > drivers.

Actually it is storage that motivates this. It is just pointless to
allocate a dma_addr_t list in the fast path when you don't need
it. You can stream the dma_addr_t directly into HW structures that are
necessary and already allocated.

> > > For instance in RDMA we want to hint and control the way the IOMMU
> > > mapping works in the DMA API to optimize the RDMA HW side. I can't do
> > > those optimizations if I'm not in control of the mapping.
> > 
> > Why? What is the technical background here?

dma-iommu.c chooses an IOVA alignment based on its own reasoning that
is not always compatible with the HW. The HW can optimize if the IOVA
alignment meets certain restrictions. Much like page tables in a GPU.

> > > The same is probably true on the GPU side too, you want IOVAs that
> > > have tidy alignment with your PTE structure, but only the importer
> > > understands its own HW to make the correct hints to the DMA API.
> > 
> > Yeah but then express those as requirements to the DMA API and not move
> > all the important decisions into the driver where they are implemented
> > over and over again and potentially broken halve the time.

It wouild be in the DMA API, just the per-mapping portion of the API.

Same as the multipath, the ATS, and more. It is all per-mapping
descisions of the executing HW, not global decisions or something
like.

Jason
Jason Gunthorpe Jan. 23, 2025, 4:08 p.m. UTC | #54
On Thu, Jan 23, 2025 at 04:48:29PM +0100, Christian König wrote:
>    No, no there are much more cases where drivers simply assume that they
>    are in the same iommu domain for different devices. 

This is an illegal assumption and invalid way to use the DMA API. Do
not do that, do not architect things in DMABUF to permit that.

The dma_addr_t out of the DMA API is only usable by the device passed
in, period full stop. If you want to use it with two devices then call
the DMA API twice.

>    E.g. that different
>    PCI endpoints can use the same dma_addr_t.

>    For example those classic sound devices for HDMI audio on graphics
>    cards work like this. 
>    In other words if the device handled by the generic ALSA driver and the
>    GPU are not in the same iommu domain you run into trouble.

Yes, I recall this weird AMD issue as well. IIRC the solution is not
clean or "correct". :( I vaugely recall it was caused by a HW bug...

>    Well it might never been documented but I know of quite a bunch of
>    different cases that assume that a DMA addr will just ultimately work
>    for some other device/driver as well.

Again, illegal assumption, breaks the abstraction.

>> This is all DMA API internal details you shouldn't even be talking
>> about at the DMA BUF level. It is all hidden and simply does not
>> matter to DMA BUF at all.
> 
>    Well we somehow need to support the existing use cases with the new
>    API.

Call the DMA API multiple times, once per device. That is the only
correct way to handle this today. DMABUF is already architected like
this, each and every attach should be dma mapping and generating a
scatterlist for every unique importing device.

Improving it to somehow avoid the redundant DMA API map would require
new DMA API work.

Do NOT randomly assume that devices share dma_addr_t, there is no
architected way to ever discover this, it is a complete violation of
all the API abstractions.

>> If you want to pass an IOVA in one of these special driver-created
>> domains then it would be some private address in DMABUF that only
>> works on drivers that have understood they attached to these manually
>> created domains. No DMA API involvement here.
> 
>    That won't fly like this. That would break at least the ALSA use case
>    and potentially quite a bunch of others.

Your AMD ALSA weirdness is not using custom iommu_domains (nor should
it), it is a different problem.

> dma-iommu.c chooses an IOVA alignment based on its own reasoning that
> is not always compatible with the HW. The HW can optimize if the IOVA
> alignment meets certain restrictions. Much like page tables in a GPU.
> 
>    Yeah, but why can't we tell the DMA API those restrictions instead of
>    letting the driver manage the address space themselves?

How do you propose to do this per-mapping operation without having the
HW driver actually call the mapping operation?

> > Same as the multipath, the ATS, and more. It is all per-mapping
> > descisions of the executing HW, not global decisions or something
> > like.
> 
>    So the DMA API has some structure or similar to describe the necessary
>    per-mapping properties?

Not fully yet (though some multipath is supported), but I want to
slowly move in this direction to solve all of these problems we
have :(

Jason
diff mbox series

Patch

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 7eeee3a38202..83d1448b6dcc 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -630,10 +630,10 @@  struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
 	size_t alloc_size = sizeof(struct dma_buf);
 	int ret;
 
-	if (WARN_ON(!exp_info->priv || !exp_info->ops
-		    || !exp_info->ops->map_dma_buf
-		    || !exp_info->ops->unmap_dma_buf
-		    || !exp_info->ops->release))
+	if (WARN_ON(!exp_info->priv || !exp_info->ops ||
+		    (!!exp_info->ops->map_dma_buf != !!exp_info->ops->unmap_dma_buf) ||
+		    (!exp_info->ops->map_dma_buf && !exp_info->ops->get_pfn) ||
+		    !exp_info->ops->release))
 		return ERR_PTR(-EINVAL);
 
 	if (WARN_ON(exp_info->ops->cache_sgt_mapping &&
@@ -909,7 +909,10 @@  dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
 	struct dma_buf_attachment *attach;
 	int ret;
 
-	if (WARN_ON(!dmabuf || !dev))
+	if (WARN_ON(!dmabuf))
+		return ERR_PTR(-EINVAL);
+
+	if (WARN_ON(dmabuf->ops->map_dma_buf && !dev))
 		return ERR_PTR(-EINVAL);
 
 	if (WARN_ON(importer_ops && !importer_ops->move_notify))
@@ -941,7 +944,7 @@  dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
 	 */
 	if (dma_buf_attachment_is_dynamic(attach) !=
 	    dma_buf_is_dynamic(dmabuf)) {
-		struct sg_table *sgt;
+		struct sg_table *sgt = NULL;
 
 		dma_resv_lock(attach->dmabuf->resv, NULL);
 		if (dma_buf_is_dynamic(attach->dmabuf)) {
@@ -950,13 +953,16 @@  dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
 				goto err_unlock;
 		}
 
-		sgt = __map_dma_buf(attach, DMA_BIDIRECTIONAL);
-		if (!sgt)
-			sgt = ERR_PTR(-ENOMEM);
-		if (IS_ERR(sgt)) {
-			ret = PTR_ERR(sgt);
-			goto err_unpin;
+		if (dmabuf->ops->map_dma_buf) {
+			sgt = __map_dma_buf(attach, DMA_BIDIRECTIONAL);
+			if (!sgt)
+				sgt = ERR_PTR(-ENOMEM);
+			if (IS_ERR(sgt)) {
+				ret = PTR_ERR(sgt);
+				goto err_unpin;
+			}
 		}
+
 		dma_resv_unlock(attach->dmabuf->resv);
 		attach->sgt = sgt;
 		attach->dir = DMA_BIDIRECTIONAL;
@@ -1119,7 +1125,8 @@  struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *attach,
 
 	might_sleep();
 
-	if (WARN_ON(!attach || !attach->dmabuf))
+	if (WARN_ON(!attach || !attach->dmabuf ||
+		    !attach->dmabuf->ops->map_dma_buf))
 		return ERR_PTR(-EINVAL);
 
 	dma_resv_assert_held(attach->dmabuf->resv);
@@ -1195,7 +1202,8 @@  dma_buf_map_attachment_unlocked(struct dma_buf_attachment *attach,
 
 	might_sleep();
 
-	if (WARN_ON(!attach || !attach->dmabuf))
+	if (WARN_ON(!attach || !attach->dmabuf ||
+		    !attach->dmabuf->ops->map_dma_buf))
 		return ERR_PTR(-EINVAL);
 
 	dma_resv_lock(attach->dmabuf->resv, NULL);
@@ -1222,7 +1230,8 @@  void dma_buf_unmap_attachment(struct dma_buf_attachment *attach,
 {
 	might_sleep();
 
-	if (WARN_ON(!attach || !attach->dmabuf || !sg_table))
+	if (WARN_ON(!attach || !attach->dmabuf ||
+		    !attach->dmabuf->ops->unmap_dma_buf || !sg_table))
 		return;
 
 	dma_resv_assert_held(attach->dmabuf->resv);
@@ -1254,7 +1263,8 @@  void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
 {
 	might_sleep();
 
-	if (WARN_ON(!attach || !attach->dmabuf || !sg_table))
+	if (WARN_ON(!attach || !attach->dmabuf ||
+		    !attach->dmabuf->ops->unmap_dma_buf || !sg_table))
 		return;
 
 	dma_resv_lock(attach->dmabuf->resv, NULL);
@@ -1263,6 +1273,52 @@  void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
 }
 EXPORT_SYMBOL_NS_GPL(dma_buf_unmap_attachment_unlocked, "DMA_BUF");
 
+/**
+ * dma_buf_get_pfn_unlocked -
+ * @attach:	[in]	attachment to get pfn from
+ * @pgoff:	[in]	page offset of the buffer against the start of dma_buf
+ * @pfn:	[out]	returns the pfn of the buffer
+ * @max_order	[out]	returns the max mapping order of the buffer
+ */
+int dma_buf_get_pfn_unlocked(struct dma_buf_attachment *attach,
+			     pgoff_t pgoff, u64 *pfn, int *max_order)
+{
+	struct dma_buf *dmabuf = attach->dmabuf;
+	int ret;
+
+	if (WARN_ON(!attach || !attach->dmabuf ||
+		    !attach->dmabuf->ops->get_pfn))
+		return -EINVAL;
+
+	/*
+	 * Open:
+	 *
+	 * When dma_buf is dynamic but dma_buf move is disabled, the buffer
+	 * should be pinned before use, See dma_buf_map_attachment() for
+	 * reference.
+	 *
+	 * But for now no pin is intended inside dma_buf_get_pfn(), otherwise
+	 * need another API to unpin the dma_buf. So just fail out this case.
+	 */
+	if (dma_buf_is_dynamic(attach->dmabuf) &&
+	    !IS_ENABLED(CONFIG_DMABUF_MOVE_NOTIFY))
+		return -ENOENT;
+
+	dma_resv_lock(attach->dmabuf->resv, NULL);
+	ret = dmabuf->ops->get_pfn(attach, pgoff, pfn, max_order);
+	/*
+	 * Open:
+	 *
+	 * Is dma_resv_wait_timeout() needed? I assume no. The DMA buffer
+	 * content synchronization could be done when the buffer is to be
+	 * mapped by importer.
+	 */
+	dma_resv_unlock(attach->dmabuf->resv);
+
+	return ret;
+}
+EXPORT_SYMBOL_NS_GPL(dma_buf_get_pfn_unlocked, "DMA_BUF");
+
 /**
  * dma_buf_move_notify - notify attachments that DMA-buf is moving
  *
@@ -1662,7 +1718,7 @@  static int dma_buf_debug_show(struct seq_file *s, void *unused)
 		attach_count = 0;
 
 		list_for_each_entry(attach_obj, &buf_obj->attachments, node) {
-			seq_printf(s, "\t%s\n", dev_name(attach_obj->dev));
+			seq_printf(s, "\t%s\n", attach_obj->dev ? dev_name(attach_obj->dev) : NULL);
 			attach_count++;
 		}
 		dma_resv_unlock(buf_obj->resv);
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index 36216d28d8bd..b16183edfb3a 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -194,6 +194,17 @@  struct dma_buf_ops {
 	 * if the call would block.
 	 */
 
+	/**
+	 * @get_pfn:
+	 *
+	 * This is called by dma_buf_get_pfn(). It is used to get the pfn
+	 * of the buffer positioned by the page offset against the start of
+	 * the dma_buf. It can only be called if @attach has been called
+	 * successfully.
+	 */
+	int (*get_pfn)(struct dma_buf_attachment *attach, pgoff_t pgoff,
+		       u64 *pfn, int *max_order);
+
 	/**
 	 * @release:
 	 *
@@ -629,6 +640,8 @@  dma_buf_map_attachment_unlocked(struct dma_buf_attachment *attach,
 void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
 				       struct sg_table *sg_table,
 				       enum dma_data_direction direction);
+int dma_buf_get_pfn_unlocked(struct dma_buf_attachment *attach,
+			     pgoff_t pgoff, u64 *pfn, int *max_order);
 
 int dma_buf_mmap(struct dma_buf *, struct vm_area_struct *,
 		 unsigned long);