Message ID | 20250107142719.179636-2-yilun.xu@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Private MMIO support for private assigned dev | expand |
On Thu, Jan 16, 2025 at 06:33:48AM +0100, Christoph Hellwig wrote: > On Wed, Jan 15, 2025 at 09:34:19AM -0400, Jason Gunthorpe wrote: > > > Or do you mean some that don't have pages associated with them, and > > > thus have pfn_valid fail on them? They still have a PFN, just not > > > one that is valid to use in most of the Linux MM. > > > > He is talking about private interconnect hidden inside clusters of > > devices. > > > > Ie the system may have many GPUs and those GPUs have their own private > > interconnect between them. It is not PCI, and packets don't transit > > through the CPU SOC at all, so the IOMMU is not involved. > > > > DMA can happen on that private interconnect, but from a Linux > > perspective it is not DMA API DMA, and the addresses used to describe > > it are not part of the CPU address space. The initiating device will > > have a way to choose which path the DMA goes through when setting up > > the DMA. > > So how is this in any way relevant to dma_buf which operates on > a dma_addr_t right now and thus by definition can't be used for > these? One part of dma-buf is the fd-based machanism for exporters to share buffers across devices & subsystems, while still have buffer's lifetime controlled by exporters. Another part is the way the importers use the buffer (i.e. the dma_addr_t based kAPI). The fd-based exporting machanism is what we want to dmaareuse for buffer sharing. But we are pursuing some extended buffer sharing and DMA usage which doesn't fit into existing DMA API mapping, e.g. for GPA/userspace IOVA accessing, or IIUC other side channel DMA, multi-channel DMA ... Thanks, Yilun
On Thu, Jan 16, 2025 at 04:13:13PM +0100, Christian König wrote: > Am 15.01.25 um 18:09 schrieb Jason Gunthorpe: > > On Wed, Jan 15, 2025 at 05:34:23PM +0100, Christian König wrote: > > Granted, let me try to improve this. > Here is a real world example of one of the issues we ran into and why > CPU mappings of importers are redirected to the exporter. > We have a good bunch of different exporters who track the CPU mappings > of their backing store using address_space objects in one way or > another and then uses unmap_mapping_range() to invalidate those CPU > mappings. > But when importers get the PFNs of the backing store they can look > behind the curtain and directly insert this PFN into the CPU page > tables. > We had literally tons of cases like this where drivers developers cause > access after free issues because the importer created a CPU mappings on > their own without the exporter knowing about it. > This is just one example of what we ran into. Additional to that > basically the whole synchronization between drivers was overhauled as > well because we found that we can't trust importers to always do the > right thing. > > But this, fundamentally, is importers creating attachments and then > *ignoring the lifetime rules of DMABUF*. If you created an attachment, > got a move and *ignored the move* because you put the PFN in your own > VMA, then you are not following the attachment lifetime rules! > > Move notify is solely for informing the importer that they need to > re-fresh their DMA mappings and eventually block for ongoing DMA to end. > > This semantics doesn't work well for CPU mappings because you need to hold > the reservation lock to make sure that the information stay valid and you > can't hold a lock while returning from a page fault. Dealing with CPU mapping and resource invalidation is a little hard, but is resolvable, by using other types of locks. And I guess for now dma-buf exporters should always handle this CPU mapping VS. invalidation contention if they support mmap(). It is resolvable so with some invalidation notify, a decent importers could also handle the contention well. IIUC now the only concern is importer device drivers are easier to do something wrong, so move CPU mapping things to exporter. But most of the exporters are also device drivers, why they are smarter? And there are increasing mapping needs, today exporters help handle CPU primary mapping, tomorrow should they also help on all other mappings? Clearly it is not feasible. So maybe conditionally give trust to some importers. Thanks, Yilun
Am 07.01.25 um 15:27 schrieb Xu Yilun: > Introduce a new API for dma-buf importer, also add a dma_buf_ops > callback for dma-buf exporter. This API is for subsystem importers who > map the dma-buf to some user defined address space, e.g. for IOMMUFD to > map the dma-buf to userspace IOVA via IOMMU page table, or for KVM to > map the dma-buf to GPA via KVM MMU (e.g. EPT). > > Currently dma-buf is only used to get DMA address for device's default > domain by using kernel DMA APIs. But for these new use-cases, importers > only need the pfn of the dma-buf resource to build their own mapping > tables. As far as I can see I have to fundamentally reject this whole approach. It's intentional DMA-buf design that we don't expose struct pages nor PFNs to the importer. Essentially DMA-buf only transports DMA addresses. In other words the mapping is done by the exporter and *not* the importer. What we certainly can do is to annotate those DMA addresses to a better specify in which domain they are applicable, e.g. if they are PCIe bus addresses or some inter device bus addresses etc... But moving the functionality to map the pages/PFNs to DMA addresses into the importer is an absolutely clear NO-GO. Regards, Christian. > So the map_dma_buf() callback is not mandatory for exporters > anymore. Also the importers could choose not to provide > struct device *dev on dma_buf_attach() if they don't call > dma_buf_map_attachment(). > > Like dma_buf_map_attachment(), the importer should firstly call > dma_buf_attach/dynamic_attach() then call dma_buf_get_pfn_unlocked(). > If the importer choose to do dynamic attach, it also should handle the > dma-buf move notification. > > Only the unlocked version of dma_buf_get_pfn is implemented for now, > just because no locked version is used for now. > > Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com> > > --- > IIUC, Only get_pfn() is needed but no put_pfn(). The whole dma-buf is > de/referenced at dma-buf attach/detach time. > > Specifically, for static attachment, the exporter should always make > memory resource available/pinned on first dma_buf_attach(), and > release/unpin memory resource on last dma_buf_detach(). For dynamic > attachment, the exporter could populate & invalidate the memory > resource at any time, it's OK as long as the importers follow dma-buf > move notification. So no pinning is needed for get_pfn() and no > put_pfn() is needed. > --- > drivers/dma-buf/dma-buf.c | 90 +++++++++++++++++++++++++++++++-------- > include/linux/dma-buf.h | 13 ++++++ > 2 files changed, 86 insertions(+), 17 deletions(-) > > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c > index 7eeee3a38202..83d1448b6dcc 100644 > --- a/drivers/dma-buf/dma-buf.c > +++ b/drivers/dma-buf/dma-buf.c > @@ -630,10 +630,10 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info) > size_t alloc_size = sizeof(struct dma_buf); > int ret; > > - if (WARN_ON(!exp_info->priv || !exp_info->ops > - || !exp_info->ops->map_dma_buf > - || !exp_info->ops->unmap_dma_buf > - || !exp_info->ops->release)) > + if (WARN_ON(!exp_info->priv || !exp_info->ops || > + (!!exp_info->ops->map_dma_buf != !!exp_info->ops->unmap_dma_buf) || > + (!exp_info->ops->map_dma_buf && !exp_info->ops->get_pfn) || > + !exp_info->ops->release)) > return ERR_PTR(-EINVAL); > > if (WARN_ON(exp_info->ops->cache_sgt_mapping && > @@ -909,7 +909,10 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev, > struct dma_buf_attachment *attach; > int ret; > > - if (WARN_ON(!dmabuf || !dev)) > + if (WARN_ON(!dmabuf)) > + return ERR_PTR(-EINVAL); > + > + if (WARN_ON(dmabuf->ops->map_dma_buf && !dev)) > return ERR_PTR(-EINVAL); > > if (WARN_ON(importer_ops && !importer_ops->move_notify)) > @@ -941,7 +944,7 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev, > */ > if (dma_buf_attachment_is_dynamic(attach) != > dma_buf_is_dynamic(dmabuf)) { > - struct sg_table *sgt; > + struct sg_table *sgt = NULL; > > dma_resv_lock(attach->dmabuf->resv, NULL); > if (dma_buf_is_dynamic(attach->dmabuf)) { > @@ -950,13 +953,16 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev, > goto err_unlock; > } > > - sgt = __map_dma_buf(attach, DMA_BIDIRECTIONAL); > - if (!sgt) > - sgt = ERR_PTR(-ENOMEM); > - if (IS_ERR(sgt)) { > - ret = PTR_ERR(sgt); > - goto err_unpin; > + if (dmabuf->ops->map_dma_buf) { > + sgt = __map_dma_buf(attach, DMA_BIDIRECTIONAL); > + if (!sgt) > + sgt = ERR_PTR(-ENOMEM); > + if (IS_ERR(sgt)) { > + ret = PTR_ERR(sgt); > + goto err_unpin; > + } > } > + > dma_resv_unlock(attach->dmabuf->resv); > attach->sgt = sgt; > attach->dir = DMA_BIDIRECTIONAL; > @@ -1119,7 +1125,8 @@ struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *attach, > > might_sleep(); > > - if (WARN_ON(!attach || !attach->dmabuf)) > + if (WARN_ON(!attach || !attach->dmabuf || > + !attach->dmabuf->ops->map_dma_buf)) > return ERR_PTR(-EINVAL); > > dma_resv_assert_held(attach->dmabuf->resv); > @@ -1195,7 +1202,8 @@ dma_buf_map_attachment_unlocked(struct dma_buf_attachment *attach, > > might_sleep(); > > - if (WARN_ON(!attach || !attach->dmabuf)) > + if (WARN_ON(!attach || !attach->dmabuf || > + !attach->dmabuf->ops->map_dma_buf)) > return ERR_PTR(-EINVAL); > > dma_resv_lock(attach->dmabuf->resv, NULL); > @@ -1222,7 +1230,8 @@ void dma_buf_unmap_attachment(struct dma_buf_attachment *attach, > { > might_sleep(); > > - if (WARN_ON(!attach || !attach->dmabuf || !sg_table)) > + if (WARN_ON(!attach || !attach->dmabuf || > + !attach->dmabuf->ops->unmap_dma_buf || !sg_table)) > return; > > dma_resv_assert_held(attach->dmabuf->resv); > @@ -1254,7 +1263,8 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach, > { > might_sleep(); > > - if (WARN_ON(!attach || !attach->dmabuf || !sg_table)) > + if (WARN_ON(!attach || !attach->dmabuf || > + !attach->dmabuf->ops->unmap_dma_buf || !sg_table)) > return; > > dma_resv_lock(attach->dmabuf->resv, NULL); > @@ -1263,6 +1273,52 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach, > } > EXPORT_SYMBOL_NS_GPL(dma_buf_unmap_attachment_unlocked, "DMA_BUF"); > > +/** > + * dma_buf_get_pfn_unlocked - > + * @attach: [in] attachment to get pfn from > + * @pgoff: [in] page offset of the buffer against the start of dma_buf > + * @pfn: [out] returns the pfn of the buffer > + * @max_order [out] returns the max mapping order of the buffer > + */ > +int dma_buf_get_pfn_unlocked(struct dma_buf_attachment *attach, > + pgoff_t pgoff, u64 *pfn, int *max_order) > +{ > + struct dma_buf *dmabuf = attach->dmabuf; > + int ret; > + > + if (WARN_ON(!attach || !attach->dmabuf || > + !attach->dmabuf->ops->get_pfn)) > + return -EINVAL; > + > + /* > + * Open: > + * > + * When dma_buf is dynamic but dma_buf move is disabled, the buffer > + * should be pinned before use, See dma_buf_map_attachment() for > + * reference. > + * > + * But for now no pin is intended inside dma_buf_get_pfn(), otherwise > + * need another API to unpin the dma_buf. So just fail out this case. > + */ > + if (dma_buf_is_dynamic(attach->dmabuf) && > + !IS_ENABLED(CONFIG_DMABUF_MOVE_NOTIFY)) > + return -ENOENT; > + > + dma_resv_lock(attach->dmabuf->resv, NULL); > + ret = dmabuf->ops->get_pfn(attach, pgoff, pfn, max_order); > + /* > + * Open: > + * > + * Is dma_resv_wait_timeout() needed? I assume no. The DMA buffer > + * content synchronization could be done when the buffer is to be > + * mapped by importer. > + */ > + dma_resv_unlock(attach->dmabuf->resv); > + > + return ret; > +} > +EXPORT_SYMBOL_NS_GPL(dma_buf_get_pfn_unlocked, "DMA_BUF"); > + > /** > * dma_buf_move_notify - notify attachments that DMA-buf is moving > * > @@ -1662,7 +1718,7 @@ static int dma_buf_debug_show(struct seq_file *s, void *unused) > attach_count = 0; > > list_for_each_entry(attach_obj, &buf_obj->attachments, node) { > - seq_printf(s, "\t%s\n", dev_name(attach_obj->dev)); > + seq_printf(s, "\t%s\n", attach_obj->dev ? dev_name(attach_obj->dev) : NULL); > attach_count++; > } > dma_resv_unlock(buf_obj->resv); > diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h > index 36216d28d8bd..b16183edfb3a 100644 > --- a/include/linux/dma-buf.h > +++ b/include/linux/dma-buf.h > @@ -194,6 +194,17 @@ struct dma_buf_ops { > * if the call would block. > */ > > + /** > + * @get_pfn: > + * > + * This is called by dma_buf_get_pfn(). It is used to get the pfn > + * of the buffer positioned by the page offset against the start of > + * the dma_buf. It can only be called if @attach has been called > + * successfully. > + */ > + int (*get_pfn)(struct dma_buf_attachment *attach, pgoff_t pgoff, > + u64 *pfn, int *max_order); > + > /** > * @release: > * > @@ -629,6 +640,8 @@ dma_buf_map_attachment_unlocked(struct dma_buf_attachment *attach, > void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach, > struct sg_table *sg_table, > enum dma_data_direction direction); > +int dma_buf_get_pfn_unlocked(struct dma_buf_attachment *attach, > + pgoff_t pgoff, u64 *pfn, int *max_order); > > int dma_buf_mmap(struct dma_buf *, struct vm_area_struct *, > unsigned long);
On Wed, Jan 08, 2025 at 09:01:46AM +0100, Christian König wrote: > Am 07.01.25 um 15:27 schrieb Xu Yilun: > > Introduce a new API for dma-buf importer, also add a dma_buf_ops > > callback for dma-buf exporter. This API is for subsystem importers who > > map the dma-buf to some user defined address space, e.g. for IOMMUFD to > > map the dma-buf to userspace IOVA via IOMMU page table, or for KVM to > > map the dma-buf to GPA via KVM MMU (e.g. EPT). > > > > Currently dma-buf is only used to get DMA address for device's default > > domain by using kernel DMA APIs. But for these new use-cases, importers > > only need the pfn of the dma-buf resource to build their own mapping > > tables. > > As far as I can see I have to fundamentally reject this whole approach. > > It's intentional DMA-buf design that we don't expose struct pages nor PFNs > to the importer. Essentially DMA-buf only transports DMA addresses. > > In other words the mapping is done by the exporter and *not* the importer. > > What we certainly can do is to annotate those DMA addresses to a better > specify in which domain they are applicable, e.g. if they are PCIe bus > addresses or some inter device bus addresses etc... > > But moving the functionality to map the pages/PFNs to DMA addresses into the > importer is an absolutely clear NO-GO. Oh? Having the importer do the mapping is the correct way to operate the DMA API and the new API that Leon has built to fix the scatterlist abuse in dmabuf relies on importer mapping as part of it's construction. Why on earth do you want the exporter to map? That is completely backwards and unworkable in many cases. The disfunctional P2P support in dmabuf is like that principally because of this. That said, I don't think get_pfn() is an especially good interface, but we will need to come with something that passes the physical pfn out. Jason
Am 08.01.25 um 14:23 schrieb Jason Gunthorpe: > On Wed, Jan 08, 2025 at 09:01:46AM +0100, Christian König wrote: >> Am 07.01.25 um 15:27 schrieb Xu Yilun: >>> Introduce a new API for dma-buf importer, also add a dma_buf_ops >>> callback for dma-buf exporter. This API is for subsystem importers who >>> map the dma-buf to some user defined address space, e.g. for IOMMUFD to >>> map the dma-buf to userspace IOVA via IOMMU page table, or for KVM to >>> map the dma-buf to GPA via KVM MMU (e.g. EPT). >>> >>> Currently dma-buf is only used to get DMA address for device's default >>> domain by using kernel DMA APIs. But for these new use-cases, importers >>> only need the pfn of the dma-buf resource to build their own mapping >>> tables. >> As far as I can see I have to fundamentally reject this whole approach. >> >> It's intentional DMA-buf design that we don't expose struct pages nor PFNs >> to the importer. Essentially DMA-buf only transports DMA addresses. >> >> In other words the mapping is done by the exporter and *not* the importer. >> >> What we certainly can do is to annotate those DMA addresses to a better >> specify in which domain they are applicable, e.g. if they are PCIe bus >> addresses or some inter device bus addresses etc... >> >> But moving the functionality to map the pages/PFNs to DMA addresses into the >> importer is an absolutely clear NO-GO. > Oh? > > Having the importer do the mapping is the correct way to operate the > DMA API and the new API that Leon has built to fix the scatterlist > abuse in dmabuf relies on importer mapping as part of it's > construction. Exactly on that I strongly disagree on. DMA-buf works by providing DMA addresses the importer can work with and *NOT* the underlying location of the buffer. > Why on earth do you want the exporter to map? Because the exporter owns the exported buffer and only the exporter knows to how correctly access it. > That is completely backwards and unworkable in many cases. The disfunctional P2P support > in dmabuf is like that principally because of this. No, that is exactly what we need. Using the scatterlist to transport the DMA addresses was clearly a mistake, but providing the DMA addresses by the exporter has proved many times to be the right approach. Keep in mind that the exported buffer is not necessary memory, but can also be MMIO or stuff which is only accessible through address space windows where you can't create a PFN nor struct page for. > That said, I don't think get_pfn() is an especially good interface, > but we will need to come with something that passes the physical pfn > out. No, physical pfn is absolutely not a good way of passing the location of data around because it is limited to what the CPU sees as address space. We have use cases where DMA-buf transports the location of CPU invisible data which only the involved devices can see. Regards, Christian. > > Jason
On Wed, Jan 08, 2025 at 02:44:26PM +0100, Christian König wrote: > > Having the importer do the mapping is the correct way to operate the > > DMA API and the new API that Leon has built to fix the scatterlist > > abuse in dmabuf relies on importer mapping as part of it's > > construction. > > Exactly on that I strongly disagree on. > > DMA-buf works by providing DMA addresses the importer can work with and > *NOT* the underlying location of the buffer. The expectation is that the DMA API will be used to DMA map (most) things, and the DMA API always works with a physaddr_t/pfn argument. Basically, everything that is not a private address space should be supported by improving the DMA API. We are on course for finally getting all the common cases like P2P and MMIO solved here. That alone will take care of alot. For P2P cases we are going toward (PFN + P2P source information) as input to the DMA API. The additional "P2P source information" provides a good way for co-operating drivers to represent private address spaces as well. Both importer and exporter can have full understanding what is being mapped and do the correct things, safely. So, no, we don't loose private address space support when moving to importer mapping, in fact it works better because the importer gets more information about what is going on. I have imagined a staged approach were DMABUF gets a new API that works with the new DMA API to do importer mapping with "P2P source information" and a gradual conversion. Exporter mapping falls down in too many cases already: 1) Private addresses spaces don't work fully well because many devices need some indication what address space is being used and scatter list can't really properly convey that. If the DMABUF has a mixture of CPU and private it becomes a PITA 2) Multi-path PCI can require the importer to make mapping decisions unique to the device and program device specific information for the multi-path. We are doing this in mlx5 today and have hacks because DMABUF is destroying the information the importer needs to choose the correct PCI path. 3) Importing devices need to know if they are working with PCI P2P addresses during mapping because they need to do things like turn on ATS on their DMA. As for multi-path we have the same hacks inside mlx5 today that assume DMABUFs are always P2P because we cannot determine if things are P2P or not after being DMA mapped. 4) TPH bits needs to be programmed into the importer device but are derived based on the NUMA topology of the DMA target. The importer has no idea what the DMA target actually was because the exporter mapping destroyed that information. 5) iommufd and kvm are both using CPU addresses without DMA. No exporter mapping is possible Jason
Am 08.01.25 um 15:58 schrieb Jason Gunthorpe: > On Wed, Jan 08, 2025 at 02:44:26PM +0100, Christian König wrote: > >>> Having the importer do the mapping is the correct way to operate the >>> DMA API and the new API that Leon has built to fix the scatterlist >>> abuse in dmabuf relies on importer mapping as part of it's >>> construction. >> Exactly on that I strongly disagree on. >> >> DMA-buf works by providing DMA addresses the importer can work with and >> *NOT* the underlying location of the buffer. > The expectation is that the DMA API will be used to DMA map (most) > things, and the DMA API always works with a physaddr_t/pfn > argument. Basically, everything that is not a private address space > should be supported by improving the DMA API. We are on course for > finally getting all the common cases like P2P and MMIO solved > here. That alone will take care of alot. Well, from experience the DMA API has failed more often than it actually worked in the way required by drivers. Especially that we tried to hide architectural complexity in there instead of properly expose limitations to drivers is not something I consider a good design approach. So I see putting even more into that extremely critical. > For P2P cases we are going toward (PFN + P2P source information) as > input to the DMA API. The additional "P2P source information" provides > a good way for co-operating drivers to represent private address > spaces as well. Both importer and exporter can have full understanding > what is being mapped and do the correct things, safely. I can say from experience that this is clearly not going to work for all use cases. It would mean that we have to pull a massive amount of driver specific functionality into the DMA API. Things like programming access windows for PCI BARs is completely driver specific and as far as I can see can't be part of the DMA API without things like callbacks. With that in mind the DMA API would become a mid layer between different drivers and that is really not something you are suggesting, isn't it? > So, no, we don't loose private address space support when moving to > importer mapping, in fact it works better because the importer gets > more information about what is going on. Well, sounds like I wasn't able to voice my concern. Let me try again: We should not give importers information they don't need. Especially not information about the backing store of buffers. So that importers get more information about what's going on is a bad thing. > I have imagined a staged approach were DMABUF gets a new API that > works with the new DMA API to do importer mapping with "P2P source > information" and a gradual conversion. To make it clear as maintainer of that subsystem I would reject such a step with all I have. We have already gone down that road and it didn't worked at all and was a really big pain to pull people back from it. > Exporter mapping falls down in too many cases already: > > 1) Private addresses spaces don't work fully well because many devices > need some indication what address space is being used and scatter list > can't really properly convey that. If the DMABUF has a mixture of CPU > and private it becomes a PITA Correct, yes. That's why I said that scatterlist was a bad choice for the interface. But exposing the backing store to importers and then let them do whatever they want with it sounds like an even worse idea. > 2) Multi-path PCI can require the importer to make mapping decisions > unique to the device and program device specific information for the > multi-path. We are doing this in mlx5 today and have hacks because > DMABUF is destroying the information the importer needs to choose the > correct PCI path. That's why the exporter gets the struct device of the importer so that it can plan how those accesses are made. Where exactly is the problem with that? When you have an use case which is not covered by the existing DMA-buf interfaces then please voice that to me and other maintainers instead of implementing some hack. > 3) Importing devices need to know if they are working with PCI P2P > addresses during mapping because they need to do things like turn on > ATS on their DMA. As for multi-path we have the same hacks inside mlx5 > today that assume DMABUFs are always P2P because we cannot determine > if things are P2P or not after being DMA mapped. Why would you need ATS on PCI P2P and not for system memory accesses? > 4) TPH bits needs to be programmed into the importer device but are > derived based on the NUMA topology of the DMA target. The importer has > no idea what the DMA target actually was because the exporter mapping > destroyed that information. Yeah, but again that is completely intentional. I assume you mean TLP processing hints when you say TPH and those should be part of the DMA addresses provided by the exporter. That an importer tries to look behind the curtain and determines the NUMA placement and topology themselves is clearly a no-go from the design perspective. > 5) iommufd and kvm are both using CPU addresses without DMA. No > exporter mapping is possible We have customers using both KVM and XEN with DMA-buf, so I can clearly confirm that this isn't true. Regards, Christian. > > Jason
On Wed, Jan 08, 2025 at 04:25:54PM +0100, Christian König wrote: > Am 08.01.25 um 15:58 schrieb Jason Gunthorpe: > > On Wed, Jan 08, 2025 at 02:44:26PM +0100, Christian König wrote: > > > > > > Having the importer do the mapping is the correct way to operate the > > > > DMA API and the new API that Leon has built to fix the scatterlist > > > > abuse in dmabuf relies on importer mapping as part of it's > > > > construction. > > > Exactly on that I strongly disagree on. > > > > > > DMA-buf works by providing DMA addresses the importer can work with and > > > *NOT* the underlying location of the buffer. > > The expectation is that the DMA API will be used to DMA map (most) > > things, and the DMA API always works with a physaddr_t/pfn > > argument. Basically, everything that is not a private address space > > should be supported by improving the DMA API. We are on course for > > finally getting all the common cases like P2P and MMIO solved > > here. That alone will take care of alot. > > Well, from experience the DMA API has failed more often than it actually > worked in the way required by drivers. The DMA API has been static and very hard to change in these ways for a long time. I think Leon's new API will break through this and we will be able finally address these issues. > > For P2P cases we are going toward (PFN + P2P source information) as > > input to the DMA API. The additional "P2P source information" provides > > a good way for co-operating drivers to represent private address > > spaces as well. Both importer and exporter can have full understanding > > what is being mapped and do the correct things, safely. > > I can say from experience that this is clearly not going to work for all use > cases. > > It would mean that we have to pull a massive amount of driver specific > functionality into the DMA API. That isn't what I mean. There are two distinct parts, the means to describe the source (PFN + P2P source information) that is compatible with the DMA API, and the DMA API itself that works with a few general P2P source information types. Private source information would be detected by co-operating drivers and go down driver private paths. It would be rejected by other drivers. This broadly follows how the new API is working. So here I mean you can use the same PFN + Source API between importer and exporter and the importer can simply detect the special source and do the private stuff. It is not shifting things under the DMA API, it is building along side it using compatible design approaches. You would match the source information, cast it to a driver structure, do whatever driver math is needed to compute the local DMA address and then write it to the device. Nothing is hard or "not going to work" here. > > So, no, we don't loose private address space support when moving to > > importer mapping, in fact it works better because the importer gets > > more information about what is going on. > > Well, sounds like I wasn't able to voice my concern. Let me try again: > > We should not give importers information they don't need. Especially not > information about the backing store of buffers. > > So that importers get more information about what's going on is a bad thing. I strongly disagree because we are suffering today in mlx5 because of this viewpoint. You cannot predict in advance what importers are going to need. I already listed many examples where it does not work today as is. > > I have imagined a staged approach were DMABUF gets a new API that > > works with the new DMA API to do importer mapping with "P2P source > > information" and a gradual conversion. > > To make it clear as maintainer of that subsystem I would reject such a step > with all I have. This is unexpected, so you want to just leave dmabuf broken? Do you have any plan to fix it, to fix the misuse of the DMA API, and all the problems I listed below? This is a big deal, it is causing real problems today. If it going to be like this I think we will stop trying to use dmabuf and do something simpler for vfio/kvm/iommufd :( > We have already gone down that road and it didn't worked at all and > was a really big pain to pull people back from it. Nobody has really seriously tried to improve the DMA API before, so I don't think this is true at all. > > Exporter mapping falls down in too many cases already: > > > > 1) Private addresses spaces don't work fully well because many devices > > need some indication what address space is being used and scatter list > > can't really properly convey that. If the DMABUF has a mixture of CPU > > and private it becomes a PITA > > Correct, yes. That's why I said that scatterlist was a bad choice for the > interface. > > But exposing the backing store to importers and then let them do whatever > they want with it sounds like an even worse idea. You keep saying this without real justification. To me it is a nanny style of API design. But also I don't see how you can possibly fix the above without telling the importer alot more information. > > 2) Multi-path PCI can require the importer to make mapping decisions > > unique to the device and program device specific information for the > > multi-path. We are doing this in mlx5 today and have hacks because > > DMABUF is destroying the information the importer needs to choose the > > correct PCI path. > > That's why the exporter gets the struct device of the importer so that it > can plan how those accesses are made. Where exactly is the problem with > that? A single struct device does not convey the multipath options. We have multiple struct devices (and multiple PCI endpoints) doing DMA concurrently under one driver. Multipath always needs additional meta information in the importer side to tell the device which path to select. A naked dma address is not sufficient. Today we guess that DMABUF will be using P2P and hack to choose a P2P struct device to pass the exporter. We need to know what is in the dmabuf before we can choose which of the multiple struct devices the driver has to use for DMA mapping. But even simple CPU centric cases we will eventually want to select the proper NUMA local PCI channel matching struct device for CPU only buffers. > When you have an use case which is not covered by the existing DMA-buf > interfaces then please voice that to me and other maintainers instead of > implementing some hack. Do you have any suggestion for any of this then? We have a good plan to fix this stuff and more. Many experts in their fields have agreed on the different parts now. We haven't got to dmabuf because I had no idea there would be an objection like this. > > 3) Importing devices need to know if they are working with PCI P2P > > addresses during mapping because they need to do things like turn on > > ATS on their DMA. As for multi-path we have the same hacks inside mlx5 > > today that assume DMABUFs are always P2P because we cannot determine > > if things are P2P or not after being DMA mapped. > > Why would you need ATS on PCI P2P and not for system memory accesses? ATS has a significant performance cost. It is mandatory for PCI P2P, but ideally should be avoided for CPU memory. > > 4) TPH bits needs to be programmed into the importer device but are > > derived based on the NUMA topology of the DMA target. The importer has > > no idea what the DMA target actually was because the exporter mapping > > destroyed that information. > > Yeah, but again that is completely intentional. > > I assume you mean TLP processing hints when you say TPH and those should be > part of the DMA addresses provided by the exporter. Yes, but is not part of the DMA addresses. > That an importer tries to look behind the curtain and determines the NUMA > placement and topology themselves is clearly a no-go from the design > perspective. I strongly disagree, this is important. Drivers need this information in a future TPH/UIO/multipath PCI world. > > 5) iommufd and kvm are both using CPU addresses without DMA. No > > exporter mapping is possible > > We have customers using both KVM and XEN with DMA-buf, so I can clearly > confirm that this isn't true. Today they are mmaping the dma-buf into a VMA and then using KVM's follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable dma-buf must have a CPU PFN. Here Xu implements basically the same path, except without the VMA indirection, and it suddenly not OK? Illogical. Jason
> > > 5) iommufd and kvm are both using CPU addresses without DMA. No > > > exporter mapping is possible > > > > We have customers using both KVM and XEN with DMA-buf, so I can clearly > > confirm that this isn't true. > > Today they are mmaping the dma-buf into a VMA and then using KVM's > follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable > dma-buf must have a CPU PFN. Yes, the final target for KVM is still the CPU PFN, just with the help of CPU mapping table. I also found the xen gntdev-dmabuf is calculating pfn from mapped sgt. From Christion's point, I assume only sgl->dma_address should be used by importers but in fact not. More importers are 'abusing' sg dma helpers. That said there are existing needs for importers to know more about the real buffer resource, for mapping, or even more than mapping, e.g. dmabuf_imp_grant_foreign_access() Thanks, Yilun
On Wed, Jan 08, 2025 at 12:22:27PM -0400, Jason Gunthorpe wrote: > On Wed, Jan 08, 2025 at 04:25:54PM +0100, Christian König wrote: > > Am 08.01.25 um 15:58 schrieb Jason Gunthorpe: > > > I have imagined a staged approach were DMABUF gets a new API that > > > works with the new DMA API to do importer mapping with "P2P source > > > information" and a gradual conversion. > > > > To make it clear as maintainer of that subsystem I would reject such a step > > with all I have. > > This is unexpected, so you want to just leave dmabuf broken? Do you > have any plan to fix it, to fix the misuse of the DMA API, and all > the problems I listed below? This is a big deal, it is causing real > problems today. > > If it going to be like this I think we will stop trying to use dmabuf > and do something simpler for vfio/kvm/iommufd :( As the gal who help edit the og dma-buf spec 13 years ago, I think adding pfn isn't a terrible idea. By design, dma-buf is the "everything is optional" interface. And in the beginning, even consistent locking was optional, but we've managed to fix that by now :-/ Where I do agree with Christian is that stuffing pfn support into the dma_buf_attachment interfaces feels a bit much wrong. > > We have already gone down that road and it didn't worked at all and > > was a really big pain to pull people back from it. > > Nobody has really seriously tried to improve the DMA API before, so I > don't think this is true at all. Aside, I really hope this finally happens! > > > 3) Importing devices need to know if they are working with PCI P2P > > > addresses during mapping because they need to do things like turn on > > > ATS on their DMA. As for multi-path we have the same hacks inside mlx5 > > > today that assume DMABUFs are always P2P because we cannot determine > > > if things are P2P or not after being DMA mapped. > > > > Why would you need ATS on PCI P2P and not for system memory accesses? > > ATS has a significant performance cost. It is mandatory for PCI P2P, > but ideally should be avoided for CPU memory. Huh, I didn't know that. And yeah kinda means we've butchered the pci p2p stuff a bit I guess ... > > > 5) iommufd and kvm are both using CPU addresses without DMA. No > > > exporter mapping is possible > > > > We have customers using both KVM and XEN with DMA-buf, so I can clearly > > confirm that this isn't true. > > Today they are mmaping the dma-buf into a VMA and then using KVM's > follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable > dma-buf must have a CPU PFN. > > Here Xu implements basically the same path, except without the VMA > indirection, and it suddenly not OK? Illogical. So the big difference is that for follow_pfn() you need mmu_notifier since the mmap might move around, whereas with pfn smashed into dma_buf_attachment you need dma_resv_lock rules, and the move_notify callback if you go dynamic. So I guess my first question is, which locking rules do you want here for pfn importers? If mmu notifiers is fine, then I think the current approach of follow_pfn should be ok. But if you instead dma_resv_lock rules (or the cpu mmap somehow is an issue itself), then I think the clean design is create a new separate access mechanism just for that. It would be the 5th or so (kernel vmap, userspace mmap, dma_buf_attach and driver private stuff like virtio_dma_buf.c where you access your buffer with a uuid), so really not a big deal. And for non-contrived exporters we might be able to implement the other access methods in terms of the pfn method generically, so this wouldn't even be a terrible maintenance burden going forward. And meanwhile all the contrived exporters just keep working as-is. The other part is that cpu mmap is optional, and there's plenty of strange exporters who don't implement. But you can dma map the attachment into plenty devices. This tends to mostly be a thing on SoC devices with some very funky memory. But I guess you don't care about these use-case, so should be ok. I couldn't come up with a good name for these pfn users, maybe dma_buf_pfn_attachment? This does _not_ have a struct device, but maybe some of these new p2p source specifiers (or a list of those which are allowed, no idea how this would need to fit into the new dma api). Cheers, Sima
On Wed, Jan 08, 2025 at 07:44:54PM +0100, Simona Vetter wrote: > On Wed, Jan 08, 2025 at 12:22:27PM -0400, Jason Gunthorpe wrote: > > On Wed, Jan 08, 2025 at 04:25:54PM +0100, Christian König wrote: > > > Am 08.01.25 um 15:58 schrieb Jason Gunthorpe: > > > > I have imagined a staged approach were DMABUF gets a new API that > > > > works with the new DMA API to do importer mapping with "P2P source > > > > information" and a gradual conversion. > > > > > > To make it clear as maintainer of that subsystem I would reject such a step > > > with all I have. > > > > This is unexpected, so you want to just leave dmabuf broken? Do you > > have any plan to fix it, to fix the misuse of the DMA API, and all > > the problems I listed below? This is a big deal, it is causing real > > problems today. > > > > If it going to be like this I think we will stop trying to use dmabuf > > and do something simpler for vfio/kvm/iommufd :( > > As the gal who help edit the og dma-buf spec 13 years ago, I think adding > pfn isn't a terrible idea. By design, dma-buf is the "everything is > optional" interface. And in the beginning, even consistent locking was > optional, but we've managed to fix that by now :-/ > > Where I do agree with Christian is that stuffing pfn support into the > dma_buf_attachment interfaces feels a bit much wrong. So it could a dmabuf interface like mmap/vmap()? I was also wondering about that. But finally I start to use dma_buf_attachment interface because of leveraging existing buffer pin and move_notify. > > > > We have already gone down that road and it didn't worked at all and > > > was a really big pain to pull people back from it. > > > > Nobody has really seriously tried to improve the DMA API before, so I > > don't think this is true at all. > > Aside, I really hope this finally happens! > > > > > 3) Importing devices need to know if they are working with PCI P2P > > > > addresses during mapping because they need to do things like turn on > > > > ATS on their DMA. As for multi-path we have the same hacks inside mlx5 > > > > today that assume DMABUFs are always P2P because we cannot determine > > > > if things are P2P or not after being DMA mapped. > > > > > > Why would you need ATS on PCI P2P and not for system memory accesses? > > > > ATS has a significant performance cost. It is mandatory for PCI P2P, > > but ideally should be avoided for CPU memory. > > Huh, I didn't know that. And yeah kinda means we've butchered the pci p2p > stuff a bit I guess ... > > > > > 5) iommufd and kvm are both using CPU addresses without DMA. No > > > > exporter mapping is possible > > > > > > We have customers using both KVM and XEN with DMA-buf, so I can clearly > > > confirm that this isn't true. > > > > Today they are mmaping the dma-buf into a VMA and then using KVM's > > follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable > > dma-buf must have a CPU PFN. > > > > Here Xu implements basically the same path, except without the VMA > > indirection, and it suddenly not OK? Illogical. > > So the big difference is that for follow_pfn() you need mmu_notifier since > the mmap might move around, whereas with pfn smashed into > dma_buf_attachment you need dma_resv_lock rules, and the move_notify > callback if you go dynamic. > > So I guess my first question is, which locking rules do you want here for > pfn importers? follow_pfn() is unwanted for private MMIO, so dma_resv_lock. > > If mmu notifiers is fine, then I think the current approach of follow_pfn > should be ok. But if you instead dma_resv_lock rules (or the cpu mmap > somehow is an issue itself), then I think the clean design is create a new cpu mmap() is an issue, this series is aimed to eliminate userspace mapping for private MMIO resources. > separate access mechanism just for that. It would be the 5th or so (kernel > vmap, userspace mmap, dma_buf_attach and driver private stuff like > virtio_dma_buf.c where you access your buffer with a uuid), so really not > a big deal. OK, will think more about that. Thanks, Yilun > > And for non-contrived exporters we might be able to implement the other > access methods in terms of the pfn method generically, so this wouldn't > even be a terrible maintenance burden going forward. And meanwhile all the > contrived exporters just keep working as-is. > > The other part is that cpu mmap is optional, and there's plenty of strange > exporters who don't implement. But you can dma map the attachment into > plenty devices. This tends to mostly be a thing on SoC devices with some > very funky memory. But I guess you don't care about these use-case, so > should be ok. > > I couldn't come up with a good name for these pfn users, maybe > dma_buf_pfn_attachment? This does _not_ have a struct device, but maybe > some of these new p2p source specifiers (or a list of those which are > allowed, no idea how this would need to fit into the new dma api). > > Cheers, Sima > -- > Simona Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch
> So I guess my first question is, which locking rules do you want here for > pfn importers? > > follow_pfn() is unwanted for private MMIO, so dma_resv_lock. > > As Sima explained you either have follow_pfn() and mmu_notifier or you > have DMA addresses and dma_resv lock / dma_fence. > > Just giving out PFNs without some lifetime associated with them is one of > the major problems we faced before and really not something you can do. I'm trying to make exporter give out PFN with lifetime control via move_notify() in this series. May not be conceptually correct but seems possible. > > > If mmu notifiers is fine, then I think the current approach of follow_pfn > should be ok. But if you instead dma_resv_lock rules (or the cpu mmap > somehow is an issue itself), then I think the clean design is create a new > > cpu mmap() is an issue, this series is aimed to eliminate userspace > mapping for private MMIO resources. > > Why? OK, I can start from here. It is about the Secure guest, or CoCo VM. The memory and MMIOs assigned to this kind of guest is unaccessable to host itself, by leveraging HW encryption & access control technology (e.g. Intel TDX, AMD SEV-SNP ...). This is to protect the tenant data being stolen by CSP itself. The impact is when host accesses the encrypted region, bad things happen to system, e.g. memory corruption, MCE. Kernel is trying to mitigate most of the impact by alloc and assign user unmappable memory resources (private memory) to guest, which prevents userspace accidents. guest_memfd is the private memory provider that only allows for KVM to position the page/pfn by fd + offset and create secondary page table (EPT, NPT...), no host mapping, no VMA, no mmu_notifier. But the lifecycle of the private memory is still controlled by guest_memfd. When fallocate(fd, PUNCH_HOLE), the memory resource is revoked and KVM is notified to unmap corresponding EPT. The further thought is guest_memfd is also suitable for normal guest. It makes no sense VMM must build host mapping table before guest access. Now I'm trying to seek a similar way for private MMIO. A MMIO resource provider that is exported as an fd. It controls the lifecycle of the MMIO resource and notify KVM when revoked. dma-buf seems to be a good provider which have done most of the work, only need to extend the memory resource seeking by fd + offset. > > separate access mechanism just for that. It would be the 5th or so (kernel > vmap, userspace mmap, dma_buf_attach and driver private stuff like > virtio_dma_buf.c where you access your buffer with a uuid), so really not > a big deal. > > OK, will think more about that. > > Please note that we have follow_pfn() + mmu_notifier working for KVM/XEN Folow_pfn & mmu_notifier won't work here, cause no VMA, no host mapping table. Thanks, Yilun > with MMIO mappings and P2P. And that required exactly zero DMA-buf changes > :) > > I don't fully understand your use case, but I think it's quite likely that > we already have that working. > > Regards, > Christian.
Answering on my reply once more as pure text mail. I love AMDs mail servers. Cheers, Christian. Am 09.01.25 um 09:04 schrieb Christian König: > Am 08.01.25 um 20:22 schrieb Xu Yilun: >> On Wed, Jan 08, 2025 at 07:44:54PM +0100, Simona Vetter wrote: >>> On Wed, Jan 08, 2025 at 12:22:27PM -0400, Jason Gunthorpe wrote: >>>> On Wed, Jan 08, 2025 at 04:25:54PM +0100, Christian König wrote: >>>>> Am 08.01.25 um 15:58 schrieb Jason Gunthorpe: >>>>>> I have imagined a staged approach were DMABUF gets a new API that >>>>>> works with the new DMA API to do importer mapping with "P2P source >>>>>> information" and a gradual conversion. >>>>> To make it clear as maintainer of that subsystem I would reject such a step >>>>> with all I have. >>>> This is unexpected, so you want to just leave dmabuf broken? Do you >>>> have any plan to fix it, to fix the misuse of the DMA API, and all >>>> the problems I listed below? This is a big deal, it is causing real >>>> problems today. >>>> >>>> If it going to be like this I think we will stop trying to use dmabuf >>>> and do something simpler for vfio/kvm/iommufd :( >>> As the gal who help edit the og dma-buf spec 13 years ago, I think adding >>> pfn isn't a terrible idea. By design, dma-buf is the "everything is >>> optional" interface. And in the beginning, even consistent locking was >>> optional, but we've managed to fix that by now :-/ > > Well you were also the person who mangled the struct page pointers in > the scatterlist because people were abusing this and getting a bloody > nose :) > >>> Where I do agree with Christian is that stuffing pfn support into the >>> dma_buf_attachment interfaces feels a bit much wrong. >> So it could a dmabuf interface like mmap/vmap()? I was also wondering >> about that. But finally I start to use dma_buf_attachment interface >> because of leveraging existing buffer pin and move_notify. > > Exactly that's the point, sharing pfn doesn't work with the pin and > move_notify interfaces because of the MMU notifier approach Sima > mentioned. > >>>>> We have already gone down that road and it didn't worked at all and >>>>> was a really big pain to pull people back from it. >>>> Nobody has really seriously tried to improve the DMA API before, so I >>>> don't think this is true at all. >>> Aside, I really hope this finally happens! > > Sorry my fault. I was not talking about the DMA API, but rather that > people tried to look behind the curtain of DMA-buf backing stores. > > In other words all the fun we had with scatterlists and that people > try to modify the struct pages inside of them. > > Improving the DMA API is something I really really hope for as well. > >>>>>> 3) Importing devices need to know if they are working with PCI P2P >>>>>> addresses during mapping because they need to do things like turn on >>>>>> ATS on their DMA. As for multi-path we have the same hacks inside mlx5 >>>>>> today that assume DMABUFs are always P2P because we cannot determine >>>>>> if things are P2P or not after being DMA mapped. >>>>> Why would you need ATS on PCI P2P and not for system memory accesses? >>>> ATS has a significant performance cost. It is mandatory for PCI P2P, >>>> but ideally should be avoided for CPU memory. >>> Huh, I didn't know that. And yeah kinda means we've butchered the pci p2p >>> stuff a bit I guess ... > > Hui? Why should ATS be mandatory for PCI P2P? > > We have tons of production systems using PCI P2P without ATS. And it's > the first time I hear that. > >>>>>> 5) iommufd and kvm are both using CPU addresses without DMA. No >>>>>> exporter mapping is possible >>>>> We have customers using both KVM and XEN with DMA-buf, so I can clearly >>>>> confirm that this isn't true. >>>> Today they are mmaping the dma-buf into a VMA and then using KVM's >>>> follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable >>>> dma-buf must have a CPU PFN. >>>> >>>> Here Xu implements basically the same path, except without the VMA >>>> indirection, and it suddenly not OK? Illogical. >>> So the big difference is that for follow_pfn() you need mmu_notifier since >>> the mmap might move around, whereas with pfn smashed into >>> dma_buf_attachment you need dma_resv_lock rules, and the move_notify >>> callback if you go dynamic. >>> >>> So I guess my first question is, which locking rules do you want here for >>> pfn importers? >> follow_pfn() is unwanted for private MMIO, so dma_resv_lock. > > As Sima explained you either have follow_pfn() and mmu_notifier or you > have DMA addresses and dma_resv lock / dma_fence. > > Just giving out PFNs without some lifetime associated with them is one > of the major problems we faced before and really not something you can do. > >>> If mmu notifiers is fine, then I think the current approach of follow_pfn >>> should be ok. But if you instead dma_resv_lock rules (or the cpu mmap >>> somehow is an issue itself), then I think the clean design is create a new >> cpu mmap() is an issue, this series is aimed to eliminate userspace >> mapping for private MMIO resources. > > Why? > >>> separate access mechanism just for that. It would be the 5th or so (kernel >>> vmap, userspace mmap, dma_buf_attach and driver private stuff like >>> virtio_dma_buf.c where you access your buffer with a uuid), so really not >>> a big deal. >> OK, will think more about that. > > Please note that we have follow_pfn() + mmu_notifier working for > KVM/XEN with MMIO mappings and P2P. And that required exactly zero > DMA-buf changes :) > > I don't fully understand your use case, but I think it's quite likely > that we already have that working. > > Regards, > Christian. > >> Thanks, >> Yilun >> >>> And for non-contrived exporters we might be able to implement the other >>> access methods in terms of the pfn method generically, so this wouldn't >>> even be a terrible maintenance burden going forward. And meanwhile all the >>> contrived exporters just keep working as-is. >>> >>> The other part is that cpu mmap is optional, and there's plenty of strange >>> exporters who don't implement. But you can dma map the attachment into >>> plenty devices. This tends to mostly be a thing on SoC devices with some >>> very funky memory. But I guess you don't care about these use-case, so >>> should be ok. >>> >>> I couldn't come up with a good name for these pfn users, maybe >>> dma_buf_pfn_attachment? This does _not_ have a struct device, but maybe >>> some of these new p2p source specifiers (or a list of those which are >>> allowed, no idea how this would need to fit into the new dma api). >>> >>> Cheers, Sima >>> -- >>> Simona Vetter >>> Software Engineer, Intel Corporation >>> http://blog.ffwll.ch >
On Thu, Jan 09, 2025 at 10:10:01AM +0100, Christian König wrote: > Am 08.01.25 um 17:22 schrieb Jason Gunthorpe: > > [SNIP] > > > > For P2P cases we are going toward (PFN + P2P source information) as > > > > input to the DMA API. The additional "P2P source information" provides > > > > a good way for co-operating drivers to represent private address > > > > spaces as well. Both importer and exporter can have full understanding > > > > what is being mapped and do the correct things, safely. > > > I can say from experience that this is clearly not going to work for all use > > > cases. > > > > > > It would mean that we have to pull a massive amount of driver specific > > > functionality into the DMA API. > > That isn't what I mean. There are two distinct parts, the means to > > describe the source (PFN + P2P source information) that is compatible > > with the DMA API, and the DMA API itself that works with a few general > > P2P source information types. > > > > Private source information would be detected by co-operating drivers > > and go down driver private paths. It would be rejected by other > > drivers. This broadly follows how the new API is working. > > > > So here I mean you can use the same PFN + Source API between importer > > and exporter and the importer can simply detect the special source and > > do the private stuff. It is not shifting things under the DMA API, it > > is building along side it using compatible design approaches. You > > would match the source information, cast it to a driver structure, do > > whatever driver math is needed to compute the local DMA address and > > then write it to the device. Nothing is hard or "not going to work" > > here. > > Well to be honest that sounds like an absolutely horrible design. > > You are moving all responsibilities for inter driver handling into the > drivers themselves without any supervision by the core OS. > > Drivers are notoriously buggy and should absolutely not do things like that > on their own. IMHO, you and Jason give different meaning to word "driver" in this discussion. It is upto to the subsystems to decide how to provide new API to the end drivers. Worth to read this LWN article first. Dancing the DMA two-step - https://lwn.net/Articles/997563/ > > Do you have pointers to this new API? Latest version is here - https://lore.kernel.org/all/cover.1734436840.git.leon@kernel.org/ Unfortunately, I forgot to copy/paste cover letter but it can be seen in previous version https://lore.kernel.org/all/cover.1733398913.git.leon@kernel.org/. The most complex example is block layer implementation which hides DMA API from block drivers. https://lore.kernel.org/all/cover.1730037261.git.leon@kernel.org/ Thanks
On Thu, Jan 09, 2025 at 01:56:02AM +0800, Xu Yilun wrote: > > > > 5) iommufd and kvm are both using CPU addresses without DMA. No > > > > exporter mapping is possible > > > > > > We have customers using both KVM and XEN with DMA-buf, so I can clearly > > > confirm that this isn't true. > > > > Today they are mmaping the dma-buf into a VMA and then using KVM's > > follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable > > dma-buf must have a CPU PFN. > > Yes, the final target for KVM is still the CPU PFN, just with the help > of CPU mapping table. > > I also found the xen gntdev-dmabuf is calculating pfn from mapped > sgt. See the comment, it's ok because it's a fake device with fake iommu and the xen code has special knowledge to peek behind the curtain. -Sima > From Christion's point, I assume only sgl->dma_address should be > used by importers but in fact not. More importers are 'abusing' sg dma > helpers. > > That said there are existing needs for importers to know more about the > real buffer resource, for mapping, or even more than mapping, > e.g. dmabuf_imp_grant_foreign_access()
On Thu, Jan 09, 2025 at 07:06:48AM +0800, Xu Yilun wrote: > > So I guess my first question is, which locking rules do you want here for > > pfn importers? > > > > follow_pfn() is unwanted for private MMIO, so dma_resv_lock. > > > > As Sima explained you either have follow_pfn() and mmu_notifier or you > > have DMA addresses and dma_resv lock / dma_fence. > > > > Just giving out PFNs without some lifetime associated with them is one of > > the major problems we faced before and really not something you can do. > > I'm trying to make exporter give out PFN with lifetime control via > move_notify() in this series. May not be conceptually correct but seems > possible. > > > > > > > If mmu notifiers is fine, then I think the current approach of follow_pfn > > should be ok. But if you instead dma_resv_lock rules (or the cpu mmap > > somehow is an issue itself), then I think the clean design is create a new > > > > cpu mmap() is an issue, this series is aimed to eliminate userspace > > mapping for private MMIO resources. > > > > Why? > > OK, I can start from here. > > It is about the Secure guest, or CoCo VM. The memory and MMIOs assigned > to this kind of guest is unaccessable to host itself, by leveraging HW > encryption & access control technology (e.g. Intel TDX, AMD SEV-SNP ...). > This is to protect the tenant data being stolen by CSP itself. > > The impact is when host accesses the encrypted region, bad things > happen to system, e.g. memory corruption, MCE. Kernel is trying to > mitigate most of the impact by alloc and assign user unmappable memory > resources (private memory) to guest, which prevents userspace > accidents. guest_memfd is the private memory provider that only allows > for KVM to position the page/pfn by fd + offset and create secondary > page table (EPT, NPT...), no host mapping, no VMA, no mmu_notifier. But > the lifecycle of the private memory is still controlled by guest_memfd. > When fallocate(fd, PUNCH_HOLE), the memory resource is revoked and KVM > is notified to unmap corresponding EPT. > > The further thought is guest_memfd is also suitable for normal guest. > It makes no sense VMM must build host mapping table before guest access. > > Now I'm trying to seek a similar way for private MMIO. A MMIO resource > provider that is exported as an fd. It controls the lifecycle of the > MMIO resource and notify KVM when revoked. dma-buf seems to be a good > provider which have done most of the work, only need to extend the > memory resource seeking by fd + offset. So if I'm getting this right, what you need from a functional pov is a dma_buf_tdx_mmap? Because due to tdx restrictions, the normal dma_buf_mmap is not going to work I guess? Also another thing that's a bit tricky is that kvm kinda has a 3rd dma-buf memory model: - permanently pinned dma-buf, they never move - dynamic dma-buf, they move through ->move_notify and importers can remap - revocable dma-buf, which thus far only exist for pci mmio resources Since we're leaning even more on that 3rd model I'm wondering whether we should make it something official. Because the existing dynamic importers do very much assume that re-acquiring the memory after move_notify will work. But for the revocable use-case the entire point is that it will never work. I feel like that's a concept we need to make explicit, so that dynamic importers can reject such memory if necessary. So yeah there's a bunch of tricky lifetime questions that need to be sorted out with proper design I think, and the current "let's just use pfn directly" proposal hides them all under the rug. I agree with Christian that we need a bit more care here. -Sima > > > > > separate access mechanism just for that. It would be the 5th or so (kernel > > vmap, userspace mmap, dma_buf_attach and driver private stuff like > > virtio_dma_buf.c where you access your buffer with a uuid), so really not > > a big deal. > > > > OK, will think more about that. > > > > Please note that we have follow_pfn() + mmu_notifier working for KVM/XEN > > Folow_pfn & mmu_notifier won't work here, cause no VMA, no host mapping > table. > > Thanks, > Yilun > > with MMIO mappings and P2P. And that required exactly zero DMA-buf changes > > :) > > > > I don't fully understand your use case, but I think it's quite likely that > > we already have that working. > > > > Regards, > > Christian.
On Fri, Jan 10, 2025 at 08:24:22PM +0100, Simona Vetter wrote: > On Thu, Jan 09, 2025 at 01:56:02AM +0800, Xu Yilun wrote: > > > > > 5) iommufd and kvm are both using CPU addresses without DMA. No > > > > > exporter mapping is possible > > > > > > > > We have customers using both KVM and XEN with DMA-buf, so I can clearly > > > > confirm that this isn't true. > > > > > > Today they are mmaping the dma-buf into a VMA and then using KVM's > > > follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable > > > dma-buf must have a CPU PFN. > > > > Yes, the final target for KVM is still the CPU PFN, just with the help > > of CPU mapping table. > > > > I also found the xen gntdev-dmabuf is calculating pfn from mapped > > sgt. > > See the comment, it's ok because it's a fake device with fake iommu and > the xen code has special knowledge to peek behind the curtain. /* * Now convert sgt to array of gfns without accessing underlying pages. * It is not allowed to access the underlying struct page of an sg table * exported by DMA-buf, but since we deal with special Xen dma device here * (not a normal physical one) look at the dma addresses in the sg table * and then calculate gfns directly from them. */ for_each_sgtable_dma_page(sgt, &sg_iter, 0) { dma_addr_t addr = sg_page_iter_dma_address(&sg_iter); unsigned long pfn = bfn_to_pfn(XEN_PFN_DOWN(dma_to_phys(dev, addr))); *barf* Can we please all agree that is a horrible abuse of the DMA API and lets not point it as some acceptable "solution"? KVM and iommufd do not have fake struct devices with fake iommus. Jason
On Fri, Jan 10, 2025 at 08:34:55PM +0100, Simona Vetter wrote: > So if I'm getting this right, what you need from a functional pov is a > dma_buf_tdx_mmap? Because due to tdx restrictions, the normal dma_buf_mmap > is not going to work I guess? Don't want something TDX specific! There is a general desire, and CC is one, but there are other motivations like performance, to stop using VMAs and mmaps as a way to exchanage memory between two entities. Instead we want to use FDs. We now have memfd and guestmemfd that are usable with memfd_pin_folios() - this covers pinnable CPU memory. And for a long time we had DMABUF which is for all the other wild stuff, and it supports movable memory too. So, the normal DMABUF semantics with reservation locking and move notifiers seem workable to me here. They are broadly similar enough to the mmu notifier locking that they can serve the same job of updating page tables. > Also another thing that's a bit tricky is that kvm kinda has a 3rd dma-buf > memory model: > - permanently pinned dma-buf, they never move > - dynamic dma-buf, they move through ->move_notify and importers can remap > - revocable dma-buf, which thus far only exist for pci mmio resources I would like to see the importers be able to discover which one is going to be used, because we have RDMA cases where we can support 1 and 3 but not 2. revocable doesn't require page faulting as it is a terminal condition. > Since we're leaning even more on that 3rd model I'm wondering whether we > should make it something official. Because the existing dynamic importers > do very much assume that re-acquiring the memory after move_notify will > work. But for the revocable use-case the entire point is that it will > never work. > I feel like that's a concept we need to make explicit, so that dynamic > importers can reject such memory if necessary. It strikes me as strange that HW can do page faulting, so it can support #2, but it can't handle a non-present fault? > So yeah there's a bunch of tricky lifetime questions that need to be > sorted out with proper design I think, and the current "let's just use pfn > directly" proposal hides them all under the rug. I don't think these two things are connected. The lifetime model that KVM needs to work with the EPT, and that VFIO needs for it's MMIO, definately should be reviewed and evaluated. But it is completely orthogonal to allowing iommufd and kvm to access the CPU PFN to use in their mapping flows, instead of the dma_addr_t. What I want to get to is a replacement for scatter list in DMABUF that is an array of arrays, roughly like: struct memory_chunks { struct memory_p2p_provider *provider; struct bio_vec addrs[]; }; int (*dmabuf_get_memory)(struct memory_chunks **chunks, size_t *num_chunks); This can represent all forms of memory: P2P, private, CPU, etc and would be efficient with the new DMA API. This is similar to the structure BIO has, and it composes nicely with a future pin_user_pages() and memfd_pin_folios(). Jason
On Thu, Jan 09, 2025 at 09:09:46AM +0100, Christian König wrote: > Answering on my reply once more as pure text mail. It is hard to do anything with your HTML mails :\ > > Well you were also the person who mangled the struct page pointers in > > the scatterlist because people were abusing this and getting a bloody > > nose :) But alot of this is because scatterlist is too limited, you actually can't correctly describe anything except struct page backed CPU memory in a scatterlist. As soon as we can correctly describe everything in a datastructure these issues go away - or at least turn into a compatability exchange problem. > > > > Where I do agree with Christian is that stuffing pfn support into the > > > > dma_buf_attachment interfaces feels a bit much wrong. > > > So it could a dmabuf interface like mmap/vmap()? I was also wondering > > > about that. But finally I start to use dma_buf_attachment interface > > > because of leveraging existing buffer pin and move_notify. > > > > Exactly that's the point, sharing pfn doesn't work with the pin and > > move_notify interfaces because of the MMU notifier approach Sima > > mentioned. Huh? mmu notifiers are for tracking changes to VMAs pin/move_notify are for tracking changes the the underlying memory of a DMABUF. How does sharing the PFN vs DMA addre effect the pin/move_notify lifetime rules at all? > > > > > > > 3) Importing devices need to know if they are working with PCI P2P > > > > > > > addresses during mapping because they need to do things like turn on > > > > > > > ATS on their DMA. As for multi-path we have the same hacks inside mlx5 > > > > > > > today that assume DMABUFs are always P2P because we cannot determine > > > > > > > if things are P2P or not after being DMA mapped. > > > > > > Why would you need ATS on PCI P2P and not for system memory accesses? > > > > > ATS has a significant performance cost. It is mandatory for PCI P2P, > > > > > but ideally should be avoided for CPU memory. > > > > Huh, I didn't know that. And yeah kinda means we've butchered the pci p2p > > > > stuff a bit I guess ... > > > > Hui? Why should ATS be mandatory for PCI P2P? I should say "mandatory on some configurations" If you need the iommu turned on, and you have a PCI switch in your path, then ATS allows you to have full P2P bandwidth and retain full IOMMU security. > > We have tons of production systems using PCI P2P without ATS. And it's > > the first time I hear that. It is situational and topologically dependent. We have very large number of deployed systems now that rely on ATS for PCI P2P. > > As Sima explained you either have follow_pfn() and mmu_notifier or you > > have DMA addresses and dma_resv lock / dma_fence. > > > > Just giving out PFNs without some lifetime associated with them is one > > of the major problems we faced before and really not something you can > > do. Certainly I never imagined there would be no liftime, I expect anything coming out of the dmabuf interface to use the dma_resv lock, fence and move_notify for lifetime managament, regardless of how the target memory is described. > > > > separate access mechanism just for that. It would be the 5th or so (kernel > > > > vmap, userspace mmap, dma_buf_attach and driver private stuff like > > > > virtio_dma_buf.c where you access your buffer with a uuid), so really not > > > > a big deal. > > > OK, will think more about that. > > > > Please note that we have follow_pfn() + mmu_notifier working for KVM/XEN > > with MMIO mappings and P2P. And that required exactly zero DMA-buf > > changes :) > > I don't fully understand your use case, but I think it's quite likely > > that we already have that working. In Intel CC systems you cannot mmap secure memory or the system will take a machine check. You have to convey secure memory inside a FD entirely within the kernel so that only an importer that understands how to handle secure memory (such as KVM) is using it to avoid machine checking. The patch series here should be thought of as the first part of this, allowing PFNs to flow without VMAs. IMHO the second part of preventing machine checks is not complete. In the approach I have been talking about the secure memory would be represented by a p2p_provider structure that is incompatible with everything else. For instance importers that can only do DMA would simply cleanly fail when presented with this memory. Jason
On Fri, Jan 10, 2025 at 04:38:38PM -0400, Jason Gunthorpe wrote: > On Fri, Jan 10, 2025 at 08:34:55PM +0100, Simona Vetter wrote: > > > So if I'm getting this right, what you need from a functional pov is a > > dma_buf_tdx_mmap? Because due to tdx restrictions, the normal dma_buf_mmap I'm not sure if the word 'mmap' is proper here. It is kind of like the mapping from (FD+offset) to backend memory, which is directly provided by memory provider, rather than via VMA and cpu page table. Basically VMA & cpu page table are for host to access the memory, but VMM/host doesn't access most of the guest memory, so why must build them. > > is not going to work I guess? > > Don't want something TDX specific! > > There is a general desire, and CC is one, but there are other > motivations like performance, to stop using VMAs and mmaps as a way to > exchanage memory between two entities. Instead we want to use FDs. Exactly. > > We now have memfd and guestmemfd that are usable with > memfd_pin_folios() - this covers pinnable CPU memory. > > And for a long time we had DMABUF which is for all the other wild > stuff, and it supports movable memory too. > > So, the normal DMABUF semantics with reservation locking and move > notifiers seem workable to me here. They are broadly similar enough to > the mmu notifier locking that they can serve the same job of updating > page tables. Yes. With this new sharing model, the lifecycle of the shared memory/pfn/Page is directly controlled by dma-buf exporter, not by CPU mapping. So I also think reservation lock & move_notify works well for lifecycle control, no conflict (nothing to do) with follow_pfn() & mmu_notifier. > > > Also another thing that's a bit tricky is that kvm kinda has a 3rd dma-buf > > memory model: > > - permanently pinned dma-buf, they never move > > - dynamic dma-buf, they move through ->move_notify and importers can remap > > - revocable dma-buf, which thus far only exist for pci mmio resources > > I would like to see the importers be able to discover which one is > going to be used, because we have RDMA cases where we can support 1 > and 3 but not 2. > > revocable doesn't require page faulting as it is a terminal condition. > > > Since we're leaning even more on that 3rd model I'm wondering whether we > > should make it something official. Because the existing dynamic importers > > do very much assume that re-acquiring the memory after move_notify will > > work. But for the revocable use-case the entire point is that it will > > never work. > > > I feel like that's a concept we need to make explicit, so that dynamic > > importers can reject such memory if necessary. > > It strikes me as strange that HW can do page faulting, so it can > support #2, but it can't handle a non-present fault? > > > So yeah there's a bunch of tricky lifetime questions that need to be > > sorted out with proper design I think, and the current "let's just use pfn > > directly" proposal hides them all under the rug. > > I don't think these two things are connected. The lifetime model that > KVM needs to work with the EPT, and that VFIO needs for it's MMIO, > definately should be reviewed and evaluated. > > But it is completely orthogonal to allowing iommufd and kvm to access > the CPU PFN to use in their mapping flows, instead of the > dma_addr_t. > > What I want to get to is a replacement for scatter list in DMABUF that > is an array of arrays, roughly like: > > struct memory_chunks { > struct memory_p2p_provider *provider; > struct bio_vec addrs[]; > }; > int (*dmabuf_get_memory)(struct memory_chunks **chunks, size_t *num_chunks); Maybe we need to specify which object the API is operating on, struct dma_buf, or struct dma_buf_attachment, or a new attachment. I think: int (*dmabuf_get_memory)(struct dma_buf_attachment *attach, struct memory_chunks **chunks, size_t *num_chunks); works, but maybe a new attachment is conceptually more clear to importers and harder to abuse? Thanks, Yilun > > This can represent all forms of memory: P2P, private, CPU, etc and > would be efficient with the new DMA API. > > This is similar to the structure BIO has, and it composes nicely with > a future pin_user_pages() and memfd_pin_folios(). > > Jason
On Fri, Jan 10, 2025 at 04:38:38PM -0400, Jason Gunthorpe wrote: > On Fri, Jan 10, 2025 at 08:34:55PM +0100, Simona Vetter wrote: > > > So if I'm getting this right, what you need from a functional pov is a > > dma_buf_tdx_mmap? Because due to tdx restrictions, the normal dma_buf_mmap > > is not going to work I guess? > > Don't want something TDX specific! > > There is a general desire, and CC is one, but there are other > motivations like performance, to stop using VMAs and mmaps as a way to > exchanage memory between two entities. Instead we want to use FDs. > > We now have memfd and guestmemfd that are usable with > memfd_pin_folios() - this covers pinnable CPU memory. > > And for a long time we had DMABUF which is for all the other wild > stuff, and it supports movable memory too. > > So, the normal DMABUF semantics with reservation locking and move > notifiers seem workable to me here. They are broadly similar enough to > the mmu notifier locking that they can serve the same job of updating > page tables. Yeah raw pfn is fine with me too. It might come with more "might not work on this dma-buf" restrictions, but I can't think of a practical one right now. > > Also another thing that's a bit tricky is that kvm kinda has a 3rd dma-buf > > memory model: > > - permanently pinned dma-buf, they never move > > - dynamic dma-buf, they move through ->move_notify and importers can remap > > - revocable dma-buf, which thus far only exist for pci mmio resources > > I would like to see the importers be able to discover which one is > going to be used, because we have RDMA cases where we can support 1 > and 3 but not 2. > > revocable doesn't require page faulting as it is a terminal condition. Yeah this is why I think we should separate the dynamic from the revocable use-cases clearly, because mixing them is going to result in issues. > > Since we're leaning even more on that 3rd model I'm wondering whether we > > should make it something official. Because the existing dynamic importers > > do very much assume that re-acquiring the memory after move_notify will > > work. But for the revocable use-case the entire point is that it will > > never work. > > > I feel like that's a concept we need to make explicit, so that dynamic > > importers can reject such memory if necessary. > > It strikes me as strange that HW can do page faulting, so it can > support #2, but it can't handle a non-present fault? I guess it's not a kernel issue, but userspace might want to know whether this dma-buf could potentially nuke the entire gpu context. Because that's what you get when we can't patch up a fault, which is the difference between a recovable dma-buf and a dynamic dma-buf. E.g. if a compositor gets a dma-buf it assumes that by just binding that it will not risk gpu context destruction (unless you're out of memory and everything is on fire anyway, and it's ok to die). But if a nasty client app supplies a revocable dma-buf, then it can shot down the higher priviledged compositor gpu workload with precision. Which is not great, so maybe existing dynamic gpu importers should reject revocable dma-buf. That's at least what I had in mind as a potential issue. > > So yeah there's a bunch of tricky lifetime questions that need to be > > sorted out with proper design I think, and the current "let's just use pfn > > directly" proposal hides them all under the rug. > > I don't think these two things are connected. The lifetime model that > KVM needs to work with the EPT, and that VFIO needs for it's MMIO, > definately should be reviewed and evaluated. > > But it is completely orthogonal to allowing iommufd and kvm to access > the CPU PFN to use in their mapping flows, instead of the > dma_addr_t. > > What I want to get to is a replacement for scatter list in DMABUF that > is an array of arrays, roughly like: > > struct memory_chunks { > struct memory_p2p_provider *provider; > struct bio_vec addrs[]; > }; > int (*dmabuf_get_memory)(struct memory_chunks **chunks, size_t *num_chunks); > > This can represent all forms of memory: P2P, private, CPU, etc and > would be efficient with the new DMA API. > > This is similar to the structure BIO has, and it composes nicely with > a future pin_user_pages() and memfd_pin_folios(). Since you mention pin here, I think that's another aspect of the revocable vs dynamic question. Dynamic buffers are expected to sometimes just move around for no reason, and importers must be able to cope. For recovable exporters/importers I'd expect that movement is not happening, meaning it's pinned until the single terminal revocation. And maybe I read the kvm stuff wrong, but it reads more like the latter to me when crawling through the pfn code. Once we have the lifetime rules nailed then there's the other issue of how to describe the memory, and my take for that is that once the dma-api has a clear answer we'll just blindly adopt that one and done. And currently with either dynamic attachments and dma_addr_t or through fishing the pfn from the cpu pagetables there's some very clearly defined lifetime and locking rules (which kvm might get wrong, I've seen some discussions fly by where it wasn't doing a perfect job with reflecting pte changes, but that was about access attributes iirc). If we add something new, we need clear rules and not just "here's the kvm code that uses it". That's how we've done dma-buf at first, and it was a terrible mess of mismatched expecations. -Sima
On Tue, Jan 14, 2025 at 03:44:04PM +0100, Simona Vetter wrote: > E.g. if a compositor gets a dma-buf it assumes that by just binding that > it will not risk gpu context destruction (unless you're out of memory and > everything is on fire anyway, and it's ok to die). But if a nasty client > app supplies a revocable dma-buf, then it can shot down the higher > priviledged compositor gpu workload with precision. Which is not great, so > maybe existing dynamic gpu importers should reject revocable dma-buf. > That's at least what I had in mind as a potential issue. I see, so it is not that they can't handle a non-present fault it is just that the non-present effectively turns into a crash of the context and you want to avoid the crash. It makes sense to me to negotiate this as part of the API. > > This is similar to the structure BIO has, and it composes nicely with > > a future pin_user_pages() and memfd_pin_folios(). > > Since you mention pin here, I think that's another aspect of the revocable > vs dynamic question. Dynamic buffers are expected to sometimes just move > around for no reason, and importers must be able to cope. Yes, and we have importers that can tolerate dynamic and those that can't. Though those that can't tolerate it can often implement revoke. I view your list as a cascade: 1) Fully pinned can never be changed so long as the attach is present 2) Fully pinned, but can be revoked. Revoked is a fatal condition and the importer is allowed to experience an error 3) Fully dynamic and always present. Support for move, and restartable fault, is required Today in RDMA we ask the exporter if it is 1 or 3 and allow different things. I've seen the GPU side start to offer 1 more often as it has significant performance wins. > For recovable exporters/importers I'd expect that movement is not > happening, meaning it's pinned until the single terminal revocation. And > maybe I read the kvm stuff wrong, but it reads more like the latter to me > when crawling through the pfn code. kvm should be fully faultable and it should be able handle move. It handles move today using the mmu notifiers after all. kvm would need to interact with the dmabuf reservations on its page fault path. iommufd cannot be faultable and it would only support revoke. For VFIO revoke would not be fully terminal as VFIO can unrevoke too (sigh). If we make revoke special I'd like to eventually include unrevoke for this reason. > Once we have the lifetime rules nailed then there's the other issue of how > to describe the memory, and my take for that is that once the dma-api has > a clear answer we'll just blindly adopt that one and done. This is what I hope, we are not there yet, first Leon's series needs to get merged then we can start on making the DMA API P2P safe without any struct page. From there it should be clear what direction things go in. DMABUF would return pfns annotated with whatever matches the DMA API, and the importer would be able to inspect the PFNs to learn information like their P2Pness, CPU mappability or whatever. I'm pushing for the extra struct, and Christoph has been thinking about searching a maple tree on the PFN. We need to see what works best. > And currently with either dynamic attachments and dma_addr_t or through > fishing the pfn from the cpu pagetables there's some very clearly defined > lifetime and locking rules (which kvm might get wrong, I've seen some > discussions fly by where it wasn't doing a perfect job with reflecting pte > changes, but that was about access attributes iirc). Wouldn't surprise me, mmu notifiers are very complex all around. We've had bugs already where the mm doesn't signal the notifiers at the right points. > If we add something > new, we need clear rules and not just "here's the kvm code that uses it". > That's how we've done dma-buf at first, and it was a terrible mess of > mismatched expecations. Yes, that would be wrong. It should be self defined within dmabuf and kvm should adopt to it, move semantics and all. My general desire is to move all of RDMA's MR process away from scatterlist and work using only the new DMA API. This will save *huge* amounts of memory in common workloads and be the basis for non-struct page DMA support, including P2P. Jason
On Tue, Jan 14, 2025 at 01:31:03PM -0400, Jason Gunthorpe wrote: > On Tue, Jan 14, 2025 at 03:44:04PM +0100, Simona Vetter wrote: > > > E.g. if a compositor gets a dma-buf it assumes that by just binding that > > it will not risk gpu context destruction (unless you're out of memory and > > everything is on fire anyway, and it's ok to die). But if a nasty client > > app supplies a revocable dma-buf, then it can shot down the higher > > priviledged compositor gpu workload with precision. Which is not great, so > > maybe existing dynamic gpu importers should reject revocable dma-buf. > > That's at least what I had in mind as a potential issue. > > I see, so it is not that they can't handle a non-present fault it is > just that the non-present effectively turns into a crash of the > context and you want to avoid the crash. It makes sense to me to > negotiate this as part of the API. Yup. > > > This is similar to the structure BIO has, and it composes nicely with > > > a future pin_user_pages() and memfd_pin_folios(). > > > > Since you mention pin here, I think that's another aspect of the revocable > > vs dynamic question. Dynamic buffers are expected to sometimes just move > > around for no reason, and importers must be able to cope. > > Yes, and we have importers that can tolerate dynamic and those that > can't. Though those that can't tolerate it can often implement revoke. > > I view your list as a cascade: > 1) Fully pinned can never be changed so long as the attach is present > 2) Fully pinned, but can be revoked. Revoked is a fatal condition and > the importer is allowed to experience an error > 3) Fully dynamic and always present. Support for move, and > restartable fault, is required > > Today in RDMA we ask the exporter if it is 1 or 3 and allow different > things. I've seen the GPU side start to offer 1 more often as it has > significant performance wins. I'm not entirely sure a cascade is the right thing or whether we should have importers just specify bitmask of what is acceptable to them. But that little detail aside this sounds like what I was thinking of. > > For recovable exporters/importers I'd expect that movement is not > > happening, meaning it's pinned until the single terminal revocation. And > > maybe I read the kvm stuff wrong, but it reads more like the latter to me > > when crawling through the pfn code. > > kvm should be fully faultable and it should be able handle move. It > handles move today using the mmu notifiers after all. > > kvm would need to interact with the dmabuf reservations on its page > fault path. > > iommufd cannot be faultable and it would only support revoke. For VFIO > revoke would not be fully terminal as VFIO can unrevoke too > (sigh). If we make revoke special I'd like to eventually include > unrevoke for this reason. > > > Once we have the lifetime rules nailed then there's the other issue of how > > to describe the memory, and my take for that is that once the dma-api has > > a clear answer we'll just blindly adopt that one and done. > > This is what I hope, we are not there yet, first Leon's series needs > to get merged then we can start on making the DMA API P2P safe without > any struct page. From there it should be clear what direction things > go in. > > DMABUF would return pfns annotated with whatever matches the DMA API, > and the importer would be able to inspect the PFNs to learn > information like their P2Pness, CPU mappability or whatever. I think for 90% of exporters pfn would fit, but there's some really funny ones where you cannot get a cpu pfn by design. So we need to keep the pfn-less interfaces around. But ideally for the pfn-capable exporters we'd have helpers/common code that just implements all the other interfaces. I'm not worried about the resulting fragmentation, because dma-buf is the "everythig is optional" api. We just need to make sure we have really clear semantic api contracts, like with the revocable vs dynamic/moveable distinction. Otherwise things go boom when an importer gets an unexpected dma-buf fd, and we pass this across security boundaries a lot. > I'm pushing for the extra struct, and Christoph has been thinking > about searching a maple tree on the PFN. We need to see what works best. > > > And currently with either dynamic attachments and dma_addr_t or through > > fishing the pfn from the cpu pagetables there's some very clearly defined > > lifetime and locking rules (which kvm might get wrong, I've seen some > > discussions fly by where it wasn't doing a perfect job with reflecting pte > > changes, but that was about access attributes iirc). > > Wouldn't surprise me, mmu notifiers are very complex all around. We've > had bugs already where the mm doesn't signal the notifiers at the > right points. > > > If we add something > > new, we need clear rules and not just "here's the kvm code that uses it". > > That's how we've done dma-buf at first, and it was a terrible mess of > > mismatched expecations. > > Yes, that would be wrong. It should be self defined within dmabuf and > kvm should adopt to it, move semantics and all. Ack. I feel like we have a plan here. Summary from my side: - Sort out pin vs revocable vs dynamic/moveable semantics, make sure importers have no surprises. - Adopt whatever new dma-api datastructures pops out of the dma-api reworks. - Add pfn based memory access as yet another optional access method, with helpers so that exporters who support this get all the others for free. I don't see a strict ordering between these, imo should be driven by actual users of the dma-buf api. Already done: - dmem cgroup so that we can resource control device pinnings just landed in drm-next for next merge window. So that part is imo sorted and we can charge ahead with pinning into device memory without all the concerns we've had years ago when discussing that for p2p dma-buf support. But there might be some work so that we can p2p pin without requiring dynamic attachments only, I haven't checked whether that needs adjustment in dma-buf.c code or just in exporters. Anything missing? I feel like this is small enough that m-l archives is good enough. For some of the bigger projects we do in graphics we sometimes create entries in our kerneldoc with wip design consensus and things like that. But feels like overkill here. > My general desire is to move all of RDMA's MR process away from > scatterlist and work using only the new DMA API. This will save *huge* > amounts of memory in common workloads and be the basis for non-struct > page DMA support, including P2P. Yeah a more memory efficient structure than the scatterlist would be really nice. That would even benefit the very special dma-buf exporters where you cannot get a pfn and only the dma_addr_t, altough most of those (all maybe even?) have contig buffers, so your scatterlist has only one entry. But it would definitely be nice from a design pov. Aside: A way to more efficiently create compressed scatterlists would be neat too, because a lot of drivers hand-roll that and it's a bit brittle and kinda silly to duplicate. With compressed I mean just a single entry for a contig range, in practice thanks to huge pages/folios and allocators trying to hand out contig ranges if there's plenty of memory that saves a lot of memory too. But currently it's a bit a pain to construct these efficiently, mostly it's just a two-pass approach and then trying to free surplus memory or krealloc to fit. Also I don't have good ideas here, but dma-api folks might have some from looking at too many things that create scatterlists. -Sima
On Wed, Jan 15, 2025 at 09:55:29AM +0100, Simona Vetter wrote: > I think for 90% of exporters pfn would fit, but there's some really funny > ones where you cannot get a cpu pfn by design. So we need to keep the > pfn-less interfaces around. But ideally for the pfn-capable exporters we'd > have helpers/common code that just implements all the other interfaces. There is no way to have dma address without a PFN in Linux right now. How would you generate them? That implies you have an IOMMU that can generate IOVAs for something that doesn't have a physical address at all. Or do you mean some that don't have pages associated with them, and thus have pfn_valid fail on them? They still have a PFN, just not one that is valid to use in most of the Linux MM.
Am 10.01.25 um 21:54 schrieb Jason Gunthorpe: > [SNIP] >>> I don't fully understand your use case, but I think it's quite likely >>> that we already have that working. > In Intel CC systems you cannot mmap secure memory or the system will > take a machine check. > > You have to convey secure memory inside a FD entirely within the > kernel so that only an importer that understands how to handle secure > memory (such as KVM) is using it to avoid machine checking. > > The patch series here should be thought of as the first part of this, > allowing PFNs to flow without VMAs. IMHO the second part of preventing > machine checks is not complete. > > In the approach I have been talking about the secure memory would be > represented by a p2p_provider structure that is incompatible with > everything else. For instance importers that can only do DMA would > simply cleanly fail when presented with this memory. That's a rather interesting use case, but not something I consider fitting for the DMA-buf interface. See DMA-buf in meant to be used between drivers to allow DMA access on shared buffers. What you try to do here instead is to give memory in the form of a file descriptor to a client VM to do things like CPU mapping and giving it to drivers to do DMA etc... As far as I can see this memory is secured by some kind of MMU which makes sure that even the host CPU can't access it without causing a machine check exception. That sounds more something for the TEE driver instead of anything DMA-buf should be dealing with. Regards, Christian. > > Jason
On Wed, Jan 15, 2025 at 10:32:34AM +0100, Christoph Hellwig wrote: > On Wed, Jan 15, 2025 at 09:55:29AM +0100, Simona Vetter wrote: > > I think for 90% of exporters pfn would fit, but there's some really funny > > ones where you cannot get a cpu pfn by design. So we need to keep the > > pfn-less interfaces around. But ideally for the pfn-capable exporters we'd > > have helpers/common code that just implements all the other interfaces. > > There is no way to have dma address without a PFN in Linux right now. > How would you generate them? That implies you have an IOMMU that can > generate IOVAs for something that doesn't have a physical address at > all. > > Or do you mean some that don't have pages associated with them, and > thus have pfn_valid fail on them? They still have a PFN, just not > one that is valid to use in most of the Linux MM. He is talking about private interconnect hidden inside clusters of devices. Ie the system may have many GPUs and those GPUs have their own private interconnect between them. It is not PCI, and packets don't transit through the CPU SOC at all, so the IOMMU is not involved. DMA can happen on that private interconnect, but from a Linux perspective it is not DMA API DMA, and the addresses used to describe it are not part of the CPU address space. The initiating device will have a way to choose which path the DMA goes through when setting up the DMA. Effectively if you look at one of these complex GPU systems you will have a physical bit of memory, say HBM memory located on the GPU. Then from an OS perspective we have a whole bunch of different representations/addresses of that very same memory. A Grace/Hopper system would have at least three different addresses (ZONE_MOVABLE, a PCI MMIO aperture, and a global NVLink address). Each different address effectively represents a different physical interconnect multipath, and an initiator may have three different routes/addresses available to reach the same physical target memory. Part of what DMABUF needs to do is pick which multi-path will be used between expoter/importer. So, the hack today has the DMABUF exporter GPU driver understand the importer is part of the private interconnect and then generate a scatterlist with a NULL sg_page, but a sg_dma_addr that encodes the private global address on the hidden interconnect. Somehow the importer knows this has happened and programs its HW to use the private path. Jason
On Wed, Jan 15, 2025 at 10:38:00AM +0100, Christian König wrote: > Am 10.01.25 um 21:54 schrieb Jason Gunthorpe: > > [SNIP] > > > > I don't fully understand your use case, but I think it's quite likely > > > > that we already have that working. > > In Intel CC systems you cannot mmap secure memory or the system will > > take a machine check. > > > > You have to convey secure memory inside a FD entirely within the > > kernel so that only an importer that understands how to handle secure > > memory (such as KVM) is using it to avoid machine checking. > > > > The patch series here should be thought of as the first part of this, > > allowing PFNs to flow without VMAs. IMHO the second part of preventing > > machine checks is not complete. > > > > In the approach I have been talking about the secure memory would be > > represented by a p2p_provider structure that is incompatible with > > everything else. For instance importers that can only do DMA would > > simply cleanly fail when presented with this memory. > > That's a rather interesting use case, but not something I consider fitting > for the DMA-buf interface. To recast the problem statement, it is basically the same as your device private interconnects. There are certain devices that understand how to use this memory, and if they work together they can access it. > See DMA-buf in meant to be used between drivers to allow DMA access on > shared buffers. They are shared, just not with everyone :) > What you try to do here instead is to give memory in the form of a file > descriptor to a client VM to do things like CPU mapping and giving it to > drivers to do DMA etc... How is this paragraph different from the first? It is a shared buffer that we want real DMA and CPU "DMA" access to. It is "private" so things that don't understand the interconnect rules cannot access it. > That sounds more something for the TEE driver instead of anything DMA-buf > should be dealing with. Has nothing to do with TEE. Jason
Explicitly replying as text mail once more. I just love the AMD mails servers :( Christian. Am 15.01.25 um 14:45 schrieb Christian König: > Am 15.01.25 um 14:38 schrieb Jason Gunthorpe: >> On Wed, Jan 15, 2025 at 10:38:00AM +0100, Christian König wrote: >>> Am 10.01.25 um 21:54 schrieb Jason Gunthorpe: >>>> [SNIP] >>>>>> I don't fully understand your use case, but I think it's quite likely >>>>>> that we already have that working. >>>> In Intel CC systems you cannot mmap secure memory or the system will >>>> take a machine check. >>>> >>>> You have to convey secure memory inside a FD entirely within the >>>> kernel so that only an importer that understands how to handle secure >>>> memory (such as KVM) is using it to avoid machine checking. >>>> >>>> The patch series here should be thought of as the first part of this, >>>> allowing PFNs to flow without VMAs. IMHO the second part of preventing >>>> machine checks is not complete. >>>> >>>> In the approach I have been talking about the secure memory would be >>>> represented by a p2p_provider structure that is incompatible with >>>> everything else. For instance importers that can only do DMA would >>>> simply cleanly fail when presented with this memory. >>> That's a rather interesting use case, but not something I consider fitting >>> for the DMA-buf interface. >> To recast the problem statement, it is basically the same as your >> device private interconnects. There are certain devices that >> understand how to use this memory, and if they work together they can >> access it. >> >>> See DMA-buf in meant to be used between drivers to allow DMA access on >>> shared buffers. >> They are shared, just not with everyone :) >> >>> What you try to do here instead is to give memory in the form of a file >>> descriptor to a client VM to do things like CPU mapping and giving it to >>> drivers to do DMA etc... >> How is this paragraph different from the first? It is a shared buffer >> that we want real DMA and CPU "DMA" access to. It is "private" so >> things that don't understand the interconnect rules cannot access it. > > Yeah, but it's private to the exporter. And a very fundamental rule of > DMA-buf is that the exporter is the one in control of things. > > So for example it is illegal for an importer to setup CPU mappings to > a buffer. That's why we have dma_buf_mmap() which redirects mmap() > requests from the importer to the exporter. > > In your use case here the importer wants to be in control and do > things like both CPU as well as DMA mappings. > > As far as I can see that is really not an use case which fits DMA-buf > in any way. > >>> That sounds more something for the TEE driver instead of anything DMA-buf >>> should be dealing with. >> Has nothing to do with TEE. > > Why? > > Regards, > Christian. > >> Jason >
On Wed, Jan 15, 2025 at 02:46:56PM +0100, Christian König wrote: > Explicitly replying as text mail once more. > > I just love the AMD mails servers :( :( This is hard > > Yeah, but it's private to the exporter. And a very fundamental rule of > > DMA-buf is that the exporter is the one in control of things. I've said a few times now, I don't think we can build the kind of buffer sharing framework we need to solve all the problems with this philosophy. It is also inefficient with the new DMA API. I think it is backwards looking and we need to move forwards with fixing the fundamental API issues which motivated that design. > > So for example it is illegal for an importer to setup CPU mappings to a > > buffer. That's why we have dma_buf_mmap() which redirects mmap() > > requests from the importer to the exporter. Like this, in a future no-scatter list world I would want to make this safe. The importer will have enough information to know if CPU mappings exist and are safe to use under what conditions. There is no reason the importer should not be able to CPU access memory that is HW permitted to be CPU accessible. If the importer needs CPU access and the exporter cannot provide it then the attachment simply fails. Saying CPU access is banned 100% of the time is not a helpful position when we have use cases that need it. > > As far as I can see that is really not an use case which fits DMA-buf in > > any way. I really don't want to make a dmabuf2 - everyone would have to implement it, including all the GPU drivers if they want to work with RDMA. I don't think this makes any sense compared to incrementally evolving dmabuf with more optional capabilities. > > > > That sounds more something for the TEE driver instead of anything DMA-buf > > > > should be dealing with. > > > Has nothing to do with TEE. > > > > Why? The Linux TEE framework is not used as part of confidential compute. CC already has guest memfd for holding it's private CPU memory. This is about confidential MMIO memory. This is also not just about the KVM side, the VM side also has issues with DMABUF and CC - only co-operating devices can interact with the VM side "encrypted" memory and there needs to be a negotiation as part of all buffer setup what the mutual capability is. :\ swiotlb hides some of this some times, but confidential P2P is currently unsolved. Jason
Sending it as text mail to the mailing lists once more :( Christian. Am 15.01.25 um 15:29 schrieb Christian König: > Am 15.01.25 um 15:14 schrieb Jason Gunthorpe: >> On Wed, Jan 15, 2025 at 02:46:56PM +0100, Christian König wrote: >> [SNIP] >>>> Yeah, but it's private to the exporter. And a very fundamental rule of >>>> DMA-buf is that the exporter is the one in control of things. >> I've said a few times now, I don't think we can build the kind of >> buffer sharing framework we need to solve all the problems with this >> philosophy. It is also inefficient with the new DMA API. >> >> I think it is backwards looking and we need to move forwards with >> fixing the fundamental API issues which motivated that design. > > And that's what I clearly see completely different. > > Those rules are not something we cam up with because of some > limitation of the DMA-API, but rather from experience working with > different device driver and especially their developers. > > Applying and enforcing those restrictions is absolutely mandatory must > have for extending DMA-buf. > >>>> So for example it is illegal for an importer to setup CPU mappings to a >>>> buffer. That's why we have dma_buf_mmap() which redirects mmap() >>>> requests from the importer to the exporter. >> Like this, in a future no-scatter list world I would want to make this >> safe. The importer will have enough information to know if CPU >> mappings exist and are safe to use under what conditions. >> >> There is no reason the importer should not be able to CPU access >> memory that is HW permitted to be CPU accessible. >> >> If the importer needs CPU access and the exporter cannot provide it >> then the attachment simply fails. >> >> Saying CPU access is banned 100% of the time is not a helpful position >> when we have use cases that need it. > > That approach is an absolutely no-go from my side. > > We have fully intentionally implemented the restriction that importers > can't CPU access DMA-buf for both kernel and userspace without going > through the exporter because of design requirements and a lot of > negative experience with exactly this approach. > > This is not something which is discuss-able in any way possible. > >>>> As far as I can see that is really not an use case which fits DMA-buf in >>>> any way. >> I really don't want to make a dmabuf2 - everyone would have to >> implement it, including all the GPU drivers if they want to work with >> RDMA. I don't think this makes any sense compared to incrementally >> evolving dmabuf with more optional capabilities. > > The point is that a dmabuf2 would most likely be rejected as well or > otherwise run into the same issues we have seen before. > >>>>>> That sounds more something for the TEE driver instead of anything DMA-buf >>>>>> should be dealing with. >>>>> Has nothing to do with TEE. >>>> Why? >> The Linux TEE framework is not used as part of confidential compute. >> >> CC already has guest memfd for holding it's private CPU memory. > > Where is that coming from and how it is used? > >> This is about confidential MMIO memory. > > Who is the exporter and who is the importer of the DMA-buf in this use > case? > >> This is also not just about the KVM side, the VM side also has issues >> with DMABUF and CC - only co-operating devices can interact with the >> VM side "encrypted" memory and there needs to be a negotiation as part >> of all buffer setup what the mutual capability is. :\ swiotlb hides >> some of this some times, but confidential P2P is currently unsolved. > > Yes and it is documented by now how that is supposed to happen with > DMA-buf. > > As far as I can see there is not much new approach here. > > Regards, > Christian. > >> Jason >
On Wed, Jan 15, 2025 at 03:30:47PM +0100, Christian König wrote: > > Those rules are not something we cam up with because of some limitation > > of the DMA-API, but rather from experience working with different device > > driver and especially their developers. I would say it stems from the use of scatter list. You do not have enough information exchanged between exporter and importer to implement something sane and correct. At that point being restrictive is a reasonable path. Because of scatterlist developers don't have APIs that correctly solve the problems they want to solve, so of course things get into a mess. > > Applying and enforcing those restrictions is absolutely mandatory must > > have for extending DMA-buf. You said to come to the maintainers with the problems, here are the problems. Your answer is don't use dmabuf. That doesn't make the problems go away :( > > > I really don't want to make a dmabuf2 - everyone would have to > > > implement it, including all the GPU drivers if they want to work with > > > RDMA. I don't think this makes any sense compared to incrementally > > > evolving dmabuf with more optional capabilities. > > > > The point is that a dmabuf2 would most likely be rejected as well or > > otherwise run into the same issues we have seen before. You'd need to be much more concrete and technical in your objections to cause a rejection. "We tried something else before and it didn't work" won't cut it. There is a very simple problem statement here, we need a FD handle for various kinds of memory, with a lifetime model that fits a couple of different use cases. The exporter and importer need to understand what type of memory it is and what rules apply to working with it. The required importers are more general that just simple PCI DMA. I feel like this is already exactly DMABUF's mission. Besides, you have been saying to go do this in TEE or whatever, how is that any different from dmabuf2? > > > > > > > That sounds more something for the TEE driver instead of anything DMA-buf > > > > > > > should be dealing with. > > > > > > Has nothing to do with TEE. > > > > > Why? > > > The Linux TEE framework is not used as part of confidential compute. > > > > > > CC already has guest memfd for holding it's private CPU memory. > > > > Where is that coming from and how it is used? What do you mean? guest memfd is the result of years of negotiation in the mm and x86 arch subsystems :( It is used like a normal memfd, and we now have APIs in KVM and iommufd to directly intake and map from a memfd. I expect guestmemfd will soon grow some more generic dmabuf-like lifetime callbacks to avoid pinning - it already has some KVM specific APIs IIRC. But it is 100% exclusively focused on CPU memory and nothing else. > > > This is about confidential MMIO memory. > > > > Who is the exporter and who is the importer of the DMA-buf in this use > > case? In this case Xu is exporting MMIO from VFIO and importing to KVM and iommufd. > > This is also not just about the KVM side, the VM side also has issues > > with DMABUF and CC - only co-operating devices can interact with the > > VM side "encrypted" memory and there needs to be a negotiation as part > > of all buffer setup what the mutual capability is. :\ swiotlb hides > > some of this some times, but confidential P2P is currently unsolved. > > Yes and it is documented by now how that is supposed to happen with > DMA-buf. I doubt that. It is complex and not fully solved in the core code today. Many scenarios do not work correctly, devices don't even exist yet that can exercise the hard paths. This is a future problem :( Jason
On Wed, Jan 15, 2025 at 05:34:23PM +0100, Christian König wrote: > Granted, let me try to improve this. > Here is a real world example of one of the issues we ran into and why > CPU mappings of importers are redirected to the exporter. > We have a good bunch of different exporters who track the CPU mappings > of their backing store using address_space objects in one way or > another and then uses unmap_mapping_range() to invalidate those CPU > mappings. > But when importers get the PFNs of the backing store they can look > behind the curtain and directly insert this PFN into the CPU page > tables. > We had literally tons of cases like this where drivers developers cause > access after free issues because the importer created a CPU mappings on > their own without the exporter knowing about it. > This is just one example of what we ran into. Additional to that > basically the whole synchronization between drivers was overhauled as > well because we found that we can't trust importers to always do the > right thing. But this, fundamentally, is importers creating attachments and then *ignoring the lifetime rules of DMABUF*. If you created an attachment, got a move and *ignored the move* because you put the PFN in your own VMA, then you are not following the attachment lifetime rules! To implement this safely the driver would need to use unma_mapping_range() on the driver VMA inside the move callback, and hook into the VMA fault callback to re-attach the dmabuf. This is where I get into trouble with your argument. It is not that the API has an issue, or that the rules of the API are not logical and functional. You are arguing that even a logical and functional API will be mis-used by some people and that reviewers will not catch it. Honestly, I don't think that is consistent with the kernel philosophy. We should do our best to make APIs that are had to mis-use, but if we can't achieve that it doesn't mean we stop and give up on problems, we go into the world of APIs that can be mis-used and we are supposed to rely on the reviewer system to catch it. > You can already turn both a TEE allocated buffer as well as a memfd > into a DMA-buf. So basically TEE and memfd already provides different > interfaces which go beyond what DMA-buf does and allows. > In other words if you want to do things like direct I/O to block or > network devices you can mmap() your memfd and do this while at the same > time send your memfd as DMA-buf to your GPU, V4L or neural accelerator. > Would this be a way you could work with as well? I guess, but this still requires creating a dmabuf2 type thing with very similar semantics and then shimming dmabuf2 to 1 for DRM consumers. I don't see how it addresses your fundamental concern that the semantics we want are an API that is too easy for drivers to abuse. And being more functional and efficient we'd just see people wanting to use dmabuf2 directly instead of bothering with 1. > separate file descriptor representing the private MMIO which iommufd > and KVM uses but you can turn it into a DMA-buf whenever you need to > give it to a DMA-buf importer? Well, it would end up just being used everywhere. I think one person wanted to use this with DRM drivers for some reason, but RDMA would use the dmabuf2 directly because it will be much more efficient than using scatterlist. Honestly, I'd much rather extend dmabuf and see DRM institute some rule that DRM drivers may not use XYZ parts of the improvement. Like maybe we could use some symbol namespaces to really enforce it eg. MODULE_IMPORT_NS(DMABUF_NOT_FOR_DRM_USAGE) Some of the improvements we want like the revoke rules for lifetime seem to be agreeable. Block the API that gives you the non-scatterlist attachment. Only VFIO/RDMA/kvm/iommufd will get to implement it. > In this case Xu is exporting MMIO from VFIO and importing to KVM and > iommufd. > > So basically a portion of a PCIe BAR is imported into iommufd? And KVM. We need to get the CPU address into KVM and IOMMU page tables. It must go through a private FD path and not a VMA because of the CC rules about machine check I mentioned earlier. The private FD must have a lifetime model to ensure we don't UAF the PCIe BAR memory. Someone else had some use case where they wanted to put the VFIO MMIO PCIe BAR into a DMABUF and ship it into a GPU driver for somethingsomething virtualization but I didn't understand it. > Let's just say that both the ARM guys as well as the GPU people already > have some pretty "interesting" ways of doing digital rights management > and content protection. Well, that is TEE stuff, TEE and CC are not the same thing, though they have some high level conceptual overlap. In a certain sense CC is a TEE that is built using KVM instead of the TEE subsystem. Using KVM and integrating with the MM brings a whole set of unique challenges that TEE got to avoid.. Jason
On Wed, Jan 15, 2025 at 09:34:19AM -0400, Jason Gunthorpe wrote: > > Or do you mean some that don't have pages associated with them, and > > thus have pfn_valid fail on them? They still have a PFN, just not > > one that is valid to use in most of the Linux MM. > > He is talking about private interconnect hidden inside clusters of > devices. > > Ie the system may have many GPUs and those GPUs have their own private > interconnect between them. It is not PCI, and packets don't transit > through the CPU SOC at all, so the IOMMU is not involved. > > DMA can happen on that private interconnect, but from a Linux > perspective it is not DMA API DMA, and the addresses used to describe > it are not part of the CPU address space. The initiating device will > have a way to choose which path the DMA goes through when setting up > the DMA. So how is this in any way relevant to dma_buf which operates on a dma_addr_t right now and thus by definition can't be used for these?
On Thu, Jan 16, 2025 at 06:33:48AM +0100, Christoph Hellwig wrote: > On Wed, Jan 15, 2025 at 09:34:19AM -0400, Jason Gunthorpe wrote: > > > Or do you mean some that don't have pages associated with them, and > > > thus have pfn_valid fail on them? They still have a PFN, just not > > > one that is valid to use in most of the Linux MM. > > > > He is talking about private interconnect hidden inside clusters of > > devices. > > > > Ie the system may have many GPUs and those GPUs have their own private > > interconnect between them. It is not PCI, and packets don't transit > > through the CPU SOC at all, so the IOMMU is not involved. > > > > DMA can happen on that private interconnect, but from a Linux > > perspective it is not DMA API DMA, and the addresses used to describe > > it are not part of the CPU address space. The initiating device will > > have a way to choose which path the DMA goes through when setting up > > the DMA. > > So how is this in any way relevant to dma_buf which operates on > a dma_addr_t right now and thus by definition can't be used for > these? Oh, well since this private stuff exists the DRM folks implemented it and used dmabuf to hook it together tough the uAPI. To make it work it abuses scatterlist and dma_addr_t to carry this other information. Thus the pushback in this thread we can't naively fixup dmabuf because this non-dma_addr_t abuse exists and is uAPI. So it also needs some improved architecture to move forward :\ Basically, the scatterlist in dmabuf API does not follow any of the normal rules scatterlist should follow. It is not using the semantics of dma_addr_t even though that is the type. It is really just an array of addr/len pairs - we can't reason about it in the normal way :( Jason
On Thu, Jan 16, 2025 at 04:13:13PM +0100, Christian König wrote: >> But this, fundamentally, is importers creating attachments and then >> *ignoring the lifetime rules of DMABUF*. If you created an attachment, >> got a move and *ignored the move* because you put the PFN in your own >> VMA, then you are not following the attachment lifetime rules! > > Move notify is solely for informing the importer that they need to > re-fresh their DMA mappings and eventually block for ongoing DMA to > end. I feel that it is a bit pedantic to say DMA and CPU are somehow different. The DMABUF API gives you a scatterlist, it is reasonable to say that move invalidates the entire scatterlist, CPU and DMA equally. > This semantics doesn't work well for CPU mappings because you need to > hold the reservation lock to make sure that the information stay valid > and you can't hold a lock while returning from a page fault. Sure, I imagine hooking up a VMA is hard - but that doesn't change my point. The semantics can be reasonable and well defined. > Yeah and exactly that is something we don't want to allow because it > means that every importer need to get things right to prevent exporters > from running into problems. You can make the same argument about the DMA address. We should just get rid of DMABUF entirely because people are going to mis-use it and wrongly implement the invalidation callback. I have no idea why GPU drivers want to implement mmap of dmabuf, that seems to be a uniquely GPU thing. We are not going to be doing stuff like that in KVM and other places. And we can implement the invalidation callback with correct locking. Why should we all be punished because DRM drivers seem to have this weird historical mmap problem? I don't think that is a reasonable way to approach building a general purpose linux kernel API. > Well it's not miss-used, it's just a very bad design decision to let > every importer implement functionality which actually belong into a > single point in the exporter. Well, this is the problem. Sure it may be that importers should not implement mmap - but using the PFN side address is needed for more than just mmap! DMA mapping belongs in the importer, and the new DMA API makes this even more explicit by allowing the importer alot of options to optimize the process of building the HW datastructures. Scatterlist and the enforeced represetation of the DMA list is very inefficient and we are working to get rid of it. It isn't going to be replaced by any sort of list of DMA addresses though. If you really disagree you can try to convince the NVMe people to give up their optimizations the new DMA API allows so DRM can prevent this code-review problem. I also want the same optimizations in RDMA, and I am also not convinced giving them up is a worthwhile tradeoff. > Why would you want to do a dmabuf2 here? Because I need the same kind of common framework. I need to hook VFIO to RDMA as well. I need to fix RDMA to have working P2P in all cases. I need to hook KVM virtual device stuff to iommufd. Someone else need VFIO to hook into DRM. How many different times do I need to implement a buffer sharing lifetime model? No, we should not make a VFIO specific thing, we need a general tool to do this properly and cover all the different use cases. That's "dmabuf2" or whatever you want to call it. There are more than enough use cases to justify doing this. I think this is a bad idea, we do not need two things, we should have dmabuf to handle all the use cases people have, not just DRMs. > I don't mind improving the scatterlist approach in any way possible. > I'm just rejecting things which we already tried and turned out to be a > bad idea. > If you make an interface which gives DMA addresses plus additional > information like address space, access hints etc.. to importers that > would be really welcomed. This is not welcomed, having lists of DMA addresses is inefficient and does not match the direction of the DMA API. We are trying very hard to completely remove the lists of DMA addresses in common fast paths. > But exposing PFNs and letting the importers created their DMA mappings > themselves and making CPU mappings themselves is an absolutely clear > no-go. Again, this is what we must have to support the new DMA API, the KVM and IOMMUFD use cases I mentioned. >> In this case Xu is exporting MMIO from VFIO and importing to KVM and >> iommufd. > > So basically a portion of a PCIe BAR is imported into iommufd? Yeah, and KVM. And RMDA. > Then create an interface between VFIO and KVM/iommufd which allows to > pass data between these two. > We already do this between DMA-buf exporters/importers all the time. > Just don't make it general DMA-buf API. I have no idea what this means. We'd need a new API linked to DMABUF that would be optional and used by this part of the world. As I said above we could protect it with some module namespace so you can keep it out of DRM. If you can agree to that then it seems fine.. > > Someone else had some use case where they wanted to put the VFIO MMIO > > PCIe BAR into a DMABUF and ship it into a GPU driver for > > somethingsomething virtualization but I didn't understand it. > > Yeah, that is already perfectly supported. No, it isn't. Christoph is blocking DMABUF in VFIO because he does not want to scatterlist abuses that dmabuf is doing to proliferate. We already have some ARM systems where the naive way typical DMABUF implementations are setting up P2P does not work. Those systems have PCI offset. Getting this to be "perfectly supported" is why we are working on all these aspects to improve the DMA API and remove the scatterlist abuses. >> In a certain sense CC is a TEE that is built using KVM instead of the >> TEE subsystem. Using KVM and integrating with the MM brings a whole >> set of unique challenges that TEE got to avoid.. > > Please go over those challenges in more detail. I need to get a better > understanding of what's going on here. > E.g. who manages encryption keys, who raises the machine check on > violations etc... TEE broadly has Linux launch a secure world that does some private work. The secure worlds tend to be very limited, they are not really VMs and they don't run full Linux inside CC broadly has the secure world exist at boot and launch Linux and provide services to Linux. The secure world enforces memory isolation on Linux and generates faults on violations. KVM is the gateway to launch new secure worlds and the secure worlds are full VMs with all the device emulation and more. It CC is much more like xen with it's hypervisor and DOM0 concepts. From this perspective, the only thing that matters is that CC secure memory is different and special - it is very much like your private memory concept. Only special places that understand it and have the right HW capability can use it. All the consumers need a CPU address to program their HW because of how the secure world security works. Jason
On Thu, Jan 16, 2025 at 12:07:47PM -0400, Jason Gunthorpe wrote: > On Thu, Jan 16, 2025 at 04:13:13PM +0100, Christian König wrote: > >> But this, fundamentally, is importers creating attachments and then > >> *ignoring the lifetime rules of DMABUF*. If you created an attachment, > >> got a move and *ignored the move* because you put the PFN in your own > >> VMA, then you are not following the attachment lifetime rules! > > > > Move notify is solely for informing the importer that they need to > > re-fresh their DMA mappings and eventually block for ongoing DMA to > > end. > > I feel that it is a bit pedantic to say DMA and CPU are somehow > different. The DMABUF API gives you a scatterlist, it is reasonable to > say that move invalidates the entire scatterlist, CPU and DMA equally. dma-buf doesn't even give you a scatterlist, there's dma-buf which truly are just handles that are 100% device specific. It really is the "everything is optional" interface. And yes we've had endless amounts of fun where importers tried to peek behind the curtain, largely because scatterlists didn't cleanly separate the dma_addr_t from the struct page side of things. So I understand Christian's concerns here. But that doesn't mean we cannot add another interface that does allow importers to peek behind the curtiain. As long as we don't make a mess and confuse importers, and ideally have compat functions so that existing importers can deal with pfn exporters too it's all fine with me. And I also don't see an issue with reasonable pfn semantics that would prevent a generic mmap implementation on top of that, so that dma_buf_mmap() just works for those. The mmu notifier pain is when you consume mmaps/vma in a generic way, but implementing mmap with the current dma_resv locking is dead easy. Unless I've missed something nonobvious and just made a big fool of myself :-) What we cannot demand is that all existing dma-buf exporters switch over to the pfn interfaces. But hey it's the everything optional interface, the only things is guarantees are - fixed length - no copying - refcounting and yay it only took 10 years: consistent locking rules So I really don't understand Christian's fundamental opposition here. Cheers, Sima > > This semantics doesn't work well for CPU mappings because you need to > > hold the reservation lock to make sure that the information stay valid > > and you can't hold a lock while returning from a page fault. > > Sure, I imagine hooking up a VMA is hard - but that doesn't change my > point. The semantics can be reasonable and well defined. > > > Yeah and exactly that is something we don't want to allow because it > > means that every importer need to get things right to prevent exporters > > from running into problems. > > You can make the same argument about the DMA address. We should just > get rid of DMABUF entirely because people are going to mis-use it and > wrongly implement the invalidation callback. > > I have no idea why GPU drivers want to implement mmap of dmabuf, that > seems to be a uniquely GPU thing. We are not going to be doing stuff > like that in KVM and other places. And we can implement the > invalidation callback with correct locking. Why should we all be > punished because DRM drivers seem to have this weird historical mmap > problem? > > I don't think that is a reasonable way to approach building a general > purpose linux kernel API. > > > Well it's not miss-used, it's just a very bad design decision to let > > every importer implement functionality which actually belong into a > > single point in the exporter. > > Well, this is the problem. Sure it may be that importers should not > implement mmap - but using the PFN side address is needed for more > than just mmap! > > DMA mapping belongs in the importer, and the new DMA API makes this > even more explicit by allowing the importer alot of options to > optimize the process of building the HW datastructures. Scatterlist > and the enforeced represetation of the DMA list is very inefficient > and we are working to get rid of it. It isn't going to be replaced by > any sort of list of DMA addresses though. > > If you really disagree you can try to convince the NVMe people to give > up their optimizations the new DMA API allows so DRM can prevent this > code-review problem. > > I also want the same optimizations in RDMA, and I am also not > convinced giving them up is a worthwhile tradeoff. > > > Why would you want to do a dmabuf2 here? > > Because I need the same kind of common framework. I need to hook VFIO > to RDMA as well. I need to fix RDMA to have working P2P in all > cases. I need to hook KVM virtual device stuff to iommufd. Someone > else need VFIO to hook into DRM. > > How many different times do I need to implement a buffer sharing > lifetime model? No, we should not make a VFIO specific thing, we need > a general tool to do this properly and cover all the different use > cases. That's "dmabuf2" or whatever you want to call it. There are > more than enough use cases to justify doing this. I think this is a > bad idea, we do not need two things, we should have dmabuf to handle > all the use cases people have, not just DRMs. > > > I don't mind improving the scatterlist approach in any way possible. > > I'm just rejecting things which we already tried and turned out to be a > > bad idea. > > If you make an interface which gives DMA addresses plus additional > > information like address space, access hints etc.. to importers that > > would be really welcomed. > > This is not welcomed, having lists of DMA addresses is inefficient and > does not match the direction of the DMA API. We are trying very hard > to completely remove the lists of DMA addresses in common fast paths. > > > But exposing PFNs and letting the importers created their DMA mappings > > themselves and making CPU mappings themselves is an absolutely clear > > no-go. > > Again, this is what we must have to support the new DMA API, the KVM > and IOMMUFD use cases I mentioned. > > >> In this case Xu is exporting MMIO from VFIO and importing to KVM and > >> iommufd. > > > > So basically a portion of a PCIe BAR is imported into iommufd? > > Yeah, and KVM. And RMDA. > > > Then create an interface between VFIO and KVM/iommufd which allows to > > pass data between these two. > > We already do this between DMA-buf exporters/importers all the time. > > Just don't make it general DMA-buf API. > > I have no idea what this means. We'd need a new API linked to DMABUF > that would be optional and used by this part of the world. As I said > above we could protect it with some module namespace so you can keep > it out of DRM. If you can agree to that then it seems fine.. > > > > Someone else had some use case where they wanted to put the VFIO MMIO > > > PCIe BAR into a DMABUF and ship it into a GPU driver for > > > somethingsomething virtualization but I didn't understand it. > > > > Yeah, that is already perfectly supported. > > No, it isn't. Christoph is blocking DMABUF in VFIO because he does not > want to scatterlist abuses that dmabuf is doing to proliferate. We > already have some ARM systems where the naive way typical DMABUF > implementations are setting up P2P does not work. Those systems have > PCI offset. > > Getting this to be "perfectly supported" is why we are working on all > these aspects to improve the DMA API and remove the scatterlist > abuses. > > >> In a certain sense CC is a TEE that is built using KVM instead of the > >> TEE subsystem. Using KVM and integrating with the MM brings a whole > >> set of unique challenges that TEE got to avoid.. > > > > Please go over those challenges in more detail. I need to get a better > > understanding of what's going on here. > > E.g. who manages encryption keys, who raises the machine check on > > violations etc... > > TEE broadly has Linux launch a secure world that does some private > work. The secure worlds tend to be very limited, they are not really > VMs and they don't run full Linux inside > > CC broadly has the secure world exist at boot and launch Linux and > provide services to Linux. The secure world enforces memory isolation > on Linux and generates faults on violations. KVM is the gateway to > launch new secure worlds and the secure worlds are full VMs with all > the device emulation and more. > > It CC is much more like xen with it's hypervisor and DOM0 concepts. > > From this perspective, the only thing that matters is that CC secure > memory is different and special - it is very much like your private > memory concept. Only special places that understand it and have the > right HW capability can use it. All the consumers need a CPU address > to program their HW because of how the secure world security works. > > Jason
On Wed, Jan 15, 2025 at 11:06:53AM +0100, Christian König wrote: > Am 15.01.25 um 09:55 schrieb Simona Vetter: > > > > If we add something > > > > new, we need clear rules and not just "here's the kvm code that uses it". > > > > That's how we've done dma-buf at first, and it was a terrible mess of > > > > mismatched expecations. > > > Yes, that would be wrong. It should be self defined within dmabuf and > > > kvm should adopt to it, move semantics and all. > > Ack. > > > > I feel like we have a plan here. > > I think I have to object a bit on that. > > > Summary from my side: > > > > - Sort out pin vs revocable vs dynamic/moveable semantics, make sure > > importers have no surprises. > > > > - Adopt whatever new dma-api datastructures pops out of the dma-api > > reworks. > > > > - Add pfn based memory access as yet another optional access method, with > > helpers so that exporters who support this get all the others for free. > > > > I don't see a strict ordering between these, imo should be driven by > > actual users of the dma-buf api. > > > > Already done: > > > > - dmem cgroup so that we can resource control device pinnings just landed > > in drm-next for next merge window. So that part is imo sorted and we can > > charge ahead with pinning into device memory without all the concerns > > we've had years ago when discussing that for p2p dma-buf support. > > > > But there might be some work so that we can p2p pin without requiring > > dynamic attachments only, I haven't checked whether that needs > > adjustment in dma-buf.c code or just in exporters. > > > > Anything missing? > > Well as far as I can see this use case is not a good fit for the DMA-buf > interfaces in the first place. DMA-buf deals with devices and buffer > exchange. > > What's necessary here instead is to give an importing VM full access on some > memory for their specific use case. > > That full access includes CPU and DMA mappings, modifying caching > attributes, potentially setting encryption keys for specific ranges etc.... > etc... > > In other words we have a lot of things the importer here should be able to > do which we don't want most of the DMA-buf importers to do. This proposal isn't about forcing existing exporters to allow importers to do new stuff. That stays as-is, because it would break things. It's about adding yet another interface to get at the underlying data, and we have tons of those already. The only difference is that if we don't butcher the design, we'll be able to implement all the existing dma-buf interfaces on top of this new pfn interface, for some neat maximal compatibility. But fundamentally there's never been an expectation that you can take any arbitrary dma-buf and pass it any arbitrary importer, and that is must work. The fundamental promise is that if it _does_ work, then - it's zero copy - and fast, or as fast as we can make it I don't see this any different than all the much more specific prposals and existing code, where a subset of importers/exporters have special rules so that e.g. gpu interconnect or vfio uuid based sharing works. pfn-based sharing is just yet another flavor that exists to get the max amount of speed out of interconnects. Cheers, Sima > > The semantics for things like pin vs revocable vs dynamic/moveable seems > similar, but that's basically it. > > As far as I know the TEE subsystem also represents their allocations as file > descriptors. If I'm not completely mistaken this use case most likely fit's > better there. > > > I feel like this is small enough that m-l archives is good enough. For > > some of the bigger projects we do in graphics we sometimes create entries > > in our kerneldoc with wip design consensus and things like that. But > > feels like overkill here. > > > > > My general desire is to move all of RDMA's MR process away from > > > scatterlist and work using only the new DMA API. This will save *huge* > > > amounts of memory in common workloads and be the basis for non-struct > > > page DMA support, including P2P. > > Yeah a more memory efficient structure than the scatterlist would be > > really nice. That would even benefit the very special dma-buf exporters > > where you cannot get a pfn and only the dma_addr_t, altough most of those > > (all maybe even?) have contig buffers, so your scatterlist has only one > > entry. But it would definitely be nice from a design pov. > > Completely agree on that part. > > Scatterlist have a some design flaws, especially mixing the input and out > parameters of the DMA API into the same structure. > > Additional to that DMA addresses are basically missing which bus they belong > to and details how the access should be made (e.g. snoop vs no-snoop > etc...). > > > Aside: A way to more efficiently create compressed scatterlists would be > > neat too, because a lot of drivers hand-roll that and it's a bit brittle > > and kinda silly to duplicate. With compressed I mean just a single entry > > for a contig range, in practice thanks to huge pages/folios and allocators > > trying to hand out contig ranges if there's plenty of memory that saves a > > lot of memory too. But currently it's a bit a pain to construct these > > efficiently, mostly it's just a two-pass approach and then trying to free > > surplus memory or krealloc to fit. Also I don't have good ideas here, but > > dma-api folks might have some from looking at too many things that create > > scatterlists. > > I mailed with Christoph about that a while back as well and we both agreed > that it would probably be a good idea to start defining a data structure to > better encapsulate DMA addresses. > > It's just that nobody had time for that yet and/or I wasn't looped in in the > final discussion about it. > > Regards, > Christian. > > > -Sima
Am 17.01.25 um 15:42 schrieb Simona Vetter: > On Wed, Jan 15, 2025 at 11:06:53AM +0100, Christian König wrote: >> [SNIP] >>> Anything missing? >> Well as far as I can see this use case is not a good fit for the DMA-buf >> interfaces in the first place. DMA-buf deals with devices and buffer >> exchange. >> >> What's necessary here instead is to give an importing VM full access on some >> memory for their specific use case. >> >> That full access includes CPU and DMA mappings, modifying caching >> attributes, potentially setting encryption keys for specific ranges etc.... >> etc... >> >> In other words we have a lot of things the importer here should be able to >> do which we don't want most of the DMA-buf importers to do. > This proposal isn't about forcing existing exporters to allow importers to > do new stuff. That stays as-is, because it would break things. > > It's about adding yet another interface to get at the underlying data, and > we have tons of those already. The only difference is that if we don't > butcher the design, we'll be able to implement all the existing dma-buf > interfaces on top of this new pfn interface, for some neat maximal > compatibility. That sounds like you missed my concern. When an exporter and an importer agree that they want to exchange PFNs instead of DMA addresses then that is perfectly fine. The problems start when you define the semantics how those PFNs, DMA address, private bus addresses, whatever is echanged different to what we have documented for DMA-buf. This semantics is very well defined for DMA-buf now, because that is really important or otherwise things usually seem to work under testing (e.g. without memory pressure) and then totally fall apart in production environments. In other words we have defined what lock you need to hold when calling functions, what a DMA fence is, when exchanged addresses are valid etc... > But fundamentally there's never been an expectation that you can take any > arbitrary dma-buf and pass it any arbitrary importer, and that is must > work. The fundamental promise is that if it _does_ work, then > - it's zero copy > - and fast, or as fast as we can make it > > I don't see this any different than all the much more specific prposals > and existing code, where a subset of importers/exporters have special > rules so that e.g. gpu interconnect or vfio uuid based sharing works. > pfn-based sharing is just yet another flavor that exists to get the max > amount of speed out of interconnects. Please take another look at what is proposed here. The function is called dma_buf_get_pfn_*unlocked* ! This is not following DMA-buf semantics for exchanging addresses and keeping them valid, but rather something more like userptrs. Inserting PFNs into CPU (or probably also IOMMU) page tables have a different semantics than what DMA-buf usually does, because as soon as the address is written into the page table it is made public. So you need some kind of mechanism to make sure that this addr you made public stays valid as long as it is public. The usual I/O operation we encapsulate with DMA-fences have a fundamentally different semantics because we have the lock which enforces that stuff stays valid and then have a DMA-fence which notes how long the stuff needs to stay valid for an operation to complete. Regards, Christian. > > Cheers, Sima > >> The semantics for things like pin vs revocable vs dynamic/moveable seems >> similar, but that's basically it. >> >> As far as I know the TEE subsystem also represents their allocations as file >> descriptors. If I'm not completely mistaken this use case most likely fit's >> better there. >> >>> I feel like this is small enough that m-l archives is good enough. For >>> some of the bigger projects we do in graphics we sometimes create entries >>> in our kerneldoc with wip design consensus and things like that. But >>> feels like overkill here. >>> >>>> My general desire is to move all of RDMA's MR process away from >>>> scatterlist and work using only the new DMA API. This will save *huge* >>>> amounts of memory in common workloads and be the basis for non-struct >>>> page DMA support, including P2P. >>> Yeah a more memory efficient structure than the scatterlist would be >>> really nice. That would even benefit the very special dma-buf exporters >>> where you cannot get a pfn and only the dma_addr_t, altough most of those >>> (all maybe even?) have contig buffers, so your scatterlist has only one >>> entry. But it would definitely be nice from a design pov. >> Completely agree on that part. >> >> Scatterlist have a some design flaws, especially mixing the input and out >> parameters of the DMA API into the same structure. >> >> Additional to that DMA addresses are basically missing which bus they belong >> to and details how the access should be made (e.g. snoop vs no-snoop >> etc...). >> >>> Aside: A way to more efficiently create compressed scatterlists would be >>> neat too, because a lot of drivers hand-roll that and it's a bit brittle >>> and kinda silly to duplicate. With compressed I mean just a single entry >>> for a contig range, in practice thanks to huge pages/folios and allocators >>> trying to hand out contig ranges if there's plenty of memory that saves a >>> lot of memory too. But currently it's a bit a pain to construct these >>> efficiently, mostly it's just a two-pass approach and then trying to free >>> surplus memory or krealloc to fit. Also I don't have good ideas here, but >>> dma-api folks might have some from looking at too many things that create >>> scatterlists. >> I mailed with Christoph about that a while back as well and we both agreed >> that it would probably be a good idea to start defining a data structure to >> better encapsulate DMA addresses. >> >> It's just that nobody had time for that yet and/or I wasn't looped in in the >> final discussion about it. >> >> Regards, >> Christian. >> >>> -Sima
Am 21.06.24 um 00:02 schrieb Xu Yilun: > On Thu, Jan 16, 2025 at 04:13:13PM +0100, Christian König wrote: >> Am 15.01.25 um 18:09 schrieb Jason Gunthorpe: >> >> On Wed, Jan 15, 2025 at 05:34:23PM +0100, Christian König wrote: >> >> Granted, let me try to improve this. >> Here is a real world example of one of the issues we ran into and why >> CPU mappings of importers are redirected to the exporter. >> We have a good bunch of different exporters who track the CPU mappings >> of their backing store using address_space objects in one way or >> another and then uses unmap_mapping_range() to invalidate those CPU >> mappings. >> But when importers get the PFNs of the backing store they can look >> behind the curtain and directly insert this PFN into the CPU page >> tables. >> We had literally tons of cases like this where drivers developers cause >> access after free issues because the importer created a CPU mappings on >> their own without the exporter knowing about it. >> This is just one example of what we ran into. Additional to that >> basically the whole synchronization between drivers was overhauled as >> well because we found that we can't trust importers to always do the >> right thing. >> >> But this, fundamentally, is importers creating attachments and then >> *ignoring the lifetime rules of DMABUF*. If you created an attachment, >> got a move and *ignored the move* because you put the PFN in your own >> VMA, then you are not following the attachment lifetime rules! >> >> Move notify is solely for informing the importer that they need to >> re-fresh their DMA mappings and eventually block for ongoing DMA to end. >> >> This semantics doesn't work well for CPU mappings because you need to hold >> the reservation lock to make sure that the information stay valid and you >> can't hold a lock while returning from a page fault. > Dealing with CPU mapping and resource invalidation is a little hard, but is > resolvable, by using other types of locks. And I guess for now dma-buf > exporters should always handle this CPU mapping VS. invalidation contention if > they support mmap(). > > It is resolvable so with some invalidation notify, a decent importers could > also handle the contention well. That doesn't work like this. See page tables updates under DMA-buf works by using the same locking approach for both the validation and invalidation side. In other words we hold the same lock while inserting and removing entries into/from the page tables. That this here should be an unlocked API means that can only use it with pre-allocated and hard pinned memory without any chance to invalidate it while running. Otherwise you can never be sure of the validity of the address information you got from the exporter. > IIUC now the only concern is importer device drivers are easier to do > something wrong, so move CPU mapping things to exporter. But most of the > exporters are also device drivers, why they are smarter? Exporters always use their invalidation code path no matter if they are exporting their buffers for other to use or if they are stand alone. If you do the invalidation on the importer side you always need both exporter and importer around to test it. Additional to that we have much more importers than exporters. E.g. a lot of simple drivers only import DMA-heap buffers and never exports anything. > And there are increasing mapping needs, today exporters help handle CPU primary > mapping, tomorrow should they also help on all other mappings? Clearly it is > not feasible. So maybe conditionally give trust to some importers. Why should that be necessary? Exporters *must* know what somebody does with their buffers. If you have an use case the exporter doesn't support in their mapping operation then that use case most likely doesn't work in the first place. For example direct I/O is enabled/disabled by exporters on their CPU mappings based on if that works correctly for them. And importer simply doesn't know if they should use vm_insert_pfn() or vm_insert_page(). We could of course implement that logic into each importer to chose between the different approaches, but than each importer gains logic it only exercises with a specific exporter. And that doesn't seem to be a good idea at all. Regards, Christian. > > Thanks, > Yilun
On Mon, Jan 20, 2025 at 01:14:12PM +0100, Christian König wrote: What is going wrong with your email? You replied to Simona, but Simona Vetter <simona.vetter@ffwll.ch> is dropped from the To/CC list??? I added the address back, but seems like a weird thing to happen. > Please take another look at what is proposed here. The function is called > dma_buf_get_pfn_*unlocked* ! I don't think Simona and I are defending the implementation in this series. This series needs work. We have been talking about what the implementation should be. I think we've all been clear on the idea that the DMA buf locking rules should apply to any description of the memory, regardless of if the address are CPU, DMA, or private. I agree that the idea of any "get unlocked" concept seems nonsensical and wrong within dmabuf. > Inserting PFNs into CPU (or probably also IOMMU) page tables have a > different semantics than what DMA-buf usually does, because as soon as the > address is written into the page table it is made public. Not really. The KVM/CPU is fully compatible with move semantics, it has restartable page faults and can implement dmabuf's move locking scheme. It can use the resv lock, the fences, move_notify and so on to implement it. It is a bug if this series isn't doing that. The iommu cannot support move semantics. It would need the existing pin semantics (ie we call dma_buf_pin() and don't support move_notify). To work with VFIO we would need to formalize the revoke semantics that Simona was discussing. We already implement both of these modalities in rdma, the locking API is fine and workable with CPU pfns just as well. I've imagined a staged flow here: 1) The new DMA API lands 2) We improve the new DMA API to be fully struct page free, including setting up P2P 3) VFIO provides a dmbuf exporter using the new DMA API's P2P support. We'd have to continue with the scatterlist hacks for now. VFIO would be a move_notify exporter. This should work with RDMA 4) RDMA works to drop scatterlist from its internal flows and use the new DMA API instead. 5) VFIO/RDMA implement a new non-scatterlist DMABUF op to demonstrate the post-scatterlist world and deprecate the scatterlist hacks. 6) We add revoke semantics to dmabuf, and VFIO/RDMA implements them 7) iommufd can import a pinnable revokable dmabuf using CPU pfns through the non-scatterlist op. 8) Relevant GPU drivers implement the non-scatterlist op and RDMA removes support for the deprecated scatterlist hacks. Xu's series has jumped ahead a bit and is missing infrastructure to build it properly. Jason
On Mon, Jan 20, 2025 at 01:59:01PM -0400, Jason Gunthorpe wrote: > On Mon, Jan 20, 2025 at 01:14:12PM +0100, Christian König wrote: > What is going wrong with your email? You replied to Simona, but > Simona Vetter <simona.vetter@ffwll.ch> is dropped from the To/CC > list??? I added the address back, but seems like a weird thing to > happen. Might also be funny mailing list stuff, depending how you get these. I read mails over lore and pretty much ignore cc (unless it's not also on any list, since those tend to be security issues) because I get cc'ed on way too much stuff for that to be a useful signal. > > Please take another look at what is proposed here. The function is called > > dma_buf_get_pfn_*unlocked* ! > > I don't think Simona and I are defending the implementation in this > series. This series needs work. Yeah this current series is good for kicking off the discussions, it's defo not close to anything we can merge. > We have been talking about what the implementation should be. I think > we've all been clear on the idea that the DMA buf locking rules should > apply to any description of the memory, regardless of if the address > are CPU, DMA, or private. > > I agree that the idea of any "get unlocked" concept seems nonsensical > and wrong within dmabuf. > > > Inserting PFNs into CPU (or probably also IOMMU) page tables have a > > different semantics than what DMA-buf usually does, because as soon as the > > address is written into the page table it is made public. > > Not really. > > The KVM/CPU is fully compatible with move semantics, it has > restartable page faults and can implement dmabuf's move locking > scheme. It can use the resv lock, the fences, move_notify and so on to > implement it. It is a bug if this series isn't doing that. Yeah I'm not worried about cpu mmap locking semantics. drm/ttm is a pretty clear example that you can implement dma-buf mmap with the rules we have, except the unmap_mapping_range might need a bit fudging with a separate address_space. For cpu mmaps I'm more worried about the arch bits in the pte, stuff like caching mode or encrypted memory bits and things like that. There's vma->vm_pgprot, but it's a mess. But maybe this all is an incentive to clean up that mess a bit. > The iommu cannot support move semantics. It would need the existing > pin semantics (ie we call dma_buf_pin() and don't support > move_notify). To work with VFIO we would need to formalize the revoke > semantics that Simona was discussing. I thought iommuv2 (or whatever linux calls these) has full fault support and could support current move semantics. But yeah for iommu without fault support we need some kind of pin or a newly formalized revoke model. > We already implement both of these modalities in rdma, the locking API > is fine and workable with CPU pfns just as well. > > I've imagined a staged flow here: > > 1) The new DMA API lands > 2) We improve the new DMA API to be fully struct page free, including > setting up P2P > 3) VFIO provides a dmbuf exporter using the new DMA API's P2P > support. We'd have to continue with the scatterlist hacks for now. > VFIO would be a move_notify exporter. This should work with RDMA > 4) RDMA works to drop scatterlist from its internal flows and use the > new DMA API instead. > 5) VFIO/RDMA implement a new non-scatterlist DMABUF op to > demonstrate the post-scatterlist world and deprecate the scatterlist > hacks. > 6) We add revoke semantics to dmabuf, and VFIO/RDMA implements them > 7) iommufd can import a pinnable revokable dmabuf using CPU pfns > through the non-scatterlist op. > 8) Relevant GPU drivers implement the non-scatterlist op and RDMA > removes support for the deprecated scatterlist hacks. Sounds pretty reasonable as a first sketch of a proper plan. Of course fully expecting that no plan ever survives implementation intact :-) Cheers, Sima > > Xu's series has jumped ahead a bit and is missing infrastructure to > build it properly. > > Jason
On Mon, Jan 20, 2025 at 07:50:23PM +0100, Simona Vetter wrote: > On Mon, Jan 20, 2025 at 01:59:01PM -0400, Jason Gunthorpe wrote: > > On Mon, Jan 20, 2025 at 01:14:12PM +0100, Christian König wrote: > > What is going wrong with your email? You replied to Simona, but > > Simona Vetter <simona.vetter@ffwll.ch> is dropped from the To/CC > > list??? I added the address back, but seems like a weird thing to > > happen. > > Might also be funny mailing list stuff, depending how you get these. I > read mails over lore and pretty much ignore cc (unless it's not also on > any list, since those tend to be security issues) because I get cc'ed on > way too much stuff for that to be a useful signal. Oh I see, you are sending a Mail-followup-to header that excludes your address, so you don't get any emails at all.. My mutt is dropping you as well. > Yeah I'm not worried about cpu mmap locking semantics. drm/ttm is a pretty > clear example that you can implement dma-buf mmap with the rules we have, > except the unmap_mapping_range might need a bit fudging with a separate > address_space. From my perspective the mmap thing is a bit of a side/DRM-only thing as nothing I'm interested in wants to mmap dmabuf into a VMA. However, I think if you have locking rules that can fit into a VMA fault path and link move_notify to unmap_mapping_range() then you've got a pretty usuable API. > For cpu mmaps I'm more worried about the arch bits in the pte, stuff like > caching mode or encrypted memory bits and things like that. There's > vma->vm_pgprot, but it's a mess. But maybe this all is an incentive to > clean up that mess a bit. I'm convinced we need meta-data along with pfns, there is too much stuff that needs more information than just the address. Cachability, CC encryption, exporting device, etc. This is a topic to partially cross when we talk about how to fully remove struct page requirements from the new DMA API. I'm hoping we can get to something where we describe not just how the pfns should be DMA mapped, but also can describe how they should be CPU mapped. For instance that this PFN space is always mapped uncachable, in CPU and in IOMMU. We also have current bugs in the iommu/vfio side where we are fudging CC stuff, like assuming CPU memory is encrypted (not always true) and that MMIO is non-encrypted (not always true) > I thought iommuv2 (or whatever linux calls these) has full fault support > and could support current move semantics. But yeah for iommu without > fault support we need some kind of pin or a newly formalized revoke model. No, this is HW dependent, including PCI device, and I'm aware of no HW that fully implements this in a way that could be useful to implement arbitary move semantics for VFIO.. Jason
On Mon, Jan 20, 2025 at 03:48:04PM -0400, Jason Gunthorpe wrote: > On Mon, Jan 20, 2025 at 07:50:23PM +0100, Simona Vetter wrote: > > On Mon, Jan 20, 2025 at 01:59:01PM -0400, Jason Gunthorpe wrote: > > > On Mon, Jan 20, 2025 at 01:14:12PM +0100, Christian König wrote: > > > What is going wrong with your email? You replied to Simona, but > > > Simona Vetter <simona.vetter@ffwll.ch> is dropped from the To/CC > > > list??? I added the address back, but seems like a weird thing to > > > happen. > > > > Might also be funny mailing list stuff, depending how you get these. I > > read mails over lore and pretty much ignore cc (unless it's not also on > > any list, since those tend to be security issues) because I get cc'ed on > > way too much stuff for that to be a useful signal. > > Oh I see, you are sending a Mail-followup-to header that excludes your > address, so you don't get any emails at all.. My mutt is dropping you > as well. > > > Yeah I'm not worried about cpu mmap locking semantics. drm/ttm is a pretty > > clear example that you can implement dma-buf mmap with the rules we have, > > except the unmap_mapping_range might need a bit fudging with a separate > > address_space. > > From my perspective the mmap thing is a bit of a side/DRM-only thing > as nothing I'm interested in wants to mmap dmabuf into a VMA. I guess we could just skip mmap on these pfn exporters, but it also means a bit more boilerplate. At least the device mapping / dma_buf_attachment side should be doable with just the pfn and the new dma-api? > However, I think if you have locking rules that can fit into a VMA > fault path and link move_notify to unmap_mapping_range() then you've > got a pretty usuable API. > > > For cpu mmaps I'm more worried about the arch bits in the pte, stuff like > > caching mode or encrypted memory bits and things like that. There's > > vma->vm_pgprot, but it's a mess. But maybe this all is an incentive to > > clean up that mess a bit. > > I'm convinced we need meta-data along with pfns, there is too much > stuff that needs more information than just the address. Cachability, > CC encryption, exporting device, etc. This is a topic to partially > cross when we talk about how to fully remove struct page requirements > from the new DMA API. > > I'm hoping we can get to something where we describe not just how the > pfns should be DMA mapped, but also can describe how they should be > CPU mapped. For instance that this PFN space is always mapped > uncachable, in CPU and in IOMMU. I was pondering whether dma_mmap and friends would be a good place to prototype this and go for a fully generic implementation. But then even those have _wc/_uncached variants. If you go into arch specific stuff, then x86 does have wc/uc/... tracking, but only for memory (set_page_wc and friends iirc). And you can bypass it if you know what you're doing. > We also have current bugs in the iommu/vfio side where we are fudging > CC stuff, like assuming CPU memory is encrypted (not always true) and > that MMIO is non-encrypted (not always true) tbf CC pte flags I just don't grok at all. I've once tried to understand what current exporters and gpu drivers do and just gave up. But that's also a bit why I'm worried here because it's an enigma to me. > > I thought iommuv2 (or whatever linux calls these) has full fault support > > and could support current move semantics. But yeah for iommu without > > fault support we need some kind of pin or a newly formalized revoke model. > > No, this is HW dependent, including PCI device, and I'm aware of no HW > that fully implements this in a way that could be useful to implement > arbitary move semantics for VFIO.. Hm I thought we've had at least prototypes floating around of device fault repair, but I guess that only works with ATS/pasid stuff and not general iommu traffic from devices. Definitely needs some device cooperation since the timeouts of a full fault are almost endless. -Sima
On Tue, Jan 21, 2025 at 05:11:32PM +0100, Simona Vetter wrote: > On Mon, Jan 20, 2025 at 03:48:04PM -0400, Jason Gunthorpe wrote: > > On Mon, Jan 20, 2025 at 07:50:23PM +0100, Simona Vetter wrote: > > > On Mon, Jan 20, 2025 at 01:59:01PM -0400, Jason Gunthorpe wrote: > > > > On Mon, Jan 20, 2025 at 01:14:12PM +0100, Christian König wrote: > > > > What is going wrong with your email? You replied to Simona, but > > > > Simona Vetter <simona.vetter@ffwll.ch> is dropped from the To/CC > > > > list??? I added the address back, but seems like a weird thing to > > > > happen. > > > > > > Might also be funny mailing list stuff, depending how you get these. I > > > read mails over lore and pretty much ignore cc (unless it's not also on > > > any list, since those tend to be security issues) because I get cc'ed on > > > way too much stuff for that to be a useful signal. > > > > Oh I see, you are sending a Mail-followup-to header that excludes your > > address, so you don't get any emails at all.. My mutt is dropping you > > as well. > > > > > Yeah I'm not worried about cpu mmap locking semantics. drm/ttm is a pretty > > > clear example that you can implement dma-buf mmap with the rules we have, > > > except the unmap_mapping_range might need a bit fudging with a separate > > > address_space. > > > > From my perspective the mmap thing is a bit of a side/DRM-only thing > > as nothing I'm interested in wants to mmap dmabuf into a VMA. > > I guess we could just skip mmap on these pfn exporters, but it also means > a bit more boilerplate. I have been assuming that dmabuf mmap remains unchanged, that exporters will continue to implement that mmap() callback as today. My main interest has been what data structure is produced in the attach APIs. Eg today we have a struct dma_buf_attachment that returns a sg_table. I'm expecting some kind of new data structure, lets call it "physical list" that is some efficient coding of meta/addr/len tuples that works well with the new DMA API. Matthew has been calling this thing phyr.. So, I imagine, struct dma_buf_attachment gaining an optional feature negotiation and then we have in dma_buf_attachment: union { struct sg_table *sgt; struct physical_list *phyr; }; That's basicaly it, an alternative to scatterlist that has a clean architecture. Now, if you are asking if the current dmabuf mmap callback can be improved with the above? Maybe? phyr should have the neccessary information inside it to populate a VMA - eventually even fully correctly with all the right cachable/encrypted/forbidden/etc flags. So, you could imagine that exporters could just have one routine to generate the phyr list and that goes into the attachment, goes into some common code to fill VMA PTEs, and some other common code that will convert it into the DMABUF scatterlist. If performance is not a concern with these data structure conversions it could be an appealing simplification. And yes, I could imagine the meta information being descriptive enough to support the private interconnect cases, the common code could detect private meta information and just cleanly fail. > At least the device mapping / dma_buf_attachment > side should be doable with just the pfn and the new dma-api? Yes, that would be my first goal post. Figure out some meta information and a container data structure that allows struct page-less P2P mapping through the new DMA API. > > I'm hoping we can get to something where we describe not just how the > > pfns should be DMA mapped, but also can describe how they should be > > CPU mapped. For instance that this PFN space is always mapped > > uncachable, in CPU and in IOMMU. > > I was pondering whether dma_mmap and friends would be a good place to > prototype this and go for a fully generic implementation. But then even > those have _wc/_uncached variants. Given that the inability to correctly DMA map P2P MMIO without struct page is a current pain point and current source of hacks in dmabuf exporters, I wanted to make resolving that a priority. However, if you mean what I described above for "fully generic [dmabuf mmap] implementation", then we'd have the phyr datastructure as a dependency to attempt that work. phyr, and particularly the meta information, has a number of stakeholders. I was thinking of going first with rdma's memory registration flow because we are now pretty close to being able to do such a big change, and it can demonstrate most of the requirements. But that doesn't mean mmap couldn't go concurrently on the same agreed datastructure if people are interested. > > We also have current bugs in the iommu/vfio side where we are fudging > > CC stuff, like assuming CPU memory is encrypted (not always true) and > > that MMIO is non-encrypted (not always true) > > tbf CC pte flags I just don't grok at all. I've once tried to understand > what current exporters and gpu drivers do and just gave up. But that's > also a bit why I'm worried here because it's an enigma to me. For CC, inside the secure world, is some information if each PFN inside the VM is 'encrypted' or not. Any VM PTE (including the IOPTEs) pointing at the PFN must match the secure world's view of 'encrypted'. The VM can ask the secure world to change its view at runtime. The way CC has been bolted on to the kernel so far laregly hides this from drivers, so it is troubled to tell in driver code if the PFN you have is 'encrypted' or not. Right now the general rule (that is not always true) is that struct page CPU memory is encrypted and everything else is decrypted. So right now, you can mostly ignore it and the above assumption largely happens for you transparently. However, soon we will have encrypted P2P MMIO which will stress this hiding strategy. > > > I thought iommuv2 (or whatever linux calls these) has full fault support > > > and could support current move semantics. But yeah for iommu without > > > fault support we need some kind of pin or a newly formalized revoke model. > > > > No, this is HW dependent, including PCI device, and I'm aware of no HW > > that fully implements this in a way that could be useful to implement > > arbitary move semantics for VFIO.. > > Hm I thought we've had at least prototypes floating around of device fault > repair, but I guess that only works with ATS/pasid stuff and not general > iommu traffic from devices. Definitely needs some device cooperation since > the timeouts of a full fault are almost endless. Yes, exactly. What all real devices I'm aware have done is make a subset of their traffic work with ATS and PRI, but not all their traffic. Without *all* traffic you can't make any generic assumption in the iommu that a transient non-present won't be fatal to the device. Stuff like dmabuf move semantics rely on transient non-present being non-disruptive... Jason
On Mon, Jan 20, 2025 at 02:44:13PM +0100, Christian König wrote: > Am 21.06.24 um 00:02 schrieb Xu Yilun: > > On Thu, Jan 16, 2025 at 04:13:13PM +0100, Christian König wrote: > > > Am 15.01.25 um 18:09 schrieb Jason Gunthorpe: > > > > > > On Wed, Jan 15, 2025 at 05:34:23PM +0100, Christian König wrote: > > > > > > Granted, let me try to improve this. > > > Here is a real world example of one of the issues we ran into and why > > > CPU mappings of importers are redirected to the exporter. > > > We have a good bunch of different exporters who track the CPU mappings > > > of their backing store using address_space objects in one way or > > > another and then uses unmap_mapping_range() to invalidate those CPU > > > mappings. > > > But when importers get the PFNs of the backing store they can look > > > behind the curtain and directly insert this PFN into the CPU page > > > tables. > > > We had literally tons of cases like this where drivers developers cause > > > access after free issues because the importer created a CPU mappings on > > > their own without the exporter knowing about it. > > > This is just one example of what we ran into. Additional to that > > > basically the whole synchronization between drivers was overhauled as > > > well because we found that we can't trust importers to always do the > > > right thing. > > > > > > But this, fundamentally, is importers creating attachments and then > > > *ignoring the lifetime rules of DMABUF*. If you created an attachment, > > > got a move and *ignored the move* because you put the PFN in your own > > > VMA, then you are not following the attachment lifetime rules! > > > > > > Move notify is solely for informing the importer that they need to > > > re-fresh their DMA mappings and eventually block for ongoing DMA to end. > > > > > > This semantics doesn't work well for CPU mappings because you need to hold > > > the reservation lock to make sure that the information stay valid and you > > > can't hold a lock while returning from a page fault. > > Dealing with CPU mapping and resource invalidation is a little hard, but is > > resolvable, by using other types of locks. And I guess for now dma-buf > > exporters should always handle this CPU mapping VS. invalidation contention if > > they support mmap(). > > > > It is resolvable so with some invalidation notify, a decent importers could > > also handle the contention well. > > That doesn't work like this. > > See page tables updates under DMA-buf works by using the same locking > approach for both the validation and invalidation side. In other words we > hold the same lock while inserting and removing entries into/from the page > tables. Not sure what's the issue it causes, maybe I don't get why "you can't hold a lock while returning from a page fault". > > That this here should be an unlocked API means that can only use it with > pre-allocated and hard pinned memory without any chance to invalidate it > while running. Otherwise you can never be sure of the validity of the Then importers can use a locked version to get pfn, and manually use dma_resv lock only to ensure the PFN validity during page table setup. Importers could detect the PFN will be invalid via move notify and remove page table entries. Then find the new PFN next time page fault happens. IIUC, Simona mentions drm/ttm is already doing it this way. I'm not trying to change the CPU mmap things for existing drivers, just to ensure importer mapping is possible with faultable MMU. I wanna KVM MMU (also faultable) to work in this importer mapping way. Thanks, Yilun
On Tue, Jan 21, 2025 at 01:36:33PM -0400, Jason Gunthorpe wrote: > On Tue, Jan 21, 2025 at 05:11:32PM +0100, Simona Vetter wrote: > > On Mon, Jan 20, 2025 at 03:48:04PM -0400, Jason Gunthorpe wrote: > > > On Mon, Jan 20, 2025 at 07:50:23PM +0100, Simona Vetter wrote: > > > > On Mon, Jan 20, 2025 at 01:59:01PM -0400, Jason Gunthorpe wrote: > > > > > On Mon, Jan 20, 2025 at 01:14:12PM +0100, Christian König wrote: > > > > > What is going wrong with your email? You replied to Simona, but > > > > > Simona Vetter <simona.vetter@ffwll.ch> is dropped from the To/CC > > > > > list??? I added the address back, but seems like a weird thing to > > > > > happen. > > > > > > > > Might also be funny mailing list stuff, depending how you get these. I > > > > read mails over lore and pretty much ignore cc (unless it's not also on > > > > any list, since those tend to be security issues) because I get cc'ed on > > > > way too much stuff for that to be a useful signal. > > > > > > Oh I see, you are sending a Mail-followup-to header that excludes your > > > address, so you don't get any emails at all.. My mutt is dropping you > > > as well. > > > > > > > Yeah I'm not worried about cpu mmap locking semantics. drm/ttm is a pretty > > > > clear example that you can implement dma-buf mmap with the rules we have, > > > > except the unmap_mapping_range might need a bit fudging with a separate > > > > address_space. > > > > > > From my perspective the mmap thing is a bit of a side/DRM-only thing > > > as nothing I'm interested in wants to mmap dmabuf into a VMA. > > > > I guess we could just skip mmap on these pfn exporters, but it also means > > a bit more boilerplate. > > I have been assuming that dmabuf mmap remains unchanged, that > exporters will continue to implement that mmap() callback as today. > > My main interest has been what data structure is produced in the > attach APIs. > > Eg today we have a struct dma_buf_attachment that returns a sg_table. > > I'm expecting some kind of new data structure, lets call it "physical > list" that is some efficient coding of meta/addr/len tuples that works > well with the new DMA API. Matthew has been calling this thing phyr.. > > So, I imagine, struct dma_buf_attachment gaining an optional > feature negotiation and then we have in dma_buf_attachment: > > union { > struct sg_table *sgt; > struct physical_list *phyr; > }; > > That's basicaly it, an alternative to scatterlist that has a clean > architecture. > > Now, if you are asking if the current dmabuf mmap callback can be > improved with the above? Maybe? phyr should have the neccessary > information inside it to populate a VMA - eventually even fully > correctly with all the right cachable/encrypted/forbidden/etc flags. > > So, you could imagine that exporters could just have one routine to > generate the phyr list and that goes into the attachment, goes into > some common code to fill VMA PTEs, and some other common code that > will convert it into the DMABUF scatterlist. If performance is not a > concern with these data structure conversions it could be an appealing > simplification. > > And yes, I could imagine the meta information being descriptive enough > to support the private interconnect cases, the common code could > detect private meta information and just cleanly fail. I'm kinda leaning towards entirely separate dma-buf interfaces for the new phyr stuff, because I fear that adding that to the existing ones will only make the chaos worse. But that aside sounds all reasonable, and even that could just be too much worry on my side and mixing phyr into existing attachments (with a pile of importer/exporter flags probably) is fine. For the existing dma-buf importers/exporters I'm kinda hoping for a pure dma_addr_t based list eventually. Going all the way to a phyr based approach for everyone might be too much churn, there's some real bad cruft there. It's not going to work for every case, but it covers a lot of them and might be less pain for existing importers. But in theory it should be possible to use phyr everywhere eventually, as long as there's no obviously api-rules-breaking way to go from a phyr back to a struct page even when that exists. > > At least the device mapping / dma_buf_attachment > > side should be doable with just the pfn and the new dma-api? > > Yes, that would be my first goal post. Figure out some meta > information and a container data structure that allows struct > page-less P2P mapping through the new DMA API. > > > > I'm hoping we can get to something where we describe not just how the > > > pfns should be DMA mapped, but also can describe how they should be > > > CPU mapped. For instance that this PFN space is always mapped > > > uncachable, in CPU and in IOMMU. > > > > I was pondering whether dma_mmap and friends would be a good place to > > prototype this and go for a fully generic implementation. But then even > > those have _wc/_uncached variants. > > Given that the inability to correctly DMA map P2P MMIO without struct > page is a current pain point and current source of hacks in dmabuf > exporters, I wanted to make resolving that a priority. > > However, if you mean what I described above for "fully generic [dmabuf > mmap] implementation", then we'd have the phyr datastructure as a > dependency to attempt that work. > > phyr, and particularly the meta information, has a number of > stakeholders. I was thinking of going first with rdma's memory > registration flow because we are now pretty close to being able to do > such a big change, and it can demonstrate most of the requirements. > > But that doesn't mean mmap couldn't go concurrently on the same agreed > datastructure if people are interested. Yeah cpu mmap needs a lot more, going with a very limited p2p use-case first only makes sense. > > > We also have current bugs in the iommu/vfio side where we are fudging > > > CC stuff, like assuming CPU memory is encrypted (not always true) and > > > that MMIO is non-encrypted (not always true) > > > > tbf CC pte flags I just don't grok at all. I've once tried to understand > > what current exporters and gpu drivers do and just gave up. But that's > > also a bit why I'm worried here because it's an enigma to me. > > For CC, inside the secure world, is some information if each PFN > inside the VM is 'encrypted' or not. Any VM PTE (including the IOPTEs) > pointing at the PFN must match the secure world's view of > 'encrypted'. The VM can ask the secure world to change its view at > runtime. > > The way CC has been bolted on to the kernel so far laregly hides this > from drivers, so it is troubled to tell in driver code if the PFN you > have is 'encrypted' or not. Right now the general rule (that is not > always true) is that struct page CPU memory is encrypted and > everything else is decrypted. > > So right now, you can mostly ignore it and the above assumption > largely happens for you transparently. > > However, soon we will have encrypted P2P MMIO which will stress this > hiding strategy. It's already breaking with stuff like virtual gpu drivers, vmwgfx is fiddling around with these bits (at least last I tried to understand this all) and I think a few others do too. > > > > I thought iommuv2 (or whatever linux calls these) has full fault support > > > > and could support current move semantics. But yeah for iommu without > > > > fault support we need some kind of pin or a newly formalized revoke model. > > > > > > No, this is HW dependent, including PCI device, and I'm aware of no HW > > > that fully implements this in a way that could be useful to implement > > > arbitary move semantics for VFIO.. > > > > Hm I thought we've had at least prototypes floating around of device fault > > repair, but I guess that only works with ATS/pasid stuff and not general > > iommu traffic from devices. Definitely needs some device cooperation since > > the timeouts of a full fault are almost endless. > > Yes, exactly. What all real devices I'm aware have done is make a > subset of their traffic work with ATS and PRI, but not all their > traffic. Without *all* traffic you can't make any generic assumption > in the iommu that a transient non-present won't be fatal to the > device. > > Stuff like dmabuf move semantics rely on transient non-present being > non-disruptive... Ah now I get it, at the iommu level you have to pessimistically assume whether a device can handle a fault, and none can for all traffic. I was thinking too much about the driver level where generally the dma-buf you importer are only used for the subset of device functions that can cope with faults on many devices. Cheers, Sima
On Wed, Jan 22, 2025 at 12:04:19PM +0100, Simona Vetter wrote: > I'm kinda leaning towards entirely separate dma-buf interfaces for the new > phyr stuff, because I fear that adding that to the existing ones will only > make the chaos worse. Lets see when some patches come up, if importers have to deal with several formats a single attach interface would probably be simpler for them.. > For the existing dma-buf importers/exporters I'm kinda hoping for a pure > dma_addr_t based list eventually. IMHO the least churn would be to have the old importers call a helper (perhaps implicitly in the core code) to convert phyr into a dma mapped scatterlist and then just keep using their existing code exactly as is. Performance improvement would come from importers switching to use the new dma api internally and avoiding the scatterlist allocation, but even the extra alocation is not so bad. Then, perhaps, and I really hesitate to say this, but perhaps to ease the migration we could store a dma mapped list in a phyr using the meta information. That would allow a dmabuf scatterlist exporter to be converted to a phyr with a helper. The importer would have to have a special path to detect the dma mapped mode and skip the normal mapping. The point would be to let maintained drivers use the new data structures and flows easily and still interoperate with older drivers. Otherwise we have to somehow build parallel scatterlist and phyr code flows into importers :\ > Going all the way to a phyr based > approach for everyone might be too much churn, there's some real bad cruft > there. It's not going to work for every case, but it covers a lot of them > and might be less pain for existing importers. I'd hope we can reach every case.. But I don't know what kind of horrors you guys have :) > But in theory it should be possible to use phyr everywhere eventually, as > long as there's no obviously api-rules-breaking way to go from a phyr back to > a struct page even when that exists. I'd say getting a struct page is a perfectly safe operation, from a phyr API perspective. The dmabuf issue seems to be entirely about following the locking rules, an importer is not allowed to get a struct page and switch from reservation locking to page refcount locking. I get the appeal for DRM of blocking struct page use because that directly prevents the above. But, in RDMA I want to re-use the same phyr datastructure for tracking pin_user_pages() CPU memory, and I must get the struct page back so I can put_user_page() it when done. Perhaps we can find some compromise where the phyr data structure has some kind of flag 'disable struct page' and then the phyr_entry_to_page() API will WARN_ON() or something like that. > > However, soon we will have encrypted P2P MMIO which will stress this > > hiding strategy. > > It's already breaking with stuff like virtual gpu drivers, vmwgfx is > fiddling around with these bits (at least last I tried to understand this > all) and I think a few others do too. IMHO, most places I've seen touching this out side arch code feel very hacky. :\ Jason
Am 22.01.25 um 12:04 schrieb Simona Vetter: > On Tue, Jan 21, 2025 at 01:36:33PM -0400, Jason Gunthorpe wrote: >> On Tue, Jan 21, 2025 at 05:11:32PM +0100, Simona Vetter wrote: >>> On Mon, Jan 20, 2025 at 03:48:04PM -0400, Jason Gunthorpe wrote: >>>> On Mon, Jan 20, 2025 at 07:50:23PM +0100, Simona Vetter wrote: >>>>> On Mon, Jan 20, 2025 at 01:59:01PM -0400, Jason Gunthorpe wrote: >>>>>> On Mon, Jan 20, 2025 at 01:14:12PM +0100, Christian König wrote: >>>>>> What is going wrong with your email? You replied to Simona, but >>>>>> Simona Vetter <simona.vetter@ffwll.ch> is dropped from the To/CC >>>>>> list??? I added the address back, but seems like a weird thing to >>>>>> happen. >>>>> Might also be funny mailing list stuff, depending how you get these. I >>>>> read mails over lore and pretty much ignore cc (unless it's not also on >>>>> any list, since those tend to be security issues) because I get cc'ed on >>>>> way too much stuff for that to be a useful signal. >>>> Oh I see, you are sending a Mail-followup-to header that excludes your >>>> address, so you don't get any emails at all.. My mutt is dropping you >>>> as well. I'm having all kind of funny phenomena with AMDs mail servers since coming back from xmas vacation. From the news it looks like Outlook on Windows has a new major security issue where just viewing a mail can compromise the system and my educated guess is that our IT guys went into panic mode because of this and has changed something. >> [SNIP] >> I have been assuming that dmabuf mmap remains unchanged, that >> exporters will continue to implement that mmap() callback as today. That sounds really really good to me because that was my major concern when you noted that you want to have PFNs to build up KVM page tables. But you don't want to handle mmap() on your own, you basically don't want to have a VMA for this stuff at all, correct? >> My main interest has been what data structure is produced in the >> attach APIs. >> >> Eg today we have a struct dma_buf_attachment that returns a sg_table. >> >> I'm expecting some kind of new data structure, lets call it "physical >> list" that is some efficient coding of meta/addr/len tuples that works >> well with the new DMA API. Matthew has been calling this thing phyr.. I would not use a data structure at all. Instead we should have something like an iterator/cursor based approach similar to what the new DMA API is doing. >> So, I imagine, struct dma_buf_attachment gaining an optional >> feature negotiation and then we have in dma_buf_attachment: >> >> union { >> struct sg_table *sgt; >> struct physical_list *phyr; >> }; >> >> That's basicaly it, an alternative to scatterlist that has a clean >> architecture. I would rather suggest something like dma_buf_attachment() gets offset and size to map and returns a cursor object you can use to get your address, length and access attributes. And then you can iterate over this cursor and fill in your importer data structure with the necessary information. This way neither the exporter nor the importer need to convert their data back and forth between their specific representations of the information. >> Now, if you are asking if the current dmabuf mmap callback can be >> improved with the above? Maybe? phyr should have the neccessary >> information inside it to populate a VMA - eventually even fully >> correctly with all the right cachable/encrypted/forbidden/etc flags. That won't work like this. See the exporter needs to be informed about page faults on the VMA to eventually wait for operations to end and sync caches. Otherwise we either potentially allow access to freed up or re-used memory or run into issues with device cache coherency. >> So, you could imagine that exporters could just have one routine to >> generate the phyr list and that goes into the attachment, goes into >> some common code to fill VMA PTEs, and some other common code that >> will convert it into the DMABUF scatterlist. If performance is not a >> concern with these data structure conversions it could be an appealing >> simplification. >> >> And yes, I could imagine the meta information being descriptive enough >> to support the private interconnect cases, the common code could >> detect private meta information and just cleanly fail. > I'm kinda leaning towards entirely separate dma-buf interfaces for the new > phyr stuff, because I fear that adding that to the existing ones will only > make the chaos worse. But that aside sounds all reasonable, and even that > could just be too much worry on my side and mixing phyr into existing > attachments (with a pile of importer/exporter flags probably) is fine. I lean into the other direction. Dmitry and Thomas have done a really good job at cleaning up all the interaction between dynamic and static exporters / importers. Especially that we now have consistent locking for map_dma_buf() and unmap_dma_buf() should make that transition rather straight forward. > For the existing dma-buf importers/exporters I'm kinda hoping for a pure > dma_addr_t based list eventually. Going all the way to a phyr based > approach for everyone might be too much churn, there's some real bad cruft > there. It's not going to work for every case, but it covers a lot of them > and might be less pain for existing importers. The point is we have use cases that won't work without exchanging DMA addresses any more. For example we have cases with multiple devices are in the same IOMMU domain and re-using their DMA address mappings. > But in theory it should be possible to use phyr everywhere eventually, as > long as there's no obviously api-rules-breaking way to go from a phyr back to > a struct page even when that exists. I would rather say we should stick to DMA addresses as much as possible. What we can do is to add an address space description to the addresses, e.g. if it's a PCIe BUS addr in IOMMU domain X, or of it's a device private bus addr or in the case of sharing with iommufd and KVM PFNs. Regards, Christian. >>> At least the device mapping / dma_buf_attachment >>> side should be doable with just the pfn and the new dma-api? >> Yes, that would be my first goal post. Figure out some meta >> information and a container data structure that allows struct >> page-less P2P mapping through the new DMA API. >> >>>> I'm hoping we can get to something where we describe not just how the >>>> pfns should be DMA mapped, but also can describe how they should be >>>> CPU mapped. For instance that this PFN space is always mapped >>>> uncachable, in CPU and in IOMMU. >>> I was pondering whether dma_mmap and friends would be a good place to >>> prototype this and go for a fully generic implementation. But then even >>> those have _wc/_uncached variants. >> Given that the inability to correctly DMA map P2P MMIO without struct >> page is a current pain point and current source of hacks in dmabuf >> exporters, I wanted to make resolving that a priority. >> >> However, if you mean what I described above for "fully generic [dmabuf >> mmap] implementation", then we'd have the phyr datastructure as a >> dependency to attempt that work. >> >> phyr, and particularly the meta information, has a number of >> stakeholders. I was thinking of going first with rdma's memory >> registration flow because we are now pretty close to being able to do >> such a big change, and it can demonstrate most of the requirements. >> >> But that doesn't mean mmap couldn't go concurrently on the same agreed >> datastructure if people are interested. > Yeah cpu mmap needs a lot more, going with a very limited p2p use-case > first only makes sense. > >>>> We also have current bugs in the iommu/vfio side where we are fudging >>>> CC stuff, like assuming CPU memory is encrypted (not always true) and >>>> that MMIO is non-encrypted (not always true) >>> tbf CC pte flags I just don't grok at all. I've once tried to understand >>> what current exporters and gpu drivers do and just gave up. But that's >>> also a bit why I'm worried here because it's an enigma to me. >> For CC, inside the secure world, is some information if each PFN >> inside the VM is 'encrypted' or not. Any VM PTE (including the IOPTEs) >> pointing at the PFN must match the secure world's view of >> 'encrypted'. The VM can ask the secure world to change its view at >> runtime. >> >> The way CC has been bolted on to the kernel so far laregly hides this >> from drivers, so it is troubled to tell in driver code if the PFN you >> have is 'encrypted' or not. Right now the general rule (that is not >> always true) is that struct page CPU memory is encrypted and >> everything else is decrypted. >> >> So right now, you can mostly ignore it and the above assumption >> largely happens for you transparently. >> >> However, soon we will have encrypted P2P MMIO which will stress this >> hiding strategy. > It's already breaking with stuff like virtual gpu drivers, vmwgfx is > fiddling around with these bits (at least last I tried to understand this > all) and I think a few others do too. > >>>>> I thought iommuv2 (or whatever linux calls these) has full fault support >>>>> and could support current move semantics. But yeah for iommu without >>>>> fault support we need some kind of pin or a newly formalized revoke model. >>>> No, this is HW dependent, including PCI device, and I'm aware of no HW >>>> that fully implements this in a way that could be useful to implement >>>> arbitary move semantics for VFIO.. >>> Hm I thought we've had at least prototypes floating around of device fault >>> repair, but I guess that only works with ATS/pasid stuff and not general >>> iommu traffic from devices. Definitely needs some device cooperation since >>> the timeouts of a full fault are almost endless. >> Yes, exactly. What all real devices I'm aware have done is make a >> subset of their traffic work with ATS and PRI, but not all their >> traffic. Without *all* traffic you can't make any generic assumption >> in the iommu that a transient non-present won't be fatal to the >> device. >> >> Stuff like dmabuf move semantics rely on transient non-present being >> non-disruptive... > Ah now I get it, at the iommu level you have to pessimistically assume > whether a device can handle a fault, and none can for all traffic. I was > thinking too much about the driver level where generally the dma-buf you > importer are only used for the subset of device functions that can cope > with faults on many devices. > > Cheers, Sima
On Wed, Jan 22, 2025 at 02:29:09PM +0100, Christian König wrote: > I'm having all kind of funny phenomena with AMDs mail servers since coming > back from xmas vacation. :( A few years back our IT fully migrated our email to into Office 365 cloud and gave up all the crazy half on-prem stuff they were doing. The mail started working fully perfectly after that, as long as you use MS's servers directly :\ > But you don't want to handle mmap() on your own, you basically don't want to > have a VMA for this stuff at all, correct? Right, we have no interest in mmap, VMAs or struct page in rdma/kvm/iommu. > > > My main interest has been what data structure is produced in the > > > attach APIs. > > > > > > Eg today we have a struct dma_buf_attachment that returns a sg_table. > > > > > > I'm expecting some kind of new data structure, lets call it "physical > > > list" that is some efficient coding of meta/addr/len tuples that works > > > well with the new DMA API. Matthew has been calling this thing phyr.. > > I would not use a data structure at all. Instead we should have something > like an iterator/cursor based approach similar to what the new DMA API is > doing. I'm certainly open to this idea. There may be some technical challenges, it is a big change from scatterlist today, and function-pointer-per-page sounds like bad performance if there are alot of pages.. RDMA would probably have to stuff this immediately into something like a phyr anyhow because it needs to fully extent the thing being mapped to figure out what the HW page size and geometry should be - that would be trivial though, and a RDMA problem. > > > Now, if you are asking if the current dmabuf mmap callback can be > > > improved with the above? Maybe? phyr should have the neccessary > > > information inside it to populate a VMA - eventually even fully > > > correctly with all the right cachable/encrypted/forbidden/etc flags. > > That won't work like this. Note I said "populate a VMA", ie a helper to build the VMA PTEs only. > See the exporter needs to be informed about page faults on the VMA to > eventually wait for operations to end and sync caches. All of this would still have to be provided outside in the same way as today. > For example we have cases with multiple devices are in the same IOMMU domain > and re-using their DMA address mappings. IMHO this is just another flavour of "private" address flow between two cooperating drivers. It is not a "dma address" in the sense of a dma_addr_t that was output from the DMA API. I think that subtle distinction is very important. When I say pfn/dma address I'm really only talking about standard DMA API flows, used by generic drivers. IMHO, DMABUF needs a private address "escape hatch", and cooperating drivers should do whatever they want when using that flow. The address is *fully private*, so the co-operating drivers can do whatever they want. iommu_map in exporter and pass an IOVA? Fine! pass a PFN and iommu_map in the importer? Also fine! Private is private. > > But in theory it should be possible to use phyr everywhere eventually, as > > long as there's no obviously api-rules-breaking way to go from a phyr back to > > a struct page even when that exists. > > I would rather say we should stick to DMA addresses as much as possible. I remain skeptical of this.. Aside from all the technical reasons I already outlined.. I think it is too much work to have the exporters conditionally build all sorts of different representations of the same thing depending on the importer. Like having alot of DRM drivers generate both a PFN and DMA mapped list in their export code doesn't sound very appealing to me at all. It makes sense that a driver would be able to conditionally generate private and generic based on negotiation, but IMHO, not more than one flavour of generic.. Jason
Am 22.01.25 um 15:37 schrieb Jason Gunthorpe: >>>> My main interest has been what data structure is produced in the >>>> attach APIs. >>>> >>>> Eg today we have a struct dma_buf_attachment that returns a sg_table. >>>> >>>> I'm expecting some kind of new data structure, lets call it "physical >>>> list" that is some efficient coding of meta/addr/len tuples that works >>>> well with the new DMA API. Matthew has been calling this thing phyr.. >> I would not use a data structure at all. Instead we should have something >> like an iterator/cursor based approach similar to what the new DMA API is >> doing. > I'm certainly open to this idea. There may be some technical > challenges, it is a big change from scatterlist today, and > function-pointer-per-page sounds like bad performance if there are > alot of pages.. > > RDMA would probably have to stuff this immediately into something like > a phyr anyhow because it needs to fully extent the thing being mapped > to figure out what the HW page size and geometry should be - that > would be trivial though, and a RDMA problem. > >>>> Now, if you are asking if the current dmabuf mmap callback can be >>>> improved with the above? Maybe? phyr should have the neccessary >>>> information inside it to populate a VMA - eventually even fully >>>> correctly with all the right cachable/encrypted/forbidden/etc flags. >> That won't work like this. > Note I said "populate a VMA", ie a helper to build the VMA PTEs only. > >> See the exporter needs to be informed about page faults on the VMA to >> eventually wait for operations to end and sync caches. > All of this would still have to be provided outside in the same way as > today. > >> For example we have cases with multiple devices are in the same IOMMU domain >> and re-using their DMA address mappings. > IMHO this is just another flavour of "private" address flow between > two cooperating drivers. Well that's the point. The inporter is not cooperating here. The importer doesn't have the slightest idea that he is sharing it's DMA addresses with the exporter. All the importer gets is when you want to access this information use this address here. > It is not a "dma address" in the sense of a dma_addr_t that was output > from the DMA API. I think that subtle distinction is very > important. When I say pfn/dma address I'm really only talking about > standard DMA API flows, used by generic drivers. > > IMHO, DMABUF needs a private address "escape hatch", and cooperating > drivers should do whatever they want when using that flow. The address > is *fully private*, so the co-operating drivers can do whatever they > want. iommu_map in exporter and pass an IOVA? Fine! pass a PFN and > iommu_map in the importer? Also fine! Private is private. > >>> But in theory it should be possible to use phyr everywhere eventually, as >>> long as there's no obviously api-rules-breaking way to go from a phyr back to >>> a struct page even when that exists. >> I would rather say we should stick to DMA addresses as much as possible. > I remain skeptical of this.. Aside from all the technical reasons I > already outlined.. > > I think it is too much work to have the exporters conditionally build > all sorts of different representations of the same thing depending on > the importer. Like having alot of DRM drivers generate both a PFN and > DMA mapped list in their export code doesn't sound very appealing to > me at all. Well from experience I can say that it is actually the other way around. We have a very limited number of exporters and a lot of different importers. So having complexity in the exporter instead of the importer is absolutely beneficial. PFN is the special case, in other words this is the private address passed around. And I will push hard to not support that in the DRM drivers nor any DMA buf heap. > It makes sense that a driver would be able to conditionally generate > private and generic based on negotiation, but IMHO, not more than one > flavour of generic.. I still strongly think that the exporter should talk with the DMA API to setup the access path for the importer and *not* the importer directly. Regards, Christian. > > Jason
On Wed, Jan 22, 2025 at 03:59:11PM +0100, Christian König wrote: > > > For example we have cases with multiple devices are in the same IOMMU domain > > > and re-using their DMA address mappings. > > IMHO this is just another flavour of "private" address flow between > > two cooperating drivers. > > Well that's the point. The inporter is not cooperating here. If the private address relies on a shared iommu_domain controlled by the driver, then yes, the importer MUST be cooperating. For instance, if you send the same private address into RDMA it will explode because it doesn't have any notion of shared iommu_domain mappings, and it certainly doesn't setup any such shared domains. > The importer doesn't have the slightest idea that he is sharing it's DMA > addresses with the exporter. Of course it does. The importer driver would have had to explicitly set this up! The normal kernel behavior is that all drivers get private iommu_domains controled by the DMA API. If your driver is doing something else *it did it deliberately*. Some of that mess in tegra host1x around this area is not well structured, it should not be implicitly setting up domains for drivers. It is old code that hasn't been updated to use the new iommu subsystem approach for driver controled non-DMA API domains. The new iommu architecture has the probing driver disable the DMA API and can then manipulate its iommu domain however it likes, safely. Ie the probing driver is aware and particiapting in disabling the DMA API. Again, either you are using the DMA API and you work in generic ways with generic devices or it is "private" and only co-operating drivers can interwork with private addresses. A private address must not ever be sent to a DMA API using driver and vice versa. IMHO this is an important architecture point and why Christoph was frowning on abusing dma_addr_t to represent things that did NOT come out of the DMA API. > We have a very limited number of exporters and a lot of different importers. > So having complexity in the exporter instead of the importer is absolutely > beneficial. Isn't every DRM driver both an importer and exporter? That is what I was expecting at least.. > I still strongly think that the exporter should talk with the DMA API to > setup the access path for the importer and *not* the importer directly. It is contrary to the design of the new API which wants to co-optimize mapping and HW setup together as one unit. For instance in RDMA we want to hint and control the way the IOMMU mapping works in the DMA API to optimize the RDMA HW side. I can't do those optimizations if I'm not in control of the mapping. The same is probably true on the GPU side too, you want IOVAs that have tidy alignment with your PTE structure, but only the importer understands its own HW to make the correct hints to the DMA API. Jason
Sending it as text mail once more. Am 23.01.25 um 15:32 schrieb Christian König: > Am 23.01.25 um 14:59 schrieb Jason Gunthorpe: >> On Wed, Jan 22, 2025 at 03:59:11PM +0100, Christian König wrote: >>>>> For example we have cases with multiple devices are in the same IOMMU domain >>>>> and re-using their DMA address mappings. >>>> IMHO this is just another flavour of "private" address flow between >>>> two cooperating drivers. >>> Well that's the point. The inporter is not cooperating here. >> If the private address relies on a shared iommu_domain controlled by >> the driver, then yes, the importer MUST be cooperating. For instance, >> if you send the same private address into RDMA it will explode because >> it doesn't have any notion of shared iommu_domain mappings, and it >> certainly doesn't setup any such shared domains. > > Hui? Why the heck should a driver own it's iommu domain? > > The domain is owned and assigned by the PCI subsystem under Linux. > >>> The importer doesn't have the slightest idea that he is sharing it's DMA >>> addresses with the exporter. >> Of course it does. The importer driver would have had to explicitly >> set this up! The normal kernel behavior is that all drivers get >> private iommu_domains controled by the DMA API. If your driver is >> doing something else *it did it deliberately*. > > As far as I know that is simply not correct. Currently IOMMU > domains/groups are usually shared between devices. > > Especially multi function devices get only a single IOMMU domain. > >> Some of that mess in tegra host1x around this area is not well >> structured, it should not be implicitly setting up domains for >> drivers. It is old code that hasn't been updated to use the new iommu >> subsystem approach for driver controled non-DMA API domains. >> >> The new iommu architecture has the probing driver disable the DMA API >> and can then manipulate its iommu domain however it likes, safely. Ie >> the probing driver is aware and particiapting in disabling the DMA >> API. > > Why the heck should we do this? > > That drivers manage all of that on their own sounds like a massive > step in the wrong direction. > >> Again, either you are using the DMA API and you work in generic ways >> with generic devices or it is "private" and only co-operating drivers >> can interwork with private addresses. A private address must not ever >> be sent to a DMA API using driver and vice versa. >> >> IMHO this is an important architecture point and why Christoph was >> frowning on abusing dma_addr_t to represent things that did NOT come >> out of the DMA API. >> >>> We have a very limited number of exporters and a lot of different importers. >>> So having complexity in the exporter instead of the importer is absolutely >>> beneficial. >> Isn't every DRM driver both an importer and exporter? That is what I >> was expecting at least.. >> >>> I still strongly think that the exporter should talk with the DMA API to >>> setup the access path for the importer and *not* the importer directly. >> It is contrary to the design of the new API which wants to co-optimize >> mapping and HW setup together as one unit. > > Yeah and I'm really questioning this design goal. That sounds like > totally going into the wrong direction just because of the RDMA drivers. > >> For instance in RDMA we want to hint and control the way the IOMMU >> mapping works in the DMA API to optimize the RDMA HW side. I can't do >> those optimizations if I'm not in control of the mapping. > > Why? What is the technical background here? > >> The same is probably true on the GPU side too, you want IOVAs that >> have tidy alignment with your PTE structure, but only the importer >> understands its own HW to make the correct hints to the DMA API. > > Yeah but then express those as requirements to the DMA API and not > move all the important decisions into the driver where they are > implemented over and over again and potentially broken halve the time. > > See drivers are supposed to be simple, small and stupid. They should > be controlled by the core OS and not allowed to do whatever they want. > > Driver developers are not trust able to always get everything right if > you make it as complicated as this. > > Regards, > Christian. > >> Jason >
On Thu, Jan 23, 2025 at 03:35:21PM +0100, Christian König wrote: > Sending it as text mail once more. > > Am 23.01.25 um 15:32 schrieb Christian König: > > Am 23.01.25 um 14:59 schrieb Jason Gunthorpe: > > > On Wed, Jan 22, 2025 at 03:59:11PM +0100, Christian König wrote: > > > > > > For example we have cases with multiple devices are in the same IOMMU domain > > > > > > and re-using their DMA address mappings. > > > > > IMHO this is just another flavour of "private" address flow between > > > > > two cooperating drivers. > > > > Well that's the point. The inporter is not cooperating here. > > > If the private address relies on a shared iommu_domain controlled by > > > the driver, then yes, the importer MUST be cooperating. For instance, > > > if you send the same private address into RDMA it will explode because > > > it doesn't have any notion of shared iommu_domain mappings, and it > > > certainly doesn't setup any such shared domains. > > > > Hui? Why the heck should a driver own it's iommu domain? I don't know, you are the one saying the drivers have special shared iommu_domains so DMA BUF need some special design to accommodate it. I'm aware that DRM drivers do directly call into the iommu subsystem and do directly manage their own IOVA. I assumed this is what you were talkinga bout. See below. > > The domain is owned and assigned by the PCI subsystem under Linux. That domain is *exclusively* owned by the DMA API and is only accessed via maps created by DMA API calls. If you are using the DMA API correctly then all of this is abstracted and none of it matters to you. There is no concept of "shared domains" in the DMA API. You call the DMA API, you get a dma_addr_t that is valid for a *single* device, you program it in HW. That is all. There is no reason to dig deeper than this. > > > > The importer doesn't have the slightest idea that he is sharing it's DMA > > > > addresses with the exporter. > > > Of course it does. The importer driver would have had to explicitly > > > set this up! The normal kernel behavior is that all drivers get > > > private iommu_domains controled by the DMA API. If your driver is > > > doing something else *it did it deliberately*. > > > > As far as I know that is simply not correct. Currently IOMMU > > domains/groups are usually shared between devices. No, the opposite. The iommu subsystem tries to maximally isolate devices up to the HW limit. On server platforms every device is expected to get its own iommu domain. > > Especially multi function devices get only a single IOMMU domain. Only if the PCI HW doesn't support ACS. This is all DMA API internal details you shouldn't even be talking about at the DMA BUF level. It is all hidden and simply does not matter to DMA BUF at all. > > > The new iommu architecture has the probing driver disable the DMA API > > > and can then manipulate its iommu domain however it likes, safely. Ie > > > the probing driver is aware and particiapting in disabling the DMA > > > API. > > > > Why the heck should we do this? > > > > That drivers manage all of that on their own sounds like a massive step > > in the wrong direction. I am talking about DRM drivers that HAVE to manage their own for some reason I don't know. eg: drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c: tdev->iommu.domain = iommu_domain_alloc(&platform_bus_type); drivers/gpu/drm/msm/msm_iommu.c: domain = iommu_paging_domain_alloc(dev); drivers/gpu/drm/rockchip/rockchip_drm_drv.c: private->domain = iommu_paging_domain_alloc(private->iommu_dev); drivers/gpu/drm/tegra/drm.c: tegra->domain = iommu_paging_domain_alloc(dma_dev); drivers/gpu/host1x/dev.c: host->domain = iommu_paging_domain_alloc(host->dev); Normal simple drivers should never be calling these functions! If you are calling these functions you are not using the DMA API, and, yes, some cases like tegra n1x are actively sharing these special domains across multiple devices and drivers. If you want to pass an IOVA in one of these special driver-created domains then it would be some private address in DMABUF that only works on drivers that have understood they attached to these manually created domains. No DMA API involvement here. > > > > I still strongly think that the exporter should talk with the DMA API to > > > > setup the access path for the importer and *not* the importer directly. > > > It is contrary to the design of the new API which wants to co-optimize > > > mapping and HW setup together as one unit. > > > > Yeah and I'm really questioning this design goal. That sounds like > > totally going into the wrong direction just because of the RDMA > > drivers. Actually it is storage that motivates this. It is just pointless to allocate a dma_addr_t list in the fast path when you don't need it. You can stream the dma_addr_t directly into HW structures that are necessary and already allocated. > > > For instance in RDMA we want to hint and control the way the IOMMU > > > mapping works in the DMA API to optimize the RDMA HW side. I can't do > > > those optimizations if I'm not in control of the mapping. > > > > Why? What is the technical background here? dma-iommu.c chooses an IOVA alignment based on its own reasoning that is not always compatible with the HW. The HW can optimize if the IOVA alignment meets certain restrictions. Much like page tables in a GPU. > > > The same is probably true on the GPU side too, you want IOVAs that > > > have tidy alignment with your PTE structure, but only the importer > > > understands its own HW to make the correct hints to the DMA API. > > > > Yeah but then express those as requirements to the DMA API and not move > > all the important decisions into the driver where they are implemented > > over and over again and potentially broken halve the time. It wouild be in the DMA API, just the per-mapping portion of the API. Same as the multipath, the ATS, and more. It is all per-mapping descisions of the executing HW, not global decisions or something like. Jason
On Thu, Jan 23, 2025 at 04:48:29PM +0100, Christian König wrote: > No, no there are much more cases where drivers simply assume that they > are in the same iommu domain for different devices. This is an illegal assumption and invalid way to use the DMA API. Do not do that, do not architect things in DMABUF to permit that. The dma_addr_t out of the DMA API is only usable by the device passed in, period full stop. If you want to use it with two devices then call the DMA API twice. > E.g. that different > PCI endpoints can use the same dma_addr_t. > For example those classic sound devices for HDMI audio on graphics > cards work like this. > In other words if the device handled by the generic ALSA driver and the > GPU are not in the same iommu domain you run into trouble. Yes, I recall this weird AMD issue as well. IIRC the solution is not clean or "correct". :( I vaugely recall it was caused by a HW bug... > Well it might never been documented but I know of quite a bunch of > different cases that assume that a DMA addr will just ultimately work > for some other device/driver as well. Again, illegal assumption, breaks the abstraction. >> This is all DMA API internal details you shouldn't even be talking >> about at the DMA BUF level. It is all hidden and simply does not >> matter to DMA BUF at all. > > Well we somehow need to support the existing use cases with the new > API. Call the DMA API multiple times, once per device. That is the only correct way to handle this today. DMABUF is already architected like this, each and every attach should be dma mapping and generating a scatterlist for every unique importing device. Improving it to somehow avoid the redundant DMA API map would require new DMA API work. Do NOT randomly assume that devices share dma_addr_t, there is no architected way to ever discover this, it is a complete violation of all the API abstractions. >> If you want to pass an IOVA in one of these special driver-created >> domains then it would be some private address in DMABUF that only >> works on drivers that have understood they attached to these manually >> created domains. No DMA API involvement here. > > That won't fly like this. That would break at least the ALSA use case > and potentially quite a bunch of others. Your AMD ALSA weirdness is not using custom iommu_domains (nor should it), it is a different problem. > dma-iommu.c chooses an IOVA alignment based on its own reasoning that > is not always compatible with the HW. The HW can optimize if the IOVA > alignment meets certain restrictions. Much like page tables in a GPU. > > Yeah, but why can't we tell the DMA API those restrictions instead of > letting the driver manage the address space themselves? How do you propose to do this per-mapping operation without having the HW driver actually call the mapping operation? > > Same as the multipath, the ATS, and more. It is all per-mapping > > descisions of the executing HW, not global decisions or something > > like. > > So the DMA API has some structure or similar to describe the necessary > per-mapping properties? Not fully yet (though some multipath is supported), but I want to slowly move in this direction to solve all of these problems we have :( Jason
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 7eeee3a38202..83d1448b6dcc 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -630,10 +630,10 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info) size_t alloc_size = sizeof(struct dma_buf); int ret; - if (WARN_ON(!exp_info->priv || !exp_info->ops - || !exp_info->ops->map_dma_buf - || !exp_info->ops->unmap_dma_buf - || !exp_info->ops->release)) + if (WARN_ON(!exp_info->priv || !exp_info->ops || + (!!exp_info->ops->map_dma_buf != !!exp_info->ops->unmap_dma_buf) || + (!exp_info->ops->map_dma_buf && !exp_info->ops->get_pfn) || + !exp_info->ops->release)) return ERR_PTR(-EINVAL); if (WARN_ON(exp_info->ops->cache_sgt_mapping && @@ -909,7 +909,10 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev, struct dma_buf_attachment *attach; int ret; - if (WARN_ON(!dmabuf || !dev)) + if (WARN_ON(!dmabuf)) + return ERR_PTR(-EINVAL); + + if (WARN_ON(dmabuf->ops->map_dma_buf && !dev)) return ERR_PTR(-EINVAL); if (WARN_ON(importer_ops && !importer_ops->move_notify)) @@ -941,7 +944,7 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev, */ if (dma_buf_attachment_is_dynamic(attach) != dma_buf_is_dynamic(dmabuf)) { - struct sg_table *sgt; + struct sg_table *sgt = NULL; dma_resv_lock(attach->dmabuf->resv, NULL); if (dma_buf_is_dynamic(attach->dmabuf)) { @@ -950,13 +953,16 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev, goto err_unlock; } - sgt = __map_dma_buf(attach, DMA_BIDIRECTIONAL); - if (!sgt) - sgt = ERR_PTR(-ENOMEM); - if (IS_ERR(sgt)) { - ret = PTR_ERR(sgt); - goto err_unpin; + if (dmabuf->ops->map_dma_buf) { + sgt = __map_dma_buf(attach, DMA_BIDIRECTIONAL); + if (!sgt) + sgt = ERR_PTR(-ENOMEM); + if (IS_ERR(sgt)) { + ret = PTR_ERR(sgt); + goto err_unpin; + } } + dma_resv_unlock(attach->dmabuf->resv); attach->sgt = sgt; attach->dir = DMA_BIDIRECTIONAL; @@ -1119,7 +1125,8 @@ struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *attach, might_sleep(); - if (WARN_ON(!attach || !attach->dmabuf)) + if (WARN_ON(!attach || !attach->dmabuf || + !attach->dmabuf->ops->map_dma_buf)) return ERR_PTR(-EINVAL); dma_resv_assert_held(attach->dmabuf->resv); @@ -1195,7 +1202,8 @@ dma_buf_map_attachment_unlocked(struct dma_buf_attachment *attach, might_sleep(); - if (WARN_ON(!attach || !attach->dmabuf)) + if (WARN_ON(!attach || !attach->dmabuf || + !attach->dmabuf->ops->map_dma_buf)) return ERR_PTR(-EINVAL); dma_resv_lock(attach->dmabuf->resv, NULL); @@ -1222,7 +1230,8 @@ void dma_buf_unmap_attachment(struct dma_buf_attachment *attach, { might_sleep(); - if (WARN_ON(!attach || !attach->dmabuf || !sg_table)) + if (WARN_ON(!attach || !attach->dmabuf || + !attach->dmabuf->ops->unmap_dma_buf || !sg_table)) return; dma_resv_assert_held(attach->dmabuf->resv); @@ -1254,7 +1263,8 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach, { might_sleep(); - if (WARN_ON(!attach || !attach->dmabuf || !sg_table)) + if (WARN_ON(!attach || !attach->dmabuf || + !attach->dmabuf->ops->unmap_dma_buf || !sg_table)) return; dma_resv_lock(attach->dmabuf->resv, NULL); @@ -1263,6 +1273,52 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach, } EXPORT_SYMBOL_NS_GPL(dma_buf_unmap_attachment_unlocked, "DMA_BUF"); +/** + * dma_buf_get_pfn_unlocked - + * @attach: [in] attachment to get pfn from + * @pgoff: [in] page offset of the buffer against the start of dma_buf + * @pfn: [out] returns the pfn of the buffer + * @max_order [out] returns the max mapping order of the buffer + */ +int dma_buf_get_pfn_unlocked(struct dma_buf_attachment *attach, + pgoff_t pgoff, u64 *pfn, int *max_order) +{ + struct dma_buf *dmabuf = attach->dmabuf; + int ret; + + if (WARN_ON(!attach || !attach->dmabuf || + !attach->dmabuf->ops->get_pfn)) + return -EINVAL; + + /* + * Open: + * + * When dma_buf is dynamic but dma_buf move is disabled, the buffer + * should be pinned before use, See dma_buf_map_attachment() for + * reference. + * + * But for now no pin is intended inside dma_buf_get_pfn(), otherwise + * need another API to unpin the dma_buf. So just fail out this case. + */ + if (dma_buf_is_dynamic(attach->dmabuf) && + !IS_ENABLED(CONFIG_DMABUF_MOVE_NOTIFY)) + return -ENOENT; + + dma_resv_lock(attach->dmabuf->resv, NULL); + ret = dmabuf->ops->get_pfn(attach, pgoff, pfn, max_order); + /* + * Open: + * + * Is dma_resv_wait_timeout() needed? I assume no. The DMA buffer + * content synchronization could be done when the buffer is to be + * mapped by importer. + */ + dma_resv_unlock(attach->dmabuf->resv); + + return ret; +} +EXPORT_SYMBOL_NS_GPL(dma_buf_get_pfn_unlocked, "DMA_BUF"); + /** * dma_buf_move_notify - notify attachments that DMA-buf is moving * @@ -1662,7 +1718,7 @@ static int dma_buf_debug_show(struct seq_file *s, void *unused) attach_count = 0; list_for_each_entry(attach_obj, &buf_obj->attachments, node) { - seq_printf(s, "\t%s\n", dev_name(attach_obj->dev)); + seq_printf(s, "\t%s\n", attach_obj->dev ? dev_name(attach_obj->dev) : NULL); attach_count++; } dma_resv_unlock(buf_obj->resv); diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 36216d28d8bd..b16183edfb3a 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -194,6 +194,17 @@ struct dma_buf_ops { * if the call would block. */ + /** + * @get_pfn: + * + * This is called by dma_buf_get_pfn(). It is used to get the pfn + * of the buffer positioned by the page offset against the start of + * the dma_buf. It can only be called if @attach has been called + * successfully. + */ + int (*get_pfn)(struct dma_buf_attachment *attach, pgoff_t pgoff, + u64 *pfn, int *max_order); + /** * @release: * @@ -629,6 +640,8 @@ dma_buf_map_attachment_unlocked(struct dma_buf_attachment *attach, void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach, struct sg_table *sg_table, enum dma_data_direction direction); +int dma_buf_get_pfn_unlocked(struct dma_buf_attachment *attach, + pgoff_t pgoff, u64 *pfn, int *max_order); int dma_buf_mmap(struct dma_buf *, struct vm_area_struct *, unsigned long);
Introduce a new API for dma-buf importer, also add a dma_buf_ops callback for dma-buf exporter. This API is for subsystem importers who map the dma-buf to some user defined address space, e.g. for IOMMUFD to map the dma-buf to userspace IOVA via IOMMU page table, or for KVM to map the dma-buf to GPA via KVM MMU (e.g. EPT). Currently dma-buf is only used to get DMA address for device's default domain by using kernel DMA APIs. But for these new use-cases, importers only need the pfn of the dma-buf resource to build their own mapping tables. So the map_dma_buf() callback is not mandatory for exporters anymore. Also the importers could choose not to provide struct device *dev on dma_buf_attach() if they don't call dma_buf_map_attachment(). Like dma_buf_map_attachment(), the importer should firstly call dma_buf_attach/dynamic_attach() then call dma_buf_get_pfn_unlocked(). If the importer choose to do dynamic attach, it also should handle the dma-buf move notification. Only the unlocked version of dma_buf_get_pfn is implemented for now, just because no locked version is used for now. Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com> --- IIUC, Only get_pfn() is needed but no put_pfn(). The whole dma-buf is de/referenced at dma-buf attach/detach time. Specifically, for static attachment, the exporter should always make memory resource available/pinned on first dma_buf_attach(), and release/unpin memory resource on last dma_buf_detach(). For dynamic attachment, the exporter could populate & invalidate the memory resource at any time, it's OK as long as the importers follow dma-buf move notification. So no pinning is needed for get_pfn() and no put_pfn() is needed. --- drivers/dma-buf/dma-buf.c | 90 +++++++++++++++++++++++++++++++-------- include/linux/dma-buf.h | 13 ++++++ 2 files changed, 86 insertions(+), 17 deletions(-)