Message ID | 20250107142719.179636-2-yilun.xu@linux.intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Private MMIO support for private assigned dev | expand |
Am 07.01.25 um 15:27 schrieb Xu Yilun: > Introduce a new API for dma-buf importer, also add a dma_buf_ops > callback for dma-buf exporter. This API is for subsystem importers who > map the dma-buf to some user defined address space, e.g. for IOMMUFD to > map the dma-buf to userspace IOVA via IOMMU page table, or for KVM to > map the dma-buf to GPA via KVM MMU (e.g. EPT). > > Currently dma-buf is only used to get DMA address for device's default > domain by using kernel DMA APIs. But for these new use-cases, importers > only need the pfn of the dma-buf resource to build their own mapping > tables. As far as I can see I have to fundamentally reject this whole approach. It's intentional DMA-buf design that we don't expose struct pages nor PFNs to the importer. Essentially DMA-buf only transports DMA addresses. In other words the mapping is done by the exporter and *not* the importer. What we certainly can do is to annotate those DMA addresses to a better specify in which domain they are applicable, e.g. if they are PCIe bus addresses or some inter device bus addresses etc... But moving the functionality to map the pages/PFNs to DMA addresses into the importer is an absolutely clear NO-GO. Regards, Christian. > So the map_dma_buf() callback is not mandatory for exporters > anymore. Also the importers could choose not to provide > struct device *dev on dma_buf_attach() if they don't call > dma_buf_map_attachment(). > > Like dma_buf_map_attachment(), the importer should firstly call > dma_buf_attach/dynamic_attach() then call dma_buf_get_pfn_unlocked(). > If the importer choose to do dynamic attach, it also should handle the > dma-buf move notification. > > Only the unlocked version of dma_buf_get_pfn is implemented for now, > just because no locked version is used for now. > > Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com> > > --- > IIUC, Only get_pfn() is needed but no put_pfn(). The whole dma-buf is > de/referenced at dma-buf attach/detach time. > > Specifically, for static attachment, the exporter should always make > memory resource available/pinned on first dma_buf_attach(), and > release/unpin memory resource on last dma_buf_detach(). For dynamic > attachment, the exporter could populate & invalidate the memory > resource at any time, it's OK as long as the importers follow dma-buf > move notification. So no pinning is needed for get_pfn() and no > put_pfn() is needed. > --- > drivers/dma-buf/dma-buf.c | 90 +++++++++++++++++++++++++++++++-------- > include/linux/dma-buf.h | 13 ++++++ > 2 files changed, 86 insertions(+), 17 deletions(-) > > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c > index 7eeee3a38202..83d1448b6dcc 100644 > --- a/drivers/dma-buf/dma-buf.c > +++ b/drivers/dma-buf/dma-buf.c > @@ -630,10 +630,10 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info) > size_t alloc_size = sizeof(struct dma_buf); > int ret; > > - if (WARN_ON(!exp_info->priv || !exp_info->ops > - || !exp_info->ops->map_dma_buf > - || !exp_info->ops->unmap_dma_buf > - || !exp_info->ops->release)) > + if (WARN_ON(!exp_info->priv || !exp_info->ops || > + (!!exp_info->ops->map_dma_buf != !!exp_info->ops->unmap_dma_buf) || > + (!exp_info->ops->map_dma_buf && !exp_info->ops->get_pfn) || > + !exp_info->ops->release)) > return ERR_PTR(-EINVAL); > > if (WARN_ON(exp_info->ops->cache_sgt_mapping && > @@ -909,7 +909,10 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev, > struct dma_buf_attachment *attach; > int ret; > > - if (WARN_ON(!dmabuf || !dev)) > + if (WARN_ON(!dmabuf)) > + return ERR_PTR(-EINVAL); > + > + if (WARN_ON(dmabuf->ops->map_dma_buf && !dev)) > return ERR_PTR(-EINVAL); > > if (WARN_ON(importer_ops && !importer_ops->move_notify)) > @@ -941,7 +944,7 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev, > */ > if (dma_buf_attachment_is_dynamic(attach) != > dma_buf_is_dynamic(dmabuf)) { > - struct sg_table *sgt; > + struct sg_table *sgt = NULL; > > dma_resv_lock(attach->dmabuf->resv, NULL); > if (dma_buf_is_dynamic(attach->dmabuf)) { > @@ -950,13 +953,16 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev, > goto err_unlock; > } > > - sgt = __map_dma_buf(attach, DMA_BIDIRECTIONAL); > - if (!sgt) > - sgt = ERR_PTR(-ENOMEM); > - if (IS_ERR(sgt)) { > - ret = PTR_ERR(sgt); > - goto err_unpin; > + if (dmabuf->ops->map_dma_buf) { > + sgt = __map_dma_buf(attach, DMA_BIDIRECTIONAL); > + if (!sgt) > + sgt = ERR_PTR(-ENOMEM); > + if (IS_ERR(sgt)) { > + ret = PTR_ERR(sgt); > + goto err_unpin; > + } > } > + > dma_resv_unlock(attach->dmabuf->resv); > attach->sgt = sgt; > attach->dir = DMA_BIDIRECTIONAL; > @@ -1119,7 +1125,8 @@ struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *attach, > > might_sleep(); > > - if (WARN_ON(!attach || !attach->dmabuf)) > + if (WARN_ON(!attach || !attach->dmabuf || > + !attach->dmabuf->ops->map_dma_buf)) > return ERR_PTR(-EINVAL); > > dma_resv_assert_held(attach->dmabuf->resv); > @@ -1195,7 +1202,8 @@ dma_buf_map_attachment_unlocked(struct dma_buf_attachment *attach, > > might_sleep(); > > - if (WARN_ON(!attach || !attach->dmabuf)) > + if (WARN_ON(!attach || !attach->dmabuf || > + !attach->dmabuf->ops->map_dma_buf)) > return ERR_PTR(-EINVAL); > > dma_resv_lock(attach->dmabuf->resv, NULL); > @@ -1222,7 +1230,8 @@ void dma_buf_unmap_attachment(struct dma_buf_attachment *attach, > { > might_sleep(); > > - if (WARN_ON(!attach || !attach->dmabuf || !sg_table)) > + if (WARN_ON(!attach || !attach->dmabuf || > + !attach->dmabuf->ops->unmap_dma_buf || !sg_table)) > return; > > dma_resv_assert_held(attach->dmabuf->resv); > @@ -1254,7 +1263,8 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach, > { > might_sleep(); > > - if (WARN_ON(!attach || !attach->dmabuf || !sg_table)) > + if (WARN_ON(!attach || !attach->dmabuf || > + !attach->dmabuf->ops->unmap_dma_buf || !sg_table)) > return; > > dma_resv_lock(attach->dmabuf->resv, NULL); > @@ -1263,6 +1273,52 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach, > } > EXPORT_SYMBOL_NS_GPL(dma_buf_unmap_attachment_unlocked, "DMA_BUF"); > > +/** > + * dma_buf_get_pfn_unlocked - > + * @attach: [in] attachment to get pfn from > + * @pgoff: [in] page offset of the buffer against the start of dma_buf > + * @pfn: [out] returns the pfn of the buffer > + * @max_order [out] returns the max mapping order of the buffer > + */ > +int dma_buf_get_pfn_unlocked(struct dma_buf_attachment *attach, > + pgoff_t pgoff, u64 *pfn, int *max_order) > +{ > + struct dma_buf *dmabuf = attach->dmabuf; > + int ret; > + > + if (WARN_ON(!attach || !attach->dmabuf || > + !attach->dmabuf->ops->get_pfn)) > + return -EINVAL; > + > + /* > + * Open: > + * > + * When dma_buf is dynamic but dma_buf move is disabled, the buffer > + * should be pinned before use, See dma_buf_map_attachment() for > + * reference. > + * > + * But for now no pin is intended inside dma_buf_get_pfn(), otherwise > + * need another API to unpin the dma_buf. So just fail out this case. > + */ > + if (dma_buf_is_dynamic(attach->dmabuf) && > + !IS_ENABLED(CONFIG_DMABUF_MOVE_NOTIFY)) > + return -ENOENT; > + > + dma_resv_lock(attach->dmabuf->resv, NULL); > + ret = dmabuf->ops->get_pfn(attach, pgoff, pfn, max_order); > + /* > + * Open: > + * > + * Is dma_resv_wait_timeout() needed? I assume no. The DMA buffer > + * content synchronization could be done when the buffer is to be > + * mapped by importer. > + */ > + dma_resv_unlock(attach->dmabuf->resv); > + > + return ret; > +} > +EXPORT_SYMBOL_NS_GPL(dma_buf_get_pfn_unlocked, "DMA_BUF"); > + > /** > * dma_buf_move_notify - notify attachments that DMA-buf is moving > * > @@ -1662,7 +1718,7 @@ static int dma_buf_debug_show(struct seq_file *s, void *unused) > attach_count = 0; > > list_for_each_entry(attach_obj, &buf_obj->attachments, node) { > - seq_printf(s, "\t%s\n", dev_name(attach_obj->dev)); > + seq_printf(s, "\t%s\n", attach_obj->dev ? dev_name(attach_obj->dev) : NULL); > attach_count++; > } > dma_resv_unlock(buf_obj->resv); > diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h > index 36216d28d8bd..b16183edfb3a 100644 > --- a/include/linux/dma-buf.h > +++ b/include/linux/dma-buf.h > @@ -194,6 +194,17 @@ struct dma_buf_ops { > * if the call would block. > */ > > + /** > + * @get_pfn: > + * > + * This is called by dma_buf_get_pfn(). It is used to get the pfn > + * of the buffer positioned by the page offset against the start of > + * the dma_buf. It can only be called if @attach has been called > + * successfully. > + */ > + int (*get_pfn)(struct dma_buf_attachment *attach, pgoff_t pgoff, > + u64 *pfn, int *max_order); > + > /** > * @release: > * > @@ -629,6 +640,8 @@ dma_buf_map_attachment_unlocked(struct dma_buf_attachment *attach, > void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach, > struct sg_table *sg_table, > enum dma_data_direction direction); > +int dma_buf_get_pfn_unlocked(struct dma_buf_attachment *attach, > + pgoff_t pgoff, u64 *pfn, int *max_order); > > int dma_buf_mmap(struct dma_buf *, struct vm_area_struct *, > unsigned long);
On Wed, Jan 08, 2025 at 09:01:46AM +0100, Christian König wrote: > Am 07.01.25 um 15:27 schrieb Xu Yilun: > > Introduce a new API for dma-buf importer, also add a dma_buf_ops > > callback for dma-buf exporter. This API is for subsystem importers who > > map the dma-buf to some user defined address space, e.g. for IOMMUFD to > > map the dma-buf to userspace IOVA via IOMMU page table, or for KVM to > > map the dma-buf to GPA via KVM MMU (e.g. EPT). > > > > Currently dma-buf is only used to get DMA address for device's default > > domain by using kernel DMA APIs. But for these new use-cases, importers > > only need the pfn of the dma-buf resource to build their own mapping > > tables. > > As far as I can see I have to fundamentally reject this whole approach. > > It's intentional DMA-buf design that we don't expose struct pages nor PFNs > to the importer. Essentially DMA-buf only transports DMA addresses. > > In other words the mapping is done by the exporter and *not* the importer. > > What we certainly can do is to annotate those DMA addresses to a better > specify in which domain they are applicable, e.g. if they are PCIe bus > addresses or some inter device bus addresses etc... > > But moving the functionality to map the pages/PFNs to DMA addresses into the > importer is an absolutely clear NO-GO. Oh? Having the importer do the mapping is the correct way to operate the DMA API and the new API that Leon has built to fix the scatterlist abuse in dmabuf relies on importer mapping as part of it's construction. Why on earth do you want the exporter to map? That is completely backwards and unworkable in many cases. The disfunctional P2P support in dmabuf is like that principally because of this. That said, I don't think get_pfn() is an especially good interface, but we will need to come with something that passes the physical pfn out. Jason
Am 08.01.25 um 14:23 schrieb Jason Gunthorpe: > On Wed, Jan 08, 2025 at 09:01:46AM +0100, Christian König wrote: >> Am 07.01.25 um 15:27 schrieb Xu Yilun: >>> Introduce a new API for dma-buf importer, also add a dma_buf_ops >>> callback for dma-buf exporter. This API is for subsystem importers who >>> map the dma-buf to some user defined address space, e.g. for IOMMUFD to >>> map the dma-buf to userspace IOVA via IOMMU page table, or for KVM to >>> map the dma-buf to GPA via KVM MMU (e.g. EPT). >>> >>> Currently dma-buf is only used to get DMA address for device's default >>> domain by using kernel DMA APIs. But for these new use-cases, importers >>> only need the pfn of the dma-buf resource to build their own mapping >>> tables. >> As far as I can see I have to fundamentally reject this whole approach. >> >> It's intentional DMA-buf design that we don't expose struct pages nor PFNs >> to the importer. Essentially DMA-buf only transports DMA addresses. >> >> In other words the mapping is done by the exporter and *not* the importer. >> >> What we certainly can do is to annotate those DMA addresses to a better >> specify in which domain they are applicable, e.g. if they are PCIe bus >> addresses or some inter device bus addresses etc... >> >> But moving the functionality to map the pages/PFNs to DMA addresses into the >> importer is an absolutely clear NO-GO. > Oh? > > Having the importer do the mapping is the correct way to operate the > DMA API and the new API that Leon has built to fix the scatterlist > abuse in dmabuf relies on importer mapping as part of it's > construction. Exactly on that I strongly disagree on. DMA-buf works by providing DMA addresses the importer can work with and *NOT* the underlying location of the buffer. > Why on earth do you want the exporter to map? Because the exporter owns the exported buffer and only the exporter knows to how correctly access it. > That is completely backwards and unworkable in many cases. The disfunctional P2P support > in dmabuf is like that principally because of this. No, that is exactly what we need. Using the scatterlist to transport the DMA addresses was clearly a mistake, but providing the DMA addresses by the exporter has proved many times to be the right approach. Keep in mind that the exported buffer is not necessary memory, but can also be MMIO or stuff which is only accessible through address space windows where you can't create a PFN nor struct page for. > That said, I don't think get_pfn() is an especially good interface, > but we will need to come with something that passes the physical pfn > out. No, physical pfn is absolutely not a good way of passing the location of data around because it is limited to what the CPU sees as address space. We have use cases where DMA-buf transports the location of CPU invisible data which only the involved devices can see. Regards, Christian. > > Jason
On Wed, Jan 08, 2025 at 02:44:26PM +0100, Christian König wrote: > > Having the importer do the mapping is the correct way to operate the > > DMA API and the new API that Leon has built to fix the scatterlist > > abuse in dmabuf relies on importer mapping as part of it's > > construction. > > Exactly on that I strongly disagree on. > > DMA-buf works by providing DMA addresses the importer can work with and > *NOT* the underlying location of the buffer. The expectation is that the DMA API will be used to DMA map (most) things, and the DMA API always works with a physaddr_t/pfn argument. Basically, everything that is not a private address space should be supported by improving the DMA API. We are on course for finally getting all the common cases like P2P and MMIO solved here. That alone will take care of alot. For P2P cases we are going toward (PFN + P2P source information) as input to the DMA API. The additional "P2P source information" provides a good way for co-operating drivers to represent private address spaces as well. Both importer and exporter can have full understanding what is being mapped and do the correct things, safely. So, no, we don't loose private address space support when moving to importer mapping, in fact it works better because the importer gets more information about what is going on. I have imagined a staged approach were DMABUF gets a new API that works with the new DMA API to do importer mapping with "P2P source information" and a gradual conversion. Exporter mapping falls down in too many cases already: 1) Private addresses spaces don't work fully well because many devices need some indication what address space is being used and scatter list can't really properly convey that. If the DMABUF has a mixture of CPU and private it becomes a PITA 2) Multi-path PCI can require the importer to make mapping decisions unique to the device and program device specific information for the multi-path. We are doing this in mlx5 today and have hacks because DMABUF is destroying the information the importer needs to choose the correct PCI path. 3) Importing devices need to know if they are working with PCI P2P addresses during mapping because they need to do things like turn on ATS on their DMA. As for multi-path we have the same hacks inside mlx5 today that assume DMABUFs are always P2P because we cannot determine if things are P2P or not after being DMA mapped. 4) TPH bits needs to be programmed into the importer device but are derived based on the NUMA topology of the DMA target. The importer has no idea what the DMA target actually was because the exporter mapping destroyed that information. 5) iommufd and kvm are both using CPU addresses without DMA. No exporter mapping is possible Jason
Am 08.01.25 um 15:58 schrieb Jason Gunthorpe: > On Wed, Jan 08, 2025 at 02:44:26PM +0100, Christian König wrote: > >>> Having the importer do the mapping is the correct way to operate the >>> DMA API and the new API that Leon has built to fix the scatterlist >>> abuse in dmabuf relies on importer mapping as part of it's >>> construction. >> Exactly on that I strongly disagree on. >> >> DMA-buf works by providing DMA addresses the importer can work with and >> *NOT* the underlying location of the buffer. > The expectation is that the DMA API will be used to DMA map (most) > things, and the DMA API always works with a physaddr_t/pfn > argument. Basically, everything that is not a private address space > should be supported by improving the DMA API. We are on course for > finally getting all the common cases like P2P and MMIO solved > here. That alone will take care of alot. Well, from experience the DMA API has failed more often than it actually worked in the way required by drivers. Especially that we tried to hide architectural complexity in there instead of properly expose limitations to drivers is not something I consider a good design approach. So I see putting even more into that extremely critical. > For P2P cases we are going toward (PFN + P2P source information) as > input to the DMA API. The additional "P2P source information" provides > a good way for co-operating drivers to represent private address > spaces as well. Both importer and exporter can have full understanding > what is being mapped and do the correct things, safely. I can say from experience that this is clearly not going to work for all use cases. It would mean that we have to pull a massive amount of driver specific functionality into the DMA API. Things like programming access windows for PCI BARs is completely driver specific and as far as I can see can't be part of the DMA API without things like callbacks. With that in mind the DMA API would become a mid layer between different drivers and that is really not something you are suggesting, isn't it? > So, no, we don't loose private address space support when moving to > importer mapping, in fact it works better because the importer gets > more information about what is going on. Well, sounds like I wasn't able to voice my concern. Let me try again: We should not give importers information they don't need. Especially not information about the backing store of buffers. So that importers get more information about what's going on is a bad thing. > I have imagined a staged approach were DMABUF gets a new API that > works with the new DMA API to do importer mapping with "P2P source > information" and a gradual conversion. To make it clear as maintainer of that subsystem I would reject such a step with all I have. We have already gone down that road and it didn't worked at all and was a really big pain to pull people back from it. > Exporter mapping falls down in too many cases already: > > 1) Private addresses spaces don't work fully well because many devices > need some indication what address space is being used and scatter list > can't really properly convey that. If the DMABUF has a mixture of CPU > and private it becomes a PITA Correct, yes. That's why I said that scatterlist was a bad choice for the interface. But exposing the backing store to importers and then let them do whatever they want with it sounds like an even worse idea. > 2) Multi-path PCI can require the importer to make mapping decisions > unique to the device and program device specific information for the > multi-path. We are doing this in mlx5 today and have hacks because > DMABUF is destroying the information the importer needs to choose the > correct PCI path. That's why the exporter gets the struct device of the importer so that it can plan how those accesses are made. Where exactly is the problem with that? When you have an use case which is not covered by the existing DMA-buf interfaces then please voice that to me and other maintainers instead of implementing some hack. > 3) Importing devices need to know if they are working with PCI P2P > addresses during mapping because they need to do things like turn on > ATS on their DMA. As for multi-path we have the same hacks inside mlx5 > today that assume DMABUFs are always P2P because we cannot determine > if things are P2P or not after being DMA mapped. Why would you need ATS on PCI P2P and not for system memory accesses? > 4) TPH bits needs to be programmed into the importer device but are > derived based on the NUMA topology of the DMA target. The importer has > no idea what the DMA target actually was because the exporter mapping > destroyed that information. Yeah, but again that is completely intentional. I assume you mean TLP processing hints when you say TPH and those should be part of the DMA addresses provided by the exporter. That an importer tries to look behind the curtain and determines the NUMA placement and topology themselves is clearly a no-go from the design perspective. > 5) iommufd and kvm are both using CPU addresses without DMA. No > exporter mapping is possible We have customers using both KVM and XEN with DMA-buf, so I can clearly confirm that this isn't true. Regards, Christian. > > Jason
On Wed, Jan 08, 2025 at 04:25:54PM +0100, Christian König wrote: > Am 08.01.25 um 15:58 schrieb Jason Gunthorpe: > > On Wed, Jan 08, 2025 at 02:44:26PM +0100, Christian König wrote: > > > > > > Having the importer do the mapping is the correct way to operate the > > > > DMA API and the new API that Leon has built to fix the scatterlist > > > > abuse in dmabuf relies on importer mapping as part of it's > > > > construction. > > > Exactly on that I strongly disagree on. > > > > > > DMA-buf works by providing DMA addresses the importer can work with and > > > *NOT* the underlying location of the buffer. > > The expectation is that the DMA API will be used to DMA map (most) > > things, and the DMA API always works with a physaddr_t/pfn > > argument. Basically, everything that is not a private address space > > should be supported by improving the DMA API. We are on course for > > finally getting all the common cases like P2P and MMIO solved > > here. That alone will take care of alot. > > Well, from experience the DMA API has failed more often than it actually > worked in the way required by drivers. The DMA API has been static and very hard to change in these ways for a long time. I think Leon's new API will break through this and we will be able finally address these issues. > > For P2P cases we are going toward (PFN + P2P source information) as > > input to the DMA API. The additional "P2P source information" provides > > a good way for co-operating drivers to represent private address > > spaces as well. Both importer and exporter can have full understanding > > what is being mapped and do the correct things, safely. > > I can say from experience that this is clearly not going to work for all use > cases. > > It would mean that we have to pull a massive amount of driver specific > functionality into the DMA API. That isn't what I mean. There are two distinct parts, the means to describe the source (PFN + P2P source information) that is compatible with the DMA API, and the DMA API itself that works with a few general P2P source information types. Private source information would be detected by co-operating drivers and go down driver private paths. It would be rejected by other drivers. This broadly follows how the new API is working. So here I mean you can use the same PFN + Source API between importer and exporter and the importer can simply detect the special source and do the private stuff. It is not shifting things under the DMA API, it is building along side it using compatible design approaches. You would match the source information, cast it to a driver structure, do whatever driver math is needed to compute the local DMA address and then write it to the device. Nothing is hard or "not going to work" here. > > So, no, we don't loose private address space support when moving to > > importer mapping, in fact it works better because the importer gets > > more information about what is going on. > > Well, sounds like I wasn't able to voice my concern. Let me try again: > > We should not give importers information they don't need. Especially not > information about the backing store of buffers. > > So that importers get more information about what's going on is a bad thing. I strongly disagree because we are suffering today in mlx5 because of this viewpoint. You cannot predict in advance what importers are going to need. I already listed many examples where it does not work today as is. > > I have imagined a staged approach were DMABUF gets a new API that > > works with the new DMA API to do importer mapping with "P2P source > > information" and a gradual conversion. > > To make it clear as maintainer of that subsystem I would reject such a step > with all I have. This is unexpected, so you want to just leave dmabuf broken? Do you have any plan to fix it, to fix the misuse of the DMA API, and all the problems I listed below? This is a big deal, it is causing real problems today. If it going to be like this I think we will stop trying to use dmabuf and do something simpler for vfio/kvm/iommufd :( > We have already gone down that road and it didn't worked at all and > was a really big pain to pull people back from it. Nobody has really seriously tried to improve the DMA API before, so I don't think this is true at all. > > Exporter mapping falls down in too many cases already: > > > > 1) Private addresses spaces don't work fully well because many devices > > need some indication what address space is being used and scatter list > > can't really properly convey that. If the DMABUF has a mixture of CPU > > and private it becomes a PITA > > Correct, yes. That's why I said that scatterlist was a bad choice for the > interface. > > But exposing the backing store to importers and then let them do whatever > they want with it sounds like an even worse idea. You keep saying this without real justification. To me it is a nanny style of API design. But also I don't see how you can possibly fix the above without telling the importer alot more information. > > 2) Multi-path PCI can require the importer to make mapping decisions > > unique to the device and program device specific information for the > > multi-path. We are doing this in mlx5 today and have hacks because > > DMABUF is destroying the information the importer needs to choose the > > correct PCI path. > > That's why the exporter gets the struct device of the importer so that it > can plan how those accesses are made. Where exactly is the problem with > that? A single struct device does not convey the multipath options. We have multiple struct devices (and multiple PCI endpoints) doing DMA concurrently under one driver. Multipath always needs additional meta information in the importer side to tell the device which path to select. A naked dma address is not sufficient. Today we guess that DMABUF will be using P2P and hack to choose a P2P struct device to pass the exporter. We need to know what is in the dmabuf before we can choose which of the multiple struct devices the driver has to use for DMA mapping. But even simple CPU centric cases we will eventually want to select the proper NUMA local PCI channel matching struct device for CPU only buffers. > When you have an use case which is not covered by the existing DMA-buf > interfaces then please voice that to me and other maintainers instead of > implementing some hack. Do you have any suggestion for any of this then? We have a good plan to fix this stuff and more. Many experts in their fields have agreed on the different parts now. We haven't got to dmabuf because I had no idea there would be an objection like this. > > 3) Importing devices need to know if they are working with PCI P2P > > addresses during mapping because they need to do things like turn on > > ATS on their DMA. As for multi-path we have the same hacks inside mlx5 > > today that assume DMABUFs are always P2P because we cannot determine > > if things are P2P or not after being DMA mapped. > > Why would you need ATS on PCI P2P and not for system memory accesses? ATS has a significant performance cost. It is mandatory for PCI P2P, but ideally should be avoided for CPU memory. > > 4) TPH bits needs to be programmed into the importer device but are > > derived based on the NUMA topology of the DMA target. The importer has > > no idea what the DMA target actually was because the exporter mapping > > destroyed that information. > > Yeah, but again that is completely intentional. > > I assume you mean TLP processing hints when you say TPH and those should be > part of the DMA addresses provided by the exporter. Yes, but is not part of the DMA addresses. > That an importer tries to look behind the curtain and determines the NUMA > placement and topology themselves is clearly a no-go from the design > perspective. I strongly disagree, this is important. Drivers need this information in a future TPH/UIO/multipath PCI world. > > 5) iommufd and kvm are both using CPU addresses without DMA. No > > exporter mapping is possible > > We have customers using both KVM and XEN with DMA-buf, so I can clearly > confirm that this isn't true. Today they are mmaping the dma-buf into a VMA and then using KVM's follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable dma-buf must have a CPU PFN. Here Xu implements basically the same path, except without the VMA indirection, and it suddenly not OK? Illogical. Jason
On Wed, Jan 08, 2025 at 12:22:27PM -0400, Jason Gunthorpe wrote: > On Wed, Jan 08, 2025 at 04:25:54PM +0100, Christian König wrote: > > Am 08.01.25 um 15:58 schrieb Jason Gunthorpe: > > > I have imagined a staged approach were DMABUF gets a new API that > > > works with the new DMA API to do importer mapping with "P2P source > > > information" and a gradual conversion. > > > > To make it clear as maintainer of that subsystem I would reject such a step > > with all I have. > > This is unexpected, so you want to just leave dmabuf broken? Do you > have any plan to fix it, to fix the misuse of the DMA API, and all > the problems I listed below? This is a big deal, it is causing real > problems today. > > If it going to be like this I think we will stop trying to use dmabuf > and do something simpler for vfio/kvm/iommufd :( As the gal who help edit the og dma-buf spec 13 years ago, I think adding pfn isn't a terrible idea. By design, dma-buf is the "everything is optional" interface. And in the beginning, even consistent locking was optional, but we've managed to fix that by now :-/ Where I do agree with Christian is that stuffing pfn support into the dma_buf_attachment interfaces feels a bit much wrong. > > We have already gone down that road and it didn't worked at all and > > was a really big pain to pull people back from it. > > Nobody has really seriously tried to improve the DMA API before, so I > don't think this is true at all. Aside, I really hope this finally happens! > > > 3) Importing devices need to know if they are working with PCI P2P > > > addresses during mapping because they need to do things like turn on > > > ATS on their DMA. As for multi-path we have the same hacks inside mlx5 > > > today that assume DMABUFs are always P2P because we cannot determine > > > if things are P2P or not after being DMA mapped. > > > > Why would you need ATS on PCI P2P and not for system memory accesses? > > ATS has a significant performance cost. It is mandatory for PCI P2P, > but ideally should be avoided for CPU memory. Huh, I didn't know that. And yeah kinda means we've butchered the pci p2p stuff a bit I guess ... > > > 5) iommufd and kvm are both using CPU addresses without DMA. No > > > exporter mapping is possible > > > > We have customers using both KVM and XEN with DMA-buf, so I can clearly > > confirm that this isn't true. > > Today they are mmaping the dma-buf into a VMA and then using KVM's > follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable > dma-buf must have a CPU PFN. > > Here Xu implements basically the same path, except without the VMA > indirection, and it suddenly not OK? Illogical. So the big difference is that for follow_pfn() you need mmu_notifier since the mmap might move around, whereas with pfn smashed into dma_buf_attachment you need dma_resv_lock rules, and the move_notify callback if you go dynamic. So I guess my first question is, which locking rules do you want here for pfn importers? If mmu notifiers is fine, then I think the current approach of follow_pfn should be ok. But if you instead dma_resv_lock rules (or the cpu mmap somehow is an issue itself), then I think the clean design is create a new separate access mechanism just for that. It would be the 5th or so (kernel vmap, userspace mmap, dma_buf_attach and driver private stuff like virtio_dma_buf.c where you access your buffer with a uuid), so really not a big deal. And for non-contrived exporters we might be able to implement the other access methods in terms of the pfn method generically, so this wouldn't even be a terrible maintenance burden going forward. And meanwhile all the contrived exporters just keep working as-is. The other part is that cpu mmap is optional, and there's plenty of strange exporters who don't implement. But you can dma map the attachment into plenty devices. This tends to mostly be a thing on SoC devices with some very funky memory. But I guess you don't care about these use-case, so should be ok. I couldn't come up with a good name for these pfn users, maybe dma_buf_pfn_attachment? This does _not_ have a struct device, but maybe some of these new p2p source specifiers (or a list of those which are allowed, no idea how this would need to fit into the new dma api). Cheers, Sima
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 7eeee3a38202..83d1448b6dcc 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -630,10 +630,10 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info) size_t alloc_size = sizeof(struct dma_buf); int ret; - if (WARN_ON(!exp_info->priv || !exp_info->ops - || !exp_info->ops->map_dma_buf - || !exp_info->ops->unmap_dma_buf - || !exp_info->ops->release)) + if (WARN_ON(!exp_info->priv || !exp_info->ops || + (!!exp_info->ops->map_dma_buf != !!exp_info->ops->unmap_dma_buf) || + (!exp_info->ops->map_dma_buf && !exp_info->ops->get_pfn) || + !exp_info->ops->release)) return ERR_PTR(-EINVAL); if (WARN_ON(exp_info->ops->cache_sgt_mapping && @@ -909,7 +909,10 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev, struct dma_buf_attachment *attach; int ret; - if (WARN_ON(!dmabuf || !dev)) + if (WARN_ON(!dmabuf)) + return ERR_PTR(-EINVAL); + + if (WARN_ON(dmabuf->ops->map_dma_buf && !dev)) return ERR_PTR(-EINVAL); if (WARN_ON(importer_ops && !importer_ops->move_notify)) @@ -941,7 +944,7 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev, */ if (dma_buf_attachment_is_dynamic(attach) != dma_buf_is_dynamic(dmabuf)) { - struct sg_table *sgt; + struct sg_table *sgt = NULL; dma_resv_lock(attach->dmabuf->resv, NULL); if (dma_buf_is_dynamic(attach->dmabuf)) { @@ -950,13 +953,16 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev, goto err_unlock; } - sgt = __map_dma_buf(attach, DMA_BIDIRECTIONAL); - if (!sgt) - sgt = ERR_PTR(-ENOMEM); - if (IS_ERR(sgt)) { - ret = PTR_ERR(sgt); - goto err_unpin; + if (dmabuf->ops->map_dma_buf) { + sgt = __map_dma_buf(attach, DMA_BIDIRECTIONAL); + if (!sgt) + sgt = ERR_PTR(-ENOMEM); + if (IS_ERR(sgt)) { + ret = PTR_ERR(sgt); + goto err_unpin; + } } + dma_resv_unlock(attach->dmabuf->resv); attach->sgt = sgt; attach->dir = DMA_BIDIRECTIONAL; @@ -1119,7 +1125,8 @@ struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *attach, might_sleep(); - if (WARN_ON(!attach || !attach->dmabuf)) + if (WARN_ON(!attach || !attach->dmabuf || + !attach->dmabuf->ops->map_dma_buf)) return ERR_PTR(-EINVAL); dma_resv_assert_held(attach->dmabuf->resv); @@ -1195,7 +1202,8 @@ dma_buf_map_attachment_unlocked(struct dma_buf_attachment *attach, might_sleep(); - if (WARN_ON(!attach || !attach->dmabuf)) + if (WARN_ON(!attach || !attach->dmabuf || + !attach->dmabuf->ops->map_dma_buf)) return ERR_PTR(-EINVAL); dma_resv_lock(attach->dmabuf->resv, NULL); @@ -1222,7 +1230,8 @@ void dma_buf_unmap_attachment(struct dma_buf_attachment *attach, { might_sleep(); - if (WARN_ON(!attach || !attach->dmabuf || !sg_table)) + if (WARN_ON(!attach || !attach->dmabuf || + !attach->dmabuf->ops->unmap_dma_buf || !sg_table)) return; dma_resv_assert_held(attach->dmabuf->resv); @@ -1254,7 +1263,8 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach, { might_sleep(); - if (WARN_ON(!attach || !attach->dmabuf || !sg_table)) + if (WARN_ON(!attach || !attach->dmabuf || + !attach->dmabuf->ops->unmap_dma_buf || !sg_table)) return; dma_resv_lock(attach->dmabuf->resv, NULL); @@ -1263,6 +1273,52 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach, } EXPORT_SYMBOL_NS_GPL(dma_buf_unmap_attachment_unlocked, "DMA_BUF"); +/** + * dma_buf_get_pfn_unlocked - + * @attach: [in] attachment to get pfn from + * @pgoff: [in] page offset of the buffer against the start of dma_buf + * @pfn: [out] returns the pfn of the buffer + * @max_order [out] returns the max mapping order of the buffer + */ +int dma_buf_get_pfn_unlocked(struct dma_buf_attachment *attach, + pgoff_t pgoff, u64 *pfn, int *max_order) +{ + struct dma_buf *dmabuf = attach->dmabuf; + int ret; + + if (WARN_ON(!attach || !attach->dmabuf || + !attach->dmabuf->ops->get_pfn)) + return -EINVAL; + + /* + * Open: + * + * When dma_buf is dynamic but dma_buf move is disabled, the buffer + * should be pinned before use, See dma_buf_map_attachment() for + * reference. + * + * But for now no pin is intended inside dma_buf_get_pfn(), otherwise + * need another API to unpin the dma_buf. So just fail out this case. + */ + if (dma_buf_is_dynamic(attach->dmabuf) && + !IS_ENABLED(CONFIG_DMABUF_MOVE_NOTIFY)) + return -ENOENT; + + dma_resv_lock(attach->dmabuf->resv, NULL); + ret = dmabuf->ops->get_pfn(attach, pgoff, pfn, max_order); + /* + * Open: + * + * Is dma_resv_wait_timeout() needed? I assume no. The DMA buffer + * content synchronization could be done when the buffer is to be + * mapped by importer. + */ + dma_resv_unlock(attach->dmabuf->resv); + + return ret; +} +EXPORT_SYMBOL_NS_GPL(dma_buf_get_pfn_unlocked, "DMA_BUF"); + /** * dma_buf_move_notify - notify attachments that DMA-buf is moving * @@ -1662,7 +1718,7 @@ static int dma_buf_debug_show(struct seq_file *s, void *unused) attach_count = 0; list_for_each_entry(attach_obj, &buf_obj->attachments, node) { - seq_printf(s, "\t%s\n", dev_name(attach_obj->dev)); + seq_printf(s, "\t%s\n", attach_obj->dev ? dev_name(attach_obj->dev) : NULL); attach_count++; } dma_resv_unlock(buf_obj->resv); diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 36216d28d8bd..b16183edfb3a 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -194,6 +194,17 @@ struct dma_buf_ops { * if the call would block. */ + /** + * @get_pfn: + * + * This is called by dma_buf_get_pfn(). It is used to get the pfn + * of the buffer positioned by the page offset against the start of + * the dma_buf. It can only be called if @attach has been called + * successfully. + */ + int (*get_pfn)(struct dma_buf_attachment *attach, pgoff_t pgoff, + u64 *pfn, int *max_order); + /** * @release: * @@ -629,6 +640,8 @@ dma_buf_map_attachment_unlocked(struct dma_buf_attachment *attach, void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach, struct sg_table *sg_table, enum dma_data_direction direction); +int dma_buf_get_pfn_unlocked(struct dma_buf_attachment *attach, + pgoff_t pgoff, u64 *pfn, int *max_order); int dma_buf_mmap(struct dma_buf *, struct vm_area_struct *, unsigned long);
Introduce a new API for dma-buf importer, also add a dma_buf_ops callback for dma-buf exporter. This API is for subsystem importers who map the dma-buf to some user defined address space, e.g. for IOMMUFD to map the dma-buf to userspace IOVA via IOMMU page table, or for KVM to map the dma-buf to GPA via KVM MMU (e.g. EPT). Currently dma-buf is only used to get DMA address for device's default domain by using kernel DMA APIs. But for these new use-cases, importers only need the pfn of the dma-buf resource to build their own mapping tables. So the map_dma_buf() callback is not mandatory for exporters anymore. Also the importers could choose not to provide struct device *dev on dma_buf_attach() if they don't call dma_buf_map_attachment(). Like dma_buf_map_attachment(), the importer should firstly call dma_buf_attach/dynamic_attach() then call dma_buf_get_pfn_unlocked(). If the importer choose to do dynamic attach, it also should handle the dma-buf move notification. Only the unlocked version of dma_buf_get_pfn is implemented for now, just because no locked version is used for now. Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com> --- IIUC, Only get_pfn() is needed but no put_pfn(). The whole dma-buf is de/referenced at dma-buf attach/detach time. Specifically, for static attachment, the exporter should always make memory resource available/pinned on first dma_buf_attach(), and release/unpin memory resource on last dma_buf_detach(). For dynamic attachment, the exporter could populate & invalidate the memory resource at any time, it's OK as long as the importers follow dma-buf move notification. So no pinning is needed for get_pfn() and no put_pfn() is needed. --- drivers/dma-buf/dma-buf.c | 90 +++++++++++++++++++++++++++++++-------- include/linux/dma-buf.h | 13 ++++++ 2 files changed, 86 insertions(+), 17 deletions(-)