diff mbox series

[v2,4/4] vfio/pci: Allow MMIO regions to be exported through dma-buf

Message ID 4-v2-472615b3877e+28f7-vfio_dma_buf_jgg@nvidia.com (mailing list archive)
State New, archived
Headers show
Series Allow MMIO regions to be exported through dma-buf | expand

Commit Message

Jason Gunthorpe Aug. 31, 2022, 11:12 p.m. UTC
dma-buf has become a way to safely acquire a handle to non-struct page
memory that can still have lifetime controlled by the exporter. Notably
RDMA can now import dma-buf FDs and build them into MRs which allows for
PCI P2P operations. Extend this to allow vfio-pci to export MMIO memory
from PCI device BARs.

The patch design loosely follows the pattern in commit
db1a8dd916aa ("habanalabs: add support for dma-buf exporter") except this
does not support pinning.

Instead, this implements what, in the past, we've called a revocable
attachment using move. In normal situations the attachment is pinned, as a
BAR does not change physical address. However when the VFIO device is
closed, or a PCI reset is issued, access to the MMIO memory is revoked.

Revoked means that move occurs, but an attempt to immediately re-map the
memory will fail. In the reset case a future move will be triggered when
MMIO access returns. As both close and reset are under userspace control
it is expected that userspace will suspend use of the dma-buf before doing
these operations, the revoke is purely for kernel self-defense against a
hostile userspace.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/vfio/pci/Makefile          |   1 +
 drivers/vfio/pci/dma_buf.c         | 269 +++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_config.c |   8 +-
 drivers/vfio/pci/vfio_pci_core.c   |  28 ++-
 drivers/vfio/pci/vfio_pci_priv.h   |  23 +++
 include/linux/vfio_pci_core.h      |   1 +
 include/uapi/linux/vfio.h          |  18 ++
 7 files changed, 340 insertions(+), 8 deletions(-)
 create mode 100644 drivers/vfio/pci/dma_buf.c

Comments

Christoph Hellwig Sept. 6, 2022, 9:51 a.m. UTC | #1
> +{
> +	struct vfio_pci_dma_buf *priv = dmabuf->priv;
> +	int rc;
> +
> +	rc = pci_p2pdma_distance_many(priv->vdev->pdev, &attachment->dev, 1,
> +				      true);

This should just use pci_p2pdma_distance.

> +	/*
> +	 * Since the memory being mapped is a device memory it could never be in
> +	 * CPU caches.
> +	 */

DMA_ATTR_SKIP_CPU_SYNC doesn't even apply to dma_map_resource, not sure
where this wisdom comes from.

> +	dma_addr = dma_map_resource(
> +		attachment->dev,
> +		pci_resource_start(priv->vdev->pdev, priv->index) +
> +			priv->offset,
> +		priv->dmabuf->size, dir, DMA_ATTR_SKIP_CPU_SYNC);

This is not how P2P addresses are mapped.  You need to use
dma_map_sgtable and have the proper pgmap for it.

The above is just a badly implemented version of the dma-direct
PCI_P2PDMA_MAP_BUS_ADDR case, ignoring mappings through the host
bridge or dma-map-ops interactions.
Christian König Sept. 6, 2022, 10:38 a.m. UTC | #2
Am 06.09.22 um 11:51 schrieb Christoph Hellwig:
>> +{
>> +	struct vfio_pci_dma_buf *priv = dmabuf->priv;
>> +	int rc;
>> +
>> +	rc = pci_p2pdma_distance_many(priv->vdev->pdev, &attachment->dev, 1,
>> +				      true);
> This should just use pci_p2pdma_distance.
>
>> +	/*
>> +	 * Since the memory being mapped is a device memory it could never be in
>> +	 * CPU caches.
>> +	 */
> DMA_ATTR_SKIP_CPU_SYNC doesn't even apply to dma_map_resource, not sure
> where this wisdom comes from.
>
>> +	dma_addr = dma_map_resource(
>> +		attachment->dev,
>> +		pci_resource_start(priv->vdev->pdev, priv->index) +
>> +			priv->offset,
>> +		priv->dmabuf->size, dir, DMA_ATTR_SKIP_CPU_SYNC);
> This is not how P2P addresses are mapped.  You need to use
> dma_map_sgtable and have the proper pgmap for it.

The problem is once more that this is MMIO space, in other words 
register BARs which needs to be exported/imported.

Adding struct pages for it generally sounds like the wrong approach 
here. You can't even access this with the CPU or would trigger 
potentially unwanted hardware actions.

Would you mind if I start to tackle this problem?

Regards,
Christian.

>
> The above is just a badly implemented version of the dma-direct
> PCI_P2PDMA_MAP_BUS_ADDR case, ignoring mappings through the host
> bridge or dma-map-ops interactions.
Jason Gunthorpe Sept. 6, 2022, 11:48 a.m. UTC | #3
On Tue, Sep 06, 2022 at 12:38:44PM +0200, Christian König wrote:
> Am 06.09.22 um 11:51 schrieb Christoph Hellwig:
> > > +{
> > > +	struct vfio_pci_dma_buf *priv = dmabuf->priv;
> > > +	int rc;
> > > +
> > > +	rc = pci_p2pdma_distance_many(priv->vdev->pdev, &attachment->dev, 1,
> > > +				      true);
> > This should just use pci_p2pdma_distance.

OK

> > > +	/*
> > > +	 * Since the memory being mapped is a device memory it could never be in
> > > +	 * CPU caches.
> > > +	 */
> > DMA_ATTR_SKIP_CPU_SYNC doesn't even apply to dma_map_resource, not sure
> > where this wisdom comes from.

Habana driver

> > > +	dma_addr = dma_map_resource(
> > > +		attachment->dev,
> > > +		pci_resource_start(priv->vdev->pdev, priv->index) +
> > > +			priv->offset,
> > > +		priv->dmabuf->size, dir, DMA_ATTR_SKIP_CPU_SYNC);
> > This is not how P2P addresses are mapped.  You need to use
> > dma_map_sgtable and have the proper pgmap for it.
> 
> The problem is once more that this is MMIO space, in other words register
> BARs which needs to be exported/imported.
> 
> Adding struct pages for it generally sounds like the wrong approach here.
> You can't even access this with the CPU or would trigger potentially
> unwanted hardware actions.

Right, this whole thing is the "standard" that dmabuf has adopted
instead of the struct pages. Once the AMD GPU driver started doing
this some time ago other drivers followed.

Now we have struct pages, almost, but I'm not sure if their limits are
compatible with VFIO? This has to work for small bars as well.

Jason
Oded Gabbay Sept. 6, 2022, 12:34 p.m. UTC | #4
On Tue, Sep 6, 2022 at 2:48 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Sep 06, 2022 at 12:38:44PM +0200, Christian König wrote:
> > Am 06.09.22 um 11:51 schrieb Christoph Hellwig:
> > > > +{
> > > > + struct vfio_pci_dma_buf *priv = dmabuf->priv;
> > > > + int rc;
> > > > +
> > > > + rc = pci_p2pdma_distance_many(priv->vdev->pdev, &attachment->dev, 1,
> > > > +                               true);
> > > This should just use pci_p2pdma_distance.
>
> OK
>
> > > > + /*
> > > > +  * Since the memory being mapped is a device memory it could never be in
> > > > +  * CPU caches.
> > > > +  */
> > > DMA_ATTR_SKIP_CPU_SYNC doesn't even apply to dma_map_resource, not sure
> > > where this wisdom comes from.
>
> Habana driver
I hate to throw the ball at someone else, but I actually copied the
code from the amdgpu driver, from amdgpu_vram_mgr_alloc_sgt() iirc.
And if you remember Jason, you asked why we use this specific define
in the original review you did and I replied the following (to which
you agreed and that's why we added the comment):

"The memory behind this specific dma-buf has *always* resided on the
device itself, i.e. it lives only in the 'device' domain (after all,
it maps a PCI bar address which points to the device memory).
Therefore, it was never in the 'CPU' domain and hence, there is no
need to perform a sync of the memory to the CPU's cache, as it was
never inside that cache to begin with.

This is not the same case as with regular memory which is dma-mapped
and then copied into the device using a dma engine. In that case,
the memory started in the 'CPU' domain and moved to the 'device'
domain. When it is unmapped it will indeed be recycled to be used
for another purpose and therefore we need to sync the CPU cache."

Oded
>
> > > > + dma_addr = dma_map_resource(
> > > > +         attachment->dev,
> > > > +         pci_resource_start(priv->vdev->pdev, priv->index) +
> > > > +                 priv->offset,
> > > > +         priv->dmabuf->size, dir, DMA_ATTR_SKIP_CPU_SYNC);
> > > This is not how P2P addresses are mapped.  You need to use
> > > dma_map_sgtable and have the proper pgmap for it.
> >
> > The problem is once more that this is MMIO space, in other words register
> > BARs which needs to be exported/imported.
> >
> > Adding struct pages for it generally sounds like the wrong approach here.
> > You can't even access this with the CPU or would trigger potentially
> > unwanted hardware actions.
>
> Right, this whole thing is the "standard" that dmabuf has adopted
> instead of the struct pages. Once the AMD GPU driver started doing
> this some time ago other drivers followed.
>
> Now we have struct pages, almost, but I'm not sure if their limits are
> compatible with VFIO? This has to work for small bars as well.
>
> Jason
Jason Gunthorpe Sept. 6, 2022, 5:59 p.m. UTC | #5
On Tue, Sep 06, 2022 at 03:34:02PM +0300, Oded Gabbay wrote:

> > > > > + /*
> > > > > +  * Since the memory being mapped is a device memory it could never be in
> > > > > +  * CPU caches.
> > > > > +  */
> > > > DMA_ATTR_SKIP_CPU_SYNC doesn't even apply to dma_map_resource, not sure
> > > > where this wisdom comes from.
> >
> > Habana driver
> I hate to throw the ball at someone else, but I actually copied the
> code from the amdgpu driver, from amdgpu_vram_mgr_alloc_sgt() iirc.
> And if you remember Jason, you asked why we use this specific define
> in the original review you did and I replied the following (to which
> you agreed and that's why we added the comment):

Yes, I remember, but Christophs remark is that DMA_ATTR_SKIP_CPU_SYNC
doesn't even do anything when passed to dma_map_resource().

The only attr that seems to be used is DMA_ATTR_PRIVILEGED from what I
can see.

Jason
Oded Gabbay Sept. 6, 2022, 7:44 p.m. UTC | #6
On Tue, Sep 6, 2022 at 8:59 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Sep 06, 2022 at 03:34:02PM +0300, Oded Gabbay wrote:
>
> > > > > > + /*
> > > > > > +  * Since the memory being mapped is a device memory it could never be in
> > > > > > +  * CPU caches.
> > > > > > +  */
> > > > > DMA_ATTR_SKIP_CPU_SYNC doesn't even apply to dma_map_resource, not sure
> > > > > where this wisdom comes from.
> > >
> > > Habana driver
> > I hate to throw the ball at someone else, but I actually copied the
> > code from the amdgpu driver, from amdgpu_vram_mgr_alloc_sgt() iirc.
> > And if you remember Jason, you asked why we use this specific define
> > in the original review you did and I replied the following (to which
> > you agreed and that's why we added the comment):
>
> Yes, I remember, but Christophs remark is that DMA_ATTR_SKIP_CPU_SYNC
> doesn't even do anything when passed to dma_map_resource().
>
> The only attr that seems to be used is DMA_ATTR_PRIVILEGED from what I
> can see.
>
> Jason
Yes, it appears he is correct.
Probably what happened is that either this was originally copied from
a use of dma_map_page or something similar that does check this
attribute, or maybe dma_map_resource used it in the past and the
underlying code has changed.

Regardless, it seems we can remove it from the calls to
dma_map_resource. I went over the kernel code and it seems only habana
and amdgpu (and amdkfd) are passing this property to dma_map_resource.
All other callers just pass 0.

Oded
Christoph Hellwig Sept. 7, 2022, 12:03 p.m. UTC | #7
On Tue, Sep 06, 2022 at 12:38:44PM +0200, Christian König wrote:
> The problem is once more that this is MMIO space, in other words register
> BARs which needs to be exported/imported.

Everything used for P2P is bar space.

> Adding struct pages for it generally sounds like the wrong approach here.
> You can't even access this with the CPU or would trigger potentially
> unwanted hardware actions.

How would an access from the CPU vs anther device make any difference?

> Would you mind if I start to tackle this problem?

Yes, I've been waiting forever for someone to tacke how the scatterlist
is abused in dma-buf..
Christoph Hellwig Sept. 7, 2022, 12:05 p.m. UTC | #8
On Tue, Sep 06, 2022 at 08:48:28AM -0300, Jason Gunthorpe wrote:
> Right, this whole thing is the "standard" that dmabuf has adopted
> instead of the struct pages. Once the AMD GPU driver started doing
> this some time ago other drivers followed.

But it is simple wrong.  The scatterlist requires struct page backing.
In theory a physical address would be enough, but when Dan Williams
sent patches for that Linus shot them down.

That being said the scatterlist is the wrong interface here (and
probably for most of it's uses).  We really want a lot-level struct
with just the dma_address and length for the DMA side, and leave it
separate from that what is used to generate it (in most cases that
would be a bio_vec).

> Now we have struct pages, almost, but I'm not sure if their limits are
> compatible with VFIO? This has to work for small bars as well.

Why would small BARs be problematic for the pages?  The pages are more
a problem for gigantic BARs do the memory overhead.
Christoph Hellwig Sept. 7, 2022, 12:07 p.m. UTC | #9
On Tue, Sep 06, 2022 at 10:44:45PM +0300, Oded Gabbay wrote:
> Regardless, it seems we can remove it from the calls to
> dma_map_resource. I went over the kernel code and it seems only habana
> and amdgpu (and amdkfd) are passing this property to dma_map_resource.
> All other callers just pass 0.

What really need to happen is a revert of that entire commit, given that
dma_map_resource must not be used for PCIe P2P transactions as explained
in my previous mail.
Jason Gunthorpe Sept. 7, 2022, 12:33 p.m. UTC | #10
On Wed, Sep 07, 2022 at 05:05:57AM -0700, Christoph Hellwig wrote:
> On Tue, Sep 06, 2022 at 08:48:28AM -0300, Jason Gunthorpe wrote:
> > Right, this whole thing is the "standard" that dmabuf has adopted
> > instead of the struct pages. Once the AMD GPU driver started doing
> > this some time ago other drivers followed.
> 
> But it is simple wrong.  The scatterlist requires struct page backing.
> In theory a physical address would be enough, but when Dan Williams
> sent patches for that Linus shot them down.

Yes, you said that, and I said that when the AMD driver first merged
it - but it went in anyhow and now people are using it in a bunch of
places.

I'm happy that Christian wants to start trying to fix it, and will
help him, but it doesn't really impact this. Whatever fix is cooked up
will apply equally to vfio and habana.

> That being said the scatterlist is the wrong interface here (and
> probably for most of it's uses).  We really want a lot-level struct
> with just the dma_address and length for the DMA side, and leave it
> separate from that what is used to generate it (in most cases that
> would be a bio_vec).

Oh definitely

> > Now we have struct pages, almost, but I'm not sure if their limits are
> > compatible with VFIO? This has to work for small bars as well.
> 
> Why would small BARs be problematic for the pages?  The pages are more
> a problem for gigantic BARs do the memory overhead.

How do I get a struct page * for a 4k BAR in vfio?

The docs say:

 ..hotplug api on memory block boundaries. The implementation relies on
 this lack of user-api constraint to allow sub-section sized memory
 ranges to be specified to :c:func:`arch_add_memory`, the top-half of
 memory hotplug. Sub-section support allows for 2MB as the cross-arch
 common alignment granularity for :c:func:`devm_memremap_pages`.

Jason
Christoph Hellwig Sept. 7, 2022, 2:29 p.m. UTC | #11
On Wed, Sep 07, 2022 at 09:33:11AM -0300, Jason Gunthorpe wrote:
> Yes, you said that, and I said that when the AMD driver first merged
> it - but it went in anyhow and now people are using it in a bunch of
> places.

drm folks made up their own weird rules, if they internally stick
to it they have to listen to it given that they ignore review comments,
but it violates the scatterlist API and has not business anywhere
else in the kernel.  And yes, there probably is a reason or two why
the drm code is unusually error prone.

> > Why would small BARs be problematic for the pages?  The pages are more
> > a problem for gigantic BARs do the memory overhead.
> 
> How do I get a struct page * for a 4k BAR in vfio?

I guess we have different definitions of small then :)

But unless my understanding of the code is out out of data,
memremap_pages just requires the (virtual) start address to be 2MB
aligned, not the size.  Adding Dan for comments.

That being said, what is the point of mapping say a 4k BAR for p2p?
You're not going to save a measurable amount of CPU overhead if that
is the only place you transfer to.
Oded Gabbay Sept. 7, 2022, 2:46 p.m. UTC | #12
On Wed, Sep 7, 2022 at 5:30 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Wed, Sep 07, 2022 at 09:33:11AM -0300, Jason Gunthorpe wrote:
> > Yes, you said that, and I said that when the AMD driver first merged
> > it - but it went in anyhow and now people are using it in a bunch of
> > places.
>
> drm folks made up their own weird rules, if they internally stick
> to it they have to listen to it given that they ignore review comments,
> but it violates the scatterlist API and has not business anywhere
> else in the kernel.  And yes, there probably is a reason or two why
> the drm code is unusually error prone.
>
> > > Why would small BARs be problematic for the pages?  The pages are more
> > > a problem for gigantic BARs do the memory overhead.
> >
> > How do I get a struct page * for a 4k BAR in vfio?
>
> I guess we have different definitions of small then :)
>
> But unless my understanding of the code is out out of data,
> memremap_pages just requires the (virtual) start address to be 2MB
> aligned, not the size.  Adding Dan for comments.
>
> That being said, what is the point of mapping say a 4k BAR for p2p?
> You're not going to save a measurable amount of CPU overhead if that
> is the only place you transfer to.
I don't know what Jason had in mind, but I can see a use for that for
writing to doorbells of a device.
Today, usually what happens is that peer A reads/writes to peer B's
memory through the large bar and then signals the host the operation
was completed.
Then the host s/w writes to the doorbell of the peer B to let him know
he can continue with the execution as the data is now ready (or can be
recycled).
I can imagine peer A writing directly to the doorbell of peer B, and
usually for that we would like to expose a very small area, probably a
single 4K page.

Oded
Robin Murphy Sept. 7, 2022, 3:01 p.m. UTC | #13
On 2022-09-07 13:33, Jason Gunthorpe wrote:
> On Wed, Sep 07, 2022 at 05:05:57AM -0700, Christoph Hellwig wrote:
>> On Tue, Sep 06, 2022 at 08:48:28AM -0300, Jason Gunthorpe wrote:
>>> Right, this whole thing is the "standard" that dmabuf has adopted
>>> instead of the struct pages. Once the AMD GPU driver started doing
>>> this some time ago other drivers followed.
>>
>> But it is simple wrong.  The scatterlist requires struct page backing.
>> In theory a physical address would be enough, but when Dan Williams
>> sent patches for that Linus shot them down.
> 
> Yes, you said that, and I said that when the AMD driver first merged
> it - but it went in anyhow and now people are using it in a bunch of
> places.
> 
> I'm happy that Christian wants to start trying to fix it, and will
> help him, but it doesn't really impact this. Whatever fix is cooked up
> will apply equally to vfio and habana.

We've just added support for P2P segments in scatterlists, can that not 
be used here?

Robin.

>> That being said the scatterlist is the wrong interface here (and
>> probably for most of it's uses).  We really want a lot-level struct
>> with just the dma_address and length for the DMA side, and leave it
>> separate from that what is used to generate it (in most cases that
>> would be a bio_vec).
> 
> Oh definitely
> 
>>> Now we have struct pages, almost, but I'm not sure if their limits are
>>> compatible with VFIO? This has to work for small bars as well.
>>
>> Why would small BARs be problematic for the pages?  The pages are more
>> a problem for gigantic BARs do the memory overhead.
> 
> How do I get a struct page * for a 4k BAR in vfio?
> 
> The docs say:
> 
>   ..hotplug api on memory block boundaries. The implementation relies on
>   this lack of user-api constraint to allow sub-section sized memory
>   ranges to be specified to :c:func:`arch_add_memory`, the top-half of
>   memory hotplug. Sub-section support allows for 2MB as the cross-arch
>   common alignment granularity for :c:func:`devm_memremap_pages`.
> 
> Jason
Christian König Sept. 7, 2022, 3:08 p.m. UTC | #14
Am 07.09.22 um 14:03 schrieb Christoph Hellwig:
> On Tue, Sep 06, 2022 at 12:38:44PM +0200, Christian König wrote:
>> The problem is once more that this is MMIO space, in other words register
>> BARs which needs to be exported/imported.
> Everything used for P2P is bar space.
>
>> Adding struct pages for it generally sounds like the wrong approach here.
>> You can't even access this with the CPU or would trigger potentially
>> unwanted hardware actions.
> How would an access from the CPU vs anther device make any difference?

The key point is that you can't do any CPU fallback with this as long as 
the CPU wouldn't do exactly the same thing as the original hardware 
device. E.g. not write combine nor do any fully page copies etc...

See what happens here is not really P2P DMA transfer, but rather P2P 
signaling of events.

For a simple example think of a camera and a video codec. The camera is 
pumping video data into system memory the video codec should encode into 
an H264 stream.

So after every frame the camera hardware issues a P2P write into the BAR 
of the video codec to signal that the frame is completed and it can 
start decoding.

This is not even a memory write, but rather just some trigger event. 
That's essentially the background why I think having struct pages and 
sg_tables doesn't make any sense at all for this use case.

>> Would you mind if I start to tackle this problem?
> Yes, I've been waiting forever for someone to tacke how the scatterlist
> is abused in dma-buf..

How about we separate the scatterlist into page and DMA address container?

Regards,
Christian.
Jason Gunthorpe Sept. 7, 2022, 3:23 p.m. UTC | #15
On Wed, Sep 07, 2022 at 07:29:58AM -0700, Christoph Hellwig wrote:
> On Wed, Sep 07, 2022 at 09:33:11AM -0300, Jason Gunthorpe wrote:
> > Yes, you said that, and I said that when the AMD driver first merged
> > it - but it went in anyhow and now people are using it in a bunch of
> > places.
> 
> drm folks made up their own weird rules, if they internally stick
> to it they have to listen to it given that they ignore review comments,
> but it violates the scatterlist API and has not business anywhere
> else in the kernel.  And yes, there probably is a reason or two why
> the drm code is unusually error prone.

That may be, but it is creating problems if DRM gets to do X crazy
thing and nobody else can..

So, we have two issues here

 1) DMABUF abuses the scatter list, but this is very constrainted we have
    this weird special "DMABUF scatterlist" that is only touched by DMABUF
    importers. The imports signal that they understand the format with
    a flag. This is ugly and would be nice to clean to a dma mapped
    address list of some sort.

    I spent alot of time a few years ago removing driver touches of
    the SGL and preparing the RDMA stack to do this kind of change, at
    least.

 2) DMABUF abuses dma_map_resource() for P2P and thus doesn't work in
    certain special cases.

    Rather than jump to ZONE_DEVICE and map_sgl I would like to
    continue to support non-struct page mapping. So, I would suggest
    adding a dma_map_p2p() that can cover off the special cases,
    include the two struct devices as arguments with a physical PFN/size. Do
    the same logic we do under the ZONE_DEVICE stuff.

    Drivers can choose if they want to pay the memory cost of
    ZONE_DEVICE and get faster dma_map or get slower dma_map and save
    memory.

I still think we can address them incrementally - but the
dma_map_p2p() might be small enough to sort out right away, if you are
OK with it.

> > > Why would small BARs be problematic for the pages?  The pages are more
> > > a problem for gigantic BARs do the memory overhead.
> > 
> > How do I get a struct page * for a 4k BAR in vfio?
> 
> I guess we have different definitions of small then :)
> 
> But unless my understanding of the code is out out of data,
> memremap_pages just requires the (virtual) start address to be 2MB
> aligned, not the size.  Adding Dan for comments.

Don't we need the virtual start address to equal the physical pfn for
everything to work properly? eg pfn_to_page?

And we can't over-allocate because another driver might want to also
use ZONE_DEVICE pages for its BAR that is now creating a collision.

So, at least as is, the memmap stuff seems unable to support the case
we have with VFIO.

> That being said, what is the point of mapping say a 4k BAR for p2p?
> You're not going to save a measurable amount of CPU overhead if that
> is the only place you transfer to.

For the purpose this series is chasing, it is for doorbell rings. The
actual data transfer may still bounce through CPU memory (if a CMB is
not available), but the latency reduction of directly signaling the
peer device that the transfer is ready is the key objective. 

Bouncing an interrupt through the CPU to cause it to do a writel() is
very tiem consuming, especially on slow ARM devices, while we have
adequate memory bandwidth for data transfer.

When I look at iommufd, it is for generality and compat. We don't have
knowledge of what the guest will do, so regardless of BAR size we have
to create P2P iommu mappings for every kind of PCI BAR. It is what
vfio is currently doing.

Jason
Christoph Hellwig Sept. 7, 2022, 3:23 p.m. UTC | #16
On Wed, Sep 07, 2022 at 05:08:37PM +0200, Christian König wrote:
> The key point is that you can't do any CPU fallback with this as long as the
> CPU wouldn't do exactly the same thing as the original hardware device. E.g.
> not write combine nor do any fully page copies etc...

That is true for MMIO in general.  You need to use the mmio accessors,
and even then it needs to match what the hardware expects

> This is not even a memory write, but rather just some trigger event. That's
> essentially the background why I think having struct pages and sg_tables
> doesn't make any sense at all for this use case.

Well, Jasons patch uses a sg_table, just in a very broken way.

> > > Would you mind if I start to tackle this problem?
> > Yes, I've been waiting forever for someone to tacke how the scatterlist
> > is abused in dma-buf..
> 
> How about we separate the scatterlist into page and DMA address container?

As said before that is fundamentally the right thing to do.  It just
takes a lot of work.  I think on the dma mapping side we're finally
getting ready for it now that arm uses the common dma-direct code,
although getting rid of the arm iommu ops first would be even better.

Note that this only solves the problem of needing the pages to hold
the physical address.  The other thing the pages are used for in p2p
is that (through the pgmap) it points to the owning device of the p2p
memory and thus allows figuring out how it needs to be accessed.
Christoph Hellwig Sept. 7, 2022, 3:32 p.m. UTC | #17
On Wed, Sep 07, 2022 at 12:23:28PM -0300, Jason Gunthorpe wrote:
>  2) DMABUF abuses dma_map_resource() for P2P and thus doesn't work in
>     certain special cases.

Not just certain special cases, but one of the main use cases.
Basically P2P can happen in two ways:

 a) through a PCIe switch, or
 b) through connected root ports

The open code version here only supports a), only supports it when there
is no offset between the 'phyiscal' address of the BAR seen PCIe
endpoint and the Linux way.  x86 usually (always?) doesn't have an
offset there, but other architectures often do.

Last but not least I don't really see how the code would even work
when an IOMMU is used, as dma_map_resource will return an IOVA that
is only understood by the IOMMU itself, and not the other endpoint.

How was this code even tested?
Jason Gunthorpe Sept. 7, 2022, 4:12 p.m. UTC | #18
On Wed, Sep 07, 2022 at 08:32:23AM -0700, Christoph Hellwig wrote:
> On Wed, Sep 07, 2022 at 12:23:28PM -0300, Jason Gunthorpe wrote:
> >  2) DMABUF abuses dma_map_resource() for P2P and thus doesn't work in
> >     certain special cases.
> 
> Not just certain special cases, but one of the main use cases.
> Basically P2P can happen in two ways:
>
>  a) through a PCIe switch, or
>  b) through connected root ports

Yes, we tested both, both work.

> The open code version here only supports a), only supports it when there
> is no offset between the 'phyiscal' address of the BAR seen PCIe
> endpoint and the Linux way.  x86 usually (always?) doesn't have an
> offset there, but other architectures often do.

The PCI offset is some embedded thing - I've never seen it in a server
platform.

Frankly, it is just bad SOC design and there is good reason why
non-zero needs to be avoided. As soon as you create aliases between
the address spaces you invite trouble. IIRC a SOC I used once put the
memory at 0 -> 4G then put the only PCI aperture at 4g ->
4g+N. However this design requires 64 bit PCI support, which at the
time, the platform didn't have. So they used PCI offset to hackily
alias the aperture over the DDR. I don't remember if they threw out a
bit of DDR to resolve the alias, or if they just didn't support PCI
switches.

In any case, it is a complete mess. You either drastically limit your
BAR size, don't support PCI switches or loose a lot of DDR.

I also seem to remember that iommu and PCI offset don't play nice
together - so for the VFIO use case where the iommu is present I'm
pretty sure we can very safely assume 0 offset. That seems confirmed
by the fact that VFIO has never handled PCI offset in its own P2P path
and P2P works fine in VMs across a wide range of platforms.

That said, I agree we should really have APIs that support this
properly, and dma_map_resource is certainly technically wrong.

So, would you be OK with this series if I try to make a dma_map_p2p()
that resolves the offset issue?

> Last but not least I don't really see how the code would even work
> when an IOMMU is used, as dma_map_resource will return an IOVA that
> is only understood by the IOMMU itself, and not the other endpoint.

I don't understand this.

__iommu_dma_map() will put the given phys into the iommu_domain
associated with 'dev' and return the IOVA it picked.

Here 'dev' is the importing device, it is the device that will issue
the DMA:

+       dma_addr = dma_map_resource(
+               attachment->dev,
+               pci_resource_start(priv->vdev->pdev, priv->index) +
+                       priv->offset,
+               priv->dmabuf->size, dir, DMA_ATTR_SKIP_CPU_SYNC);

eg attachment->dev is the PCI device of the RDMA device, not the VFIO
device.

'phys' is the CPU physical of the PCI BAR page, which with 0 PCI
offset is the right thing to program into the IO page table.

> How was this code even tested?

It was tested on a few platforms, like I said above, the cases where
it doesn't work are special, largely embedded, and not anything we
have in our labs - AFAIK.

Jason
Robin Murphy Sept. 7, 2022, 4:31 p.m. UTC | #19
On 2022-09-07 16:23, Jason Gunthorpe wrote:
> On Wed, Sep 07, 2022 at 07:29:58AM -0700, Christoph Hellwig wrote:
>> On Wed, Sep 07, 2022 at 09:33:11AM -0300, Jason Gunthorpe wrote:
>>> Yes, you said that, and I said that when the AMD driver first merged
>>> it - but it went in anyhow and now people are using it in a bunch of
>>> places.
>>
>> drm folks made up their own weird rules, if they internally stick
>> to it they have to listen to it given that they ignore review comments,
>> but it violates the scatterlist API and has not business anywhere
>> else in the kernel.  And yes, there probably is a reason or two why
>> the drm code is unusually error prone.
> 
> That may be, but it is creating problems if DRM gets to do X crazy
> thing and nobody else can..
> 
> So, we have two issues here
> 
>   1) DMABUF abuses the scatter list, but this is very constrainted we have
>      this weird special "DMABUF scatterlist" that is only touched by DMABUF
>      importers. The imports signal that they understand the format with
>      a flag. This is ugly and would be nice to clean to a dma mapped
>      address list of some sort.
> 
>      I spent alot of time a few years ago removing driver touches of
>      the SGL and preparing the RDMA stack to do this kind of change, at
>      least.
> 
>   2) DMABUF abuses dma_map_resource() for P2P and thus doesn't work in
>      certain special cases.

FWIW, dma_map_resource() *is* for P2P in general. The classic case of 
one device poking at another's registers that was the original 
motivation is a standalone DMA engine reading/writing a peripheral 
device's FIFO, so the very similar inter-device doorbell signal is 
absolutely in scope too; VRAM might be a slightly greyer area, but if 
it's still not page-backed kernel memory then I reckon that's fair game.

The only trouble is that it's not geared for *PCI* P2P when that may or 
may not happen entirely upstream of IOMMU translation.

Robin.
Jason Gunthorpe Sept. 7, 2022, 4:47 p.m. UTC | #20
On Wed, Sep 07, 2022 at 05:31:14PM +0100, Robin Murphy wrote:

> The only trouble is that it's not geared for *PCI* P2P when that may or may
> not happen entirely upstream of IOMMU translation.

This is why PCI users have to call the pci_distance stuff before using
dma_map_resource(), it ensures the PCI fabric is setup in a way that
is consistent with the iommu. eg if we have IOMMU turned on then the
fabric must have ACS/etc to ensure that all TLPs are translated.

PCI P2P is very complicated and fragile, sadly.

Thanks,
Jason
Dan Williams Sept. 7, 2022, 5:03 p.m. UTC | #21
Christoph Hellwig wrote:
> On Wed, Sep 07, 2022 at 09:33:11AM -0300, Jason Gunthorpe wrote:
> > Yes, you said that, and I said that when the AMD driver first merged
> > it - but it went in anyhow and now people are using it in a bunch of
> > places.
> 
> drm folks made up their own weird rules, if they internally stick
> to it they have to listen to it given that they ignore review comments,
> but it violates the scatterlist API and has not business anywhere
> else in the kernel.  And yes, there probably is a reason or two why
> the drm code is unusually error prone.
> 
> > > Why would small BARs be problematic for the pages?  The pages are more
> > > a problem for gigantic BARs do the memory overhead.
> > 
> > How do I get a struct page * for a 4k BAR in vfio?
> 
> I guess we have different definitions of small then :)
> 
> But unless my understanding of the code is out out of data,
> memremap_pages just requires the (virtual) start address to be 2MB
> aligned, not the size.  Adding Dan for comments.

The minimum granularity for sparse_add_section() that memremap_pages
uses internally is 2MB, so start and end need to be 2MB aligned. Details
here:

ba72b4c8cf60 mm/sparsemem: support sub-section hotplug
Christoph Hellwig Sept. 9, 2022, 1:24 p.m. UTC | #22
On Wed, Sep 07, 2022 at 01:12:52PM -0300, Jason Gunthorpe wrote:
> The PCI offset is some embedded thing - I've never seen it in a server
> platform.

That's not actually true, e.g. some power system definitively had it,
althiugh I don't know if the current ones do.

But that's not that point.  The offset is a configuration fully
supported by Linux, and someone that just works by using the proper
APIs.  Doing some handwaiving about embedded only or bad design doesn't
matter.  There is a reason why we have these proper APIs and no one
has any business bypassing them.

> I also seem to remember that iommu and PCI offset don't play nice
> together - so for the VFIO use case where the iommu is present I'm
> pretty sure we can very safely assume 0 offset. That seems confirmed
> by the fact that VFIO has never handled PCI offset in its own P2P path
> and P2P works fine in VMs across a wide range of platforms.

I think the offset is one of the reasons why IOVA windows can be
reserved (and maybe also why ppc is so weird).

> So, would you be OK with this series if I try to make a dma_map_p2p()
> that resolves the offset issue?

Well, if it also solves the other issue of invalid scatterlists leaking
outside of drm we can think about it.

> 
> > Last but not least I don't really see how the code would even work
> > when an IOMMU is used, as dma_map_resource will return an IOVA that
> > is only understood by the IOMMU itself, and not the other endpoint.
> 
> I don't understand this.
> 
> __iommu_dma_map() will put the given phys into the iommu_domain
> associated with 'dev' and return the IOVA it picked.

Yes, __iommu_dma_map creates an IOVA for the mapped remote BAR.  That
is the right thing if the I/O goes through the host bridge, but it is
the wrong thing if the I/O goes through the switch - in that case the
IOVA generated is not something that the endpoint that owns the BAR
can even understand.

Take a look at iommu_dma_map_sg and pci_p2pdma_map_segment to see how
this is handled.
Jason Gunthorpe Sept. 9, 2022, 2:09 p.m. UTC | #23
On Fri, Sep 09, 2022 at 06:24:35AM -0700, Christoph Hellwig wrote:
> On Wed, Sep 07, 2022 at 01:12:52PM -0300, Jason Gunthorpe wrote:
> > The PCI offset is some embedded thing - I've never seen it in a server
> > platform.
> 
> That's not actually true, e.g. some power system definitively had it,
> althiugh I don't know if the current ones do.

I thought those were all power embedded systems.

> There is a reason why we have these proper APIs and no one has any
> business bypassing them.

Yes, we should try to support these things, but you said this patch
didn't work and wasn't tested - that is not true at all.

And it isn't like we have APIs just sitting here to solve this
specific problem. So lets make something.

> > So, would you be OK with this series if I try to make a dma_map_p2p()
> > that resolves the offset issue?
> 
> Well, if it also solves the other issue of invalid scatterlists leaking
> outside of drm we can think about it.

The scatterlist stuff has already leaked outside of DRM anyhow.

Again, I think it is very problematic to let DRM get away with things
and then insist all the poor non-DRM people be responsible to clean up
their mess.

I'm skeptical I can fix AMD GPU, but I can try to create a DMABUF op
that returns something that is not a scatterlist and teach RDMA to use
it. So at least the VFIO/RDMA part can avoid the scatter list abuse. I
expected to need non-scatterlist for iommufd anyhow.

Coupled with a series to add some dma_map_resource_pci() that handles
the PCI_P2PDMA_MAP_BUS_ADDR and the PCI offset, would it be an
agreeable direction?

> Take a look at iommu_dma_map_sg and pci_p2pdma_map_segment to see how
> this is handled.

So there is a bug in all these DMABUF implementations, they do ignore
the PCI_P2PDMA_MAP_BUS_ADDR "distance type".

This isn't a real-world problem for VFIO because VFIO is largely
incompatible with the non-ACS configuration that would trigger
PCI_P2PDMA_MAP_BUS_ADDR, and explains why we never saw any
problem. All our systems have ACS turned on so we can use VFIO.

I'm unclear how Habana or AMD have avoided a problem here..

This is much more serious than the pci offset in my mind.

Thanks,
Jason
diff mbox series

Patch

diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 24c524224da5a3..514c12a81127d6 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -2,6 +2,7 @@ 
 
 vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
 vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o
+vfio-pci-core-$(CONFIG_DMA_SHARED_BUFFER) += dma_buf.o
 obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
 
 vfio-pci-y := vfio_pci.o
diff --git a/drivers/vfio/pci/dma_buf.c b/drivers/vfio/pci/dma_buf.c
new file mode 100644
index 00000000000000..11396aeabff405
--- /dev/null
+++ b/drivers/vfio/pci/dma_buf.c
@@ -0,0 +1,269 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES.
+ */
+#include <linux/dma-buf.h>
+#include <linux/pci-p2pdma.h>
+#include <linux/dma-resv.h>
+
+#include "vfio_pci_priv.h"
+
+MODULE_IMPORT_NS(DMA_BUF);
+
+struct vfio_pci_dma_buf {
+	struct dma_buf *dmabuf;
+	struct vfio_pci_core_device *vdev;
+	struct list_head dmabufs_elm;
+	unsigned int index;
+	unsigned int orig_nents;
+	size_t offset;
+	bool revoked;
+};
+
+static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
+				   struct dma_buf_attachment *attachment)
+{
+	struct vfio_pci_dma_buf *priv = dmabuf->priv;
+	int rc;
+
+	rc = pci_p2pdma_distance_many(priv->vdev->pdev, &attachment->dev, 1,
+				      true);
+	if (rc < 0)
+		attachment->peer2peer = false;
+	return 0;
+}
+
+static void vfio_pci_dma_buf_unpin(struct dma_buf_attachment *attachment)
+{
+}
+
+static int vfio_pci_dma_buf_pin(struct dma_buf_attachment *attachment)
+{
+	/*
+	 * Uses the dynamic interface but must always allow for
+	 * dma_buf_move_notify() to do revoke
+	 */
+	return -EINVAL;
+}
+
+static struct sg_table *
+vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment,
+		     enum dma_data_direction dir)
+{
+	size_t sgl_size = dma_get_max_seg_size(attachment->dev);
+	struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
+	struct scatterlist *sgl;
+	struct sg_table *sgt;
+	dma_addr_t dma_addr;
+	unsigned int nents;
+	size_t offset;
+	int ret;
+
+	dma_resv_assert_held(priv->dmabuf->resv);
+
+	if (!attachment->peer2peer)
+		return ERR_PTR(-EPERM);
+
+	if (priv->revoked)
+		return ERR_PTR(-ENODEV);
+
+	sgt = kzalloc(sizeof(*sgt), GFP_KERNEL);
+	if (!sgt)
+		return ERR_PTR(-ENOMEM);
+
+	nents = DIV_ROUND_UP(priv->dmabuf->size, sgl_size);
+	ret = sg_alloc_table(sgt, nents, GFP_KERNEL);
+	if (ret)
+		goto err_kfree_sgt;
+
+	/*
+	 * Since the memory being mapped is a device memory it could never be in
+	 * CPU caches.
+	 */
+	dma_addr = dma_map_resource(
+		attachment->dev,
+		pci_resource_start(priv->vdev->pdev, priv->index) +
+			priv->offset,
+		priv->dmabuf->size, dir, DMA_ATTR_SKIP_CPU_SYNC);
+	ret = dma_mapping_error(attachment->dev, dma_addr);
+	if (ret)
+		goto err_free_sgt;
+
+	/*
+	 * Break the BAR's physical range up into max sized SGL's according to
+	 * the device's requirement.
+	 */
+	sgl = sgt->sgl;
+	for (offset = 0; offset != priv->dmabuf->size;) {
+		size_t chunk_size = min(priv->dmabuf->size - offset, sgl_size);
+
+		sg_set_page(sgl, NULL, chunk_size, 0);
+		sg_dma_address(sgl) = dma_addr + offset;
+		sg_dma_len(sgl) = chunk_size;
+		sgl = sg_next(sgl);
+		offset += chunk_size;
+	}
+
+	/*
+	 * Because we are not going to include a CPU list we want to have some
+	 * chance that other users will detect this by setting the orig_nents to
+	 * 0 and using only nents (length of DMA list) when going over the sgl
+	 */
+	priv->orig_nents = sgt->orig_nents;
+	sgt->orig_nents = 0;
+	return sgt;
+
+err_free_sgt:
+	sg_free_table(sgt);
+err_kfree_sgt:
+	kfree(sgt);
+	return ERR_PTR(ret);
+}
+
+static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment,
+				   struct sg_table *sgt,
+				   enum dma_data_direction dir)
+{
+	struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
+
+	sgt->orig_nents = priv->orig_nents;
+	dma_unmap_resource(attachment->dev, sg_dma_address(sgt->sgl),
+			   priv->dmabuf->size, dir, DMA_ATTR_SKIP_CPU_SYNC);
+	sg_free_table(sgt);
+	kfree(sgt);
+}
+
+static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
+{
+	struct vfio_pci_dma_buf *priv = dmabuf->priv;
+
+	/*
+	 * Either this or vfio_pci_dma_buf_cleanup() will remove from the list.
+	 * The refcount prevents both.
+	 */
+	if (priv->vdev) {
+		down_write(&priv->vdev->memory_lock);
+		list_del_init(&priv->dmabufs_elm);
+		up_write(&priv->vdev->memory_lock);
+		vfio_device_put(&priv->vdev->vdev);
+	}
+	kfree(priv);
+}
+
+static const struct dma_buf_ops vfio_pci_dmabuf_ops = {
+	.attach = vfio_pci_dma_buf_attach,
+	.map_dma_buf = vfio_pci_dma_buf_map,
+	.pin = vfio_pci_dma_buf_pin,
+	.unpin = vfio_pci_dma_buf_unpin,
+	.release = vfio_pci_dma_buf_release,
+	.unmap_dma_buf = vfio_pci_dma_buf_unmap,
+};
+
+int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
+				  struct vfio_device_feature_dma_buf __user *arg,
+				  size_t argsz)
+{
+	struct vfio_device_feature_dma_buf get_dma_buf;
+	DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
+	struct vfio_pci_dma_buf *priv;
+	int ret;
+
+	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
+				 sizeof(get_dma_buf));
+	if (ret != 1)
+		return ret;
+
+	if (copy_from_user(&get_dma_buf, arg, sizeof(get_dma_buf)))
+		return -EFAULT;
+
+	/* For PCI the region_index is the BAR number like everything else */
+	if (get_dma_buf.region_index >= VFIO_PCI_ROM_REGION_INDEX)
+		return -EINVAL;
+
+	exp_info.ops = &vfio_pci_dmabuf_ops;
+	exp_info.size = pci_resource_len(vdev->pdev, get_dma_buf.region_index);
+	if (!exp_info.size)
+		return -EINVAL;
+	if (get_dma_buf.offset || get_dma_buf.length) {
+		if (get_dma_buf.length > exp_info.size ||
+		    get_dma_buf.offset >= exp_info.size ||
+		    get_dma_buf.length > exp_info.size - get_dma_buf.offset ||
+		    get_dma_buf.offset % PAGE_SIZE ||
+		    get_dma_buf.length % PAGE_SIZE)
+			return -EINVAL;
+		exp_info.size = get_dma_buf.length;
+	}
+	exp_info.flags = get_dma_buf.open_flags;
+
+	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
+	if (!priv)
+		return -ENOMEM;
+	INIT_LIST_HEAD(&priv->dmabufs_elm);
+	priv->offset = get_dma_buf.offset;
+	priv->index = get_dma_buf.region_index;
+
+	exp_info.priv = priv;
+	priv->dmabuf = dma_buf_export(&exp_info);
+	if (IS_ERR(priv->dmabuf)) {
+		ret = PTR_ERR(priv->dmabuf);
+		kfree(priv);
+		return ret;
+	}
+
+	/* dma_buf_put() now frees priv */
+
+	down_write(&vdev->memory_lock);
+	dma_resv_lock(priv->dmabuf->resv, NULL);
+	priv->revoked = !__vfio_pci_memory_enabled(vdev);
+	priv->vdev = vdev;
+	vfio_device_get(&vdev->vdev);
+	list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs);
+	dma_resv_unlock(priv->dmabuf->resv);
+	up_write(&vdev->memory_lock);
+
+	/*
+	 * dma_buf_fd() consumes the reference, when the file closes the dmabuf
+	 * will be released.
+	 */
+	return dma_buf_fd(priv->dmabuf, get_dma_buf.open_flags);
+}
+
+void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
+{
+	struct vfio_pci_dma_buf *priv;
+	struct vfio_pci_dma_buf *tmp;
+
+	lockdep_assert_held_write(&vdev->memory_lock);
+
+	list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
+		if (!dma_buf_try_get(priv->dmabuf))
+			continue;
+		if (priv->revoked != revoked) {
+			dma_resv_lock(priv->dmabuf->resv, NULL);
+			priv->revoked = revoked;
+			dma_buf_move_notify(priv->dmabuf);
+			dma_resv_unlock(priv->dmabuf->resv);
+		}
+		dma_buf_put(priv->dmabuf);
+	}
+}
+
+void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
+{
+	struct vfio_pci_dma_buf *priv;
+	struct vfio_pci_dma_buf *tmp;
+
+	down_write(&vdev->memory_lock);
+	list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
+		if (!dma_buf_try_get(priv->dmabuf))
+			continue;
+		dma_resv_lock(priv->dmabuf->resv, NULL);
+		list_del_init(&priv->dmabufs_elm);
+		priv->vdev = NULL;
+		priv->revoked = true;
+		dma_buf_move_notify(priv->dmabuf);
+		dma_resv_unlock(priv->dmabuf->resv);
+		vfio_device_put(&vdev->vdev);
+		dma_buf_put(priv->dmabuf);
+	}
+	up_write(&vdev->memory_lock);
+}
diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index d22921efa25987..f8a9c24d04aeb7 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -584,10 +584,12 @@  static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos,
 		virt_mem = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_MEMORY);
 		new_mem = !!(new_cmd & PCI_COMMAND_MEMORY);
 
-		if (!new_mem)
+		if (!new_mem) {
 			vfio_pci_zap_and_down_write_memory_lock(vdev);
-		else
+			vfio_pci_dma_buf_move(vdev, true);
+		} else {
 			down_write(&vdev->memory_lock);
+		}
 
 		/*
 		 * If the user is writing mem/io enable (new_mem/io) and we
@@ -622,6 +624,8 @@  static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos,
 		*virt_cmd &= cpu_to_le16(~mask);
 		*virt_cmd |= cpu_to_le16(new_cmd & mask);
 
+		if (__vfio_pci_memory_enabled(vdev))
+			vfio_pci_dma_buf_move(vdev, false);
 		up_write(&vdev->memory_lock);
 	}
 
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 885706944e4d96..7e27bfc60440c2 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -504,6 +504,8 @@  void vfio_pci_core_close_device(struct vfio_device *core_vdev)
 	vfio_spapr_pci_eeh_release(vdev->pdev);
 	vfio_pci_core_disable(vdev);
 
+	vfio_pci_dma_buf_cleanup(vdev);
+
 	mutex_lock(&vdev->igate);
 	if (vdev->err_trigger) {
 		eventfd_ctx_put(vdev->err_trigger);
@@ -980,7 +982,10 @@  int vfio_pci_try_reset_function(struct vfio_pci_core_device *vdev)
 	 */
 	vfio_pci_set_power_state(vdev, PCI_D0);
 
+	vfio_pci_dma_buf_move(vdev, true);
 	ret = pci_try_reset_function(vdev->pdev);
+	if (__vfio_pci_memory_enabled(vdev))
+		vfio_pci_dma_buf_move(vdev, false);
 	up_write(&vdev->memory_lock);
 
 	return ret;
@@ -1210,11 +1215,10 @@  long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
 
-static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
-				       uuid_t __user *arg, size_t argsz)
+static int vfio_pci_core_feature_token(struct vfio_pci_core_device *vdev,
+				       u32 flags, uuid_t __user *arg,
+				       size_t argsz)
 {
-	struct vfio_pci_core_device *vdev =
-		container_of(device, struct vfio_pci_core_device, vdev);
 	uuid_t uuid;
 	int ret;
 
@@ -1241,9 +1245,14 @@  static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
 int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
 				void __user *arg, size_t argsz)
 {
+	struct vfio_pci_core_device *vdev =
+		container_of(device, struct vfio_pci_core_device, vdev);
+
 	switch (flags & VFIO_DEVICE_FEATURE_MASK) {
 	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
-		return vfio_pci_core_feature_token(device, flags, arg, argsz);
+		return vfio_pci_core_feature_token(vdev, flags, arg, argsz);
+	case VFIO_DEVICE_FEATURE_DMA_BUF:
+		return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz);
 	default:
 		return -ENOTTY;
 	}
@@ -1881,6 +1890,7 @@  void vfio_pci_core_init_device(struct vfio_pci_core_device *vdev,
 	INIT_LIST_HEAD(&vdev->vma_list);
 	INIT_LIST_HEAD(&vdev->sriov_pfs_item);
 	init_rwsem(&vdev->memory_lock);
+	INIT_LIST_HEAD(&vdev->dmabufs);
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_init_device);
 
@@ -2227,11 +2237,17 @@  static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 	 * cause the PCI config space reset without restoring the original
 	 * state (saved locally in 'vdev->pm_save').
 	 */
-	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
+	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
+		vfio_pci_dma_buf_move(cur, true);
 		vfio_pci_set_power_state(cur, PCI_D0);
+	}
 
 	ret = pci_reset_bus(pdev);
 
+	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
+		if (__vfio_pci_memory_enabled(cur))
+			vfio_pci_dma_buf_move(cur, false);
+
 err_undo:
 	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
 		if (cur == cur_mem)
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index 5b1cb9a9838442..c295a1166e7a9c 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -102,4 +102,27 @@  static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
 	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
 }
 
+#ifdef CONFIG_DMA_SHARED_BUFFER
+int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
+				  struct vfio_device_feature_dma_buf __user *arg,
+				  size_t argsz);
+void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev);
+void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked);
+#else
+static int
+vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
+			      struct vfio_device_feature_dma_buf __user *arg,
+			      size_t argsz)
+{
+	return -ENOTTY;
+}
+static inline void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
+{
+}
+static inline void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev,
+					 bool revoked)
+{
+}
+#endif
+
 #endif
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index e5cf0d3313a694..fb9769d9e5d149 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -93,6 +93,7 @@  struct vfio_pci_core_device {
 	struct mutex		vma_lock;
 	struct list_head	vma_list;
 	struct rw_semaphore	memory_lock;
+	struct list_head	dmabufs;
 };
 
 /* Will be exported for vfio pci drivers usage */
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 733a1cddde30a5..1dcfad6dcb6863 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -986,6 +986,24 @@  enum vfio_device_mig_state {
 	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
 };
 
+/**
+ * Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the
+ * region selected.
+ *
+ * open_flags are the typical flags passed to open(2), eg O_RDWR, O_CLOEXEC,
+ * etc. offset/length specify a slice of the region to create the dmabuf from.
+ * If both are 0 then the whole region is used.
+ *
+ * Return: The fd number on success, -1 and errno is set on failure.
+ */
+struct vfio_device_feature_dma_buf {
+	__u32 region_index;
+	__u32 open_flags;
+	__u32 offset;
+	__u64 length;
+};
+#define VFIO_DEVICE_FEATURE_DMA_BUF 3
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**