[0/4] cover-letter: Allow MMIO regions to be exported through dmabuf

Message ID	20241216095429.210792-1-wguay@fb.com (mailing list archive)
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: Wei Lin Guay <wguay@fb.com> To: <alex.williamson@redhat.com>, <dri-devel@lists.freedesktop.org>, <kvm@vger.kernel.org>, <linux-rdma@vger.kernel.org> CC: <jgg@nvidia.com>, <vivek.kasireddy@intel.com>, <dagmoxnes@meta.com>, <kbusch@kernel.org>, <nviljoen@meta.com>, Wei Lin Guay <wguay@meta.com>, Oded Gabbay <ogabbay@kernel.org>, =?utf-8?q?Christian_K=C3=B6nig?= <christian.koenig@amd.com>, Daniel Vetter <daniel.vetter@ffwll.ch>, Leon Romanovsky <leon@kernel.org>, Maor Gottlieb <maorg@nvidia.com> Subject: [PATCH 0/4] cover-letter: Allow MMIO regions to be exported through dmabuf Date: Mon, 16 Dec 2024 01:54:14 -0800 Message-ID: <20241216095429.210792-1-wguay@fb.com> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: list Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	cover-letter: Allow MMIO regions to be exported through dmabuf \| expand [0/4] cover-letter: Allow MMIO regions to be exported through dmabuf [1/4] vfio: Add vfio_device_get() [2/4] dma-buf: Add dma_buf_try_get() [3/4] vfio/pci: Allow MMIO regions to be exported through dma-buf

Wei Lin Guay Dec. 16, 2024, 9:54 a.m. UTC

From: Wei Lin Guay <wguay@meta.com>

This is another attempt to revive the patches posted by Jason
Gunthorpe and Vivek Kasireddy, at
https://patchwork.kernel.org/project/linux-media/cover/0-v2-472615b3877e+28f7-vfio_dma_buf_jgg@nvidia.com/
https://lwn.net/Articles/970751/

In addition to the initial proposal by Jason, another promising
application is exposing memory from an AI accelerator (bound to VFIO)
to an RDMA device. This would allow the RDMA device to directly access
the accelerator's memory, thereby facilitating direct data
transactions between the RDMA device and the accelerator.

Below is from the text/motivation from the orginal cover letter.

dma-buf has become a way to safely acquire a handle to non-struct page
memory that can still have lifetime controlled by the exporter. Notably
RDMA can now import dma-buf FDs and build them into MRs which allows for
PCI P2P operations. Extend this to allow vfio-pci to export MMIO memory
from PCI device BARs.

This series supports a use case for SPDK where a NVMe device will be owned
by SPDK through VFIO but interacting with a RDMA device. The RDMA device
may directly access the NVMe CMB or directly manipulate the NVMe device's
doorbell using PCI P2P.

However, as a general mechanism, it can support many other scenarios with
VFIO. I imagine this dmabuf approach to be usable by iommufd as well for
generic and safe P2P mappings.

This series goes after the "Break up ioctl dispatch functions to one
function per ioctl" series.

v2:
 - Name the new file dma_buf.c
 - Restore orig_nents before freeing
 - Fix reversed logic around priv->revoked
 - Set priv->index
 - Rebased on v2 "Break up ioctl dispatch functions"
v1: https://lore.kernel.org/r/0-v1-9e6e1739ed95+5fa-vfio_dma_buf_jgg@nvidia.com
Cc: linux-rdma@vger.kernel.org
Cc: Oded Gabbay <ogabbay@kernel.org>
Cc: Christian König <christian.koenig@amd.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Maor Gottlieb <maorg@nvidia.com>
Cc: dri-devel@lists.freedesktop.org
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Jason Gunthorpe (3):
  vfio: Add vfio_device_get()
  dma-buf: Add dma_buf_try_get()
  vfio/pci: Allow MMIO regions to be exported through dma-buf

Wei Lin Guay (1):
  vfio/pci: Allow export dmabuf without move_notify from importer

 drivers/vfio/pci/Makefile          |   1 +
 drivers/vfio/pci/dma_buf.c         | 291 +++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_config.c |   8 +-
 drivers/vfio/pci/vfio_pci_core.c   |  44 ++++-
 drivers/vfio/pci/vfio_pci_priv.h   |  30 +++
 drivers/vfio/vfio_main.c           |   1 +
 include/linux/dma-buf.h            |  13 ++
 include/linux/vfio.h               |   6 +
 include/linux/vfio_pci_core.h      |   1 +
 include/uapi/linux/vfio.h          |  18 ++
 10 files changed, 405 insertions(+), 8 deletions(-)
 create mode 100644 drivers/vfio/pci/dma_buf.c

--
2.43.5

Wei Lin Guay Dec. 17, 2024, 11:06 a.m. UTC | #1

Hi Christian, 

Thanks again for your prompt response/review.

> On 17 Dec 2024, at 10:53, Christian König <christian.koenig@amd.com> wrote:
> 
> > 
> Am 16.12.24 um 17:54 schrieb Keith Busch:
>> On Mon, Dec 16, 2024 at 11:21:39AM +0100, Christian König wrote:
>>> Am 16.12.24 um 10:54 schrieb Wei Lin Guay:
>>>> From: Wei Lin Guay <wguay@meta.com>
>>>> However, as a general mechanism, it can support many other scenarios with
>>>> VFIO. I imagine this dmabuf approach to be usable by iommufd as well for
>>>> generic and safe P2P mappings.
>>>> 
>>>> This series goes after the "Break up ioctl dispatch functions to one
>>>> function per ioctl" series.
>>> Yeah that sounds like it should work.
>>> 
>>> But where is the rest of the series, I only see the cover letter?
>> Should be here:
>> 
>>   https://lore.kernel.org/linux-rdma/20241216095429.210792-2-wguay@fb.com/T/#u
> 
> Please send that out once more with me on explicit CC.

I will resend the patch series. I was experiencing issues with my email client, which inadvertently split the series into two separate emails.

> 
> Apart from that I have to reject the adding of dma_buf_try_get(), that is clearly not something we should do.


Understood. It appears that Vivek has confirmed that his v2 has resolved the issue. I will follow up with him to determine if he plans to resume his patch, and if so, I will apply my last patch on top of his updated patch series 

Thanks,
Wei Lin
> 
> Thanks,
> Christian.

Christian König Dec. 17, 2024, 12:47 p.m. UTC | #2

Am 17.12.24 um 12:06 schrieb Wei Lin Guay:
> [SNIP]
>> Please send that out once more with me on explicit CC.
> I will resend the patch series. I was experiencing issues with my email client, which inadvertently split the series into two separate emails.

Alternatively I can also copy the code from the list archive and explain 
why that doesn't work:

+void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
+{
+    struct vfio_pci_dma_buf *priv;
+    struct vfio_pci_dma_buf *tmp;
+
+    lockdep_assert_held_write(&vdev->memory_lock);

This makes sure that the caller is holding vdev->memory_lock.

+
+    list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
+        if (!dma_buf_try_get(priv->dmabuf))

This here only works because vfio_pci_dma_buf_release() also grabs 
vdev->memory_lock and so we are protected against concurrent releases.

+            continue;
+        if (priv->revoked != revoked) {
+            dma_resv_lock(priv->dmabuf->resv, NULL);
+            priv->revoked = revoked;
+            dma_buf_move_notify(priv->dmabuf);
+            dma_resv_unlock(priv->dmabuf->resv);
+        }
+        dma_buf_put(priv->dmabuf);

The problem is now that this here might drop the last reference which in 
turn calls vfio_pci_dma_buf_release() which also tries to grab 
vdev->memory_lock and so results in a deadlock.

+    }
+}

This pattern was suggested multiple times and I had to rejected it every 
time because the whole idea is just fundamentally broken.

It's really astonishing how people always come up with the same broken 
pattern.

Regards,
Christian.

>
>> Apart from that I have to reject the adding of dma_buf_try_get(), that is clearly not something we should do.
>
> Understood. It appears that Vivek has confirmed that his v2 has resolved the issue. I will follow up with him to determine if he plans to resume his patch, and if so, I will apply my last patch on top of his updated patch series
>
> Thanks,
> Wei Lin
>> Thanks,
>> Christian.
>

Kasireddy, Vivek Dec. 18, 2024, 6:16 a.m. UTC | #3

Hi Christian,

> Subject: Re: [PATCH 0/4] cover-letter: Allow MMIO regions to be exported
> through dmabuf
> 
> 
>> 	I will resend the patch series. I was experiencing issues with my email
>> client, which inadvertently split the series into two separate emails.
> 
> 
> Alternatively I can also copy the code from the list archive and explain why
> that doesn't work:
> 
> +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool
> +revoked) {
> +    struct vfio_pci_dma_buf *priv;
> +    struct vfio_pci_dma_buf *tmp;
> +
> +    lockdep_assert_held_write(&vdev->memory_lock);
> 
> This makes sure that the caller is holding vdev->memory_lock.
> 
> +
> +    list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
> +        if (!dma_buf_try_get(priv->dmabuf))
> 
> This here only works because vfio_pci_dma_buf_release() also grabs vdev-
> >memory_lock and so we are protected against concurrent releases.
> 
> +            continue;
> +        if (priv->revoked != revoked) {
> +            dma_resv_lock(priv->dmabuf->resv, NULL);
> +            priv->revoked = revoked;
> +            dma_buf_move_notify(priv->dmabuf);
> +            dma_resv_unlock(priv->dmabuf->resv);
> +        }
> +        dma_buf_put(priv->dmabuf);
> 
> The problem is now that this here might drop the last reference which in turn
> calls vfio_pci_dma_buf_release() which also tries to grab vdev-
> >memory_lock and so results in a deadlock.
AFAICS, vfio_pci_dma_buf_release() would not be called synchronously after the
last reference is dropped by dma_buf_put(). This is because fput(), which is called
by dma_buf_put() triggers f_op->release() asynchronously; therefore, a deadlock
is unlikely to occur in this scenario, unless I am overlooking something.

Thanks,
Vivek

> 
> +    }
> +}
> 
> This pattern was suggested multiple times and I had to rejected it every time
> because the whole idea is just fundamentally broken.
> 
> It's really astonishing how people always come up with the same broken
> pattern.
> 
> Regards,
> Christian.
> 
> 
> 
> 
> 
> 
> 
> 		Apart from that I have to reject the adding of
> dma_buf_try_get(), that is clearly not something we should do.
> 
> 
> 
> 	Understood. It appears that Vivek has confirmed that his v2 has
> resolved the issue. I will follow up with him to determine if he plans to
> resume his patch, and if so, I will apply my last patch on top of his updated
> patch series
> 
> 	Thanks,
> 	Wei Lin
> 
> 
> 		Thanks,
> 		Christian.
> 
> 
> 
>

Christian König Dec. 18, 2024, 10:01 a.m. UTC | #4

Am 18.12.24 um 07:16 schrieb Kasireddy, Vivek:
> Hi Christian,
>
>> Subject: Re: [PATCH 0/4] cover-letter: Allow MMIO regions to be exported
>> through dmabuf
>>
>>
>>> 	I will resend the patch series. I was experiencing issues with my email
>>> client, which inadvertently split the series into two separate emails.
>>
>> Alternatively I can also copy the code from the list archive and explain why
>> that doesn't work:
>>
>> +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool
>> +revoked) {
>> +    struct vfio_pci_dma_buf *priv;
>> +    struct vfio_pci_dma_buf *tmp;
>> +
>> +    lockdep_assert_held_write(&vdev->memory_lock);
>>
>> This makes sure that the caller is holding vdev->memory_lock.
>>
>> +
>> +    list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
>> +        if (!dma_buf_try_get(priv->dmabuf))
>>
>> This here only works because vfio_pci_dma_buf_release() also grabs vdev-
>>> memory_lock and so we are protected against concurrent releases.
>> +            continue;
>> +        if (priv->revoked != revoked) {
>> +            dma_resv_lock(priv->dmabuf->resv, NULL);
>> +            priv->revoked = revoked;
>> +            dma_buf_move_notify(priv->dmabuf);
>> +            dma_resv_unlock(priv->dmabuf->resv);
>> +        }
>> +        dma_buf_put(priv->dmabuf);
>>
>> The problem is now that this here might drop the last reference which in turn
>> calls vfio_pci_dma_buf_release() which also tries to grab vdev-
>>> memory_lock and so results in a deadlock.
> AFAICS, vfio_pci_dma_buf_release() would not be called synchronously after the
> last reference is dropped by dma_buf_put(). This is because fput(), which is called
> by dma_buf_put() triggers f_op->release() asynchronously; therefore, a deadlock
> is unlikely to occur in this scenario, unless I am overlooking something.

My recollection is that the f_op->release handler is only called 
asynchronously if fput() was issued in interrupt context.

But could be that this information is outdated.

Regards,
Christian.

>
> Thanks,
> Vivek
>
>> +    }
>> +}
>>
>> This pattern was suggested multiple times and I had to rejected it every time
>> because the whole idea is just fundamentally broken.
>>
>> It's really astonishing how people always come up with the same broken
>> pattern.
>>
>> Regards,
>> Christian.
>>
>>
>>
>>
>>
>>
>>
>> 		Apart from that I have to reject the adding of
>> dma_buf_try_get(), that is clearly not something we should do.
>>
>>
>>
>> 	Understood. It appears that Vivek has confirmed that his v2 has
>> resolved the issue. I will follow up with him to determine if he plans to
>> resume his patch, and if so, I will apply my last patch on top of his updated
>> patch series
>>
>> 	Thanks,
>> 	Wei Lin
>>
>>
>> 		Thanks,
>> 		Christian.
>>
>>
>>
>>

Simona Vetter Dec. 18, 2024, 10:44 a.m. UTC | #5

On Tue, Dec 17, 2024 at 10:53:32AM +0100, Christian König wrote:
> Am 16.12.24 um 17:54 schrieb Keith Busch:
> > On Mon, Dec 16, 2024 at 11:21:39AM +0100, Christian König wrote:
> > > Am 16.12.24 um 10:54 schrieb Wei Lin Guay:
> > > > From: Wei Lin Guay <wguay@meta.com>
> > > > However, as a general mechanism, it can support many other scenarios with
> > > > VFIO. I imagine this dmabuf approach to be usable by iommufd as well for
> > > > generic and safe P2P mappings.
> > > > 
> > > > This series goes after the "Break up ioctl dispatch functions to one
> > > > function per ioctl" series.
> > > Yeah that sounds like it should work.
> > > 
> > > But where is the rest of the series, I only see the cover letter?
> > Should be here:
> > 
> >    https://lore.kernel.org/linux-rdma/20241216095429.210792-2-wguay@fb.com/T/#u
> 
> Please send that out once more with me on explicit CC.
> 
> Apart from that I have to reject the adding of dma_buf_try_get(), that is
> clearly not something we should do.

Yeah if we do try_get it would need to be at least on a specific dma_buf
type (so checking for dma_buf->ops), since in general this does not work
(unless we add the general code in dma_buf.c somehow to make it work).
Aside from any other design concerns.
-Sima

Kasireddy, Vivek Dec. 19, 2024, 7:02 a.m. UTC | #6

Hi Christian,

> Subject: Re: [PATCH 0/4] cover-letter: Allow MMIO regions to be exported
> through dmabuf
> 
> >>
> >>> 	I will resend the patch series. I was experiencing issues with my email
> >>> client, which inadvertently split the series into two separate emails.
> >>
> >> Alternatively I can also copy the code from the list archive and explain why
> >> that doesn't work:
> >>
> >> +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool
> >> +revoked) {
> >> +    struct vfio_pci_dma_buf *priv;
> >> +    struct vfio_pci_dma_buf *tmp;
> >> +
> >> +    lockdep_assert_held_write(&vdev->memory_lock);
> >>
> >> This makes sure that the caller is holding vdev->memory_lock.
> >>
> >> +
> >> +    list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
> >> +        if (!dma_buf_try_get(priv->dmabuf))
> >>
> >> This here only works because vfio_pci_dma_buf_release() also grabs
> vdev-
> >>> memory_lock and so we are protected against concurrent releases.
> >> +            continue;
> >> +        if (priv->revoked != revoked) {
> >> +            dma_resv_lock(priv->dmabuf->resv, NULL);
> >> +            priv->revoked = revoked;
> >> +            dma_buf_move_notify(priv->dmabuf);
> >> +            dma_resv_unlock(priv->dmabuf->resv);
> >> +        }
> >> +        dma_buf_put(priv->dmabuf);
> >>
> >> The problem is now that this here might drop the last reference which in
> turn
> >> calls vfio_pci_dma_buf_release() which also tries to grab vdev-
> >>> memory_lock and so results in a deadlock.
> > AFAICS, vfio_pci_dma_buf_release() would not be called synchronously
> after the
> > last reference is dropped by dma_buf_put(). This is because fput(), which is
> called
> > by dma_buf_put() triggers f_op->release() asynchronously; therefore, a
> deadlock
> > is unlikely to occur in this scenario, unless I am overlooking something.
> 
> My recollection is that the f_op->release handler is only called
> asynchronously if fput() was issued in interrupt context.
Here is the code of fput() from the current master:
void fput(struct file *file)
{
        if (file_ref_put(&file->f_ref)) {
                struct task_struct *task = current;

                if (unlikely(!(file->f_mode & (FMODE_BACKING | FMODE_OPENED)))) {
                        file_free(file);
                        return;
                }
                if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
                        init_task_work(&file->f_task_work, ____fput);
                        if (!task_work_add(task, &file->f_task_work, TWA_RESUME))
                                return;
                        /*
                         * After this task has run exit_task_work(),
                         * task_work_add() will fail.  Fall through to delayed
                         * fput to avoid leaking *file.
                         */
                }

                if (llist_add(&file->f_llist, &delayed_fput_list))
                        schedule_delayed_work(&delayed_fput_work, 1);
        }
}

IIUC, based on the above code, it looks like there are two ways in which the
f_op->release() handler is triggered from fput():
- via delayed_fput() for kernel threads and code in interrupt context
- via task_work_run() just before the task/process returns to the user-mode

The first scenario above is definitely asynchronous as the release() handler is
called from a worker thread. But I think the second case (which is the most
common and relevant for our use-case) can also be considered asynchronous,
because the execution of the system call or ioctl that led to the context in
which dma_buf_put() was called is completed.

Here is a trace from my light testing with the udmabuf driver, where you can
see the release() handler being called by syscall_exit_to_user_mode() :
[  158.464203] Call Trace:
[  158.466681]  <TASK>
[  158.468815]  dump_stack_lvl+0x60/0x80
[  158.472507]  dump_stack+0x14/0x16
[  158.475853]  release_udmabuf+0x2f/0x9f
[  158.479631]  dma_buf_release+0x3c/0x90
[  158.483408]  __dentry_kill+0x8f/0x180
[  158.487098]  dput+0xe7/0x1a0
[  158.490013]  __fput+0x131/0x2b0
[  158.493178]  ____fput+0x19/0x20
[  158.496352]  task_work_run+0x61/0x90
[  158.499959]  syscall_exit_to_user_mode+0x1a4/0x1b0
[  158.504769]  do_syscall_64+0x5b/0x110
[  158.508458]  entry_SYSCALL_64_after_hwframe+0x4b/0x53

And, here is the relevant syscall code (from arch/x86/entry/common.c):
__visible noinstr bool do_syscall_64(struct pt_regs *regs, int nr)
{
        add_random_kstack_offset();
        nr = syscall_enter_from_user_mode(regs, nr);

        instrumentation_begin();

        if (!do_syscall_x64(regs, nr) && !do_syscall_x32(regs, nr) && nr != -1) {
                /* Invalid system call, but still a system call. */
                regs->ax = __x64_sys_ni_syscall(regs);
        }

        instrumentation_end();
        syscall_exit_to_user_mode(regs);

I also confirmed that the release() handler is indeed called after dma_buf_put()
(and not by dma_buf_put()) by adding debug prints before and after
dma_buf_put() and one in the release() handler. Furthermore, I also found
that calling close() on the dmabuf fd in the user-space is one scenario in
which fput() is called synchronously. Here is the relevant trace:
[  302.770910] Call Trace:
[  302.773389]  <TASK>
[  302.775516]  dump_stack_lvl+0x60/0x80
[  302.779209]  dump_stack+0x14/0x16
[  302.782549]  release_udmabuf+0x2f/0x9f
[  302.786329]  dma_buf_release+0x3c/0x90
[  302.790105]  __dentry_kill+0x8f/0x180
[  302.793789]  dput+0xe7/0x1a0
[  302.796703]  __fput+0x131/0x2b0
[  302.799866]  __fput_sync+0x53/0x70
[  302.803299]  __x64_sys_close+0x58/0xc0
[  302.807076]  x64_sys_call+0x126a/0x17d0
[  302.810938]  do_syscall_64+0x4f/0x110
[  302.814622]  entry_SYSCALL_64_after_hwframe+0x4b/0x53

As you can see above, there is indeed a synchronous version of fput() defined
just below fput():
/*
 * synchronous analog of fput(); for kernel threads that might be needed
 * in some umount() (and thus can't use flush_delayed_fput() without
 * risking deadlocks), need to wait for completion of __fput() and know
 * for this specific struct file it won't involve anything that would
 * need them.  Use only if you really need it - at the very least,
 * don't blindly convert fput() by kernel thread to that.
 */
void __fput_sync(struct file *file)
{
	if (file_ref_put(&file->f_ref))
		__fput(file);
}

Based on all the above, I think we can conclude that since dma_buf_put()
does not directly (or synchronously) call the f_op->release() handler, a
deadlock is unlikely to occur in the scenario you described.

Thanks,
Vivek

> 
> But could be that this information is outdated.
> 
> Regards,
> Christian.
> 
> >
> > Thanks,
> > Vivek
> >
> >> +    }
> >> +}
> >>
> >> This pattern was suggested multiple times and I had to rejected it every
> time
> >> because the whole idea is just fundamentally broken.
> >>
> >> It's really astonishing how people always come up with the same broken
> >> pattern.
> >>
> >> Regards,
> >> Christian.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> 		Apart from that I have to reject the adding of
> >> dma_buf_try_get(), that is clearly not something we should do.
> >>
> >>
> >>
> >> 	Understood. It appears that Vivek has confirmed that his v2 has
> >> resolved the issue. I will follow up with him to determine if he plans to
> >> resume his patch, and if so, I will apply my last patch on top of his
> updated
> >> patch series
> >>
> >> 	Thanks,
> >> 	Wei Lin
> >>
> >>
> >> 		Thanks,
> >> 		Christian.
> >>
> >>
> >>
> >>

Christian König Dec. 19, 2024, 10:04 a.m. UTC | #7

Am 19.12.24 um 08:02 schrieb Kasireddy, Vivek:
> Hi Christian,
>
>> Subject: Re: [PATCH 0/4] cover-letter: Allow MMIO regions to be exported
>> through dmabuf
>>
>>>>> 	I will resend the patch series. I was experiencing issues with my email
>>>>> client, which inadvertently split the series into two separate emails.
>>>> Alternatively I can also copy the code from the list archive and explain why
>>>> that doesn't work:
>>>>
>>>> +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool
>>>> +revoked) {
>>>> +    struct vfio_pci_dma_buf *priv;
>>>> +    struct vfio_pci_dma_buf *tmp;
>>>> +
>>>> +    lockdep_assert_held_write(&vdev->memory_lock);
>>>>
>>>> This makes sure that the caller is holding vdev->memory_lock.
>>>>
>>>> +
>>>> +    list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
>>>> +        if (!dma_buf_try_get(priv->dmabuf))
>>>>
>>>> This here only works because vfio_pci_dma_buf_release() also grabs
>> vdev-
>>>>> memory_lock and so we are protected against concurrent releases.
>>>> +            continue;
>>>> +        if (priv->revoked != revoked) {
>>>> +            dma_resv_lock(priv->dmabuf->resv, NULL);
>>>> +            priv->revoked = revoked;
>>>> +            dma_buf_move_notify(priv->dmabuf);
>>>> +            dma_resv_unlock(priv->dmabuf->resv);
>>>> +        }
>>>> +        dma_buf_put(priv->dmabuf);
>>>>
>>>> The problem is now that this here might drop the last reference which in
>> turn
>>>> calls vfio_pci_dma_buf_release() which also tries to grab vdev-
>>>>> memory_lock and so results in a deadlock.
>>> AFAICS, vfio_pci_dma_buf_release() would not be called synchronously
>> after the
>>> last reference is dropped by dma_buf_put(). This is because fput(), which is
>> called
>>> by dma_buf_put() triggers f_op->release() asynchronously; therefore, a
>> deadlock
>>> is unlikely to occur in this scenario, unless I am overlooking something.
>> My recollection is that the f_op->release handler is only called
>> asynchronously if fput() was issued in interrupt context.
> Here is the code of fput() from the current master:
> void fput(struct file *file)
> {
>          if (file_ref_put(&file->f_ref)) {
>                  struct task_struct *task = current;
>
>                  if (unlikely(!(file->f_mode & (FMODE_BACKING | FMODE_OPENED)))) {
>                          file_free(file);
>                          return;
>                  }
>                  if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
>                          init_task_work(&file->f_task_work, ____fput);
>                          if (!task_work_add(task, &file->f_task_work, TWA_RESUME))
>                                  return;
>                          /*
>                           * After this task has run exit_task_work(),
>                           * task_work_add() will fail.  Fall through to delayed
>                           * fput to avoid leaking *file.
>                           */
>                  }
>
>                  if (llist_add(&file->f_llist, &delayed_fput_list))
>                          schedule_delayed_work(&delayed_fput_work, 1);
>          }
> }
>
> IIUC, based on the above code, it looks like there are two ways in which the
> f_op->release() handler is triggered from fput():
> - via delayed_fput() for kernel threads and code in interrupt context
> - via task_work_run() just before the task/process returns to the user-mode
>
> The first scenario above is definitely asynchronous as the release() handler is
> called from a worker thread. But I think the second case (which is the most
> common and relevant for our use-case) can also be considered asynchronous,
> because the execution of the system call or ioctl that led to the context in
> which dma_buf_put() was called is completed.
>
> Here is a trace from my light testing with the udmabuf driver, where you can
> see the release() handler being called by syscall_exit_to_user_mode() :
> [  158.464203] Call Trace:
> [  158.466681]  <TASK>
> [  158.468815]  dump_stack_lvl+0x60/0x80
> [  158.472507]  dump_stack+0x14/0x16
> [  158.475853]  release_udmabuf+0x2f/0x9f
> [  158.479631]  dma_buf_release+0x3c/0x90
> [  158.483408]  __dentry_kill+0x8f/0x180
> [  158.487098]  dput+0xe7/0x1a0
> [  158.490013]  __fput+0x131/0x2b0
> [  158.493178]  ____fput+0x19/0x20
> [  158.496352]  task_work_run+0x61/0x90
> [  158.499959]  syscall_exit_to_user_mode+0x1a4/0x1b0
> [  158.504769]  do_syscall_64+0x5b/0x110
> [  158.508458]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
>
> And, here is the relevant syscall code (from arch/x86/entry/common.c):
> __visible noinstr bool do_syscall_64(struct pt_regs *regs, int nr)
> {
>          add_random_kstack_offset();
>          nr = syscall_enter_from_user_mode(regs, nr);
>
>          instrumentation_begin();
>
>          if (!do_syscall_x64(regs, nr) && !do_syscall_x32(regs, nr) && nr != -1) {
>                  /* Invalid system call, but still a system call. */
>                  regs->ax = __x64_sys_ni_syscall(regs);
>          }
>
>          instrumentation_end();
>          syscall_exit_to_user_mode(regs);
>
> I also confirmed that the release() handler is indeed called after dma_buf_put()
> (and not by dma_buf_put()) by adding debug prints before and after
> dma_buf_put() and one in the release() handler. Furthermore, I also found
> that calling close() on the dmabuf fd in the user-space is one scenario in
> which fput() is called synchronously. Here is the relevant trace:
> [  302.770910] Call Trace:
> [  302.773389]  <TASK>
> [  302.775516]  dump_stack_lvl+0x60/0x80
> [  302.779209]  dump_stack+0x14/0x16
> [  302.782549]  release_udmabuf+0x2f/0x9f
> [  302.786329]  dma_buf_release+0x3c/0x90
> [  302.790105]  __dentry_kill+0x8f/0x180
> [  302.793789]  dput+0xe7/0x1a0
> [  302.796703]  __fput+0x131/0x2b0
> [  302.799866]  __fput_sync+0x53/0x70
> [  302.803299]  __x64_sys_close+0x58/0xc0
> [  302.807076]  x64_sys_call+0x126a/0x17d0
> [  302.810938]  do_syscall_64+0x4f/0x110
> [  302.814622]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
>
> As you can see above, there is indeed a synchronous version of fput() defined
> just below fput():
> /*
>   * synchronous analog of fput(); for kernel threads that might be needed
>   * in some umount() (and thus can't use flush_delayed_fput() without
>   * risking deadlocks), need to wait for completion of __fput() and know
>   * for this specific struct file it won't involve anything that would
>   * need them.  Use only if you really need it - at the very least,
>   * don't blindly convert fput() by kernel thread to that.
>   */
> void __fput_sync(struct file *file)
> {
> 	if (file_ref_put(&file->f_ref))
> 		__fput(file);
> }
>
> Based on all the above, I think we can conclude that since dma_buf_put()
> does not directly (or synchronously) call the f_op->release() handler, a
> deadlock is unlikely to occur in the scenario you described.

Yeah, I agree.

Interesting to know, I wasn't aware that the task_work functionality 
exists for that use case.

Thanks,
Christian.

>
> Thanks,
> Vivek
>
>> But could be that this information is outdated.
>>
>> Regards,
>> Christian.
>>
>>> Thanks,
>>> Vivek
>>>
>>>> +    }
>>>> +}
>>>>
>>>> This pattern was suggested multiple times and I had to rejected it every
>> time
>>>> because the whole idea is just fundamentally broken.
>>>>
>>>> It's really astonishing how people always come up with the same broken
>>>> pattern.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 		Apart from that I have to reject the adding of
>>>> dma_buf_try_get(), that is clearly not something we should do.
>>>>
>>>>
>>>>
>>>> 	Understood. It appears that Vivek has confirmed that his v2 has
>>>> resolved the issue. I will follow up with him to determine if he plans to
>>>> resume his patch, and if so, I will apply my last patch on top of his
>> updated
>>>> patch series
>>>>
>>>> 	Thanks,
>>>> 	Wei Lin
>>>>
>>>>
>>>> 		Thanks,
>>>> 		Christian.
>>>>
>>>>
>>>>
>>>>

[0/4] cover-letter: Allow MMIO regions to be exported through dmabuf

Message

Comments