Message ID | 20241216095429.210792-1-wguay@fb.com (mailing list archive) |
---|---|
Headers | show |
Series | cover-letter: Allow MMIO regions to be exported through dmabuf | expand |
Hi Christian, Thanks again for your prompt response/review. > On 17 Dec 2024, at 10:53, Christian König <christian.koenig@amd.com> wrote: > > > > Am 16.12.24 um 17:54 schrieb Keith Busch: >> On Mon, Dec 16, 2024 at 11:21:39AM +0100, Christian König wrote: >>> Am 16.12.24 um 10:54 schrieb Wei Lin Guay: >>>> From: Wei Lin Guay <wguay@meta.com> >>>> However, as a general mechanism, it can support many other scenarios with >>>> VFIO. I imagine this dmabuf approach to be usable by iommufd as well for >>>> generic and safe P2P mappings. >>>> >>>> This series goes after the "Break up ioctl dispatch functions to one >>>> function per ioctl" series. >>> Yeah that sounds like it should work. >>> >>> But where is the rest of the series, I only see the cover letter? >> Should be here: >> >> https://lore.kernel.org/linux-rdma/20241216095429.210792-2-wguay@fb.com/T/#u > > Please send that out once more with me on explicit CC. I will resend the patch series. I was experiencing issues with my email client, which inadvertently split the series into two separate emails. > > Apart from that I have to reject the adding of dma_buf_try_get(), that is clearly not something we should do. Understood. It appears that Vivek has confirmed that his v2 has resolved the issue. I will follow up with him to determine if he plans to resume his patch, and if so, I will apply my last patch on top of his updated patch series Thanks, Wei Lin > > Thanks, > Christian.
Am 17.12.24 um 12:06 schrieb Wei Lin Guay: > [SNIP] >> Please send that out once more with me on explicit CC. > I will resend the patch series. I was experiencing issues with my email client, which inadvertently split the series into two separate emails. Alternatively I can also copy the code from the list archive and explain why that doesn't work: +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) +{ + struct vfio_pci_dma_buf *priv; + struct vfio_pci_dma_buf *tmp; + + lockdep_assert_held_write(&vdev->memory_lock); This makes sure that the caller is holding vdev->memory_lock. + + list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { + if (!dma_buf_try_get(priv->dmabuf)) This here only works because vfio_pci_dma_buf_release() also grabs vdev->memory_lock and so we are protected against concurrent releases. + continue; + if (priv->revoked != revoked) { + dma_resv_lock(priv->dmabuf->resv, NULL); + priv->revoked = revoked; + dma_buf_move_notify(priv->dmabuf); + dma_resv_unlock(priv->dmabuf->resv); + } + dma_buf_put(priv->dmabuf); The problem is now that this here might drop the last reference which in turn calls vfio_pci_dma_buf_release() which also tries to grab vdev->memory_lock and so results in a deadlock. + } +} This pattern was suggested multiple times and I had to rejected it every time because the whole idea is just fundamentally broken. It's really astonishing how people always come up with the same broken pattern. Regards, Christian. > >> Apart from that I have to reject the adding of dma_buf_try_get(), that is clearly not something we should do. > > Understood. It appears that Vivek has confirmed that his v2 has resolved the issue. I will follow up with him to determine if he plans to resume his patch, and if so, I will apply my last patch on top of his updated patch series > > Thanks, > Wei Lin >> Thanks, >> Christian. >
Hi Christian, > Subject: Re: [PATCH 0/4] cover-letter: Allow MMIO regions to be exported > through dmabuf > > >> I will resend the patch series. I was experiencing issues with my email >> client, which inadvertently split the series into two separate emails. > > > Alternatively I can also copy the code from the list archive and explain why > that doesn't work: > > +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool > +revoked) { > + struct vfio_pci_dma_buf *priv; > + struct vfio_pci_dma_buf *tmp; > + > + lockdep_assert_held_write(&vdev->memory_lock); > > This makes sure that the caller is holding vdev->memory_lock. > > + > + list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { > + if (!dma_buf_try_get(priv->dmabuf)) > > This here only works because vfio_pci_dma_buf_release() also grabs vdev- > >memory_lock and so we are protected against concurrent releases. > > + continue; > + if (priv->revoked != revoked) { > + dma_resv_lock(priv->dmabuf->resv, NULL); > + priv->revoked = revoked; > + dma_buf_move_notify(priv->dmabuf); > + dma_resv_unlock(priv->dmabuf->resv); > + } > + dma_buf_put(priv->dmabuf); > > The problem is now that this here might drop the last reference which in turn > calls vfio_pci_dma_buf_release() which also tries to grab vdev- > >memory_lock and so results in a deadlock. AFAICS, vfio_pci_dma_buf_release() would not be called synchronously after the last reference is dropped by dma_buf_put(). This is because fput(), which is called by dma_buf_put() triggers f_op->release() asynchronously; therefore, a deadlock is unlikely to occur in this scenario, unless I am overlooking something. Thanks, Vivek > > + } > +} > > This pattern was suggested multiple times and I had to rejected it every time > because the whole idea is just fundamentally broken. > > It's really astonishing how people always come up with the same broken > pattern. > > Regards, > Christian. > > > > > > > > Apart from that I have to reject the adding of > dma_buf_try_get(), that is clearly not something we should do. > > > > Understood. It appears that Vivek has confirmed that his v2 has > resolved the issue. I will follow up with him to determine if he plans to > resume his patch, and if so, I will apply my last patch on top of his updated > patch series > > Thanks, > Wei Lin > > > Thanks, > Christian. > > > >
Am 18.12.24 um 07:16 schrieb Kasireddy, Vivek: > Hi Christian, > >> Subject: Re: [PATCH 0/4] cover-letter: Allow MMIO regions to be exported >> through dmabuf >> >> >>> I will resend the patch series. I was experiencing issues with my email >>> client, which inadvertently split the series into two separate emails. >> >> Alternatively I can also copy the code from the list archive and explain why >> that doesn't work: >> >> +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool >> +revoked) { >> + struct vfio_pci_dma_buf *priv; >> + struct vfio_pci_dma_buf *tmp; >> + >> + lockdep_assert_held_write(&vdev->memory_lock); >> >> This makes sure that the caller is holding vdev->memory_lock. >> >> + >> + list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { >> + if (!dma_buf_try_get(priv->dmabuf)) >> >> This here only works because vfio_pci_dma_buf_release() also grabs vdev- >>> memory_lock and so we are protected against concurrent releases. >> + continue; >> + if (priv->revoked != revoked) { >> + dma_resv_lock(priv->dmabuf->resv, NULL); >> + priv->revoked = revoked; >> + dma_buf_move_notify(priv->dmabuf); >> + dma_resv_unlock(priv->dmabuf->resv); >> + } >> + dma_buf_put(priv->dmabuf); >> >> The problem is now that this here might drop the last reference which in turn >> calls vfio_pci_dma_buf_release() which also tries to grab vdev- >>> memory_lock and so results in a deadlock. > AFAICS, vfio_pci_dma_buf_release() would not be called synchronously after the > last reference is dropped by dma_buf_put(). This is because fput(), which is called > by dma_buf_put() triggers f_op->release() asynchronously; therefore, a deadlock > is unlikely to occur in this scenario, unless I am overlooking something. My recollection is that the f_op->release handler is only called asynchronously if fput() was issued in interrupt context. But could be that this information is outdated. Regards, Christian. > > Thanks, > Vivek > >> + } >> +} >> >> This pattern was suggested multiple times and I had to rejected it every time >> because the whole idea is just fundamentally broken. >> >> It's really astonishing how people always come up with the same broken >> pattern. >> >> Regards, >> Christian. >> >> >> >> >> >> >> >> Apart from that I have to reject the adding of >> dma_buf_try_get(), that is clearly not something we should do. >> >> >> >> Understood. It appears that Vivek has confirmed that his v2 has >> resolved the issue. I will follow up with him to determine if he plans to >> resume his patch, and if so, I will apply my last patch on top of his updated >> patch series >> >> Thanks, >> Wei Lin >> >> >> Thanks, >> Christian. >> >> >> >>
On Tue, Dec 17, 2024 at 10:53:32AM +0100, Christian König wrote: > Am 16.12.24 um 17:54 schrieb Keith Busch: > > On Mon, Dec 16, 2024 at 11:21:39AM +0100, Christian König wrote: > > > Am 16.12.24 um 10:54 schrieb Wei Lin Guay: > > > > From: Wei Lin Guay <wguay@meta.com> > > > > However, as a general mechanism, it can support many other scenarios with > > > > VFIO. I imagine this dmabuf approach to be usable by iommufd as well for > > > > generic and safe P2P mappings. > > > > > > > > This series goes after the "Break up ioctl dispatch functions to one > > > > function per ioctl" series. > > > Yeah that sounds like it should work. > > > > > > But where is the rest of the series, I only see the cover letter? > > Should be here: > > > > https://lore.kernel.org/linux-rdma/20241216095429.210792-2-wguay@fb.com/T/#u > > Please send that out once more with me on explicit CC. > > Apart from that I have to reject the adding of dma_buf_try_get(), that is > clearly not something we should do. Yeah if we do try_get it would need to be at least on a specific dma_buf type (so checking for dma_buf->ops), since in general this does not work (unless we add the general code in dma_buf.c somehow to make it work). Aside from any other design concerns. -Sima
Hi Christian, > Subject: Re: [PATCH 0/4] cover-letter: Allow MMIO regions to be exported > through dmabuf > > >> > >>> I will resend the patch series. I was experiencing issues with my email > >>> client, which inadvertently split the series into two separate emails. > >> > >> Alternatively I can also copy the code from the list archive and explain why > >> that doesn't work: > >> > >> +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool > >> +revoked) { > >> + struct vfio_pci_dma_buf *priv; > >> + struct vfio_pci_dma_buf *tmp; > >> + > >> + lockdep_assert_held_write(&vdev->memory_lock); > >> > >> This makes sure that the caller is holding vdev->memory_lock. > >> > >> + > >> + list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { > >> + if (!dma_buf_try_get(priv->dmabuf)) > >> > >> This here only works because vfio_pci_dma_buf_release() also grabs > vdev- > >>> memory_lock and so we are protected against concurrent releases. > >> + continue; > >> + if (priv->revoked != revoked) { > >> + dma_resv_lock(priv->dmabuf->resv, NULL); > >> + priv->revoked = revoked; > >> + dma_buf_move_notify(priv->dmabuf); > >> + dma_resv_unlock(priv->dmabuf->resv); > >> + } > >> + dma_buf_put(priv->dmabuf); > >> > >> The problem is now that this here might drop the last reference which in > turn > >> calls vfio_pci_dma_buf_release() which also tries to grab vdev- > >>> memory_lock and so results in a deadlock. > > AFAICS, vfio_pci_dma_buf_release() would not be called synchronously > after the > > last reference is dropped by dma_buf_put(). This is because fput(), which is > called > > by dma_buf_put() triggers f_op->release() asynchronously; therefore, a > deadlock > > is unlikely to occur in this scenario, unless I am overlooking something. > > My recollection is that the f_op->release handler is only called > asynchronously if fput() was issued in interrupt context. Here is the code of fput() from the current master: void fput(struct file *file) { if (file_ref_put(&file->f_ref)) { struct task_struct *task = current; if (unlikely(!(file->f_mode & (FMODE_BACKING | FMODE_OPENED)))) { file_free(file); return; } if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) { init_task_work(&file->f_task_work, ____fput); if (!task_work_add(task, &file->f_task_work, TWA_RESUME)) return; /* * After this task has run exit_task_work(), * task_work_add() will fail. Fall through to delayed * fput to avoid leaking *file. */ } if (llist_add(&file->f_llist, &delayed_fput_list)) schedule_delayed_work(&delayed_fput_work, 1); } } IIUC, based on the above code, it looks like there are two ways in which the f_op->release() handler is triggered from fput(): - via delayed_fput() for kernel threads and code in interrupt context - via task_work_run() just before the task/process returns to the user-mode The first scenario above is definitely asynchronous as the release() handler is called from a worker thread. But I think the second case (which is the most common and relevant for our use-case) can also be considered asynchronous, because the execution of the system call or ioctl that led to the context in which dma_buf_put() was called is completed. Here is a trace from my light testing with the udmabuf driver, where you can see the release() handler being called by syscall_exit_to_user_mode() : [ 158.464203] Call Trace: [ 158.466681] <TASK> [ 158.468815] dump_stack_lvl+0x60/0x80 [ 158.472507] dump_stack+0x14/0x16 [ 158.475853] release_udmabuf+0x2f/0x9f [ 158.479631] dma_buf_release+0x3c/0x90 [ 158.483408] __dentry_kill+0x8f/0x180 [ 158.487098] dput+0xe7/0x1a0 [ 158.490013] __fput+0x131/0x2b0 [ 158.493178] ____fput+0x19/0x20 [ 158.496352] task_work_run+0x61/0x90 [ 158.499959] syscall_exit_to_user_mode+0x1a4/0x1b0 [ 158.504769] do_syscall_64+0x5b/0x110 [ 158.508458] entry_SYSCALL_64_after_hwframe+0x4b/0x53 And, here is the relevant syscall code (from arch/x86/entry/common.c): __visible noinstr bool do_syscall_64(struct pt_regs *regs, int nr) { add_random_kstack_offset(); nr = syscall_enter_from_user_mode(regs, nr); instrumentation_begin(); if (!do_syscall_x64(regs, nr) && !do_syscall_x32(regs, nr) && nr != -1) { /* Invalid system call, but still a system call. */ regs->ax = __x64_sys_ni_syscall(regs); } instrumentation_end(); syscall_exit_to_user_mode(regs); I also confirmed that the release() handler is indeed called after dma_buf_put() (and not by dma_buf_put()) by adding debug prints before and after dma_buf_put() and one in the release() handler. Furthermore, I also found that calling close() on the dmabuf fd in the user-space is one scenario in which fput() is called synchronously. Here is the relevant trace: [ 302.770910] Call Trace: [ 302.773389] <TASK> [ 302.775516] dump_stack_lvl+0x60/0x80 [ 302.779209] dump_stack+0x14/0x16 [ 302.782549] release_udmabuf+0x2f/0x9f [ 302.786329] dma_buf_release+0x3c/0x90 [ 302.790105] __dentry_kill+0x8f/0x180 [ 302.793789] dput+0xe7/0x1a0 [ 302.796703] __fput+0x131/0x2b0 [ 302.799866] __fput_sync+0x53/0x70 [ 302.803299] __x64_sys_close+0x58/0xc0 [ 302.807076] x64_sys_call+0x126a/0x17d0 [ 302.810938] do_syscall_64+0x4f/0x110 [ 302.814622] entry_SYSCALL_64_after_hwframe+0x4b/0x53 As you can see above, there is indeed a synchronous version of fput() defined just below fput(): /* * synchronous analog of fput(); for kernel threads that might be needed * in some umount() (and thus can't use flush_delayed_fput() without * risking deadlocks), need to wait for completion of __fput() and know * for this specific struct file it won't involve anything that would * need them. Use only if you really need it - at the very least, * don't blindly convert fput() by kernel thread to that. */ void __fput_sync(struct file *file) { if (file_ref_put(&file->f_ref)) __fput(file); } Based on all the above, I think we can conclude that since dma_buf_put() does not directly (or synchronously) call the f_op->release() handler, a deadlock is unlikely to occur in the scenario you described. Thanks, Vivek > > But could be that this information is outdated. > > Regards, > Christian. > > > > > Thanks, > > Vivek > > > >> + } > >> +} > >> > >> This pattern was suggested multiple times and I had to rejected it every > time > >> because the whole idea is just fundamentally broken. > >> > >> It's really astonishing how people always come up with the same broken > >> pattern. > >> > >> Regards, > >> Christian. > >> > >> > >> > >> > >> > >> > >> > >> Apart from that I have to reject the adding of > >> dma_buf_try_get(), that is clearly not something we should do. > >> > >> > >> > >> Understood. It appears that Vivek has confirmed that his v2 has > >> resolved the issue. I will follow up with him to determine if he plans to > >> resume his patch, and if so, I will apply my last patch on top of his > updated > >> patch series > >> > >> Thanks, > >> Wei Lin > >> > >> > >> Thanks, > >> Christian. > >> > >> > >> > >>
Am 19.12.24 um 08:02 schrieb Kasireddy, Vivek: > Hi Christian, > >> Subject: Re: [PATCH 0/4] cover-letter: Allow MMIO regions to be exported >> through dmabuf >> >>>>> I will resend the patch series. I was experiencing issues with my email >>>>> client, which inadvertently split the series into two separate emails. >>>> Alternatively I can also copy the code from the list archive and explain why >>>> that doesn't work: >>>> >>>> +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool >>>> +revoked) { >>>> + struct vfio_pci_dma_buf *priv; >>>> + struct vfio_pci_dma_buf *tmp; >>>> + >>>> + lockdep_assert_held_write(&vdev->memory_lock); >>>> >>>> This makes sure that the caller is holding vdev->memory_lock. >>>> >>>> + >>>> + list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { >>>> + if (!dma_buf_try_get(priv->dmabuf)) >>>> >>>> This here only works because vfio_pci_dma_buf_release() also grabs >> vdev- >>>>> memory_lock and so we are protected against concurrent releases. >>>> + continue; >>>> + if (priv->revoked != revoked) { >>>> + dma_resv_lock(priv->dmabuf->resv, NULL); >>>> + priv->revoked = revoked; >>>> + dma_buf_move_notify(priv->dmabuf); >>>> + dma_resv_unlock(priv->dmabuf->resv); >>>> + } >>>> + dma_buf_put(priv->dmabuf); >>>> >>>> The problem is now that this here might drop the last reference which in >> turn >>>> calls vfio_pci_dma_buf_release() which also tries to grab vdev- >>>>> memory_lock and so results in a deadlock. >>> AFAICS, vfio_pci_dma_buf_release() would not be called synchronously >> after the >>> last reference is dropped by dma_buf_put(). This is because fput(), which is >> called >>> by dma_buf_put() triggers f_op->release() asynchronously; therefore, a >> deadlock >>> is unlikely to occur in this scenario, unless I am overlooking something. >> My recollection is that the f_op->release handler is only called >> asynchronously if fput() was issued in interrupt context. > Here is the code of fput() from the current master: > void fput(struct file *file) > { > if (file_ref_put(&file->f_ref)) { > struct task_struct *task = current; > > if (unlikely(!(file->f_mode & (FMODE_BACKING | FMODE_OPENED)))) { > file_free(file); > return; > } > if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) { > init_task_work(&file->f_task_work, ____fput); > if (!task_work_add(task, &file->f_task_work, TWA_RESUME)) > return; > /* > * After this task has run exit_task_work(), > * task_work_add() will fail. Fall through to delayed > * fput to avoid leaking *file. > */ > } > > if (llist_add(&file->f_llist, &delayed_fput_list)) > schedule_delayed_work(&delayed_fput_work, 1); > } > } > > IIUC, based on the above code, it looks like there are two ways in which the > f_op->release() handler is triggered from fput(): > - via delayed_fput() for kernel threads and code in interrupt context > - via task_work_run() just before the task/process returns to the user-mode > > The first scenario above is definitely asynchronous as the release() handler is > called from a worker thread. But I think the second case (which is the most > common and relevant for our use-case) can also be considered asynchronous, > because the execution of the system call or ioctl that led to the context in > which dma_buf_put() was called is completed. > > Here is a trace from my light testing with the udmabuf driver, where you can > see the release() handler being called by syscall_exit_to_user_mode() : > [ 158.464203] Call Trace: > [ 158.466681] <TASK> > [ 158.468815] dump_stack_lvl+0x60/0x80 > [ 158.472507] dump_stack+0x14/0x16 > [ 158.475853] release_udmabuf+0x2f/0x9f > [ 158.479631] dma_buf_release+0x3c/0x90 > [ 158.483408] __dentry_kill+0x8f/0x180 > [ 158.487098] dput+0xe7/0x1a0 > [ 158.490013] __fput+0x131/0x2b0 > [ 158.493178] ____fput+0x19/0x20 > [ 158.496352] task_work_run+0x61/0x90 > [ 158.499959] syscall_exit_to_user_mode+0x1a4/0x1b0 > [ 158.504769] do_syscall_64+0x5b/0x110 > [ 158.508458] entry_SYSCALL_64_after_hwframe+0x4b/0x53 > > And, here is the relevant syscall code (from arch/x86/entry/common.c): > __visible noinstr bool do_syscall_64(struct pt_regs *regs, int nr) > { > add_random_kstack_offset(); > nr = syscall_enter_from_user_mode(regs, nr); > > instrumentation_begin(); > > if (!do_syscall_x64(regs, nr) && !do_syscall_x32(regs, nr) && nr != -1) { > /* Invalid system call, but still a system call. */ > regs->ax = __x64_sys_ni_syscall(regs); > } > > instrumentation_end(); > syscall_exit_to_user_mode(regs); > > I also confirmed that the release() handler is indeed called after dma_buf_put() > (and not by dma_buf_put()) by adding debug prints before and after > dma_buf_put() and one in the release() handler. Furthermore, I also found > that calling close() on the dmabuf fd in the user-space is one scenario in > which fput() is called synchronously. Here is the relevant trace: > [ 302.770910] Call Trace: > [ 302.773389] <TASK> > [ 302.775516] dump_stack_lvl+0x60/0x80 > [ 302.779209] dump_stack+0x14/0x16 > [ 302.782549] release_udmabuf+0x2f/0x9f > [ 302.786329] dma_buf_release+0x3c/0x90 > [ 302.790105] __dentry_kill+0x8f/0x180 > [ 302.793789] dput+0xe7/0x1a0 > [ 302.796703] __fput+0x131/0x2b0 > [ 302.799866] __fput_sync+0x53/0x70 > [ 302.803299] __x64_sys_close+0x58/0xc0 > [ 302.807076] x64_sys_call+0x126a/0x17d0 > [ 302.810938] do_syscall_64+0x4f/0x110 > [ 302.814622] entry_SYSCALL_64_after_hwframe+0x4b/0x53 > > As you can see above, there is indeed a synchronous version of fput() defined > just below fput(): > /* > * synchronous analog of fput(); for kernel threads that might be needed > * in some umount() (and thus can't use flush_delayed_fput() without > * risking deadlocks), need to wait for completion of __fput() and know > * for this specific struct file it won't involve anything that would > * need them. Use only if you really need it - at the very least, > * don't blindly convert fput() by kernel thread to that. > */ > void __fput_sync(struct file *file) > { > if (file_ref_put(&file->f_ref)) > __fput(file); > } > > Based on all the above, I think we can conclude that since dma_buf_put() > does not directly (or synchronously) call the f_op->release() handler, a > deadlock is unlikely to occur in the scenario you described. Yeah, I agree. Interesting to know, I wasn't aware that the task_work functionality exists for that use case. Thanks, Christian. > > Thanks, > Vivek > >> But could be that this information is outdated. >> >> Regards, >> Christian. >> >>> Thanks, >>> Vivek >>> >>>> + } >>>> +} >>>> >>>> This pattern was suggested multiple times and I had to rejected it every >> time >>>> because the whole idea is just fundamentally broken. >>>> >>>> It's really astonishing how people always come up with the same broken >>>> pattern. >>>> >>>> Regards, >>>> Christian. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> Apart from that I have to reject the adding of >>>> dma_buf_try_get(), that is clearly not something we should do. >>>> >>>> >>>> >>>> Understood. It appears that Vivek has confirmed that his v2 has >>>> resolved the issue. I will follow up with him to determine if he plans to >>>> resume his patch, and if so, I will apply my last patch on top of his >> updated >>>> patch series >>>> >>>> Thanks, >>>> Wei Lin >>>> >>>> >>>> Thanks, >>>> Christian. >>>> >>>> >>>> >>>>