Message ID | 1671216640-157935-3-git-send-email-steven.sistare@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | fixes for virtual address update | expand |
> From: Steve Sistare <steven.sistare@oracle.com> > Sent: Saturday, December 17, 2022 2:51 AM > > When a vfio container is preserved across exec, the task does not change, > but it gets a new mm with locked_vm=0. If the user later unmaps a dma > mapping, locked_vm underflows to a large unsigned value, and a subsequent > dma map request fails with ENOMEM in __account_locked_vm. > > To avoid underflow, grab and save the mm at the time a dma is mapped. > Use that mm when adjusting locked_vm, rather than re-acquiring the saved > task's mm, which may have changed. If the saved mm is dead, do nothing. worth clarifying that locked_vm of the new mm is still not fixed. > @@ -1664,15 +1666,7 @@ static int vfio_dma_do_map(struct vfio_iommu > *iommu, > * against the locked memory limit and we need to be able to do both > * outside of this call path as pinning can be asynchronous via the > * external interfaces for mdev devices. RLIMIT_MEMLOCK requires > a > - * task_struct and VM locked pages requires an mm_struct, however > - * holding an indefinite mm reference is not recommended, > therefore we > - * only hold a reference to a task. We could hold a reference to > - * current, however QEMU uses this call path through vCPU threads, > - * which can be killed resulting in a NULL mm and failure in the > unmap > - * path when called via a different thread. Avoid this problem by > - * using the group_leader as threads within the same group require > - * both CLONE_THREAD and CLONE_VM and will therefore use the > same > - * mm_struct. > + * task_struct and VM locked pages requires an mm_struct. IMHO the rationale why choosing group_leader still applies... otherwise this looks good to me: Reviewed-by: Kevin Tian <kevin.tian@intel.com>
On 12/19/2022 2:48 AM, Tian, Kevin wrote: >> From: Steve Sistare <steven.sistare@oracle.com> >> Sent: Saturday, December 17, 2022 2:51 AM >> >> When a vfio container is preserved across exec, the task does not change, >> but it gets a new mm with locked_vm=0. If the user later unmaps a dma >> mapping, locked_vm underflows to a large unsigned value, and a subsequent >> dma map request fails with ENOMEM in __account_locked_vm. >> >> To avoid underflow, grab and save the mm at the time a dma is mapped. >> Use that mm when adjusting locked_vm, rather than re-acquiring the saved >> task's mm, which may have changed. If the saved mm is dead, do nothing. > > worth clarifying that locked_vm of the new mm is still not fixed. Will do. >> @@ -1664,15 +1666,7 @@ static int vfio_dma_do_map(struct vfio_iommu >> *iommu, >> * against the locked memory limit and we need to be able to do both >> * outside of this call path as pinning can be asynchronous via the >> * external interfaces for mdev devices. RLIMIT_MEMLOCK requires >> a >> - * task_struct and VM locked pages requires an mm_struct, however >> - * holding an indefinite mm reference is not recommended, >> therefore we >> - * only hold a reference to a task. We could hold a reference to >> - * current, however QEMU uses this call path through vCPU threads, >> - * which can be killed resulting in a NULL mm and failure in the >> unmap >> - * path when called via a different thread. Avoid this problem by >> - * using the group_leader as threads within the same group require >> - * both CLONE_THREAD and CLONE_VM and will therefore use the >> same >> - * mm_struct. >> + * task_struct and VM locked pages requires an mm_struct. > > IMHO the rationale why choosing group_leader still applies... I don't see why it still applies. With the new code, we may save a reference to current or current->group_leader, without error. "NULL mm and failure in the unmap path" will not happen with mmgrab. task->signal->rlimit is shared, so it does not matter which task we use, or whether the task is dead, as long as one of the tasks lives, which is guaranteed by the mmget_not_zero() guard. Am I missing something? I kept current->group_leader for ease of debugging, so that all dma's are owned by the same task. - Steve > otherwise this looks good to me: > > Reviewed-by: Kevin Tian <kevin.tian@intel.com>
On Tue, 20 Dec 2022 10:01:21 -0500 Steven Sistare <steven.sistare@oracle.com> wrote: > On 12/19/2022 2:48 AM, Tian, Kevin wrote: > >> From: Steve Sistare <steven.sistare@oracle.com> > >> Sent: Saturday, December 17, 2022 2:51 AM > >> @@ -1664,15 +1666,7 @@ static int vfio_dma_do_map(struct vfio_iommu > >> *iommu, > >> * against the locked memory limit and we need to be able to do both > >> * outside of this call path as pinning can be asynchronous via the > >> * external interfaces for mdev devices. RLIMIT_MEMLOCK requires > >> a > >> - * task_struct and VM locked pages requires an mm_struct, however > >> - * holding an indefinite mm reference is not recommended, > >> therefore we > >> - * only hold a reference to a task. We could hold a reference to > >> - * current, however QEMU uses this call path through vCPU threads, > >> - * which can be killed resulting in a NULL mm and failure in the > >> unmap > >> - * path when called via a different thread. Avoid this problem by > >> - * using the group_leader as threads within the same group require > >> - * both CLONE_THREAD and CLONE_VM and will therefore use the > >> same > >> - * mm_struct. > >> + * task_struct and VM locked pages requires an mm_struct. > > > > IMHO the rationale why choosing group_leader still applies... > > I don't see why it still applies. With the new code, we may save a reference > to current or current->group_leader, without error. "NULL mm and failure in the > unmap path" will not happen with mmgrab. task->signal->rlimit is shared, so it > does not matter which task we use, or whether the task is dead, as long as > one of the tasks lives, which is guaranteed by the mmget_not_zero() guard. Am > I missing something? > > I kept current->group_leader for ease of debugging, so that all dma's are owned > by the same task. That much at least would be a good comment to add since the above removes all justification for why we're storing group_leader as the task rather than current. Ex: QEMU typically calls this path through vCPU threads, which can terminate due to vCPU hotplug, therefore historically we've used group_leader for task tracking. With the addition of grabbing the mm for the life of the DMA tracking structure, this is not so much a concern, but we continue to use group_leader for debug'ability, ie. all DMA tracking is owned by the same task. Given the upcoming holidays and reviewers starting to disappear, I suggest we take this up as a fixes series for v6.2 after the new year. The remainder of the vfio next branch for v6.2 has already been merged. Thanks, Alex
On 12/20/2022 4:59 PM, Alex Williamson wrote: > On Tue, 20 Dec 2022 10:01:21 -0500 > Steven Sistare <steven.sistare@oracle.com> wrote: > >> On 12/19/2022 2:48 AM, Tian, Kevin wrote: >>>> From: Steve Sistare <steven.sistare@oracle.com> >>>> Sent: Saturday, December 17, 2022 2:51 AM >>>> @@ -1664,15 +1666,7 @@ static int vfio_dma_do_map(struct vfio_iommu >>>> *iommu, >>>> * against the locked memory limit and we need to be able to do both >>>> * outside of this call path as pinning can be asynchronous via the >>>> * external interfaces for mdev devices. RLIMIT_MEMLOCK requires >>>> a >>>> - * task_struct and VM locked pages requires an mm_struct, however >>>> - * holding an indefinite mm reference is not recommended, >>>> therefore we >>>> - * only hold a reference to a task. We could hold a reference to >>>> - * current, however QEMU uses this call path through vCPU threads, >>>> - * which can be killed resulting in a NULL mm and failure in the >>>> unmap >>>> - * path when called via a different thread. Avoid this problem by >>>> - * using the group_leader as threads within the same group require >>>> - * both CLONE_THREAD and CLONE_VM and will therefore use the >>>> same >>>> - * mm_struct. >>>> + * task_struct and VM locked pages requires an mm_struct. >>> >>> IMHO the rationale why choosing group_leader still applies... >> >> I don't see why it still applies. With the new code, we may save a reference >> to current or current->group_leader, without error. "NULL mm and failure in the >> unmap path" will not happen with mmgrab. task->signal->rlimit is shared, so it >> does not matter which task we use, or whether the task is dead, as long as >> one of the tasks lives, which is guaranteed by the mmget_not_zero() guard. Am >> I missing something? >> >> I kept current->group_leader for ease of debugging, so that all dma's are owned >> by the same task. > > That much at least would be a good comment to add since the above > removes all justification for why we're storing group_leader as the > task rather than current. Ex: > > QEMU typically calls this path through vCPU threads, which can > terminate due to vCPU hotplug, therefore historically we've used > group_leader for task tracking. With the addition of grabbing > the mm for the life of the DMA tracking structure, this is not > so much a concern, but we continue to use group_leader for > debug'ability, ie. all DMA tracking is owned by the same task. > > Given the upcoming holidays and reviewers starting to disappear, I > suggest we take this up as a fixes series for v6.2 after the new year. > The remainder of the vfio next branch for v6.2 has already been merged. > Thanks, Will do - Steve
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 144f5bb..71f980b 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -100,6 +100,7 @@ struct vfio_dma { struct task_struct *task; struct rb_root pfn_list; /* Ex-user pinned pfn list */ unsigned long *bitmap; + struct mm_struct *mm; }; struct vfio_batch { @@ -420,8 +421,8 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async) if (!npage) return 0; - mm = async ? get_task_mm(dma->task) : dma->task->mm; - if (!mm) + mm = dma->mm; + if (async && !mmget_not_zero(mm)) return -ESRCH; /* process exited */ ret = mmap_write_lock_killable(mm); @@ -794,8 +795,8 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr, struct mm_struct *mm; int ret; - mm = get_task_mm(dma->task); - if (!mm) + mm = dma->mm; + if (!mmget_not_zero(mm)) return -ENODEV; ret = vaddr_get_pfns(mm, vaddr, 1, dma->prot, pfn_base, pages); @@ -1180,6 +1181,7 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma) vfio_unmap_unpin(iommu, dma, true); vfio_unlink_dma(iommu, dma); put_task_struct(dma->task); + mmdrop(dma->mm); vfio_dma_bitmap_free(dma); if (dma->vaddr_invalid) { iommu->vaddr_invalid_count--; @@ -1664,15 +1666,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu, * against the locked memory limit and we need to be able to do both * outside of this call path as pinning can be asynchronous via the * external interfaces for mdev devices. RLIMIT_MEMLOCK requires a - * task_struct and VM locked pages requires an mm_struct, however - * holding an indefinite mm reference is not recommended, therefore we - * only hold a reference to a task. We could hold a reference to - * current, however QEMU uses this call path through vCPU threads, - * which can be killed resulting in a NULL mm and failure in the unmap - * path when called via a different thread. Avoid this problem by - * using the group_leader as threads within the same group require - * both CLONE_THREAD and CLONE_VM and will therefore use the same - * mm_struct. + * task_struct and VM locked pages requires an mm_struct. * * Previously we also used the task for testing CAP_IPC_LOCK at the * time of pinning and accounting, however has_capability() makes use @@ -1687,6 +1681,8 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu, get_task_struct(current->group_leader); dma->task = current->group_leader; dma->lock_cap = capable(CAP_IPC_LOCK); + dma->mm = current->mm; + mmgrab(dma->mm); dma->pfn_list = RB_ROOT; @@ -3122,9 +3118,8 @@ static int vfio_iommu_type1_dma_rw_chunk(struct vfio_iommu *iommu, !(dma->prot & IOMMU_READ)) return -EPERM; - mm = get_task_mm(dma->task); - - if (!mm) + mm = dma->mm; + if (!mmget_not_zero(mm)) return -EPERM; if (kthread)
When a vfio container is preserved across exec, the task does not change, but it gets a new mm with locked_vm=0. If the user later unmaps a dma mapping, locked_vm underflows to a large unsigned value, and a subsequent dma map request fails with ENOMEM in __account_locked_vm. To avoid underflow, grab and save the mm at the time a dma is mapped. Use that mm when adjusting locked_vm, rather than re-acquiring the saved task's mm, which may have changed. If the saved mm is dead, do nothing. Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation") Cc: stable@vger.kernel.org Signed-off-by: Steve Sistare <steven.sistare@oracle.com> --- drivers/vfio/vfio_iommu_type1.c | 27 +++++++++++---------------- 1 file changed, 11 insertions(+), 16 deletions(-)