diff mbox series

[V6,2/7] vfio/type1: prevent underflow of locked_vm via exec()

Message ID 1671216640-157935-3-git-send-email-steven.sistare@oracle.com (mailing list archive)
State New, archived
Headers show
Series fixes for virtual address update | expand

Commit Message

Steven Sistare Dec. 16, 2022, 6:50 p.m. UTC
When a vfio container is preserved across exec, the task does not change,
but it gets a new mm with locked_vm=0.  If the user later unmaps a dma
mapping, locked_vm underflows to a large unsigned value, and a subsequent
dma map request fails with ENOMEM in __account_locked_vm.

To avoid underflow, grab and save the mm at the time a dma is mapped.
Use that mm when adjusting locked_vm, rather than re-acquiring the saved
task's mm, which may have changed.  If the saved mm is dead, do nothing.

Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation")
Cc: stable@vger.kernel.org
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 drivers/vfio/vfio_iommu_type1.c | 27 +++++++++++----------------
 1 file changed, 11 insertions(+), 16 deletions(-)

Comments

Tian, Kevin Dec. 19, 2022, 7:48 a.m. UTC | #1
> From: Steve Sistare <steven.sistare@oracle.com>
> Sent: Saturday, December 17, 2022 2:51 AM
> 
> When a vfio container is preserved across exec, the task does not change,
> but it gets a new mm with locked_vm=0.  If the user later unmaps a dma
> mapping, locked_vm underflows to a large unsigned value, and a subsequent
> dma map request fails with ENOMEM in __account_locked_vm.
> 
> To avoid underflow, grab and save the mm at the time a dma is mapped.
> Use that mm when adjusting locked_vm, rather than re-acquiring the saved
> task's mm, which may have changed.  If the saved mm is dead, do nothing.

worth clarifying that locked_vm of the new mm is still not fixed.

> @@ -1664,15 +1666,7 @@ static int vfio_dma_do_map(struct vfio_iommu
> *iommu,
>  	 * against the locked memory limit and we need to be able to do both
>  	 * outside of this call path as pinning can be asynchronous via the
>  	 * external interfaces for mdev devices.  RLIMIT_MEMLOCK requires
> a
> -	 * task_struct and VM locked pages requires an mm_struct, however
> -	 * holding an indefinite mm reference is not recommended,
> therefore we
> -	 * only hold a reference to a task.  We could hold a reference to
> -	 * current, however QEMU uses this call path through vCPU threads,
> -	 * which can be killed resulting in a NULL mm and failure in the
> unmap
> -	 * path when called via a different thread.  Avoid this problem by
> -	 * using the group_leader as threads within the same group require
> -	 * both CLONE_THREAD and CLONE_VM and will therefore use the
> same
> -	 * mm_struct.
> +	 * task_struct and VM locked pages requires an mm_struct.

IMHO the rationale why choosing group_leader still applies...

otherwise this looks good to me:

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Steven Sistare Dec. 20, 2022, 3:01 p.m. UTC | #2
On 12/19/2022 2:48 AM, Tian, Kevin wrote:
>> From: Steve Sistare <steven.sistare@oracle.com>
>> Sent: Saturday, December 17, 2022 2:51 AM
>>
>> When a vfio container is preserved across exec, the task does not change,
>> but it gets a new mm with locked_vm=0.  If the user later unmaps a dma
>> mapping, locked_vm underflows to a large unsigned value, and a subsequent
>> dma map request fails with ENOMEM in __account_locked_vm.
>>
>> To avoid underflow, grab and save the mm at the time a dma is mapped.
>> Use that mm when adjusting locked_vm, rather than re-acquiring the saved
>> task's mm, which may have changed.  If the saved mm is dead, do nothing.
> 
> worth clarifying that locked_vm of the new mm is still not fixed.

Will do.

>> @@ -1664,15 +1666,7 @@ static int vfio_dma_do_map(struct vfio_iommu
>> *iommu,
>>  	 * against the locked memory limit and we need to be able to do both
>>  	 * outside of this call path as pinning can be asynchronous via the
>>  	 * external interfaces for mdev devices.  RLIMIT_MEMLOCK requires
>> a
>> -	 * task_struct and VM locked pages requires an mm_struct, however
>> -	 * holding an indefinite mm reference is not recommended,
>> therefore we
>> -	 * only hold a reference to a task.  We could hold a reference to
>> -	 * current, however QEMU uses this call path through vCPU threads,
>> -	 * which can be killed resulting in a NULL mm and failure in the
>> unmap
>> -	 * path when called via a different thread.  Avoid this problem by
>> -	 * using the group_leader as threads within the same group require
>> -	 * both CLONE_THREAD and CLONE_VM and will therefore use the
>> same
>> -	 * mm_struct.
>> +	 * task_struct and VM locked pages requires an mm_struct.
> 
> IMHO the rationale why choosing group_leader still applies...

I don't see why it still applies.  With the new code, we may save a reference 
to current or current->group_leader, without error.  "NULL mm and failure in the 
unmap path" will not happen with mmgrab.  task->signal->rlimit is shared, so it 
does not matter which task we use, or whether the task is dead, as long as 
one of the tasks lives, which is guaranteed by the mmget_not_zero() guard.  Am
I missing something?

I kept current->group_leader for ease of debugging, so that all dma's are owned 
by the same task.

- Steve

> otherwise this looks good to me:
> 
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Alex Williamson Dec. 20, 2022, 9:59 p.m. UTC | #3
On Tue, 20 Dec 2022 10:01:21 -0500
Steven Sistare <steven.sistare@oracle.com> wrote:

> On 12/19/2022 2:48 AM, Tian, Kevin wrote:
> >> From: Steve Sistare <steven.sistare@oracle.com>
> >> Sent: Saturday, December 17, 2022 2:51 AM
> >> @@ -1664,15 +1666,7 @@ static int vfio_dma_do_map(struct vfio_iommu
> >> *iommu,
> >>  	 * against the locked memory limit and we need to be able to do both
> >>  	 * outside of this call path as pinning can be asynchronous via the
> >>  	 * external interfaces for mdev devices.  RLIMIT_MEMLOCK requires
> >> a
> >> -	 * task_struct and VM locked pages requires an mm_struct, however
> >> -	 * holding an indefinite mm reference is not recommended,
> >> therefore we
> >> -	 * only hold a reference to a task.  We could hold a reference to
> >> -	 * current, however QEMU uses this call path through vCPU threads,
> >> -	 * which can be killed resulting in a NULL mm and failure in the
> >> unmap
> >> -	 * path when called via a different thread.  Avoid this problem by
> >> -	 * using the group_leader as threads within the same group require
> >> -	 * both CLONE_THREAD and CLONE_VM and will therefore use the
> >> same
> >> -	 * mm_struct.
> >> +	 * task_struct and VM locked pages requires an mm_struct.  
> > 
> > IMHO the rationale why choosing group_leader still applies...  
> 
> I don't see why it still applies.  With the new code, we may save a reference 
> to current or current->group_leader, without error.  "NULL mm and failure in the 
> unmap path" will not happen with mmgrab.  task->signal->rlimit is shared, so it 
> does not matter which task we use, or whether the task is dead, as long as 
> one of the tasks lives, which is guaranteed by the mmget_not_zero() guard.  Am
> I missing something?
> 
> I kept current->group_leader for ease of debugging, so that all dma's are owned 
> by the same task.

That much at least would be a good comment to add since the above
removes all justification for why we're storing group_leader as the
task rather than current.  Ex:

	QEMU typically calls this path through vCPU threads, which can
	terminate due to vCPU hotplug, therefore historically we've used
	group_leader for task tracking.  With the addition of grabbing
	the mm for the life of the DMA tracking structure, this is not
	so much a concern, but we continue to use group_leader for
	debug'ability, ie. all DMA tracking is owned by the same task.

Given the upcoming holidays and reviewers starting to disappear, I
suggest we take this up as a fixes series for v6.2 after the new year.
The remainder of the vfio next branch for v6.2 has already been merged.
Thanks,

Alex
Steven Sistare Dec. 20, 2022, 10:06 p.m. UTC | #4
On 12/20/2022 4:59 PM, Alex Williamson wrote:
> On Tue, 20 Dec 2022 10:01:21 -0500
> Steven Sistare <steven.sistare@oracle.com> wrote:
> 
>> On 12/19/2022 2:48 AM, Tian, Kevin wrote:
>>>> From: Steve Sistare <steven.sistare@oracle.com>
>>>> Sent: Saturday, December 17, 2022 2:51 AM
>>>> @@ -1664,15 +1666,7 @@ static int vfio_dma_do_map(struct vfio_iommu
>>>> *iommu,
>>>>  	 * against the locked memory limit and we need to be able to do both
>>>>  	 * outside of this call path as pinning can be asynchronous via the
>>>>  	 * external interfaces for mdev devices.  RLIMIT_MEMLOCK requires
>>>> a
>>>> -	 * task_struct and VM locked pages requires an mm_struct, however
>>>> -	 * holding an indefinite mm reference is not recommended,
>>>> therefore we
>>>> -	 * only hold a reference to a task.  We could hold a reference to
>>>> -	 * current, however QEMU uses this call path through vCPU threads,
>>>> -	 * which can be killed resulting in a NULL mm and failure in the
>>>> unmap
>>>> -	 * path when called via a different thread.  Avoid this problem by
>>>> -	 * using the group_leader as threads within the same group require
>>>> -	 * both CLONE_THREAD and CLONE_VM and will therefore use the
>>>> same
>>>> -	 * mm_struct.
>>>> +	 * task_struct and VM locked pages requires an mm_struct.  
>>>
>>> IMHO the rationale why choosing group_leader still applies...  
>>
>> I don't see why it still applies.  With the new code, we may save a reference 
>> to current or current->group_leader, without error.  "NULL mm and failure in the 
>> unmap path" will not happen with mmgrab.  task->signal->rlimit is shared, so it 
>> does not matter which task we use, or whether the task is dead, as long as 
>> one of the tasks lives, which is guaranteed by the mmget_not_zero() guard.  Am
>> I missing something?
>>
>> I kept current->group_leader for ease of debugging, so that all dma's are owned 
>> by the same task.
> 
> That much at least would be a good comment to add since the above
> removes all justification for why we're storing group_leader as the
> task rather than current.  Ex:
> 
> 	QEMU typically calls this path through vCPU threads, which can
> 	terminate due to vCPU hotplug, therefore historically we've used
> 	group_leader for task tracking.  With the addition of grabbing
> 	the mm for the life of the DMA tracking structure, this is not
> 	so much a concern, but we continue to use group_leader for
> 	debug'ability, ie. all DMA tracking is owned by the same task.
> 
> Given the upcoming holidays and reviewers starting to disappear, I
> suggest we take this up as a fixes series for v6.2 after the new year.
> The remainder of the vfio next branch for v6.2 has already been merged.
> Thanks,

Will do - Steve
diff mbox series

Patch

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 144f5bb..71f980b 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -100,6 +100,7 @@  struct vfio_dma {
 	struct task_struct	*task;
 	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
 	unsigned long		*bitmap;
+	struct mm_struct	*mm;
 };
 
 struct vfio_batch {
@@ -420,8 +421,8 @@  static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
 	if (!npage)
 		return 0;
 
-	mm = async ? get_task_mm(dma->task) : dma->task->mm;
-	if (!mm)
+	mm = dma->mm;
+	if (async && !mmget_not_zero(mm))
 		return -ESRCH; /* process exited */
 
 	ret = mmap_write_lock_killable(mm);
@@ -794,8 +795,8 @@  static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
 	struct mm_struct *mm;
 	int ret;
 
-	mm = get_task_mm(dma->task);
-	if (!mm)
+	mm = dma->mm;
+	if (!mmget_not_zero(mm))
 		return -ENODEV;
 
 	ret = vaddr_get_pfns(mm, vaddr, 1, dma->prot, pfn_base, pages);
@@ -1180,6 +1181,7 @@  static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
 	vfio_unmap_unpin(iommu, dma, true);
 	vfio_unlink_dma(iommu, dma);
 	put_task_struct(dma->task);
+	mmdrop(dma->mm);
 	vfio_dma_bitmap_free(dma);
 	if (dma->vaddr_invalid) {
 		iommu->vaddr_invalid_count--;
@@ -1664,15 +1666,7 @@  static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	 * against the locked memory limit and we need to be able to do both
 	 * outside of this call path as pinning can be asynchronous via the
 	 * external interfaces for mdev devices.  RLIMIT_MEMLOCK requires a
-	 * task_struct and VM locked pages requires an mm_struct, however
-	 * holding an indefinite mm reference is not recommended, therefore we
-	 * only hold a reference to a task.  We could hold a reference to
-	 * current, however QEMU uses this call path through vCPU threads,
-	 * which can be killed resulting in a NULL mm and failure in the unmap
-	 * path when called via a different thread.  Avoid this problem by
-	 * using the group_leader as threads within the same group require
-	 * both CLONE_THREAD and CLONE_VM and will therefore use the same
-	 * mm_struct.
+	 * task_struct and VM locked pages requires an mm_struct.
 	 *
 	 * Previously we also used the task for testing CAP_IPC_LOCK at the
 	 * time of pinning and accounting, however has_capability() makes use
@@ -1687,6 +1681,8 @@  static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	get_task_struct(current->group_leader);
 	dma->task = current->group_leader;
 	dma->lock_cap = capable(CAP_IPC_LOCK);
+	dma->mm = current->mm;
+	mmgrab(dma->mm);
 
 	dma->pfn_list = RB_ROOT;
 
@@ -3122,9 +3118,8 @@  static int vfio_iommu_type1_dma_rw_chunk(struct vfio_iommu *iommu,
 			!(dma->prot & IOMMU_READ))
 		return -EPERM;
 
-	mm = get_task_mm(dma->task);
-
-	if (!mm)
+	mm = dma->mm;
+	if (!mmget_not_zero(mm))
 		return -EPERM;
 
 	if (kthread)