diff mbox series

vfio: remove useless judgement

Message ID 20220627035109.73745-1-lizhe.67@bytedance.com (mailing list archive)
State New, archived
Headers show
Series vfio: remove useless judgement | expand

Commit Message

lizhe.67@bytedance.com June 27, 2022, 3:51 a.m. UTC
From: Li Zhe <lizhe.67@bytedance.com>

In function vfio_dma_do_unmap(), we currently prevent process to unmap
vfio dma region whose mm_struct is different from the vfio_dma->task.
In our virtual machine scenario which is using kvm and qemu, this
judgement stops us from liveupgrading our qemu, which uses fork() &&
exec() to load the new binary but the new process cannot do the
VFIO_IOMMU_UNMAP_DMA action during vm exit because of this judgement.

This judgement is added in commit 8f0d5bb95f76 ("vfio iommu type1: Add
task structure to vfio_dma") for the security reason. But it seems that
no other task who has no family relationship with old and new process
can get the same vfio_dma struct here for the reason of resource
isolation. So this patch delete it.

Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
Reviewed-by: Jason Gunthorpe <jgg@ziepe.ca>
---
 drivers/vfio/vfio_iommu_type1.c | 6 ------
 1 file changed, 6 deletions(-)

Comments

Alex Williamson June 27, 2022, 10:06 p.m. UTC | #1
Hey Steve, how did you get around this for cpr or is this a gap?
Thanks,

Alex

On Mon, 27 Jun 2022 11:51:09 +0800
lizhe.67@bytedance.com wrote:

> From: Li Zhe <lizhe.67@bytedance.com>
> 
> In function vfio_dma_do_unmap(), we currently prevent process to unmap
> vfio dma region whose mm_struct is different from the vfio_dma->task.
> In our virtual machine scenario which is using kvm and qemu, this
> judgement stops us from liveupgrading our qemu, which uses fork() &&
> exec() to load the new binary but the new process cannot do the
> VFIO_IOMMU_UNMAP_DMA action during vm exit because of this judgement.
>
> This judgement is added in commit 8f0d5bb95f76 ("vfio iommu type1: Add
> task structure to vfio_dma") for the security reason. But it seems that
> no other task who has no family relationship with old and new process
> can get the same vfio_dma struct here for the reason of resource
> isolation. So this patch delete it.
> 
> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> Reviewed-by: Jason Gunthorpe <jgg@ziepe.ca>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 6 ------
>  1 file changed, 6 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index c13b9290e357..a8ff00dad834 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -1377,12 +1377,6 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  
>  		if (!iommu->v2 && iova > dma->iova)
>  			break;
> -		/*
> -		 * Task with same address space who mapped this iova range is
> -		 * allowed to unmap the iova range.
> -		 */
> -		if (dma->task->mm != current->mm)
> -			break;
>  
>  		if (invalidate_vaddr) {
>  			if (dma->vaddr_invalid) {
Steven Sistare June 28, 2022, 12:48 p.m. UTC | #2
For cpr, old qemu directly exec's new qemu, so task does not change.

To support fork+exec, the ownership test needs to be deleted or modified.

Pinned page accounting is another issue, as the parent counts pins in its
mm->locked_vm.  If the child unmaps, it cannot simply decrement its own
mm->locked_vm counter.  As you and I have discussed, the count is also 
wrong in the direct exec model, because exec clears mm->locked_vm.  I am 
thinking vfio could count pins in struct user locked_vm to handle both 
models.  The user struct and its count would persist across direct exec,
and be shared by parent and child for fork+exec.  However, that does change
the RLIMIT_MEMLOCK value that applications must set, because the limit must
accommodate vfio plus other sub-systems that count in user->locked_vm, which
includes io_uring, skbuff, xdp, and perf.  Plus, the limit must accommodate all
processes of that user, not just a single process.

Folks like fork+exec because it allows recovery if the new qemu process fails to
initialize. One can fall back to the original process, if the above issues are fixed.

- Steve

On 6/27/2022 6:06 PM, Alex Williamson wrote:
> 
> Hey Steve, how did you get around this for cpr or is this a gap?
> Thanks,
> 
> Alex
> 
> On Mon, 27 Jun 2022 11:51:09 +0800
> lizhe.67@bytedance.com wrote:
> 
>> From: Li Zhe <lizhe.67@bytedance.com>
>>
>> In function vfio_dma_do_unmap(), we currently prevent process to unmap
>> vfio dma region whose mm_struct is different from the vfio_dma->task.
>> In our virtual machine scenario which is using kvm and qemu, this
>> judgement stops us from liveupgrading our qemu, which uses fork() &&
>> exec() to load the new binary but the new process cannot do the
>> VFIO_IOMMU_UNMAP_DMA action during vm exit because of this judgement.
>>
>> This judgement is added in commit 8f0d5bb95f76 ("vfio iommu type1: Add
>> task structure to vfio_dma") for the security reason. But it seems that
>> no other task who has no family relationship with old and new process
>> can get the same vfio_dma struct here for the reason of resource
>> isolation. So this patch delete it.
>>
>> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
>> Reviewed-by: Jason Gunthorpe <jgg@ziepe.ca>
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 6 ------
>>  1 file changed, 6 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index c13b9290e357..a8ff00dad834 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -1377,12 +1377,6 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>  
>>  		if (!iommu->v2 && iova > dma->iova)
>>  			break;
>> -		/*
>> -		 * Task with same address space who mapped this iova range is
>> -		 * allowed to unmap the iova range.
>> -		 */
>> -		if (dma->task->mm != current->mm)
>> -			break;
>>  
>>  		if (invalidate_vaddr) {
>>  			if (dma->vaddr_invalid) {
>
Jason Gunthorpe June 28, 2022, 1:03 p.m. UTC | #3
On Tue, Jun 28, 2022 at 08:48:11AM -0400, Steven Sistare wrote:
> For cpr, old qemu directly exec's new qemu, so task does not change.
> 
> To support fork+exec, the ownership test needs to be deleted or modified.
> 
> Pinned page accounting is another issue, as the parent counts pins in its
> mm->locked_vm.  If the child unmaps, it cannot simply decrement its own
> mm->locked_vm counter.

It is fine already:


	mm = async ? get_task_mm(dma->task) : dma->task->mm;
	if (!mm)
		return -ESRCH; /* process exited */

	ret = mmap_write_lock_killable(mm);
	if (!ret) {
		ret = __account_locked_vm(mm, abs(npage), npage > 0, dma->task,
					  dma->lock_cap);

Each 'dma' already stores a pointer to the mm that sourced it and only
manipulates the counter in that mm. AFAICT 'current' is not used
during unmap.

> As you and I have discussed, the count is also wrong in the direct
> exec model, because exec clears mm->locked_vm.

Really? Yikes, I thought exec would generate a new mm?

> I am thinking vfio could count pins in struct user locked_vm to handle both 
> models.  The user struct and its count would persist across direct exec,
> and be shared by parent and child for fork+exec.  However, that does change
> the RLIMIT_MEMLOCK value that applications must set, because the limit must
> accommodate vfio plus other sub-systems that count in user->locked_vm, which
> includes io_uring, skbuff, xdp, and perf.  Plus, the limit must accommodate all
> processes of that user, not just a single process.

We discussed this, for iommufd we are currently planning to go this
way and will See How it Goes.

Jason
Steven Sistare June 28, 2022, 1:54 p.m. UTC | #4
On 6/28/2022 9:03 AM, Jason Gunthorpe wrote:
> On Tue, Jun 28, 2022 at 08:48:11AM -0400, Steven Sistare wrote:
>> For cpr, old qemu directly exec's new qemu, so task does not change.
>>
>> To support fork+exec, the ownership test needs to be deleted or modified.
>>
>> Pinned page accounting is another issue, as the parent counts pins in its
>> mm->locked_vm.  If the child unmaps, it cannot simply decrement its own
>> mm->locked_vm counter.
> 
> It is fine already:
> 
> 	mm = async ? get_task_mm(dma->task) : dma->task->mm;
> 	if (!mm)
> 		return -ESRCH; /* process exited */
> 
> 	ret = mmap_write_lock_killable(mm);
> 	if (!ret) {
> 		ret = __account_locked_vm(mm, abs(npage), npage > 0, dma->task,
> 					  dma->lock_cap);
> 
> Each 'dma' already stores a pointer to the mm that sourced it and only
> manipulates the counter in that mm. AFAICT 'current' is not used
> during unmap.
Ah yes, no problem then.
Limits become looser, though, as the child can pin an additional RLIMIT_MEMLOCK
of pages.  That is the natural consequence of mm->locked_vm being a per process limit, 
but probably not what the application wants.  Another argument for switching to 
user->locked_vm.

>> As you and I have discussed, the count is also wrong in the direct
>> exec model, because exec clears mm->locked_vm.
> 
> Really? Yikes, I thought exec would generate a new mm?

Yes, exec creates a new mm with locked_vm = 0.  The old locked_vm count is dropped
on the floor.  The existing dma points to the same task, but task->mm has changed,
and dma->task->mm->locked_vm is 0.  An unmap ioctl drives it negative.

I have prototyped a few possible fixes.  One changes vfio to use user->locked_vm.
Another changes to mm->pinned_vm and preserves it during exec.  A third preserves
mm->locked_vm across exec, but that is not practical, because mm->locked_vm mixes
vfio pins and mlocks.  The mlock component must be cleared during exec, and we don't 
have a separate count for it.

>> I am thinking vfio could count pins in struct user locked_vm to handle both 
>> models.  The user struct and its count would persist across direct exec,
>> and be shared by parent and child for fork+exec.  However, that does change
>> the RLIMIT_MEMLOCK value that applications must set, because the limit must
>> accommodate vfio plus other sub-systems that count in user->locked_vm, which
>> includes io_uring, skbuff, xdp, and perf.  Plus, the limit must accommodate all
>> processes of that user, not just a single process.
> 
> We discussed this, for iommufd we are currently planning to go this
> way and will See How it Goes.

Yes, I have followed that thread with interest.

- Steve
Jason Gunthorpe June 28, 2022, 2:04 p.m. UTC | #5
On Tue, Jun 28, 2022 at 09:54:19AM -0400, Steven Sistare wrote:

> >> As you and I have discussed, the count is also wrong in the direct
> >> exec model, because exec clears mm->locked_vm.
> > 
> > Really? Yikes, I thought exec would generate a new mm?
> 
> Yes, exec creates a new mm with locked_vm = 0.  The old locked_vm count is dropped
> on the floor.  The existing dma points to the same task, but task->mm has changed,
> and dma->task->mm->locked_vm is 0.  An unmap ioctl drives it
> negative.

Oh.. This is probably a bug, vfio should never use task->mm, the mm
itself should be held using mmgrab instead.

Otherwise exec case is broken as you describe.

> I have prototyped a few possible fixes.  One changes vfio to use user->locked_vm.
> Another changes to mm->pinned_vm and preserves it during exec.  A third preserves
> mm->locked_vm across exec, but that is not practical, because mm->locked_vm mixes
> vfio pins and mlocks.  The mlock component must be cleared during exec, and we don't 
> have a separate count for it.

Lossing locked_vm on exec/fork is the correct and expected behavior
for the core kernel code, the bug is that vfio drives it negative.

Jason
Alex Williamson June 30, 2022, 7:51 p.m. UTC | #6
On Mon, 27 Jun 2022 11:51:09 +0800
lizhe.67@bytedance.com wrote:

> From: Li Zhe <lizhe.67@bytedance.com>
> 
> In function vfio_dma_do_unmap(), we currently prevent process to unmap
> vfio dma region whose mm_struct is different from the vfio_dma->task.
> In our virtual machine scenario which is using kvm and qemu, this
> judgement stops us from liveupgrading our qemu, which uses fork() &&
> exec() to load the new binary but the new process cannot do the
> VFIO_IOMMU_UNMAP_DMA action during vm exit because of this judgement.
> 
> This judgement is added in commit 8f0d5bb95f76 ("vfio iommu type1: Add
> task structure to vfio_dma") for the security reason. But it seems that
> no other task who has no family relationship with old and new process
> can get the same vfio_dma struct here for the reason of resource
> isolation. So this patch delete it.
> 
> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> Reviewed-by: Jason Gunthorpe <jgg@ziepe.ca>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 6 ------
>  1 file changed, 6 deletions(-)

Applied to vfio next branch for v5.20.  Thanks,

Alex

> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index c13b9290e357..a8ff00dad834 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -1377,12 +1377,6 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  
>  		if (!iommu->v2 && iova > dma->iova)
>  			break;
> -		/*
> -		 * Task with same address space who mapped this iova range is
> -		 * allowed to unmap the iova range.
> -		 */
> -		if (dma->task->mm != current->mm)
> -			break;
>  
>  		if (invalidate_vaddr) {
>  			if (dma->vaddr_invalid) {
diff mbox series

Patch

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index c13b9290e357..a8ff00dad834 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -1377,12 +1377,6 @@  static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 
 		if (!iommu->v2 && iova > dma->iova)
 			break;
-		/*
-		 * Task with same address space who mapped this iova range is
-		 * allowed to unmap the iova range.
-		 */
-		if (dma->task->mm != current->mm)
-			break;
 
 		if (invalidate_vaddr) {
 			if (dma->vaddr_invalid) {