diff mbox series

[V7,2/7] vfio/type1: prevent underflow of locked_vm via exec()

Message ID 1671568765-297322-3-git-send-email-steven.sistare@oracle.com (mailing list archive)
State New, archived
Headers show
Series fixes for virtual address update | expand

Commit Message

Steven Sistare Dec. 20, 2022, 8:39 p.m. UTC
When a vfio container is preserved across exec, the task does not change,
but it gets a new mm with locked_vm=0, and loses the count from existing
dma mappings.  If the user later unmaps a dma mapping, locked_vm underflows
to a large unsigned value, and a subsequent dma map request fails with
ENOMEM in __account_locked_vm.

To avoid underflow, grab and save the mm at the time a dma is mapped.
Use that mm when adjusting locked_vm, rather than re-acquiring the saved
task's mm, which may have changed.  If the saved mm is dead, do nothing.

locked_vm is incremented for existing mappings in a subsequent patch.

Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation")
Cc: stable@vger.kernel.org
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
 drivers/vfio/vfio_iommu_type1.c | 27 +++++++++++----------------
 1 file changed, 11 insertions(+), 16 deletions(-)

Comments

Jason Gunthorpe Jan. 3, 2023, 3:20 p.m. UTC | #1
On Tue, Dec 20, 2022 at 12:39:20PM -0800, Steve Sistare wrote:
> When a vfio container is preserved across exec, the task does not change,
> but it gets a new mm with locked_vm=0, and loses the count from existing
> dma mappings.  If the user later unmaps a dma mapping, locked_vm underflows
> to a large unsigned value, and a subsequent dma map request fails with
> ENOMEM in __account_locked_vm.
> 
> To avoid underflow, grab and save the mm at the time a dma is mapped.
> Use that mm when adjusting locked_vm, rather than re-acquiring the saved
> task's mm, which may have changed.  If the saved mm is dead, do nothing.
> 
> locked_vm is incremented for existing mappings in a subsequent patch.
> 
> Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation")
> Cc: stable@vger.kernel.org
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 27 +++++++++++----------------
>  1 file changed, 11 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 144f5bb..71f980b 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -100,6 +100,7 @@ struct vfio_dma {
>  	struct task_struct	*task;
>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
>  	unsigned long		*bitmap;
> +	struct mm_struct	*mm;
>  };
>  
>  struct vfio_batch {
> @@ -420,8 +421,8 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
>  	if (!npage)
>  		return 0;
>  
> -	mm = async ? get_task_mm(dma->task) : dma->task->mm;
> -	if (!mm)
> +	mm = dma->mm;
> +	if (async && !mmget_not_zero(mm))
>  		return -ESRCH; /* process exited */

Just delete the async, the lock_acct always acts on the dma which
always has a singular mm.

FIx the few callers that need it to do the mmget_no_zero() before
calling in.

> @@ -794,8 +795,8 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
>  	struct mm_struct *mm;
>  	int ret;
>  
> -	mm = get_task_mm(dma->task);
> -	if (!mm)
> +	mm = dma->mm;
> +	if (!mmget_not_zero(mm))
>  		return -ENODEV;

eg things get all confused here where we have the mmget_not_zero

But then we go ahead and call vfio_lock_acct() with true

Jason
Steven Sistare Jan. 3, 2023, 6:12 p.m. UTC | #2
On 1/3/2023 10:20 AM, Jason Gunthorpe wrote:
> On Tue, Dec 20, 2022 at 12:39:20PM -0800, Steve Sistare wrote:
>> When a vfio container is preserved across exec, the task does not change,
>> but it gets a new mm with locked_vm=0, and loses the count from existing
>> dma mappings.  If the user later unmaps a dma mapping, locked_vm underflows
>> to a large unsigned value, and a subsequent dma map request fails with
>> ENOMEM in __account_locked_vm.
>>
>> To avoid underflow, grab and save the mm at the time a dma is mapped.
>> Use that mm when adjusting locked_vm, rather than re-acquiring the saved
>> task's mm, which may have changed.  If the saved mm is dead, do nothing.
>>
>> locked_vm is incremented for existing mappings in a subsequent patch.
>>
>> Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation")
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 27 +++++++++++----------------
>>  1 file changed, 11 insertions(+), 16 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index 144f5bb..71f980b 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -100,6 +100,7 @@ struct vfio_dma {
>>  	struct task_struct	*task;
>>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
>>  	unsigned long		*bitmap;
>> +	struct mm_struct	*mm;
>>  };
>>  
>>  struct vfio_batch {
>> @@ -420,8 +421,8 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
>>  	if (!npage)
>>  		return 0;
>>  
>> -	mm = async ? get_task_mm(dma->task) : dma->task->mm;
>> -	if (!mm)
>> +	mm = dma->mm;
>> +	if (async && !mmget_not_zero(mm))
>>  		return -ESRCH; /* process exited */
> 
> Just delete the async, the lock_acct always acts on the dma which
> always has a singular mm.
> 
> FIx the few callers that need it to do the mmget_no_zero() before
> calling in.

Most of the callers pass async=true:
  ret = vfio_lock_acct(dma, lock_acct, false);
  vfio_lock_acct(dma, locked - unlocked, true);
  ret = vfio_lock_acct(dma, 1, true);
  vfio_lock_acct(dma, -unlocked, true);
  vfio_lock_acct(dma, -1, true);
  vfio_lock_acct(dma, -unlocked, true);
  ret = mm_lock_acct(task, mm, lock_cap, npage, false);
  mm_lock_acct(dma->task, dma->mm, dma->lock_cap, -npage, true);
  vfio_lock_acct(dma, locked - unlocked, true);

Hoisting mmget_not_zero and mmput will make many call sites messier.

Alex, what is your opinion?

>> @@ -794,8 +795,8 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
>>  	struct mm_struct *mm;
>>  	int ret;
>>  
>> -	mm = get_task_mm(dma->task);
>> -	if (!mm)
>> +	mm = dma->mm;
>> +	if (!mmget_not_zero(mm))
>>  		return -ENODEV;
> 
> eg things get all confused here where we have the mmget_not_zero
> 
> But then we go ahead and call vfio_lock_acct() with true

Yes, I should pass false at this call site to optimize.

- Steve
Jason Gunthorpe Jan. 3, 2023, 7:20 p.m. UTC | #3
On Tue, Jan 03, 2023 at 01:12:53PM -0500, Steven Sistare wrote:
> On 1/3/2023 10:20 AM, Jason Gunthorpe wrote:
> > On Tue, Dec 20, 2022 at 12:39:20PM -0800, Steve Sistare wrote:
> >> When a vfio container is preserved across exec, the task does not change,
> >> but it gets a new mm with locked_vm=0, and loses the count from existing
> >> dma mappings.  If the user later unmaps a dma mapping, locked_vm underflows
> >> to a large unsigned value, and a subsequent dma map request fails with
> >> ENOMEM in __account_locked_vm.
> >>
> >> To avoid underflow, grab and save the mm at the time a dma is mapped.
> >> Use that mm when adjusting locked_vm, rather than re-acquiring the saved
> >> task's mm, which may have changed.  If the saved mm is dead, do nothing.
> >>
> >> locked_vm is incremented for existing mappings in a subsequent patch.
> >>
> >> Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation")
> >> Cc: stable@vger.kernel.org
> >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> >> ---
> >>  drivers/vfio/vfio_iommu_type1.c | 27 +++++++++++----------------
> >>  1 file changed, 11 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >> index 144f5bb..71f980b 100644
> >> --- a/drivers/vfio/vfio_iommu_type1.c
> >> +++ b/drivers/vfio/vfio_iommu_type1.c
> >> @@ -100,6 +100,7 @@ struct vfio_dma {
> >>  	struct task_struct	*task;
> >>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
> >>  	unsigned long		*bitmap;
> >> +	struct mm_struct	*mm;
> >>  };
> >>  
> >>  struct vfio_batch {
> >> @@ -420,8 +421,8 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
> >>  	if (!npage)
> >>  		return 0;
> >>  
> >> -	mm = async ? get_task_mm(dma->task) : dma->task->mm;
> >> -	if (!mm)
> >> +	mm = dma->mm;
> >> +	if (async && !mmget_not_zero(mm))
> >>  		return -ESRCH; /* process exited */
> > 
> > Just delete the async, the lock_acct always acts on the dma which
> > always has a singular mm.
> > 
> > FIx the few callers that need it to do the mmget_no_zero() before
> > calling in.
> 
> Most of the callers pass async=true:
>   ret = vfio_lock_acct(dma, lock_acct, false);
>   vfio_lock_acct(dma, locked - unlocked, true);
>   ret = vfio_lock_acct(dma, 1, true);
>   vfio_lock_acct(dma, -unlocked, true);
>   vfio_lock_acct(dma, -1, true);
>   vfio_lock_acct(dma, -unlocked, true);
>   ret = mm_lock_acct(task, mm, lock_cap, npage, false);
>   mm_lock_acct(dma->task, dma->mm, dma->lock_cap, -npage, true);
>   vfio_lock_acct(dma, locked - unlocked, true);

Seems like if you make a lock_sub_acct() function that does the -1*
and does the mmget it will be OK?

Jason
Steven Sistare Jan. 6, 2023, 3:14 p.m. UTC | #4
On 1/3/2023 2:20 PM, Jason Gunthorpe wrote:
> On Tue, Jan 03, 2023 at 01:12:53PM -0500, Steven Sistare wrote:
>> On 1/3/2023 10:20 AM, Jason Gunthorpe wrote:
>>> On Tue, Dec 20, 2022 at 12:39:20PM -0800, Steve Sistare wrote:
>>>> When a vfio container is preserved across exec, the task does not change,
>>>> but it gets a new mm with locked_vm=0, and loses the count from existing
>>>> dma mappings.  If the user later unmaps a dma mapping, locked_vm underflows
>>>> to a large unsigned value, and a subsequent dma map request fails with
>>>> ENOMEM in __account_locked_vm.
>>>>
>>>> To avoid underflow, grab and save the mm at the time a dma is mapped.
>>>> Use that mm when adjusting locked_vm, rather than re-acquiring the saved
>>>> task's mm, which may have changed.  If the saved mm is dead, do nothing.
>>>>
>>>> locked_vm is incremented for existing mappings in a subsequent patch.
>>>>
>>>> Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation")
>>>> Cc: stable@vger.kernel.org
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
>>>> ---
>>>>  drivers/vfio/vfio_iommu_type1.c | 27 +++++++++++----------------
>>>>  1 file changed, 11 insertions(+), 16 deletions(-)
>>>>
>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>>> index 144f5bb..71f980b 100644
>>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>>> @@ -100,6 +100,7 @@ struct vfio_dma {
>>>>  	struct task_struct	*task;
>>>>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
>>>>  	unsigned long		*bitmap;
>>>> +	struct mm_struct	*mm;
>>>>  };
>>>>  
>>>>  struct vfio_batch {
>>>> @@ -420,8 +421,8 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
>>>>  	if (!npage)
>>>>  		return 0;
>>>>  
>>>> -	mm = async ? get_task_mm(dma->task) : dma->task->mm;
>>>> -	if (!mm)
>>>> +	mm = dma->mm;
>>>> +	if (async && !mmget_not_zero(mm))
>>>>  		return -ESRCH; /* process exited */
>>>
>>> Just delete the async, the lock_acct always acts on the dma which
>>> always has a singular mm.
>>>
>>> FIx the few callers that need it to do the mmget_no_zero() before
>>> calling in.
>>
>> Most of the callers pass async=true:
>>   ret = vfio_lock_acct(dma, lock_acct, false);
>>   vfio_lock_acct(dma, locked - unlocked, true);
>>   ret = vfio_lock_acct(dma, 1, true);
>>   vfio_lock_acct(dma, -unlocked, true);
>>   vfio_lock_acct(dma, -1, true);
>>   vfio_lock_acct(dma, -unlocked, true);
>>   ret = mm_lock_acct(task, mm, lock_cap, npage, false);
>>   mm_lock_acct(dma->task, dma->mm, dma->lock_cap, -npage, true);
>>   vfio_lock_acct(dma, locked - unlocked, true);
> 
> Seems like if you make a lock_sub_acct() function that does the -1*
> and does the mmget it will be OK?

Do you mean, provide two versions of vfio_lock_acct?  Simplified:

    vfio_lock_acct()
    {
        mm_lock_acct()
        dma->locked_vm += npage;
    }

    vfio_lock_acct_async()
    {
        mmget_not_zero(dma->mm)

        mm_lock_acct()
        dma->locked_vm += npage;

        mmput(dma->mm);
    }

If so, I will factor this into a separate patch, since it is cosmetic, so stable
can omit it.

- Steve
Jason Gunthorpe Jan. 9, 2023, 1:52 p.m. UTC | #5
On Fri, Jan 06, 2023 at 10:14:57AM -0500, Steven Sistare wrote:
> On 1/3/2023 2:20 PM, Jason Gunthorpe wrote:
> > On Tue, Jan 03, 2023 at 01:12:53PM -0500, Steven Sistare wrote:
> >> On 1/3/2023 10:20 AM, Jason Gunthorpe wrote:
> >>> On Tue, Dec 20, 2022 at 12:39:20PM -0800, Steve Sistare wrote:
> >>>> When a vfio container is preserved across exec, the task does not change,
> >>>> but it gets a new mm with locked_vm=0, and loses the count from existing
> >>>> dma mappings.  If the user later unmaps a dma mapping, locked_vm underflows
> >>>> to a large unsigned value, and a subsequent dma map request fails with
> >>>> ENOMEM in __account_locked_vm.
> >>>>
> >>>> To avoid underflow, grab and save the mm at the time a dma is mapped.
> >>>> Use that mm when adjusting locked_vm, rather than re-acquiring the saved
> >>>> task's mm, which may have changed.  If the saved mm is dead, do nothing.
> >>>>
> >>>> locked_vm is incremented for existing mappings in a subsequent patch.
> >>>>
> >>>> Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation")
> >>>> Cc: stable@vger.kernel.org
> >>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >>>> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> >>>> ---
> >>>>  drivers/vfio/vfio_iommu_type1.c | 27 +++++++++++----------------
> >>>>  1 file changed, 11 insertions(+), 16 deletions(-)
> >>>>
> >>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >>>> index 144f5bb..71f980b 100644
> >>>> --- a/drivers/vfio/vfio_iommu_type1.c
> >>>> +++ b/drivers/vfio/vfio_iommu_type1.c
> >>>> @@ -100,6 +100,7 @@ struct vfio_dma {
> >>>>  	struct task_struct	*task;
> >>>>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
> >>>>  	unsigned long		*bitmap;
> >>>> +	struct mm_struct	*mm;
> >>>>  };
> >>>>  
> >>>>  struct vfio_batch {
> >>>> @@ -420,8 +421,8 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
> >>>>  	if (!npage)
> >>>>  		return 0;
> >>>>  
> >>>> -	mm = async ? get_task_mm(dma->task) : dma->task->mm;
> >>>> -	if (!mm)
> >>>> +	mm = dma->mm;
> >>>> +	if (async && !mmget_not_zero(mm))
> >>>>  		return -ESRCH; /* process exited */
> >>>
> >>> Just delete the async, the lock_acct always acts on the dma which
> >>> always has a singular mm.
> >>>
> >>> FIx the few callers that need it to do the mmget_no_zero() before
> >>> calling in.
> >>
> >> Most of the callers pass async=true:
> >>   ret = vfio_lock_acct(dma, lock_acct, false);
> >>   vfio_lock_acct(dma, locked - unlocked, true);
> >>   ret = vfio_lock_acct(dma, 1, true);
> >>   vfio_lock_acct(dma, -unlocked, true);
> >>   vfio_lock_acct(dma, -1, true);
> >>   vfio_lock_acct(dma, -unlocked, true);
> >>   ret = mm_lock_acct(task, mm, lock_cap, npage, false);
> >>   mm_lock_acct(dma->task, dma->mm, dma->lock_cap, -npage, true);
> >>   vfio_lock_acct(dma, locked - unlocked, true);
> > 
> > Seems like if you make a lock_sub_acct() function that does the -1*
> > and does the mmget it will be OK?
> 
> Do you mean, provide two versions of vfio_lock_acct?  Simplified:
> 
>     vfio_lock_acct()
>     {
>         mm_lock_acct()
>         dma->locked_vm += npage;
>     }
> 
>     vfio_lock_acct_async()
>     {
>         mmget_not_zero(dma->mm)
> 
>         mm_lock_acct()
>         dma->locked_vm += npage;
> 
>         mmput(dma->mm);
>     }

I was thinking more like 

vfio_lock_acct_subtract()
         mmget_not_zero(dma->mm)
	 mm->locked_vm -= npage

Jason
Steven Sistare Jan. 9, 2023, 3:54 p.m. UTC | #6
On 1/9/2023 8:52 AM, Jason Gunthorpe wrote:
> On Fri, Jan 06, 2023 at 10:14:57AM -0500, Steven Sistare wrote:
>> On 1/3/2023 2:20 PM, Jason Gunthorpe wrote:
>>> On Tue, Jan 03, 2023 at 01:12:53PM -0500, Steven Sistare wrote:
>>>> On 1/3/2023 10:20 AM, Jason Gunthorpe wrote:
>>>>> On Tue, Dec 20, 2022 at 12:39:20PM -0800, Steve Sistare wrote:
>>>>>> When a vfio container is preserved across exec, the task does not change,
>>>>>> but it gets a new mm with locked_vm=0, and loses the count from existing
>>>>>> dma mappings.  If the user later unmaps a dma mapping, locked_vm underflows
>>>>>> to a large unsigned value, and a subsequent dma map request fails with
>>>>>> ENOMEM in __account_locked_vm.
>>>>>>
>>>>>> To avoid underflow, grab and save the mm at the time a dma is mapped.
>>>>>> Use that mm when adjusting locked_vm, rather than re-acquiring the saved
>>>>>> task's mm, which may have changed.  If the saved mm is dead, do nothing.
>>>>>>
>>>>>> locked_vm is incremented for existing mappings in a subsequent patch.
>>>>>>
>>>>>> Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation")
>>>>>> Cc: stable@vger.kernel.org
>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>>> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
>>>>>> ---
>>>>>>  drivers/vfio/vfio_iommu_type1.c | 27 +++++++++++----------------
>>>>>>  1 file changed, 11 insertions(+), 16 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>>>>> index 144f5bb..71f980b 100644
>>>>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>>>>> @@ -100,6 +100,7 @@ struct vfio_dma {
>>>>>>  	struct task_struct	*task;
>>>>>>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
>>>>>>  	unsigned long		*bitmap;
>>>>>> +	struct mm_struct	*mm;
>>>>>>  };
>>>>>>  
>>>>>>  struct vfio_batch {
>>>>>> @@ -420,8 +421,8 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
>>>>>>  	if (!npage)
>>>>>>  		return 0;
>>>>>>  
>>>>>> -	mm = async ? get_task_mm(dma->task) : dma->task->mm;
>>>>>> -	if (!mm)
>>>>>> +	mm = dma->mm;
>>>>>> +	if (async && !mmget_not_zero(mm))
>>>>>>  		return -ESRCH; /* process exited */
>>>>>
>>>>> Just delete the async, the lock_acct always acts on the dma which
>>>>> always has a singular mm.
>>>>>
>>>>> FIx the few callers that need it to do the mmget_no_zero() before
>>>>> calling in.
>>>>
>>>> Most of the callers pass async=true:
>>>>   ret = vfio_lock_acct(dma, lock_acct, false);
>>>>   vfio_lock_acct(dma, locked - unlocked, true);
>>>>   ret = vfio_lock_acct(dma, 1, true);
>>>>   vfio_lock_acct(dma, -unlocked, true);
>>>>   vfio_lock_acct(dma, -1, true);
>>>>   vfio_lock_acct(dma, -unlocked, true);
>>>>   ret = mm_lock_acct(task, mm, lock_cap, npage, false);
>>>>   mm_lock_acct(dma->task, dma->mm, dma->lock_cap, -npage, true);
>>>>   vfio_lock_acct(dma, locked - unlocked, true);
>>>
>>> Seems like if you make a lock_sub_acct() function that does the -1*
>>> and does the mmget it will be OK?
>>
>> Do you mean, provide two versions of vfio_lock_acct?  Simplified:
>>
>>     vfio_lock_acct()
>>     {
>>         mm_lock_acct()
>>         dma->locked_vm += npage;
>>     }
>>
>>     vfio_lock_acct_async()
>>     {
>>         mmget_not_zero(dma->mm)
>>
>>         mm_lock_acct()
>>         dma->locked_vm += npage;
>>
>>         mmput(dma->mm);
>>     }
> 
> I was thinking more like 
> 
> ()
>        mmget_not_zero(dma->mm)
> 	 mm->locked_vm -= npage 
         ^^^^^^
Is this shorthand for open coding __account_locked_vm?  If so, we are
essentially saying the same thing.  My function vfio_lock_acct_async calls 
mm_lock_acct which calls __account_locked_vm.

But, your vfio_lock_acct_subtract does not call mmput, so maybe I still don't
grok your suggestion.

FWIW here are my functions with all error checking:

static int mm_lock_acct(struct task_struct *task, struct mm_struct *mm,
                        bool lock_cap, long npage)
{
        int ret = mmap_write_lock_killable(mm);

        if (!ret) {
                ret = __account_locked_vm(mm, abs(npage), npage > 0, task,
                                          lock_cap);
                mmap_write_unlock(mm);
        }

        return ret;
}

static int vfio_lock_acct(struct vfio_dma *dma, long npage)
{
        int ret;

        if (!npage)
                return 0;

        ret = mm_lock_acct(dma->task, dma->mm, dma->lock_cap, npage);
        if (!ret)
                dma->locked_vm += npage;

        return ret;
}

static int vfio_lock_acct_async(struct vfio_dma *dma, long npage)
{
        int ret;

        if (!npage)
                return 0;

        if (!mmget_not_zero(dma->mm))
                return -ESRCH; /* process exited */

        ret = mm_lock_acct(dma->task, dma->mm, dma->lock_cap, npage);
        if (!ret)
                dma->locked_vm += npage;

        mmput(dma->mm);

        return ret;
}

- Steve
Steven Sistare Jan. 9, 2023, 9:16 p.m. UTC | #7
On 1/9/2023 10:54 AM, Steven Sistare wrote:
> On 1/9/2023 8:52 AM, Jason Gunthorpe wrote:
>> On Fri, Jan 06, 2023 at 10:14:57AM -0500, Steven Sistare wrote:
>>> On 1/3/2023 2:20 PM, Jason Gunthorpe wrote:
>>>> On Tue, Jan 03, 2023 at 01:12:53PM -0500, Steven Sistare wrote:
>>>>> On 1/3/2023 10:20 AM, Jason Gunthorpe wrote:
>>>>>> On Tue, Dec 20, 2022 at 12:39:20PM -0800, Steve Sistare wrote:
>>>>>>> When a vfio container is preserved across exec, the task does not change,
>>>>>>> but it gets a new mm with locked_vm=0, and loses the count from existing
>>>>>>> dma mappings.  If the user later unmaps a dma mapping, locked_vm underflows
>>>>>>> to a large unsigned value, and a subsequent dma map request fails with
>>>>>>> ENOMEM in __account_locked_vm.
>>>>>>>
>>>>>>> To avoid underflow, grab and save the mm at the time a dma is mapped.
>>>>>>> Use that mm when adjusting locked_vm, rather than re-acquiring the saved
>>>>>>> task's mm, which may have changed.  If the saved mm is dead, do nothing.
>>>>>>>
>>>>>>> locked_vm is incremented for existing mappings in a subsequent patch.
>>>>>>>
>>>>>>> Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation")
>>>>>>> Cc: stable@vger.kernel.org
>>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>>>> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
>>>>>>> ---
>>>>>>>  drivers/vfio/vfio_iommu_type1.c | 27 +++++++++++----------------
>>>>>>>  1 file changed, 11 insertions(+), 16 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>>>>>> index 144f5bb..71f980b 100644
>>>>>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>>>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>>>>>> @@ -100,6 +100,7 @@ struct vfio_dma {
>>>>>>>  	struct task_struct	*task;
>>>>>>>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
>>>>>>>  	unsigned long		*bitmap;
>>>>>>> +	struct mm_struct	*mm;
>>>>>>>  };
>>>>>>>  
>>>>>>>  struct vfio_batch {
>>>>>>> @@ -420,8 +421,8 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
>>>>>>>  	if (!npage)
>>>>>>>  		return 0;
>>>>>>>  
>>>>>>> -	mm = async ? get_task_mm(dma->task) : dma->task->mm;
>>>>>>> -	if (!mm)
>>>>>>> +	mm = dma->mm;
>>>>>>> +	if (async && !mmget_not_zero(mm))
>>>>>>>  		return -ESRCH; /* process exited */
>>>>>>
>>>>>> Just delete the async, the lock_acct always acts on the dma which
>>>>>> always has a singular mm.
>>>>>>
>>>>>> FIx the few callers that need it to do the mmget_no_zero() before
>>>>>> calling in.
>>>>>
>>>>> Most of the callers pass async=true:
>>>>>   ret = vfio_lock_acct(dma, lock_acct, false);
>>>>>   vfio_lock_acct(dma, locked - unlocked, true);
>>>>>   ret = vfio_lock_acct(dma, 1, true);
>>>>>   vfio_lock_acct(dma, -unlocked, true);
>>>>>   vfio_lock_acct(dma, -1, true);
>>>>>   vfio_lock_acct(dma, -unlocked, true);
>>>>>   ret = mm_lock_acct(task, mm, lock_cap, npage, false);
>>>>>   mm_lock_acct(dma->task, dma->mm, dma->lock_cap, -npage, true);
>>>>>   vfio_lock_acct(dma, locked - unlocked, true);
>>>>
>>>> Seems like if you make a lock_sub_acct() function that does the -1*
>>>> and does the mmget it will be OK?
>>>
>>> Do you mean, provide two versions of vfio_lock_acct?  Simplified:
>>>
>>>     vfio_lock_acct()
>>>     {
>>>         mm_lock_acct()
>>>         dma->locked_vm += npage;
>>>     }
>>>
>>>     vfio_lock_acct_async()
>>>     {
>>>         mmget_not_zero(dma->mm)
>>>
>>>         mm_lock_acct()
>>>         dma->locked_vm += npage;
>>>
>>>         mmput(dma->mm);
>>>     }
>>
>> I was thinking more like 
>>
>> ()
>>        mmget_not_zero(dma->mm)
>> 	 mm->locked_vm -= npage 
>          ^^^^^^
> Is this shorthand for open coding __account_locked_vm?  If so, we are
> essentially saying the same thing.  My function vfio_lock_acct_async calls 
> mm_lock_acct which calls __account_locked_vm.
> 
> But, your vfio_lock_acct_subtract does not call mmput, so maybe I still don't
> grok your suggestion.
> 
> FWIW here are my functions with all error checking:
> 
> static int mm_lock_acct(struct task_struct *task, struct mm_struct *mm,
>                         bool lock_cap, long npage)
> {
>         int ret = mmap_write_lock_killable(mm);
> 
>         if (!ret) {
>                 ret = __account_locked_vm(mm, abs(npage), npage > 0, task,
>                                           lock_cap);
>                 mmap_write_unlock(mm);
>         }
> 
>         return ret;
> }
> 
> static int vfio_lock_acct(struct vfio_dma *dma, long npage)
> {
>         int ret;
> 
>         if (!npage)
>                 return 0;
> 
>         ret = mm_lock_acct(dma->task, dma->mm, dma->lock_cap, npage);
>         if (!ret)
>                 dma->locked_vm += npage;
> 
>         return ret;
> }
> 
> static int vfio_lock_acct_async(struct vfio_dma *dma, long npage)
> {
>         int ret;
> 
>         if (!npage)
>                 return 0;
> 
>         if (!mmget_not_zero(dma->mm))
>                 return -ESRCH; /* process exited */
> 
>         ret = mm_lock_acct(dma->task, dma->mm, dma->lock_cap, npage);
>         if (!ret)
>                 dma->locked_vm += npage;
> 
>         mmput(dma->mm);
> 
>         return ret;
> }

Let's leave the async arg and vfio_lock_acct as is.  We are over-polishing a small
style issue, in pre-existing code, in a soon-to-be dead-end code base.

- Steve
Jason Gunthorpe Jan. 10, 2023, 3:02 p.m. UTC | #8
On Mon, Jan 09, 2023 at 04:16:18PM -0500, Steven Sistare wrote:

> Let's leave the async arg and vfio_lock_acct as is.  We are over-polishing a small
> style issue, in pre-existing code, in a soon-to-be dead-end code base.

fine

Jason
diff mbox series

Patch

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 144f5bb..71f980b 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -100,6 +100,7 @@  struct vfio_dma {
 	struct task_struct	*task;
 	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
 	unsigned long		*bitmap;
+	struct mm_struct	*mm;
 };
 
 struct vfio_batch {
@@ -420,8 +421,8 @@  static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
 	if (!npage)
 		return 0;
 
-	mm = async ? get_task_mm(dma->task) : dma->task->mm;
-	if (!mm)
+	mm = dma->mm;
+	if (async && !mmget_not_zero(mm))
 		return -ESRCH; /* process exited */
 
 	ret = mmap_write_lock_killable(mm);
@@ -794,8 +795,8 @@  static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
 	struct mm_struct *mm;
 	int ret;
 
-	mm = get_task_mm(dma->task);
-	if (!mm)
+	mm = dma->mm;
+	if (!mmget_not_zero(mm))
 		return -ENODEV;
 
 	ret = vaddr_get_pfns(mm, vaddr, 1, dma->prot, pfn_base, pages);
@@ -1180,6 +1181,7 @@  static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
 	vfio_unmap_unpin(iommu, dma, true);
 	vfio_unlink_dma(iommu, dma);
 	put_task_struct(dma->task);
+	mmdrop(dma->mm);
 	vfio_dma_bitmap_free(dma);
 	if (dma->vaddr_invalid) {
 		iommu->vaddr_invalid_count--;
@@ -1664,15 +1666,7 @@  static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	 * against the locked memory limit and we need to be able to do both
 	 * outside of this call path as pinning can be asynchronous via the
 	 * external interfaces for mdev devices.  RLIMIT_MEMLOCK requires a
-	 * task_struct and VM locked pages requires an mm_struct, however
-	 * holding an indefinite mm reference is not recommended, therefore we
-	 * only hold a reference to a task.  We could hold a reference to
-	 * current, however QEMU uses this call path through vCPU threads,
-	 * which can be killed resulting in a NULL mm and failure in the unmap
-	 * path when called via a different thread.  Avoid this problem by
-	 * using the group_leader as threads within the same group require
-	 * both CLONE_THREAD and CLONE_VM and will therefore use the same
-	 * mm_struct.
+	 * task_struct and VM locked pages requires an mm_struct.
 	 *
 	 * Previously we also used the task for testing CAP_IPC_LOCK at the
 	 * time of pinning and accounting, however has_capability() makes use
@@ -1687,6 +1681,8 @@  static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	get_task_struct(current->group_leader);
 	dma->task = current->group_leader;
 	dma->lock_cap = capable(CAP_IPC_LOCK);
+	dma->mm = current->mm;
+	mmgrab(dma->mm);
 
 	dma->pfn_list = RB_ROOT;
 
@@ -3122,9 +3118,8 @@  static int vfio_iommu_type1_dma_rw_chunk(struct vfio_iommu *iommu,
 			!(dma->prot & IOMMU_READ))
 		return -EPERM;
 
-	mm = get_task_mm(dma->task);
-
-	if (!mm)
+	mm = dma->mm;
+	if (!mmget_not_zero(mm))
 		return -EPERM;
 
 	if (kthread)