diff mbox series

mm/gup: stop leaking pinned pages in low memory conditions

Message ID 20241016202242.456953-1-jhubbard@nvidia.com (mailing list archive)
State New
Headers show
Series mm/gup: stop leaking pinned pages in low memory conditions | expand

Commit Message

John Hubbard Oct. 16, 2024, 8:22 p.m. UTC
If a driver tries to call any of the pin_user_pages*(FOLL_LONGTERM)
family of functions, and requests "too many" pages, then the call will
erroneously leave pages pinned. This is visible in user space as an
actual memory leak.

Repro is trivial: just make enough pin_user_pages(FOLL_LONGTERM) calls
to exhaust memory.

The root cause of the problem is this sequence, within
__gup_longterm_locked():

    __get_user_pages_locked()
    rc = check_and_migrate_movable_pages()

...which gets retried in a loop. The loop error handling is incomplete,
clearly due to a somewhat unusual and complicated tri-state error API.
But anyway, if -ENOMEM, or in fact, any unexpected error is returned
from check_and_migrate_movable_pages(), then __gup_longterm_locked()
happily returns the error, while leaving the pages pinned.

In the failed case, which is an app that requests (via a device driver)
30720000000 bytes to be pinned, and then exits, I see this:

    $ grep foll /proc/vmstat
        nr_foll_pin_acquired 7502048
        nr_foll_pin_released 2048

And after applying this patch, it returns to balanced pins:

    $ grep foll /proc/vmstat
        nr_foll_pin_acquired 7502048
        nr_foll_pin_released 7502048

Fix this by unpinning the pages that __get_user_pages_locked() has
pinned, in such error cases.

Fixes: 24a95998e9ba ("mm/gup.c: simplify and fix check_and_migrate_movable_pages() return codes")
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Shigeru Yoshida <syoshida@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 mm/gup.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

Comments

Andrew Morton Oct. 16, 2024, 9:57 p.m. UTC | #1
On Wed, 16 Oct 2024 13:22:42 -0700 John Hubbard <jhubbard@nvidia.com> wrote:

> If a driver tries to call any of the pin_user_pages*(FOLL_LONGTERM)
> family of functions, and requests "too many" pages, then the call will
> erroneously leave pages pinned. This is visible in user space as an
> actual memory leak.
> 
> Repro is trivial: just make enough pin_user_pages(FOLL_LONGTERM) calls
> to exhaust memory.
> 
> The root cause of the problem is this sequence, within
> __gup_longterm_locked():
> 
>     __get_user_pages_locked()
>     rc = check_and_migrate_movable_pages()
> 
> ...which gets retried in a loop. The loop error handling is incomplete,
> clearly due to a somewhat unusual and complicated tri-state error API.
> But anyway, if -ENOMEM, or in fact, any unexpected error is returned
> from check_and_migrate_movable_pages(), then __gup_longterm_locked()
> happily returns the error, while leaving the pages pinned.
> 
> In the failed case, which is an app that requests (via a device driver)
> 30720000000 bytes to be pinned, and then exits, I see this:
> 
>     $ grep foll /proc/vmstat
>         nr_foll_pin_acquired 7502048
>         nr_foll_pin_released 2048
> 
> And after applying this patch, it returns to balanced pins:
> 
>     $ grep foll /proc/vmstat
>         nr_foll_pin_acquired 7502048
>         nr_foll_pin_released 7502048
> 
> Fix this by unpinning the pages that __get_user_pages_locked() has
> pinned, in such error cases.

Thanks.

> Fixes: 24a95998e9ba ("mm/gup.c: simplify and fix check_and_migrate_movable_pages() return codes")

I'll add this to the -stable backport pile, although this seems a bit
marginal?
John Hubbard Oct. 16, 2024, 10:05 p.m. UTC | #2
On 10/16/24 2:57 PM, Andrew Morton wrote:
> On Wed, 16 Oct 2024 13:22:42 -0700 John Hubbard <jhubbard@nvidia.com> wrote:
...
>> Fix this by unpinning the pages that __get_user_pages_locked() has
>> pinned, in such error cases.
> 
> Thanks.
> 
>> Fixes: 24a95998e9ba ("mm/gup.c: simplify and fix check_and_migrate_movable_pages() return codes")
> 
> I'll add this to the -stable backport pile, although this seems a bit
> marginal?

I'm on the fence about that. It is marginal: you have to
exhaust memory. On the other hand, a real user reported
this bug to us.

I guess I'd lean toward "correctness in -stable", and
add it to the pile, in the end.


thanks,
Alistair Popple Oct. 16, 2024, 10:13 p.m. UTC | #3
John Hubbard <jhubbard@nvidia.com> writes:

> If a driver tries to call any of the pin_user_pages*(FOLL_LONGTERM)
> family of functions, and requests "too many" pages, then the call will
> erroneously leave pages pinned. This is visible in user space as an
> actual memory leak.
>
> Repro is trivial: just make enough pin_user_pages(FOLL_LONGTERM) calls
> to exhaust memory.
>
> The root cause of the problem is this sequence, within
> __gup_longterm_locked():
>
>     __get_user_pages_locked()
>     rc = check_and_migrate_movable_pages()
>
> ...which gets retried in a loop. The loop error handling is incomplete,
> clearly due to a somewhat unusual and complicated tri-state error API.
> But anyway, if -ENOMEM, or in fact, any unexpected error is returned
> from check_and_migrate_movable_pages(), then __gup_longterm_locked()
> happily returns the error, while leaving the pages pinned.
>
> In the failed case, which is an app that requests (via a device driver)
> 30720000000 bytes to be pinned, and then exits, I see this:
>
>     $ grep foll /proc/vmstat
>         nr_foll_pin_acquired 7502048
>         nr_foll_pin_released 2048
>
> And after applying this patch, it returns to balanced pins:
>
>     $ grep foll /proc/vmstat
>         nr_foll_pin_acquired 7502048
>         nr_foll_pin_released 7502048
>
> Fix this by unpinning the pages that __get_user_pages_locked() has
> pinned, in such error cases.
>
> Fixes: 24a95998e9ba ("mm/gup.c: simplify and fix check_and_migrate_movable_pages() return codes")
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Shigeru Yoshida <syoshida@redhat.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  mm/gup.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index a82890b46a36..24acf53c8294 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2492,6 +2492,17 @@ static long __gup_longterm_locked(struct mm_struct *mm,
>  
>  		/* FOLL_LONGTERM implies FOLL_PIN */
>  		rc = check_and_migrate_movable_pages(nr_pinned_pages, pages);
> +
> +		/*
> +		 * The __get_user_pages_locked() call happens before we know
> +		 * that whether it's possible to successfully complete the whole
> +		 * operation. To compensate for this, if we get an unexpected
> +		 * error (such as -ENOMEM) then we must unpin everything, before
> +		 * erroring out.
> +		 */
> +		if (rc != -EAGAIN && rc != 0)
> +			unpin_user_pages(pages, nr_pinned_pages);

I know this is going to fall out of the loop in the next line but given
this is an error handling case it would be nice if this was made
explicit here with a break statement. It looks correct to me though so:

Reviewed-by: Alistair Popple <apopple@nvidia.com>

> +
>  	} while (rc == -EAGAIN);
>  	memalloc_pin_restore(flags);
>  	return rc ? rc : nr_pinned_pages;
John Hubbard Oct. 16, 2024, 10:22 p.m. UTC | #4
On 10/16/24 3:13 PM, Alistair Popple wrote:
> John Hubbard <jhubbard@nvidia.com> writes:
...
>> diff --git a/mm/gup.c b/mm/gup.c
>> index a82890b46a36..24acf53c8294 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -2492,6 +2492,17 @@ static long __gup_longterm_locked(struct mm_struct *mm,
>>   
>>   		/* FOLL_LONGTERM implies FOLL_PIN */
>>   		rc = check_and_migrate_movable_pages(nr_pinned_pages, pages);
>> +
>> +		/*
>> +		 * The __get_user_pages_locked() call happens before we know
>> +		 * that whether it's possible to successfully complete the whole
>> +		 * operation. To compensate for this, if we get an unexpected
>> +		 * error (such as -ENOMEM) then we must unpin everything, before
>> +		 * erroring out.
>> +		 */
>> +		if (rc != -EAGAIN && rc != 0)
>> +			unpin_user_pages(pages, nr_pinned_pages);
> 
> I know this is going to fall out of the loop in the next line but given
> this is an error handling case it would be nice if this was made
> explicit here with a break statement. It looks correct to me though so:

Yes, I'm not sure that adding another line of code for cosmetics really
makes this better. At this point, smaller is better IMHO, and then the
next step should be something bigger, such as refactoring to avoid the
tri-state return codes.

> 
> Reviewed-by: Alistair Popple <apopple@nvidia.com>

Thanks for the super fast response and the review!


thanks,
Andrew Morton Oct. 16, 2024, 10:41 p.m. UTC | #5
On Wed, 16 Oct 2024 15:05:28 -0700 John Hubbard <jhubbard@nvidia.com> wrote:

> On 10/16/24 2:57 PM, Andrew Morton wrote:
> > On Wed, 16 Oct 2024 13:22:42 -0700 John Hubbard <jhubbard@nvidia.com> wrote:
> ...
> >> Fix this by unpinning the pages that __get_user_pages_locked() has
> >> pinned, in such error cases.
> > 
> > Thanks.
> > 
> >> Fixes: 24a95998e9ba ("mm/gup.c: simplify and fix check_and_migrate_movable_pages() return codes")
> > 
> > I'll add this to the -stable backport pile, although this seems a bit
> > marginal?
> 
> I'm on the fence about that. It is marginal: you have to
> exhaust memory. On the other hand, a real user reported
> this bug to us.
> 
> I guess I'd lean toward "correctness in -stable", and
> add it to the pile, in the end.

Thanks.  It's a super-simple patch, which helps the decision.
David Hildenbrand Oct. 17, 2024, 8:51 a.m. UTC | #6
On 16.10.24 22:22, John Hubbard wrote:
> If a driver tries to call any of the pin_user_pages*(FOLL_LONGTERM)
> family of functions, and requests "too many" pages, then the call will
> erroneously leave pages pinned. This is visible in user space as an
> actual memory leak.
> 
> Repro is trivial: just make enough pin_user_pages(FOLL_LONGTERM) calls
> to exhaust memory.
> 
> The root cause of the problem is this sequence, within
> __gup_longterm_locked():
> 
>      __get_user_pages_locked()
>      rc = check_and_migrate_movable_pages()
> 
> ...which gets retried in a loop. The loop error handling is incomplete,
> clearly due to a somewhat unusual and complicated tri-state error API.
> But anyway, if -ENOMEM, or in fact, any unexpected error is returned
> from check_and_migrate_movable_pages(), then __gup_longterm_locked()
> happily returns the error, while leaving the pages pinned.
> 
> In the failed case, which is an app that requests (via a device driver)
> 30720000000 bytes to be pinned, and then exits, I see this:
> 
>      $ grep foll /proc/vmstat
>          nr_foll_pin_acquired 7502048
>          nr_foll_pin_released 2048
> 
> And after applying this patch, it returns to balanced pins:
> 
>      $ grep foll /proc/vmstat
>          nr_foll_pin_acquired 7502048
>          nr_foll_pin_released 7502048
> 
> Fix this by unpinning the pages that __get_user_pages_locked() has
> pinned, in such error cases.
> 
> Fixes: 24a95998e9ba ("mm/gup.c: simplify and fix check_and_migrate_movable_pages() return codes")
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Shigeru Yoshida <syoshida@redhat.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
>   mm/gup.c | 11 +++++++++++
>   1 file changed, 11 insertions(+)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index a82890b46a36..24acf53c8294 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2492,6 +2492,17 @@ static long __gup_longterm_locked(struct mm_struct *mm,
>   
>   		/* FOLL_LONGTERM implies FOLL_PIN */
>   		rc = check_and_migrate_movable_pages(nr_pinned_pages, pages);
> +
> +		/*
> +		 * The __get_user_pages_locked() call happens before we know
> +		 * that whether it's possible to successfully complete the whole
> +		 * operation. To compensate for this, if we get an unexpected
> +		 * error (such as -ENOMEM) then we must unpin everything, before
> +		 * erroring out.
> +		 */
> +		if (rc != -EAGAIN && rc != 0)
> +			unpin_user_pages(pages, nr_pinned_pages);
> +
>   	} while (rc == -EAGAIN);

Wouldn't it be cleaner to simply have here after the loop (possibly even 
after the memalloc_pin_restore())

if (rc)
	unpin_user_pages(pages, nr_pinned_pages);

But maybe I am missing something.

>   	memalloc_pin_restore(flags);
>   	return rc ? rc : nr_pinned_pages;
David Hildenbrand Oct. 17, 2024, 8:53 a.m. UTC | #7
On 17.10.24 10:51, David Hildenbrand wrote:
> On 16.10.24 22:22, John Hubbard wrote:
>> If a driver tries to call any of the pin_user_pages*(FOLL_LONGTERM)
>> family of functions, and requests "too many" pages, then the call will
>> erroneously leave pages pinned. This is visible in user space as an
>> actual memory leak.
>>
>> Repro is trivial: just make enough pin_user_pages(FOLL_LONGTERM) calls
>> to exhaust memory.
>>
>> The root cause of the problem is this sequence, within
>> __gup_longterm_locked():
>>
>>       __get_user_pages_locked()
>>       rc = check_and_migrate_movable_pages()
>>
>> ...which gets retried in a loop. The loop error handling is incomplete,
>> clearly due to a somewhat unusual and complicated tri-state error API.
>> But anyway, if -ENOMEM, or in fact, any unexpected error is returned
>> from check_and_migrate_movable_pages(), then __gup_longterm_locked()
>> happily returns the error, while leaving the pages pinned.
>>
>> In the failed case, which is an app that requests (via a device driver)
>> 30720000000 bytes to be pinned, and then exits, I see this:
>>
>>       $ grep foll /proc/vmstat
>>           nr_foll_pin_acquired 7502048
>>           nr_foll_pin_released 2048
>>
>> And after applying this patch, it returns to balanced pins:
>>
>>       $ grep foll /proc/vmstat
>>           nr_foll_pin_acquired 7502048
>>           nr_foll_pin_released 7502048
>>
>> Fix this by unpinning the pages that __get_user_pages_locked() has
>> pinned, in such error cases.
>>
>> Fixes: 24a95998e9ba ("mm/gup.c: simplify and fix check_and_migrate_movable_pages() return codes")
>> Cc: Alistair Popple <apopple@nvidia.com>
>> Cc: Shigeru Yoshida <syoshida@redhat.com>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Jason Gunthorpe <jgg@nvidia.com>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
>> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
>> ---
>>    mm/gup.c | 11 +++++++++++
>>    1 file changed, 11 insertions(+)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index a82890b46a36..24acf53c8294 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -2492,6 +2492,17 @@ static long __gup_longterm_locked(struct mm_struct *mm,
>>    
>>    		/* FOLL_LONGTERM implies FOLL_PIN */
>>    		rc = check_and_migrate_movable_pages(nr_pinned_pages, pages);
>> +
>> +		/*
>> +		 * The __get_user_pages_locked() call happens before we know
>> +		 * that whether it's possible to successfully complete the whole
>> +		 * operation. To compensate for this, if we get an unexpected
>> +		 * error (such as -ENOMEM) then we must unpin everything, before
>> +		 * erroring out.
>> +		 */
>> +		if (rc != -EAGAIN && rc != 0)
>> +			unpin_user_pages(pages, nr_pinned_pages);
>> +
>>    	} while (rc == -EAGAIN);
> 
> Wouldn't it be cleaner to simply have here after the loop (possibly even
> after the memalloc_pin_restore())
> 
> if (rc)
> 	unpin_user_pages(pages, nr_pinned_pages);
> 
> But maybe I am missing something.

And staring at memfd_pin_folios(), don't we have the same issue there if
check_and_migrate_movable_folios() fails?


diff --git a/mm/gup.c b/mm/gup.c
index a82890b46a36..f79974d38608 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -3708,12 +3708,10 @@ long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end,
                 ret = check_and_migrate_movable_folios(nr_folios, folios);
         } while (ret == -EAGAIN);
  
-       memalloc_pin_restore(flags);
-       return ret ? ret : nr_folios;
  err:
         memalloc_pin_restore(flags);
-       unpin_folios(folios, nr_folios);
-
-       return ret;
+       if (ret)
+               unpin_folios(folios, nr_folios);
+       return ret ? ret : nr_folios;
  }
  EXPORT_SYMBOL_GPL(memfd_pin_folios);
John Hubbard Oct. 17, 2024, 4:54 p.m. UTC | #8
On 10/17/24 1:51 AM, David Hildenbrand wrote:
> On 16.10.24 22:22, John Hubbard wrote:
...
>> diff --git a/mm/gup.c b/mm/gup.c
>> index a82890b46a36..24acf53c8294 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -2492,6 +2492,17 @@ static long __gup_longterm_locked(struct 
>> mm_struct *mm,
>>           /* FOLL_LONGTERM implies FOLL_PIN */
>>           rc = check_and_migrate_movable_pages(nr_pinned_pages, pages);
>> +
>> +        /*
>> +         * The __get_user_pages_locked() call happens before we know
>> +         * that whether it's possible to successfully complete the whole

oops, s/that whether/that/

>> +         * operation. To compensate for this, if we get an unexpected
>> +         * error (such as -ENOMEM) then we must unpin everything, before
>> +         * erroring out.
>> +         */
>> +        if (rc != -EAGAIN && rc != 0)
>> +            unpin_user_pages(pages, nr_pinned_pages);
>> +
>>       } while (rc == -EAGAIN);
> 
> Wouldn't it be cleaner to simply have here after the loop (possibly even 
> after the memalloc_pin_restore())
> 
> if (rc)
>      unpin_user_pages(pages, nr_pinned_pages);
> 
> But maybe I am missing something.

Yes, I think you are correct. That is cleaner. Let me retest real quick just
in case, and then send a v2 that also picks up the typo and moves the 
comment
to the new location.

thanks,
John Hubbard Oct. 17, 2024, 5:06 p.m. UTC | #9
On 10/17/24 1:53 AM, David Hildenbrand wrote:
> On 17.10.24 10:51, David Hildenbrand wrote:
>> On 16.10.24 22:22, John Hubbard wrote:
...
> And staring at memfd_pin_folios(), don't we have the same issue there if
> check_and_migrate_movable_folios() fails?

Yes, it looks very clearly like the exact same bug, in a different location.
This complicated return code is the gift that keeps on giving. Although
likely people are just copying the pattern, which had the problem.


> 
> diff --git a/mm/gup.c b/mm/gup.c
> index a82890b46a36..f79974d38608 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -3708,12 +3708,10 @@ long memfd_pin_folios(struct file *memfd, loff_t 
> start, loff_t end,
>                  ret = check_and_migrate_movable_folios(nr_folios, folios);
>          } while (ret == -EAGAIN);
> 
> -       memalloc_pin_restore(flags);
> -       return ret ? ret : nr_folios;
>   err:
>          memalloc_pin_restore(flags);
> -       unpin_folios(folios, nr_folios);
> -
> -       return ret;
> +       if (ret)
> +               unpin_folios(folios, nr_folios);
> +       return ret ? ret : nr_folios;

That looks correct. I can send this out with the other patch as a tiny
2-patch series since they are related. Would you prefer to appear
as a Signed-off-by, or a Suggested-by, or "other"? :)

>   }
>   EXPORT_SYMBOL_GPL(memfd_pin_folios);
> 
> 

thanks,
David Hildenbrand Oct. 17, 2024, 5:10 p.m. UTC | #10
On 17.10.24 19:06, John Hubbard wrote:
> On 10/17/24 1:53 AM, David Hildenbrand wrote:
>> On 17.10.24 10:51, David Hildenbrand wrote:
>>> On 16.10.24 22:22, John Hubbard wrote:
> ...
>> And staring at memfd_pin_folios(), don't we have the same issue there if
>> check_and_migrate_movable_folios() fails?
> 
> Yes, it looks very clearly like the exact same bug, in a different location.
> This complicated return code is the gift that keeps on giving. Although
> likely people are just copying the pattern, which had the problem.
> 
> 
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index a82890b46a36..f79974d38608 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -3708,12 +3708,10 @@ long memfd_pin_folios(struct file *memfd, loff_t
>> start, loff_t end,
>>                   ret = check_and_migrate_movable_folios(nr_folios, folios);
>>           } while (ret == -EAGAIN);
>>
>> -       memalloc_pin_restore(flags);
>> -       return ret ? ret : nr_folios;
>>    err:
>>           memalloc_pin_restore(flags);
>> -       unpin_folios(folios, nr_folios);
>> -
>> -       return ret;
>> +       if (ret)
>> +               unpin_folios(folios, nr_folios);
>> +       return ret ? ret : nr_folios;
> 
> That looks correct. I can send this out with the other patch as a tiny
> 2-patch series since they are related. Would you prefer to appear
> as a Signed-off-by, or a Suggested-by, or "other"? :)

Suggested-by: please :)
Alistair Popple Oct. 17, 2024, 9:28 p.m. UTC | #11
David Hildenbrand <david@redhat.com> writes:

> On 16.10.24 22:22, John Hubbard wrote:
>> If a driver tries to call any of the pin_user_pages*(FOLL_LONGTERM)
>> family of functions, and requests "too many" pages, then the call will
>> erroneously leave pages pinned. This is visible in user space as an
>> actual memory leak.
>> Repro is trivial: just make enough pin_user_pages(FOLL_LONGTERM)
>> calls
>> to exhaust memory.
>> The root cause of the problem is this sequence, within
>> __gup_longterm_locked():
>>      __get_user_pages_locked()
>>      rc = check_and_migrate_movable_pages()
>> ...which gets retried in a loop. The loop error handling is
>> incomplete,
>> clearly due to a somewhat unusual and complicated tri-state error API.
>> But anyway, if -ENOMEM, or in fact, any unexpected error is returned
>> from check_and_migrate_movable_pages(), then __gup_longterm_locked()
>> happily returns the error, while leaving the pages pinned.
>> In the failed case, which is an app that requests (via a device
>> driver)
>> 30720000000 bytes to be pinned, and then exits, I see this:
>>      $ grep foll /proc/vmstat
>>          nr_foll_pin_acquired 7502048
>>          nr_foll_pin_released 2048
>> And after applying this patch, it returns to balanced pins:
>>      $ grep foll /proc/vmstat
>>          nr_foll_pin_acquired 7502048
>>          nr_foll_pin_released 7502048
>> Fix this by unpinning the pages that __get_user_pages_locked() has
>> pinned, in such error cases.
>> Fixes: 24a95998e9ba ("mm/gup.c: simplify and fix
>> check_and_migrate_movable_pages() return codes")
>> Cc: Alistair Popple <apopple@nvidia.com>
>> Cc: Shigeru Yoshida <syoshida@redhat.com>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Jason Gunthorpe <jgg@nvidia.com>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
>> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
>> ---
>>   mm/gup.c | 11 +++++++++++
>>   1 file changed, 11 insertions(+)
>> diff --git a/mm/gup.c b/mm/gup.c
>> index a82890b46a36..24acf53c8294 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -2492,6 +2492,17 @@ static long __gup_longterm_locked(struct mm_struct *mm,
>>     		/* FOLL_LONGTERM implies FOLL_PIN */
>>   		rc = check_and_migrate_movable_pages(nr_pinned_pages, pages);
>> +
>> +		/*
>> +		 * The __get_user_pages_locked() call happens before we know
>> +		 * that whether it's possible to successfully complete the whole
>> +		 * operation. To compensate for this, if we get an unexpected
>> +		 * error (such as -ENOMEM) then we must unpin everything, before
>> +		 * erroring out.
>> +		 */
>> +		if (rc != -EAGAIN && rc != 0)
>> +			unpin_user_pages(pages, nr_pinned_pages);
>> +
>>   	} while (rc == -EAGAIN);
>
> Wouldn't it be cleaner to simply have here after the loop (possibly
> even after the memalloc_pin_restore())
>
> if (rc)
> 	unpin_user_pages(pages, nr_pinned_pages);
>
> But maybe I am missing something.

I initially thought the same thing but I'm not sure it is
correct. Consider what happens when __get_user_pages_locked() fails
earlier in the loop. In this case it will have bailed out of the loop
with rc <= 0 but we shouldn't call unpin_user_pages().

>>   	memalloc_pin_restore(flags);
>>   	return rc ? rc : nr_pinned_pages;
David Hildenbrand Oct. 17, 2024, 9:47 p.m. UTC | #12
On 17.10.24 23:28, Alistair Popple wrote:
> 
> David Hildenbrand <david@redhat.com> writes:
> 
>> On 16.10.24 22:22, John Hubbard wrote:
>>> If a driver tries to call any of the pin_user_pages*(FOLL_LONGTERM)
>>> family of functions, and requests "too many" pages, then the call will
>>> erroneously leave pages pinned. This is visible in user space as an
>>> actual memory leak.
>>> Repro is trivial: just make enough pin_user_pages(FOLL_LONGTERM)
>>> calls
>>> to exhaust memory.
>>> The root cause of the problem is this sequence, within
>>> __gup_longterm_locked():
>>>       __get_user_pages_locked()
>>>       rc = check_and_migrate_movable_pages()
>>> ...which gets retried in a loop. The loop error handling is
>>> incomplete,
>>> clearly due to a somewhat unusual and complicated tri-state error API.
>>> But anyway, if -ENOMEM, or in fact, any unexpected error is returned
>>> from check_and_migrate_movable_pages(), then __gup_longterm_locked()
>>> happily returns the error, while leaving the pages pinned.
>>> In the failed case, which is an app that requests (via a device
>>> driver)
>>> 30720000000 bytes to be pinned, and then exits, I see this:
>>>       $ grep foll /proc/vmstat
>>>           nr_foll_pin_acquired 7502048
>>>           nr_foll_pin_released 2048
>>> And after applying this patch, it returns to balanced pins:
>>>       $ grep foll /proc/vmstat
>>>           nr_foll_pin_acquired 7502048
>>>           nr_foll_pin_released 7502048
>>> Fix this by unpinning the pages that __get_user_pages_locked() has
>>> pinned, in such error cases.
>>> Fixes: 24a95998e9ba ("mm/gup.c: simplify and fix
>>> check_and_migrate_movable_pages() return codes")
>>> Cc: Alistair Popple <apopple@nvidia.com>
>>> Cc: Shigeru Yoshida <syoshida@redhat.com>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> Cc: Jason Gunthorpe <jgg@nvidia.com>
>>> Cc: Minchan Kim <minchan@kernel.org>
>>> Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
>>> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
>>> ---
>>>    mm/gup.c | 11 +++++++++++
>>>    1 file changed, 11 insertions(+)
>>> diff --git a/mm/gup.c b/mm/gup.c
>>> index a82890b46a36..24acf53c8294 100644
>>> --- a/mm/gup.c
>>> +++ b/mm/gup.c
>>> @@ -2492,6 +2492,17 @@ static long __gup_longterm_locked(struct mm_struct *mm,
>>>      		/* FOLL_LONGTERM implies FOLL_PIN */
>>>    		rc = check_and_migrate_movable_pages(nr_pinned_pages, pages);
>>> +
>>> +		/*
>>> +		 * The __get_user_pages_locked() call happens before we know
>>> +		 * that whether it's possible to successfully complete the whole
>>> +		 * operation. To compensate for this, if we get an unexpected
>>> +		 * error (such as -ENOMEM) then we must unpin everything, before
>>> +		 * erroring out.
>>> +		 */
>>> +		if (rc != -EAGAIN && rc != 0)
>>> +			unpin_user_pages(pages, nr_pinned_pages);
>>> +
>>>    	} while (rc == -EAGAIN);
>>
>> Wouldn't it be cleaner to simply have here after the loop (possibly
>> even after the memalloc_pin_restore())
>>
>> if (rc)
>> 	unpin_user_pages(pages, nr_pinned_pages);
>>
>> But maybe I am missing something.
> 
> I initially thought the same thing but I'm not sure it is
> correct. Consider what happens when __get_user_pages_locked() fails
> earlier in the loop. In this case it will have bailed out of the loop
> with rc <= 0 but we shouldn't call unpin_user_pages().

Ah, I see what you mean, I primarily only stared at the diff.

We should likely avoid using nr_pinned_pages as a temporary variable that
can hold an error value.
John Hubbard Oct. 17, 2024, 9:57 p.m. UTC | #13
On 10/17/24 2:47 PM, David Hildenbrand wrote:
> On 17.10.24 23:28, Alistair Popple wrote:
>> David Hildenbrand <david@redhat.com> writes:
>>> On 16.10.24 22:22, John Hubbard wrote:
...
>>>> +        if (rc != -EAGAIN && rc != 0)
>>>> +            unpin_user_pages(pages, nr_pinned_pages);
>>>> +
>>>>        } while (rc == -EAGAIN);
>>>
>>> Wouldn't it be cleaner to simply have here after the loop (possibly
>>> even after the memalloc_pin_restore())
>>>
>>> if (rc)
>>>     unpin_user_pages(pages, nr_pinned_pages);
>>>
>>> But maybe I am missing something.
>>
>> I initially thought the same thing but I'm not sure it is
>> correct. Consider what happens when __get_user_pages_locked() fails
>> earlier in the loop. In this case it will have bailed out of the loop
>> with rc <= 0 but we shouldn't call unpin_user_pages().

doh. yes. Thanks for catching that, Alistair! I actually considered
it during the first draft, too, but got tunnel vision during the
review, sigh.

> 
> Ah, I see what you mean, I primarily only stared at the diff.
> 
> We should likely avoid using nr_pinned_pages as a temporary variable that
> can hold an error value.
> 

OK, I still want to defer all the pretty refactoring ideas into some
future effort, so for now, let's just leave v1 alone except for fixing
the typo in the comment, yes?

I'll still send out a 2-patch series with that, plus a suitably
modified fix for the memfd case too.


thanks,
David Hildenbrand Oct. 17, 2024, 10:03 p.m. UTC | #14
On 17.10.24 23:57, John Hubbard wrote:
> On 10/17/24 2:47 PM, David Hildenbrand wrote:
>> On 17.10.24 23:28, Alistair Popple wrote:
>>> David Hildenbrand <david@redhat.com> writes:
>>>> On 16.10.24 22:22, John Hubbard wrote:
> ...
>>>>> +        if (rc != -EAGAIN && rc != 0)
>>>>> +            unpin_user_pages(pages, nr_pinned_pages);
>>>>> +
>>>>>         } while (rc == -EAGAIN);
>>>>
>>>> Wouldn't it be cleaner to simply have here after the loop (possibly
>>>> even after the memalloc_pin_restore())
>>>>
>>>> if (rc)
>>>>      unpin_user_pages(pages, nr_pinned_pages);
>>>>
>>>> But maybe I am missing something.
>>>
>>> I initially thought the same thing but I'm not sure it is
>>> correct. Consider what happens when __get_user_pages_locked() fails
>>> earlier in the loop. In this case it will have bailed out of the loop
>>> with rc <= 0 but we shouldn't call unpin_user_pages().
> 
> doh. yes. Thanks for catching that, Alistair! I actually considered
> it during the first draft, too, but got tunnel vision during the
> review, sigh.
> 
>>
>> Ah, I see what you mean, I primarily only stared at the diff.
>>
>> We should likely avoid using nr_pinned_pages as a temporary variable that
>> can hold an error value.
>>
> 
> OK, I still want to defer all the pretty refactoring ideas into some
> future effort, so for now, let's just leave v1 alone except for fixing
> the typo in the comment, yes?

Fine with me!

Acked-by: David Hildenbrand <david@redhat.com>
diff mbox series

Patch

diff --git a/mm/gup.c b/mm/gup.c
index a82890b46a36..24acf53c8294 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2492,6 +2492,17 @@  static long __gup_longterm_locked(struct mm_struct *mm,
 
 		/* FOLL_LONGTERM implies FOLL_PIN */
 		rc = check_and_migrate_movable_pages(nr_pinned_pages, pages);
+
+		/*
+		 * The __get_user_pages_locked() call happens before we know
+		 * that whether it's possible to successfully complete the whole
+		 * operation. To compensate for this, if we get an unexpected
+		 * error (such as -ENOMEM) then we must unpin everything, before
+		 * erroring out.
+		 */
+		if (rc != -EAGAIN && rc != 0)
+			unpin_user_pages(pages, nr_pinned_pages);
+
 	} while (rc == -EAGAIN);
 	memalloc_pin_restore(flags);
 	return rc ? rc : nr_pinned_pages;