[v3,2/4] mm/gup: clean up follow_pfn_pte() slightly

Message ID	20220203093232.572380-3-jhubbard@nvidia.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 12.22.5.238 as permitted sender) receiver=protection.outlook.com; client-ip=12.22.5.238; helo=mail.nvidia.com; From: John Hubbard <jhubbard@nvidia.com> To: Andrew Morton <akpm@linux-foundation.org>, Peter Xu <peterx@redhat.com>, Jason Gunthorpe <jgg@ziepe.ca> CC: David Hildenbrand <david@redhat.com>, Lukas Bulwahn <lukas.bulwahn@gmail.com>, Jan Kara <jack@suse.cz>, Claudio Imbrenda <imbrenda@linux.ibm.com>, "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>, Alex Williamson <alex.williamson@redhat.com>, Andrea Arcangeli <aarcange@redhat.com>, LKML <linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>, John Hubbard <jhubbard@nvidia.com>, Jason Gunthorpe <jgg@nvidia.com> Subject: [PATCH v3 2/4] mm/gup: clean up follow_pfn_pte() slightly Date: Thu, 3 Feb 2022 01:32:30 -0800 Message-ID: <20220203093232.572380-3-jhubbard@nvidia.com> In-Reply-To: <20220203093232.572380-1-jhubbard@nvidia.com> References: <20220203093232.572380-1-jhubbard@nvidia.com> MIME-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm/gup: some cleanups \| expand [v3,0/4] mm/gup: some cleanups [v3,1/4] mm: Fix invalid page pointer returned with FOLL_PIN gups [v3,2/4] mm/gup: clean up follow_pfn_pte() slightly [v3,3/4] mm/gup: remove unused pin_user_pages_locked() [v3,4/4] mm/gup: remove get_user_pages_locked()

John Hubbard Feb. 3, 2022, 9:32 a.m. UTC

Regardless of any FOLL_* flags, get_user_pages() and its variants should
handle PFN-only entries by stopping early, if the caller expected
**pages to be filled in.

This makes for a more reliable API, as compared to the previous approach
of skipping over such entries (and thus leaving them silently
unwritten).

Cc: Peter Xu <peterx@redhat.com>
Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 mm/gup.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

Claudio Imbrenda Feb. 3, 2022, 1:31 p.m. UTC | #1

On Thu, 3 Feb 2022 01:32:30 -0800
John Hubbard <jhubbard@nvidia.com> wrote:

> Regardless of any FOLL_* flags, get_user_pages() and its variants should
> handle PFN-only entries by stopping early, if the caller expected
> **pages to be filled in.
> 
> This makes for a more reliable API, as compared to the previous approach
> of skipping over such entries (and thus leaving them silently
> unwritten).
> 
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  mm/gup.c | 11 ++++++-----
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 65575ae3602f..cad3f28492e3 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -439,10 +439,6 @@ static struct page *no_page_table(struct vm_area_struct *vma,
>  static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
>  		pte_t *pte, unsigned int flags)
>  {
> -	/* No page to get reference */
> -	if (flags & (FOLL_GET | FOLL_PIN))
> -		return -EFAULT;
> -
>  	if (flags & FOLL_TOUCH) {
>  		pte_t entry = *pte;
>  
> @@ -1180,8 +1176,13 @@ static long __get_user_pages(struct mm_struct *mm,
>  		} else if (PTR_ERR(page) == -EEXIST) {
>  			/*
>  			 * Proper page table entry exists, but no corresponding
> -			 * struct page.
> +			 * struct page. If the caller expects **pages to be
> +			 * filled in, bail out now, because that can't be done
> +			 * for this page.
>  			 */
> +			if (pages)
> +				goto out;
> +
>  			goto next_page;
>  		} else if (IS_ERR(page)) {
>  			ret = PTR_ERR(page);

I'm not an expert, can you explain why this is better, and why it does
not cause new issues?

If I understand correctly, the problem you are trying to solve is that
in some cases you might try to get n pages, but you only get m < n
pages instead, because some don't have an associated struct page, and
the missing pages might even be in the middle.

The `pages` array would contain the list of pages actually pinned
(getted?), but this won't tell which of the requested pages have been
pinned (e.g. if some pages in the middle of the run were skipped)

With your patch you will stop at the first page without a struct page,
meaning that if the caller tries again, it will get 0 pages. Why won't
this cause issues?

Why will this not cause problems when the `pages` parameter is NULL?

sorry for the dumb questions, but this seems a rather important change,
and I think in these circumstances you can't have too much
documentation.

Jan Kara Feb. 3, 2022, 1:53 p.m. UTC | #2

On Thu 03-02-22 01:32:30, John Hubbard wrote:
> Regardless of any FOLL_* flags, get_user_pages() and its variants should
> handle PFN-only entries by stopping early, if the caller expected
> **pages to be filled in.
> 
> This makes for a more reliable API, as compared to the previous approach
> of skipping over such entries (and thus leaving them silently
> unwritten).
> 
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  mm/gup.c | 11 ++++++-----
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 65575ae3602f..cad3f28492e3 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -439,10 +439,6 @@ static struct page *no_page_table(struct vm_area_struct *vma,
>  static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
>  		pte_t *pte, unsigned int flags)
>  {
> -	/* No page to get reference */
> -	if (flags & (FOLL_GET | FOLL_PIN))
> -		return -EFAULT;
> -
>  	if (flags & FOLL_TOUCH) {
>  		pte_t entry = *pte;
>  

This will also modify the error code returned from follow_page(). A quick
audit shows that at least the user in mm/migrate.c will propagate this
error code to userspace and I'm not sure the change in error code will not
break something... EEXIST is a bit strange error code to get from
move_pages(2).

								Honza

Jason Gunthorpe Feb. 3, 2022, 3:01 p.m. UTC | #3

On Thu, Feb 03, 2022 at 02:53:52PM +0100, Jan Kara wrote:
> On Thu 03-02-22 01:32:30, John Hubbard wrote:
> > Regardless of any FOLL_* flags, get_user_pages() and its variants should
> > handle PFN-only entries by stopping early, if the caller expected
> > **pages to be filled in.
> > 
> > This makes for a more reliable API, as compared to the previous approach
> > of skipping over such entries (and thus leaving them silently
> > unwritten).
> > 
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
> > Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> >  mm/gup.c | 11 ++++++-----
> >  1 file changed, 6 insertions(+), 5 deletions(-)
> > 
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 65575ae3602f..cad3f28492e3 100644
> > +++ b/mm/gup.c
> > @@ -439,10 +439,6 @@ static struct page *no_page_table(struct vm_area_struct *vma,
> >  static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
> >  		pte_t *pte, unsigned int flags)
> >  {
> > -	/* No page to get reference */
> > -	if (flags & (FOLL_GET | FOLL_PIN))
> > -		return -EFAULT;
> > -
> >  	if (flags & FOLL_TOUCH) {
> >  		pte_t entry = *pte;
> >  
> 
> This will also modify the error code returned from follow_page(). 

Er, but isn't that the whole point of this entire design? It is what
the commit that added it says:

commit 1027e4436b6a5c413c95d95e50d0f26348a602ac
Author: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Date:   Fri Sep 4 15:47:55 2015 -0700

    mm: make GUP handle pfn mapping unless FOLL_GET is requested
    
    With DAX, pfn mapping becoming more common.  The patch adjusts GUP code to
    cover pfn mapping for cases when we don't need struct page to proceed.
    
    To make it possible, let's change follow_page() code to return -EEXIST
    error code if proper page table entry exists, but no corresponding struct
    page.  __get_user_page() would ignore the error code and move to the next
    page frame.
    
    The immediate effect of the change is working MAP_POPULATE and mlock() on
    DAX mappings.

> A quick audit shows that at least the user in mm/migrate.c will
> propagate this error code to userspace and I'm not sure the change
> in error code will not break something... EEXIST is a bit strange
> error code to get from move_pages(2).

That makes sense, maybe move_pages should squash the return codes to
EEXIST?

Jason

Matthew Wilcox Feb. 3, 2022, 3:18 p.m. UTC | #4

On Thu, Feb 03, 2022 at 11:01:23AM -0400, Jason Gunthorpe wrote:
> On Thu, Feb 03, 2022 at 02:53:52PM +0100, Jan Kara wrote:
> > On Thu 03-02-22 01:32:30, John Hubbard wrote:
> > > Regardless of any FOLL_* flags, get_user_pages() and its variants should
> > > handle PFN-only entries by stopping early, if the caller expected
> > > **pages to be filled in.
> > > 
> > > This makes for a more reliable API, as compared to the previous approach
> > > of skipping over such entries (and thus leaving them silently
> > > unwritten).
> > > 
> > > Cc: Peter Xu <peterx@redhat.com>
> > > Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
> > > Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> > > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > > Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> > >  mm/gup.c | 11 ++++++-----
> > >  1 file changed, 6 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/mm/gup.c b/mm/gup.c
> > > index 65575ae3602f..cad3f28492e3 100644
> > > +++ b/mm/gup.c
> > > @@ -439,10 +439,6 @@ static struct page *no_page_table(struct vm_area_struct *vma,
> > >  static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
> > >  		pte_t *pte, unsigned int flags)
> > >  {
> > > -	/* No page to get reference */
> > > -	if (flags & (FOLL_GET | FOLL_PIN))
> > > -		return -EFAULT;
> > > -
> > >  	if (flags & FOLL_TOUCH) {
> > >  		pte_t entry = *pte;
> > >  
> > 
> > This will also modify the error code returned from follow_page(). 
> 
> Er, but isn't that the whole point of this entire design? It is what
> the commit that added it says:
> 
> commit 1027e4436b6a5c413c95d95e50d0f26348a602ac
> Author: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Date:   Fri Sep 4 15:47:55 2015 -0700
> 
>     mm: make GUP handle pfn mapping unless FOLL_GET is requested
>     
>     With DAX, pfn mapping becoming more common.  The patch adjusts GUP code to
>     cover pfn mapping for cases when we don't need struct page to proceed.
>     
>     To make it possible, let's change follow_page() code to return -EEXIST
>     error code if proper page table entry exists, but no corresponding struct
>     page.  __get_user_page() would ignore the error code and move to the next
>     page frame.
>     
>     The immediate effect of the change is working MAP_POPULATE and mlock() on
>     DAX mappings.
> 
> > A quick audit shows that at least the user in mm/migrate.c will
> > propagate this error code to userspace and I'm not sure the change
> > in error code will not break something... EEXIST is a bit strange
> > error code to get from move_pages(2).
> 
> That makes sense, maybe move_pages should squash the return codes to
> EEXIST?

I think EFAULT is the closest:
              This  is  a  zero  page  or the memory area is not mapped by the
              process.

EBUSY implies it can be tried again later.

John Hubbard Feb. 3, 2022, 8:53 p.m. UTC | #5

On 2/3/22 05:31, Claudio Imbrenda wrote:
...
>> @@ -1180,8 +1176,13 @@ static long __get_user_pages(struct mm_struct *mm,
>>   		} else if (PTR_ERR(page) == -EEXIST) {
>>   			/*
>>   			 * Proper page table entry exists, but no corresponding
>> -			 * struct page.
>> +			 * struct page. If the caller expects **pages to be
>> +			 * filled in, bail out now, because that can't be done
>> +			 * for this page.
>>   			 */
>> +			if (pages)
>> +				goto out;
>> +
>>   			goto next_page;
>>   		} else if (IS_ERR(page)) {
>>   			ret = PTR_ERR(page);
> 
> I'm not an expert, can you explain why this is better, and why it does
> not cause new issues?
> 
> If I understand correctly, the problem you are trying to solve is that
> in some cases you might try to get n pages, but you only get m < n
> pages instead, because some don't have an associated struct page, and
> the missing pages might even be in the middle.
> 
> The `pages` array would contain the list of pages actually pinned
> (getted?), but this won't tell which of the requested pages have been
> pinned (e.g. if some pages in the middle of the run were skipped)
> 

The get_user_pages() API doesn't leave pages in the middle, ever.
Instead, it stops at the first error, and reports the number of page
that were successfully pinned. And the caller is responsible for
unpinning.

 From __get_user_pages()'s kerneldoc documentation:

  * Returns either number of pages pinned (which may be less than the
  * number requested), or an error. Details about the return value:
  *
  * -- If nr_pages is 0, returns 0.
  * -- If nr_pages is >0, but no pages were pinned, returns -errno.
  * -- If nr_pages is >0, and some pages were pinned, returns the number of
  *    pages pinned. Again, this may be less than nr_pages.
  * -- 0 return value is possible when the fault would need to be retried.
  *
  * The caller is responsible for releasing returned @pages, via put_page().

So the **pages array doesn't have holes, and the caller just counts up
from the beginning of **pages and stops at nr_pages.

> With your patch you will stop at the first page without a struct page,
> meaning that if the caller tries again, it will get 0 pages. Why won't
> this cause issues?

Callers are already written to deal with this case.

> 
> Why will this not cause problems when the `pages` parameter is NULL?

The behavior is unchanged here if pages == NULL. But maybe you meant,
if pages != NULL. And in that case, the new behavior is to stop early
and return n < m, which is (I am claiming) better than just leaving
garbage values in **pages.

Another approach would be to file in PTR_ERR(page) values, but GUP is
a well-established and widely used API, and that would be a large
change that would require changing a lot of caller code.

> 
> 
> sorry for the dumb questions, but this seems a rather important change,
> and I think in these circumstances you can't have too much
> documentation.
> 

Thanks for reviewing this!

thanks,

John Hubbard Feb. 3, 2022, 9:19 p.m. UTC | #6

On 2/3/22 07:18, Matthew Wilcox wrote:
...
>>> This will also modify the error code returned from follow_page().
>>
>> Er, but isn't that the whole point of this entire design? It is what
>> the commit that added it says:
>>
>> commit 1027e4436b6a5c413c95d95e50d0f26348a602ac
>> Author: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Date:   Fri Sep 4 15:47:55 2015 -0700
>>
>>      mm: make GUP handle pfn mapping unless FOLL_GET is requested
>>      
>>      With DAX, pfn mapping becoming more common.  The patch adjusts GUP code to
>>      cover pfn mapping for cases when we don't need struct page to proceed.
>>      
>>      To make it possible, let's change follow_page() code to return -EEXIST
>>      error code if proper page table entry exists, but no corresponding struct
>>      page.  __get_user_page() would ignore the error code and move to the next
>>      page frame.
>>      
>>      The immediate effect of the change is working MAP_POPULATE and mlock() on
>>      DAX mappings.
>>
>>> A quick audit shows that at least the user in mm/migrate.c will
>>> propagate this error code to userspace and I'm not sure the change
>>> in error code will not break something... EEXIST is a bit strange
>>> error code to get from move_pages(2).
>>
>> That makes sense, maybe move_pages should squash the return codes to
>> EEXIST?
> 
> I think EFAULT is the closest:
>                This  is  a  zero  page  or the memory area is not mapped by the
>                process.
> 
> EBUSY implies it can be tried again later.
> 

OK. I definitely need to rework the commit description now, but the diffs are
looking like this:

diff --git a/mm/gup.c b/mm/gup.c
index 65575ae3602f..cad3f28492e3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -439,10 +439,6 @@ static struct page *no_page_table(struct vm_area_struct *vma,
  static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
  		pte_t *pte, unsigned int flags)
  {
-	/* No page to get reference */
-	if (flags & (FOLL_GET | FOLL_PIN))
-		return -EFAULT;
-
  	if (flags & FOLL_TOUCH) {
  		pte_t entry = *pte;

@@ -1180,8 +1176,13 @@ static long __get_user_pages(struct mm_struct *mm,
  		} else if (PTR_ERR(page) == -EEXIST) {
  			/*
  			 * Proper page table entry exists, but no corresponding
-			 * struct page.
+			 * struct page. If the caller expects **pages to be
+			 * filled in, bail out now, because that can't be done
+			 * for this page.
  			 */
+			if (pages)
+				goto out;
+
  			goto next_page;
  		} else if (IS_ERR(page)) {
  			ret = PTR_ERR(page);
diff --git a/mm/migrate.c b/mm/migrate.c
index c7da064b4781..be0d5ae36dc1 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1761,6 +1761,13 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
  			continue;
  		}

+		/*
+		 * The move_pages() man page does not have an -EEXIST choice, so
+		 * use -EFAULT instead.
+		 */
+		if (err == -EEXIST)
+			err = -EFAULT;
+
  		/*
  		 * If the page is already on the target node (!err), store the
  		 * node, otherwise, store the err.

thanks,

[v3,2/4] mm/gup: clean up follow_pfn_pte() slightly

Commit Message

Comments

Patch