Message ID | 20220203093232.572380-3-jhubbard@nvidia.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/gup: some cleanups | expand |
On Thu, 3 Feb 2022 01:32:30 -0800 John Hubbard <jhubbard@nvidia.com> wrote: > Regardless of any FOLL_* flags, get_user_pages() and its variants should > handle PFN-only entries by stopping early, if the caller expected > **pages to be filled in. > > This makes for a more reliable API, as compared to the previous approach > of skipping over such entries (and thus leaving them silently > unwritten). > > Cc: Peter Xu <peterx@redhat.com> > Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com> > Suggested-by: Jason Gunthorpe <jgg@nvidia.com> > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> > Signed-off-by: John Hubbard <jhubbard@nvidia.com> > --- > mm/gup.c | 11 ++++++----- > 1 file changed, 6 insertions(+), 5 deletions(-) > > diff --git a/mm/gup.c b/mm/gup.c > index 65575ae3602f..cad3f28492e3 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -439,10 +439,6 @@ static struct page *no_page_table(struct vm_area_struct *vma, > static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address, > pte_t *pte, unsigned int flags) > { > - /* No page to get reference */ > - if (flags & (FOLL_GET | FOLL_PIN)) > - return -EFAULT; > - > if (flags & FOLL_TOUCH) { > pte_t entry = *pte; > > @@ -1180,8 +1176,13 @@ static long __get_user_pages(struct mm_struct *mm, > } else if (PTR_ERR(page) == -EEXIST) { > /* > * Proper page table entry exists, but no corresponding > - * struct page. > + * struct page. If the caller expects **pages to be > + * filled in, bail out now, because that can't be done > + * for this page. > */ > + if (pages) > + goto out; > + > goto next_page; > } else if (IS_ERR(page)) { > ret = PTR_ERR(page); I'm not an expert, can you explain why this is better, and why it does not cause new issues? If I understand correctly, the problem you are trying to solve is that in some cases you might try to get n pages, but you only get m < n pages instead, because some don't have an associated struct page, and the missing pages might even be in the middle. The `pages` array would contain the list of pages actually pinned (getted?), but this won't tell which of the requested pages have been pinned (e.g. if some pages in the middle of the run were skipped) With your patch you will stop at the first page without a struct page, meaning that if the caller tries again, it will get 0 pages. Why won't this cause issues? Why will this not cause problems when the `pages` parameter is NULL? sorry for the dumb questions, but this seems a rather important change, and I think in these circumstances you can't have too much documentation.
On Thu 03-02-22 01:32:30, John Hubbard wrote: > Regardless of any FOLL_* flags, get_user_pages() and its variants should > handle PFN-only entries by stopping early, if the caller expected > **pages to be filled in. > > This makes for a more reliable API, as compared to the previous approach > of skipping over such entries (and thus leaving them silently > unwritten). > > Cc: Peter Xu <peterx@redhat.com> > Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com> > Suggested-by: Jason Gunthorpe <jgg@nvidia.com> > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> > Signed-off-by: John Hubbard <jhubbard@nvidia.com> > --- > mm/gup.c | 11 ++++++----- > 1 file changed, 6 insertions(+), 5 deletions(-) > > diff --git a/mm/gup.c b/mm/gup.c > index 65575ae3602f..cad3f28492e3 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -439,10 +439,6 @@ static struct page *no_page_table(struct vm_area_struct *vma, > static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address, > pte_t *pte, unsigned int flags) > { > - /* No page to get reference */ > - if (flags & (FOLL_GET | FOLL_PIN)) > - return -EFAULT; > - > if (flags & FOLL_TOUCH) { > pte_t entry = *pte; > This will also modify the error code returned from follow_page(). A quick audit shows that at least the user in mm/migrate.c will propagate this error code to userspace and I'm not sure the change in error code will not break something... EEXIST is a bit strange error code to get from move_pages(2). Honza
On Thu, Feb 03, 2022 at 02:53:52PM +0100, Jan Kara wrote: > On Thu 03-02-22 01:32:30, John Hubbard wrote: > > Regardless of any FOLL_* flags, get_user_pages() and its variants should > > handle PFN-only entries by stopping early, if the caller expected > > **pages to be filled in. > > > > This makes for a more reliable API, as compared to the previous approach > > of skipping over such entries (and thus leaving them silently > > unwritten). > > > > Cc: Peter Xu <peterx@redhat.com> > > Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com> > > Suggested-by: Jason Gunthorpe <jgg@nvidia.com> > > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> > > Signed-off-by: John Hubbard <jhubbard@nvidia.com> > > mm/gup.c | 11 ++++++----- > > 1 file changed, 6 insertions(+), 5 deletions(-) > > > > diff --git a/mm/gup.c b/mm/gup.c > > index 65575ae3602f..cad3f28492e3 100644 > > +++ b/mm/gup.c > > @@ -439,10 +439,6 @@ static struct page *no_page_table(struct vm_area_struct *vma, > > static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address, > > pte_t *pte, unsigned int flags) > > { > > - /* No page to get reference */ > > - if (flags & (FOLL_GET | FOLL_PIN)) > > - return -EFAULT; > > - > > if (flags & FOLL_TOUCH) { > > pte_t entry = *pte; > > > > This will also modify the error code returned from follow_page(). Er, but isn't that the whole point of this entire design? It is what the commit that added it says: commit 1027e4436b6a5c413c95d95e50d0f26348a602ac Author: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Date: Fri Sep 4 15:47:55 2015 -0700 mm: make GUP handle pfn mapping unless FOLL_GET is requested With DAX, pfn mapping becoming more common. The patch adjusts GUP code to cover pfn mapping for cases when we don't need struct page to proceed. To make it possible, let's change follow_page() code to return -EEXIST error code if proper page table entry exists, but no corresponding struct page. __get_user_page() would ignore the error code and move to the next page frame. The immediate effect of the change is working MAP_POPULATE and mlock() on DAX mappings. > A quick audit shows that at least the user in mm/migrate.c will > propagate this error code to userspace and I'm not sure the change > in error code will not break something... EEXIST is a bit strange > error code to get from move_pages(2). That makes sense, maybe move_pages should squash the return codes to EEXIST? Jason
On Thu, Feb 03, 2022 at 11:01:23AM -0400, Jason Gunthorpe wrote: > On Thu, Feb 03, 2022 at 02:53:52PM +0100, Jan Kara wrote: > > On Thu 03-02-22 01:32:30, John Hubbard wrote: > > > Regardless of any FOLL_* flags, get_user_pages() and its variants should > > > handle PFN-only entries by stopping early, if the caller expected > > > **pages to be filled in. > > > > > > This makes for a more reliable API, as compared to the previous approach > > > of skipping over such entries (and thus leaving them silently > > > unwritten). > > > > > > Cc: Peter Xu <peterx@redhat.com> > > > Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com> > > > Suggested-by: Jason Gunthorpe <jgg@nvidia.com> > > > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> > > > Signed-off-by: John Hubbard <jhubbard@nvidia.com> > > > mm/gup.c | 11 ++++++----- > > > 1 file changed, 6 insertions(+), 5 deletions(-) > > > > > > diff --git a/mm/gup.c b/mm/gup.c > > > index 65575ae3602f..cad3f28492e3 100644 > > > +++ b/mm/gup.c > > > @@ -439,10 +439,6 @@ static struct page *no_page_table(struct vm_area_struct *vma, > > > static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address, > > > pte_t *pte, unsigned int flags) > > > { > > > - /* No page to get reference */ > > > - if (flags & (FOLL_GET | FOLL_PIN)) > > > - return -EFAULT; > > > - > > > if (flags & FOLL_TOUCH) { > > > pte_t entry = *pte; > > > > > > > This will also modify the error code returned from follow_page(). > > Er, but isn't that the whole point of this entire design? It is what > the commit that added it says: > > commit 1027e4436b6a5c413c95d95e50d0f26348a602ac > Author: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > Date: Fri Sep 4 15:47:55 2015 -0700 > > mm: make GUP handle pfn mapping unless FOLL_GET is requested > > With DAX, pfn mapping becoming more common. The patch adjusts GUP code to > cover pfn mapping for cases when we don't need struct page to proceed. > > To make it possible, let's change follow_page() code to return -EEXIST > error code if proper page table entry exists, but no corresponding struct > page. __get_user_page() would ignore the error code and move to the next > page frame. > > The immediate effect of the change is working MAP_POPULATE and mlock() on > DAX mappings. > > > A quick audit shows that at least the user in mm/migrate.c will > > propagate this error code to userspace and I'm not sure the change > > in error code will not break something... EEXIST is a bit strange > > error code to get from move_pages(2). > > That makes sense, maybe move_pages should squash the return codes to > EEXIST? I think EFAULT is the closest: This is a zero page or the memory area is not mapped by the process. EBUSY implies it can be tried again later.
On 2/3/22 05:31, Claudio Imbrenda wrote: ... >> @@ -1180,8 +1176,13 @@ static long __get_user_pages(struct mm_struct *mm, >> } else if (PTR_ERR(page) == -EEXIST) { >> /* >> * Proper page table entry exists, but no corresponding >> - * struct page. >> + * struct page. If the caller expects **pages to be >> + * filled in, bail out now, because that can't be done >> + * for this page. >> */ >> + if (pages) >> + goto out; >> + >> goto next_page; >> } else if (IS_ERR(page)) { >> ret = PTR_ERR(page); > > I'm not an expert, can you explain why this is better, and why it does > not cause new issues? > > If I understand correctly, the problem you are trying to solve is that > in some cases you might try to get n pages, but you only get m < n > pages instead, because some don't have an associated struct page, and > the missing pages might even be in the middle. > > The `pages` array would contain the list of pages actually pinned > (getted?), but this won't tell which of the requested pages have been > pinned (e.g. if some pages in the middle of the run were skipped) > The get_user_pages() API doesn't leave pages in the middle, ever. Instead, it stops at the first error, and reports the number of page that were successfully pinned. And the caller is responsible for unpinning. From __get_user_pages()'s kerneldoc documentation: * Returns either number of pages pinned (which may be less than the * number requested), or an error. Details about the return value: * * -- If nr_pages is 0, returns 0. * -- If nr_pages is >0, but no pages were pinned, returns -errno. * -- If nr_pages is >0, and some pages were pinned, returns the number of * pages pinned. Again, this may be less than nr_pages. * -- 0 return value is possible when the fault would need to be retried. * * The caller is responsible for releasing returned @pages, via put_page(). So the **pages array doesn't have holes, and the caller just counts up from the beginning of **pages and stops at nr_pages. > With your patch you will stop at the first page without a struct page, > meaning that if the caller tries again, it will get 0 pages. Why won't > this cause issues? Callers are already written to deal with this case. > > Why will this not cause problems when the `pages` parameter is NULL? The behavior is unchanged here if pages == NULL. But maybe you meant, if pages != NULL. And in that case, the new behavior is to stop early and return n < m, which is (I am claiming) better than just leaving garbage values in **pages. Another approach would be to file in PTR_ERR(page) values, but GUP is a well-established and widely used API, and that would be a large change that would require changing a lot of caller code. > > > sorry for the dumb questions, but this seems a rather important change, > and I think in these circumstances you can't have too much > documentation. > Thanks for reviewing this! thanks,
On 2/3/22 07:18, Matthew Wilcox wrote: ... >>> This will also modify the error code returned from follow_page(). >> >> Er, but isn't that the whole point of this entire design? It is what >> the commit that added it says: >> >> commit 1027e4436b6a5c413c95d95e50d0f26348a602ac >> Author: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> >> Date: Fri Sep 4 15:47:55 2015 -0700 >> >> mm: make GUP handle pfn mapping unless FOLL_GET is requested >> >> With DAX, pfn mapping becoming more common. The patch adjusts GUP code to >> cover pfn mapping for cases when we don't need struct page to proceed. >> >> To make it possible, let's change follow_page() code to return -EEXIST >> error code if proper page table entry exists, but no corresponding struct >> page. __get_user_page() would ignore the error code and move to the next >> page frame. >> >> The immediate effect of the change is working MAP_POPULATE and mlock() on >> DAX mappings. >> >>> A quick audit shows that at least the user in mm/migrate.c will >>> propagate this error code to userspace and I'm not sure the change >>> in error code will not break something... EEXIST is a bit strange >>> error code to get from move_pages(2). >> >> That makes sense, maybe move_pages should squash the return codes to >> EEXIST? > > I think EFAULT is the closest: > This is a zero page or the memory area is not mapped by the > process. > > EBUSY implies it can be tried again later. > OK. I definitely need to rework the commit description now, but the diffs are looking like this: diff --git a/mm/gup.c b/mm/gup.c index 65575ae3602f..cad3f28492e3 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -439,10 +439,6 @@ static struct page *no_page_table(struct vm_area_struct *vma, static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address, pte_t *pte, unsigned int flags) { - /* No page to get reference */ - if (flags & (FOLL_GET | FOLL_PIN)) - return -EFAULT; - if (flags & FOLL_TOUCH) { pte_t entry = *pte; @@ -1180,8 +1176,13 @@ static long __get_user_pages(struct mm_struct *mm, } else if (PTR_ERR(page) == -EEXIST) { /* * Proper page table entry exists, but no corresponding - * struct page. + * struct page. If the caller expects **pages to be + * filled in, bail out now, because that can't be done + * for this page. */ + if (pages) + goto out; + goto next_page; } else if (IS_ERR(page)) { ret = PTR_ERR(page); diff --git a/mm/migrate.c b/mm/migrate.c index c7da064b4781..be0d5ae36dc1 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1761,6 +1761,13 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes, continue; } + /* + * The move_pages() man page does not have an -EEXIST choice, so + * use -EFAULT instead. + */ + if (err == -EEXIST) + err = -EFAULT; + /* * If the page is already on the target node (!err), store the * node, otherwise, store the err. thanks,
diff --git a/mm/gup.c b/mm/gup.c index 65575ae3602f..cad3f28492e3 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -439,10 +439,6 @@ static struct page *no_page_table(struct vm_area_struct *vma, static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address, pte_t *pte, unsigned int flags) { - /* No page to get reference */ - if (flags & (FOLL_GET | FOLL_PIN)) - return -EFAULT; - if (flags & FOLL_TOUCH) { pte_t entry = *pte; @@ -1180,8 +1176,13 @@ static long __get_user_pages(struct mm_struct *mm, } else if (PTR_ERR(page) == -EEXIST) { /* * Proper page table entry exists, but no corresponding - * struct page. + * struct page. If the caller expects **pages to be + * filled in, bail out now, because that can't be done + * for this page. */ + if (pages) + goto out; + goto next_page; } else if (IS_ERR(page)) { ret = PTR_ERR(page);