Message ID | 20231118063233.733523-4-vivek.kasireddy@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm/gup: Introduce pin_user_pages_fd() for pinning shmem/hugetlbfs file pages (v4) | expand |
On 18.11.23 07:32, Vivek Kasireddy wrote: > For drivers that would like to longterm-pin the pages associated > with a file, the pin_user_pages_fd() API provides an option to > not only pin the pages via FOLL_PIN but also to check and migrate > them if they reside in movable zone or CMA block. This API > currently works with files that belong to either shmem or hugetlbfs. > Files belonging to other filesystems are rejected for now. > > The pages need to be located first before pinning them via FOLL_PIN. > If they are found in the page cache, they can be immediately pinned. > Otherwise, they need to be allocated using the filesystem specific > APIs and then pinned. > > v2: > - Drop gup_flags and improve comments and commit message (David) > - Allocate a page if we cannot find in page cache for the hugetlbfs > case as well (David) > - Don't unpin pages if there is a migration related failure (David) > - Drop the unnecessary nr_pages <= 0 check (Jason) > - Have the caller of the API pass in file * instead of fd (Jason) > > v3: (David) > - Enclose the huge page allocation code with #ifdef CONFIG_HUGETLB_PAGE > (Build error reported by kernel test robot <lkp@intel.com>) > - Don't forget memalloc_pin_restore() on non-migration related errors > - Improve the readability of the cleanup code associated with > non-migration related errors > - Augment the comments by describing FOLL_LONGTERM like behavior > - Include the R-b tag from Jason > > v4: > - Remove the local variable "page" and instead use 3 return statements > in alloc_file_page() (David) > - Add the R-b tag from David > > Cc: David Hildenbrand <david@redhat.com> > Cc: Daniel Vetter <daniel.vetter@ffwll.ch> > Cc: Mike Kravetz <mike.kravetz@oracle.com> > Cc: Hugh Dickins <hughd@google.com> > Cc: Peter Xu <peterx@redhat.com> > Cc: Gerd Hoffmann <kraxel@redhat.com> > Cc: Dongwon Kim <dongwon.kim@intel.com> > Cc: Junxiao Chang <junxiao.chang@intel.com> > Suggested-by: Jason Gunthorpe <jgg@nvidia.com> > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> (v2) > Reviewed-by: David Hildenbrand <david@redhat.com> (v3) > Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> > --- [...] > +static struct page *alloc_file_page(struct file *file, pgoff_t idx) > +{ > +#ifdef CONFIG_HUGETLB_PAGE > + struct folio *folio; > + int err; > + > + if (is_file_hugepages(file)) { > + folio = alloc_hugetlb_folio_nodemask(hstate_file(file), > + NUMA_NO_NODE, > + NULL, > + GFP_USER); > + if (folio && folio_try_get(folio)) { > + err = hugetlb_add_to_page_cache(folio, > + file->f_mapping, > + idx); > + if (err) { > + folio_put(folio); > + free_huge_folio(folio); > + return ERR_PTR(err); > + } > + return &folio->page; While looking at the user of pin_user_pages_fd(), I realized something: Assume idx is not aligned to the hugetlb page size. find_get_page_flags() would always return a tail page in that case, but you'd be returning the head page here. See pagecache_get_page()->folio_file_page(folio, index); > + } > + return ERR_PTR(-ENOMEM); > + } > +#endif > + return shmem_read_mapping_page(file->f_mapping, idx); > +} > + > +/** > + * pin_user_pages_fd() - pin user pages associated with a file > + * @file: the file whose pages are to be pinned > + * @start: starting file offset > + * @nr_pages: number of pages from start to pin > + * @pages: array that receives pointers to the pages pinned. > + * Should be at-least nr_pages long. > + * > + * Attempt to pin pages associated with a file that belongs to either shmem > + * or hugetlb. The pages are either found in the page cache or allocated if > + * necessary. Once the pages are located, they are all pinned via FOLL_PIN. > + * And, these pinned pages need to be released either using unpin_user_pages() > + * or unpin_user_page(). > + * > + * It must be noted that the pages may be pinned for an indefinite amount > + * of time. And, in most cases, the duration of time they may stay pinned > + * would be controlled by the userspace. This behavior is effectively the > + * same as using FOLL_LONGTERM with other GUP APIs. > + * > + * Returns number of pages pinned. This would be equal to the number of > + * pages requested. If no pages were pinned, it returns -errno. > + */ > +long pin_user_pages_fd(struct file *file, pgoff_t start, > + unsigned long nr_pages, struct page **pages) > +{ > + struct page *page; > + unsigned int flags, i; > + long ret; > + > + if (start < 0) > + return -EINVAL; > + > + if (!file) > + return -EINVAL; > + > + if (!shmem_file(file) && !is_file_hugepages(file)) > + return -EINVAL; > + > + flags = memalloc_pin_save(); > + do { > + for (i = 0; i < nr_pages; i++) { > + /* > + * In most cases, we should be able to find the page > + * in the page cache. If we cannot find it, we try to > + * allocate one and add it to the page cache. > + */ > + page = find_get_page_flags(file->f_mapping, > + start + i, > + FGP_ACCESSED); > + if (!page) { > + page = alloc_file_page(file, start + i); > + if (IS_ERR(page)) { > + ret = PTR_ERR(page); > + goto err; While looking at above, I do wonder: what if two parties tried to alloc the page at the same time? I suspect we'd want to handle -EEXIST a bit nicer here, right?
Hi David, > > On 18.11.23 07:32, Vivek Kasireddy wrote: > > For drivers that would like to longterm-pin the pages associated > > with a file, the pin_user_pages_fd() API provides an option to > > not only pin the pages via FOLL_PIN but also to check and migrate > > them if they reside in movable zone or CMA block. This API > > currently works with files that belong to either shmem or hugetlbfs. > > Files belonging to other filesystems are rejected for now. > > > > The pages need to be located first before pinning them via FOLL_PIN. > > If they are found in the page cache, they can be immediately pinned. > > Otherwise, they need to be allocated using the filesystem specific > > APIs and then pinned. > > > > v2: > > - Drop gup_flags and improve comments and commit message (David) > > - Allocate a page if we cannot find in page cache for the hugetlbfs > > case as well (David) > > - Don't unpin pages if there is a migration related failure (David) > > - Drop the unnecessary nr_pages <= 0 check (Jason) > > - Have the caller of the API pass in file * instead of fd (Jason) > > > > v3: (David) > > - Enclose the huge page allocation code with #ifdef > CONFIG_HUGETLB_PAGE > > (Build error reported by kernel test robot <lkp@intel.com>) > > - Don't forget memalloc_pin_restore() on non-migration related errors > > - Improve the readability of the cleanup code associated with > > non-migration related errors > > - Augment the comments by describing FOLL_LONGTERM like behavior > > - Include the R-b tag from Jason > > > > v4: > > - Remove the local variable "page" and instead use 3 return statements > > in alloc_file_page() (David) > > - Add the R-b tag from David > > > > Cc: David Hildenbrand <david@redhat.com> > > Cc: Daniel Vetter <daniel.vetter@ffwll.ch> > > Cc: Mike Kravetz <mike.kravetz@oracle.com> > > Cc: Hugh Dickins <hughd@google.com> > > Cc: Peter Xu <peterx@redhat.com> > > Cc: Gerd Hoffmann <kraxel@redhat.com> > > Cc: Dongwon Kim <dongwon.kim@intel.com> > > Cc: Junxiao Chang <junxiao.chang@intel.com> > > Suggested-by: Jason Gunthorpe <jgg@nvidia.com> > > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> (v2) > > Reviewed-by: David Hildenbrand <david@redhat.com> (v3) > > Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> > > --- > > > [...] > > > > +static struct page *alloc_file_page(struct file *file, pgoff_t idx) > > +{ > > +#ifdef CONFIG_HUGETLB_PAGE > > + struct folio *folio; > > + int err; > > + > > + if (is_file_hugepages(file)) { > > + folio = alloc_hugetlb_folio_nodemask(hstate_file(file), > > + NUMA_NO_NODE, > > + NULL, > > + GFP_USER); > > + if (folio && folio_try_get(folio)) { > > + err = hugetlb_add_to_page_cache(folio, > > + file->f_mapping, > > + idx); > > + if (err) { > > + folio_put(folio); > > + free_huge_folio(folio); > > + return ERR_PTR(err); > > + } > > + return &folio->page; > > While looking at the user of pin_user_pages_fd(), I realized something: > > Assume idx is not aligned to the hugetlb page size. > find_get_page_flags() would always return a tail page in that case, but > you'd be returning the head page here. > > See pagecache_get_page()->folio_file_page(folio, index); Thank you for catching this. Looking at how udambuf uses this API for hugetlb case: hpstate = hstate_file(memfd); mapidx = list[i].offset >> huge_page_shift(hpstate); do { nr_pages = shmem_file(memfd) ? pgcnt : 1; ret = pin_user_pages_fd(memfd, mapidx, nr_pages, ubuf->pages + pgbuf); As the raw file offset is translated into huge page size units, represented by mapidx, I was expecting find_get_page_flags() to return a head page but I did not realize that find_get_page_flags() now returns tail pages given that it had returned head pages in the previous kernel versions I had tested IIRC. As my goal is to only grab the head pages, __filemap_get_folio() seems like the right API to use instead of find_get_page_flags(). With this change, the hugetlb subtest (that I have not tested with kernels >= 6.7) that fails with kernel 6.7 RC1 now seems to work as expected. > > > + } > > + return ERR_PTR(-ENOMEM); > > + } > > +#endif > > + return shmem_read_mapping_page(file->f_mapping, idx); > > +} > > + > > +/** > > + * pin_user_pages_fd() - pin user pages associated with a file > > + * @file: the file whose pages are to be pinned > > + * @start: starting file offset > > + * @nr_pages: number of pages from start to pin > > + * @pages: array that receives pointers to the pages pinned. > > + * Should be at-least nr_pages long. > > + * > > + * Attempt to pin pages associated with a file that belongs to either > shmem > > + * or hugetlb. The pages are either found in the page cache or allocated if > > + * necessary. Once the pages are located, they are all pinned via FOLL_PIN. > > + * And, these pinned pages need to be released either using > unpin_user_pages() > > + * or unpin_user_page(). > > + * > > + * It must be noted that the pages may be pinned for an indefinite amount > > + * of time. And, in most cases, the duration of time they may stay pinned > > + * would be controlled by the userspace. This behavior is effectively the > > + * same as using FOLL_LONGTERM with other GUP APIs. > > + * > > + * Returns number of pages pinned. This would be equal to the number of > > + * pages requested. If no pages were pinned, it returns -errno. > > + */ > > +long pin_user_pages_fd(struct file *file, pgoff_t start, > > + unsigned long nr_pages, struct page **pages) > > +{ > > + struct page *page; > > + unsigned int flags, i; > > + long ret; > > + > > + if (start < 0) > > + return -EINVAL; > > + > > + if (!file) > > + return -EINVAL; > > + > > + if (!shmem_file(file) && !is_file_hugepages(file)) > > + return -EINVAL; > > + > > + flags = memalloc_pin_save(); > > + do { > > + for (i = 0; i < nr_pages; i++) { > > + /* > > + * In most cases, we should be able to find the page > > + * in the page cache. If we cannot find it, we try to > > + * allocate one and add it to the page cache. > > + */ > > + page = find_get_page_flags(file->f_mapping, > > + start + i, > > + FGP_ACCESSED); > > + if (!page) { > > + page = alloc_file_page(file, start + i); > > + if (IS_ERR(page)) { > > + ret = PTR_ERR(page); > > + goto err; > > While looking at above, I do wonder: what if two parties tried to alloc > the page at the same time? I suspect we'd want to handle -EEXIST a bit > nicer here, right? At-least with the udmabuf use-case, there cannot be multiple entities allocating and adding a page at a given offset in the memfd at the same time. However, I can make the following changes to protect against this potential failure mode: do { for (i = 0; i < nr_pages; i++) { +retry: + page = NULL; /* * In most cases, we should be able to find the page * in the page cache. If we cannot find it, we try to * allocate one and add it to the page cache. */ - page = find_get_page_flags(file->f_mapping, - start + i, - FGP_ACCESSED); + folio = __filemap_get_folio(file->f_mapping, + start + i, + FGP_ACCESSED, 0); + if (!IS_ERR(folio)) + page = &folio->page; + if (!page) { page = alloc_file_page(file, start + i); if (IS_ERR(page)) { ret = PTR_ERR(page); + if (ret == -EEXIST) + goto retry; goto err; } Thanks, Vivek > > > -- > Cheers, > > David / dhildenb
diff --git a/include/linux/mm.h b/include/linux/mm.h index 418d26608ece..1b675fa35059 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2472,6 +2472,8 @@ long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages, struct page **pages, unsigned int gup_flags); long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages, struct page **pages, unsigned int gup_flags); +long pin_user_pages_fd(struct file *file, pgoff_t start, + unsigned long nr_pages, struct page **pages); int get_user_pages_fast(unsigned long start, int nr_pages, unsigned int gup_flags, struct page **pages); diff --git a/mm/gup.c b/mm/gup.c index 231711efa390..875c51d13ee5 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -3410,3 +3410,111 @@ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages, &locked, gup_flags); } EXPORT_SYMBOL(pin_user_pages_unlocked); + +static struct page *alloc_file_page(struct file *file, pgoff_t idx) +{ +#ifdef CONFIG_HUGETLB_PAGE + struct folio *folio; + int err; + + if (is_file_hugepages(file)) { + folio = alloc_hugetlb_folio_nodemask(hstate_file(file), + NUMA_NO_NODE, + NULL, + GFP_USER); + if (folio && folio_try_get(folio)) { + err = hugetlb_add_to_page_cache(folio, + file->f_mapping, + idx); + if (err) { + folio_put(folio); + free_huge_folio(folio); + return ERR_PTR(err); + } + return &folio->page; + } + return ERR_PTR(-ENOMEM); + } +#endif + return shmem_read_mapping_page(file->f_mapping, idx); +} + +/** + * pin_user_pages_fd() - pin user pages associated with a file + * @file: the file whose pages are to be pinned + * @start: starting file offset + * @nr_pages: number of pages from start to pin + * @pages: array that receives pointers to the pages pinned. + * Should be at-least nr_pages long. + * + * Attempt to pin pages associated with a file that belongs to either shmem + * or hugetlb. The pages are either found in the page cache or allocated if + * necessary. Once the pages are located, they are all pinned via FOLL_PIN. + * And, these pinned pages need to be released either using unpin_user_pages() + * or unpin_user_page(). + * + * It must be noted that the pages may be pinned for an indefinite amount + * of time. And, in most cases, the duration of time they may stay pinned + * would be controlled by the userspace. This behavior is effectively the + * same as using FOLL_LONGTERM with other GUP APIs. + * + * Returns number of pages pinned. This would be equal to the number of + * pages requested. If no pages were pinned, it returns -errno. + */ +long pin_user_pages_fd(struct file *file, pgoff_t start, + unsigned long nr_pages, struct page **pages) +{ + struct page *page; + unsigned int flags, i; + long ret; + + if (start < 0) + return -EINVAL; + + if (!file) + return -EINVAL; + + if (!shmem_file(file) && !is_file_hugepages(file)) + return -EINVAL; + + flags = memalloc_pin_save(); + do { + for (i = 0; i < nr_pages; i++) { + /* + * In most cases, we should be able to find the page + * in the page cache. If we cannot find it, we try to + * allocate one and add it to the page cache. + */ + page = find_get_page_flags(file->f_mapping, + start + i, + FGP_ACCESSED); + if (!page) { + page = alloc_file_page(file, start + i); + if (IS_ERR(page)) { + ret = PTR_ERR(page); + goto err; + } + } + ret = try_grab_page(page, FOLL_PIN); + if (unlikely(ret)) + goto err; + + pages[i] = page; + put_page(pages[i]); + } + + ret = check_and_migrate_movable_pages(nr_pages, pages); + } while (ret == -EAGAIN); + + memalloc_pin_restore(flags); + return ret ? ret : nr_pages; +err: + memalloc_pin_restore(flags); + while (i-- > 0) + if (pages[i]) + unpin_user_page(pages[i]); + + return ret; +} +EXPORT_SYMBOL_GPL(pin_user_pages_fd); +