diff mbox series

[v2,05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN

Message ID 20191103211813.213227-6-jhubbard@nvidia.com (mailing list archive)
State New, archived
Headers show
Series mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM | expand

Commit Message

John Hubbard Nov. 3, 2019, 9:18 p.m. UTC
Introduce pin_user_pages*() variations of get_user_pages*() calls,
and also pin_longterm_pages*() variations.

These variants all set FOLL_PIN, which is also introduced, and
thoroughly documented.

The pin_longterm*() variants also set FOLL_LONGTERM, in addition
to FOLL_PIN:

    pin_user_pages()
    pin_user_pages_remote()
    pin_user_pages_fast()

    pin_longterm_pages()
    pin_longterm_pages_remote()
    pin_longterm_pages_fast()

All pages that are pinned via the above calls, must be unpinned via
put_user_page().

The underlying rules are:

* These are gup-internal flags, so the call sites should not directly
set FOLL_PIN nor FOLL_LONGTERM. That behavior is enforced with
assertions, for the new FOLL_PIN flag. However, for the pre-existing
FOLL_LONGTERM flag, which has some call sites that still directly
set FOLL_LONGTERM, there is no assertion yet.

* Call sites that want to indicate that they are going to do DirectIO
  ("DIO") or something with similar characteristics, should call a
  get_user_pages()-like wrapper call that sets FOLL_PIN. These wrappers
  will:
        * Start with "pin_user_pages" instead of "get_user_pages". That
          makes it easy to find and audit the call sites.
        * Set FOLL_PIN

* For pages that are received via FOLL_PIN, those pages must be returned
  via put_user_page().

Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases
in this documentation. (I've reworded it and expanded on it slightly.)

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 Documentation/vm/index.rst          |   1 +
 Documentation/vm/pin_user_pages.rst | 212 ++++++++++++++++++++++
 include/linux/mm.h                  |  62 ++++++-
 mm/gup.c                            | 265 +++++++++++++++++++++++++---
 4 files changed, 514 insertions(+), 26 deletions(-)
 create mode 100644 Documentation/vm/pin_user_pages.rst

Comments

Jerome Glisse Nov. 4, 2019, 5:33 p.m. UTC | #1
On Sun, Nov 03, 2019 at 01:18:00PM -0800, John Hubbard wrote:
> Introduce pin_user_pages*() variations of get_user_pages*() calls,
> and also pin_longterm_pages*() variations.
> 
> These variants all set FOLL_PIN, which is also introduced, and
> thoroughly documented.
> 
> The pin_longterm*() variants also set FOLL_LONGTERM, in addition
> to FOLL_PIN:
> 
>     pin_user_pages()
>     pin_user_pages_remote()
>     pin_user_pages_fast()
> 
>     pin_longterm_pages()
>     pin_longterm_pages_remote()
>     pin_longterm_pages_fast()
> 
> All pages that are pinned via the above calls, must be unpinned via
> put_user_page().
> 
> The underlying rules are:
> 
> * These are gup-internal flags, so the call sites should not directly
> set FOLL_PIN nor FOLL_LONGTERM. That behavior is enforced with
> assertions, for the new FOLL_PIN flag. However, for the pre-existing
> FOLL_LONGTERM flag, which has some call sites that still directly
> set FOLL_LONGTERM, there is no assertion yet.
> 
> * Call sites that want to indicate that they are going to do DirectIO
>   ("DIO") or something with similar characteristics, should call a
>   get_user_pages()-like wrapper call that sets FOLL_PIN. These wrappers
>   will:
>         * Start with "pin_user_pages" instead of "get_user_pages". That
>           makes it easy to find and audit the call sites.
>         * Set FOLL_PIN
> 
> * For pages that are received via FOLL_PIN, those pages must be returned
>   via put_user_page().
> 
> Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases
> in this documentation. (I've reworded it and expanded on it slightly.)
> 
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>

Few nitpick belows, nonetheless:

Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

> ---
>  Documentation/vm/index.rst          |   1 +
>  Documentation/vm/pin_user_pages.rst | 212 ++++++++++++++++++++++
>  include/linux/mm.h                  |  62 ++++++-
>  mm/gup.c                            | 265 +++++++++++++++++++++++++---
>  4 files changed, 514 insertions(+), 26 deletions(-)
>  create mode 100644 Documentation/vm/pin_user_pages.rst
> 

[...]

> diff --git a/Documentation/vm/pin_user_pages.rst b/Documentation/vm/pin_user_pages.rst
> new file mode 100644
> index 000000000000..3910f49ca98c
> --- /dev/null
> +++ b/Documentation/vm/pin_user_pages.rst

[...]

> +
> +FOLL_PIN, FOLL_GET, FOLL_LONGTERM: when to use which flags
> +==========================================================
> +
> +Thanks to Jan Kara, Vlastimil Babka and several other -mm people, for describing
> +these categories:
> +
> +CASE 1: Direct IO (DIO)
> +-----------------------
> +There are GUP references to pages that are serving
> +as DIO buffers. These buffers are needed for a relatively short time (so they
> +are not "long term"). No special synchronization with page_mkclean() or
> +munmap() is provided. Therefore, flags to set at the call site are: ::
> +
> +    FOLL_PIN
> +
> +...but rather than setting FOLL_PIN directly, call sites should use one of
> +the pin_user_pages*() routines that set FOLL_PIN.
> +
> +CASE 2: RDMA
> +------------
> +There are GUP references to pages that are serving as DMA
> +buffers. These buffers are needed for a long time ("long term"). No special
> +synchronization with page_mkclean() or munmap() is provided. Therefore, flags
> +to set at the call site are: ::
> +
> +    FOLL_PIN | FOLL_LONGTERM
> +
> +NOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. That's
> +because DAX pages do not have a separate page cache, and so "pinning" implies
> +locking down file system blocks, which is not (yet) supported in that way.
> +
> +CASE 3: ODP
> +-----------
> +(Mellanox/Infiniband On Demand Paging: the hardware supports
> +replayable page faulting). There are GUP references to pages serving as DMA
> +buffers. For ODP, MMU notifiers are used to synchronize with page_mkclean()
> +and munmap(). Therefore, normal GUP calls are sufficient, so neither flag
> +needs to be set.

I would not include ODP or anything like it here, they do not use
GUP anymore and i believe it is more confusing here. I would how-
ever include some text in this documentation explaining that hard-
ware that support page fault is superior as it does not incur any
of the issues described here.

> +
> +CASE 4: Pinning for struct page manipulation only
> +-------------------------------------------------
> +Here, normal GUP calls are sufficient, so neither flag needs to be set.
> +

[...]

> diff --git a/mm/gup.c b/mm/gup.c
> index 199da99e8ffc..1aea48427879 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c

[...]

> @@ -1014,7 +1018,16 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
>  		BUG_ON(*locked != 1);
>  	}
>  
> -	if (pages)
> +	/*
> +	 * FOLL_PIN and FOLL_GET are mutually exclusive. Traditional behavior
> +	 * is to set FOLL_GET if the caller wants pages[] filled in (but has
> +	 * carelessly failed to specify FOLL_GET), so keep doing that, but only
> +	 * for FOLL_GET, not for the newer FOLL_PIN.
> +	 *
> +	 * FOLL_PIN always expects pages to be non-null, but no need to assert
> +	 * that here, as any failures will be obvious enough.
> +	 */
> +	if (pages && !(flags & FOLL_PIN))
>  		flags |= FOLL_GET;

Did you look at user that have pages and not FOLL_GET set ?
I believe it would be better to first fix them to end up
with FOLL_GET set and then error out if pages is != NULL but
nor FOLL_GET or FOLL_PIN is set.

>  
>  	pages_done = 0;

> @@ -2373,24 +2402,9 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
>  	return ret;
>  }
>  
> -/**
> - * get_user_pages_fast() - pin user pages in memory
> - * @start:	starting user address
> - * @nr_pages:	number of pages from start to pin
> - * @gup_flags:	flags modifying pin behaviour
> - * @pages:	array that receives pointers to the pages pinned.
> - *		Should be at least nr_pages long.
> - *
> - * Attempt to pin user pages in memory without taking mm->mmap_sem.
> - * If not successful, it will fall back to taking the lock and
> - * calling get_user_pages().
> - *
> - * Returns number of pages pinned. This may be fewer than the number
> - * requested. If nr_pages is 0 or negative, returns 0. If no pages
> - * were pinned, returns -errno.
> - */
> -int get_user_pages_fast(unsigned long start, int nr_pages,
> -			unsigned int gup_flags, struct page **pages)
> +static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
> +					unsigned int gup_flags,
> +					struct page **pages)

Usualy function are rename to _old_func_name ie add _ in front. So
here it would become _get_user_pages_fast but i know some people
don't like that as sometimes we endup with ___function_overloaded :)

>  {
>  	unsigned long addr, len, end;
>  	int nr = 0, ret = 0;


> @@ -2435,4 +2449,215 @@ int get_user_pages_fast(unsigned long start, int nr_pages,

[...]

> +/**
> + * pin_user_pages_remote() - pin pages for (typically) use by Direct IO, and
> + * return the pages to the user.

Not a fan of (typically) maybe:
pin_user_pages_remote() - pin pages of a remote process (task != current)

I think here the remote part if more important that DIO. Remote is use by
other thing that DIO.

> + *
> + * Nearly the same as get_user_pages_remote(), except that FOLL_PIN is set. See
> + * get_user_pages_remote() for documentation on the function arguments, because
> + * the arguments here are identical.
> + *
> + * FOLL_PIN means that the pages must be released via put_user_page(). Please
> + * see Documentation/vm/pin_user_pages.rst for details.
> + *
> + * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
> + * is NOT intended for Case 2 (RDMA: long-term pins).
> + */
> +long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
> +			   unsigned long start, unsigned long nr_pages,
> +			   unsigned int gup_flags, struct page **pages,
> +			   struct vm_area_struct **vmas, int *locked)
> +{
> +	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
> +	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> +		return -EINVAL;
> +
> +	gup_flags |= FOLL_TOUCH | FOLL_REMOTE | FOLL_PIN;
> +
> +	return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
> +				       locked, gup_flags);
> +}
> +EXPORT_SYMBOL(pin_user_pages_remote);
> +
> +/**
> + * pin_longterm_pages_remote() - pin pages for (typically) use by Direct IO, and
> + * return the pages to the user.

I think you copy pasted this from pin_user_pages_remote() :)

> + *
> + * Nearly the same as get_user_pages_remote(), but note that FOLL_TOUCH is not
> + * set, and FOLL_PIN and FOLL_LONGTERM are set. See get_user_pages_remote() for
> + * documentation on the function arguments, because the arguments here are
> + * identical.
> + *
> + * FOLL_PIN means that the pages must be released via put_user_page(). Please
> + * see Documentation/vm/pin_user_pages.rst for further details.
> + *
> + * FOLL_LONGTERM means that the pages are being pinned for "long term" use,
> + * typically by a non-CPU device, and we cannot be sure that waiting for a
> + * pinned page to become unpin will be effective.
> + *
> + * This is intended for Case 2 (RDMA: long-term pins) in
> + * Documentation/vm/pin_user_pages.rst.
> + */
> +long pin_longterm_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
> +			       unsigned long start, unsigned long nr_pages,
> +			       unsigned int gup_flags, struct page **pages,
> +			       struct vm_area_struct **vmas, int *locked)
> +{
> +	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
> +	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> +		return -EINVAL;
> +
> +	/*
> +	 * FIXME: as noted in the get_user_pages_remote() implementation, it
> +	 * is not yet possible to safely set FOLL_LONGTERM here. FOLL_LONGTERM
> +	 * needs to be set, but for now the best we can do is a "TODO" item.
> +	 */
> +	gup_flags |= FOLL_REMOTE | FOLL_PIN;

Wouldn't it be better to not add pin_longterm_pages_remote() until
it can be properly implemented ?
John Hubbard Nov. 4, 2019, 7:04 p.m. UTC | #2
On 11/4/19 9:33 AM, Jerome Glisse wrote:
...
> 
> Few nitpick belows, nonetheless:
> 
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> [...]
>> +
>> +CASE 3: ODP
>> +-----------
>> +(Mellanox/Infiniband On Demand Paging: the hardware supports
>> +replayable page faulting). There are GUP references to pages serving as DMA
>> +buffers. For ODP, MMU notifiers are used to synchronize with page_mkclean()
>> +and munmap(). Therefore, normal GUP calls are sufficient, so neither flag
>> +needs to be set.
> 
> I would not include ODP or anything like it here, they do not use
> GUP anymore and i believe it is more confusing here. I would how-
> ever include some text in this documentation explaining that hard-
> ware that support page fault is superior as it does not incur any
> of the issues described here.

OK, agreed, here's a new write up that I'll put in v3:


CASE 3: ODP
-----------
Advanced, but non-CPU (DMA) hardware that supports replayable page faults.
Here, a well-written driver doesn't normally need to pin pages at all. However,
if the driver does choose to do so, it can register MMU notifiers for the range,
and will be called back upon invalidation. Either way (avoiding page pinning, or
using MMU notifiers to unpin upon request), there is proper synchronization with 
both filesystem and mm (page_mkclean(), munmap(), etc).

Therefore, neither flag needs to be set.

It's worth mentioning here that pinning pages should not be the first design
choice. If page fault capable hardware is available, then the software should
be written so that it does not pin pages. This allows mm and filesystems to
operate more efficiently and reliably.

> [...]
> 
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 199da99e8ffc..1aea48427879 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
> 
> [...]
> 
>> @@ -1014,7 +1018,16 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
>>  		BUG_ON(*locked != 1);
>>  	}
>>  
>> -	if (pages)
>> +	/*
>> +	 * FOLL_PIN and FOLL_GET are mutually exclusive. Traditional behavior
>> +	 * is to set FOLL_GET if the caller wants pages[] filled in (but has
>> +	 * carelessly failed to specify FOLL_GET), so keep doing that, but only
>> +	 * for FOLL_GET, not for the newer FOLL_PIN.
>> +	 *
>> +	 * FOLL_PIN always expects pages to be non-null, but no need to assert
>> +	 * that here, as any failures will be obvious enough.
>> +	 */
>> +	if (pages && !(flags & FOLL_PIN))
>>  		flags |= FOLL_GET;
> 
> Did you look at user that have pages and not FOLL_GET set ?
> I believe it would be better to first fix them to end up
> with FOLL_GET set and then error out if pages is != NULL but
> nor FOLL_GET or FOLL_PIN is set.
> 

I was perhaps overly cautious, and didn't go there. However, it's probably
doable, given that there was already the following in __get_user_pages():

    VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));

...which will have conditioned people and code to set FOLL_GET together with
pages. So I agree that the time is right.

In order to make bisecting future failures simpler, I can insert a patch right 
before this one, that changes the FOLL_GET setting into an assert, like this:

diff --git a/mm/gup.c b/mm/gup.c
index 8f236a335ae9..be338961e80d 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1014,8 +1014,8 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
                BUG_ON(*locked != 1);
        }
 
-       if (pages)
-               flags |= FOLL_GET;
+       if (pages && WARN_ON_ONCE(!(gup_flags & FOLL_GET)))
+               return -EINVAL;
 
        pages_done = 0;
        lock_dropped = false;


...and then add in FOLL_PIN, with this patch.

>>  
>>  	pages_done = 0;
> 
>> @@ -2373,24 +2402,9 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
>>  	return ret;
>>  }
>>  
>> -/**
>> - * get_user_pages_fast() - pin user pages in memory
>> - * @start:	starting user address
>> - * @nr_pages:	number of pages from start to pin
>> - * @gup_flags:	flags modifying pin behaviour
>> - * @pages:	array that receives pointers to the pages pinned.
>> - *		Should be at least nr_pages long.
>> - *
>> - * Attempt to pin user pages in memory without taking mm->mmap_sem.
>> - * If not successful, it will fall back to taking the lock and
>> - * calling get_user_pages().
>> - *
>> - * Returns number of pages pinned. This may be fewer than the number
>> - * requested. If nr_pages is 0 or negative, returns 0. If no pages
>> - * were pinned, returns -errno.
>> - */
>> -int get_user_pages_fast(unsigned long start, int nr_pages,
>> -			unsigned int gup_flags, struct page **pages)
>> +static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
>> +					unsigned int gup_flags,
>> +					struct page **pages)
> 
> Usualy function are rename to _old_func_name ie add _ in front. So
> here it would become _get_user_pages_fast but i know some people
> don't like that as sometimes we endup with ___function_overloaded :)

Exactly: the __get_user_pages* names were already used for *non*-internal
routines, so I attempted to pick the next best naming prefix.

> 
>>  {
>>  	unsigned long addr, len, end;
>>  	int nr = 0, ret = 0;
> 
> 
>> @@ -2435,4 +2449,215 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
> 
> [...]
> 
>> +/**
>> + * pin_user_pages_remote() - pin pages for (typically) use by Direct IO, and
>> + * return the pages to the user.
> 
> Not a fan of (typically) maybe:
> pin_user_pages_remote() - pin pages of a remote process (task != current)
> 
> I think here the remote part if more important that DIO. Remote is use by
> other thing that DIO.

Yes, good point. I'll use your wording:

 * pin_user_pages_remote() - pin pages of a remote process (task != current)



> 
>> + *
>> + * Nearly the same as get_user_pages_remote(), except that FOLL_PIN is set. See
>> + * get_user_pages_remote() for documentation on the function arguments, because
>> + * the arguments here are identical.
>> + *
>> + * FOLL_PIN means that the pages must be released via put_user_page(). Please
>> + * see Documentation/vm/pin_user_pages.rst for details.
>> + *
>> + * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
>> + * is NOT intended for Case 2 (RDMA: long-term pins).
>> + */
>> +long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
>> +			   unsigned long start, unsigned long nr_pages,
>> +			   unsigned int gup_flags, struct page **pages,
>> +			   struct vm_area_struct **vmas, int *locked)
>> +{
>> +	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
>> +	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
>> +		return -EINVAL;
>> +
>> +	gup_flags |= FOLL_TOUCH | FOLL_REMOTE | FOLL_PIN;
>> +
>> +	return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
>> +				       locked, gup_flags);
>> +}
>> +EXPORT_SYMBOL(pin_user_pages_remote);
>> +
>> +/**
>> + * pin_longterm_pages_remote() - pin pages for (typically) use by Direct IO, and
>> + * return the pages to the user.
> 
> I think you copy pasted this from pin_user_pages_remote() :)

I admit to nothing, with respect to copy-paste! :)

This one can simply be:

 * pin_longterm_pages_remote() - pin pages of a remote process (task != current)


> 
>> + *
>> + * Nearly the same as get_user_pages_remote(), but note that FOLL_TOUCH is not
>> + * set, and FOLL_PIN and FOLL_LONGTERM are set. See get_user_pages_remote() for
>> + * documentation on the function arguments, because the arguments here are
>> + * identical.
>> + *
>> + * FOLL_PIN means that the pages must be released via put_user_page(). Please
>> + * see Documentation/vm/pin_user_pages.rst for further details.
>> + *
>> + * FOLL_LONGTERM means that the pages are being pinned for "long term" use,
>> + * typically by a non-CPU device, and we cannot be sure that waiting for a
>> + * pinned page to become unpin will be effective.
>> + *
>> + * This is intended for Case 2 (RDMA: long-term pins) in
>> + * Documentation/vm/pin_user_pages.rst.
>> + */
>> +long pin_longterm_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
>> +			       unsigned long start, unsigned long nr_pages,
>> +			       unsigned int gup_flags, struct page **pages,
>> +			       struct vm_area_struct **vmas, int *locked)
>> +{
>> +	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
>> +	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
>> +		return -EINVAL;
>> +
>> +	/*
>> +	 * FIXME: as noted in the get_user_pages_remote() implementation, it
>> +	 * is not yet possible to safely set FOLL_LONGTERM here. FOLL_LONGTERM
>> +	 * needs to be set, but for now the best we can do is a "TODO" item.
>> +	 */
>> +	gup_flags |= FOLL_REMOTE | FOLL_PIN;
> 
> Wouldn't it be better to not add pin_longterm_pages_remote() until
> it can be properly implemented ?
> 

Well, the problem is that I need each call site that requires FOLL_PIN
to use a proper wrapper. It's the FOLL_PIN that is the focus here, because
there is a hard, bright rule, which is: if and only if a caller sets
FOLL_PIN, then the dma-page tracking happens, and put_user_page() must
be called.

So this leaves me with only two reasonable choices:

a) Convert the call site as above: pin_longterm_pages_remote(), which sets
FOLL_PIN (the key point!), and leaves the FOLL_LONGTERM situation exactly
as it has been so far. When the FOLL_LONGTERM situation is fixed, the call
site *might* not need any changes to adopt the working gup.c code.

b) Convert the call site to pin_user_pages_remote(), which also sets
FOLL_PIN, and also leaves the FOLL_LONGTERM situation exactly as before.
There would also be a comment at the call site, to the effect of, "this
is the wrong call to make: it really requires FOLL_LONGTERM behavior".

When the FOLL_LONGTERM situation is fixed, the call site will need to be
changed to pin_longterm_pages_remote().

So you can probably see why I picked (a).


thanks,

John Hubbard
NVIDIA
Jerome Glisse Nov. 4, 2019, 7:18 p.m. UTC | #3
On Mon, Nov 04, 2019 at 11:04:38AM -0800, John Hubbard wrote:
> On 11/4/19 9:33 AM, Jerome Glisse wrote:
> ...
> > 
> > Few nitpick belows, nonetheless:
> > 
> > Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> > [...]
> >> +
> >> +CASE 3: ODP
> >> +-----------
> >> +(Mellanox/Infiniband On Demand Paging: the hardware supports
> >> +replayable page faulting). There are GUP references to pages serving as DMA
> >> +buffers. For ODP, MMU notifiers are used to synchronize with page_mkclean()
> >> +and munmap(). Therefore, normal GUP calls are sufficient, so neither flag
> >> +needs to be set.
> > 
> > I would not include ODP or anything like it here, they do not use
> > GUP anymore and i believe it is more confusing here. I would how-
> > ever include some text in this documentation explaining that hard-
> > ware that support page fault is superior as it does not incur any
> > of the issues described here.
> 
> OK, agreed, here's a new write up that I'll put in v3:
> 
> 
> CASE 3: ODP
> -----------

ODP is RDMA, maybe Hardware with page fault support instead

> Advanced, but non-CPU (DMA) hardware that supports replayable page faults.
> Here, a well-written driver doesn't normally need to pin pages at all. However,
> if the driver does choose to do so, it can register MMU notifiers for the range,
> and will be called back upon invalidation. Either way (avoiding page pinning, or
> using MMU notifiers to unpin upon request), there is proper synchronization with 
> both filesystem and mm (page_mkclean(), munmap(), etc).
> 
> Therefore, neither flag needs to be set.

In fact GUP should never be use with those.

> 
> It's worth mentioning here that pinning pages should not be the first design
> choice. If page fault capable hardware is available, then the software should
> be written so that it does not pin pages. This allows mm and filesystems to
> operate more efficiently and reliably.
> 
> > [...]
> > 
> >> diff --git a/mm/gup.c b/mm/gup.c
> >> index 199da99e8ffc..1aea48427879 100644
> >> --- a/mm/gup.c
> >> +++ b/mm/gup.c
> > 
> > [...]
> > 
> >> @@ -1014,7 +1018,16 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
> >>  		BUG_ON(*locked != 1);
> >>  	}
> >>  
> >> -	if (pages)
> >> +	/*
> >> +	 * FOLL_PIN and FOLL_GET are mutually exclusive. Traditional behavior
> >> +	 * is to set FOLL_GET if the caller wants pages[] filled in (but has
> >> +	 * carelessly failed to specify FOLL_GET), so keep doing that, but only
> >> +	 * for FOLL_GET, not for the newer FOLL_PIN.
> >> +	 *
> >> +	 * FOLL_PIN always expects pages to be non-null, but no need to assert
> >> +	 * that here, as any failures will be obvious enough.
> >> +	 */
> >> +	if (pages && !(flags & FOLL_PIN))
> >>  		flags |= FOLL_GET;
> > 
> > Did you look at user that have pages and not FOLL_GET set ?
> > I believe it would be better to first fix them to end up
> > with FOLL_GET set and then error out if pages is != NULL but
> > nor FOLL_GET or FOLL_PIN is set.
> > 
> 
> I was perhaps overly cautious, and didn't go there. However, it's probably
> doable, given that there was already the following in __get_user_pages():
> 
>     VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));
> 
> ...which will have conditioned people and code to set FOLL_GET together with
> pages. So I agree that the time is right.
> 
> In order to make bisecting future failures simpler, I can insert a patch right 
> before this one, that changes the FOLL_GET setting into an assert, like this:
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 8f236a335ae9..be338961e80d 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1014,8 +1014,8 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
>                 BUG_ON(*locked != 1);
>         }
>  
> -       if (pages)
> -               flags |= FOLL_GET;
> +       if (pages && WARN_ON_ONCE(!(gup_flags & FOLL_GET)))
> +               return -EINVAL;
>  
>         pages_done = 0;
>         lock_dropped = false;
> 
> 
> ...and then add in FOLL_PIN, with this patch.

looks good but double check that it should not happens, i will try
to check on my side too.

> 
> >>  
> >>  	pages_done = 0;
> > 
> >> @@ -2373,24 +2402,9 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
> >>  	return ret;
> >>  }
> >>  
> >> -/**
> >> - * get_user_pages_fast() - pin user pages in memory
> >> - * @start:	starting user address
> >> - * @nr_pages:	number of pages from start to pin
> >> - * @gup_flags:	flags modifying pin behaviour
> >> - * @pages:	array that receives pointers to the pages pinned.
> >> - *		Should be at least nr_pages long.
> >> - *
> >> - * Attempt to pin user pages in memory without taking mm->mmap_sem.
> >> - * If not successful, it will fall back to taking the lock and
> >> - * calling get_user_pages().
> >> - *
> >> - * Returns number of pages pinned. This may be fewer than the number
> >> - * requested. If nr_pages is 0 or negative, returns 0. If no pages
> >> - * were pinned, returns -errno.
> >> - */
> >> -int get_user_pages_fast(unsigned long start, int nr_pages,
> >> -			unsigned int gup_flags, struct page **pages)
> >> +static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
> >> +					unsigned int gup_flags,
> >> +					struct page **pages)
> > 
> > Usualy function are rename to _old_func_name ie add _ in front. So
> > here it would become _get_user_pages_fast but i know some people
> > don't like that as sometimes we endup with ___function_overloaded :)
> 
> Exactly: the __get_user_pages* names were already used for *non*-internal
> routines, so I attempted to pick the next best naming prefix.

Didn't know we were that far in the ___ :)

> > 
> >>  {
> >>  	unsigned long addr, len, end;
> >>  	int nr = 0, ret = 0;
> > 
> > 
> >> @@ -2435,4 +2449,215 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
> > 
> > [...]
> > 
> >> +/**
> >> + * pin_user_pages_remote() - pin pages for (typically) use by Direct IO, and
> >> + * return the pages to the user.
> > 
> > Not a fan of (typically) maybe:
> > pin_user_pages_remote() - pin pages of a remote process (task != current)
> > 
> > I think here the remote part if more important that DIO. Remote is use by
> > other thing that DIO.
> 
> Yes, good point. I'll use your wording:
> 
>  * pin_user_pages_remote() - pin pages of a remote process (task != current)
> 
> 
> 
> > 
> >> + *
> >> + * Nearly the same as get_user_pages_remote(), except that FOLL_PIN is set. See
> >> + * get_user_pages_remote() for documentation on the function arguments, because
> >> + * the arguments here are identical.
> >> + *
> >> + * FOLL_PIN means that the pages must be released via put_user_page(). Please
> >> + * see Documentation/vm/pin_user_pages.rst for details.
> >> + *
> >> + * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
> >> + * is NOT intended for Case 2 (RDMA: long-term pins).
> >> + */
> >> +long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
> >> +			   unsigned long start, unsigned long nr_pages,
> >> +			   unsigned int gup_flags, struct page **pages,
> >> +			   struct vm_area_struct **vmas, int *locked)
> >> +{
> >> +	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
> >> +	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> >> +		return -EINVAL;
> >> +
> >> +	gup_flags |= FOLL_TOUCH | FOLL_REMOTE | FOLL_PIN;
> >> +
> >> +	return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
> >> +				       locked, gup_flags);
> >> +}
> >> +EXPORT_SYMBOL(pin_user_pages_remote);
> >> +
> >> +/**
> >> + * pin_longterm_pages_remote() - pin pages for (typically) use by Direct IO, and
> >> + * return the pages to the user.
> > 
> > I think you copy pasted this from pin_user_pages_remote() :)
> 
> I admit to nothing, with respect to copy-paste! :)
> 
> This one can simply be:
> 
>  * pin_longterm_pages_remote() - pin pages of a remote process (task != current)
> 
> 
> > 
> >> + *
> >> + * Nearly the same as get_user_pages_remote(), but note that FOLL_TOUCH is not
> >> + * set, and FOLL_PIN and FOLL_LONGTERM are set. See get_user_pages_remote() for
> >> + * documentation on the function arguments, because the arguments here are
> >> + * identical.
> >> + *
> >> + * FOLL_PIN means that the pages must be released via put_user_page(). Please
> >> + * see Documentation/vm/pin_user_pages.rst for further details.
> >> + *
> >> + * FOLL_LONGTERM means that the pages are being pinned for "long term" use,
> >> + * typically by a non-CPU device, and we cannot be sure that waiting for a
> >> + * pinned page to become unpin will be effective.
> >> + *
> >> + * This is intended for Case 2 (RDMA: long-term pins) in
> >> + * Documentation/vm/pin_user_pages.rst.
> >> + */
> >> +long pin_longterm_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
> >> +			       unsigned long start, unsigned long nr_pages,
> >> +			       unsigned int gup_flags, struct page **pages,
> >> +			       struct vm_area_struct **vmas, int *locked)
> >> +{
> >> +	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
> >> +	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> >> +		return -EINVAL;
> >> +
> >> +	/*
> >> +	 * FIXME: as noted in the get_user_pages_remote() implementation, it
> >> +	 * is not yet possible to safely set FOLL_LONGTERM here. FOLL_LONGTERM
> >> +	 * needs to be set, but for now the best we can do is a "TODO" item.
> >> +	 */
> >> +	gup_flags |= FOLL_REMOTE | FOLL_PIN;
> > 
> > Wouldn't it be better to not add pin_longterm_pages_remote() until
> > it can be properly implemented ?
> > 
> 
> Well, the problem is that I need each call site that requires FOLL_PIN
> to use a proper wrapper. It's the FOLL_PIN that is the focus here, because
> there is a hard, bright rule, which is: if and only if a caller sets
> FOLL_PIN, then the dma-page tracking happens, and put_user_page() must
> be called.
> 
> So this leaves me with only two reasonable choices:
> 
> a) Convert the call site as above: pin_longterm_pages_remote(), which sets
> FOLL_PIN (the key point!), and leaves the FOLL_LONGTERM situation exactly
> as it has been so far. When the FOLL_LONGTERM situation is fixed, the call
> site *might* not need any changes to adopt the working gup.c code.
> 
> b) Convert the call site to pin_user_pages_remote(), which also sets
> FOLL_PIN, and also leaves the FOLL_LONGTERM situation exactly as before.
> There would also be a comment at the call site, to the effect of, "this
> is the wrong call to make: it really requires FOLL_LONGTERM behavior".
> 
> When the FOLL_LONGTERM situation is fixed, the call site will need to be
> changed to pin_longterm_pages_remote().
> 
> So you can probably see why I picked (a).

But right now nobody has FOLL_LONGTERM and FOLL_REMOTE. So you should
never have the need for pin_longterm_pages_remote(). My fear is that
longterm has implication and it would be better to not drop this implication
by adding a wrapper that does not do what the name says.

So do not introduce pin_longterm_pages_remote() until its first user
happens. This is option c)

Cheers,
Jérôme
John Hubbard Nov. 4, 2019, 7:30 p.m. UTC | #4
On 11/4/19 11:18 AM, Jerome Glisse wrote:
> On Mon, Nov 04, 2019 at 11:04:38AM -0800, John Hubbard wrote:
>> On 11/4/19 9:33 AM, Jerome Glisse wrote:
>> ...
>>>
>>> Few nitpick belows, nonetheless:
>>>
>>> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
>>> [...]
>>>> +
>>>> +CASE 3: ODP
>>>> +-----------
>>>> +(Mellanox/Infiniband On Demand Paging: the hardware supports
>>>> +replayable page faulting). There are GUP references to pages serving as DMA
>>>> +buffers. For ODP, MMU notifiers are used to synchronize with page_mkclean()
>>>> +and munmap(). Therefore, normal GUP calls are sufficient, so neither flag
>>>> +needs to be set.
>>>
>>> I would not include ODP or anything like it here, they do not use
>>> GUP anymore and i believe it is more confusing here. I would how-
>>> ever include some text in this documentation explaining that hard-
>>> ware that support page fault is superior as it does not incur any
>>> of the issues described here.
>>
>> OK, agreed, here's a new write up that I'll put in v3:
>>
>>
>> CASE 3: ODP
>> -----------
> 
> ODP is RDMA, maybe Hardware with page fault support instead
> 
>> Advanced, but non-CPU (DMA) hardware that supports replayable page faults.

OK, so:

    "RDMA hardware with page faulting support."

for the first sentence.


>> Here, a well-written driver doesn't normally need to pin pages at all. However,
>> if the driver does choose to do so, it can register MMU notifiers for the range,
>> and will be called back upon invalidation. Either way (avoiding page pinning, or
>> using MMU notifiers to unpin upon request), there is proper synchronization with 
>> both filesystem and mm (page_mkclean(), munmap(), etc).
>>
>> Therefore, neither flag needs to be set.
> 
> In fact GUP should never be use with those.


Yes. The next paragraph says that, but maybe not strong enough.


>>
>> It's worth mentioning here that pinning pages should not be the first design
>> choice. If page fault capable hardware is available, then the software should
>> be written so that it does not pin pages. This allows mm and filesystems to
>> operate more efficiently and reliably.

Here's what we have after the above changes:

CASE 3: ODP
-----------
RDMA hardware with page faulting support. Here, a well-written driver doesn't
normally need to pin pages at all. However, if the driver does choose to do so,
it can register MMU notifiers for the range, and will be called back upon
invalidation. Either way (avoiding page pinning, or using MMU notifiers to unpin
upon request), there is proper synchronization with both filesystem and mm
(page_mkclean(), munmap(), etc).

Therefore, neither flag needs to be set.

In this case, ideally, neither get_user_pages() nor pin_user_pages() should be 
called. Instead, the software should be written so that it does not pin pages. 
This allows mm and filesystems to operate more efficiently and reliably.

>>> [...]
>>>
>>>> @@ -1014,7 +1018,16 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
>>>>  		BUG_ON(*locked != 1);
>>>>  	}
>>>>  
>>>> -	if (pages)
>>>> +	/*
>>>> +	 * FOLL_PIN and FOLL_GET are mutually exclusive. Traditional behavior
>>>> +	 * is to set FOLL_GET if the caller wants pages[] filled in (but has
>>>> +	 * carelessly failed to specify FOLL_GET), so keep doing that, but only
>>>> +	 * for FOLL_GET, not for the newer FOLL_PIN.
>>>> +	 *
>>>> +	 * FOLL_PIN always expects pages to be non-null, but no need to assert
>>>> +	 * that here, as any failures will be obvious enough.
>>>> +	 */
>>>> +	if (pages && !(flags & FOLL_PIN))
>>>>  		flags |= FOLL_GET;
>>>
>>> Did you look at user that have pages and not FOLL_GET set ?
>>> I believe it would be better to first fix them to end up
>>> with FOLL_GET set and then error out if pages is != NULL but
>>> nor FOLL_GET or FOLL_PIN is set.
>>>
>>
>> I was perhaps overly cautious, and didn't go there. However, it's probably
>> doable, given that there was already the following in __get_user_pages():
>>
>>     VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));
>>
>> ...which will have conditioned people and code to set FOLL_GET together with
>> pages. So I agree that the time is right.
>>
>> In order to make bisecting future failures simpler, I can insert a patch right 
>> before this one, that changes the FOLL_GET setting into an assert, like this:
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 8f236a335ae9..be338961e80d 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -1014,8 +1014,8 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
>>                 BUG_ON(*locked != 1);
>>         }
>>  
>> -       if (pages)
>> -               flags |= FOLL_GET;
>> +       if (pages && WARN_ON_ONCE(!(gup_flags & FOLL_GET)))
>> +               return -EINVAL;
>>  
>>         pages_done = 0;
>>         lock_dropped = false;
>>
>>
>> ...and then add in FOLL_PIN, with this patch.
> 
> looks good but double check that it should not happens, i will try
> to check on my side too.

Yes, I'll look.

...
>>>> +	 */
>>>> +	gup_flags |= FOLL_REMOTE | FOLL_PIN;
>>>
>>> Wouldn't it be better to not add pin_longterm_pages_remote() until
>>> it can be properly implemented ?
>>>
>>
>> Well, the problem is that I need each call site that requires FOLL_PIN
>> to use a proper wrapper. It's the FOLL_PIN that is the focus here, because
>> there is a hard, bright rule, which is: if and only if a caller sets
>> FOLL_PIN, then the dma-page tracking happens, and put_user_page() must
>> be called.
>>
>> So this leaves me with only two reasonable choices:
>>
>> a) Convert the call site as above: pin_longterm_pages_remote(), which sets
>> FOLL_PIN (the key point!), and leaves the FOLL_LONGTERM situation exactly
>> as it has been so far. When the FOLL_LONGTERM situation is fixed, the call
>> site *might* not need any changes to adopt the working gup.c code.
>>
>> b) Convert the call site to pin_user_pages_remote(), which also sets
>> FOLL_PIN, and also leaves the FOLL_LONGTERM situation exactly as before.
>> There would also be a comment at the call site, to the effect of, "this
>> is the wrong call to make: it really requires FOLL_LONGTERM behavior".
>>
>> When the FOLL_LONGTERM situation is fixed, the call site will need to be
>> changed to pin_longterm_pages_remote().
>>
>> So you can probably see why I picked (a).
> 
> But right now nobody has FOLL_LONGTERM and FOLL_REMOTE. So you should
> never have the need for pin_longterm_pages_remote(). My fear is that
> longterm has implication and it would be better to not drop this implication
> by adding a wrapper that does not do what the name says.
> 
> So do not introduce pin_longterm_pages_remote() until its first user
> happens. This is option c)
> 

Almost forgot, though: there is already another user: Infiniband:

drivers/infiniband/core/umem_odp.c:646:         npages = pin_longterm_pages_remote(owning_process, owning_mm,



thanks,

John Hubbard
NVIDIA
Jerome Glisse Nov. 4, 2019, 7:52 p.m. UTC | #5
On Mon, Nov 04, 2019 at 11:30:32AM -0800, John Hubbard wrote:
> On 11/4/19 11:18 AM, Jerome Glisse wrote:
> > On Mon, Nov 04, 2019 at 11:04:38AM -0800, John Hubbard wrote:
> >> On 11/4/19 9:33 AM, Jerome Glisse wrote:
> >> ...
> >>>
> >>> Few nitpick belows, nonetheless:
> >>>
> >>> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> >>> [...]
> >>>> +
> >>>> +CASE 3: ODP
> >>>> +-----------
> >>>> +(Mellanox/Infiniband On Demand Paging: the hardware supports
> >>>> +replayable page faulting). There are GUP references to pages serving as DMA
> >>>> +buffers. For ODP, MMU notifiers are used to synchronize with page_mkclean()
> >>>> +and munmap(). Therefore, normal GUP calls are sufficient, so neither flag
> >>>> +needs to be set.
> >>>
> >>> I would not include ODP or anything like it here, they do not use
> >>> GUP anymore and i believe it is more confusing here. I would how-
> >>> ever include some text in this documentation explaining that hard-
> >>> ware that support page fault is superior as it does not incur any
> >>> of the issues described here.
> >>
> >> OK, agreed, here's a new write up that I'll put in v3:
> >>
> >>
> >> CASE 3: ODP
> >> -----------
> > 
> > ODP is RDMA, maybe Hardware with page fault support instead
> > 
> >> Advanced, but non-CPU (DMA) hardware that supports replayable page faults.
> 
> OK, so:
> 
>     "RDMA hardware with page faulting support."
> 
> for the first sentence.

I would drop RDMA completely, RDMA is just one example, they are GPU, FPGA and
others that are in that category. See below

> 
> 
> >> Here, a well-written driver doesn't normally need to pin pages at all. However,
> >> if the driver does choose to do so, it can register MMU notifiers for the range,
> >> and will be called back upon invalidation. Either way (avoiding page pinning, or
> >> using MMU notifiers to unpin upon request), there is proper synchronization with 
> >> both filesystem and mm (page_mkclean(), munmap(), etc).
> >>
> >> Therefore, neither flag needs to be set.
> > 
> > In fact GUP should never be use with those.
> 
> 
> Yes. The next paragraph says that, but maybe not strong enough.
> 
> 
> >>
> >> It's worth mentioning here that pinning pages should not be the first design
> >> choice. If page fault capable hardware is available, then the software should
> >> be written so that it does not pin pages. This allows mm and filesystems to
> >> operate more efficiently and reliably.
> 
> Here's what we have after the above changes:
> 
> CASE 3: ODP
> -----------
> RDMA hardware with page faulting support. Here, a well-written driver doesn't

CASE3: Hardware with page fault support
---------------------------------------

Here, a well-written ....


> normally need to pin pages at all. However, if the driver does choose to do so,
> it can register MMU notifiers for the range, and will be called back upon
> invalidation. Either way (avoiding page pinning, or using MMU notifiers to unpin
> upon request), there is proper synchronization with both filesystem and mm
> (page_mkclean(), munmap(), etc).
> 
> Therefore, neither flag needs to be set.
> 
> In this case, ideally, neither get_user_pages() nor pin_user_pages() should be 
> called. Instead, the software should be written so that it does not pin pages. 
> This allows mm and filesystems to operate more efficiently and reliably.
> 
> >>> [...]
> >>>
> >>>> @@ -1014,7 +1018,16 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
> >>>>  		BUG_ON(*locked != 1);
> >>>>  	}
> >>>>  
> >>>> -	if (pages)
> >>>> +	/*
> >>>> +	 * FOLL_PIN and FOLL_GET are mutually exclusive. Traditional behavior
> >>>> +	 * is to set FOLL_GET if the caller wants pages[] filled in (but has
> >>>> +	 * carelessly failed to specify FOLL_GET), so keep doing that, but only
> >>>> +	 * for FOLL_GET, not for the newer FOLL_PIN.
> >>>> +	 *
> >>>> +	 * FOLL_PIN always expects pages to be non-null, but no need to assert
> >>>> +	 * that here, as any failures will be obvious enough.
> >>>> +	 */
> >>>> +	if (pages && !(flags & FOLL_PIN))
> >>>>  		flags |= FOLL_GET;
> >>>
> >>> Did you look at user that have pages and not FOLL_GET set ?
> >>> I believe it would be better to first fix them to end up
> >>> with FOLL_GET set and then error out if pages is != NULL but
> >>> nor FOLL_GET or FOLL_PIN is set.
> >>>
> >>
> >> I was perhaps overly cautious, and didn't go there. However, it's probably
> >> doable, given that there was already the following in __get_user_pages():
> >>
> >>     VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));
> >>
> >> ...which will have conditioned people and code to set FOLL_GET together with
> >> pages. So I agree that the time is right.
> >>
> >> In order to make bisecting future failures simpler, I can insert a patch right 
> >> before this one, that changes the FOLL_GET setting into an assert, like this:
> >>
> >> diff --git a/mm/gup.c b/mm/gup.c
> >> index 8f236a335ae9..be338961e80d 100644
> >> --- a/mm/gup.c
> >> +++ b/mm/gup.c
> >> @@ -1014,8 +1014,8 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
> >>                 BUG_ON(*locked != 1);
> >>         }
> >>  
> >> -       if (pages)
> >> -               flags |= FOLL_GET;
> >> +       if (pages && WARN_ON_ONCE(!(gup_flags & FOLL_GET)))
> >> +               return -EINVAL;
> >>  
> >>         pages_done = 0;
> >>         lock_dropped = false;
> >>
> >>
> >> ...and then add in FOLL_PIN, with this patch.
> > 
> > looks good but double check that it should not happens, i will try
> > to check on my side too.
> 
> Yes, I'll look.
> 
> ...
> >>>> +	 */
> >>>> +	gup_flags |= FOLL_REMOTE | FOLL_PIN;
> >>>
> >>> Wouldn't it be better to not add pin_longterm_pages_remote() until
> >>> it can be properly implemented ?
> >>>
> >>
> >> Well, the problem is that I need each call site that requires FOLL_PIN
> >> to use a proper wrapper. It's the FOLL_PIN that is the focus here, because
> >> there is a hard, bright rule, which is: if and only if a caller sets
> >> FOLL_PIN, then the dma-page tracking happens, and put_user_page() must
> >> be called.
> >>
> >> So this leaves me with only two reasonable choices:
> >>
> >> a) Convert the call site as above: pin_longterm_pages_remote(), which sets
> >> FOLL_PIN (the key point!), and leaves the FOLL_LONGTERM situation exactly
> >> as it has been so far. When the FOLL_LONGTERM situation is fixed, the call
> >> site *might* not need any changes to adopt the working gup.c code.
> >>
> >> b) Convert the call site to pin_user_pages_remote(), which also sets
> >> FOLL_PIN, and also leaves the FOLL_LONGTERM situation exactly as before.
> >> There would also be a comment at the call site, to the effect of, "this
> >> is the wrong call to make: it really requires FOLL_LONGTERM behavior".
> >>
> >> When the FOLL_LONGTERM situation is fixed, the call site will need to be
> >> changed to pin_longterm_pages_remote().
> >>
> >> So you can probably see why I picked (a).
> > 
> > But right now nobody has FOLL_LONGTERM and FOLL_REMOTE. So you should
> > never have the need for pin_longterm_pages_remote(). My fear is that
> > longterm has implication and it would be better to not drop this implication
> > by adding a wrapper that does not do what the name says.
> > 
> > So do not introduce pin_longterm_pages_remote() until its first user
> > happens. This is option c)
> > 
> 
> Almost forgot, though: there is already another user: Infiniband:
> 
> drivers/infiniband/core/umem_odp.c:646:         npages = pin_longterm_pages_remote(owning_process, owning_mm,

odp do not need that, i thought the HMM convertion was already upstream
but seems not, in any case odp do not need the longterm case it only
so best is to revert that user to gup_fast or something until it get
converted to HMM.

Cheers,
Jérôme
John Hubbard Nov. 4, 2019, 8:09 p.m. UTC | #6
Jason, a question for you at the bottom.

On 11/4/19 11:52 AM, Jerome Glisse wrote:
...
>> CASE 3: ODP
>> -----------
>> RDMA hardware with page faulting support. Here, a well-written driver doesn't
> 
> CASE3: Hardware with page fault support
> ---------------------------------------
> 
> Here, a well-written ....
> 

Ah, OK. So just drop the first sentence, yes.

...
>>>>>> +	 */
>>>>>> +	gup_flags |= FOLL_REMOTE | FOLL_PIN;
>>>>>
>>>>> Wouldn't it be better to not add pin_longterm_pages_remote() until
>>>>> it can be properly implemented ?
>>>>>
>>>>
>>>> Well, the problem is that I need each call site that requires FOLL_PIN
>>>> to use a proper wrapper. It's the FOLL_PIN that is the focus here, because
>>>> there is a hard, bright rule, which is: if and only if a caller sets
>>>> FOLL_PIN, then the dma-page tracking happens, and put_user_page() must
>>>> be called.
>>>>
>>>> So this leaves me with only two reasonable choices:
>>>>
>>>> a) Convert the call site as above: pin_longterm_pages_remote(), which sets
>>>> FOLL_PIN (the key point!), and leaves the FOLL_LONGTERM situation exactly
>>>> as it has been so far. When the FOLL_LONGTERM situation is fixed, the call
>>>> site *might* not need any changes to adopt the working gup.c code.
>>>>
>>>> b) Convert the call site to pin_user_pages_remote(), which also sets
>>>> FOLL_PIN, and also leaves the FOLL_LONGTERM situation exactly as before.
>>>> There would also be a comment at the call site, to the effect of, "this
>>>> is the wrong call to make: it really requires FOLL_LONGTERM behavior".
>>>>
>>>> When the FOLL_LONGTERM situation is fixed, the call site will need to be
>>>> changed to pin_longterm_pages_remote().
>>>>
>>>> So you can probably see why I picked (a).
>>>
>>> But right now nobody has FOLL_LONGTERM and FOLL_REMOTE. So you should
>>> never have the need for pin_longterm_pages_remote(). My fear is that
>>> longterm has implication and it would be better to not drop this implication
>>> by adding a wrapper that does not do what the name says.
>>>
>>> So do not introduce pin_longterm_pages_remote() until its first user
>>> happens. This is option c)
>>>
>>
>> Almost forgot, though: there is already another user: Infiniband:
>>
>> drivers/infiniband/core/umem_odp.c:646:         npages = pin_longterm_pages_remote(owning_process, owning_mm,
> 
> odp do not need that, i thought the HMM convertion was already upstream
> but seems not, in any case odp do not need the longterm case it only
> so best is to revert that user to gup_fast or something until it get
> converted to HMM.
> 

Note for Jason: the (a) or (b) items are talking about the vfio case, which is
one of the two call sites that now use pin_longterm_pages_remote(), and the
other one is infiniband:

drivers/infiniband/core/umem_odp.c:646:         npages = pin_longterm_pages_remote(owning_process, owning_mm,
drivers/vfio/vfio_iommu_type1.c:353:            ret = pin_longterm_pages_remote(NULL, mm, vaddr, 1,


Jerome, Jason: I really don't want to revert the put_page() to put_user_page() 
conversions that are already throughout the IB driver--pointless churn, right?
I'd rather either delete them in Jason's tree, or go with what I have here
while waiting for the deletion.

Maybe we should just settle on (a) or (b), so that the IB driver ends up with
the wrapper functions? In fact, if it's getting deleted, then I'd prefer leaving
it at (a), since that's simple...

Jason should weigh in on how he wants this to go, with respect to branching
and merging, since it sounds like that will conflict with the hmm branch 
(ha, I'm overdue in reviewing his mmu notifier series, that's what I get for
being late).

thanks,

John Hubbard
NVIDIA
Jason Gunthorpe Nov. 4, 2019, 8:31 p.m. UTC | #7
On Mon, Nov 04, 2019 at 12:09:05PM -0800, John Hubbard wrote:

> Note for Jason: the (a) or (b) items are talking about the vfio case, which is
> one of the two call sites that now use pin_longterm_pages_remote(), and the
> other one is infiniband:
> 
> drivers/infiniband/core/umem_odp.c:646:         npages = pin_longterm_pages_remote(owning_process, owning_mm,

This is a mistake, it is not a longterm pin and does not need FOLL_PIN
semantics

> Jason should weigh in on how he wants this to go, with respect to branching
> and merging, since it sounds like that will conflict with the hmm branch 

I think since you don't need to change this site things should be
fine?

Jason
Jerome Glisse Nov. 4, 2019, 8:31 p.m. UTC | #8
On Mon, Nov 04, 2019 at 12:09:05PM -0800, John Hubbard wrote:
> Jason, a question for you at the bottom.
> 
> On 11/4/19 11:52 AM, Jerome Glisse wrote:
> ...
> >> CASE 3: ODP
> >> -----------
> >> RDMA hardware with page faulting support. Here, a well-written driver doesn't
> > 
> > CASE3: Hardware with page fault support
> > ---------------------------------------
> > 
> > Here, a well-written ....
> > 
> 
> Ah, OK. So just drop the first sentence, yes.
> 
> ...
> >>>>>> +	 */
> >>>>>> +	gup_flags |= FOLL_REMOTE | FOLL_PIN;
> >>>>>
> >>>>> Wouldn't it be better to not add pin_longterm_pages_remote() until
> >>>>> it can be properly implemented ?
> >>>>>
> >>>>
> >>>> Well, the problem is that I need each call site that requires FOLL_PIN
> >>>> to use a proper wrapper. It's the FOLL_PIN that is the focus here, because
> >>>> there is a hard, bright rule, which is: if and only if a caller sets
> >>>> FOLL_PIN, then the dma-page tracking happens, and put_user_page() must
> >>>> be called.
> >>>>
> >>>> So this leaves me with only two reasonable choices:
> >>>>
> >>>> a) Convert the call site as above: pin_longterm_pages_remote(), which sets
> >>>> FOLL_PIN (the key point!), and leaves the FOLL_LONGTERM situation exactly
> >>>> as it has been so far. When the FOLL_LONGTERM situation is fixed, the call
> >>>> site *might* not need any changes to adopt the working gup.c code.
> >>>>
> >>>> b) Convert the call site to pin_user_pages_remote(), which also sets
> >>>> FOLL_PIN, and also leaves the FOLL_LONGTERM situation exactly as before.
> >>>> There would also be a comment at the call site, to the effect of, "this
> >>>> is the wrong call to make: it really requires FOLL_LONGTERM behavior".
> >>>>
> >>>> When the FOLL_LONGTERM situation is fixed, the call site will need to be
> >>>> changed to pin_longterm_pages_remote().
> >>>>
> >>>> So you can probably see why I picked (a).
> >>>
> >>> But right now nobody has FOLL_LONGTERM and FOLL_REMOTE. So you should
> >>> never have the need for pin_longterm_pages_remote(). My fear is that
> >>> longterm has implication and it would be better to not drop this implication
> >>> by adding a wrapper that does not do what the name says.
> >>>
> >>> So do not introduce pin_longterm_pages_remote() until its first user
> >>> happens. This is option c)
> >>>
> >>
> >> Almost forgot, though: there is already another user: Infiniband:
> >>
> >> drivers/infiniband/core/umem_odp.c:646:         npages = pin_longterm_pages_remote(owning_process, owning_mm,
> > 
> > odp do not need that, i thought the HMM convertion was already upstream
> > but seems not, in any case odp do not need the longterm case it only
> > so best is to revert that user to gup_fast or something until it get
> > converted to HMM.
> > 
> 
> Note for Jason: the (a) or (b) items are talking about the vfio case, which is
> one of the two call sites that now use pin_longterm_pages_remote(), and the
> other one is infiniband:
> 
> drivers/infiniband/core/umem_odp.c:646:         npages = pin_longterm_pages_remote(owning_process, owning_mm,
> drivers/vfio/vfio_iommu_type1.c:353:            ret = pin_longterm_pages_remote(NULL, mm, vaddr, 1,

vfio should be reverted until it can be properly implemented.
The issue is that when you fix the implementation you might
break vfio existing user and thus regress the kernel from user
point of view. So i rather have the change to vfio reverted,
i believe it was not well understood when it got upstream,
between in my 5.4 tree it is still gup_remote not longterm.


> Jerome, Jason: I really don't want to revert the put_page() to put_user_page() 
> conversions that are already throughout the IB driver--pointless churn, right?
> I'd rather either delete them in Jason's tree, or go with what I have here
> while waiting for the deletion.
> 
> Maybe we should just settle on (a) or (b), so that the IB driver ends up with
> the wrapper functions? In fact, if it's getting deleted, then I'd prefer leaving
> it at (a), since that's simple...
> 
> Jason should weigh in on how he wants this to go, with respect to branching
> and merging, since it sounds like that will conflict with the hmm branch 
> (ha, I'm overdue in reviewing his mmu notifier series, that's what I get for
> being late).
> 
> thanks,
> 
> John Hubbard
> NVIDIA
David Rientjes Nov. 4, 2019, 8:33 p.m. UTC | #9
On Sun, 3 Nov 2019, John Hubbard wrote:

> Introduce pin_user_pages*() variations of get_user_pages*() calls,
> and also pin_longterm_pages*() variations.
> 
> These variants all set FOLL_PIN, which is also introduced, and
> thoroughly documented.
> 
> The pin_longterm*() variants also set FOLL_LONGTERM, in addition
> to FOLL_PIN:
> 
>     pin_user_pages()
>     pin_user_pages_remote()
>     pin_user_pages_fast()
> 
>     pin_longterm_pages()
>     pin_longterm_pages_remote()
>     pin_longterm_pages_fast()
> 
> All pages that are pinned via the above calls, must be unpinned via
> put_user_page().
> 

Hi John,

I'm curious what consideration is given to what pageblock migrate types 
that FOLL_PIN and FOLL_LONGTERM pages originate from, assuming that 
longterm would want to originate from MIGRATE_UNMOVABLE pageblocks for the 
purposes of anti-fragmentation?
Jason Gunthorpe Nov. 4, 2019, 8:37 p.m. UTC | #10
On Mon, Nov 04, 2019 at 03:31:53PM -0500, Jerome Glisse wrote:
> > Note for Jason: the (a) or (b) items are talking about the vfio case, which is
> > one of the two call sites that now use pin_longterm_pages_remote(), and the
> > other one is infiniband:
> > 
> > drivers/infiniband/core/umem_odp.c:646:         npages = pin_longterm_pages_remote(owning_process, owning_mm,
> > drivers/vfio/vfio_iommu_type1.c:353:            ret = pin_longterm_pages_remote(NULL, mm, vaddr, 1,
> 
> vfio should be reverted until it can be properly implemented.
> The issue is that when you fix the implementation you might
> break vfio existing user and thus regress the kernel from user
> point of view. So i rather have the change to vfio reverted,
> i believe it was not well understood when it got upstream,
> between in my 5.4 tree it is still gup_remote not longterm.

It is clearly a bug, vfio must use LONGTERM, and does right above this
remote call:

        if (mm == current->mm) {
                ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page,
                                     vmas);
        } else {
                ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
                                            vmas, NULL);


I'm not even sure that it really makes any sense to build a 'if' like
that, surely just always call remote??

Jason
John Hubbard Nov. 4, 2019, 8:40 p.m. UTC | #11
On 11/4/19 12:31 PM, Jason Gunthorpe wrote:
> On Mon, Nov 04, 2019 at 12:09:05PM -0800, John Hubbard wrote:
> 
>> Note for Jason: the (a) or (b) items are talking about the vfio case, which is
>> one of the two call sites that now use pin_longterm_pages_remote(), and the
>> other one is infiniband:
>>
>> drivers/infiniband/core/umem_odp.c:646:         npages = pin_longterm_pages_remote(owning_process, owning_mm,
> 
> This is a mistake, it is not a longterm pin and does not need FOLL_PIN
> semantics

OK! So it really just wants to be get_user_pages_remote() / put_page()? I'll
change it back to that.

> 
>> Jason should weigh in on how he wants this to go, with respect to branching
>> and merging, since it sounds like that will conflict with the hmm branch 
> 
> I think since you don't need to change this site things should be
> fine?
> 

Right. 


thanks,

John Hubbard
NVIDIA
Jerome Glisse Nov. 4, 2019, 8:48 p.m. UTC | #12
On Mon, Nov 04, 2019 at 12:33:09PM -0800, David Rientjes wrote:
> 
> 
> On Sun, 3 Nov 2019, John Hubbard wrote:
> 
> > Introduce pin_user_pages*() variations of get_user_pages*() calls,
> > and also pin_longterm_pages*() variations.
> > 
> > These variants all set FOLL_PIN, which is also introduced, and
> > thoroughly documented.
> > 
> > The pin_longterm*() variants also set FOLL_LONGTERM, in addition
> > to FOLL_PIN:
> > 
> >     pin_user_pages()
> >     pin_user_pages_remote()
> >     pin_user_pages_fast()
> > 
> >     pin_longterm_pages()
> >     pin_longterm_pages_remote()
> >     pin_longterm_pages_fast()
> > 
> > All pages that are pinned via the above calls, must be unpinned via
> > put_user_page().
> > 
> 
> Hi John,
> 
> I'm curious what consideration is given to what pageblock migrate types 
> that FOLL_PIN and FOLL_LONGTERM pages originate from, assuming that 
> longterm would want to originate from MIGRATE_UNMOVABLE pageblocks for the 
> purposes of anti-fragmentation?

We do not control page block, GUP can happens on _any_ page that is
map inside a process (anonymous private vma or regular file back one).

Cheers,
Jérôme
John Hubbard Nov. 4, 2019, 8:57 p.m. UTC | #13
On 11/4/19 12:37 PM, Jason Gunthorpe wrote:
> On Mon, Nov 04, 2019 at 03:31:53PM -0500, Jerome Glisse wrote:
>>> Note for Jason: the (a) or (b) items are talking about the vfio case, which is
>>> one of the two call sites that now use pin_longterm_pages_remote(), and the
>>> other one is infiniband:
>>>
>>> drivers/infiniband/core/umem_odp.c:646:         npages = pin_longterm_pages_remote(owning_process, owning_mm,
>>> drivers/vfio/vfio_iommu_type1.c:353:            ret = pin_longterm_pages_remote(NULL, mm, vaddr, 1,
>>
>> vfio should be reverted until it can be properly implemented.
>> The issue is that when you fix the implementation you might
>> break vfio existing user and thus regress the kernel from user
>> point of view. So i rather have the change to vfio reverted,
>> i believe it was not well understood when it got upstream,
>> between in my 5.4 tree it is still gup_remote not longterm.
> 
> It is clearly a bug, vfio must use LONGTERM, and does right above this
> remote call:
> 
>         if (mm == current->mm) {
>                 ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page,
>                                      vmas);
>         } else {
>                 ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
>                                             vmas, NULL);
> 
> 
> I'm not even sure that it really makes any sense to build a 'if' like
> that, surely just always call remote??
> 


Right, and I thought about this when converting, and realized that the above 
code is working around the current gup.c limitations, which are "cannot support
gup remote with FOLL_LONGTERM".

Given that observation, the code is getting itself some FOLL_LONGTERM support
for the non-remote case, and only hitting the limitation if the mm really is
non-current.

And if you look at my patch, it keeps the same behavior, while adding in the
new wrapper calls.

So...thoughts, preferences?


thanks,

John Hubbard
NVIDIA
Jason Gunthorpe Nov. 4, 2019, 9:15 p.m. UTC | #14
On Mon, Nov 04, 2019 at 12:57:59PM -0800, John Hubbard wrote:
> On 11/4/19 12:37 PM, Jason Gunthorpe wrote:
> > On Mon, Nov 04, 2019 at 03:31:53PM -0500, Jerome Glisse wrote:
> >>> Note for Jason: the (a) or (b) items are talking about the vfio case, which is
> >>> one of the two call sites that now use pin_longterm_pages_remote(), and the
> >>> other one is infiniband:
> >>>
> >>> drivers/infiniband/core/umem_odp.c:646:         npages = pin_longterm_pages_remote(owning_process, owning_mm,
> >>> drivers/vfio/vfio_iommu_type1.c:353:            ret = pin_longterm_pages_remote(NULL, mm, vaddr, 1,
> >>
> >> vfio should be reverted until it can be properly implemented.
> >> The issue is that when you fix the implementation you might
> >> break vfio existing user and thus regress the kernel from user
> >> point of view. So i rather have the change to vfio reverted,
> >> i believe it was not well understood when it got upstream,
> >> between in my 5.4 tree it is still gup_remote not longterm.
> > 
> > It is clearly a bug, vfio must use LONGTERM, and does right above this
> > remote call:
> > 
> >         if (mm == current->mm) {
> >                 ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page,
> >                                      vmas);
> >         } else {
> >                 ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
> >                                             vmas, NULL);
> > 
> > 
> > I'm not even sure that it really makes any sense to build a 'if' like
> > that, surely just always call remote??
> > 
> 
> 
> Right, and I thought about this when converting, and realized that the above 
> code is working around the current gup.c limitations, which are "cannot support
> gup remote with FOLL_LONGTERM".

But AFAICT it doesn't have a problem, the protection test is just too
strict, and I guess the control flow needs a bit of fixing..

The issue is this:

static __always_inline long __get_user_pages_locked():
{
        if (locked) {
                /* if VM_FAULT_RETRY can be returned, vmas become invalid */
                BUG_ON(vmas);
                /* check caller initialized locked */
                BUG_ON(*locked != 1);
        }


so remote could be written as:

if (gup_flags & FOLL_LONGTERM) {
   if (WARN_ON_ONCE(locked))
        return -EINVAL;
   return __gup_longterm_locked(...)
}

return __get_user_pages_locked(...)

??

Jason
John Hubbard Nov. 4, 2019, 9:34 p.m. UTC | #15
On 11/4/19 1:15 PM, Jason Gunthorpe wrote:
...
>> Right, and I thought about this when converting, and realized that the above 
>> code is working around the current gup.c limitations, which are "cannot support
>> gup remote with FOLL_LONGTERM".
> 
> But AFAICT it doesn't have a problem, the protection test is just too
> strict, and I guess the control flow needs a bit of fixing..
> 
> The issue is this:
> 
> static __always_inline long __get_user_pages_locked():
> {
>         if (locked) {
>                 /* if VM_FAULT_RETRY can be returned, vmas become invalid */
>                 BUG_ON(vmas);
>                 /* check caller initialized locked */
>                 BUG_ON(*locked != 1);
>         }
> 
> 
> so remote could be written as:
> 
> if (gup_flags & FOLL_LONGTERM) {
>    if (WARN_ON_ONCE(locked))
>         return -EINVAL;
>    return __gup_longterm_locked(...)
> }
> 
> return __get_user_pages_locked(...)
> 
> ??

Yes, that loosens it up just enough for the vfio case (which doesn't set 
"locked") to get through, great! OK, I'll put that (the above plus 
corresponding vfio fix) in a separate patch first. 

This should clear things up nicely.


thanks,
Mike Rapoport Nov. 5, 2019, 1:10 p.m. UTC | #16
On Sun, Nov 03, 2019 at 01:18:00PM -0800, John Hubbard wrote:
> Introduce pin_user_pages*() variations of get_user_pages*() calls,
> and also pin_longterm_pages*() variations.
> 
> These variants all set FOLL_PIN, which is also introduced, and
> thoroughly documented.
> 
> The pin_longterm*() variants also set FOLL_LONGTERM, in addition
> to FOLL_PIN:
> 
>     pin_user_pages()
>     pin_user_pages_remote()
>     pin_user_pages_fast()
> 
>     pin_longterm_pages()
>     pin_longterm_pages_remote()
>     pin_longterm_pages_fast()
> 
> All pages that are pinned via the above calls, must be unpinned via
> put_user_page().
> 
> The underlying rules are:
> 
> * These are gup-internal flags, so the call sites should not directly
> set FOLL_PIN nor FOLL_LONGTERM. That behavior is enforced with
> assertions, for the new FOLL_PIN flag. However, for the pre-existing
> FOLL_LONGTERM flag, which has some call sites that still directly
> set FOLL_LONGTERM, there is no assertion yet.
> 
> * Call sites that want to indicate that they are going to do DirectIO
>   ("DIO") or something with similar characteristics, should call a
>   get_user_pages()-like wrapper call that sets FOLL_PIN. These wrappers
>   will:
>         * Start with "pin_user_pages" instead of "get_user_pages". That
>           makes it easy to find and audit the call sites.
>         * Set FOLL_PIN
> 
> * For pages that are received via FOLL_PIN, those pages must be returned
>   via put_user_page().
> 
> Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases
> in this documentation. (I've reworded it and expanded on it slightly.)
> 
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  Documentation/vm/index.rst          |   1 +
>  Documentation/vm/pin_user_pages.rst | 212 ++++++++++++++++++++++

I think it belongs to Documentation/core-api.

>  include/linux/mm.h                  |  62 ++++++-
>  mm/gup.c                            | 265 +++++++++++++++++++++++++---
>  4 files changed, 514 insertions(+), 26 deletions(-)
>  create mode 100644 Documentation/vm/pin_user_pages.rst
> 
> diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
> index e8d943b21cf9..7194efa3554a 100644
> --- a/Documentation/vm/index.rst
> +++ b/Documentation/vm/index.rst
> @@ -44,6 +44,7 @@ descriptions of data structures and algorithms.
>     page_migration
>     page_frags
>     page_owner
> +   pin_user_pages
>     remap_file_pages
>     slub
>     split_page_table_lock
> diff --git a/Documentation/vm/pin_user_pages.rst b/Documentation/vm/pin_user_pages.rst
> new file mode 100644
> index 000000000000..3910f49ca98c
> --- /dev/null
> +++ b/Documentation/vm/pin_user_pages.rst
> @@ -0,0 +1,212 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +====================================================
> +pin_user_pages() and related calls
> +====================================================

I know this is too much to ask, but having pin_user_pages() a part of more
general GUP description would be really great :)

> +
> +.. contents:: :local:
> +
> +Overview
> +========
> +
> +This document describes the following functions: ::
> +
> + pin_user_pages
> + pin_user_pages_fast
> + pin_user_pages_remote
> +
> + pin_longterm_pages
> + pin_longterm_pages_fast
> + pin_longterm_pages_remote
> +
> +Basic description of FOLL_PIN
> +=============================
> +
> +A new flag for get_user_pages ("gup") has been added: FOLL_PIN. FOLL_PIN has

Consider reading this after, say, half a year ;-)

> +significant interactions and interdependencies with FOLL_LONGTERM, so both are
> +covered here.
> +
> +Both FOLL_PIN and FOLL_LONGTERM are "internal" to gup, meaning that neither
> +FOLL_PIN nor FOLL_LONGTERM should not appear at the gup call sites. This allows
> +the associated wrapper functions  (pin_user_pages and others) to set the correct
> +combination of these flags, and to check for problems as well.
> +
> +FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However,
> +multiple threads and call sites are free to pin the same struct pages, via both
> +FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or the
> +other, not the struct page(s).
> +
> +The FOLL_PIN implementation is nearly the same as FOLL_GET, except that FOLL_PIN
> +uses a different reference counting technique.
> +
> +FOLL_PIN is a prerequisite to FOLL_LONGTGERM. Another way of saying that is,
> +FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN.
> +
> +Which flags are set by each wrapper
> +===================================
> +
> +Only FOLL_PIN and FOLL_LONGTERM are covered here. These flags are added to
> +whatever flags the caller provides::
> +
> + Function                    gup flags (FOLL_PIN or FOLL_LONGTERM only)
> + --------                    ------------------------------------------
> + pin_user_pages              FOLL_PIN
> + pin_user_pages_fast         FOLL_PIN
> + pin_user_pages_remote       FOLL_PIN
> +
> + pin_longterm_pages          FOLL_PIN | FOLL_LONGTERM
> + pin_longterm_pages_fast     FOLL_PIN | FOLL_LONGTERM
> + pin_longterm_pages_remote   FOLL_PIN | FOLL_LONGTERM
> +
> +Tracking dma-pinned pages
> +=========================
> +
> +Some of the key design constraints, and solutions, for tracking dma-pinned
> +pages:
> +
> +* An actual reference count, per struct page, is required. This is because
> +  multiple processes may pin and unpin a page.
> +
> +* False positives (reporting that a page is dma-pinned, when in fact it is not)
> +  are acceptable, but false negatives are not.
> +
> +* struct page may not be increased in size for this, and all fields are already
> +  used.
> +
> +* Given the above, we can overload the page->_refcount field by using, sort of,
> +  the upper bits in that field for a dma-pinned count. "Sort of", means that,
> +  rather than dividing page->_refcount into bit fields, we simple add a medium-
> +  large value (GUP_PIN_COUNTING_BIAS, initially chosen to be 1024: 10 bits) to
> +  page->_refcount. This provides fuzzy behavior: if a page has get_page() called
> +  on it 1024 times, then it will appear to have a single dma-pinned count.
> +  And again, that's acceptable.
> +
> +This also leads to limitations: there are only 31-10==21 bits available for a
> +counter that increments 10 bits at a time.
> +
> +TODO: for 1GB and larger huge pages, this is cutting it close. That's because
> +when pin_user_pages() follows such pages, it increments the head page by "1"
> +(where "1" used to mean "+1" for get_user_pages(), but now means "+1024" for
> +pin_user_pages()) for each tail page. So if you have a 1GB huge page:
> +
> +* There are 256K (18 bits) worth of 4 KB tail pages.
> +* There are 21 bits available to count up via GUP_PIN_COUNTING_BIAS (that is,
> +  10 bits at a time)
> +* There are 21 - 18 == 3 bits available to count. Except that there aren't,
> +  because you need to allow for a few normal get_page() calls on the head page,
> +  as well. Fortunately, the approach of using addition, rather than "hard"
> +  bitfields, within page->_refcount, allows for sharing these bits gracefully.
> +  But we're still looking at about 8 references.
> +
> +This, however, is a missing feature more than anything else, because it's easily
> +solved by addressing an obvious inefficiency in the original get_user_pages()
> +approach of retrieving pages: stop treating all the pages as if they were
> +PAGE_SIZE. Retrieve huge pages as huge pages. The callers need to be aware of
> +this, so some work is required. Once that's in place, this limitation mostly
> +disappears from view, because there will be ample refcounting range available.
> +
> +* Callers must specifically request "dma-pinned tracking of pages". In other
> +  words, just calling get_user_pages() will not suffice; a new set of functions,
> +  pin_user_page() and related, must be used.
> +
> +FOLL_PIN, FOLL_GET, FOLL_LONGTERM: when to use which flags
> +==========================================================
> +
> +Thanks to Jan Kara, Vlastimil Babka and several other -mm people, for describing
> +these categories:
> +
> +CASE 1: Direct IO (DIO)
> +-----------------------
> +There are GUP references to pages that are serving
> +as DIO buffers. These buffers are needed for a relatively short time (so they
> +are not "long term"). No special synchronization with page_mkclean() or
> +munmap() is provided. Therefore, flags to set at the call site are: ::
> +
> +    FOLL_PIN
> +
> +...but rather than setting FOLL_PIN directly, call sites should use one of
> +the pin_user_pages*() routines that set FOLL_PIN.
> +
> +CASE 2: RDMA
> +------------
> +There are GUP references to pages that are serving as DMA
> +buffers. These buffers are needed for a long time ("long term"). No special
> +synchronization with page_mkclean() or munmap() is provided. Therefore, flags
> +to set at the call site are: ::
> +
> +    FOLL_PIN | FOLL_LONGTERM
> +
> +NOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. That's
> +because DAX pages do not have a separate page cache, and so "pinning" implies
> +locking down file system blocks, which is not (yet) supported in that way.
> +
> +CASE 3: ODP
> +-----------
> +(Mellanox/Infiniband On Demand Paging: the hardware supports
> +replayable page faulting). There are GUP references to pages serving as DMA
> +buffers. For ODP, MMU notifiers are used to synchronize with page_mkclean()
> +and munmap(). Therefore, normal GUP calls are sufficient, so neither flag
> +needs to be set.
> +
> +CASE 4: Pinning for struct page manipulation only
> +-------------------------------------------------
> +Here, normal GUP calls are sufficient, so neither flag needs to be set.
> +
> +page_dma_pinned(): the whole point of pinning
> +=============================================
> +
> +The whole point of marking pages as "DMA-pinned" or "gup-pinned" is to be able
> +to query, "is this page DMA-pinned?" That allows code such as page_mkclean()
> +(and file system writeback code in general) to make informed decisions about
> +what to do when a page cannot be unmapped due to such pins.
> +
> +What to do in those cases is the subject of a years-long series of discussions
> +and debates (see the References at the end of this document). It's a TODO item
> +here: fill in the details once that's worked out. Meanwhile, it's safe to say
> +that having this available: ::
> +
> +        static inline bool page_dma_pinned(struct page *page)
> +
> +...is a prerequisite to solving the long-running gup+DMA problem.
> +
> +Another way of thinking about FOLL_GET, FOLL_PIN, and FOLL_LONGTERM
> +===================================================================
> +
> +Another way of thinking about these flags is as a progression of restrictions:
> +FOLL_GET is for struct page manipulation, without affecting the data that the
> +struct page refers to. FOLL_PIN is a *replacement* for FOLL_GET, and is for
> +short term pins on pages whose data *will* get accessed. As such, FOLL_PIN is
> +a "more severe" form of pinning. And finally, FOLL_LONGTERM is an even more
> +restrictive case that has FOLL_PIN as a prerequisite: this is for pages that
> +will be pinned longterm, and whose data will be accessed.
> +
> +Unit testing
> +============
> +This file::
> +
> + tools/testing/selftests/vm/gup_benchmark.c
> +
> +has the following new calls to exercise the new pin*() wrapper functions:
> +
> +* PIN_FAST_BENCHMARK (./gup_benchmark -a)
> +* PIN_LONGTERM_BENCHMARK (./gup_benchmark -a)
> +* PIN_BENCHMARK (./gup_benchmark -a)
> +
> +You can monitor how many total dma-pinned pages have been acquired and released
> +since the system was booted, via two new /proc/vmstat entries: ::
> +
> +    /proc/vmstat/nr_foll_pin_requested
> +    /proc/vmstat/nr_foll_pin_requested
> +
> +Those are both going to show zero, unless CONFIG_DEBUG_VM is set. This is
> +because there is a noticeable performance drop in put_user_page(), when they
> +are activated.
> +
> +References
> +==========
> +
> +* `Some slow progress on get_user_pages() (Apr 2, 2019) <https://lwn.net/Articles/784574/>`_
> +* `DMA and get_user_pages() (LPC: Dec 12, 2018) <https://lwn.net/Articles/774411/>`_
> +* `The trouble with get_user_pages() (Apr 30, 2018) <https://lwn.net/Articles/753027/>`_
> +
> +John Hubbard, October, 2019
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index cc292273e6ba..cdfb6fedb271 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1526,9 +1526,23 @@ long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
>  			    unsigned long start, unsigned long nr_pages,
>  			    unsigned int gup_flags, struct page **pages,
>  			    struct vm_area_struct **vmas, int *locked);
> +long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
> +			   unsigned long start, unsigned long nr_pages,
> +			   unsigned int gup_flags, struct page **pages,
> +			   struct vm_area_struct **vmas, int *locked);
> +long pin_longterm_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
> +			       unsigned long start, unsigned long nr_pages,
> +			       unsigned int gup_flags, struct page **pages,
> +			       struct vm_area_struct **vmas, int *locked);
>  long get_user_pages(unsigned long start, unsigned long nr_pages,
>  			    unsigned int gup_flags, struct page **pages,
>  			    struct vm_area_struct **vmas);
> +long pin_user_pages(unsigned long start, unsigned long nr_pages,
> +		    unsigned int gup_flags, struct page **pages,
> +		    struct vm_area_struct **vmas);
> +long pin_longterm_pages(unsigned long start, unsigned long nr_pages,
> +			unsigned int gup_flags, struct page **pages,
> +			struct vm_area_struct **vmas);
>  long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
>  		    unsigned int gup_flags, struct page **pages, int *locked);
>  long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
> @@ -1536,6 +1550,10 @@ long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
>  
>  int get_user_pages_fast(unsigned long start, int nr_pages,
>  			unsigned int gup_flags, struct page **pages);
> +int pin_user_pages_fast(unsigned long start, int nr_pages,
> +			unsigned int gup_flags, struct page **pages);
> +int pin_longterm_pages_fast(unsigned long start, int nr_pages,
> +			    unsigned int gup_flags, struct page **pages);
>  
>  int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc);
>  int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
> @@ -2594,13 +2612,15 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
>  #define FOLL_ANON	0x8000	/* don't do file mappings */
>  #define FOLL_LONGTERM	0x10000	/* mapping lifetime is indefinite: see below */
>  #define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
> +#define FOLL_PIN	0x40000	/* pages must be released via put_user_page() */
>  
>  /*
> - * NOTE on FOLL_LONGTERM:
> + * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
> + * other. Here is what they mean, and how to use them:
>   *
>   * FOLL_LONGTERM indicates that the page will be held for an indefinite time
> - * period _often_ under userspace control.  This is contrasted with
> - * iov_iter_get_pages() where usages which are transient.
> + * period _often_ under userspace control.  This is in contrast to
> + * iov_iter_get_pages(), where usages which are transient.
>   *
>   * FIXME: For pages which are part of a filesystem, mappings are subject to the
>   * lifetime enforced by the filesystem and we need guarantees that longterm
> @@ -2615,11 +2635,41 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
>   * Currently only get_user_pages() and get_user_pages_fast() support this flag
>   * and calls to get_user_pages_[un]locked are specifically not allowed.  This
>   * is due to an incompatibility with the FS DAX check and
> - * FAULT_FLAG_ALLOW_RETRY
> + * FAULT_FLAG_ALLOW_RETRY.
>   *
> - * In the CMA case: longterm pins in a CMA region would unnecessarily fragment
> - * that region.  And so CMA attempts to migrate the page before pinning when
> + * In the CMA case: long term pins in a CMA region would unnecessarily fragment
> + * that region.  And so, CMA attempts to migrate the page before pinning, when
>   * FOLL_LONGTERM is specified.
> + *
> + * FOLL_PIN indicates that a special kind of tracking (not just page->_refcount,
> + * but an additional pin counting system) will be invoked. This is intended for
> + * anything that gets a page reference and then touches page data (for example,
> + * Direct IO). This lets the filesystem know that some non-file-system entity is
> + * potentially changing the pages' data. In contrast to FOLL_GET (whose pages
> + * are released via put_page()), FOLL_PIN pages must be released, ultimately, by
> + * a call to put_user_page().
> + *
> + * FOLL_PIN is similar to FOLL_GET: both of these pin pages. They use different
> + * and separate refcounting mechanisms, however, and that means that each has
> + * its own acquire and release mechanisms:
> + *
> + *     FOLL_GET: get_user_pages*() to acquire, and put_page() to release.
> + *
> + *     FOLL_PIN: pin_user_pages*() or pin_longterm_pages*() to acquire, and
> + *               put_user_pages to release.
> + *
> + * FOLL_PIN and FOLL_GET are mutually exclusive for a given function call.
> + * (The underlying pages may experience both FOLL_GET-based and FOLL_PIN-based
> + * calls applied to them, and that's perfectly OK. This is a constraint on the
> + * callers, not on the pages.)
> + *
> + * FOLL_PIN and FOLL_LONGTERM should be set internally by the pin_user_page*()
> + * and pin_longterm_*() APIs, never directly by the caller. That's in order to
> + * help avoid mismatches when releasing pages: get_user_pages*() pages must be
> + * released via put_page(), while pin_user_pages*() pages must be released via
> + * put_user_page().
> + *
> + * Please see Documentation/vm/pin_user_pages.rst for more information.
>   */
>  
>  static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)
> diff --git a/mm/gup.c b/mm/gup.c
> index 199da99e8ffc..1aea48427879 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -179,6 +179,10 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
>  	spinlock_t *ptl;
>  	pte_t *ptep, pte;
>  
> +	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
> +	if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
> +			 (FOLL_PIN | FOLL_GET)))
> +		return ERR_PTR(-EINVAL);
>  retry:
>  	if (unlikely(pmd_bad(*pmd)))
>  		return no_page_table(vma, flags);
> @@ -790,7 +794,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  
>  	start = untagged_addr(start);
>  
> -	VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));
> +	VM_BUG_ON(!!pages != !!(gup_flags & (FOLL_GET | FOLL_PIN)));
>  
>  	/*
>  	 * If FOLL_FORCE is set then do not force a full fault as the hinting
> @@ -1014,7 +1018,16 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
>  		BUG_ON(*locked != 1);
>  	}
>  
> -	if (pages)
> +	/*
> +	 * FOLL_PIN and FOLL_GET are mutually exclusive. Traditional behavior
> +	 * is to set FOLL_GET if the caller wants pages[] filled in (but has
> +	 * carelessly failed to specify FOLL_GET), so keep doing that, but only
> +	 * for FOLL_GET, not for the newer FOLL_PIN.
> +	 *
> +	 * FOLL_PIN always expects pages to be non-null, but no need to assert
> +	 * that here, as any failures will be obvious enough.
> +	 */
> +	if (pages && !(flags & FOLL_PIN))
>  		flags |= FOLL_GET;
>  
>  	pages_done = 0;
> @@ -1151,6 +1164,14 @@ long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
>  		unsigned int gup_flags, struct page **pages,
>  		struct vm_area_struct **vmas, int *locked)
>  {
> +	/*
> +	 * FOLL_PIN must only be set internally by the pin_user_page*() and
> +	 * pin_longterm_*() APIs, never directly by the caller, so enforce that
> +	 * with an assertion:
> +	 */
> +	if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
> +		return -EINVAL;
> +
>  	/*
>  	 * FIXME: Current FOLL_LONGTERM behavior is incompatible with
>  	 * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
> @@ -1608,6 +1629,14 @@ long get_user_pages(unsigned long start, unsigned long nr_pages,
>  		unsigned int gup_flags, struct page **pages,
>  		struct vm_area_struct **vmas)
>  {
> +	/*
> +	 * FOLL_PIN must only be set internally by the pin_user_page*() and
> +	 * pin_longterm_*() APIs, never directly by the caller, so enforce that
> +	 * with an assertion:
> +	 */
> +	if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
> +		return -EINVAL;
> +
>  	return __gup_longterm_locked(current, current->mm, start, nr_pages,
>  				     pages, vmas, gup_flags | FOLL_TOUCH);
>  }
> @@ -2373,24 +2402,9 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
>  	return ret;
>  }
>  
> -/**
> - * get_user_pages_fast() - pin user pages in memory
> - * @start:	starting user address
> - * @nr_pages:	number of pages from start to pin
> - * @gup_flags:	flags modifying pin behaviour
> - * @pages:	array that receives pointers to the pages pinned.
> - *		Should be at least nr_pages long.
> - *
> - * Attempt to pin user pages in memory without taking mm->mmap_sem.
> - * If not successful, it will fall back to taking the lock and
> - * calling get_user_pages().
> - *
> - * Returns number of pages pinned. This may be fewer than the number
> - * requested. If nr_pages is 0 or negative, returns 0. If no pages
> - * were pinned, returns -errno.
> - */
> -int get_user_pages_fast(unsigned long start, int nr_pages,
> -			unsigned int gup_flags, struct page **pages)
> +static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
> +					unsigned int gup_flags,
> +					struct page **pages)
>  {
>  	unsigned long addr, len, end;
>  	int nr = 0, ret = 0;
> @@ -2435,4 +2449,215 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
>  
>  	return ret;
>  }
> +
> +/**
> + * get_user_pages_fast() - pin user pages in memory
> + * @start:	starting user address
> + * @nr_pages:	number of pages from start to pin
> + * @gup_flags:	flags modifying pin behaviour
> + * @pages:	array that receives pointers to the pages pinned.
> + *		Should be at least nr_pages long.
> + *
> + * Attempt to pin user pages in memory without taking mm->mmap_sem.
> + * If not successful, it will fall back to taking the lock and
> + * calling get_user_pages().
> + *
> + * Returns number of pages pinned. This may be fewer than the number requested.
> + * If nr_pages is 0 or negative, returns 0. If no pages were pinned, returns
> + * -errno.
> + */
> +int get_user_pages_fast(unsigned long start, int nr_pages,
> +			unsigned int gup_flags, struct page **pages)
> +{
> +	/*
> +	 * FOLL_PIN must only be set internally by the pin_user_page*() and
> +	 * pin_longterm_*() APIs, never directly by the caller, so enforce that:
> +	 */
> +	if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
> +		return -EINVAL;
> +
> +	return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
> +}
>  EXPORT_SYMBOL_GPL(get_user_pages_fast);
> +
> +/**
> + * pin_user_pages_fast() - pin user pages in memory without taking locks
> + *
> + * Nearly the same as get_user_pages_fast(), except that FOLL_PIN is set. See
> + * get_user_pages_fast() for documentation on the function arguments, because
> + * the arguments here are identical.
> + *
> + * FOLL_PIN means that the pages must be released via put_user_page(). Please
> + * see Documentation/vm/pin_user_pages.rst for further details.
> + *
> + * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
> + * is NOT intended for Case 2 (RDMA: long-term pins).
> + */
> +int pin_user_pages_fast(unsigned long start, int nr_pages,
> +			unsigned int gup_flags, struct page **pages)
> +{
> +	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
> +	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> +		return -EINVAL;
> +
> +	gup_flags |= FOLL_PIN;
> +	return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
> +}
> +EXPORT_SYMBOL_GPL(pin_user_pages_fast);
> +
> +/**
> + * pin_longterm_pages_fast() - pin user pages in memory without taking locks
> + *
> + * Nearly the same as get_user_pages_fast(), except that FOLL_PIN and
> + * FOLL_LONGTERM are set. See get_user_pages_fast() for documentation on the
> + * function arguments, because the arguments here are identical.
> + *
> + * FOLL_PIN means that the pages must be released via put_user_page(). Please
> + * see Documentation/vm/pin_user_pages.rst for further details.
> + *
> + * FOLL_LONGTERM means that the pages are being pinned for "long term" use,
> + * typically by a non-CPU device, and we cannot be sure that waiting for a
> + * pinned page to become unpin will be effective.
> + *
> + * This is intended for Case 2 (RDMA: long-term pins) of the FOLL_PIN
> + * documentation.
> + */
> +int pin_longterm_pages_fast(unsigned long start, int nr_pages,
> +			    unsigned int gup_flags, struct page **pages)
> +{
> +	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
> +	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> +		return -EINVAL;
> +
> +	gup_flags |= (FOLL_PIN | FOLL_LONGTERM);
> +	return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
> +}
> +EXPORT_SYMBOL_GPL(pin_longterm_pages_fast);
> +
> +/**
> + * pin_user_pages_remote() - pin pages for (typically) use by Direct IO, and
> + * return the pages to the user.
> + *
> + * Nearly the same as get_user_pages_remote(), except that FOLL_PIN is set. See
> + * get_user_pages_remote() for documentation on the function arguments, because
> + * the arguments here are identical.
> + *
> + * FOLL_PIN means that the pages must be released via put_user_page(). Please
> + * see Documentation/vm/pin_user_pages.rst for details.
> + *
> + * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
> + * is NOT intended for Case 2 (RDMA: long-term pins).
> + */
> +long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
> +			   unsigned long start, unsigned long nr_pages,
> +			   unsigned int gup_flags, struct page **pages,
> +			   struct vm_area_struct **vmas, int *locked)
> +{
> +	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
> +	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> +		return -EINVAL;
> +
> +	gup_flags |= FOLL_TOUCH | FOLL_REMOTE | FOLL_PIN;
> +
> +	return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
> +				       locked, gup_flags);
> +}
> +EXPORT_SYMBOL(pin_user_pages_remote);
> +
> +/**
> + * pin_longterm_pages_remote() - pin pages for (typically) use by Direct IO, and
> + * return the pages to the user.
> + *
> + * Nearly the same as get_user_pages_remote(), but note that FOLL_TOUCH is not
> + * set, and FOLL_PIN and FOLL_LONGTERM are set. See get_user_pages_remote() for
> + * documentation on the function arguments, because the arguments here are
> + * identical.
> + *
> + * FOLL_PIN means that the pages must be released via put_user_page(). Please
> + * see Documentation/vm/pin_user_pages.rst for further details.
> + *
> + * FOLL_LONGTERM means that the pages are being pinned for "long term" use,
> + * typically by a non-CPU device, and we cannot be sure that waiting for a
> + * pinned page to become unpin will be effective.
> + *
> + * This is intended for Case 2 (RDMA: long-term pins) in
> + * Documentation/vm/pin_user_pages.rst.
> + */
> +long pin_longterm_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
> +			       unsigned long start, unsigned long nr_pages,
> +			       unsigned int gup_flags, struct page **pages,
> +			       struct vm_area_struct **vmas, int *locked)
> +{
> +	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
> +	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> +		return -EINVAL;
> +
> +	/*
> +	 * FIXME: as noted in the get_user_pages_remote() implementation, it
> +	 * is not yet possible to safely set FOLL_LONGTERM here. FOLL_LONGTERM
> +	 * needs to be set, but for now the best we can do is a "TODO" item.
> +	 */
> +	gup_flags |= FOLL_REMOTE | FOLL_PIN;
> +
> +	return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
> +				       locked, gup_flags);
> +}
> +EXPORT_SYMBOL(pin_longterm_pages_remote);
> +
> +/**
> + * pin_user_pages() - pin user pages in memory for use by other devices
> + *
> + * Nearly the same as get_user_pages(), except that FOLL_TOUCH is not set, and
> + * FOLL_PIN is set.
> + *
> + * FOLL_PIN means that the pages must be released via put_user_page(). Please
> + * see Documentation/vm/pin_user_pages.rst for details.
> + *
> + * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
> + * is NOT intended for Case 2 (RDMA: long-term pins).
> + */
> +long pin_user_pages(unsigned long start, unsigned long nr_pages,
> +		    unsigned int gup_flags, struct page **pages,
> +		    struct vm_area_struct **vmas)
> +{
> +	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
> +	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> +		return -EINVAL;
> +
> +	gup_flags |= FOLL_PIN;
> +	return __gup_longterm_locked(current, current->mm, start, nr_pages,
> +				     pages, vmas, gup_flags);
> +}
> +EXPORT_SYMBOL(pin_user_pages);
> +
> +/**
> + * pin_longterm_pages() - pin user pages in memory for long-term use (RDMA,
> + * typically)
> + *
> + * Nearly the same as get_user_pages(), except that FOLL_PIN and FOLL_LONGTERM
> + * are set. See get_user_pages_fast() for documentation on the function
> + * arguments, because the arguments here are identical.
> + *
> + * FOLL_PIN means that the pages must be released via put_user_page(). Please
> + * see Documentation/vm/pin_user_pages.rst for further details.
> + *
> + * FOLL_LONGTERM means that the pages are being pinned for "long term" use,
> + * typically by a non-CPU device, and we cannot be sure that waiting for a
> + * pinned page to become unpin will be effective.
> + *
> + * This is intended for Case 2 (RDMA: long-term pins) in
> + * Documentation/vm/pin_user_pages.rst.
> + */
> +long pin_longterm_pages(unsigned long start, unsigned long nr_pages,
> +			unsigned int gup_flags, struct page **pages,
> +			struct vm_area_struct **vmas)
> +{
> +	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
> +	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> +		return -EINVAL;
> +
> +	gup_flags |= FOLL_PIN | FOLL_LONGTERM;
> +	return __gup_longterm_locked(current, current->mm, start, nr_pages,
> +				     pages, vmas, gup_flags);
> +}
> +EXPORT_SYMBOL(pin_longterm_pages);
> -- 
> 2.23.0
> 
>
John Hubbard Nov. 5, 2019, 7 p.m. UTC | #17
On 11/5/19 5:10 AM, Mike Rapoport wrote:
...
>> ---
>>  Documentation/vm/index.rst          |   1 +
>>  Documentation/vm/pin_user_pages.rst | 212 ++++++++++++++++++++++
> 
> I think it belongs to Documentation/core-api.

Done:

diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index ab0eae1c153a..413f7d7c8642 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -31,6 +31,7 @@ Core utilities
    generic-radix-tree
    memory-allocation
    mm-api
+   pin_user_pages
    gfp_mask-from-fs-io
    timekeeping
    boot-time-mm


...
>> diff --git a/Documentation/vm/pin_user_pages.rst b/Documentation/vm/pin_user_pages.rst
>> new file mode 100644
>> index 000000000000..3910f49ca98c
>> --- /dev/null
>> +++ b/Documentation/vm/pin_user_pages.rst
>> @@ -0,0 +1,212 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +====================================================
>> +pin_user_pages() and related calls
>> +====================================================
> 
> I know this is too much to ask, but having pin_user_pages() a part of more
> general GUP description would be really great :)
> 

Yes, definitely. But until I saw the reaction to the pin_user_pages() API
family, I didn't want to write too much--it could have all been tossed out
in favor of a whole different API. But now that we've had some initial
reviews, I'm much more confident in being able to write about the larger 
API set.

So yes, I'll put that on my pending list.


...
>> +This document describes the following functions: ::
>> +
>> + pin_user_pages
>> + pin_user_pages_fast
>> + pin_user_pages_remote
>> +
>> + pin_longterm_pages
>> + pin_longterm_pages_fast
>> + pin_longterm_pages_remote
>> +
>> +Basic description of FOLL_PIN
>> +=============================
>> +
>> +A new flag for get_user_pages ("gup") has been added: FOLL_PIN. FOLL_PIN has
> 
> Consider reading this after, say, half a year ;-)
> 

OK, OK. I knew when I wrote that that it was not going to stay new forever, but
somehow failed to write the right thing anyway. :) 

Here's a revised set of paragraphs:

Basic description of FOLL_PIN
=============================

FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the get_user_pages*()
("gup") family of functions. FOLL_PIN has significant interactions and
interdependencies with FOLL_LONGTERM, so both are covered here.

Both FOLL_PIN and FOLL_LONGTERM are internal to gup, meaning that neither
FOLL_PIN nor FOLL_LONGTERM should not appear at the gup call sites. This allows
the associated wrapper functions  (pin_user_pages() and others) to set the
correct combination of these flags, and to check for problems as well.


thanks,

John Hubbard
NVIDIA
Ira Weiny Nov. 7, 2019, 2:25 a.m. UTC | #18
> 
> 
> ...
> >> +This document describes the following functions: ::
> >> +
> >> + pin_user_pages
> >> + pin_user_pages_fast
> >> + pin_user_pages_remote
> >> +
> >> + pin_longterm_pages
> >> + pin_longterm_pages_fast
> >> + pin_longterm_pages_remote
> >> +
> >> +Basic description of FOLL_PIN
> >> +=============================
> >> +
> >> +A new flag for get_user_pages ("gup") has been added: FOLL_PIN. FOLL_PIN has
> > 
> > Consider reading this after, say, half a year ;-)
> > 
> 
> OK, OK. I knew when I wrote that that it was not going to stay new forever, but
> somehow failed to write the right thing anyway. :) 
> 
> Here's a revised set of paragraphs:
> 
> Basic description of FOLL_PIN
> =============================
> 
> FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the get_user_pages*()
> ("gup") family of functions. FOLL_PIN has significant interactions and
> interdependencies with FOLL_LONGTERM, so both are covered here.
> 
> Both FOLL_PIN and FOLL_LONGTERM are internal to gup, meaning that neither
> FOLL_PIN nor FOLL_LONGTERM should not appear at the gup call sites. This allows
> the associated wrapper functions  (pin_user_pages() and others) to set the
> correct combination of these flags, and to check for problems as well.

I like this revision as well.

Ira
Mike Rapoport Nov. 7, 2019, 8:07 a.m. UTC | #19
On Tue, Nov 05, 2019 at 11:00:06AM -0800, John Hubbard wrote:
> On 11/5/19 5:10 AM, Mike Rapoport wrote:
> ...
> >> ---
> >>  Documentation/vm/index.rst          |   1 +
> >>  Documentation/vm/pin_user_pages.rst | 212 ++++++++++++++++++++++
> > 
> > I think it belongs to Documentation/core-api.
> 
> Done:
> 
> diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
> index ab0eae1c153a..413f7d7c8642 100644
> --- a/Documentation/core-api/index.rst
> +++ b/Documentation/core-api/index.rst
> @@ -31,6 +31,7 @@ Core utilities
>     generic-radix-tree
>     memory-allocation
>     mm-api
> +   pin_user_pages
>     gfp_mask-from-fs-io
>     timekeeping
>     boot-time-mm

Thanks!
 
> ...
> >> diff --git a/Documentation/vm/pin_user_pages.rst b/Documentation/vm/pin_user_pages.rst
> >> new file mode 100644
> >> index 000000000000..3910f49ca98c
> >> --- /dev/null
> >> +++ b/Documentation/vm/pin_user_pages.rst
> >> @@ -0,0 +1,212 @@
> >> +.. SPDX-License-Identifier: GPL-2.0
> >> +
> >> +====================================================
> >> +pin_user_pages() and related calls
> >> +====================================================
> > 
> > I know this is too much to ask, but having pin_user_pages() a part of more
> > general GUP description would be really great :)
> > 
> 
> Yes, definitely. But until I saw the reaction to the pin_user_pages() API
> family, I didn't want to write too much--it could have all been tossed out
> in favor of a whole different API. But now that we've had some initial
> reviews, I'm much more confident in being able to write about the larger 
> API set.
> 
> So yes, I'll put that on my pending list.
> 
> 
> ...
> >> +This document describes the following functions: ::
> >> +
> >> + pin_user_pages
> >> + pin_user_pages_fast
> >> + pin_user_pages_remote
> >> +
> >> + pin_longterm_pages
> >> + pin_longterm_pages_fast
> >> + pin_longterm_pages_remote
> >> +
> >> +Basic description of FOLL_PIN
> >> +=============================
> >> +
> >> +A new flag for get_user_pages ("gup") has been added: FOLL_PIN. FOLL_PIN has
> > 
> > Consider reading this after, say, half a year ;-)
> > 
> 
> OK, OK. I knew when I wrote that that it was not going to stay new forever, but
> somehow failed to write the right thing anyway. :) 
> 
> Here's a revised set of paragraphs:
> 
> Basic description of FOLL_PIN
> =============================
> 
> FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the get_user_pages*()
> ("gup") family of functions. FOLL_PIN has significant interactions and
> interdependencies with FOLL_LONGTERM, so both are covered here.
> 
> Both FOLL_PIN and FOLL_LONGTERM are internal to gup, meaning that neither
> FOLL_PIN nor FOLL_LONGTERM should not appear at the gup call sites. This allows
> the associated wrapper functions  (pin_user_pages() and others) to set the
> correct combination of these flags, and to check for problems as well.

Great, thanks! 
 
> thanks,
> 
> John Hubbard
> NVIDIA
diff mbox series

Patch

diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
index e8d943b21cf9..7194efa3554a 100644
--- a/Documentation/vm/index.rst
+++ b/Documentation/vm/index.rst
@@ -44,6 +44,7 @@  descriptions of data structures and algorithms.
    page_migration
    page_frags
    page_owner
+   pin_user_pages
    remap_file_pages
    slub
    split_page_table_lock
diff --git a/Documentation/vm/pin_user_pages.rst b/Documentation/vm/pin_user_pages.rst
new file mode 100644
index 000000000000..3910f49ca98c
--- /dev/null
+++ b/Documentation/vm/pin_user_pages.rst
@@ -0,0 +1,212 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+====================================================
+pin_user_pages() and related calls
+====================================================
+
+.. contents:: :local:
+
+Overview
+========
+
+This document describes the following functions: ::
+
+ pin_user_pages
+ pin_user_pages_fast
+ pin_user_pages_remote
+
+ pin_longterm_pages
+ pin_longterm_pages_fast
+ pin_longterm_pages_remote
+
+Basic description of FOLL_PIN
+=============================
+
+A new flag for get_user_pages ("gup") has been added: FOLL_PIN. FOLL_PIN has
+significant interactions and interdependencies with FOLL_LONGTERM, so both are
+covered here.
+
+Both FOLL_PIN and FOLL_LONGTERM are "internal" to gup, meaning that neither
+FOLL_PIN nor FOLL_LONGTERM should not appear at the gup call sites. This allows
+the associated wrapper functions  (pin_user_pages and others) to set the correct
+combination of these flags, and to check for problems as well.
+
+FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However,
+multiple threads and call sites are free to pin the same struct pages, via both
+FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or the
+other, not the struct page(s).
+
+The FOLL_PIN implementation is nearly the same as FOLL_GET, except that FOLL_PIN
+uses a different reference counting technique.
+
+FOLL_PIN is a prerequisite to FOLL_LONGTGERM. Another way of saying that is,
+FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN.
+
+Which flags are set by each wrapper
+===================================
+
+Only FOLL_PIN and FOLL_LONGTERM are covered here. These flags are added to
+whatever flags the caller provides::
+
+ Function                    gup flags (FOLL_PIN or FOLL_LONGTERM only)
+ --------                    ------------------------------------------
+ pin_user_pages              FOLL_PIN
+ pin_user_pages_fast         FOLL_PIN
+ pin_user_pages_remote       FOLL_PIN
+
+ pin_longterm_pages          FOLL_PIN | FOLL_LONGTERM
+ pin_longterm_pages_fast     FOLL_PIN | FOLL_LONGTERM
+ pin_longterm_pages_remote   FOLL_PIN | FOLL_LONGTERM
+
+Tracking dma-pinned pages
+=========================
+
+Some of the key design constraints, and solutions, for tracking dma-pinned
+pages:
+
+* An actual reference count, per struct page, is required. This is because
+  multiple processes may pin and unpin a page.
+
+* False positives (reporting that a page is dma-pinned, when in fact it is not)
+  are acceptable, but false negatives are not.
+
+* struct page may not be increased in size for this, and all fields are already
+  used.
+
+* Given the above, we can overload the page->_refcount field by using, sort of,
+  the upper bits in that field for a dma-pinned count. "Sort of", means that,
+  rather than dividing page->_refcount into bit fields, we simple add a medium-
+  large value (GUP_PIN_COUNTING_BIAS, initially chosen to be 1024: 10 bits) to
+  page->_refcount. This provides fuzzy behavior: if a page has get_page() called
+  on it 1024 times, then it will appear to have a single dma-pinned count.
+  And again, that's acceptable.
+
+This also leads to limitations: there are only 31-10==21 bits available for a
+counter that increments 10 bits at a time.
+
+TODO: for 1GB and larger huge pages, this is cutting it close. That's because
+when pin_user_pages() follows such pages, it increments the head page by "1"
+(where "1" used to mean "+1" for get_user_pages(), but now means "+1024" for
+pin_user_pages()) for each tail page. So if you have a 1GB huge page:
+
+* There are 256K (18 bits) worth of 4 KB tail pages.
+* There are 21 bits available to count up via GUP_PIN_COUNTING_BIAS (that is,
+  10 bits at a time)
+* There are 21 - 18 == 3 bits available to count. Except that there aren't,
+  because you need to allow for a few normal get_page() calls on the head page,
+  as well. Fortunately, the approach of using addition, rather than "hard"
+  bitfields, within page->_refcount, allows for sharing these bits gracefully.
+  But we're still looking at about 8 references.
+
+This, however, is a missing feature more than anything else, because it's easily
+solved by addressing an obvious inefficiency in the original get_user_pages()
+approach of retrieving pages: stop treating all the pages as if they were
+PAGE_SIZE. Retrieve huge pages as huge pages. The callers need to be aware of
+this, so some work is required. Once that's in place, this limitation mostly
+disappears from view, because there will be ample refcounting range available.
+
+* Callers must specifically request "dma-pinned tracking of pages". In other
+  words, just calling get_user_pages() will not suffice; a new set of functions,
+  pin_user_page() and related, must be used.
+
+FOLL_PIN, FOLL_GET, FOLL_LONGTERM: when to use which flags
+==========================================================
+
+Thanks to Jan Kara, Vlastimil Babka and several other -mm people, for describing
+these categories:
+
+CASE 1: Direct IO (DIO)
+-----------------------
+There are GUP references to pages that are serving
+as DIO buffers. These buffers are needed for a relatively short time (so they
+are not "long term"). No special synchronization with page_mkclean() or
+munmap() is provided. Therefore, flags to set at the call site are: ::
+
+    FOLL_PIN
+
+...but rather than setting FOLL_PIN directly, call sites should use one of
+the pin_user_pages*() routines that set FOLL_PIN.
+
+CASE 2: RDMA
+------------
+There are GUP references to pages that are serving as DMA
+buffers. These buffers are needed for a long time ("long term"). No special
+synchronization with page_mkclean() or munmap() is provided. Therefore, flags
+to set at the call site are: ::
+
+    FOLL_PIN | FOLL_LONGTERM
+
+NOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. That's
+because DAX pages do not have a separate page cache, and so "pinning" implies
+locking down file system blocks, which is not (yet) supported in that way.
+
+CASE 3: ODP
+-----------
+(Mellanox/Infiniband On Demand Paging: the hardware supports
+replayable page faulting). There are GUP references to pages serving as DMA
+buffers. For ODP, MMU notifiers are used to synchronize with page_mkclean()
+and munmap(). Therefore, normal GUP calls are sufficient, so neither flag
+needs to be set.
+
+CASE 4: Pinning for struct page manipulation only
+-------------------------------------------------
+Here, normal GUP calls are sufficient, so neither flag needs to be set.
+
+page_dma_pinned(): the whole point of pinning
+=============================================
+
+The whole point of marking pages as "DMA-pinned" or "gup-pinned" is to be able
+to query, "is this page DMA-pinned?" That allows code such as page_mkclean()
+(and file system writeback code in general) to make informed decisions about
+what to do when a page cannot be unmapped due to such pins.
+
+What to do in those cases is the subject of a years-long series of discussions
+and debates (see the References at the end of this document). It's a TODO item
+here: fill in the details once that's worked out. Meanwhile, it's safe to say
+that having this available: ::
+
+        static inline bool page_dma_pinned(struct page *page)
+
+...is a prerequisite to solving the long-running gup+DMA problem.
+
+Another way of thinking about FOLL_GET, FOLL_PIN, and FOLL_LONGTERM
+===================================================================
+
+Another way of thinking about these flags is as a progression of restrictions:
+FOLL_GET is for struct page manipulation, without affecting the data that the
+struct page refers to. FOLL_PIN is a *replacement* for FOLL_GET, and is for
+short term pins on pages whose data *will* get accessed. As such, FOLL_PIN is
+a "more severe" form of pinning. And finally, FOLL_LONGTERM is an even more
+restrictive case that has FOLL_PIN as a prerequisite: this is for pages that
+will be pinned longterm, and whose data will be accessed.
+
+Unit testing
+============
+This file::
+
+ tools/testing/selftests/vm/gup_benchmark.c
+
+has the following new calls to exercise the new pin*() wrapper functions:
+
+* PIN_FAST_BENCHMARK (./gup_benchmark -a)
+* PIN_LONGTERM_BENCHMARK (./gup_benchmark -a)
+* PIN_BENCHMARK (./gup_benchmark -a)
+
+You can monitor how many total dma-pinned pages have been acquired and released
+since the system was booted, via two new /proc/vmstat entries: ::
+
+    /proc/vmstat/nr_foll_pin_requested
+    /proc/vmstat/nr_foll_pin_requested
+
+Those are both going to show zero, unless CONFIG_DEBUG_VM is set. This is
+because there is a noticeable performance drop in put_user_page(), when they
+are activated.
+
+References
+==========
+
+* `Some slow progress on get_user_pages() (Apr 2, 2019) <https://lwn.net/Articles/784574/>`_
+* `DMA and get_user_pages() (LPC: Dec 12, 2018) <https://lwn.net/Articles/774411/>`_
+* `The trouble with get_user_pages() (Apr 30, 2018) <https://lwn.net/Articles/753027/>`_
+
+John Hubbard, October, 2019
diff --git a/include/linux/mm.h b/include/linux/mm.h
index cc292273e6ba..cdfb6fedb271 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1526,9 +1526,23 @@  long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
 			    unsigned long start, unsigned long nr_pages,
 			    unsigned int gup_flags, struct page **pages,
 			    struct vm_area_struct **vmas, int *locked);
+long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
+			   unsigned long start, unsigned long nr_pages,
+			   unsigned int gup_flags, struct page **pages,
+			   struct vm_area_struct **vmas, int *locked);
+long pin_longterm_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
+			       unsigned long start, unsigned long nr_pages,
+			       unsigned int gup_flags, struct page **pages,
+			       struct vm_area_struct **vmas, int *locked);
 long get_user_pages(unsigned long start, unsigned long nr_pages,
 			    unsigned int gup_flags, struct page **pages,
 			    struct vm_area_struct **vmas);
+long pin_user_pages(unsigned long start, unsigned long nr_pages,
+		    unsigned int gup_flags, struct page **pages,
+		    struct vm_area_struct **vmas);
+long pin_longterm_pages(unsigned long start, unsigned long nr_pages,
+			unsigned int gup_flags, struct page **pages,
+			struct vm_area_struct **vmas);
 long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
 		    unsigned int gup_flags, struct page **pages, int *locked);
 long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
@@ -1536,6 +1550,10 @@  long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
 
 int get_user_pages_fast(unsigned long start, int nr_pages,
 			unsigned int gup_flags, struct page **pages);
+int pin_user_pages_fast(unsigned long start, int nr_pages,
+			unsigned int gup_flags, struct page **pages);
+int pin_longterm_pages_fast(unsigned long start, int nr_pages,
+			    unsigned int gup_flags, struct page **pages);
 
 int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc);
 int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
@@ -2594,13 +2612,15 @@  struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 #define FOLL_ANON	0x8000	/* don't do file mappings */
 #define FOLL_LONGTERM	0x10000	/* mapping lifetime is indefinite: see below */
 #define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
+#define FOLL_PIN	0x40000	/* pages must be released via put_user_page() */
 
 /*
- * NOTE on FOLL_LONGTERM:
+ * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
+ * other. Here is what they mean, and how to use them:
  *
  * FOLL_LONGTERM indicates that the page will be held for an indefinite time
- * period _often_ under userspace control.  This is contrasted with
- * iov_iter_get_pages() where usages which are transient.
+ * period _often_ under userspace control.  This is in contrast to
+ * iov_iter_get_pages(), where usages which are transient.
  *
  * FIXME: For pages which are part of a filesystem, mappings are subject to the
  * lifetime enforced by the filesystem and we need guarantees that longterm
@@ -2615,11 +2635,41 @@  struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
  * Currently only get_user_pages() and get_user_pages_fast() support this flag
  * and calls to get_user_pages_[un]locked are specifically not allowed.  This
  * is due to an incompatibility with the FS DAX check and
- * FAULT_FLAG_ALLOW_RETRY
+ * FAULT_FLAG_ALLOW_RETRY.
  *
- * In the CMA case: longterm pins in a CMA region would unnecessarily fragment
- * that region.  And so CMA attempts to migrate the page before pinning when
+ * In the CMA case: long term pins in a CMA region would unnecessarily fragment
+ * that region.  And so, CMA attempts to migrate the page before pinning, when
  * FOLL_LONGTERM is specified.
+ *
+ * FOLL_PIN indicates that a special kind of tracking (not just page->_refcount,
+ * but an additional pin counting system) will be invoked. This is intended for
+ * anything that gets a page reference and then touches page data (for example,
+ * Direct IO). This lets the filesystem know that some non-file-system entity is
+ * potentially changing the pages' data. In contrast to FOLL_GET (whose pages
+ * are released via put_page()), FOLL_PIN pages must be released, ultimately, by
+ * a call to put_user_page().
+ *
+ * FOLL_PIN is similar to FOLL_GET: both of these pin pages. They use different
+ * and separate refcounting mechanisms, however, and that means that each has
+ * its own acquire and release mechanisms:
+ *
+ *     FOLL_GET: get_user_pages*() to acquire, and put_page() to release.
+ *
+ *     FOLL_PIN: pin_user_pages*() or pin_longterm_pages*() to acquire, and
+ *               put_user_pages to release.
+ *
+ * FOLL_PIN and FOLL_GET are mutually exclusive for a given function call.
+ * (The underlying pages may experience both FOLL_GET-based and FOLL_PIN-based
+ * calls applied to them, and that's perfectly OK. This is a constraint on the
+ * callers, not on the pages.)
+ *
+ * FOLL_PIN and FOLL_LONGTERM should be set internally by the pin_user_page*()
+ * and pin_longterm_*() APIs, never directly by the caller. That's in order to
+ * help avoid mismatches when releasing pages: get_user_pages*() pages must be
+ * released via put_page(), while pin_user_pages*() pages must be released via
+ * put_user_page().
+ *
+ * Please see Documentation/vm/pin_user_pages.rst for more information.
  */
 
 static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)
diff --git a/mm/gup.c b/mm/gup.c
index 199da99e8ffc..1aea48427879 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -179,6 +179,10 @@  static struct page *follow_page_pte(struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	pte_t *ptep, pte;
 
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
+			 (FOLL_PIN | FOLL_GET)))
+		return ERR_PTR(-EINVAL);
 retry:
 	if (unlikely(pmd_bad(*pmd)))
 		return no_page_table(vma, flags);
@@ -790,7 +794,7 @@  static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 
 	start = untagged_addr(start);
 
-	VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));
+	VM_BUG_ON(!!pages != !!(gup_flags & (FOLL_GET | FOLL_PIN)));
 
 	/*
 	 * If FOLL_FORCE is set then do not force a full fault as the hinting
@@ -1014,7 +1018,16 @@  static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
 		BUG_ON(*locked != 1);
 	}
 
-	if (pages)
+	/*
+	 * FOLL_PIN and FOLL_GET are mutually exclusive. Traditional behavior
+	 * is to set FOLL_GET if the caller wants pages[] filled in (but has
+	 * carelessly failed to specify FOLL_GET), so keep doing that, but only
+	 * for FOLL_GET, not for the newer FOLL_PIN.
+	 *
+	 * FOLL_PIN always expects pages to be non-null, but no need to assert
+	 * that here, as any failures will be obvious enough.
+	 */
+	if (pages && !(flags & FOLL_PIN))
 		flags |= FOLL_GET;
 
 	pages_done = 0;
@@ -1151,6 +1164,14 @@  long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
 		unsigned int gup_flags, struct page **pages,
 		struct vm_area_struct **vmas, int *locked)
 {
+	/*
+	 * FOLL_PIN must only be set internally by the pin_user_page*() and
+	 * pin_longterm_*() APIs, never directly by the caller, so enforce that
+	 * with an assertion:
+	 */
+	if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
+		return -EINVAL;
+
 	/*
 	 * FIXME: Current FOLL_LONGTERM behavior is incompatible with
 	 * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
@@ -1608,6 +1629,14 @@  long get_user_pages(unsigned long start, unsigned long nr_pages,
 		unsigned int gup_flags, struct page **pages,
 		struct vm_area_struct **vmas)
 {
+	/*
+	 * FOLL_PIN must only be set internally by the pin_user_page*() and
+	 * pin_longterm_*() APIs, never directly by the caller, so enforce that
+	 * with an assertion:
+	 */
+	if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
+		return -EINVAL;
+
 	return __gup_longterm_locked(current, current->mm, start, nr_pages,
 				     pages, vmas, gup_flags | FOLL_TOUCH);
 }
@@ -2373,24 +2402,9 @@  static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
 	return ret;
 }
 
-/**
- * get_user_pages_fast() - pin user pages in memory
- * @start:	starting user address
- * @nr_pages:	number of pages from start to pin
- * @gup_flags:	flags modifying pin behaviour
- * @pages:	array that receives pointers to the pages pinned.
- *		Should be at least nr_pages long.
- *
- * Attempt to pin user pages in memory without taking mm->mmap_sem.
- * If not successful, it will fall back to taking the lock and
- * calling get_user_pages().
- *
- * Returns number of pages pinned. This may be fewer than the number
- * requested. If nr_pages is 0 or negative, returns 0. If no pages
- * were pinned, returns -errno.
- */
-int get_user_pages_fast(unsigned long start, int nr_pages,
-			unsigned int gup_flags, struct page **pages)
+static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
+					unsigned int gup_flags,
+					struct page **pages)
 {
 	unsigned long addr, len, end;
 	int nr = 0, ret = 0;
@@ -2435,4 +2449,215 @@  int get_user_pages_fast(unsigned long start, int nr_pages,
 
 	return ret;
 }
+
+/**
+ * get_user_pages_fast() - pin user pages in memory
+ * @start:	starting user address
+ * @nr_pages:	number of pages from start to pin
+ * @gup_flags:	flags modifying pin behaviour
+ * @pages:	array that receives pointers to the pages pinned.
+ *		Should be at least nr_pages long.
+ *
+ * Attempt to pin user pages in memory without taking mm->mmap_sem.
+ * If not successful, it will fall back to taking the lock and
+ * calling get_user_pages().
+ *
+ * Returns number of pages pinned. This may be fewer than the number requested.
+ * If nr_pages is 0 or negative, returns 0. If no pages were pinned, returns
+ * -errno.
+ */
+int get_user_pages_fast(unsigned long start, int nr_pages,
+			unsigned int gup_flags, struct page **pages)
+{
+	/*
+	 * FOLL_PIN must only be set internally by the pin_user_page*() and
+	 * pin_longterm_*() APIs, never directly by the caller, so enforce that:
+	 */
+	if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
+		return -EINVAL;
+
+	return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
+}
 EXPORT_SYMBOL_GPL(get_user_pages_fast);
+
+/**
+ * pin_user_pages_fast() - pin user pages in memory without taking locks
+ *
+ * Nearly the same as get_user_pages_fast(), except that FOLL_PIN is set. See
+ * get_user_pages_fast() for documentation on the function arguments, because
+ * the arguments here are identical.
+ *
+ * FOLL_PIN means that the pages must be released via put_user_page(). Please
+ * see Documentation/vm/pin_user_pages.rst for further details.
+ *
+ * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
+ * is NOT intended for Case 2 (RDMA: long-term pins).
+ */
+int pin_user_pages_fast(unsigned long start, int nr_pages,
+			unsigned int gup_flags, struct page **pages)
+{
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
+		return -EINVAL;
+
+	gup_flags |= FOLL_PIN;
+	return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
+}
+EXPORT_SYMBOL_GPL(pin_user_pages_fast);
+
+/**
+ * pin_longterm_pages_fast() - pin user pages in memory without taking locks
+ *
+ * Nearly the same as get_user_pages_fast(), except that FOLL_PIN and
+ * FOLL_LONGTERM are set. See get_user_pages_fast() for documentation on the
+ * function arguments, because the arguments here are identical.
+ *
+ * FOLL_PIN means that the pages must be released via put_user_page(). Please
+ * see Documentation/vm/pin_user_pages.rst for further details.
+ *
+ * FOLL_LONGTERM means that the pages are being pinned for "long term" use,
+ * typically by a non-CPU device, and we cannot be sure that waiting for a
+ * pinned page to become unpin will be effective.
+ *
+ * This is intended for Case 2 (RDMA: long-term pins) of the FOLL_PIN
+ * documentation.
+ */
+int pin_longterm_pages_fast(unsigned long start, int nr_pages,
+			    unsigned int gup_flags, struct page **pages)
+{
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
+		return -EINVAL;
+
+	gup_flags |= (FOLL_PIN | FOLL_LONGTERM);
+	return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
+}
+EXPORT_SYMBOL_GPL(pin_longterm_pages_fast);
+
+/**
+ * pin_user_pages_remote() - pin pages for (typically) use by Direct IO, and
+ * return the pages to the user.
+ *
+ * Nearly the same as get_user_pages_remote(), except that FOLL_PIN is set. See
+ * get_user_pages_remote() for documentation on the function arguments, because
+ * the arguments here are identical.
+ *
+ * FOLL_PIN means that the pages must be released via put_user_page(). Please
+ * see Documentation/vm/pin_user_pages.rst for details.
+ *
+ * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
+ * is NOT intended for Case 2 (RDMA: long-term pins).
+ */
+long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
+			   unsigned long start, unsigned long nr_pages,
+			   unsigned int gup_flags, struct page **pages,
+			   struct vm_area_struct **vmas, int *locked)
+{
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
+		return -EINVAL;
+
+	gup_flags |= FOLL_TOUCH | FOLL_REMOTE | FOLL_PIN;
+
+	return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
+				       locked, gup_flags);
+}
+EXPORT_SYMBOL(pin_user_pages_remote);
+
+/**
+ * pin_longterm_pages_remote() - pin pages for (typically) use by Direct IO, and
+ * return the pages to the user.
+ *
+ * Nearly the same as get_user_pages_remote(), but note that FOLL_TOUCH is not
+ * set, and FOLL_PIN and FOLL_LONGTERM are set. See get_user_pages_remote() for
+ * documentation on the function arguments, because the arguments here are
+ * identical.
+ *
+ * FOLL_PIN means that the pages must be released via put_user_page(). Please
+ * see Documentation/vm/pin_user_pages.rst for further details.
+ *
+ * FOLL_LONGTERM means that the pages are being pinned for "long term" use,
+ * typically by a non-CPU device, and we cannot be sure that waiting for a
+ * pinned page to become unpin will be effective.
+ *
+ * This is intended for Case 2 (RDMA: long-term pins) in
+ * Documentation/vm/pin_user_pages.rst.
+ */
+long pin_longterm_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
+			       unsigned long start, unsigned long nr_pages,
+			       unsigned int gup_flags, struct page **pages,
+			       struct vm_area_struct **vmas, int *locked)
+{
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
+		return -EINVAL;
+
+	/*
+	 * FIXME: as noted in the get_user_pages_remote() implementation, it
+	 * is not yet possible to safely set FOLL_LONGTERM here. FOLL_LONGTERM
+	 * needs to be set, but for now the best we can do is a "TODO" item.
+	 */
+	gup_flags |= FOLL_REMOTE | FOLL_PIN;
+
+	return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
+				       locked, gup_flags);
+}
+EXPORT_SYMBOL(pin_longterm_pages_remote);
+
+/**
+ * pin_user_pages() - pin user pages in memory for use by other devices
+ *
+ * Nearly the same as get_user_pages(), except that FOLL_TOUCH is not set, and
+ * FOLL_PIN is set.
+ *
+ * FOLL_PIN means that the pages must be released via put_user_page(). Please
+ * see Documentation/vm/pin_user_pages.rst for details.
+ *
+ * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
+ * is NOT intended for Case 2 (RDMA: long-term pins).
+ */
+long pin_user_pages(unsigned long start, unsigned long nr_pages,
+		    unsigned int gup_flags, struct page **pages,
+		    struct vm_area_struct **vmas)
+{
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
+		return -EINVAL;
+
+	gup_flags |= FOLL_PIN;
+	return __gup_longterm_locked(current, current->mm, start, nr_pages,
+				     pages, vmas, gup_flags);
+}
+EXPORT_SYMBOL(pin_user_pages);
+
+/**
+ * pin_longterm_pages() - pin user pages in memory for long-term use (RDMA,
+ * typically)
+ *
+ * Nearly the same as get_user_pages(), except that FOLL_PIN and FOLL_LONGTERM
+ * are set. See get_user_pages_fast() for documentation on the function
+ * arguments, because the arguments here are identical.
+ *
+ * FOLL_PIN means that the pages must be released via put_user_page(). Please
+ * see Documentation/vm/pin_user_pages.rst for further details.
+ *
+ * FOLL_LONGTERM means that the pages are being pinned for "long term" use,
+ * typically by a non-CPU device, and we cannot be sure that waiting for a
+ * pinned page to become unpin will be effective.
+ *
+ * This is intended for Case 2 (RDMA: long-term pins) in
+ * Documentation/vm/pin_user_pages.rst.
+ */
+long pin_longterm_pages(unsigned long start, unsigned long nr_pages,
+			unsigned int gup_flags, struct page **pages,
+			struct vm_area_struct **vmas)
+{
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
+		return -EINVAL;
+
+	gup_flags |= FOLL_PIN | FOLL_LONGTERM;
+	return __gup_longterm_locked(current, current->mm, start, nr_pages,
+				     pages, vmas, gup_flags);
+}
+EXPORT_SYMBOL(pin_longterm_pages);