diff mbox series

[v4,1/3] mm: Don't pin ZERO_PAGE in pin_user_pages()

Message ID 20230526214142.958751-2-dhowells@redhat.com (mailing list archive)
State New, archived
Headers show
Series block: Make old dio use iov_iter_extract_pages() and page pinning | expand

Commit Message

David Howells May 26, 2023, 9:41 p.m. UTC
Make pin_user_pages*() leave a ZERO_PAGE unpinned if it extracts a pointer
to it from the page tables and make unpin_user_page*() correspondingly
ignore a ZERO_PAGE when unpinning.  We don't want to risk overrunning a
zero page's refcount as we're only allowed ~2 million pins on it -
something that userspace can conceivably trigger.

Add a pair of functions to test whether a page or a folio is a ZERO_PAGE.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@infradead.org>
cc: David Hildenbrand <david@redhat.com>
cc: Lorenzo Stoakes <lstoakes@gmail.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Matthew Wilcox <willy@infradead.org>
cc: Jan Kara <jack@suse.cz>
cc: Jeff Layton <jlayton@kernel.org>
cc: Jason Gunthorpe <jgg@nvidia.com>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: Hillf Danton <hdanton@sina.com>
cc: Christian Brauner <brauner@kernel.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-kernel@vger.kernel.org
cc: linux-mm@kvack.org
---

Notes:
    ver #3)
     - Move is_zero_page() and is_zero_folio() to mm.h for dependency reasons.
     - Add more comments and adjust the docs.
    
    ver #2)
     - Fix use of ZERO_PAGE().
     - Add is_zero_page() and is_zero_folio() wrappers.
     - Return the zero page obtained, not ZERO_PAGE(0) unconditionally.

 Documentation/core-api/pin_user_pages.rst |  6 +++++
 include/linux/mm.h                        | 26 +++++++++++++++++--
 mm/gup.c                                  | 31 ++++++++++++++++++++++-
 3 files changed, 60 insertions(+), 3 deletions(-)

Comments

Lorenzo Stoakes May 27, 2023, 7:46 p.m. UTC | #1
On Fri, May 26, 2023 at 10:41:40PM +0100, David Howells wrote:
> Make pin_user_pages*() leave a ZERO_PAGE unpinned if it extracts a pointer
> to it from the page tables and make unpin_user_page*() correspondingly
> ignore a ZERO_PAGE when unpinning.  We don't want to risk overrunning a
> zero page's refcount as we're only allowed ~2 million pins on it -
> something that userspace can conceivably trigger.
>
> Add a pair of functions to test whether a page or a folio is a ZERO_PAGE.
>
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Christoph Hellwig <hch@infradead.org>
> cc: David Hildenbrand <david@redhat.com>
> cc: Lorenzo Stoakes <lstoakes@gmail.com>
> cc: Andrew Morton <akpm@linux-foundation.org>
> cc: Jens Axboe <axboe@kernel.dk>
> cc: Al Viro <viro@zeniv.linux.org.uk>
> cc: Matthew Wilcox <willy@infradead.org>
> cc: Jan Kara <jack@suse.cz>
> cc: Jeff Layton <jlayton@kernel.org>
> cc: Jason Gunthorpe <jgg@nvidia.com>
> cc: Logan Gunthorpe <logang@deltatee.com>
> cc: Hillf Danton <hdanton@sina.com>
> cc: Christian Brauner <brauner@kernel.org>
> cc: Linus Torvalds <torvalds@linux-foundation.org>
> cc: linux-fsdevel@vger.kernel.org
> cc: linux-block@vger.kernel.org
> cc: linux-kernel@vger.kernel.org
> cc: linux-mm@kvack.org
> ---
>
> Notes:
>     ver #3)
>      - Move is_zero_page() and is_zero_folio() to mm.h for dependency reasons.
>      - Add more comments and adjust the docs.
>
>     ver #2)
>      - Fix use of ZERO_PAGE().
>      - Add is_zero_page() and is_zero_folio() wrappers.
>      - Return the zero page obtained, not ZERO_PAGE(0) unconditionally.
>
>  Documentation/core-api/pin_user_pages.rst |  6 +++++
>  include/linux/mm.h                        | 26 +++++++++++++++++--
>  mm/gup.c                                  | 31 ++++++++++++++++++++++-
>  3 files changed, 60 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst
> index 9fb0b1080d3b..d3c1f6d8c0e0 100644
> --- a/Documentation/core-api/pin_user_pages.rst
> +++ b/Documentation/core-api/pin_user_pages.rst
> @@ -112,6 +112,12 @@ pages:
>  This also leads to limitations: there are only 31-10==21 bits available for a
>  counter that increments 10 bits at a time.
>
> +* Because of that limitation, special handling is applied to the zero pages
> +  when using FOLL_PIN.  We only pretend to pin a zero page - we don't alter its
> +  refcount or pincount at all (it is permanent, so there's no need).  The
> +  unpinning functions also don't do anything to a zero page.  This is
> +  transparent to the caller.
> +

Great! Cheers. :)

>  * Callers must specifically request "dma-pinned tracking of pages". In other
>    words, just calling get_user_pages() will not suffice; a new set of functions,
>    pin_user_page() and related, must be used.
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 27ce77080c79..3c2f6b452586 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1910,6 +1910,28 @@ static inline bool page_needs_cow_for_dma(struct vm_area_struct *vma,
>  	return page_maybe_dma_pinned(page);
>  }
>
> +/**
> + * is_zero_page - Query if a page is a zero page
> + * @page: The page to query
> + *
> + * This returns true if @page is one of the permanent zero pages.
> + */
> +static inline bool is_zero_page(const struct page *page)
> +{
> +	return is_zero_pfn(page_to_pfn(page));
> +}
> +
> +/**
> + * is_zero_folio - Query if a folio is a zero page
> + * @folio: The folio to query
> + *
> + * This returns true if @folio is one of the permanent zero pages.
> + */
> +static inline bool is_zero_folio(const struct folio *folio)
> +{
> +	return is_zero_page(&folio->page);
> +}
> +
>  /* MIGRATE_CMA and ZONE_MOVABLE do not allow pin pages */
>  #ifdef CONFIG_MIGRATION
>  static inline bool is_longterm_pinnable_page(struct page *page)
> @@ -1920,8 +1942,8 @@ static inline bool is_longterm_pinnable_page(struct page *page)
>  	if (mt == MIGRATE_CMA || mt == MIGRATE_ISOLATE)
>  		return false;
>  #endif
> -	/* The zero page may always be pinned */
> -	if (is_zero_pfn(page_to_pfn(page)))
> +	/* The zero page can be "pinned" but gets special handling. */
> +	if (is_zero_page(page))
>  		return true;
>
>  	/* Coherent device memory must always allow eviction. */
> diff --git a/mm/gup.c b/mm/gup.c
> index bbe416236593..ad28261dcafd 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -51,7 +51,8 @@ static inline void sanity_check_pinned_pages(struct page **pages,
>  		struct page *page = *pages;
>  		struct folio *folio = page_folio(page);
>
> -		if (!folio_test_anon(folio))
> +		if (is_zero_page(page) ||
> +		    !folio_test_anon(folio))
>  			continue;
>  		if (!folio_test_large(folio) || folio_test_hugetlb(folio))
>  			VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page), page);
> @@ -131,6 +132,13 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>  	else if (flags & FOLL_PIN) {
>  		struct folio *folio;
>
> +		/*
> +		 * Don't take a pin on the zero page - it's not going anywhere
> +		 * and it is used in a *lot* of places.
> +		 */
> +		if (is_zero_page(page))
> +			return page_folio(page);
> +
>  		/*
>  		 * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
>  		 * right zone, so fail and let the caller fall back to the slow
> @@ -180,6 +188,8 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>  static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
>  {
>  	if (flags & FOLL_PIN) {
> +		if (is_zero_folio(folio))
> +			return;
>  		node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs);
>  		if (folio_test_large(folio))
>  			atomic_sub(refs, &folio->_pincount);
> @@ -224,6 +234,13 @@ int __must_check try_grab_page(struct page *page, unsigned int flags)
>  	if (flags & FOLL_GET)
>  		folio_ref_inc(folio);
>  	else if (flags & FOLL_PIN) {
> +		/*
> +		 * Don't take a pin on the zero page - it's not going anywhere
> +		 * and it is used in a *lot* of places.
> +		 */
> +		if (is_zero_page(page))
> +			return 0;
> +
>  		/*
>  		 * Similar to try_grab_folio(): be sure to *also*
>  		 * increment the normal page refcount field at least once,
> @@ -3079,6 +3096,9 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast);
>   *
>   * FOLL_PIN means that the pages must be released via unpin_user_page(). Please
>   * see Documentation/core-api/pin_user_pages.rst for further details.
> + *
> + * Note that if a zero_page is amongst the returned pages, it will not have
> + * pins in it and unpin_user_page() will not remove pins from it.
>   */
>  int pin_user_pages_fast(unsigned long start, int nr_pages,
>  			unsigned int gup_flags, struct page **pages)
> @@ -3110,6 +3130,9 @@ EXPORT_SYMBOL_GPL(pin_user_pages_fast);
>   *
>   * FOLL_PIN means that the pages must be released via unpin_user_page(). Please
>   * see Documentation/core-api/pin_user_pages.rst for details.
> + *
> + * Note that if a zero_page is amongst the returned pages, it will not have
> + * pins in it and unpin_user_page*() will not remove pins from it.
>   */
>  long pin_user_pages_remote(struct mm_struct *mm,
>  			   unsigned long start, unsigned long nr_pages,
> @@ -3143,6 +3166,9 @@ EXPORT_SYMBOL(pin_user_pages_remote);
>   *
>   * FOLL_PIN means that the pages must be released via unpin_user_page(). Please
>   * see Documentation/core-api/pin_user_pages.rst for details.
> + *
> + * Note that if a zero_page is amongst the returned pages, it will not have
> + * pins in it and unpin_user_page*() will not remove pins from it.
>   */
>  long pin_user_pages(unsigned long start, unsigned long nr_pages,
>  		    unsigned int gup_flags, struct page **pages,
> @@ -3161,6 +3187,9 @@ EXPORT_SYMBOL(pin_user_pages);
>   * pin_user_pages_unlocked() is the FOLL_PIN variant of
>   * get_user_pages_unlocked(). Behavior is the same, except that this one sets
>   * FOLL_PIN and rejects FOLL_GET.
> + *
> + * Note that if a zero_page is amongst the returned pages, it will not have
> + * pins in it and unpin_user_page*() will not remove pins from it.
>   */
>  long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
>  			     struct page **pages, unsigned int gup_flags)
>

All looks good, thanks for adding comments/updating doc!

Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Christoph Hellwig May 31, 2023, 3:57 a.m. UTC | #2
Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>
David Hildenbrand May 31, 2023, 7:34 a.m. UTC | #3
On 26.05.23 23:41, David Howells wrote:
> Make pin_user_pages*() leave a ZERO_PAGE unpinned if it extracts a pointer
> to it from the page tables and make unpin_user_page*() correspondingly
> ignore a ZERO_PAGE when unpinning.  We don't want to risk overrunning a
> zero page's refcount as we're only allowed ~2 million pins on it -
> something that userspace can conceivably trigger.

2 millions pins (FOLL_PIN, which increments the refcount by 1024) or 2 
million references ?

> 
> Add a pair of functions to test whether a page or a folio is a ZERO_PAGE.
> 

[...]

> diff --git a/mm/gup.c b/mm/gup.c
> index bbe416236593..ad28261dcafd 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -51,7 +51,8 @@ static inline void sanity_check_pinned_pages(struct page **pages,
>   		struct page *page = *pages;
>   		struct folio *folio = page_folio(page);
>   
> -		if (!folio_test_anon(folio))
> +		if (is_zero_page(page) ||
> +		    !folio_test_anon(folio))

Nit: Fits into a single line (without harming readability IMHO).

>   			continue;
>   		if (!folio_test_large(folio) || folio_test_hugetlb(folio))
>   			VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page), page);
> @@ -131,6 +132,13 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>   	else if (flags & FOLL_PIN) {
>   		struct folio *folio;
>   
> +		/*
> +		 * Don't take a pin on the zero page - it's not going anywhere
> +		 * and it is used in a *lot* of places.
> +		 */
> +		if (is_zero_page(page))
> +			return page_folio(page);
> +
>   		/*
>   		 * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
>   		 * right zone, so fail and let the caller fall back to the slow
> @@ -180,6 +188,8 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>   static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
>   {
>   	if (flags & FOLL_PIN) {
> +		if (is_zero_folio(folio))
> +			return;
>   		node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs);
>   		if (folio_test_large(folio))
>   			atomic_sub(refs, &folio->_pincount);
> @@ -224,6 +234,13 @@ int __must_check try_grab_page(struct page *page, unsigned int flags)
>   	if (flags & FOLL_GET)
>   		folio_ref_inc(folio);
>   	else if (flags & FOLL_PIN) {
> +		/*
> +		 * Don't take a pin on the zero page - it's not going anywhere
> +		 * and it is used in a *lot* of places.
> +		 */
> +		if (is_zero_page(page))
> +			return 0;
> +
>   		/*
>   		 * Similar to try_grab_folio(): be sure to *also*
>   		 * increment the normal page refcount field at least once,
> @@ -3079,6 +3096,9 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast);
>    *
>    * FOLL_PIN means that the pages must be released via unpin_user_page(). Please
>    * see Documentation/core-api/pin_user_pages.rst for further details.
> + *
> + * Note that if a zero_page is amongst the returned pages, it will not have
> + * pins in it and unpin_user_page() will not remove pins from it.
>    */

"it will not have pins in it" sounds fairly weird to a non-native speaker.

"Note that the refcount of any zero_pages returned among the pinned 
pages will not be incremented, and unpin_user_page() will similarly not 
decrement it."


Acked-by: David Hildenbrand <david@redhat.com>
David Howells May 31, 2023, 8:35 a.m. UTC | #4
David Hildenbrand <david@redhat.com> wrote:

> > Make pin_user_pages*() leave a ZERO_PAGE unpinned if it extracts a pointer
> > to it from the page tables and make unpin_user_page*() correspondingly
> > ignore a ZERO_PAGE when unpinning.  We don't want to risk overrunning a
> > zero page's refcount as we're only allowed ~2 million pins on it -
> > something that userspace can conceivably trigger.
> 
> 2 millions pins (FOLL_PIN, which increments the refcount by 1024) or 2 million
> references ?

Definitely pins.  It's tricky because we've been using "pinned" to mean held
by a refcount or held by a flag too.

2 million pins on the zero page is in the realms of possibility.  It only
takes 32768 64-page DIO writes.

> > @@ -3079,6 +3096,9 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast);
> >    *
> >    * FOLL_PIN means that the pages must be released via unpin_user_page(). Please
> >    * see Documentation/core-api/pin_user_pages.rst for further details.
> > + *
> > + * Note that if a zero_page is amongst the returned pages, it will not have
> > + * pins in it and unpin_user_page() will not remove pins from it.
> >    */
> 
> "it will not have pins in it" sounds fairly weird to a non-native speaker.

Oh, I know.  The problem is that "pin" is now really ambiguous.  Can we change
"FOLL_PIN" to "FOLL_NAIL"?  Or maybe "FOLL_SCREW" - your pages are screwed if
you use DIO and fork at the same time.

> "Note that the refcount of any zero_pages returned among the pinned pages will
> not be incremented, and unpin_user_page() will similarly not decrement it."

That's not really right (although it happens to be true), because we're
talking primarily about the pin counter, not the refcount - and they may be
separate.

David
David Hildenbrand May 31, 2023, 8:46 a.m. UTC | #5
On 31.05.23 10:35, David Howells wrote:
> David Hildenbrand <david@redhat.com> wrote:
> 
>>> Make pin_user_pages*() leave a ZERO_PAGE unpinned if it extracts a pointer
>>> to it from the page tables and make unpin_user_page*() correspondingly
>>> ignore a ZERO_PAGE when unpinning.  We don't want to risk overrunning a
>>> zero page's refcount as we're only allowed ~2 million pins on it -
>>> something that userspace can conceivably trigger.
>>
>> 2 millions pins (FOLL_PIN, which increments the refcount by 1024) or 2 million
>> references ?
> 
> Definitely pins.  It's tricky because we've been using "pinned" to mean held
> by a refcount or held by a flag too.
> 

Yes, it would be clearer if we would be using "pinned" now only for 
FOLL_PIN and everything else is simply "taking a temporary reference on 
the page".

> 2 million pins on the zero page is in the realms of possibility.  It only
> takes 32768 64-page DIO writes.
> 
>>> @@ -3079,6 +3096,9 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast);
>>>     *
>>>     * FOLL_PIN means that the pages must be released via unpin_user_page(). Please
>>>     * see Documentation/core-api/pin_user_pages.rst for further details.
>>> + *
>>> + * Note that if a zero_page is amongst the returned pages, it will not have
>>> + * pins in it and unpin_user_page() will not remove pins from it.
>>>     */
>>
>> "it will not have pins in it" sounds fairly weird to a non-native speaker.
> 
> Oh, I know.  The problem is that "pin" is now really ambiguous.  Can we change
> "FOLL_PIN" to "FOLL_NAIL"?  Or maybe "FOLL_SCREW" - your pages are screwed if
> you use DIO and fork at the same time.
> 

I'm hoping that "pinning" will be "FOLL_PIN" (intention to access page 
content) and everything else is simply "taking a temporary page reference".

>> "Note that the refcount of any zero_pages returned among the pinned pages will
>> not be incremented, and unpin_user_page() will similarly not decrement it."
> 
> That's not really right (although it happens to be true), because we're
> talking primarily about the pin counter, not the refcount - and they may be
> separate.

In any case (FOLL_PIN/FOLL_GET) you increment/decrement the refcount. If 
we have a separate pincount, we increment/decrement the refcount by 1 
when (un)pinning.

Sure, if we'd have a separate pincount we'd also not be modifying it.
David Howells May 31, 2023, 1:55 p.m. UTC | #6
David Hildenbrand <david@redhat.com> wrote:

> Yes, it would be clearer if we would be using "pinned" now only for FOLL_PIN

You're not likely to get that.  "To pin" is too useful a verb that gets used
in other contexts too.  For that reason, I think FOLL_PIN was a poor choice of
name:-/.  I guess the English language has got somewhat overloaded.  Maybe
FOLL_PEG? ;-)

> and everything else is simply "taking a temporary reference on the page".

Excluding refs taken with pins, many refs are more permanent than pins as, so
far as I'm aware, pins only last for the duration of an I/O operation.

> >> "Note that the refcount of any zero_pages returned among the pinned pages will
> >> not be incremented, and unpin_user_page() will similarly not decrement it."
> > That's not really right (although it happens to be true), because we're
> > talking primarily about the pin counter, not the refcount - and they may be
> > separate.
> 
> In any case (FOLL_PIN/FOLL_GET) you increment/decrement the refcount. If we
> have a separate pincount, we increment/decrement the refcount by 1 when
> (un)pinning.

FOLL_GET isn't relevant here - only FOLL_PIN.  Yes, as it happens, we count a
ref if we count a pin, but that's kind of irrelevant; what matters is that the
effect must be undone with un-PUP.

It would be nice not to get a ref on the zero page in FOLL_GET, but I don't
think we can do that yet.  Too many places assume that GUP will give them a
ref they can release later via ordinary methods.

David
Lorenzo Stoakes May 31, 2023, 2:10 p.m. UTC | #7
On Wed, May 31, 2023 at 02:55:35PM +0100, David Howells wrote:
> David Hildenbrand <david@redhat.com> wrote:
>
> > Yes, it would be clearer if we would be using "pinned" now only for FOLL_PIN
>
> You're not likely to get that.  "To pin" is too useful a verb that gets used
> in other contexts too.  For that reason, I think FOLL_PIN was a poor choice of
> name:-/.  I guess the English language has got somewhat overloaded.  Maybe
> FOLL_PEG? ;-)
>

Might I suggest FOLL_FOLL? As we are essentially 'following' the page once
'pinned'. We could differentiate between what is currently FOLL_GET and
FOLL_PIN by using FOLL_FOLL and FOLL_FOLL_FOLL.

Patch series coming soon...

> > and everything else is simply "taking a temporary reference on the page".
>
> Excluding refs taken with pins, many refs are more permanent than pins as, so
> far as I'm aware, pins only last for the duration of an I/O operation.
>
> > >> "Note that the refcount of any zero_pages returned among the pinned pages will
> > >> not be incremented, and unpin_user_page() will similarly not decrement it."
> > > That's not really right (although it happens to be true), because we're
> > > talking primarily about the pin counter, not the refcount - and they may be
> > > separate.
> >
> > In any case (FOLL_PIN/FOLL_GET) you increment/decrement the refcount. If we
> > have a separate pincount, we increment/decrement the refcount by 1 when
> > (un)pinning.
>
> FOLL_GET isn't relevant here - only FOLL_PIN.  Yes, as it happens, we count a
> ref if we count a pin, but that's kind of irrelevant; what matters is that the
> effect must be undone with un-PUP.
>
> It would be nice not to get a ref on the zero page in FOLL_GET, but I don't
> think we can do that yet.  Too many places assume that GUP will give them a
> ref they can release later via ordinary methods.
>
> David
>
David Hildenbrand June 1, 2023, 10:14 a.m. UTC | #8
On 31.05.23 15:55, David Howells wrote:
> David Hildenbrand <david@redhat.com> wrote:
> 
>> Yes, it would be clearer if we would be using "pinned" now only for FOLL_PIN
> 
> You're not likely to get that.  "To pin" is too useful a verb that gets used
> in other contexts too.  For that reason, I think FOLL_PIN was a poor choice of
> name:-/.  I guess the English language has got somewhat overloaded.  Maybe
> FOLL_PEG? ;-)

You're probably right. FOLL_PIN and all around that is "get me an 
additional reference on the page and make sure I can DMA it without any 
undesired side-effects".

FOLL_PIN_DMA would have been clearer (and matches 
folio_maybe_dma_pinned() ) ... but then, there are some use cases where 
want the same semantics but not actually perform DMA, but simply 
read/write via the directmap (e.g., vmsplice, some io_uring cases). 
Sure, one could say that they behave like DMA: access page content at 
any time.

Saying a page is pinned (additional refcount) and having a pincount of 0 
or does indeed cause confusion.

... but once we start renaming FOLL_PIN, pincount, ... we also have to 
rename pin_user_pages() and friends, and things get nasty.

> 
>> and everything else is simply "taking a temporary reference on the page".
> 
> Excluding refs taken with pins, many refs are more permanent than pins as, so
> far as I'm aware, pins only last for the duration of an I/O operation.

I was more thinking along the lines of FOLL_GET vs. FOLL_PIN. Once we 
consider any references we might have on a page, things get more tricky 
indeed.

> 
>>>> "Note that the refcount of any zero_pages returned among the pinned pages will
>>>> not be incremented, and unpin_user_page() will similarly not decrement it."
>>> That's not really right (although it happens to be true), because we're
>>> talking primarily about the pin counter, not the refcount - and they may be
>>> separate.
>>
>> In any case (FOLL_PIN/FOLL_GET) you increment/decrement the refcount. If we
>> have a separate pincount, we increment/decrement the refcount by 1 when
>> (un)pinning.
> 
> FOLL_GET isn't relevant here - only FOLL_PIN.  Yes, as it happens, we count a
> ref if we count a pin, but that's kind of irrelevant; what matters is that the
> effect must be undone with un-PUP.

The point I was trying to make is that we always modify the refcount, 
and in some cases (FOLL_PIN on order > 0) also the pincount.

But if you define "pins" as "additional reference", we're on the same page.

> 
> It would be nice not to get a ref on the zero page in FOLL_GET, but I don't
> think we can do that yet.  Too many places assume that GUP will give them a
> ref they can release later via ordinary methods.

No we can't I'm afraid.
diff mbox series

Patch

diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst
index 9fb0b1080d3b..d3c1f6d8c0e0 100644
--- a/Documentation/core-api/pin_user_pages.rst
+++ b/Documentation/core-api/pin_user_pages.rst
@@ -112,6 +112,12 @@  pages:
 This also leads to limitations: there are only 31-10==21 bits available for a
 counter that increments 10 bits at a time.
 
+* Because of that limitation, special handling is applied to the zero pages
+  when using FOLL_PIN.  We only pretend to pin a zero page - we don't alter its
+  refcount or pincount at all (it is permanent, so there's no need).  The
+  unpinning functions also don't do anything to a zero page.  This is
+  transparent to the caller.
+
 * Callers must specifically request "dma-pinned tracking of pages". In other
   words, just calling get_user_pages() will not suffice; a new set of functions,
   pin_user_page() and related, must be used.
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 27ce77080c79..3c2f6b452586 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1910,6 +1910,28 @@  static inline bool page_needs_cow_for_dma(struct vm_area_struct *vma,
 	return page_maybe_dma_pinned(page);
 }
 
+/**
+ * is_zero_page - Query if a page is a zero page
+ * @page: The page to query
+ *
+ * This returns true if @page is one of the permanent zero pages.
+ */
+static inline bool is_zero_page(const struct page *page)
+{
+	return is_zero_pfn(page_to_pfn(page));
+}
+
+/**
+ * is_zero_folio - Query if a folio is a zero page
+ * @folio: The folio to query
+ *
+ * This returns true if @folio is one of the permanent zero pages.
+ */
+static inline bool is_zero_folio(const struct folio *folio)
+{
+	return is_zero_page(&folio->page);
+}
+
 /* MIGRATE_CMA and ZONE_MOVABLE do not allow pin pages */
 #ifdef CONFIG_MIGRATION
 static inline bool is_longterm_pinnable_page(struct page *page)
@@ -1920,8 +1942,8 @@  static inline bool is_longterm_pinnable_page(struct page *page)
 	if (mt == MIGRATE_CMA || mt == MIGRATE_ISOLATE)
 		return false;
 #endif
-	/* The zero page may always be pinned */
-	if (is_zero_pfn(page_to_pfn(page)))
+	/* The zero page can be "pinned" but gets special handling. */
+	if (is_zero_page(page))
 		return true;
 
 	/* Coherent device memory must always allow eviction. */
diff --git a/mm/gup.c b/mm/gup.c
index bbe416236593..ad28261dcafd 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -51,7 +51,8 @@  static inline void sanity_check_pinned_pages(struct page **pages,
 		struct page *page = *pages;
 		struct folio *folio = page_folio(page);
 
-		if (!folio_test_anon(folio))
+		if (is_zero_page(page) ||
+		    !folio_test_anon(folio))
 			continue;
 		if (!folio_test_large(folio) || folio_test_hugetlb(folio))
 			VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page), page);
@@ -131,6 +132,13 @@  struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
 	else if (flags & FOLL_PIN) {
 		struct folio *folio;
 
+		/*
+		 * Don't take a pin on the zero page - it's not going anywhere
+		 * and it is used in a *lot* of places.
+		 */
+		if (is_zero_page(page))
+			return page_folio(page);
+
 		/*
 		 * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
 		 * right zone, so fail and let the caller fall back to the slow
@@ -180,6 +188,8 @@  struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
 static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
 {
 	if (flags & FOLL_PIN) {
+		if (is_zero_folio(folio))
+			return;
 		node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs);
 		if (folio_test_large(folio))
 			atomic_sub(refs, &folio->_pincount);
@@ -224,6 +234,13 @@  int __must_check try_grab_page(struct page *page, unsigned int flags)
 	if (flags & FOLL_GET)
 		folio_ref_inc(folio);
 	else if (flags & FOLL_PIN) {
+		/*
+		 * Don't take a pin on the zero page - it's not going anywhere
+		 * and it is used in a *lot* of places.
+		 */
+		if (is_zero_page(page))
+			return 0;
+
 		/*
 		 * Similar to try_grab_folio(): be sure to *also*
 		 * increment the normal page refcount field at least once,
@@ -3079,6 +3096,9 @@  EXPORT_SYMBOL_GPL(get_user_pages_fast);
  *
  * FOLL_PIN means that the pages must be released via unpin_user_page(). Please
  * see Documentation/core-api/pin_user_pages.rst for further details.
+ *
+ * Note that if a zero_page is amongst the returned pages, it will not have
+ * pins in it and unpin_user_page() will not remove pins from it.
  */
 int pin_user_pages_fast(unsigned long start, int nr_pages,
 			unsigned int gup_flags, struct page **pages)
@@ -3110,6 +3130,9 @@  EXPORT_SYMBOL_GPL(pin_user_pages_fast);
  *
  * FOLL_PIN means that the pages must be released via unpin_user_page(). Please
  * see Documentation/core-api/pin_user_pages.rst for details.
+ *
+ * Note that if a zero_page is amongst the returned pages, it will not have
+ * pins in it and unpin_user_page*() will not remove pins from it.
  */
 long pin_user_pages_remote(struct mm_struct *mm,
 			   unsigned long start, unsigned long nr_pages,
@@ -3143,6 +3166,9 @@  EXPORT_SYMBOL(pin_user_pages_remote);
  *
  * FOLL_PIN means that the pages must be released via unpin_user_page(). Please
  * see Documentation/core-api/pin_user_pages.rst for details.
+ *
+ * Note that if a zero_page is amongst the returned pages, it will not have
+ * pins in it and unpin_user_page*() will not remove pins from it.
  */
 long pin_user_pages(unsigned long start, unsigned long nr_pages,
 		    unsigned int gup_flags, struct page **pages,
@@ -3161,6 +3187,9 @@  EXPORT_SYMBOL(pin_user_pages);
  * pin_user_pages_unlocked() is the FOLL_PIN variant of
  * get_user_pages_unlocked(). Behavior is the same, except that this one sets
  * FOLL_PIN and rejects FOLL_GET.
+ *
+ * Note that if a zero_page is amongst the returned pages, it will not have
+ * pins in it and unpin_user_page*() will not remove pins from it.
  */
 long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
 			     struct page **pages, unsigned int gup_flags)