diff mbox series

[1/6] mm/gup: introduce pin_user_page()

Message ID 20220827083607.2345453-2-jhubbard@nvidia.com (mailing list archive)
State New, archived
Headers show
Series convert most filesystems to pin_user_pages_fast() | expand

Commit Message

John Hubbard Aug. 27, 2022, 8:36 a.m. UTC
pin_user_page() is an externally-usable version of try_grab_page(), but
with semantics that match get_page(), so that it can act as a drop-in
replacement for get_page(). Specifically, pin_user_page() has a void
return type.

pin_user_page() elevates a page's refcount using FOLL_PIN rules. This
means that the caller must release the page via unpin_user_page().

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/mm.h |  1 +
 mm/gup.c           | 33 +++++++++++++++++++++++++++++++++
 2 files changed, 34 insertions(+)

Comments

David Hildenbrand Aug. 29, 2022, 12:07 p.m. UTC | #1
On 27.08.22 10:36, John Hubbard wrote:
> pin_user_page() is an externally-usable version of try_grab_page(), but
> with semantics that match get_page(), so that it can act as a drop-in
> replacement for get_page(). Specifically, pin_user_page() has a void
> return type.
> 
> pin_user_page() elevates a page's refcount using FOLL_PIN rules. This
> means that the caller must release the page via unpin_user_page().
> 
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/mm.h |  1 +
>  mm/gup.c           | 33 +++++++++++++++++++++++++++++++++
>  2 files changed, 34 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 982f2607180b..85a105157334 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1869,6 +1869,7 @@ long pin_user_pages_remote(struct mm_struct *mm,
>  long get_user_pages(unsigned long start, unsigned long nr_pages,
>  			    unsigned int gup_flags, struct page **pages,
>  			    struct vm_area_struct **vmas);
> +void pin_user_page(struct page *page);
>  long pin_user_pages(unsigned long start, unsigned long nr_pages,
>  		    unsigned int gup_flags, struct page **pages,
>  		    struct vm_area_struct **vmas);
> diff --git a/mm/gup.c b/mm/gup.c
> index 5abdaf487460..245ccb41ed8c 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -3213,6 +3213,39 @@ long pin_user_pages(unsigned long start, unsigned long nr_pages,
>  }
>  EXPORT_SYMBOL(pin_user_pages);
>  
> +/**
> + * pin_user_page() - apply a FOLL_PIN reference to a page
> + *
> + * @page: the page to be pinned.
> + *
> + * This is similar to get_user_pages(), except that the page's refcount is
> + * elevated using FOLL_PIN, instead of FOLL_GET.
> + *
> + * IMPORTANT: The caller must release the page via unpin_user_page().
> + *
> + */
> +void pin_user_page(struct page *page)
> +{
> +	struct folio *folio = page_folio(page);
> +
> +	WARN_ON_ONCE(folio_ref_count(folio) <= 0);
> +

We should warn if the page is anon and !exclusive.

I assume the intend is to use pin_user_page() only to duplicate pins, right?
John Hubbard Aug. 29, 2022, 7:33 p.m. UTC | #2
On 8/29/22 05:07, David Hildenbrand wrote:
>> +/**
>> + * pin_user_page() - apply a FOLL_PIN reference to a page
>> + *
>> + * @page: the page to be pinned.
>> + *
>> + * This is similar to get_user_pages(), except that the page's refcount is
>> + * elevated using FOLL_PIN, instead of FOLL_GET.

Actually, my commit log has a more useful documentation of this routine,
and given the questions below, I think I'll change to that:

 * pin_user_page() is an externally-usable version of try_grab_page(), but with
 * semantics that match get_page(), so that it can act as a drop-in replacement
 * for get_page().
 *
 * pin_user_page() elevates a page's refcount using FOLL_PIN rules. This means
 * that the caller must release the page via unpin_user_page().


>> + *
>> + * IMPORTANT: The caller must release the page via unpin_user_page().
>> + *
>> + */
>> +void pin_user_page(struct page *page)
>> +{
>> +	struct folio *folio = page_folio(page);
>> +
>> +	WARN_ON_ONCE(folio_ref_count(folio) <= 0);
>> +
> 
> We should warn if the page is anon and !exclusive.

That would be sort of OK, because pin_user_page() is being created
specifically for file system (O_DIRECT cases) use, and so the pages
should mostly be file-backed, rather than anon. Although I'm a little
vague about whether all of these iov_iter cases are really always
file-backed pages, especially for cases such as splice(2) to an
O_DIRECT-opened file, that Al Viro mentioned [1].

Can you walk me through the reasoning for why we need to keep out
anon shared pages? 

> 
> I assume the intend is to use pin_user_page() only to duplicate pins, right?
> 

Well, yes or no, depending on your use of the term "pin":

pin_user_page() is used on a page that already has a refcount >= 1 (so
no worries about speculative pinning should apply here), but the page
does not necessarily have any FOLL_PIN's applied to it yet (so it's not
"pinned" in the FOLL_PIN sense).


[1] https://lore.kernel.org/all/Ywq5VrSrY341UVpL@ZenIV/


thanks,
David Hildenbrand Aug. 30, 2022, 12:17 p.m. UTC | #3
On 29.08.22 21:33, John Hubbard wrote:
> On 8/29/22 05:07, David Hildenbrand wrote:
>>> +/**
>>> + * pin_user_page() - apply a FOLL_PIN reference to a page
>>> + *
>>> + * @page: the page to be pinned.
>>> + *
>>> + * This is similar to get_user_pages(), except that the page's refcount is
>>> + * elevated using FOLL_PIN, instead of FOLL_GET.
> 
> Actually, my commit log has a more useful documentation of this routine,
> and given the questions below, I think I'll change to that:
> 
>  * pin_user_page() is an externally-usable version of try_grab_page(), but with
>  * semantics that match get_page(), so that it can act as a drop-in replacement
>  * for get_page().
>  *
>  * pin_user_page() elevates a page's refcount using FOLL_PIN rules. This means
>  * that the caller must release the page via unpin_user_page().

Some thoughts:

a) Can we generalize such that pages with a dedicated pincount
(multi-page folios) are also covered? Maybe avoiding the refcount
terminology would be best.

b) Should we directly work on folios?

c) Would it be valid to pass in a tail page right now?

> 
>>> + *
>>> + * IMPORTANT: The caller must release the page via unpin_user_page().
>>> + *
>>> + */
>>> +void pin_user_page(struct page *page)
>>> +{
>>> +	struct folio *folio = page_folio(page);
>>> +
>>> +	WARN_ON_ONCE(folio_ref_count(folio) <= 0);
>>> +
>>
>> We should warn if the page is anon and !exclusive.
> 
> That would be sort of OK, because pin_user_page() is being created
> specifically for file system (O_DIRECT cases) use, and so the pages
> should mostly be file-backed, rather than anon. Although I'm a little
> vague about whether all of these iov_iter cases are really always
> file-backed pages, especially for cases such as splice(2) to an
> O_DIRECT-opened file, that Al Viro mentioned [1].

If we can, we should document that this interface is not for anonymous
pages and WARN if pinning an anonymous page via this interface.

The only reasonable way to obtain a pin on an anonymous page is via the
page table. Here, FOLL_PIN should be used to do the right thing -- for
example, unshare first (break COW) instead of pinning a shared anonymous
page.

Nothing would speak against duplicating such a pin using this interface
(we'd have to sanity check that the page we're pinning may already be
pinned), but I assume the pages we pin here are *not* necessarily
obtained via GUP FOLL_PIN.

I would be curious under which scenarios we could end up here with an
anonymous page and how we obtained that reference (if not via GUP).

> 
> Can you walk me through the reasoning for why we need to keep out
> anon shared pages? 

We make sure to only pin anonymous pages that are exclusive and check
when unpinning -- see sanity_check_pinned_pages(), there is also a
comment in there -- that pinned anonymous pages are in fact still
exclusive, otherwise we might have a BUG lurking somewhere that can
result in memory corruptions or leaking information between processes.

For example, once we'd pinned an anonymous pages that are not marked
exclusive (!PageAnonExclusive), or we'd be sharing a page that is
pinned, the next write fault would replace the page in the user page
table due to breaking COW, and the GUP pin would point at a different
page than the page table.

Disallowing pinning of anon pages that may be shared in any case
(FOLL_LONGTERM or not) simplifies GUP handling and allows for such
sanity checks.

(side note: after recent VM_BUG_ON discussions we might want to convert
the VM_BUG_ON_PAGE in sanity_check_pinned_pages())

> 
>>
>> I assume the intend is to use pin_user_page() only to duplicate pins, right?
>>
> 
> Well, yes or no, depending on your use of the term "pin":
> 
> pin_user_page() is used on a page that already has a refcount >= 1 (so
> no worries about speculative pinning should apply here), but the page
> does not necessarily have any FOLL_PIN's applied to it yet (so it's not
> "pinned" in the FOLL_PIN sense).

Okay, then we should really figure out if/how anonymous pages could end
up here. I assume they can't, really. But let's see :)
John Hubbard Aug. 30, 2022, 9:42 p.m. UTC | #4
On 8/30/22 05:17, David Hildenbrand wrote:
> On 29.08.22 21:33, John Hubbard wrote:
>> On 8/29/22 05:07, David Hildenbrand wrote:
>>>> +/**
>>>> + * pin_user_page() - apply a FOLL_PIN reference to a page
>>>> + *
>>>> + * @page: the page to be pinned.
>>>> + *
>>>> + * This is similar to get_user_pages(), except that the page's refcount is
>>>> + * elevated using FOLL_PIN, instead of FOLL_GET.
>>
>> Actually, my commit log has a more useful documentation of this routine,
>> and given the questions below, I think I'll change to that:
>>
>>  * pin_user_page() is an externally-usable version of try_grab_page(), but with
>>  * semantics that match get_page(), so that it can act as a drop-in replacement
>>  * for get_page().
>>  *
>>  * pin_user_page() elevates a page's refcount using FOLL_PIN rules. This means
>>  * that the caller must release the page via unpin_user_page().
> 
> Some thoughts:
> 
> a) Can we generalize such that pages with a dedicated pincount
> (multi-page folios) are also covered? Maybe avoiding the refcount
> terminology would be best.
> 
> b) Should we directly work on folios?
> 
> c) Would it be valid to pass in a tail page right now?

I would fervently prefer to defer those kinds of questions and ideas,
because the call sites are dealing simply in pages. And this is 
really for file system call sites. The folio conversion is a larger
thing. Below...

> 
>>
>>>> + *
>>>> + * IMPORTANT: The caller must release the page via unpin_user_page().
>>>> + *
>>>> + */
>>>> +void pin_user_page(struct page *page)
>>>> +{
>>>> +	struct folio *folio = page_folio(page);
>>>> +
>>>> +	WARN_ON_ONCE(folio_ref_count(folio) <= 0);
>>>> +
>>>
>>> We should warn if the page is anon and !exclusive.
>>
>> That would be sort of OK, because pin_user_page() is being created
>> specifically for file system (O_DIRECT cases) use, and so the pages
>> should mostly be file-backed, rather than anon. Although I'm a little
>> vague about whether all of these iov_iter cases are really always
>> file-backed pages, especially for cases such as splice(2) to an
>> O_DIRECT-opened file, that Al Viro mentioned [1].
> 
> If we can, we should document that this interface is not for anonymous
> pages and WARN if pinning an anonymous page via this interface.

Yes. OK, I've rewritten the documentation again, and changed the 
warning to just 

    WARN_ON_ONCE(PageAnon(page));

So it looks like this now, what do you think?

/**
 * pin_user_page() - apply a FOLL_PIN reference to a file-backed page that the
 * caller already owns.
 *
 * @page: the page to be pinned.
 *
 * pin_user_page() elevates a page's refcount using FOLL_PIN rules. This means
 * that the caller must release the page via unpin_user_page().
 *
 * pin_user_page() is intended as a drop-in replacement for get_page(). This
 * provides a way for callers to do a subsequent unpin_user_page() on the
 * affected page. However, it is only intended for use by callers (file systems,
 * block/bio) that have a file-backed page. Anonymous pages are not expected nor
 * supported, and will generate a warning.
 *
 * pin_user_page() may also be thought of as an externally-usable version of
 * try_grab_page(), but with semantics that match get_page(), so that it can act
 * as a drop-in replacement for get_page().
 *
 * IMPORTANT: The caller must release the page via unpin_user_page().
 *
 */

> 
> The only reasonable way to obtain a pin on an anonymous page is via the
> page table. Here, FOLL_PIN should be used to do the right thing -- for
> example, unshare first (break COW) instead of pinning a shared anonymous
> page.
> 
> Nothing would speak against duplicating such a pin using this interface
> (we'd have to sanity check that the page we're pinning may already be
> pinned), but I assume the pages we pin here are *not* necessarily
> obtained via GUP FOLL_PIN.
> 
> I would be curious under which scenarios we could end up here with an
> anonymous page and how we obtained that reference (if not via GUP).

Let's see if the warning ever fires. I expect not, but if it does,
then I can add the !PageAnonExclusive qualifier to the warning.

> 
>>
>> Can you walk me through the reasoning for why we need to keep out
>> anon shared pages? 
> 
> We make sure to only pin anonymous pages that are exclusive and check
> when unpinning -- see sanity_check_pinned_pages(), there is also a
> comment in there -- that pinned anonymous pages are in fact still
> exclusive, otherwise we might have a BUG lurking somewhere that can
> result in memory corruptions or leaking information between processes.
> 
> For example, once we'd pinned an anonymous pages that are not marked
> exclusive (!PageAnonExclusive), or we'd be sharing a page that is
> pinned, the next write fault would replace the page in the user page
> table due to breaking COW, and the GUP pin would point at a different
> page than the page table.

Right, OK it all clicks together, thanks for that.

> 
> Disallowing pinning of anon pages that may be shared in any case
> (FOLL_LONGTERM or not) simplifies GUP handling and allows for such
> sanity checks.
> 
> (side note: after recent VM_BUG_ON discussions we might want to convert
> the VM_BUG_ON_PAGE in sanity_check_pinned_pages())
> 
>>
>>>
>>> I assume the intend is to use pin_user_page() only to duplicate pins, right?
>>>
>>
>> Well, yes or no, depending on your use of the term "pin":
>>
>> pin_user_page() is used on a page that already has a refcount >= 1 (so
>> no worries about speculative pinning should apply here), but the page
>> does not necessarily have any FOLL_PIN's applied to it yet (so it's not
>> "pinned" in the FOLL_PIN sense).
> 
> Okay, then we should really figure out if/how anonymous pages could end
> up here. I assume they can't, really. But let's see :)
> 
> 

Yes, again, the warning will reveal that. Which reminds me that I also
have it on my list to also write some test cases that do things like Al
Viro suggested: ITER_BVEC + O_DIRECT.


thanks,
John Hubbard Aug. 31, 2022, 12:06 a.m. UTC | #5
On 8/30/22 05:17, David Hildenbrand wrote:
> (side note: after recent VM_BUG_ON discussions we might want to convert
> the VM_BUG_ON_PAGE in sanity_check_pinned_pages())

Just realized I skipped over this point in my other response.

So, there are 28 "BUG_ON" calls in mm/gup.c, only 3 of which are 
BUILD_BUG_ON().

Maybe a single patch that covers all 25 at once is the way to go, I'm
thinking. And of course, any patch that touches any of those lines
should convert to a WARN* variant--whichever lands first.


thanks,
diff mbox series

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 982f2607180b..85a105157334 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1869,6 +1869,7 @@  long pin_user_pages_remote(struct mm_struct *mm,
 long get_user_pages(unsigned long start, unsigned long nr_pages,
 			    unsigned int gup_flags, struct page **pages,
 			    struct vm_area_struct **vmas);
+void pin_user_page(struct page *page);
 long pin_user_pages(unsigned long start, unsigned long nr_pages,
 		    unsigned int gup_flags, struct page **pages,
 		    struct vm_area_struct **vmas);
diff --git a/mm/gup.c b/mm/gup.c
index 5abdaf487460..245ccb41ed8c 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -3213,6 +3213,39 @@  long pin_user_pages(unsigned long start, unsigned long nr_pages,
 }
 EXPORT_SYMBOL(pin_user_pages);
 
+/**
+ * pin_user_page() - apply a FOLL_PIN reference to a page
+ *
+ * @page: the page to be pinned.
+ *
+ * This is similar to get_user_pages(), except that the page's refcount is
+ * elevated using FOLL_PIN, instead of FOLL_GET.
+ *
+ * IMPORTANT: The caller must release the page via unpin_user_page().
+ *
+ */
+void pin_user_page(struct page *page)
+{
+	struct folio *folio = page_folio(page);
+
+	WARN_ON_ONCE(folio_ref_count(folio) <= 0);
+
+	/*
+	 * Similar to try_grab_page(): be sure to *also*
+	 * increment the normal page refcount field at least once,
+	 * so that the page really is pinned.
+	 */
+	if (folio_test_large(folio)) {
+		folio_ref_add(folio, 1);
+		atomic_add(1, folio_pincount_ptr(folio));
+	} else {
+		folio_ref_add(folio, GUP_PIN_COUNTING_BIAS);
+	}
+
+	node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, 1);
+}
+EXPORT_SYMBOL(pin_user_page);
+
 /*
  * pin_user_pages_unlocked() is the FOLL_PIN variant of
  * get_user_pages_unlocked(). Behavior is the same, except that this one sets