diff mbox series

[v4,2/3] mm: introduce put_user_page*(), placeholder versions

Message ID 20181008211623.30796-3-jhubbard@nvidia.com (mailing list archive)
State New, archived
Headers show
Series get_user_pages*() and RDMA: first steps | expand

Commit Message

john.hubbard@gmail.com Oct. 8, 2018, 9:16 p.m. UTC
From: John Hubbard <jhubbard@nvidia.com>

Introduces put_user_page(), which simply calls put_page().
This provides a way to update all get_user_pages*() callers,
so that they call put_user_page(), instead of put_page().

Also introduces put_user_pages(), and a few dirty/locked variations,
as a replacement for release_pages(), and also as a replacement
for open-coded loops that release multiple pages.
These may be used for subsequent performance improvements,
via batching of pages to be released.

This prepares for eventually fixing the problem described
in [1], and is following a plan listed in [2], [3], [4].

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

[2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubbard@nvidia.com
    Proposed steps for fixing get_user_pages() + DMA problems.

[3]https://lkml.kernel.org/r/20180710082100.mkdwngdv5kkrcz6n@quack2.suse.cz
    Bounce buffers (otherwise [2] is not really viable).

[4] https://lkml.kernel.org/r/20181003162115.GG24030@quack2.suse.cz
    Follow-up discussions.

CC: Matthew Wilcox <willy@infradead.org>
CC: Michal Hocko <mhocko@kernel.org>
CC: Christopher Lameter <cl@linux.com>
CC: Jason Gunthorpe <jgg@ziepe.ca>
CC: Dan Williams <dan.j.williams@intel.com>
CC: Jan Kara <jack@suse.cz>
CC: Al Viro <viro@zeniv.linux.org.uk>
CC: Jerome Glisse <jglisse@redhat.com>
CC: Christoph Hellwig <hch@infradead.org>
CC: Ralph Campbell <rcampbell@nvidia.com>

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/mm.h | 49 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 47 insertions(+), 2 deletions(-)

Comments

Andrew Morton Oct. 9, 2018, 12:14 a.m. UTC | #1
On Mon,  8 Oct 2018 14:16:22 -0700 john.hubbard@gmail.com wrote:

> From: John Hubbard <jhubbard@nvidia.com>
> 
> Introduces put_user_page(), which simply calls put_page().
> This provides a way to update all get_user_pages*() callers,
> so that they call put_user_page(), instead of put_page().
> 
> Also introduces put_user_pages(), and a few dirty/locked variations,
> as a replacement for release_pages(), and also as a replacement
> for open-coded loops that release multiple pages.
> These may be used for subsequent performance improvements,
> via batching of pages to be released.
> 
> This prepares for eventually fixing the problem described
> in [1], and is following a plan listed in [2], [3], [4].
> 
> [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
> 
> [2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubbard@nvidia.com
>     Proposed steps for fixing get_user_pages() + DMA problems.
> 
> [3]https://lkml.kernel.org/r/20180710082100.mkdwngdv5kkrcz6n@quack2.suse.cz
>     Bounce buffers (otherwise [2] is not really viable).
> 
> [4] https://lkml.kernel.org/r/20181003162115.GG24030@quack2.suse.cz
>     Follow-up discussions.
> 
> ...
>
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -137,6 +137,8 @@ extern int overcommit_ratio_handler(struct ctl_table *, int, void __user *,
>  				    size_t *, loff_t *);
>  extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *,
>  				    size_t *, loff_t *);
> +int set_page_dirty(struct page *page);
> +int set_page_dirty_lock(struct page *page);
>  
>  #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
>  
> @@ -943,6 +945,51 @@ static inline void put_page(struct page *page)
>  		__put_page(page);
>  }
>  
> +/*
> + * Pages that were pinned via get_user_pages*() should be released via
> + * either put_user_page(), or one of the put_user_pages*() routines
> + * below.
> + */
> +static inline void put_user_page(struct page *page)
> +{
> +	put_page(page);
> +}
> +
> +static inline void put_user_pages_dirty(struct page **pages,
> +					unsigned long npages)
> +{
> +	unsigned long index;
> +
> +	for (index = 0; index < npages; index++) {
> +		if (!PageDirty(pages[index]))

Both put_page() and set_page_dirty() handle compound pages.  But
because of the above statement, put_user_pages_dirty() might misbehave? 
Or maybe it won't - perhaps the intent here is to skip dirtying the
head page if the sub page is clean?  Please clarify, explain and add
comment if so.

> +			set_page_dirty(pages[index]);
> +
> +		put_user_page(pages[index]);
> +	}
> +}
> +
> +static inline void put_user_pages_dirty_lock(struct page **pages,
> +					     unsigned long npages)
> +{
> +	unsigned long index;
> +
> +	for (index = 0; index < npages; index++) {
> +		if (!PageDirty(pages[index]))
> +			set_page_dirty_lock(pages[index]);

Ditto.

> +		put_user_page(pages[index]);
> +	}
> +}
> +
> +static inline void put_user_pages(struct page **pages,
> +				  unsigned long npages)
> +{
> +	unsigned long index;
> +
> +	for (index = 0; index < npages; index++)
> +		put_user_page(pages[index]);
> +}
> +

Otherwise looks OK.  Ish.  But it would be nice if that comment were to
explain *why* get_user_pages() pages must be released with
put_user_page().

Also, maintainability.  What happens if someone now uses put_page() by
mistake?  Kernel fails in some mysterious fashion?  How can we prevent
this from occurring as code evolves?  Is there a cheap way of detecting
this bug at runtime?
Jan Kara Oct. 9, 2018, 8:30 a.m. UTC | #2
On Mon 08-10-18 17:14:42, Andrew Morton wrote:
> On Mon,  8 Oct 2018 14:16:22 -0700 john.hubbard@gmail.com wrote:
> > +		put_user_page(pages[index]);
> > +	}
> > +}
> > +
> > +static inline void put_user_pages(struct page **pages,
> > +				  unsigned long npages)
> > +{
> > +	unsigned long index;
> > +
> > +	for (index = 0; index < npages; index++)
> > +		put_user_page(pages[index]);
> > +}
> > +
> 
> Otherwise looks OK.  Ish.  But it would be nice if that comment were to
> explain *why* get_user_pages() pages must be released with
> put_user_page().

The reason is that eventually we want to track reference from GUP
separately but you're right that it would be good to have a comment about
that somewhere.

> Also, maintainability.  What happens if someone now uses put_page() by
> mistake?  Kernel fails in some mysterious fashion?  How can we prevent
> this from occurring as code evolves?  Is there a cheap way of detecting
> this bug at runtime?

The same will happen as with any other reference counting bug - the special
user reference will leak. It will be pretty hard to debug I agree. I was
thinking about whether we could provide some type safety against such bugs
such as get_user_pages() not returning struct page pointers but rather some
other special type but it would result in a big amount of additional churn
as we'd have to propagate this different type e.g. through the IO path so
that IO completion routines could properly call put_user_pages(). So I'm
not sure it's really worth it.

								Honza
Andrew Morton Oct. 9, 2018, 11:20 p.m. UTC | #3
On Tue, 9 Oct 2018 10:30:25 +0200 Jan Kara <jack@suse.cz> wrote:

> > Also, maintainability.  What happens if someone now uses put_page() by
> > mistake?  Kernel fails in some mysterious fashion?  How can we prevent
> > this from occurring as code evolves?  Is there a cheap way of detecting
> > this bug at runtime?
> 
> The same will happen as with any other reference counting bug - the special
> user reference will leak. It will be pretty hard to debug I agree. I was
> thinking about whether we could provide some type safety against such bugs
> such as get_user_pages() not returning struct page pointers but rather some
> other special type but it would result in a big amount of additional churn
> as we'd have to propagate this different type e.g. through the IO path so
> that IO completion routines could properly call put_user_pages(). So I'm
> not sure it's really worth it.

I'm not really understanding.  Patch 3/3 changes just one infiniband
driver to use put_user_page().  But the changelogs here imply (to me)
that every user of get_user_pages() needs to be converted to
s/put_page/put_user_page/.

Methinks a bit more explanation is needed in these changelogs?
John Hubbard Oct. 10, 2018, 12:32 a.m. UTC | #4
On 10/9/18 4:20 PM, Andrew Morton wrote:
> On Tue, 9 Oct 2018 10:30:25 +0200 Jan Kara <jack@suse.cz> wrote:
> 
>>> Also, maintainability.  What happens if someone now uses put_page() by
>>> mistake?  Kernel fails in some mysterious fashion?  How can we prevent
>>> this from occurring as code evolves?  Is there a cheap way of detecting
>>> this bug at runtime?
>>
>> The same will happen as with any other reference counting bug - the special
>> user reference will leak. It will be pretty hard to debug I agree. I was
>> thinking about whether we could provide some type safety against such bugs
>> such as get_user_pages() not returning struct page pointers but rather some
>> other special type but it would result in a big amount of additional churn
>> as we'd have to propagate this different type e.g. through the IO path so
>> that IO completion routines could properly call put_user_pages(). So I'm
>> not sure it's really worth it.
> 
> I'm not really understanding.  Patch 3/3 changes just one infiniband
> driver to use put_user_page().  But the changelogs here imply (to me)
> that every user of get_user_pages() needs to be converted to
> s/put_page/put_user_page/.
> 
> Methinks a bit more explanation is needed in these changelogs?
> 

OK, yes, it does sound like the explanation is falling short. I'll work on something 
clearer. Did the proposed steps in the changelogs, such as:
  
[2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubbard@nvidia.com
    Proposed steps for fixing get_user_pages() + DMA problems.

help at all, or is it just too many references, and I should write the words
directly in the changelog?

Anyway, patch 3/3 is a just a working example (which we do want to submit, though), and
many more conversions will follow. But they don't have to be done all upfront--they
can be done in follow up patchsets. 

The put_user_page*() routines are, at this point, not going to significantly change
behavior. 

I'm working on an RFC that will show what the long-term fix to get_user_pages and
put_user_pages will look like. But meanwhile it's good to get started on converting
all of the call sites.

thanks,
John Hubbard Oct. 10, 2018, 12:42 a.m. UTC | #5
On 10/8/18 5:14 PM, Andrew Morton wrote:
> On Mon,  8 Oct 2018 14:16:22 -0700 john.hubbard@gmail.com wrote:
> 
>> From: John Hubbard <jhubbard@nvidia.com>
[...]
>> +/*
>> + * Pages that were pinned via get_user_pages*() should be released via
>> + * either put_user_page(), or one of the put_user_pages*() routines
>> + * below.
>> + */
>> +static inline void put_user_page(struct page *page)
>> +{
>> +	put_page(page);
>> +}
>> +
>> +static inline void put_user_pages_dirty(struct page **pages,
>> +					unsigned long npages)
>> +{
>> +	unsigned long index;
>> +
>> +	for (index = 0; index < npages; index++) {
>> +		if (!PageDirty(pages[index]))
> 
> Both put_page() and set_page_dirty() handle compound pages.  But
> because of the above statement, put_user_pages_dirty() might misbehave? 
> Or maybe it won't - perhaps the intent here is to skip dirtying the
> head page if the sub page is clean?  Please clarify, explain and add
> comment if so.
> 

Yes, technically, the accounting is wrong: we normally use the head page to 
track dirtiness, and here, that is not done. (Nor was it done before this
patch). However, it's not causing problems in code today because sub pages
are released at about the same time as head pages, so the head page does get 
properly checked at some point. And that means that set_page_dirty*() gets
called if it needs to be called. 

Obviously this is a little fragile, in that it depends on the caller behaving 
a certain way. And in any case, the long-term fix (coming later) *also* only
operates on the head page. So actually, instead of a comment, I think it's good 
to just insert

	page = compound_head(page);

...into these new routines, right now. I'll do that.

[...]
> 
> Otherwise looks OK.  Ish.  But it would be nice if that comment were to
> explain *why* get_user_pages() pages must be released with
> put_user_page().
> 

Yes, will do.

> Also, maintainability.  What happens if someone now uses put_page() by
> mistake?  Kernel fails in some mysterious fashion?  How can we prevent
> this from occurring as code evolves?  Is there a cheap way of detecting
> this bug at runtime?
> 

It might be possible to do a few run-time checks, such as "does page that came 
back to put_user_page() have the correct flags?", but it's harder (without 
having a dedicated page flag) to detect the other direction: "did someone page 
in a get_user_pages page, to put_page?"

As Jan said in his reply, converting get_user_pages (and put_user_page) to 
work with a new data type that wraps struct pages, would solve it, but that's
an awfully large change. Still...given how much of a mess this can turn into 
if it's wrong, I wonder if it's worth it--maybe? 

thanks,
Jan Kara Oct. 10, 2018, 8:59 a.m. UTC | #6
On Tue 09-10-18 17:42:09, John Hubbard wrote:
> On 10/8/18 5:14 PM, Andrew Morton wrote:
> > Also, maintainability.  What happens if someone now uses put_page() by
> > mistake?  Kernel fails in some mysterious fashion?  How can we prevent
> > this from occurring as code evolves?  Is there a cheap way of detecting
> > this bug at runtime?
> > 
> 
> It might be possible to do a few run-time checks, such as "does page that came 
> back to put_user_page() have the correct flags?", but it's harder (without 
> having a dedicated page flag) to detect the other direction: "did someone page 
> in a get_user_pages page, to put_page?"
> 
> As Jan said in his reply, converting get_user_pages (and put_user_page) to 
> work with a new data type that wraps struct pages, would solve it, but that's
> an awfully large change. Still...given how much of a mess this can turn into 
> if it's wrong, I wonder if it's worth it--maybe? 

I'm certainly not opposed to looking into it. But after looking into this
for a while it is not clear to me how to convert e.g. fs/direct-io.c or
fs/iomap.c. They pass the reference from gup() via
bio->bi_io_vec[]->bv_page and then release it after IO completion.
Propagating the new type to ->bv_page is not good as lower layer do not
really care how the page is pinned in memory. But we do need to somehow
pass the information to the IO completion functions in a robust manner.

Hmm, what about the following:

1) Make gup() return new type - struct user_page *? In practice that would
be just a struct page pointer with 0 bit set so that people are forced to
use proper helpers and not just force types (and the setting of bit 0 and
masking back would be hidden behind CONFIG_DEBUG_USER_PAGE_REFERENCES for
performance reasons). Also the transition would have to be gradual so we'd
have to name the function differently and use it from converted code.

2) Provide helper bio_add_user_page() that will take user_page, convert it
to struct page, add it to the bio, and flag the bio as having pages with
user references. That code would also make sure the bio is consistent in
having only user-referenced pages in that case. IO completion (like
bio_check_pages_dirty(), bio_release_pages() etc.) will check the flag and
use approprite release function.

3) I have noticed fs/direct-io.c may submit zero page for IO when it needs
to clear stuff so we'll probably need a helper function to acquire 'user pin'
reference given a page pointer so that that code can be kept reasonably
simple and pass user_page references all around.

So this way we could maintain reasonable confidence that refcounts didn't
get mixed up. Thoughts?

								Honza
John Hubbard Oct. 10, 2018, 11:23 p.m. UTC | #7
On 10/10/18 1:59 AM, Jan Kara wrote:
> On Tue 09-10-18 17:42:09, John Hubbard wrote:
>> On 10/8/18 5:14 PM, Andrew Morton wrote:
>>> Also, maintainability.  What happens if someone now uses put_page() by
>>> mistake?  Kernel fails in some mysterious fashion?  How can we prevent
>>> this from occurring as code evolves?  Is there a cheap way of detecting
>>> this bug at runtime?
>>>
>>
>> It might be possible to do a few run-time checks, such as "does page that came 
>> back to put_user_page() have the correct flags?", but it's harder (without 
>> having a dedicated page flag) to detect the other direction: "did someone page 
>> in a get_user_pages page, to put_page?"
>>
>> As Jan said in his reply, converting get_user_pages (and put_user_page) to 
>> work with a new data type that wraps struct pages, would solve it, but that's
>> an awfully large change. Still...given how much of a mess this can turn into 
>> if it's wrong, I wonder if it's worth it--maybe? 
> 
> I'm certainly not opposed to looking into it. But after looking into this
> for a while it is not clear to me how to convert e.g. fs/direct-io.c or
> fs/iomap.c. They pass the reference from gup() via
> bio->bi_io_vec[]->bv_page and then release it after IO completion.
> Propagating the new type to ->bv_page is not good as lower layer do not
> really care how the page is pinned in memory. But we do need to somehow
> pass the information to the IO completion functions in a robust manner.
> 

You know, that problem has to be solved in either case: even if we do not
use a new data type for get_user_pages, we still need to clearly, accurately
match up the get/put pairs. And for the complicated systems (block IO, and
GPU DRM layer, especially) one of the things that has caused me concern is 
the way the pages all end up in a large, complicated pool, and put_page is
used to free all of them, indiscriminately.

So I'm glad you're looking at ways to disambiguate this for the bio system.

> Hmm, what about the following:
> 
> 1) Make gup() return new type - struct user_page *? In practice that would
> be just a struct page pointer with 0 bit set so that people are forced to
> use proper helpers and not just force types (and the setting of bit 0 and
> masking back would be hidden behind CONFIG_DEBUG_USER_PAGE_REFERENCES for
> performance reasons). Also the transition would have to be gradual so we'd
> have to name the function differently and use it from converted code.

Yes. That seems perfect: it just fades away if you're not debugging, but we
can catch lots of problems when CONFIG_DEBUG_USER_PAGE_REFERENCES is set. 

> 
> 2) Provide helper bio_add_user_page() that will take user_page, convert it
> to struct page, add it to the bio, and flag the bio as having pages with
> user references. That code would also make sure the bio is consistent in
> having only user-referenced pages in that case. IO completion (like
> bio_check_pages_dirty(), bio_release_pages() etc.) will check the flag and
> use approprite release function.

I'm very new to bio, so I have to ask: can we be sure that the same types of 
pages are always used, within each bio? Because otherwise, we'll have to plumb 
it all the way down to bio_vec's--or so it appears, based on my reading of 
bio_release_pages() and surrounding code.

> 
> 3) I have noticed fs/direct-io.c may submit zero page for IO when it needs
> to clear stuff so we'll probably need a helper function to acquire 'user pin'
> reference given a page pointer so that that code can be kept reasonably
> simple and pass user_page references all around.
>

This only works if we don't set page flags, because if we do set page flags 
on the single, global zero page, that will break the world. So I'm not sure
that the zero page usage in fs/directio.c is going to survive the conversion
to this new approach. :)
 
> So this way we could maintain reasonable confidence that refcounts didn't
> get mixed up. Thoughts?
> 

After thinking some more about the complicated situations in bio and DRM,
and looking into the future (bug reports...), I am leaning toward your 
struct user_page approach. 

I'm looking forward to hearing other opinions on whether it's worth it to go
and do this fairly intrusive change, in return for, probably, fewer bugs along
the way.


thanks,
Andrew Morton Oct. 10, 2018, 11:43 p.m. UTC | #8
On Tue, 9 Oct 2018 17:32:16 -0700 John Hubbard <jhubbard@nvidia.com> wrote:

> > I'm not really understanding.  Patch 3/3 changes just one infiniband
> > driver to use put_user_page().  But the changelogs here imply (to me)
> > that every user of get_user_pages() needs to be converted to
> > s/put_page/put_user_page/.
> > 
> > Methinks a bit more explanation is needed in these changelogs?
> > 
> 
> OK, yes, it does sound like the explanation is falling short. I'll work on something 
> clearer. Did the proposed steps in the changelogs, such as:
>   
> [2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubbard@nvidia.com
>     Proposed steps for fixing get_user_pages() + DMA problems.
> 
> help at all, or is it just too many references, and I should write the words
> directly in the changelog?
> 
> Anyway, patch 3/3 is a just a working example (which we do want to submit, though), and
> many more conversions will follow. But they don't have to be done all upfront--they
> can be done in follow up patchsets. 
> 
> The put_user_page*() routines are, at this point, not going to significantly change
> behavior. 
> 
> I'm working on an RFC that will show what the long-term fix to get_user_pages and
> put_user_pages will look like. But meanwhile it's good to get started on converting
> all of the call sites.

I see.  Yes, please do put all of it into the changelog[s].
Andrew Morton Oct. 10, 2018, 11:45 p.m. UTC | #9
On Tue, 9 Oct 2018 17:42:09 -0700 John Hubbard <jhubbard@nvidia.com> wrote:

> > Also, maintainability.  What happens if someone now uses put_page() by
> > mistake?  Kernel fails in some mysterious fashion?  How can we prevent
> > this from occurring as code evolves?  Is there a cheap way of detecting
> > this bug at runtime?
> > 
> 
> It might be possible to do a few run-time checks, such as "does page that came 
> back to put_user_page() have the correct flags?", but it's harder (without 
> having a dedicated page flag) to detect the other direction: "did someone page 
> in a get_user_pages page, to put_page?"
> 
> As Jan said in his reply, converting get_user_pages (and put_user_page) to 
> work with a new data type that wraps struct pages, would solve it, but that's
> an awfully large change. Still...given how much of a mess this can turn into 
> if it's wrong, I wonder if it's worth it--maybe? 

This is a real worry.  If someone uses a mistaken put_page() then how
will that bug manifest at runtime?  Under what set of circumstances
will the kernel trigger the bug?

(btw, please cc me on all patches, not just [0/n]!)
Jan Kara Oct. 11, 2018, 8:42 a.m. UTC | #10
On Wed 10-10-18 16:23:35, John Hubbard wrote:
> On 10/10/18 1:59 AM, Jan Kara wrote:
> > On Tue 09-10-18 17:42:09, John Hubbard wrote:
> >> On 10/8/18 5:14 PM, Andrew Morton wrote:
> >>> Also, maintainability.  What happens if someone now uses put_page() by
> >>> mistake?  Kernel fails in some mysterious fashion?  How can we prevent
> >>> this from occurring as code evolves?  Is there a cheap way of detecting
> >>> this bug at runtime?
> >>>
> >>
> >> It might be possible to do a few run-time checks, such as "does page that came 
> >> back to put_user_page() have the correct flags?", but it's harder (without 
> >> having a dedicated page flag) to detect the other direction: "did someone page 
> >> in a get_user_pages page, to put_page?"
> >>
> >> As Jan said in his reply, converting get_user_pages (and put_user_page) to 
> >> work with a new data type that wraps struct pages, would solve it, but that's
> >> an awfully large change. Still...given how much of a mess this can turn into 
> >> if it's wrong, I wonder if it's worth it--maybe? 
> > 
> > I'm certainly not opposed to looking into it. But after looking into this
> > for a while it is not clear to me how to convert e.g. fs/direct-io.c or
> > fs/iomap.c. They pass the reference from gup() via
> > bio->bi_io_vec[]->bv_page and then release it after IO completion.
> > Propagating the new type to ->bv_page is not good as lower layer do not
> > really care how the page is pinned in memory. But we do need to somehow
> > pass the information to the IO completion functions in a robust manner.
> > 
> 
> You know, that problem has to be solved in either case: even if we do not
> use a new data type for get_user_pages, we still need to clearly, accurately
> match up the get/put pairs. And for the complicated systems (block IO, and
> GPU DRM layer, especially) one of the things that has caused me concern is 
> the way the pages all end up in a large, complicated pool, and put_page is
> used to free all of them, indiscriminately.

Agreed.

> > Hmm, what about the following:
> > 
> > 1) Make gup() return new type - struct user_page *? In practice that would
> > be just a struct page pointer with 0 bit set so that people are forced to
> > use proper helpers and not just force types (and the setting of bit 0 and
> > masking back would be hidden behind CONFIG_DEBUG_USER_PAGE_REFERENCES for
> > performance reasons). Also the transition would have to be gradual so we'd
> > have to name the function differently and use it from converted code.
> 
> Yes. That seems perfect: it just fades away if you're not debugging, but we
> can catch lots of problems when CONFIG_DEBUG_USER_PAGE_REFERENCES is set. 

Yeah, and when you suspect issues with page pinning, you can try to run
with CONFIG_DEBUG_USER_PAGE_REFERENCES enabled. It's not like the overhead
is going to be huge. But it could be measurable for some workloads...

> > 2) Provide helper bio_add_user_page() that will take user_page, convert it
> > to struct page, add it to the bio, and flag the bio as having pages with
> > user references. That code would also make sure the bio is consistent in
> > having only user-referenced pages in that case. IO completion (like
> > bio_check_pages_dirty(), bio_release_pages() etc.) will check the flag and
> > use approprite release function.
> 
> I'm very new to bio, so I have to ask: can we be sure that the same types of 
> pages are always used, within each bio? Because otherwise, we'll have to plumb 
> it all the way down to bio_vec's--or so it appears, based on my reading of 
> bio_release_pages() and surrounding code.

No, we cannot be sure (the zero page usage within DIO code is one example
when it currently is not true) although usually it is the case that same
type of pages is used for one bio. But bio_add_page() (and thus similarly
bio_add_user_page()) is fine to refuse adding a page to the bio and the
caller then has to submit the current bio and start a new one. So we can
just reuse this mechanism when we detect that currently passed page is of a
different type than other pages in the bio.

> > 3) I have noticed fs/direct-io.c may submit zero page for IO when it needs
> > to clear stuff so we'll probably need a helper function to acquire 'user pin'
> > reference given a page pointer so that that code can be kept reasonably
> > simple and pass user_page references all around.
> >
> 
> This only works if we don't set page flags, because if we do set page flags 
> on the single, global zero page, that will break the world. So I'm not sure
> that the zero page usage in fs/directio.c is going to survive the conversion
> to this new approach. :)

Hum, we can always allocate single page filled with zeros for the use by
DIO code itself. But at this point I'm actually not sure why "user pinning"
of zero page would be an issue. After all if you have private anonymous
read-only mapping, it is going to be backed by zero pages and
get_user_pages() and put_user_pages() have to propely detect pin attempts
for these pages and handle them consistently... So DIO code does not seem
to be doing anything that special.

> > So this way we could maintain reasonable confidence that refcounts didn't
> > get mixed up. Thoughts?
> > 
> 
> After thinking some more about the complicated situations in bio and DRM,
> and looking into the future (bug reports...), I am leaning toward your 
> struct user_page approach. 
> 
> I'm looking forward to hearing other opinions on whether it's worth it to go
> and do this fairly intrusive change, in return for, probably, fewer bugs along
> the way.

Yeah, at this point I think it is worth it as it's probably going to save
us quite some hair-tearing when debugging stuff.

								Honza
Jan Kara Oct. 11, 2018, 8:49 a.m. UTC | #11
On Wed 10-10-18 16:45:41, Andrew Morton wrote:
> On Tue, 9 Oct 2018 17:42:09 -0700 John Hubbard <jhubbard@nvidia.com> wrote:
> 
> > > Also, maintainability.  What happens if someone now uses put_page() by
> > > mistake?  Kernel fails in some mysterious fashion?  How can we prevent
> > > this from occurring as code evolves?  Is there a cheap way of detecting
> > > this bug at runtime?
> > > 
> > 
> > It might be possible to do a few run-time checks, such as "does page that came 
> > back to put_user_page() have the correct flags?", but it's harder (without 
> > having a dedicated page flag) to detect the other direction: "did someone page 
> > in a get_user_pages page, to put_page?"
> > 
> > As Jan said in his reply, converting get_user_pages (and put_user_page) to 
> > work with a new data type that wraps struct pages, would solve it, but that's
> > an awfully large change. Still...given how much of a mess this can turn into 
> > if it's wrong, I wonder if it's worth it--maybe? 
> 
> This is a real worry.  If someone uses a mistaken put_page() then how
> will that bug manifest at runtime?  Under what set of circumstances
> will the kernel trigger the bug?

At runtime such bug will manifest as a page that can never be evicted from
memory. We could warn in put_page() if page reference count drops below
bare minimum for given user pin count which would be able to catch some
issues but it won't be 100% reliable. So at this point I'm more leaning
towards making get_user_pages() return a different type than just
struct page * to make it much harder for refcount to go wrong...

								Honza
Jason Gunthorpe Oct. 11, 2018, 1:20 p.m. UTC | #12
On Thu, Oct 11, 2018 at 10:49:29AM +0200, Jan Kara wrote:

> > This is a real worry.  If someone uses a mistaken put_page() then how
> > will that bug manifest at runtime?  Under what set of circumstances
> > will the kernel trigger the bug?
> 
> At runtime such bug will manifest as a page that can never be evicted from
> memory. We could warn in put_page() if page reference count drops below
> bare minimum for given user pin count which would be able to catch some
> issues but it won't be 100% reliable. So at this point I'm more leaning
> towards making get_user_pages() return a different type than just
> struct page * to make it much harder for refcount to go wrong...

At least for the infiniband code being used as an example here we take
the struct page from get_user_pages, then stick it in a sgl, and at
put_page time we get the page back out of the sgl via sg_page()

So type safety will not help this case... I wonder how many other
users are similar? I think this is a pretty reasonable flow for DMA
with user pages.

Jason
John Hubbard Oct. 12, 2018, 1:23 a.m. UTC | #13
On 10/11/18 6:20 AM, Jason Gunthorpe wrote:
> On Thu, Oct 11, 2018 at 10:49:29AM +0200, Jan Kara wrote:
> 
>>> This is a real worry.  If someone uses a mistaken put_page() then how
>>> will that bug manifest at runtime?  Under what set of circumstances
>>> will the kernel trigger the bug?
>>
>> At runtime such bug will manifest as a page that can never be evicted from
>> memory. We could warn in put_page() if page reference count drops below
>> bare minimum for given user pin count which would be able to catch some
>> issues but it won't be 100% reliable. So at this point I'm more leaning
>> towards making get_user_pages() return a different type than just
>> struct page * to make it much harder for refcount to go wrong...
> 
> At least for the infiniband code being used as an example here we take
> the struct page from get_user_pages, then stick it in a sgl, and at
> put_page time we get the page back out of the sgl via sg_page()
> 
> So type safety will not help this case... I wonder how many other
> users are similar? I think this is a pretty reasonable flow for DMA
> with user pages.
> 

That is true. The infiniband code, fortunately, never mixes the two page
types into the same pool (or sg list), so it's actually an easier example
than some other subsystems. But, yes, type safety doesn't help there. I can 
take a moment to look around at the other areas, to quantify how much a type
safety change might help.

Back to page flags again, out of desperation:

How much do we know about the page types that all of these subsystems
use? In other words, can we, for example, use bit 1 of page->lru.next (see [1]
for context) as the "dma-pinned" page flag, while tracking pages within parts 
of the kernel that call a mix of alloc_pages, get_user_pages, and other allocators? 
In order for that to work, page->index, page->private, and bit 1 of page->mapping
must not be used. I doubt that this is always going to hold, but...does it?

Other ideas: provide a fast lookup tree that tracks pages that were 
obtained via get_user_pages. Before calling put_page or put_user_page,
use that tree to decide which to call. Or anything along the lines of
"yet another way to track pages without using page flags".

(Also, Andrew: "ack" on your point about CC-ing you on all patches, I've fixed
my scripts accordingly, sorry about that.)


[1] Matthew Wilcox's idea for stealing some tracking bits, by removing
dma-pinned pages from the LRU:

   https://lore.kernel.org/r/20180619090255.GA25522@bombadil.infradead.org


thanks,
John Hubbard Oct. 12, 2018, 3:53 a.m. UTC | #14
On 10/11/18 6:23 PM, John Hubbard wrote:
> On 10/11/18 6:20 AM, Jason Gunthorpe wrote:
>> On Thu, Oct 11, 2018 at 10:49:29AM +0200, Jan Kara wrote:
>>
>>>> This is a real worry.  If someone uses a mistaken put_page() then how
>>>> will that bug manifest at runtime?  Under what set of circumstances
>>>> will the kernel trigger the bug?
>>>
>>> At runtime such bug will manifest as a page that can never be evicted from
>>> memory. We could warn in put_page() if page reference count drops below
>>> bare minimum for given user pin count which would be able to catch some
>>> issues but it won't be 100% reliable. So at this point I'm more leaning
>>> towards making get_user_pages() return a different type than just
>>> struct page * to make it much harder for refcount to go wrong...
>>
>> At least for the infiniband code being used as an example here we take
>> the struct page from get_user_pages, then stick it in a sgl, and at
>> put_page time we get the page back out of the sgl via sg_page()
>>
>> So type safety will not help this case... I wonder how many other
>> users are similar? I think this is a pretty reasonable flow for DMA
>> with user pages.
>>
> 
> That is true. The infiniband code, fortunately, never mixes the two page
> types into the same pool (or sg list), so it's actually an easier example
> than some other subsystems. But, yes, type safety doesn't help there. I can 
> take a moment to look around at the other areas, to quantify how much a type
> safety change might help.
> 
> Back to page flags again, out of desperation:
> 
> How much do we know about the page types that all of these subsystems
> use? In other words, can we, for example, use bit 1 of page->lru.next (see [1]
> for context) as the "dma-pinned" page flag, while tracking pages within parts 
> of the kernel that call a mix of alloc_pages, get_user_pages, and other allocators? 
> In order for that to work, page->index, page->private, and bit 1 of page->mapping
> must not be used. I doubt that this is always going to hold, but...does it?
> 

Oops, pardon me, please ignore that nonsense about page->index and page->private
and page->mapping, that's actually fine (I was seeing "union", where "struct" was
written--too much staring at this code). 

So actually, I think maybe we can just use bit 1 in page->lru.next to sort out
which pages are dma-pinned, in the calling code, just like we're going to do
in writeback situations. This should also allow run-time checking that Andrew was 
hoping for:

    put_user_page(): assert that the page is dma-pinned
    put_page(): assert that the page is *not* dma-pinned

...both of which depend on that bit being, essentially, available as sort of a
general page flag. And in fact, if it's not, then the whole approach is dead anyway.

Am I missing anything? This avoids the need to change the get_user_pages interface.


thanks,
Jan Kara Oct. 18, 2018, 10:19 a.m. UTC | #15
On Thu 11-10-18 20:53:34, John Hubbard wrote:
> On 10/11/18 6:23 PM, John Hubbard wrote:
> > On 10/11/18 6:20 AM, Jason Gunthorpe wrote:
> >> On Thu, Oct 11, 2018 at 10:49:29AM +0200, Jan Kara wrote:
> >>
> >>>> This is a real worry.  If someone uses a mistaken put_page() then how
> >>>> will that bug manifest at runtime?  Under what set of circumstances
> >>>> will the kernel trigger the bug?
> >>>
> >>> At runtime such bug will manifest as a page that can never be evicted from
> >>> memory. We could warn in put_page() if page reference count drops below
> >>> bare minimum for given user pin count which would be able to catch some
> >>> issues but it won't be 100% reliable. So at this point I'm more leaning
> >>> towards making get_user_pages() return a different type than just
> >>> struct page * to make it much harder for refcount to go wrong...
> >>
> >> At least for the infiniband code being used as an example here we take
> >> the struct page from get_user_pages, then stick it in a sgl, and at
> >> put_page time we get the page back out of the sgl via sg_page()
> >>
> >> So type safety will not help this case... I wonder how many other
> >> users are similar? I think this is a pretty reasonable flow for DMA
> >> with user pages.
> >>
> > 
> > That is true. The infiniband code, fortunately, never mixes the two page
> > types into the same pool (or sg list), so it's actually an easier example
> > than some other subsystems. But, yes, type safety doesn't help there. I can 
> > take a moment to look around at the other areas, to quantify how much a type
> > safety change might help.
> > 
> > Back to page flags again, out of desperation:
> > 
> > How much do we know about the page types that all of these subsystems
> > use? In other words, can we, for example, use bit 1 of page->lru.next (see [1]
> > for context) as the "dma-pinned" page flag, while tracking pages within parts 
> > of the kernel that call a mix of alloc_pages, get_user_pages, and other allocators? 
> > In order for that to work, page->index, page->private, and bit 1 of page->mapping
> > must not be used. I doubt that this is always going to hold, but...does it?
> > 
> 
> Oops, pardon me, please ignore that nonsense about page->index and page->private
> and page->mapping, that's actually fine (I was seeing "union", where "struct" was
> written--too much staring at this code). 
> 
> So actually, I think maybe we can just use bit 1 in page->lru.next to sort out
> which pages are dma-pinned, in the calling code, just like we're going to do
> in writeback situations. This should also allow run-time checking that Andrew was 
> hoping for:
> 
>     put_user_page(): assert that the page is dma-pinned
>     put_page(): assert that the page is *not* dma-pinned
> 
> ...both of which depend on that bit being, essentially, available as sort
> of a general page flag. And in fact, if it's not, then the whole approach
> is dead anyway.

Well, put_page() cannot assert page is not dma-pinned as someone can still
to get_page(), put_page() on dma-pinned page and that must not barf. But
put_page() could assert that if the page is pinned, refcount is >=
pincount. That will detect leaked pin references relatively quickly.

								Honza
Jason Gunthorpe Oct. 22, 2018, 7:43 p.m. UTC | #16
On Thu, Oct 11, 2018 at 06:23:24PM -0700, John Hubbard wrote:
> On 10/11/18 6:20 AM, Jason Gunthorpe wrote:
> > On Thu, Oct 11, 2018 at 10:49:29AM +0200, Jan Kara wrote:
> > 
> >>> This is a real worry.  If someone uses a mistaken put_page() then how
> >>> will that bug manifest at runtime?  Under what set of circumstances
> >>> will the kernel trigger the bug?
> >>
> >> At runtime such bug will manifest as a page that can never be evicted from
> >> memory. We could warn in put_page() if page reference count drops below
> >> bare minimum for given user pin count which would be able to catch some
> >> issues but it won't be 100% reliable. So at this point I'm more leaning
> >> towards making get_user_pages() return a different type than just
> >> struct page * to make it much harder for refcount to go wrong...
> > 
> > At least for the infiniband code being used as an example here we take
> > the struct page from get_user_pages, then stick it in a sgl, and at
> > put_page time we get the page back out of the sgl via sg_page()
> > 
> > So type safety will not help this case... I wonder how many other
> > users are similar? I think this is a pretty reasonable flow for DMA
> > with user pages.
> > 
> 
> That is true. The infiniband code, fortunately, never mixes the two page
> types into the same pool (or sg list), so it's actually an easier example
> than some other subsystems. But, yes, type safety doesn't help there. I can 
> take a moment to look around at the other areas, to quantify how much a type
> safety change might help.

Are most (all?) of the places working with SGLs?

Maybe we could just have a 'get_user_pages_to_sgl' and 'put_pages_sgl'
sort of interface that handled all this instead of trying to make
something that is struct page based?

It seems easier to get an extra bit for user/!user in the SGL
datastructure?

Jason
John Hubbard Nov. 5, 2018, 7:17 a.m. UTC | #17
On 10/22/18 12:43 PM, Jason Gunthorpe wrote:
> On Thu, Oct 11, 2018 at 06:23:24PM -0700, John Hubbard wrote:
>> On 10/11/18 6:20 AM, Jason Gunthorpe wrote:
>>> On Thu, Oct 11, 2018 at 10:49:29AM +0200, Jan Kara wrote:
>>>
>>>>> This is a real worry.  If someone uses a mistaken put_page() then how
>>>>> will that bug manifest at runtime?  Under what set of circumstances
>>>>> will the kernel trigger the bug?
>>>>
>>>> At runtime such bug will manifest as a page that can never be evicted from
>>>> memory. We could warn in put_page() if page reference count drops below
>>>> bare minimum for given user pin count which would be able to catch some
>>>> issues but it won't be 100% reliable. So at this point I'm more leaning
>>>> towards making get_user_pages() return a different type than just
>>>> struct page * to make it much harder for refcount to go wrong...
>>>
>>> At least for the infiniband code being used as an example here we take
>>> the struct page from get_user_pages, then stick it in a sgl, and at
>>> put_page time we get the page back out of the sgl via sg_page()
>>>
>>> So type safety will not help this case... I wonder how many other
>>> users are similar? I think this is a pretty reasonable flow for DMA
>>> with user pages.
>>>
>>
>> That is true. The infiniband code, fortunately, never mixes the two page
>> types into the same pool (or sg list), so it's actually an easier example
>> than some other subsystems. But, yes, type safety doesn't help there. I can 
>> take a moment to look around at the other areas, to quantify how much a type
>> safety change might help.
> 
> Are most (all?) of the places working with SGLs?

I finally put together a spreadsheet, in order to answer this sort of thing.
Some notes:

a) There are around 100 call sites of either get_user_pages*(), or indirect
calls via iov_iter_get_pages*().

b) There are only a few SGL users. Most are ad-hoc, instead: some loop that
either can be collapsed nicely into the new put_user_pages*() APIs, or...
cannot.

c) The real problem is: around 20+ iov_iter_get_pages*() call sites. I need
to change both the  iov_iter system a little bit, and also change the callers
so that they don't pile all the gup-pinned pages into the same page** array
that also contains other allocation types. This can be done, it just takes
time, that's the good news.

> 
> Maybe we could just have a 'get_user_pages_to_sgl' and 'put_pages_sgl'
> sort of interface that handled all this instead of trying to make
> something that is struct page based?
> 
> It seems easier to get an extra bit for user/!user in the SGL
> datastructure?
> 

So at the moment I don't think we need this *_sgl interface. We need iov_iter*
changes instead.

thanks,
John Hubbard Nov. 5, 2018, 7:25 a.m. UTC | #18
On 10/18/18 3:19 AM, Jan Kara wrote:
> On Thu 11-10-18 20:53:34, John Hubbard wrote:
>> On 10/11/18 6:23 PM, John Hubbard wrote:
>>> On 10/11/18 6:20 AM, Jason Gunthorpe wrote:
>>>> On Thu, Oct 11, 2018 at 10:49:29AM +0200, Jan Kara wrote:
[...]
> Well, put_page() cannot assert page is not dma-pinned as someone can still
> to get_page(), put_page() on dma-pinned page and that must not barf. But
> put_page() could assert that if the page is pinned, refcount is >=
> pincount. That will detect leaked pin references relatively quickly.
> 

That assertion is definitely a life saver. I've been attempting a combination
of finishing up more call site conversions, and runtime testing, and this
lights up the missing conversions pretty nicely.

As I mentioned in another thread just now, I'll send out an updated RFC this week,
so that people can look through it well before the LPC (next week).

thanks,
Jan Kara Nov. 5, 2018, 8:37 a.m. UTC | #19
On Sun 04-11-18 23:17:58, John Hubbard wrote:
> On 10/22/18 12:43 PM, Jason Gunthorpe wrote:
> > On Thu, Oct 11, 2018 at 06:23:24PM -0700, John Hubbard wrote:
> >> On 10/11/18 6:20 AM, Jason Gunthorpe wrote:
> >>> On Thu, Oct 11, 2018 at 10:49:29AM +0200, Jan Kara wrote:
> >>>
> >>>>> This is a real worry.  If someone uses a mistaken put_page() then how
> >>>>> will that bug manifest at runtime?  Under what set of circumstances
> >>>>> will the kernel trigger the bug?
> >>>>
> >>>> At runtime such bug will manifest as a page that can never be evicted from
> >>>> memory. We could warn in put_page() if page reference count drops below
> >>>> bare minimum for given user pin count which would be able to catch some
> >>>> issues but it won't be 100% reliable. So at this point I'm more leaning
> >>>> towards making get_user_pages() return a different type than just
> >>>> struct page * to make it much harder for refcount to go wrong...
> >>>
> >>> At least for the infiniband code being used as an example here we take
> >>> the struct page from get_user_pages, then stick it in a sgl, and at
> >>> put_page time we get the page back out of the sgl via sg_page()
> >>>
> >>> So type safety will not help this case... I wonder how many other
> >>> users are similar? I think this is a pretty reasonable flow for DMA
> >>> with user pages.
> >>>
> >>
> >> That is true. The infiniband code, fortunately, never mixes the two page
> >> types into the same pool (or sg list), so it's actually an easier example
> >> than some other subsystems. But, yes, type safety doesn't help there. I can 
> >> take a moment to look around at the other areas, to quantify how much a type
> >> safety change might help.
> > 
> > Are most (all?) of the places working with SGLs?
> 
> I finally put together a spreadsheet, in order to answer this sort of thing.
> Some notes:
> 
> a) There are around 100 call sites of either get_user_pages*(), or indirect
> calls via iov_iter_get_pages*().

Quite a bit...

> b) There are only a few SGL users. Most are ad-hoc, instead: some loop that
> either can be collapsed nicely into the new put_user_pages*() APIs, or...
> cannot.
> 
> c) The real problem is: around 20+ iov_iter_get_pages*() call sites. I need
> to change both the  iov_iter system a little bit, and also change the callers
> so that they don't pile all the gup-pinned pages into the same page** array
> that also contains other allocation types. This can be done, it just takes
> time, that's the good news.

Yes, but looking into iov_iter_get_pages() users, lot of them then end up
feeding the result either in SGL, SKB (which is basically the same thing,
just for networking), or BVEC (which is again a very similar thing, just for
generic block layer). I'm not saying that we must have _sgl() interface as
untangling all those users might be just too complex but there is certainly
some space for unification and common interfaces ;)

								Honza
diff mbox series

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0416a7204be3..0490f4a71b9c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -137,6 +137,8 @@  extern int overcommit_ratio_handler(struct ctl_table *, int, void __user *,
 				    size_t *, loff_t *);
 extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *,
 				    size_t *, loff_t *);
+int set_page_dirty(struct page *page);
+int set_page_dirty_lock(struct page *page);
 
 #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
 
@@ -943,6 +945,51 @@  static inline void put_page(struct page *page)
 		__put_page(page);
 }
 
+/*
+ * Pages that were pinned via get_user_pages*() should be released via
+ * either put_user_page(), or one of the put_user_pages*() routines
+ * below.
+ */
+static inline void put_user_page(struct page *page)
+{
+	put_page(page);
+}
+
+static inline void put_user_pages_dirty(struct page **pages,
+					unsigned long npages)
+{
+	unsigned long index;
+
+	for (index = 0; index < npages; index++) {
+		if (!PageDirty(pages[index]))
+			set_page_dirty(pages[index]);
+
+		put_user_page(pages[index]);
+	}
+}
+
+static inline void put_user_pages_dirty_lock(struct page **pages,
+					     unsigned long npages)
+{
+	unsigned long index;
+
+	for (index = 0; index < npages; index++) {
+		if (!PageDirty(pages[index]))
+			set_page_dirty_lock(pages[index]);
+
+		put_user_page(pages[index]);
+	}
+}
+
+static inline void put_user_pages(struct page **pages,
+				  unsigned long npages)
+{
+	unsigned long index;
+
+	for (index = 0; index < npages; index++)
+		put_user_page(pages[index]);
+}
+
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define SECTION_IN_PAGE_FLAGS
 #endif
@@ -1534,8 +1581,6 @@  int redirty_page_for_writepage(struct writeback_control *wbc,
 void account_page_dirtied(struct page *page, struct address_space *mapping);
 void account_page_cleaned(struct page *page, struct address_space *mapping,
 			  struct bdi_writeback *wb);
-int set_page_dirty(struct page *page);
-int set_page_dirty_lock(struct page *page);
 void __cancel_dirty_page(struct page *page);
 static inline void cancel_dirty_page(struct page *page)
 {