Message ID | 20190802022005.5117-1-jhubbard@nvidia.com (mailing list archive) |
---|---|
Headers | show |
Series | put_user_pages(): miscellaneous call sites | expand |
On Thu 01-08-19 19:19:31, john.hubbard@gmail.com wrote: [...] > 2) Convert all of the call sites for get_user_pages*(), to > invoke put_user_page*(), instead of put_page(). This involves dozens of > call sites, and will take some time. How do we make sure this is the case and it will remain the case in the future? There must be some automagic to enforce/check that. It is simply not manageable to do it every now and then because then 3) will simply be never safe. Have you considered coccinele or some other scripted way to do the transition? I have no idea how to deal with future changes that would break the balance though.
On Fri 02-08-19 11:12:44, Michal Hocko wrote: > On Thu 01-08-19 19:19:31, john.hubbard@gmail.com wrote: > [...] > > 2) Convert all of the call sites for get_user_pages*(), to > > invoke put_user_page*(), instead of put_page(). This involves dozens of > > call sites, and will take some time. > > How do we make sure this is the case and it will remain the case in the > future? There must be some automagic to enforce/check that. It is simply > not manageable to do it every now and then because then 3) will simply > be never safe. > > Have you considered coccinele or some other scripted way to do the > transition? I have no idea how to deal with future changes that would > break the balance though. Yeah, that's why I've been suggesting at LSF/MM that we may need to create a gup wrapper - say vaddr_pin_pages() - and track which sites dropping references got converted by using this wrapper instead of gup. The counterpart would then be more logically named as unpin_page() or whatever instead of put_user_page(). Sure this is not completely foolproof (you can create new callsite using vaddr_pin_pages() and then just drop refs using put_page()) but I suppose it would be a high enough barrier for missed conversions... Thoughts? Honza
On Fri, Aug 02, 2019 at 02:41:46PM +0200, Jan Kara wrote: > On Fri 02-08-19 11:12:44, Michal Hocko wrote: > > On Thu 01-08-19 19:19:31, john.hubbard@gmail.com wrote: > > [...] > > > 2) Convert all of the call sites for get_user_pages*(), to > > > invoke put_user_page*(), instead of put_page(). This involves dozens of > > > call sites, and will take some time. > > > > How do we make sure this is the case and it will remain the case in the > > future? There must be some automagic to enforce/check that. It is simply > > not manageable to do it every now and then because then 3) will simply > > be never safe. > > > > Have you considered coccinele or some other scripted way to do the > > transition? I have no idea how to deal with future changes that would > > break the balance though. > > Yeah, that's why I've been suggesting at LSF/MM that we may need to create > a gup wrapper - say vaddr_pin_pages() - and track which sites dropping > references got converted by using this wrapper instead of gup. The > counterpart would then be more logically named as unpin_page() or whatever > instead of put_user_page(). Sure this is not completely foolproof (you can > create new callsite using vaddr_pin_pages() and then just drop refs using > put_page()) but I suppose it would be a high enough barrier for missed > conversions... Thoughts? I think the API we really need is get_user_bvec() / put_user_bvec(), and I know Christoph has been putting some work into that. That avoids doing refcount operations on hundreds of pages if the page in question is a huge page. Once people are switched over to that, they won't be tempted to manually call put_page() on the individual constituent pages of a bvec.
On Fri 02-08-19 07:24:43, Matthew Wilcox wrote: > On Fri, Aug 02, 2019 at 02:41:46PM +0200, Jan Kara wrote: > > On Fri 02-08-19 11:12:44, Michal Hocko wrote: > > > On Thu 01-08-19 19:19:31, john.hubbard@gmail.com wrote: > > > [...] > > > > 2) Convert all of the call sites for get_user_pages*(), to > > > > invoke put_user_page*(), instead of put_page(). This involves dozens of > > > > call sites, and will take some time. > > > > > > How do we make sure this is the case and it will remain the case in the > > > future? There must be some automagic to enforce/check that. It is simply > > > not manageable to do it every now and then because then 3) will simply > > > be never safe. > > > > > > Have you considered coccinele or some other scripted way to do the > > > transition? I have no idea how to deal with future changes that would > > > break the balance though. > > > > Yeah, that's why I've been suggesting at LSF/MM that we may need to create > > a gup wrapper - say vaddr_pin_pages() - and track which sites dropping > > references got converted by using this wrapper instead of gup. The > > counterpart would then be more logically named as unpin_page() or whatever > > instead of put_user_page(). Sure this is not completely foolproof (you can > > create new callsite using vaddr_pin_pages() and then just drop refs using > > put_page()) but I suppose it would be a high enough barrier for missed > > conversions... Thoughts? > > I think the API we really need is get_user_bvec() / put_user_bvec(), > and I know Christoph has been putting some work into that. That avoids > doing refcount operations on hundreds of pages if the page in question is > a huge page. Once people are switched over to that, they won't be tempted > to manually call put_page() on the individual constituent pages of a bvec. Well, get_user_bvec() is certainly a good API for one class of users but just looking at the above series, you'll see there are *many* places that just don't work with bvecs at all and you need something for those. Honza
On 8/2/19 7:52 AM, Jan Kara wrote: > On Fri 02-08-19 07:24:43, Matthew Wilcox wrote: >> On Fri, Aug 02, 2019 at 02:41:46PM +0200, Jan Kara wrote: >>> On Fri 02-08-19 11:12:44, Michal Hocko wrote: >>>> On Thu 01-08-19 19:19:31, john.hubbard@gmail.com wrote: >>>> [...] >>>>> 2) Convert all of the call sites for get_user_pages*(), to >>>>> invoke put_user_page*(), instead of put_page(). This involves dozens of >>>>> call sites, and will take some time. >>>> >>>> How do we make sure this is the case and it will remain the case in the >>>> future? There must be some automagic to enforce/check that. It is simply >>>> not manageable to do it every now and then because then 3) will simply >>>> be never safe. >>>> >>>> Have you considered coccinele or some other scripted way to do the >>>> transition? I have no idea how to deal with future changes that would >>>> break the balance though. Hi Michal, Yes, I've thought about it, and coccinelle falls a bit short (it's not smart enough to know which put_page()'s to convert). However, there is a debug option planned: a yet-to-be-posted commit [1] uses struct page extensions (obviously protected by CONFIG_DEBUG_GET_USER_PAGES_REFERENCES) to add a redundant counter. That allows: void __put_page(struct page *page) { ... /* Someone called put_page() instead of put_user_page() */ WARN_ON_ONCE(atomic_read(&page_ext->pin_count) > 0); >>> >>> Yeah, that's why I've been suggesting at LSF/MM that we may need to create >>> a gup wrapper - say vaddr_pin_pages() - and track which sites dropping >>> references got converted by using this wrapper instead of gup. The >>> counterpart would then be more logically named as unpin_page() or whatever >>> instead of put_user_page(). Sure this is not completely foolproof (you can >>> create new callsite using vaddr_pin_pages() and then just drop refs using >>> put_page()) but I suppose it would be a high enough barrier for missed >>> conversions... Thoughts? The debug option above is still a bit simplistic in its implementation (and maybe not taking full advantage of the data it has), but I think it's preferable, because it monitors the "core" and WARNs. Instead of the wrapper, I'm thinking: documentation and the passage of time, plus the debug option (perhaps enhanced--probably once I post it someone will notice opportunities), yes? >> >> I think the API we really need is get_user_bvec() / put_user_bvec(), >> and I know Christoph has been putting some work into that. That avoids >> doing refcount operations on hundreds of pages if the page in question is >> a huge page. Once people are switched over to that, they won't be tempted >> to manually call put_page() on the individual constituent pages of a bvec. > > Well, get_user_bvec() is certainly a good API for one class of users but > just looking at the above series, you'll see there are *many* places that > just don't work with bvecs at all and you need something for those. > Yes, there are quite a few places that don't involve _bvec, as we can see right here. So we need something. Andrew asked for a debug option some time ago, and several people (Dave Hansen, Dan Williams, Jerome) had the idea of vmap-ing gup pages separately, so you can definitely tell where each page came from. I'm hoping not to have to go to that level of complexity though. [1] "mm/gup: debug tracking of get_user_pages() references" : https://github.com/johnhubbard/linux/commit/21ff7d6161ec2a14d3f9d17c98abb00cc969d4d6 thanks,
On Fri 02-08-19 12:14:09, John Hubbard wrote: > On 8/2/19 7:52 AM, Jan Kara wrote: > > On Fri 02-08-19 07:24:43, Matthew Wilcox wrote: > > > On Fri, Aug 02, 2019 at 02:41:46PM +0200, Jan Kara wrote: > > > > On Fri 02-08-19 11:12:44, Michal Hocko wrote: > > > > > On Thu 01-08-19 19:19:31, john.hubbard@gmail.com wrote: > > > > > [...] > > > > > > 2) Convert all of the call sites for get_user_pages*(), to > > > > > > invoke put_user_page*(), instead of put_page(). This involves dozens of > > > > > > call sites, and will take some time. > > > > > > > > > > How do we make sure this is the case and it will remain the case in the > > > > > future? There must be some automagic to enforce/check that. It is simply > > > > > not manageable to do it every now and then because then 3) will simply > > > > > be never safe. > > > > > > > > > > Have you considered coccinele or some other scripted way to do the > > > > > transition? I have no idea how to deal with future changes that would > > > > > break the balance though. > > Hi Michal, > > Yes, I've thought about it, and coccinelle falls a bit short (it's not smart > enough to know which put_page()'s to convert). However, there is a debug > option planned: a yet-to-be-posted commit [1] uses struct page extensions > (obviously protected by CONFIG_DEBUG_GET_USER_PAGES_REFERENCES) to add > a redundant counter. That allows: > > void __put_page(struct page *page) > { > ... > /* Someone called put_page() instead of put_user_page() */ > WARN_ON_ONCE(atomic_read(&page_ext->pin_count) > 0); > > > > > > > > > Yeah, that's why I've been suggesting at LSF/MM that we may need to create > > > > a gup wrapper - say vaddr_pin_pages() - and track which sites dropping > > > > references got converted by using this wrapper instead of gup. The > > > > counterpart would then be more logically named as unpin_page() or whatever > > > > instead of put_user_page(). Sure this is not completely foolproof (you can > > > > create new callsite using vaddr_pin_pages() and then just drop refs using > > > > put_page()) but I suppose it would be a high enough barrier for missed > > > > conversions... Thoughts? > > The debug option above is still a bit simplistic in its implementation > (and maybe not taking full advantage of the data it has), but I think > it's preferable, because it monitors the "core" and WARNs. > > Instead of the wrapper, I'm thinking: documentation and the passage of > time, plus the debug option (perhaps enhanced--probably once I post it > someone will notice opportunities), yes? So I think your debug option and my suggested renaming serve a bit different purposes (and thus both make sense). If you do the renaming, you can just grep to see unconverted sites. Also when someone merges new GUP user (unaware of the new rules) while you switch GUP to use pins instead of ordinary references, you'll get compilation error in case of renaming instead of hard to debug refcount leak without the renaming. And such conflict is almost bound to happen given the size of GUP patch set... Also the renaming serves against the "coding inertia" - i.e., GUP is around for ages so people just use it without checking any documentation or comments. After switching how GUP works, what used to be correct isn't anymore so renaming the function serves as a warning that something has really changed. Your refcount debug patches are good to catch bugs in the conversions done but that requires you to be able to excercise the code path in the first place which may require particular HW or so, and you also have to enable the debug option which means you already aim at verifying the GUP references are treated properly. Honza
On Wed 07-08-19 10:37:26, Jan Kara wrote: > On Fri 02-08-19 12:14:09, John Hubbard wrote: > > On 8/2/19 7:52 AM, Jan Kara wrote: > > > On Fri 02-08-19 07:24:43, Matthew Wilcox wrote: > > > > On Fri, Aug 02, 2019 at 02:41:46PM +0200, Jan Kara wrote: > > > > > On Fri 02-08-19 11:12:44, Michal Hocko wrote: > > > > > > On Thu 01-08-19 19:19:31, john.hubbard@gmail.com wrote: > > > > > > [...] > > > > > > > 2) Convert all of the call sites for get_user_pages*(), to > > > > > > > invoke put_user_page*(), instead of put_page(). This involves dozens of > > > > > > > call sites, and will take some time. > > > > > > > > > > > > How do we make sure this is the case and it will remain the case in the > > > > > > future? There must be some automagic to enforce/check that. It is simply > > > > > > not manageable to do it every now and then because then 3) will simply > > > > > > be never safe. > > > > > > > > > > > > Have you considered coccinele or some other scripted way to do the > > > > > > transition? I have no idea how to deal with future changes that would > > > > > > break the balance though. > > > > Hi Michal, > > > > Yes, I've thought about it, and coccinelle falls a bit short (it's not smart > > enough to know which put_page()'s to convert). However, there is a debug > > option planned: a yet-to-be-posted commit [1] uses struct page extensions > > (obviously protected by CONFIG_DEBUG_GET_USER_PAGES_REFERENCES) to add > > a redundant counter. That allows: > > > > void __put_page(struct page *page) > > { > > ... > > /* Someone called put_page() instead of put_user_page() */ > > WARN_ON_ONCE(atomic_read(&page_ext->pin_count) > 0); > > > > > > > > > > > > Yeah, that's why I've been suggesting at LSF/MM that we may need to create > > > > > a gup wrapper - say vaddr_pin_pages() - and track which sites dropping > > > > > references got converted by using this wrapper instead of gup. The > > > > > counterpart would then be more logically named as unpin_page() or whatever > > > > > instead of put_user_page(). Sure this is not completely foolproof (you can > > > > > create new callsite using vaddr_pin_pages() and then just drop refs using > > > > > put_page()) but I suppose it would be a high enough barrier for missed > > > > > conversions... Thoughts? > > > > The debug option above is still a bit simplistic in its implementation > > (and maybe not taking full advantage of the data it has), but I think > > it's preferable, because it monitors the "core" and WARNs. > > > > Instead of the wrapper, I'm thinking: documentation and the passage of > > time, plus the debug option (perhaps enhanced--probably once I post it > > someone will notice opportunities), yes? > > So I think your debug option and my suggested renaming serve a bit > different purposes (and thus both make sense). If you do the renaming, you > can just grep to see unconverted sites. Also when someone merges new GUP > user (unaware of the new rules) while you switch GUP to use pins instead of > ordinary references, you'll get compilation error in case of renaming > instead of hard to debug refcount leak without the renaming. And such > conflict is almost bound to happen given the size of GUP patch set... Also > the renaming serves against the "coding inertia" - i.e., GUP is around for > ages so people just use it without checking any documentation or comments. > After switching how GUP works, what used to be correct isn't anymore so > renaming the function serves as a warning that something has really > changed. Fully agreed! > Your refcount debug patches are good to catch bugs in the conversions done > but that requires you to be able to excercise the code path in the first > place which may require particular HW or so, and you also have to enable > the debug option which means you already aim at verifying the GUP > references are treated properly. > > Honza > > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR
On Wed, Aug 07, 2019 at 10:46:49AM +0200, Michal Hocko wrote: > On Wed 07-08-19 10:37:26, Jan Kara wrote: > > On Fri 02-08-19 12:14:09, John Hubbard wrote: > > > On 8/2/19 7:52 AM, Jan Kara wrote: > > > > On Fri 02-08-19 07:24:43, Matthew Wilcox wrote: > > > > > On Fri, Aug 02, 2019 at 02:41:46PM +0200, Jan Kara wrote: > > > > > > On Fri 02-08-19 11:12:44, Michal Hocko wrote: > > > > > > > On Thu 01-08-19 19:19:31, john.hubbard@gmail.com wrote: > > > > > > > [...] > > > > > > > > 2) Convert all of the call sites for get_user_pages*(), to > > > > > > > > invoke put_user_page*(), instead of put_page(). This involves dozens of > > > > > > > > call sites, and will take some time. > > > > > > > > > > > > > > How do we make sure this is the case and it will remain the case in the > > > > > > > future? There must be some automagic to enforce/check that. It is simply > > > > > > > not manageable to do it every now and then because then 3) will simply > > > > > > > be never safe. > > > > > > > > > > > > > > Have you considered coccinele or some other scripted way to do the > > > > > > > transition? I have no idea how to deal with future changes that would > > > > > > > break the balance though. > > > > > > Hi Michal, > > > > > > Yes, I've thought about it, and coccinelle falls a bit short (it's not smart > > > enough to know which put_page()'s to convert). However, there is a debug > > > option planned: a yet-to-be-posted commit [1] uses struct page extensions > > > (obviously protected by CONFIG_DEBUG_GET_USER_PAGES_REFERENCES) to add > > > a redundant counter. That allows: > > > > > > void __put_page(struct page *page) > > > { > > > ... > > > /* Someone called put_page() instead of put_user_page() */ > > > WARN_ON_ONCE(atomic_read(&page_ext->pin_count) > 0); > > > > > > > > > > > > > > > Yeah, that's why I've been suggesting at LSF/MM that we may need to create > > > > > > a gup wrapper - say vaddr_pin_pages() - and track which sites dropping > > > > > > references got converted by using this wrapper instead of gup. The > > > > > > counterpart would then be more logically named as unpin_page() or whatever > > > > > > instead of put_user_page(). Sure this is not completely foolproof (you can > > > > > > create new callsite using vaddr_pin_pages() and then just drop refs using > > > > > > put_page()) but I suppose it would be a high enough barrier for missed > > > > > > conversions... Thoughts? > > > > > > The debug option above is still a bit simplistic in its implementation > > > (and maybe not taking full advantage of the data it has), but I think > > > it's preferable, because it monitors the "core" and WARNs. > > > > > > Instead of the wrapper, I'm thinking: documentation and the passage of > > > time, plus the debug option (perhaps enhanced--probably once I post it > > > someone will notice opportunities), yes? > > > > So I think your debug option and my suggested renaming serve a bit > > different purposes (and thus both make sense). If you do the renaming, you > > can just grep to see unconverted sites. Also when someone merges new GUP > > user (unaware of the new rules) while you switch GUP to use pins instead of > > ordinary references, you'll get compilation error in case of renaming > > instead of hard to debug refcount leak without the renaming. And such > > conflict is almost bound to happen given the size of GUP patch set... Also > > the renaming serves against the "coding inertia" - i.e., GUP is around for > > ages so people just use it without checking any documentation or comments. > > After switching how GUP works, what used to be correct isn't anymore so > > renaming the function serves as a warning that something has really > > changed. > > Fully agreed! Ok Prior to this I've been basing all my work for the RDMA/FS DAX stuff in Johns put_user_pages()... (Including when I proposed failing truncate with a lease in June [1]) However, based on the suggestions in that thread it became clear that a new interface was going to need to be added to pass in the "RDMA file" information to GUP to associate file pins with the correct processes... I have many drawings on my white board with "a whole lot of lines" on them to make sure that if a process opens a file, mmaps it, pins it with RDMA, _closes_ it, and ummaps it; that the resulting file pin can still be traced back to the RDMA context and all the processes which may have access to it.... No matter where the original context may have come from. I believe I have accomplished that. Before I go on, I would like to say that the "imbalance" of get_user_pages() and put_page() bothers me from a purist standpoint... However, since this discussion cropped up I went ahead and ported my work to Linus' current master (5.3-rc3+) and in doing so I only had to steal a bit of Johns code... Sorry John... :-( I don't have the commit messages all cleaned up and I know there may be some discussion on these new interfaces but I wanted to throw this series out there because I think it may be what Jan and Michal are driving at (or at least in that direction. Right now only RDMA and DAX FS's are supported. Other users of GUP will still fail on a DAX file and regular files will still be at risk.[2] I've pushed this work (based 5.3-rc3+ (33920f1ec5bf)) here[3]: https://github.com/weiny2/linux-kernel/tree/linus-rdmafsdax-b0-v3 I think the most relevant patch to this conversation is: https://github.com/weiny2/linux-kernel/commit/5d377653ba5cf11c3b716f904b057bee6641aaf6 I stole Jans suggestion for a name as the name I used while prototyping was pretty bad... So Thanks Jan... ;-) Also thanks to John for his contribution on some of this. I'm still tweaking put_user_pages under the hood on the DAX path. Ira [1] https://lwn.net/Articles/790544/ [2] I've been looking into how to support io_uring next but I've had some issue getting a test program to actually call GUP in that code path... :-( [3] If it would be easier I can just throw an RFC on the list but right now the cover letter and some of the commit messages are full of the old stuff and various ideas I have had... > > > Your refcount debug patches are good to catch bugs in the conversions done > > but that requires you to be able to excercise the code path in the first > > place which may require particular HW or so, and you also have to enable > > the debug option which means you already aim at verifying the GUP > > references are treated properly. > > > > Honza > > > > -- > > Jan Kara <jack@suse.com> > > SUSE Labs, CR > > -- > Michal Hocko > SUSE Labs
On 8/7/19 7:36 PM, Ira Weiny wrote: > On Wed, Aug 07, 2019 at 10:46:49AM +0200, Michal Hocko wrote: >> On Wed 07-08-19 10:37:26, Jan Kara wrote: >>> On Fri 02-08-19 12:14:09, John Hubbard wrote: >>>> On 8/2/19 7:52 AM, Jan Kara wrote: >>>>> On Fri 02-08-19 07:24:43, Matthew Wilcox wrote: >>>>>> On Fri, Aug 02, 2019 at 02:41:46PM +0200, Jan Kara wrote: >>>>>>> On Fri 02-08-19 11:12:44, Michal Hocko wrote: >>>>>>>> On Thu 01-08-19 19:19:31, john.hubbard@gmail.com wrote: [...] > Before I go on, I would like to say that the "imbalance" of get_user_pages() > and put_page() bothers me from a purist standpoint... However, since this > discussion cropped up I went ahead and ported my work to Linus' current master > (5.3-rc3+) and in doing so I only had to steal a bit of Johns code... Sorry > John... :-( > > I don't have the commit messages all cleaned up and I know there may be some > discussion on these new interfaces but I wanted to throw this series out there > because I think it may be what Jan and Michal are driving at (or at least in > that direction. > > Right now only RDMA and DAX FS's are supported. Other users of GUP will still > fail on a DAX file and regular files will still be at risk.[2] > > I've pushed this work (based 5.3-rc3+ (33920f1ec5bf)) here[3]: > > https://github.com/weiny2/linux-kernel/tree/linus-rdmafsdax-b0-v3 > > I think the most relevant patch to this conversation is: > > https://github.com/weiny2/linux-kernel/commit/5d377653ba5cf11c3b716f904b057bee6641aaf6 > ohhh...can you please avoid using the old __put_user_pages_dirty() function? I thought I'd caught things early enough to get away with the rename and deletion of that. You could either: a) open code an implementation of vaddr_put_pages_dirty_lock() that doesn't call any of the *put_user_pages_dirty*() variants, or b) include my first patch ("") are part of your series, or c) base this on Andrews's tree, which already has merged in my first patch. thanks,
> > On 8/7/19 7:36 PM, Ira Weiny wrote: > > On Wed, Aug 07, 2019 at 10:46:49AM +0200, Michal Hocko wrote: > >> On Wed 07-08-19 10:37:26, Jan Kara wrote: > >>> On Fri 02-08-19 12:14:09, John Hubbard wrote: > >>>> On 8/2/19 7:52 AM, Jan Kara wrote: > >>>>> On Fri 02-08-19 07:24:43, Matthew Wilcox wrote: > >>>>>> On Fri, Aug 02, 2019 at 02:41:46PM +0200, Jan Kara wrote: > >>>>>>> On Fri 02-08-19 11:12:44, Michal Hocko wrote: > >>>>>>>> On Thu 01-08-19 19:19:31, john.hubbard@gmail.com wrote: > [...] > > Before I go on, I would like to say that the "imbalance" of > > get_user_pages() and put_page() bothers me from a purist standpoint... > > However, since this discussion cropped up I went ahead and ported my > > work to Linus' current master > > (5.3-rc3+) and in doing so I only had to steal a bit of Johns code... > > Sorry John... :-( > > > > I don't have the commit messages all cleaned up and I know there may > > be some discussion on these new interfaces but I wanted to throw this > > series out there because I think it may be what Jan and Michal are > > driving at (or at least in that direction. > > > > Right now only RDMA and DAX FS's are supported. Other users of GUP > > will still fail on a DAX file and regular files will still be at > > risk.[2] > > > > I've pushed this work (based 5.3-rc3+ (33920f1ec5bf)) here[3]: > > > > https://github.com/weiny2/linux-kernel/tree/linus-rdmafsdax-b0-v3 > > > > I think the most relevant patch to this conversation is: > > > > https://github.com/weiny2/linux- > kernel/commit/5d377653ba5cf11c3b716f90 > > 4b057bee6641aaf6 > > > > ohhh...can you please avoid using the old __put_user_pages_dirty() > function? Agreed... I did not like that. Part of the reason I did not post this is I'm still trying to figure out what has landed and what I can and can't depend on. For example, Christoph H. was proposing changes to some of the GUP calls which may conflict. But I'm not sure his changes are moving forward. So rather than waiting for the dust to settle I decided to see how hard it would be to get this rebased against mainline and working. Turns out it was not too hard. I think that is because, as time has moved on it seems that, for some users such as RDMA, a simple put_user_page() is not going to be sufficient. We need something else to allow GUP to keep track of the file pins as we discussed. So I'm starting to think some of this could go in at the same time. > I thought I'd caught things early enough to get away with the > rename and deletion of that. You could either: > > a) open code an implementation of vaddr_put_pages_dirty_lock() that > doesn't call any of the *put_user_pages_dirty*() variants, or > > b) include my first patch ("") are part of your series, or > > c) base this on Andrews's tree, which already has merged in my first patch. > Yep I can do this. I did not realize that Andrew had accepted any of this work. I'll check out his tree. But I don't think he is going to accept this series through his tree. So what is the ETA on that landing in Linus' tree? To that point I'm still not sure who would take all this as I am now touching mm, procfs, rdma, ext4, and xfs. I just thought I would chime in with my progress because I'm to a point where things are working and so I can submit the code but I'm not sure what I can/should depend on landing... Also, now that 0day has run overnight it has found issues with this rebase so I need to clean those up... Perhaps I will base on Andrew's tree prior to doing that... Thanks, Ira > > thanks, > -- > John Hubbard > NVIDIA
On 8/8/19 9:25 AM, Weiny, Ira wrote: >> >> On 8/7/19 7:36 PM, Ira Weiny wrote: >>> On Wed, Aug 07, 2019 at 10:46:49AM +0200, Michal Hocko wrote: >>>> On Wed 07-08-19 10:37:26, Jan Kara wrote: >>>>> On Fri 02-08-19 12:14:09, John Hubbard wrote: >>>>>> On 8/2/19 7:52 AM, Jan Kara wrote: >>>>>>> On Fri 02-08-19 07:24:43, Matthew Wilcox wrote: >>>>>>>> On Fri, Aug 02, 2019 at 02:41:46PM +0200, Jan Kara wrote: >>>>>>>>> On Fri 02-08-19 11:12:44, Michal Hocko wrote: >>>>>>>>>> On Thu 01-08-19 19:19:31, john.hubbard@gmail.com wrote: >> [...] > Yep I can do this. I did not realize that Andrew had accepted any of this work. I'll check out his tree. But I don't think he is going to accept this series through his tree. So what is the ETA on that landing in Linus' tree? > I'd expect it to go into 5.4, according to my understanding of how the release cycles are arranged. > To that point I'm still not sure who would take all this as I am now touching mm, procfs, rdma, ext4, and xfs. > > I just thought I would chime in with my progress because I'm to a point where things are working and so I can submit the code but I'm not sure what I can/should depend on landing... Also, now that 0day has run overnight it has found issues with this rebase so I need to clean those up... Perhaps I will base on Andrew's tree prior to doing that... I'm certainly not the right person to answer, but in spite of that, I'd think Andrew's tree is a reasonable place for it. Sort of. thanks,
On Wed 07-08-19 19:36:37, Ira Weiny wrote: > On Wed, Aug 07, 2019 at 10:46:49AM +0200, Michal Hocko wrote: > > > So I think your debug option and my suggested renaming serve a bit > > > different purposes (and thus both make sense). If you do the renaming, you > > > can just grep to see unconverted sites. Also when someone merges new GUP > > > user (unaware of the new rules) while you switch GUP to use pins instead of > > > ordinary references, you'll get compilation error in case of renaming > > > instead of hard to debug refcount leak without the renaming. And such > > > conflict is almost bound to happen given the size of GUP patch set... Also > > > the renaming serves against the "coding inertia" - i.e., GUP is around for > > > ages so people just use it without checking any documentation or comments. > > > After switching how GUP works, what used to be correct isn't anymore so > > > renaming the function serves as a warning that something has really > > > changed. > > > > Fully agreed! > > Ok Prior to this I've been basing all my work for the RDMA/FS DAX stuff in > Johns put_user_pages()... (Including when I proposed failing truncate with a > lease in June [1]) > > However, based on the suggestions in that thread it became clear that a new > interface was going to need to be added to pass in the "RDMA file" information > to GUP to associate file pins with the correct processes... > > I have many drawings on my white board with "a whole lot of lines" on them to > make sure that if a process opens a file, mmaps it, pins it with RDMA, _closes_ > it, and ummaps it; that the resulting file pin can still be traced back to the > RDMA context and all the processes which may have access to it.... No matter > where the original context may have come from. I believe I have accomplished > that. > > Before I go on, I would like to say that the "imbalance" of get_user_pages() > and put_page() bothers me from a purist standpoint... However, since this > discussion cropped up I went ahead and ported my work to Linus' current master > (5.3-rc3+) and in doing so I only had to steal a bit of Johns code... Sorry > John... :-( > > I don't have the commit messages all cleaned up and I know there may be some > discussion on these new interfaces but I wanted to throw this series out there > because I think it may be what Jan and Michal are driving at (or at least in > that direction. > > Right now only RDMA and DAX FS's are supported. Other users of GUP will still > fail on a DAX file and regular files will still be at risk.[2] > > I've pushed this work (based 5.3-rc3+ (33920f1ec5bf)) here[3]: > > https://github.com/weiny2/linux-kernel/tree/linus-rdmafsdax-b0-v3 > > I think the most relevant patch to this conversation is: > > https://github.com/weiny2/linux-kernel/commit/5d377653ba5cf11c3b716f904b057bee6641aaf6 > > I stole Jans suggestion for a name as the name I used while prototyping was > pretty bad... So Thanks Jan... ;-) For your function, I'd choose a name like vaddr_pin_leased_pages() so that association with a lease is clear from the name :) Also I'd choose the counterpart to be vaddr_unpin_leased_page[s](). Especially having put_page in the name looks confusing to me... Honza
On Thu 08-08-19 16:25:04, Weiny, Ira wrote: > > I thought I'd caught things early enough to get away with the > > rename and deletion of that. You could either: > > > > a) open code an implementation of vaddr_put_pages_dirty_lock() that > > doesn't call any of the *put_user_pages_dirty*() variants, or > > > > b) include my first patch ("") are part of your series, or > > > > c) base this on Andrews's tree, which already has merged in my first patch. > > > > Yep I can do this. I did not realize that Andrew had accepted any of > this work. I'll check out his tree. But I don't think he is going to > accept this series through his tree. So what is the ETA on that landing > in Linus' tree? > > To that point I'm still not sure who would take all this as I am now > touching mm, procfs, rdma, ext4, and xfs. MM tree would be one candidate for routing but there are other options that would make sense as well - Dan's tree, VFS tree, or even I can pickup the patches to my tree if needed. But let's worry about the routing after we have working and reviewed patches... Honza
> > On Wed 07-08-19 19:36:37, Ira Weiny wrote: > > On Wed, Aug 07, 2019 at 10:46:49AM +0200, Michal Hocko wrote: > > > > So I think your debug option and my suggested renaming serve a bit > > > > different purposes (and thus both make sense). If you do the > > > > renaming, you can just grep to see unconverted sites. Also when > > > > someone merges new GUP user (unaware of the new rules) while you > > > > switch GUP to use pins instead of ordinary references, you'll get > > > > compilation error in case of renaming instead of hard to debug > > > > refcount leak without the renaming. And such conflict is almost > > > > bound to happen given the size of GUP patch set... Also the > > > > renaming serves against the "coding inertia" - i.e., GUP is around for > ages so people just use it without checking any documentation or comments. > > > > After switching how GUP works, what used to be correct isn't > > > > anymore so renaming the function serves as a warning that > > > > something has really changed. > > > > > > Fully agreed! > > > > Ok Prior to this I've been basing all my work for the RDMA/FS DAX > > stuff in Johns put_user_pages()... (Including when I proposed failing > > truncate with a lease in June [1]) > > > > However, based on the suggestions in that thread it became clear that > > a new interface was going to need to be added to pass in the "RDMA > > file" information to GUP to associate file pins with the correct processes... > > > > I have many drawings on my white board with "a whole lot of lines" on > > them to make sure that if a process opens a file, mmaps it, pins it > > with RDMA, _closes_ it, and ummaps it; that the resulting file pin can > > still be traced back to the RDMA context and all the processes which > > may have access to it.... No matter where the original context may > > have come from. I believe I have accomplished that. > > > > Before I go on, I would like to say that the "imbalance" of > > get_user_pages() and put_page() bothers me from a purist standpoint... > > However, since this discussion cropped up I went ahead and ported my > > work to Linus' current master > > (5.3-rc3+) and in doing so I only had to steal a bit of Johns code... > > Sorry John... :-( > > > > I don't have the commit messages all cleaned up and I know there may > > be some discussion on these new interfaces but I wanted to throw this > > series out there because I think it may be what Jan and Michal are > > driving at (or at least in that direction. > > > > Right now only RDMA and DAX FS's are supported. Other users of GUP > > will still fail on a DAX file and regular files will still be at > > risk.[2] > > > > I've pushed this work (based 5.3-rc3+ (33920f1ec5bf)) here[3]: > > > > https://github.com/weiny2/linux-kernel/tree/linus-rdmafsdax-b0-v3 > > > > I think the most relevant patch to this conversation is: > > > > https://github.com/weiny2/linux- > kernel/commit/5d377653ba5cf11c3b716f90 > > 4b057bee6641aaf6 > > > > I stole Jans suggestion for a name as the name I used while > > prototyping was pretty bad... So Thanks Jan... ;-) > > For your function, I'd choose a name like vaddr_pin_leased_pages() so that > association with a lease is clear from the name :) My gut was to just change this as you suggested. But the fact is that these calls can get used on anonymous pages as well. So the "leased" semantic may not apply... OTOH if a file is encountered it will fail the pin... :-/ I'm going to leave it for now and get the patches submitted to the list... > Also I'd choose the > counterpart to be vaddr_unpin_leased_page[s](). Especially having put_page > in the name looks confusing to me... Ah yes, totally agree with the "pin/unpin" symmetry. I've changed from "put" to "unpin"... Thanks, Ira > > Honza > > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR
From: John Hubbard <jhubbard@nvidia.com> Hi, These are best characterized as miscellaneous conversions: many (not all) call sites that don't involve biovec or iov_iter, nor mm/. It also leaves out a few call sites that require some more work. These are mostly pretty simple ones. It's probably best to send all of these via Andrew's -mm tree, assuming that there are no significant merge conflicts with ongoing work in other trees (which I doubt, given that these are small changes). These patches apply to the latest linux.git. Patch #1 is also already in Andrew's tree, but given the broad non-linux-mm Cc list, I thought it would be more convenient to just include that patch here, so that people can use linux.git as the base--even though these are probably destined for linux-mm. This is part a tree-wide conversion, as described in commit fc1d8e7cca2d ("mm: introduce put_user_page*(), placeholder versions"). That commit has an extensive description of the problem and the planned steps to solve it, but the highlites are: 1) Provide put_user_page*() routines, intended to be used for releasing pages that were pinned via get_user_pages*(). 2) Convert all of the call sites for get_user_pages*(), to invoke put_user_page*(), instead of put_page(). This involves dozens of call sites, and will take some time. 3) After (2) is complete, use get_user_pages*() and put_user_page*() to implement tracking of these pages. This tracking will be separate from the existing struct page refcounting. 4) Use the tracking and identification of these pages, to implement special handling (especially in writeback paths) when the pages are backed by a filesystem. And a few references, also from that commit: [1] https://lwn.net/Articles/774411/ : "DMA and get_user_pages()" [2] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()" Ira Weiny (1): fs/binfmt_elf: convert put_page() to put_user_page*() John Hubbard (33): mm/gup: add make_dirty arg to put_user_pages_dirty_lock() net/rds: convert put_page() to put_user_page*() net/ceph: convert put_page() to put_user_page*() x86/kvm: convert put_page() to put_user_page*() drm/etnaviv: convert release_pages() to put_user_pages() drm/i915: convert put_page() to put_user_page*() drm/radeon: convert put_page() to put_user_page*() media/ivtv: convert put_page() to put_user_page*() media/v4l2-core/mm: convert put_page() to put_user_page*() genwqe: convert put_page() to put_user_page*() scif: convert put_page() to put_user_page*() vmci: convert put_page() to put_user_page*() rapidio: convert put_page() to put_user_page*() oradax: convert put_page() to put_user_page*() staging/vc04_services: convert put_page() to put_user_page*() drivers/tee: convert put_page() to put_user_page*() vfio: convert put_page() to put_user_page*() fbdev/pvr2fb: convert put_page() to put_user_page*() fsl_hypervisor: convert put_page() to put_user_page*() xen: convert put_page() to put_user_page*() fs/exec.c: convert put_page() to put_user_page*() orangefs: convert put_page() to put_user_page*() uprobes: convert put_page() to put_user_page*() futex: convert put_page() to put_user_page*() mm/frame_vector.c: convert put_page() to put_user_page*() mm/gup_benchmark.c: convert put_page() to put_user_page*() mm/memory.c: convert put_page() to put_user_page*() mm/madvise.c: convert put_page() to put_user_page*() mm/process_vm_access.c: convert put_page() to put_user_page*() crypt: convert put_page() to put_user_page*() nfs: convert put_page() to put_user_page*() goldfish_pipe: convert put_page() to put_user_page*() kernel/events/core.c: convert put_page() to put_user_page*() arch/x86/kvm/svm.c | 4 +- crypto/af_alg.c | 7 +- drivers/gpu/drm/etnaviv/etnaviv_gem.c | 4 +- drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 9 +- drivers/gpu/drm/radeon/radeon_ttm.c | 2 +- drivers/infiniband/core/umem.c | 5 +- drivers/infiniband/hw/hfi1/user_pages.c | 5 +- drivers/infiniband/hw/qib/qib_user_pages.c | 5 +- drivers/infiniband/hw/usnic/usnic_uiom.c | 5 +- drivers/infiniband/sw/siw/siw_mem.c | 10 +- drivers/media/pci/ivtv/ivtv-udma.c | 14 +-- drivers/media/pci/ivtv/ivtv-yuv.c | 10 +- drivers/media/v4l2-core/videobuf-dma-sg.c | 3 +- drivers/misc/genwqe/card_utils.c | 17 +-- drivers/misc/mic/scif/scif_rma.c | 17 ++- drivers/misc/vmw_vmci/vmci_context.c | 2 +- drivers/misc/vmw_vmci/vmci_queue_pair.c | 11 +- drivers/platform/goldfish/goldfish_pipe.c | 9 +- drivers/rapidio/devices/rio_mport_cdev.c | 9 +- drivers/sbus/char/oradax.c | 2 +- .../interface/vchiq_arm/vchiq_2835_arm.c | 10 +- drivers/tee/tee_shm.c | 10 +- drivers/vfio/vfio_iommu_type1.c | 8 +- drivers/video/fbdev/pvr2fb.c | 3 +- drivers/virt/fsl_hypervisor.c | 7 +- drivers/xen/gntdev.c | 5 +- drivers/xen/privcmd.c | 7 +- fs/binfmt_elf.c | 2 +- fs/binfmt_elf_fdpic.c | 2 +- fs/exec.c | 2 +- fs/nfs/direct.c | 4 +- fs/orangefs/orangefs-bufmap.c | 7 +- include/linux/mm.h | 5 +- kernel/events/core.c | 2 +- kernel/events/uprobes.c | 6 +- kernel/futex.c | 10 +- mm/frame_vector.c | 4 +- mm/gup.c | 115 ++++++++---------- mm/gup_benchmark.c | 2 +- mm/madvise.c | 2 +- mm/memory.c | 2 +- mm/process_vm_access.c | 18 +-- net/ceph/pagevec.c | 8 +- net/rds/info.c | 5 +- net/rds/message.c | 2 +- net/rds/rdma.c | 15 ++- virt/kvm/kvm_main.c | 4 +- 47 files changed, 151 insertions(+), 266 deletions(-)