Message ID | 20190805222019.28592-2-jhubbard@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm/: 3 more put_user_page() conversions | expand |
On Mon 05-08-19 15:20:17, john.hubbard@gmail.com wrote: > From: John Hubbard <jhubbard@nvidia.com> > > For pages that were retained via get_user_pages*(), release those pages > via the new put_user_page*() routines, instead of via put_page() or > release_pages(). Hmm, this is an interesting code path. There seems to be a mix of pages in the game. We get one page via follow_page_mask but then other pages in the range are filled by __munlock_pagevec_fill and that does a direct pte walk. Is using put_user_page correct in this case? Could you explain why in the changelog? > This is part a tree-wide conversion, as described in commit fc1d8e7cca2d > ("mm: introduce put_user_page*(), placeholder versions"). > > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: Daniel Black <daniel@linux.ibm.com> > Cc: Jan Kara <jack@suse.cz> > Cc: Jérôme Glisse <jglisse@redhat.com> > Cc: Matthew Wilcox <willy@infradead.org> > Cc: Mike Kravetz <mike.kravetz@oracle.com> > Signed-off-by: John Hubbard <jhubbard@nvidia.com> > --- > mm/mlock.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/mm/mlock.c b/mm/mlock.c > index a90099da4fb4..b980e6270e8a 100644 > --- a/mm/mlock.c > +++ b/mm/mlock.c > @@ -345,7 +345,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone) > get_page(page); /* for putback_lru_page() */ > __munlock_isolated_page(page); > unlock_page(page); > - put_page(page); /* from follow_page_mask() */ > + put_user_page(page); /* from follow_page_mask() */ > } > } > } > @@ -467,7 +467,7 @@ void munlock_vma_pages_range(struct vm_area_struct *vma, > if (page && !IS_ERR(page)) { > if (PageTransTail(page)) { > VM_BUG_ON_PAGE(PageMlocked(page), page); > - put_page(page); /* follow_page_mask() */ > + put_user_page(page); /* follow_page_mask() */ > } else if (PageTransHuge(page)) { > lock_page(page); > /* > @@ -478,7 +478,7 @@ void munlock_vma_pages_range(struct vm_area_struct *vma, > */ > page_mask = munlock_vma_page(page); > unlock_page(page); > - put_page(page); /* follow_page_mask() */ > + put_user_page(page); /* follow_page_mask() */ > } else { > /* > * Non-huge pages are handled in batches via > -- > 2.22.0
On 8/7/19 4:01 AM, Michal Hocko wrote: > On Mon 05-08-19 15:20:17, john.hubbard@gmail.com wrote: >> From: John Hubbard <jhubbard@nvidia.com> >> >> For pages that were retained via get_user_pages*(), release those pages >> via the new put_user_page*() routines, instead of via put_page() or >> release_pages(). > > Hmm, this is an interesting code path. There seems to be a mix of pages > in the game. We get one page via follow_page_mask but then other pages > in the range are filled by __munlock_pagevec_fill and that does a direct > pte walk. Is using put_user_page correct in this case? Could you explain > why in the changelog? > Actually, I think follow_page_mask() gets all the pages, right? And the get_page() in __munlock_pagevec_fill() is there to allow a pagevec_release() later. But I still think I mighthave missed an error case, because the pvec_putback in __munlock_pagevec() is never doing put_user_page() on the put-backed pages. Let me sort through this one more time and maybe I'll need to actually change the code. And either way, comments and changelog will need some notes, agreed. thanks,
On Wed 07-08-19 16:32:08, John Hubbard wrote: > On 8/7/19 4:01 AM, Michal Hocko wrote: > > On Mon 05-08-19 15:20:17, john.hubbard@gmail.com wrote: > >> From: John Hubbard <jhubbard@nvidia.com> > >> > >> For pages that were retained via get_user_pages*(), release those pages > >> via the new put_user_page*() routines, instead of via put_page() or > >> release_pages(). > > > > Hmm, this is an interesting code path. There seems to be a mix of pages > > in the game. We get one page via follow_page_mask but then other pages > > in the range are filled by __munlock_pagevec_fill and that does a direct > > pte walk. Is using put_user_page correct in this case? Could you explain > > why in the changelog? > > > > Actually, I think follow_page_mask() gets all the pages, right? And the > get_page() in __munlock_pagevec_fill() is there to allow a pagevec_release() > later. Maybe I am misreading the code (looking at Linus tree) but munlock_vma_pages_range calls follow_page for the start address and then if not THP tries to fill up the pagevec with few more pages (up to end), do the shortcut via manual pte walk as an optimization and use generic get_page there.
On 8/8/19 8:21 AM, Michal Hocko wrote: > On Wed 07-08-19 16:32:08, John Hubbard wrote: >> On 8/7/19 4:01 AM, Michal Hocko wrote: >>> On Mon 05-08-19 15:20:17, john.hubbard@gmail.com wrote: >>>> From: John Hubbard <jhubbard@nvidia.com> >>>> >>>> For pages that were retained via get_user_pages*(), release those pages >>>> via the new put_user_page*() routines, instead of via put_page() or >>>> release_pages(). >>> >>> Hmm, this is an interesting code path. There seems to be a mix of pages >>> in the game. We get one page via follow_page_mask but then other pages >>> in the range are filled by __munlock_pagevec_fill and that does a direct >>> pte walk. Is using put_user_page correct in this case? Could you explain >>> why in the changelog? >>> >> >> Actually, I think follow_page_mask() gets all the pages, right? And the >> get_page() in __munlock_pagevec_fill() is there to allow a pagevec_release() >> later. > > Maybe I am misreading the code (looking at Linus tree) but munlock_vma_pages_range > calls follow_page for the start address and then if not THP tries to > fill up the pagevec with few more pages (up to end), do the shortcut > via manual pte walk as an optimization and use generic get_page there. That's true. However, I'm not sure munlocking is where the put_user_page() machinery is intended to be used anyway? These are short-term pins for struct page manipulation, not e.g. dirtying of page contents. Reading commit fc1d8e7cca2d I don't think this case falls within the reasoning there. Perhaps not all GUP users should be converted to the planned separate GUP tracking, and instead we should have a GUP/follow_page_mask() variant that keeps using get_page/put_page?
On 8/8/19 4:09 AM, Vlastimil Babka wrote: > On 8/8/19 8:21 AM, Michal Hocko wrote: >> On Wed 07-08-19 16:32:08, John Hubbard wrote: >>> On 8/7/19 4:01 AM, Michal Hocko wrote: >>>> On Mon 05-08-19 15:20:17, john.hubbard@gmail.com wrote: >>>>> From: John Hubbard <jhubbard@nvidia.com> >>> Actually, I think follow_page_mask() gets all the pages, right? And the >>> get_page() in __munlock_pagevec_fill() is there to allow a pagevec_release() >>> later. >> >> Maybe I am misreading the code (looking at Linus tree) but munlock_vma_pages_range >> calls follow_page for the start address and then if not THP tries to >> fill up the pagevec with few more pages (up to end), do the shortcut >> via manual pte walk as an optimization and use generic get_page there. > Yes, I see it finally, thanks. :) > That's true. However, I'm not sure munlocking is where the > put_user_page() machinery is intended to be used anyway? These are > short-term pins for struct page manipulation, not e.g. dirtying of page > contents. Reading commit fc1d8e7cca2d I don't think this case falls > within the reasoning there. Perhaps not all GUP users should be > converted to the planned separate GUP tracking, and instead we should > have a GUP/follow_page_mask() variant that keeps using get_page/put_page? > Interesting. So far, the approach has been to get all the gup callers to release via put_user_page(), but if we add in Jan's and Ira's vaddr_pin_pages() wrapper, then maybe we could leave some sites unconverted. However, in order to do so, we would have to change things so that we have one set of APIs (gup) that do *not* increment a pin count, and another set (vaddr_pin_pages) that do. Is that where we want to go...? I have a tracking patch that only deals with gup/pup. I could post as an RFC, but I think it might just muddy the waters at this point, anyway it's this one: https://github.com/johnhubbard/linux/commit/a0fb73ce0a39c74f0d1fb6bd9d866f660f762eae thanks,
On 8/8/19 12:20 PM, John Hubbard wrote: > On 8/8/19 4:09 AM, Vlastimil Babka wrote: >> On 8/8/19 8:21 AM, Michal Hocko wrote: >>> On Wed 07-08-19 16:32:08, John Hubbard wrote: >>>> On 8/7/19 4:01 AM, Michal Hocko wrote: >>>>> On Mon 05-08-19 15:20:17, john.hubbard@gmail.com wrote: >>>>>> From: John Hubbard <jhubbard@nvidia.com> >>>> Actually, I think follow_page_mask() gets all the pages, right? And the >>>> get_page() in __munlock_pagevec_fill() is there to allow a pagevec_release() >>>> later. >>> >>> Maybe I am misreading the code (looking at Linus tree) but munlock_vma_pages_range >>> calls follow_page for the start address and then if not THP tries to >>> fill up the pagevec with few more pages (up to end), do the shortcut >>> via manual pte walk as an optimization and use generic get_page there. >> > > Yes, I see it finally, thanks. :) > >> That's true. However, I'm not sure munlocking is where the >> put_user_page() machinery is intended to be used anyway? These are >> short-term pins for struct page manipulation, not e.g. dirtying of page >> contents. Reading commit fc1d8e7cca2d I don't think this case falls >> within the reasoning there. Perhaps not all GUP users should be >> converted to the planned separate GUP tracking, and instead we should >> have a GUP/follow_page_mask() variant that keeps using get_page/put_page? >> > > Interesting. So far, the approach has been to get all the gup callers to > release via put_user_page(), but if we add in Jan's and Ira's vaddr_pin_pages() > wrapper, then maybe we could leave some sites unconverted. > > However, in order to do so, we would have to change things so that we have > one set of APIs (gup) that do *not* increment a pin count, and another set > (vaddr_pin_pages) that do. > > Is that where we want to go...? > Oh, and meanwhile, I'm leaning toward a cheap fix: just use gup_fast() instead of get_page(), and also fix the releasing code. So this incremental patch, on top of the existing one, should do it: diff --git a/mm/mlock.c b/mm/mlock.c index b980e6270e8a..2ea272c6fee3 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -318,18 +318,14 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone) /* * We won't be munlocking this page in the next phase * but we still need to release the follow_page_mask() - * pin. We cannot do it under lru_lock however. If it's - * the last pin, __page_cache_release() would deadlock. + * pin. */ - pagevec_add(&pvec_putback, pvec->pages[i]); + put_user_page(pages[i]); pvec->pages[i] = NULL; } __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked); spin_unlock_irq(&zone->zone_pgdat->lru_lock); - /* Now we can release pins of pages that we are not munlocking */ - pagevec_release(&pvec_putback); - /* Phase 2: page munlock */ for (i = 0; i < nr; i++) { struct page *page = pvec->pages[i]; @@ -394,6 +390,8 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec, start += PAGE_SIZE; while (start < end) { struct page *page = NULL; + int ret; + pte++; if (pte_present(*pte)) page = vm_normal_page(vma, start, *pte); @@ -411,7 +409,13 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec, if (PageTransCompound(page)) break; - get_page(page); + /* + * Use get_user_pages_fast(), instead of get_page() so that the + * releasing code can unconditionally call put_user_page(). + */ + ret = get_user_pages_fast(start, 1, 0, &page); + if (ret != 1) + break; /* * Increase the address that will be returned *before* the * eventual break due to pvec becoming full by adding the page thanks,
On Thu, Aug 08, 2019 at 03:59:15PM -0700, John Hubbard wrote: > On 8/8/19 12:20 PM, John Hubbard wrote: > > On 8/8/19 4:09 AM, Vlastimil Babka wrote: > >> On 8/8/19 8:21 AM, Michal Hocko wrote: > >>> On Wed 07-08-19 16:32:08, John Hubbard wrote: > >>>> On 8/7/19 4:01 AM, Michal Hocko wrote: > >>>>> On Mon 05-08-19 15:20:17, john.hubbard@gmail.com wrote: > >>>>>> From: John Hubbard <jhubbard@nvidia.com> > >>>> Actually, I think follow_page_mask() gets all the pages, right? And the > >>>> get_page() in __munlock_pagevec_fill() is there to allow a pagevec_release() > >>>> later. > >>> > >>> Maybe I am misreading the code (looking at Linus tree) but munlock_vma_pages_range > >>> calls follow_page for the start address and then if not THP tries to > >>> fill up the pagevec with few more pages (up to end), do the shortcut > >>> via manual pte walk as an optimization and use generic get_page there. > >> > > > > Yes, I see it finally, thanks. :) > > > >> That's true. However, I'm not sure munlocking is where the > >> put_user_page() machinery is intended to be used anyway? These are > >> short-term pins for struct page manipulation, not e.g. dirtying of page > >> contents. Reading commit fc1d8e7cca2d I don't think this case falls > >> within the reasoning there. Perhaps not all GUP users should be > >> converted to the planned separate GUP tracking, and instead we should > >> have a GUP/follow_page_mask() variant that keeps using get_page/put_page? > >> > > > > Interesting. So far, the approach has been to get all the gup callers to > > release via put_user_page(), but if we add in Jan's and Ira's vaddr_pin_pages() > > wrapper, then maybe we could leave some sites unconverted. > > > > However, in order to do so, we would have to change things so that we have > > one set of APIs (gup) that do *not* increment a pin count, and another set > > (vaddr_pin_pages) that do. > > > > Is that where we want to go...? > > > > Oh, and meanwhile, I'm leaning toward a cheap fix: just use gup_fast() instead > of get_page(), and also fix the releasing code. So this incremental patch, on > top of the existing one, should do it: > > diff --git a/mm/mlock.c b/mm/mlock.c > index b980e6270e8a..2ea272c6fee3 100644 > --- a/mm/mlock.c > +++ b/mm/mlock.c > @@ -318,18 +318,14 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone) > /* > * We won't be munlocking this page in the next phase > * but we still need to release the follow_page_mask() > - * pin. We cannot do it under lru_lock however. If it's > - * the last pin, __page_cache_release() would deadlock. > + * pin. > */ > - pagevec_add(&pvec_putback, pvec->pages[i]); > + put_user_page(pages[i]); > pvec->pages[i] = NULL; > } > __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked); > spin_unlock_irq(&zone->zone_pgdat->lru_lock); > > - /* Now we can release pins of pages that we are not munlocking */ > - pagevec_release(&pvec_putback); > - I'm not an expert but this skips a call to lru_add_drain(). Is that ok? > /* Phase 2: page munlock */ > for (i = 0; i < nr; i++) { > struct page *page = pvec->pages[i]; > @@ -394,6 +390,8 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec, > start += PAGE_SIZE; > while (start < end) { > struct page *page = NULL; > + int ret; > + > pte++; > if (pte_present(*pte)) > page = vm_normal_page(vma, start, *pte); > @@ -411,7 +409,13 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec, > if (PageTransCompound(page)) > break; > > - get_page(page); > + /* > + * Use get_user_pages_fast(), instead of get_page() so that the > + * releasing code can unconditionally call put_user_page(). > + */ > + ret = get_user_pages_fast(start, 1, 0, &page); > + if (ret != 1) > + break; I like the idea of making this a get/put pair but I'm feeling uneasy about how this is really supposed to work. For sure the GUP/PUP was supposed to be separate from [get|put]_page. Ira > /* > * Increase the address that will be returned *before* the > * eventual break due to pvec becoming full by adding the page > > > thanks, > -- > John Hubbard > NVIDIA
On 8/8/19 4:41 PM, Ira Weiny wrote: > On Thu, Aug 08, 2019 at 03:59:15PM -0700, John Hubbard wrote: >> On 8/8/19 12:20 PM, John Hubbard wrote: >>> On 8/8/19 4:09 AM, Vlastimil Babka wrote: >>>> On 8/8/19 8:21 AM, Michal Hocko wrote: >>>>> On Wed 07-08-19 16:32:08, John Hubbard wrote: >>>>>> On 8/7/19 4:01 AM, Michal Hocko wrote: >>>>>>> On Mon 05-08-19 15:20:17, john.hubbard@gmail.com wrote: ... >> Oh, and meanwhile, I'm leaning toward a cheap fix: just use gup_fast() instead >> of get_page(), and also fix the releasing code. So this incremental patch, on >> top of the existing one, should do it: >> >> diff --git a/mm/mlock.c b/mm/mlock.c >> index b980e6270e8a..2ea272c6fee3 100644 >> --- a/mm/mlock.c >> +++ b/mm/mlock.c >> @@ -318,18 +318,14 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone) >> /* >> * We won't be munlocking this page in the next phase >> * but we still need to release the follow_page_mask() >> - * pin. We cannot do it under lru_lock however. If it's >> - * the last pin, __page_cache_release() would deadlock. >> + * pin. >> */ >> - pagevec_add(&pvec_putback, pvec->pages[i]); >> + put_user_page(pages[i]); correction, make that: put_user_page(pvec->pages[i]); (This is not fully tested yet.) >> pvec->pages[i] = NULL; >> } >> __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked); >> spin_unlock_irq(&zone->zone_pgdat->lru_lock); >> >> - /* Now we can release pins of pages that we are not munlocking */ >> - pagevec_release(&pvec_putback); >> - > > I'm not an expert but this skips a call to lru_add_drain(). Is that ok? Yes: unless I'm missing something, there is no reason to go through lru_add_drain in this case. These are gup'd pages that are not going to get any further processing. > >> /* Phase 2: page munlock */ >> for (i = 0; i < nr; i++) { >> struct page *page = pvec->pages[i]; >> @@ -394,6 +390,8 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec, >> start += PAGE_SIZE; >> while (start < end) { >> struct page *page = NULL; >> + int ret; >> + >> pte++; >> if (pte_present(*pte)) >> page = vm_normal_page(vma, start, *pte); >> @@ -411,7 +409,13 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec, >> if (PageTransCompound(page)) >> break; >> >> - get_page(page); >> + /* >> + * Use get_user_pages_fast(), instead of get_page() so that the >> + * releasing code can unconditionally call put_user_page(). >> + */ >> + ret = get_user_pages_fast(start, 1, 0, &page); >> + if (ret != 1) >> + break; > > I like the idea of making this a get/put pair but I'm feeling uneasy about how > this is really supposed to work. > > For sure the GUP/PUP was supposed to be separate from [get|put]_page. > Actually, they both take references on the page. And it is absolutely OK to call them both on the same page. But anyway, we're not mixing them up here. If you follow the code paths, either gup or follow_page_mask() is used, and then put_user_page() releases. So...you haven't actually pointed to a bug here, right? :) thanks,
On 8/9/19 12:59 AM, John Hubbard wrote: >>> That's true. However, I'm not sure munlocking is where the >>> put_user_page() machinery is intended to be used anyway? These are >>> short-term pins for struct page manipulation, not e.g. dirtying of page >>> contents. Reading commit fc1d8e7cca2d I don't think this case falls >>> within the reasoning there. Perhaps not all GUP users should be >>> converted to the planned separate GUP tracking, and instead we should >>> have a GUP/follow_page_mask() variant that keeps using get_page/put_page? >>> >> >> Interesting. So far, the approach has been to get all the gup callers to >> release via put_user_page(), but if we add in Jan's and Ira's vaddr_pin_pages() >> wrapper, then maybe we could leave some sites unconverted. >> >> However, in order to do so, we would have to change things so that we have >> one set of APIs (gup) that do *not* increment a pin count, and another set >> (vaddr_pin_pages) that do. >> >> Is that where we want to go...? >> We already have a FOLL_LONGTERM flag, isn't that somehow related? And if it's not exactly the same thing, perhaps a new gup flag to distinguish which kind of pinning to use? > Oh, and meanwhile, I'm leaning toward a cheap fix: just use gup_fast() instead > of get_page(), and also fix the releasing code. So this incremental patch, on > top of the existing one, should do it: > ... > @@ -411,7 +409,13 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec, > if (PageTransCompound(page)) > break; > > - get_page(page); > + /* > + * Use get_user_pages_fast(), instead of get_page() so that the > + * releasing code can unconditionally call put_user_page(). > + */ > + ret = get_user_pages_fast(start, 1, 0, &page); Um the whole reason of __munlock_pagevec_fill() was to avoid the full page walk cost, which made a 14% difference, see 7a8010cd3627 ("mm: munlock: manual pte walk in fast path instead of follow_page_mask()") Replacing simple get_page() with page walk to satisfy API requirements seems rather suboptimal to me. > + if (ret != 1) > + break; > /* > * Increase the address that will be returned *before* the > * eventual break due to pvec becoming full by adding the page > > > thanks, >
On Fri 09-08-19 10:12:48, Vlastimil Babka wrote: > On 8/9/19 12:59 AM, John Hubbard wrote: > >>> That's true. However, I'm not sure munlocking is where the > >>> put_user_page() machinery is intended to be used anyway? These are > >>> short-term pins for struct page manipulation, not e.g. dirtying of page > >>> contents. Reading commit fc1d8e7cca2d I don't think this case falls > >>> within the reasoning there. Perhaps not all GUP users should be > >>> converted to the planned separate GUP tracking, and instead we should > >>> have a GUP/follow_page_mask() variant that keeps using get_page/put_page? > >>> > >> > >> Interesting. So far, the approach has been to get all the gup callers to > >> release via put_user_page(), but if we add in Jan's and Ira's vaddr_pin_pages() > >> wrapper, then maybe we could leave some sites unconverted. > >> > >> However, in order to do so, we would have to change things so that we have > >> one set of APIs (gup) that do *not* increment a pin count, and another set > >> (vaddr_pin_pages) that do. > >> > >> Is that where we want to go...? > >> > > We already have a FOLL_LONGTERM flag, isn't that somehow related? And if > it's not exactly the same thing, perhaps a new gup flag to distinguish > which kind of pinning to use? Agreed. This is a shiny example how forcing all existing gup users into the new scheme is subotimal at best. Not the mention the overal fragility mention elsewhere. I dislike the conversion even more now. Sorry if this was already discussed already but why the new pinning is not bound to FOLL_LONGTERM (ideally hidden by an interface so that users do not have to care about the flag) only?
On 8/9/19 1:23 AM, Michal Hocko wrote: > On Fri 09-08-19 10:12:48, Vlastimil Babka wrote: >> On 8/9/19 12:59 AM, John Hubbard wrote: >>>>> That's true. However, I'm not sure munlocking is where the >>>>> put_user_page() machinery is intended to be used anyway? These are >>>>> short-term pins for struct page manipulation, not e.g. dirtying of page >>>>> contents. Reading commit fc1d8e7cca2d I don't think this case falls >>>>> within the reasoning there. Perhaps not all GUP users should be >>>>> converted to the planned separate GUP tracking, and instead we should >>>>> have a GUP/follow_page_mask() variant that keeps using get_page/put_page? >>>>> >>>> >>>> Interesting. So far, the approach has been to get all the gup callers to >>>> release via put_user_page(), but if we add in Jan's and Ira's vaddr_pin_pages() >>>> wrapper, then maybe we could leave some sites unconverted. >>>> >>>> However, in order to do so, we would have to change things so that we have >>>> one set of APIs (gup) that do *not* increment a pin count, and another set >>>> (vaddr_pin_pages) that do. >>>> >>>> Is that where we want to go...? >>>> >> >> We already have a FOLL_LONGTERM flag, isn't that somehow related? And if >> it's not exactly the same thing, perhaps a new gup flag to distinguish >> which kind of pinning to use? > > Agreed. This is a shiny example how forcing all existing gup users into > the new scheme is subotimal at best. Not the mention the overal > fragility mention elsewhere. I dislike the conversion even more now. > > Sorry if this was already discussed already but why the new pinning is > not bound to FOLL_LONGTERM (ideally hidden by an interface so that users > do not have to care about the flag) only? > Oh, it's been discussed alright, but given how some of the discussions have gone, I certainly am not surprised that there are still questions and criticisms! Especially since I may have misunderstood some of the points, along the way. It's been quite a merry go round. :) Anyway, what I'm hearing now is: for gup(FOLL_LONGTERM), apply the pinned tracking. And therefore only do put_user_page() on pages that were pinned with FOLL_LONGTERM. For short term pins, let the locking do what it will: things can briefly block and all will be well. Also, that may or may not come with a wrapper function, courtesy of Jan and Ira. Is that about right? It's late here, but I don't immediately recall any problems with doing it that way... thanks,
On Fri 09-08-19 02:05:15, John Hubbard wrote: > On 8/9/19 1:23 AM, Michal Hocko wrote: > > On Fri 09-08-19 10:12:48, Vlastimil Babka wrote: > > > On 8/9/19 12:59 AM, John Hubbard wrote: > > > > > > That's true. However, I'm not sure munlocking is where the > > > > > > put_user_page() machinery is intended to be used anyway? These are > > > > > > short-term pins for struct page manipulation, not e.g. dirtying of page > > > > > > contents. Reading commit fc1d8e7cca2d I don't think this case falls > > > > > > within the reasoning there. Perhaps not all GUP users should be > > > > > > converted to the planned separate GUP tracking, and instead we should > > > > > > have a GUP/follow_page_mask() variant that keeps using get_page/put_page? > > > > > > > > > > Interesting. So far, the approach has been to get all the gup callers to > > > > > release via put_user_page(), but if we add in Jan's and Ira's vaddr_pin_pages() > > > > > wrapper, then maybe we could leave some sites unconverted. > > > > > > > > > > However, in order to do so, we would have to change things so that we have > > > > > one set of APIs (gup) that do *not* increment a pin count, and another set > > > > > (vaddr_pin_pages) that do. > > > > > > > > > > Is that where we want to go...? > > > > > > > > > > > We already have a FOLL_LONGTERM flag, isn't that somehow related? And if > > > it's not exactly the same thing, perhaps a new gup flag to distinguish > > > which kind of pinning to use? > > > > Agreed. This is a shiny example how forcing all existing gup users into > > the new scheme is subotimal at best. Not the mention the overal > > fragility mention elsewhere. I dislike the conversion even more now. > > > > Sorry if this was already discussed already but why the new pinning is > > not bound to FOLL_LONGTERM (ideally hidden by an interface so that users > > do not have to care about the flag) only? > > > > Oh, it's been discussed alright, but given how some of the discussions have gone, > I certainly am not surprised that there are still questions and criticisms! > Especially since I may have misunderstood some of the points, along the way. > It's been quite a merry go round. :) Yeah, I've tried to follow them but just gave up at some point. > Anyway, what I'm hearing now is: for gup(FOLL_LONGTERM), apply the pinned tracking. > And therefore only do put_user_page() on pages that were pinned with > FOLL_LONGTERM. For short term pins, let the locking do what it will: > things can briefly block and all will be well. > > Also, that may or may not come with a wrapper function, courtesy of Jan > and Ira. > > Is that about right? It's late here, but I don't immediately recall any > problems with doing it that way... Yes that makes more sense to me. Whoever needs that tracking should opt-in for it. Otherwise you just risk problems like the one discussed in the mlock path (because we do a strange stuff in the name of performance) and a never ending whack a mole where new users do not follow the new API usage and that results in all sorts of weird issues. Thanks!
On Fri 09-08-19 10:23:07, Michal Hocko wrote: > On Fri 09-08-19 10:12:48, Vlastimil Babka wrote: > > On 8/9/19 12:59 AM, John Hubbard wrote: > > >>> That's true. However, I'm not sure munlocking is where the > > >>> put_user_page() machinery is intended to be used anyway? These are > > >>> short-term pins for struct page manipulation, not e.g. dirtying of page > > >>> contents. Reading commit fc1d8e7cca2d I don't think this case falls > > >>> within the reasoning there. Perhaps not all GUP users should be > > >>> converted to the planned separate GUP tracking, and instead we should > > >>> have a GUP/follow_page_mask() variant that keeps using get_page/put_page? > > >>> > > >> > > >> Interesting. So far, the approach has been to get all the gup callers to > > >> release via put_user_page(), but if we add in Jan's and Ira's vaddr_pin_pages() > > >> wrapper, then maybe we could leave some sites unconverted. > > >> > > >> However, in order to do so, we would have to change things so that we have > > >> one set of APIs (gup) that do *not* increment a pin count, and another set > > >> (vaddr_pin_pages) that do. > > >> > > >> Is that where we want to go...? > > >> > > > > We already have a FOLL_LONGTERM flag, isn't that somehow related? And if > > it's not exactly the same thing, perhaps a new gup flag to distinguish > > which kind of pinning to use? > > Agreed. This is a shiny example how forcing all existing gup users into > the new scheme is subotimal at best. Not the mention the overal > fragility mention elsewhere. I dislike the conversion even more now. > > Sorry if this was already discussed already but why the new pinning is > not bound to FOLL_LONGTERM (ideally hidden by an interface so that users > do not have to care about the flag) only? The new tracking cannot be bound to FOLL_LONGTERM. Anything that gets page reference and then touches page data (e.g. direct IO) needs the new kind of tracking so that filesystem knows someone is messing with the page data. So what John is trying to address is a different (although related) problem to someone pinning a page for a long time. In principle, I'm not strongly opposed to a new FOLL flag to determine whether a pin or an ordinary page reference will be acquired at least as an internal implementation detail inside mm/gup.c. But I would really like to discourage new GUP users taking just page reference as the most clueless users (drivers) usually need a pin in the sense John implements. So in terms of API I'd strongly prefer to deprecate GUP as an API, provide vaddr_pin_pages() for drivers to get their buffer pages pinned and then for those few users who really know what they are doing (and who are not interested in page contents) we can have APIs like follow_page() to get a page reference from a virtual address. Honza
On Fri 09-08-19 15:58:13, Jan Kara wrote: > On Fri 09-08-19 10:23:07, Michal Hocko wrote: > > On Fri 09-08-19 10:12:48, Vlastimil Babka wrote: > > > On 8/9/19 12:59 AM, John Hubbard wrote: > > > >>> That's true. However, I'm not sure munlocking is where the > > > >>> put_user_page() machinery is intended to be used anyway? These are > > > >>> short-term pins for struct page manipulation, not e.g. dirtying of page > > > >>> contents. Reading commit fc1d8e7cca2d I don't think this case falls > > > >>> within the reasoning there. Perhaps not all GUP users should be > > > >>> converted to the planned separate GUP tracking, and instead we should > > > >>> have a GUP/follow_page_mask() variant that keeps using get_page/put_page? > > > >>> > > > >> > > > >> Interesting. So far, the approach has been to get all the gup callers to > > > >> release via put_user_page(), but if we add in Jan's and Ira's vaddr_pin_pages() > > > >> wrapper, then maybe we could leave some sites unconverted. > > > >> > > > >> However, in order to do so, we would have to change things so that we have > > > >> one set of APIs (gup) that do *not* increment a pin count, and another set > > > >> (vaddr_pin_pages) that do. > > > >> > > > >> Is that where we want to go...? > > > >> > > > > > > We already have a FOLL_LONGTERM flag, isn't that somehow related? And if > > > it's not exactly the same thing, perhaps a new gup flag to distinguish > > > which kind of pinning to use? > > > > Agreed. This is a shiny example how forcing all existing gup users into > > the new scheme is subotimal at best. Not the mention the overal > > fragility mention elsewhere. I dislike the conversion even more now. > > > > Sorry if this was already discussed already but why the new pinning is > > not bound to FOLL_LONGTERM (ideally hidden by an interface so that users > > do not have to care about the flag) only? > > The new tracking cannot be bound to FOLL_LONGTERM. Anything that gets page > reference and then touches page data (e.g. direct IO) needs the new kind of > tracking so that filesystem knows someone is messing with the page data. > So what John is trying to address is a different (although related) problem > to someone pinning a page for a long time. OK, I see. Thanks for the clarification. > In principle, I'm not strongly opposed to a new FOLL flag to determine > whether a pin or an ordinary page reference will be acquired at least as an > internal implementation detail inside mm/gup.c. But I would really like to > discourage new GUP users taking just page reference as the most clueless > users (drivers) usually need a pin in the sense John implements. So in > terms of API I'd strongly prefer to deprecate GUP as an API, provide > vaddr_pin_pages() for drivers to get their buffer pages pinned and then for > those few users who really know what they are doing (and who are not > interested in page contents) we can have APIs like follow_page() to get a > page reference from a virtual address. Yes, going with a dedicated API sounds much better to me. Whether a dedicated FOLL flag is used internally is not that important. I am also for making the underlying gup to be really internal to the core kernel. Thanks!
> On Fri 09-08-19 15:58:13, Jan Kara wrote: > > On Fri 09-08-19 10:23:07, Michal Hocko wrote: > > > On Fri 09-08-19 10:12:48, Vlastimil Babka wrote: > > > > On 8/9/19 12:59 AM, John Hubbard wrote: > > > > >>> That's true. However, I'm not sure munlocking is where the > > > > >>> put_user_page() machinery is intended to be used anyway? These > > > > >>> are short-term pins for struct page manipulation, not e.g. > > > > >>> dirtying of page contents. Reading commit fc1d8e7cca2d I don't > > > > >>> think this case falls within the reasoning there. Perhaps not > > > > >>> all GUP users should be converted to the planned separate GUP > > > > >>> tracking, and instead we should have a GUP/follow_page_mask() > variant that keeps using get_page/put_page? > > > > >>> > > > > >> > > > > >> Interesting. So far, the approach has been to get all the gup > > > > >> callers to release via put_user_page(), but if we add in Jan's > > > > >> and Ira's vaddr_pin_pages() wrapper, then maybe we could leave > some sites unconverted. > > > > >> > > > > >> However, in order to do so, we would have to change things so > > > > >> that we have one set of APIs (gup) that do *not* increment a > > > > >> pin count, and another set > > > > >> (vaddr_pin_pages) that do. > > > > >> > > > > >> Is that where we want to go...? > > > > >> > > > > > > > > We already have a FOLL_LONGTERM flag, isn't that somehow related? > > > > And if it's not exactly the same thing, perhaps a new gup flag to > > > > distinguish which kind of pinning to use? > > > > > > Agreed. This is a shiny example how forcing all existing gup users > > > into the new scheme is subotimal at best. Not the mention the overal > > > fragility mention elsewhere. I dislike the conversion even more now. > > > > > > Sorry if this was already discussed already but why the new pinning > > > is not bound to FOLL_LONGTERM (ideally hidden by an interface so > > > that users do not have to care about the flag) only? > > > > The new tracking cannot be bound to FOLL_LONGTERM. Anything that gets > > page reference and then touches page data (e.g. direct IO) needs the > > new kind of tracking so that filesystem knows someone is messing with the > page data. > > So what John is trying to address is a different (although related) > > problem to someone pinning a page for a long time. > > OK, I see. Thanks for the clarification. Not to beat a dead horse but FOLL_LONGTERM also has implications now for CMA pages which may or may not (I'm not an expert on those pages) need special tracking. > > > In principle, I'm not strongly opposed to a new FOLL flag to determine > > whether a pin or an ordinary page reference will be acquired at least > > as an internal implementation detail inside mm/gup.c. But I would > > really like to discourage new GUP users taking just page reference as > > the most clueless users (drivers) usually need a pin in the sense John > > implements. So in terms of API I'd strongly prefer to deprecate GUP as > > an API, provide > > vaddr_pin_pages() for drivers to get their buffer pages pinned and > > then for those few users who really know what they are doing (and who > > are not interested in page contents) we can have APIs like > > follow_page() to get a page reference from a virtual address. > > Yes, going with a dedicated API sounds much better to me. Whether a > dedicated FOLL flag is used internally is not that important. I am also for > making the underlying gup to be really internal to the core kernel. +1 I think GUP is too confusing. I've been working with the details for many months now and it continues to confuse me. :-( My patches should be posted soon (based on mmotm) and I'll have my flame suit on so we can debate the interface. Ira
> > On 8/8/19 4:41 PM, Ira Weiny wrote: > > On Thu, Aug 08, 2019 at 03:59:15PM -0700, John Hubbard wrote: > >> On 8/8/19 12:20 PM, John Hubbard wrote: > >>> On 8/8/19 4:09 AM, Vlastimil Babka wrote: > >>>> On 8/8/19 8:21 AM, Michal Hocko wrote: > >>>>> On Wed 07-08-19 16:32:08, John Hubbard wrote: > >>>>>> On 8/7/19 4:01 AM, Michal Hocko wrote: > >>>>>>> On Mon 05-08-19 15:20:17, john.hubbard@gmail.com wrote: > ... > >> Oh, and meanwhile, I'm leaning toward a cheap fix: just use > >> gup_fast() instead of get_page(), and also fix the releasing code. So > >> this incremental patch, on top of the existing one, should do it: > >> > >> diff --git a/mm/mlock.c b/mm/mlock.c > >> index b980e6270e8a..2ea272c6fee3 100644 > >> --- a/mm/mlock.c > >> +++ b/mm/mlock.c > >> @@ -318,18 +318,14 @@ static void __munlock_pagevec(struct pagevec > *pvec, struct zone *zone) > >> /* > >> * We won't be munlocking this page in the next phase > >> * but we still need to release the follow_page_mask() > >> - * pin. We cannot do it under lru_lock however. If it's > >> - * the last pin, __page_cache_release() would deadlock. > >> + * pin. > >> */ > >> - pagevec_add(&pvec_putback, pvec->pages[i]); > >> + put_user_page(pages[i]); > > correction, make that: > put_user_page(pvec->pages[i]); > > (This is not fully tested yet.) > > >> pvec->pages[i] = NULL; > >> } > >> __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked); > >> spin_unlock_irq(&zone->zone_pgdat->lru_lock); > >> > >> - /* Now we can release pins of pages that we are not munlocking */ > >> - pagevec_release(&pvec_putback); > >> - > > > > I'm not an expert but this skips a call to lru_add_drain(). Is that ok? > > Yes: unless I'm missing something, there is no reason to go through > lru_add_drain in this case. These are gup'd pages that are not going to get > any further processing. > > > > >> /* Phase 2: page munlock */ > >> for (i = 0; i < nr; i++) { > >> struct page *page = pvec->pages[i]; @@ -394,6 +390,8 > >> @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec, > >> start += PAGE_SIZE; > >> while (start < end) { > >> struct page *page = NULL; > >> + int ret; > >> + > >> pte++; > >> if (pte_present(*pte)) > >> page = vm_normal_page(vma, start, *pte); @@ > >> -411,7 +409,13 @@ static unsigned long __munlock_pagevec_fill(struct > pagevec *pvec, > >> if (PageTransCompound(page)) > >> break; > >> > >> - get_page(page); > >> + /* > >> + * Use get_user_pages_fast(), instead of get_page() so that the > >> + * releasing code can unconditionally call put_user_page(). > >> + */ > >> + ret = get_user_pages_fast(start, 1, 0, &page); > >> + if (ret != 1) > >> + break; > > > > I like the idea of making this a get/put pair but I'm feeling uneasy > > about how this is really supposed to work. > > > > For sure the GUP/PUP was supposed to be separate from [get|put]_page. > > > > Actually, they both take references on the page. And it is absolutely OK to call > them both on the same page. > > But anyway, we're not mixing them up here. If you follow the code paths, > either gup or follow_page_mask() is used, and then put_user_page() > releases. > > So...you haven't actually pointed to a bug here, right? :) No... no bug. sorry this was just a general comment on semantics. But in keeping with the semantics discussion it is further confusing that follow_page_mask() is also mixed in here... Which is where my comment was driving toward. If you call GUP there should be a PUP. Get_page/put_page... follow_page/unfollow_page... ??? ;-) Ok now I'm off the rails... but that was the point... I think Jan and Michal are onto something here WRT internal vs external interfaces. Ira
On 8/9/19 11:14 AM, Weiny, Ira wrote: >> On Fri 09-08-19 15:58:13, Jan Kara wrote: >>> On Fri 09-08-19 10:23:07, Michal Hocko wrote: >>>> On Fri 09-08-19 10:12:48, Vlastimil Babka wrote: >>>>> On 8/9/19 12:59 AM, John Hubbard wrote: ... >>> In principle, I'm not strongly opposed to a new FOLL flag to determine >>> whether a pin or an ordinary page reference will be acquired at least >>> as an internal implementation detail inside mm/gup.c. But I would >>> really like to discourage new GUP users taking just page reference as >>> the most clueless users (drivers) usually need a pin in the sense John >>> implements. So in terms of API I'd strongly prefer to deprecate GUP as >>> an API, provide >>> vaddr_pin_pages() for drivers to get their buffer pages pinned and >>> then for those few users who really know what they are doing (and who >>> are not interested in page contents) we can have APIs like >>> follow_page() to get a page reference from a virtual address. >> >> Yes, going with a dedicated API sounds much better to me. Whether a >> dedicated FOLL flag is used internally is not that important. I am also for >> making the underlying gup to be really internal to the core kernel. > > +1 > > I think GUP is too confusing. I've been working with the details for many months now and it continues to confuse me. :-( > > My patches should be posted soon (based on mmotm) and I'll have my flame suit on so we can debate the interface. > OK, so: use FOLL_PIN as an internal gup flag. FOLL_PIN will get set by the new vaddr_pin_pages*() wrapper calls. Then, put_user_page*() shall only be invoked from call sites that use FOLL_PIN. With that approach in mind, I can sweep through my callsite conversion patchset and drop a few patches. There are actually quite a few patches that just want to find the page, not really operate on its data. And the conversion of the actual gup() calls can be done almost independently of the put_user_page*() conversions, if necessary (and it sounds like with your patchset, it is). btw, as part of the conversion, to make merging and call site conversion smoother, maybe it's OK to pass in FOLL_PIN to existing gup() calls, with the intent to convert them to use vaddr_pin_pages.) thanks,
diff --git a/mm/mlock.c b/mm/mlock.c index a90099da4fb4..b980e6270e8a 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -345,7 +345,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone) get_page(page); /* for putback_lru_page() */ __munlock_isolated_page(page); unlock_page(page); - put_page(page); /* from follow_page_mask() */ + put_user_page(page); /* from follow_page_mask() */ } } } @@ -467,7 +467,7 @@ void munlock_vma_pages_range(struct vm_area_struct *vma, if (page && !IS_ERR(page)) { if (PageTransTail(page)) { VM_BUG_ON_PAGE(PageMlocked(page), page); - put_page(page); /* follow_page_mask() */ + put_user_page(page); /* follow_page_mask() */ } else if (PageTransHuge(page)) { lock_page(page); /* @@ -478,7 +478,7 @@ void munlock_vma_pages_range(struct vm_area_struct *vma, */ page_mask = munlock_vma_page(page); unlock_page(page); - put_page(page); /* follow_page_mask() */ + put_user_page(page); /* follow_page_mask() */ } else { /* * Non-huge pages are handled in batches via