Message ID | 20220125033700.69705-1-peterx@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: Fix invalid page pointer returned with FOLL_PIN gups | expand |
On 1/24/22 19:37, Peter Xu wrote: > Alex reported invalid page pointer returned with pin_user_pages_remote() from > vfio after upstream commit 4b6c33b32296 ("vfio/type1: Prepare for batched > pinning with struct vfio_batch"). This problem breaks NVIDIA vfio mdev. > > It turns out that it's not the fault of the vfio commit; however after vfio > switches to a full page buffer to store the page pointers it starts to expose > the problem easier. > > The problem is for VM_PFNMAP vmas we should normally fail with an -EFAULT then > vfio will carry on to handle the MMIO regions. However when the bug triggered, > follow_page_mask() returned -EEXIST for such a page, which will jump over the > current page, leaving that entry in **pages untouched. However the caller is > not aware of it, hence the caller will reference the page as usual even if the > pointer data can be anything. > > We had that -EEXIST logic since commit 1027e4436b6a ("mm: make GUP handle pfn > mapping unless FOLL_GET is requested") which seems very reasonable. It could > be that when we reworked GUP with FOLL_PIN we could have overlooked that > special path in commit 3faa52c03f44 ("mm/gup: track FOLL_PIN pages"), even if > that commit rightfully touched up follow_devmap_pud() on checking FOLL_PIN when > it needs to return an -EEXIST. > > Since at it, add another WARN_ON_ONCE() at the -EEXIST handling to make sure we > mustn't have **pages set when reaching there, because otherwise it means the > caller will try to read a garbage right after __get_user_pages() returns. > > Attaching the Fixes to the FOLL_PIN rework commit, as it happened later than > 1027e4436b6a. > > Cc: Andrea Arcangeli <aarcange@redhat.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Jan Kara <jack@suse.cz> > Cc: Jérôme Glisse <jglisse@redhat.com> > Cc: John Hubbard <jhubbard@nvidia.com> > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > Fixes: 3faa52c03f44 ("mm/gup: track FOLL_PIN pages") > Reported-by: Alex Williamson <alex.williamson@redhat.com> > Debugged-by: Alex Williamson <alex.williamson@redhat.com> > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > mm/gup.c | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/mm/gup.c b/mm/gup.c > index f0af462ac1e2..8ebc04058e97 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -440,7 +440,7 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address, > pte_t *pte, unsigned int flags) > { > /* No page to get reference */ > - if (flags & FOLL_GET) > + if (flags & (FOLL_GET | FOLL_PIN)) > return -EFAULT; Yes. This clearly fixes the problem that the patch describes, and also clearly matches up with the Fixes tag. So that's correct. > > if (flags & FOLL_TOUCH) { > @@ -1181,7 +1181,13 @@ static long __get_user_pages(struct mm_struct *mm, > /* > * Proper page table entry exists, but no corresponding > * struct page. > + * > + * Warn if we jumped over even with a valid **pages. > + * It shouldn't trigger in practise, but when there's > + * buggy returns on -EEXIST we'll warn before returning > + * an invalid page pointer in the array. > */ > + WARN_ON_ONCE(pages); Here, however, I think we need to consider this a little more carefully, and attempt to actually fix up this case. It is never going to be OK here, to return a **pages array that has these little landmines of potentially uninitialized pointers. And so continuing on *at all* seems very wrong. Can we bail out at this point, without breaking the world? I think we can... Also: this part, even if it remains as is, should be a separate fix and a separate patch, IMHO. thanks,
On Wed, Jan 26, 2022 at 04:15:02PM -0800, John Hubbard wrote: > > We had that -EEXIST logic since commit 1027e4436b6a ("mm: make GUP handle pfn > > mapping unless FOLL_GET is requested") which seems very reasonable. It could > > be that when we reworked GUP with FOLL_PIN we could have overlooked that > > special path in commit 3faa52c03f44 ("mm/gup: track FOLL_PIN pages"), even if > > that commit rightfully touched up follow_devmap_pud() on checking FOLL_PIN when > > it needs to return an -EEXIST. It sounds like this commit was all about changing the behavior of follow_page() It feels like that is another ill-fated holdover from the effort to make pageless DAX that doesn't exist anymore. Can we safely drop it now? Regardless.. > > diff --git a/mm/gup.c b/mm/gup.c > > index f0af462ac1e2..8ebc04058e97 100644 > > +++ b/mm/gup.c > > @@ -440,7 +440,7 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address, > > pte_t *pte, unsigned int flags) > > { > > /* No page to get reference */ > > - if (flags & FOLL_GET) > > + if (flags & (FOLL_GET | FOLL_PIN)) > > return -EFAULT; > > Yes. This clearly fixes the problem that the patch describes, and also > clearly matches up with the Fixes tag. So that's correct. It is a really confusing though, why not just always return -EEXIST here? The caller will always see the error code and refrain from trying to pin it and unwind upwards, just the same as -EFAULT. We shouldn't need to test the flags at this point at all. > > if (flags & FOLL_TOUCH) { > > @@ -1181,7 +1181,13 @@ static long __get_user_pages(struct mm_struct *mm, > > /* > > * Proper page table entry exists, but no corresponding > > * struct page. > > + * > > + * Warn if we jumped over even with a valid **pages. > > + * It shouldn't trigger in practise, but when there's > > + * buggy returns on -EEXIST we'll warn before returning > > + * an invalid page pointer in the array. > > */ > > + WARN_ON_ONCE(pages); > > Here, however, I think we need to consider this a little more carefully, > and attempt to actually fix up this case. It is never going to be OK > here, to return a **pages array that has these little landmines of > potentially uninitialized pointers. And so continuing on *at all* seems > very wrong. Indeed, it should just be like this: @@ -1182,6 +1182,10 @@ static long __get_user_pages(struct mm_struct *mm, * Proper page table entry exists, but no corresponding * struct page. */ + if (pages) { + page = ERR_PTR(-EFAULT); + goto out; + } goto next_page; } else if (IS_ERR(page)) { ret = PTR_ERR(page); Jason
Hi, Jason, John, On Wed, Jan 26, 2022 at 08:42:06PM -0400, Jason Gunthorpe wrote: > On Wed, Jan 26, 2022 at 04:15:02PM -0800, John Hubbard wrote: > > > > We had that -EEXIST logic since commit 1027e4436b6a ("mm: make GUP handle pfn > > > mapping unless FOLL_GET is requested") which seems very reasonable. It could > > > be that when we reworked GUP with FOLL_PIN we could have overlooked that > > > special path in commit 3faa52c03f44 ("mm/gup: track FOLL_PIN pages"), even if > > > that commit rightfully touched up follow_devmap_pud() on checking FOLL_PIN when > > > it needs to return an -EEXIST. > > It sounds like this commit was all about changing the behavior of > follow_page() > > It feels like that is another ill-fated holdover from the effort to > make pageless DAX that doesn't exist anymore. > > Can we safely drop it now? Not quite familiar with the dax effort, but just to note again that this patch fixes an immediate breakage, hence IMHO we should still merge it first (not mention it's a one-liner..). > > Regardless.. > > > > diff --git a/mm/gup.c b/mm/gup.c > > > index f0af462ac1e2..8ebc04058e97 100644 > > > +++ b/mm/gup.c > > > @@ -440,7 +440,7 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address, > > > pte_t *pte, unsigned int flags) > > > { > > > /* No page to get reference */ > > > - if (flags & FOLL_GET) > > > + if (flags & (FOLL_GET | FOLL_PIN)) > > > return -EFAULT; > > > > Yes. This clearly fixes the problem that the patch describes, and also > > clearly matches up with the Fixes tag. So that's correct. > > It is a really confusing though, why not just always return -EEXIST > here? Because in current code GUP handles -EEXIST and -EFAULT differently? We do early bail out on -EFAULT. -EEXIST was first introduced in 2015 from Kirill for not failing some mlock() or mmap(MAP_POPULATE) on dax (1027e4436b6). Then in 2017 it got used again with pud-sized thp (a00cc7d9dd93d) on dax too. They seem to service the same goal and it seems to be designed that -EEXIST shouldn't fail GUP immediately. > > The caller will always see the error code and refrain from trying to > pin it and unwind upwards, just the same as -EFAULT. > > We shouldn't need to test the flags at this point at all. > > > > if (flags & FOLL_TOUCH) { > > > @@ -1181,7 +1181,13 @@ static long __get_user_pages(struct mm_struct *mm, > > > /* > > > * Proper page table entry exists, but no corresponding > > > * struct page. > > > + * > > > + * Warn if we jumped over even with a valid **pages. > > > + * It shouldn't trigger in practise, but when there's > > > + * buggy returns on -EEXIST we'll warn before returning > > > + * an invalid page pointer in the array. > > > */ > > > + WARN_ON_ONCE(pages); > > > > Here, however, I think we need to consider this a little more carefully, > > and attempt to actually fix up this case. It is never going to be OK > > here, to return a **pages array that has these little landmines of > > potentially uninitialized pointers. And so continuing on *at all* seems > > very wrong. > > Indeed, it should just be like this: > > @@ -1182,6 +1182,10 @@ static long __get_user_pages(struct mm_struct *mm, > * Proper page table entry exists, but no corresponding > * struct page. > */ > + if (pages) { > + page = ERR_PTR(-EFAULT); > + goto out; > + } > goto next_page; > } else if (IS_ERR(page)) { > ret = PTR_ERR(page); IIUC not failing -EEXIST immediately seems to be what we want. We've discussed the only (unless I overlooked something...) two sites that can return -EEXIST in follow_page_mask() context; if it is triggered somewhere else unexpectedly it is a programming error and needs fixing, imho. Hence the gup code shouldn't rely on the -EEXIST -> -EFAULT transition to work at all in any existing case. From that POV, WARN_ON_ONCE() helps better on exposing an illegal return of -EEXIST (as mentioned in the commit message) than the -EFAULT convertion, IMHO. Thanks,
On Thu, Jan 27, 2022 at 05:19:56PM +0800, Peter Xu wrote: > > > > diff --git a/mm/gup.c b/mm/gup.c > > > > index f0af462ac1e2..8ebc04058e97 100644 > > > > +++ b/mm/gup.c > > > > @@ -440,7 +440,7 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address, > > > > pte_t *pte, unsigned int flags) > > > > { > > > > /* No page to get reference */ > > > > - if (flags & FOLL_GET) > > > > + if (flags & (FOLL_GET | FOLL_PIN)) > > > > return -EFAULT; > > > > > > Yes. This clearly fixes the problem that the patch describes, and also > > > clearly matches up with the Fixes tag. So that's correct. > > > > It is a really confusing though, why not just always return -EEXIST > > here? > > Because in current code GUP handles -EEXIST and -EFAULT differently? That has nothing to do with here. We shouldn't be deciding what the top layer does way down here. Return the correct error code for what was discovered at this layer the upper loop should make the decision what it should do > We do early bail out on -EFAULT. -EEXIST was first introduced in 2015 from > Kirill for not failing some mlock() or mmap(MAP_POPULATE) on dax (1027e4436b6). > Then in 2017 it got used again with pud-sized thp (a00cc7d9dd93d) on dax too. > They seem to service the same goal and it seems to be designed that -EEXIST > shouldn't fail GUP immediately. It must fail GUP immeidately if there is a pages list. Callers that want an early failure must pass in NULL for pages, it is just that simple. It has nothing to do with the FOLL flags. A WARN_ON would be appropriate to compare the FOLL flags against the pages. eg FOLL_GET without a pages is nonsense and should be immediately aborted. On the other hand, we avoid this by construction internal to gup.c > > > Here, however, I think we need to consider this a little more carefully, > > > and attempt to actually fix up this case. It is never going to be OK > > > here, to return a **pages array that has these little landmines of > > > potentially uninitialized pointers. And so continuing on *at all* seems > > > very wrong. > > > > Indeed, it should just be like this: > > > > @@ -1182,6 +1182,10 @@ static long __get_user_pages(struct mm_struct *mm, > > * Proper page table entry exists, but no corresponding > > * struct page. > > */ > > + if (pages) { > > + page = ERR_PTR(-EFAULT); > > + goto out; > > + } > > goto next_page; > > } else if (IS_ERR(page)) { > > ret = PTR_ERR(page); > > IIUC not failing -EEXIST immediately seems to be what we want. Which is what this does, for the only case it is acceptable - a null page list. > From that POV, WARN_ON_ONCE() helps better on exposing an illegal return of > -EEXIST (as mentioned in the commit message) than the -EFAULT convertion, IMHO. Again, that is upside down, -EEXIST should not be a illegal return. It should be valid, have a defined meaning 'the vaddr exists but has no struct page' and the top loop, and only the top loop, makes the decision what to do about it. Jason
On Thu, Jan 27, 2022 at 11:25:38AM -0400, Jason Gunthorpe wrote: > On Thu, Jan 27, 2022 at 05:19:56PM +0800, Peter Xu wrote: > > > > > > diff --git a/mm/gup.c b/mm/gup.c > > > > > index f0af462ac1e2..8ebc04058e97 100644 > > > > > +++ b/mm/gup.c > > > > > @@ -440,7 +440,7 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address, > > > > > pte_t *pte, unsigned int flags) > > > > > { > > > > > /* No page to get reference */ > > > > > - if (flags & FOLL_GET) > > > > > + if (flags & (FOLL_GET | FOLL_PIN)) > > > > > return -EFAULT; > > > > > > > > Yes. This clearly fixes the problem that the patch describes, and also > > > > clearly matches up with the Fixes tag. So that's correct. > > > > > > It is a really confusing though, why not just always return -EEXIST > > > here? > > > > Because in current code GUP handles -EEXIST and -EFAULT differently? > > That has nothing to do with here. We shouldn't be deciding what the > top layer does way down here. Return the correct error code for what > was discovered at this layer the upper loop should make the decision > what it should do > > > We do early bail out on -EFAULT. -EEXIST was first introduced in 2015 from > > Kirill for not failing some mlock() or mmap(MAP_POPULATE) on dax (1027e4436b6). > > Then in 2017 it got used again with pud-sized thp (a00cc7d9dd93d) on dax too. > > They seem to service the same goal and it seems to be designed that -EEXIST > > shouldn't fail GUP immediately. > > It must fail GUP immeidately if there is a pages list. Right, but my point is we don't have an user at all for follow_page_mask() returning -EEXIST with a **page which is non-NULL. Or did I miss it? > > Callers that want an early failure must pass in NULL for pages, it is > just that simple. It has nothing to do with the FOLL flags. > > A WARN_ON would be appropriate to compare the FOLL flags against the > pages. eg FOLL_GET without a pages is nonsense and should be > immediately aborted. On the other hand, we avoid this by construction > internal to gup.c We have something like that already, although it's only a VM_BUG_ON() not a BUG_ON() or WARN_ON() at the entry of __get_user_pages(): VM_BUG_ON(!!pages != !!(gup_flags & (FOLL_GET | FOLL_PIN))); > > > > > Here, however, I think we need to consider this a little more carefully, > > > > and attempt to actually fix up this case. It is never going to be OK > > > > here, to return a **pages array that has these little landmines of > > > > potentially uninitialized pointers. And so continuing on *at all* seems > > > > very wrong. > > > > > > Indeed, it should just be like this: > > > > > > @@ -1182,6 +1182,10 @@ static long __get_user_pages(struct mm_struct *mm, > > > * Proper page table entry exists, but no corresponding > > > * struct page. > > > */ > > > + if (pages) { > > > + page = ERR_PTR(-EFAULT); > > > + goto out; > > > + } > > > goto next_page; > > > } else if (IS_ERR(page)) { > > > ret = PTR_ERR(page); > > > > IIUC not failing -EEXIST immediately seems to be what we want. > > Which is what this does, for the only case it is acceptable - a null > page list. > > > From that POV, WARN_ON_ONCE() helps better on exposing an illegal return of > > -EEXIST (as mentioned in the commit message) than the -EFAULT convertion, IMHO. > > Again, that is upside down, -EEXIST should not be a illegal return. It > should be valid, have a defined meaning 'the vaddr exists but has no > struct page' and the top loop, and only the top loop, makes the > decision what to do about it. I believe this works too and I think I get your point, but as stated above it's just not used yet so the path is not useful to any real code path. Especially with above VM_BUG_ON() it means if we'll go into the "if (pages)" we should have already triggered the VM_BUG_ON() condition when entering the function. Thanks,
On Fri, Jan 28, 2022 at 09:36:14AM +0800, Peter Xu wrote: > On Thu, Jan 27, 2022 at 11:25:38AM -0400, Jason Gunthorpe wrote: > > On Thu, Jan 27, 2022 at 05:19:56PM +0800, Peter Xu wrote: > > > > > > > > diff --git a/mm/gup.c b/mm/gup.c > > > > > > index f0af462ac1e2..8ebc04058e97 100644 > > > > > > +++ b/mm/gup.c > > > > > > @@ -440,7 +440,7 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address, > > > > > > pte_t *pte, unsigned int flags) > > > > > > { > > > > > > /* No page to get reference */ > > > > > > - if (flags & FOLL_GET) > > > > > > + if (flags & (FOLL_GET | FOLL_PIN)) > > > > > > return -EFAULT; > > > > > > > > > > Yes. This clearly fixes the problem that the patch describes, and also > > > > > clearly matches up with the Fixes tag. So that's correct. > > > > > > > > It is a really confusing though, why not just always return -EEXIST > > > > here? > > > > > > Because in current code GUP handles -EEXIST and -EFAULT differently? > > > > That has nothing to do with here. We shouldn't be deciding what the > > top layer does way down here. Return the correct error code for what > > was discovered at this layer the upper loop should make the decision > > what it should do > > > > > We do early bail out on -EFAULT. -EEXIST was first introduced in 2015 from > > > Kirill for not failing some mlock() or mmap(MAP_POPULATE) on dax (1027e4436b6). > > > Then in 2017 it got used again with pud-sized thp (a00cc7d9dd93d) on dax too. > > > They seem to service the same goal and it seems to be designed that -EEXIST > > > shouldn't fail GUP immediately. > > > > It must fail GUP immeidately if there is a pages list. > > Right, but my point is we don't have an user at all for follow_page_mask() > returning -EEXIST with a **page which is non-NULL. Or did I miss > it? We don't, but that isn't the point - it is about understandability not utility. There are only two call chains that involve follow_page_mask(), one always wants the -EEXIST and the other wants -EEXIST sometimes and sometimes wants to return -EFAULT for -EEXIST The solution is not to change follow_page_mask() but to be consistent throughout that -EEXIST means we encountered a populated but non-struct page PTE and the single place that wants -EEXIST to be reported as -EFAULT (__get_user_pages) should make that transformation. For instance, should we return -EEXIST in other cases like devmap and more that have PTEs present but are not returnable? It is already really confusing (and probably wrong) that we sometimes return NULL for populated PTEs. NULL causes faultin_page() to be called on something we already know is populated, seems like nonsense. I certainly don't want to see every place like that doing some if to transform -EEXIST into -EFAULT. > > Callers that want an early failure must pass in NULL for pages, it is > > just that simple. It has nothing to do with the FOLL flags. > > > > A WARN_ON would be appropriate to compare the FOLL flags against the > > pages. eg FOLL_GET without a pages is nonsense and should be > > immediately aborted. On the other hand, we avoid this by construction > > internal to gup.c > > We have something like that already, although it's only a VM_BUG_ON() not a > BUG_ON() or WARN_ON() at the entry of __get_user_pages(): > > VM_BUG_ON(!!pages != !!(gup_flags & (FOLL_GET | FOLL_PIN))); Right > I believe this works too and I think I get your point, but as stated above it's > just not used yet so the path is not useful to any real code path. It is not about useful, it is about understandable and consistency.. There should be clear rules about when and what errno to return in every case, we should trend in that direction. Jason
On 1/27/22 17:36, Peter Xu wrote: > On Thu, Jan 27, 2022 at 11:25:38AM -0400, Jason Gunthorpe wrote: >> On Thu, Jan 27, 2022 at 05:19:56PM +0800, Peter Xu wrote: >> >>>>>> diff --git a/mm/gup.c b/mm/gup.c >>>>>> index f0af462ac1e2..8ebc04058e97 100644 >>>>>> +++ b/mm/gup.c >>>>>> @@ -440,7 +440,7 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address, >>>>>> pte_t *pte, unsigned int flags) >>>>>> { >>>>>> /* No page to get reference */ >>>>>> - if (flags & FOLL_GET) >>>>>> + if (flags & (FOLL_GET | FOLL_PIN)) >>>>>> return -EFAULT; >>>>> >>>>> Yes. This clearly fixes the problem that the patch describes, and also >>>>> clearly matches up with the Fixes tag. So that's correct. >>>> >>>> It is a really confusing though, why not just always return -EEXIST >>>> here? >>> >>> Because in current code GUP handles -EEXIST and -EFAULT differently? >> >> That has nothing to do with here. We shouldn't be deciding what the >> top layer does way down here. Return the correct error code for what >> was discovered at this layer the upper loop should make the decision >> what it should do >> >>> We do early bail out on -EFAULT. -EEXIST was first introduced in 2015 from >>> Kirill for not failing some mlock() or mmap(MAP_POPULATE) on dax (1027e4436b6). >>> Then in 2017 it got used again with pud-sized thp (a00cc7d9dd93d) on dax too. >>> They seem to service the same goal and it seems to be designed that -EEXIST >>> shouldn't fail GUP immediately. >> >> It must fail GUP immeidately if there is a pages list. > > Right, but my point is we don't have an user at all for follow_page_mask() > returning -EEXIST with a **page which is non-NULL. Or did I miss it? > What you are missing is that other people are potentially writing code that we haven't seen yet, and that code may use follow_page_mask(). The idea, therefore, is to make it a good building block. >> >> Callers that want an early failure must pass in NULL for pages, it is >> just that simple. It has nothing to do with the FOLL flags. >> >> A WARN_ON would be appropriate to compare the FOLL flags against the >> pages. eg FOLL_GET without a pages is nonsense and should be >> immediately aborted. On the other hand, we avoid this by construction >> internal to gup.c > > We have something like that already, although it's only a VM_BUG_ON() not a > BUG_ON() or WARN_ON() at the entry of __get_user_pages(): > > VM_BUG_ON(!!pages != !!(gup_flags & (FOLL_GET | FOLL_PIN))); > >> >>>>> Here, however, I think we need to consider this a little more carefully, >>>>> and attempt to actually fix up this case. It is never going to be OK >>>>> here, to return a **pages array that has these little landmines of >>>>> potentially uninitialized pointers. And so continuing on *at all* seems >>>>> very wrong. >>>> >>>> Indeed, it should just be like this: >>>> >>>> @@ -1182,6 +1182,10 @@ static long __get_user_pages(struct mm_struct *mm, >>>> * Proper page table entry exists, but no corresponding >>>> * struct page. >>>> */ >>>> + if (pages) { >>>> + page = ERR_PTR(-EFAULT); >>>> + goto out; >>>> + } >>>> goto next_page; >>>> } else if (IS_ERR(page)) { >>>> ret = PTR_ERR(page); >>> >>> IIUC not failing -EEXIST immediately seems to be what we want. >> >> Which is what this does, for the only case it is acceptable - a null >> page list. >> >>> From that POV, WARN_ON_ONCE() helps better on exposing an illegal return of >>> -EEXIST (as mentioned in the commit message) than the -EFAULT convertion, IMHO. >> >> Again, that is upside down, -EEXIST should not be a illegal return. It >> should be valid, have a defined meaning 'the vaddr exists but has no >> struct page' and the top loop, and only the top loop, makes the >> decision what to do about it. > > I believe this works too and I think I get your point, but as stated above it's > just not used yet so the path is not useful to any real code path. It's a really important point. This lower level routine needs to be fixed up so that it behaves in a way that is both correct and reasonable. And it is not reasonable to load up **pages with garbage pointers, regardless of whether any callers are suffering from that yet. Again, the argument that "no one uses this code path yet, so it's OK for it to have pitfalls and weirdness" is not the way to go, at least, not if it's feasible to provide a better alternative. And here, it is very feasible. > > Especially with above VM_BUG_ON() it means if we'll go into the "if (pages)" we > should have already triggered the VM_BUG_ON() condition when entering the function. > And let's not forget that VM_BUG_ON() is a debug-only assertion. So you may very well *not* have triggered anything. Even if code is functionally correct, there are still API design considerations, such as, "how usable is it?", and such. For example, Rusty Russell's "How do I make this hard to misuse?" rules [1]. [1] https://ozlabs.org/~rusty/index.cgi/tech/2008-03-30.html thanks,
On Thu, Jan 27, 2022 at 10:31:27PM -0400, Jason Gunthorpe wrote: > We don't, but that isn't the point - it is about understandability not > utility. > > There are only two call chains that involve follow_page_mask(), one > always wants the -EEXIST and the other wants -EEXIST sometimes and > sometimes wants to return -EFAULT for -EEXIST > > The solution is not to change follow_page_mask() but to be consistent > throughout that -EEXIST means we encountered a populated but > non-struct page PTE and the single place that wants -EEXIST to be > reported as -EFAULT (__get_user_pages) should make that > transformation. Currently -EEXIST is not defined like that. To me, it's defined as: This pte is populated but there's no page struct that binds to it. Let's skip this pte and continue (because we've checked there's no **pages requested). That services use cases like mlock() very well. IMHO that's fine. If I understand correctly, what is proposed above is actually to redefine -EEXIST into: This pte is populated but there's no page struct that binds to it. Then the caller will decide how we should react on it. I kind of agree that the latter one looks cleaner, but I don't have extremely strong reason to refactor it immediately. More importantly, I still don't know what will be the side effect if we immediately remove the two checks in follow_page_mask(), and how much changeset we'll need to redefine -EEXIST here. Let's discuss the two sites that returns -EEXIST one by one.. follow_pfn_pte() should be relatively easy. If with the -EEXIST to -EFAULT convertion proposed in __get_user_pages(), we could have dropped below two lines: ---8<--- diff --git a/mm/gup.c b/mm/gup.c index 8ebc04058e97..b3a109114968 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -439,10 +439,6 @@ static struct page *no_page_table(struct vm_area_struct *vma, static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address, pte_t *pte, unsigned int flags) { - /* No page to get reference */ - if (flags & (FOLL_GET | FOLL_PIN)) - return -EFAULT; - if (flags & FOLL_TOUCH) { pte_t entry = *pte; ---8<--- Then the convertion will make sure it behaves like before. There is a very slight side effect on FOLL_TOUCH above but it shouldn't be a major concern, at least not in the vfio use case - I believe most of the guest pages should already have the dirty bit set anyways, and since they're pinned they won't be a good candidate for reclaims already. IMHO the other site follow_devmap_pud() is tricker, because we can't simply remove the two lines here: ---8<--- diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 406a3c28c026..0fe2253f922c 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1171,15 +1171,6 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr, if (flags & FOLL_TOUCH) touch_pud(vma, addr, pud, flags); - /* - * device mapped pages can only be returned if the - * caller will manage the page reference count. - * - * At least one of FOLL_GET | FOLL_PIN must be set, so assert that here: - */ - if (!(flags & (FOLL_GET | FOLL_PIN))) - return ERR_PTR(-EEXIST); - pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT; *pgmap = get_dev_pagemap(pfn, *pgmap); if (!*pgmap) ---8<--- Because with the old code we'll 100% return -EEXIST as long as we don't have GET|PIN (e.g. mlock), but then later when the two lines removed we'll start to try call get_dev_pagemap(), if it returned something we'll start to fetch the page. That part looks fine, __get_user_pages() will still continue but ignore the page* returned. But what if get_dev_pagemap() returned NULL? Then we'll start to return -EFAULT where we used to constantly return -EEXIST. I'm afraid it'll stop working for some mlock() or MAP_POPULATE case that Kirill wanted to fix before, then it could break things. We could definitely add more code to fix up above to make sure mlock() on huge pud of DAX will keep working. However I hope I have somehow clarified the point that it's not as trivial as it looks like to change the semantics of -EEXIST for follow_page_mask() immediately, and to fix the vfio mdev breakage we'd better merge the oneliner fix first even if we want to rework -EEXIST. Summary: I see no problem at all on either way, it's just that for this bug fix it's straightforward to keep the -EEXIST definition, rather than clean it up, because otherwise the change will grow. Note that we will need to backport this patch, I think it's proper we keep the changeset small and leave the cleanups for later. > > For instance, should we return -EEXIST in other cases like devmap and > more that have PTEs present but are not returnable? It is already > really confusing (and probably wrong) that we sometimes return NULL > for populated PTEs. NULL causes faultin_page() to be called on > something we already know is populated, seems like nonsense. Could you elaborate what's the "return NULL" confusion you mentioned? > > I certainly don't want to see every place like that doing some if to > transform -EEXIST into -EFAULT. > > > > Callers that want an early failure must pass in NULL for pages, it is > > > just that simple. It has nothing to do with the FOLL flags. > > > > > > A WARN_ON would be appropriate to compare the FOLL flags against the > > > pages. eg FOLL_GET without a pages is nonsense and should be > > > immediately aborted. On the other hand, we avoid this by construction > > > internal to gup.c > > > > We have something like that already, although it's only a VM_BUG_ON() not a > > BUG_ON() or WARN_ON() at the entry of __get_user_pages(): > > > > VM_BUG_ON(!!pages != !!(gup_flags & (FOLL_GET | FOLL_PIN))); > > Right > > > I believe this works too and I think I get your point, but as stated above it's > > just not used yet so the path is not useful to any real code path. > > It is not about useful, it is about understandable and > consistency.. There should be clear rules about when and what errno to > return in every case, we should trend in that direction. I see that both you and John has a strong preference on at least the WARN_ON_ONCE() in the patch. Do you think it's okay I repost with only the one-liner fix, which will keep the Fixes but drop the WARN_ON_ONCE? Then we can leave the rest as follow up. For my own preference, I really like to keep the patch as-is, because as I mentioned it helps to identify issues with existing code already (it'll even help the reader when looking at the -EEXIST handling because that's not obvious). But I don't strongly ask for that, I still sincerely wish the discussion won't block the real simple bugfix. Thanks,
On Thu, Jan 27, 2022 at 06:32:27PM -0800, John Hubbard wrote: > What you are missing is that other people are potentially writing code > that we haven't seen yet, and that code may use follow_page_mask(). The > idea, therefore, is to make it a good building block. Yes, actually that's why I attached the WARN_ON_ONCE() since when people do add that new code they'll quickly discover before posting, as long as they'll still test the patch.. Then people may even start to wonder whether we should rework -EEXIST if necessary, and IMHO that's the better place for a rework, not within the bugfix here. Please check the email I replied to Jason. I hope both of you would agree that I can repost with the fix only but drop the WARN_ON_ONCE(), or any explicit suggestion on how we should move forward with the fix would be welcomed. Thanks,
On 1/27/22 19:26, Peter Xu wrote: ... > I see that both you and John has a strong preference on at least the > WARN_ON_ONCE() in the patch. > > Do you think it's okay I repost with only the one-liner fix, which will keep > the Fixes but drop the WARN_ON_ONCE? Then we can leave the rest as follow up. > I think that's OK with me, anyway. You'll recall that I initially requested that you split this into two patches, after all. Would you like me to post a follow-up that does the refactoring that Jason and I are requesting? I see that we have some fundamental differences in opinion about how this should be done, so rather than drive you crazy with debating, maybe that would be smoother? :) thanks,
On Thu, Jan 27, 2022 at 09:57:34PM -0800, John Hubbard wrote: > On 1/27/22 19:26, Peter Xu wrote: > ... > > I see that both you and John has a strong preference on at least the > > WARN_ON_ONCE() in the patch. > > > > Do you think it's okay I repost with only the one-liner fix, which will keep > > the Fixes but drop the WARN_ON_ONCE? Then we can leave the rest as follow up. > > > > I think that's OK with me, anyway. You'll recall that I initially requested > that you split this into two patches, after all. > > Would you like me to post a follow-up that does the refactoring that Jason > and I are requesting? I see that we have some fundamental differences in > opinion about how this should be done, so rather than drive you crazy with > debating, maybe that would be smoother? :) Sure thing. :-) Please just double check that the pud devmap will still always work on mlock(). I believe both of you are much more familiar than me on that; it just still seems a little bit tricky. I'll repost this one, thanks.
On Fri, Jan 28, 2022 at 11:26:16AM +0800, Peter Xu wrote: > diff --git a/mm/gup.c b/mm/gup.c > index 8ebc04058e97..b3a109114968 100644 > +++ b/mm/gup.c > @@ -439,10 +439,6 @@ static struct page *no_page_table(struct vm_area_struct *vma, > static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address, > pte_t *pte, unsigned int flags) > { > - /* No page to get reference */ > - if (flags & (FOLL_GET | FOLL_PIN)) > - return -EFAULT; > - > if (flags & FOLL_TOUCH) { > pte_t entry = *pte; > > Then the convertion will make sure it behaves like before. There is a very > slight side effect on FOLL_TOUCH above but it shouldn't be a major concern, at > least not in the vfio use case You mean that it touches the page even though it will not return the page? This is what devmap is doing already :\ > IMHO the other site follow_devmap_pud() is tricker, because we can't simply > remove the two lines here: Sure, but I didn't ask to fix everything, just trend toward something sane - if you drop the 4 line above then your patch will do what it needs to do, we don't need to jump into devmap to fix this bug. > - /* > - * device mapped pages can only be returned if the > - * caller will manage the page reference count. > - * > - * At least one of FOLL_GET | FOLL_PIN must be set, so assert that here: > - */ > - if (!(flags & (FOLL_GET | FOLL_PIN))) > - return ERR_PTR(-EEXIST); > - > pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT; > *pgmap = get_dev_pagemap(pfn, *pgmap); > if (!*pgmap) > > Because with the old code we'll 100% return -EEXIST as long as we don't have > GET|PIN (e.g. mlock), but then later when the two lines removed we'll start to > try call get_dev_pagemap(), if it returned something we'll start to fetch the > page. This is part of the "return codes don't make any sense" I was complaining about.. devmap has a populated PTE and a struct page, it shouldn't be returning -EFAULT. Failure to get the struct page is either EEXIST or NULL (take a fault to resolve the race). [as we've discussed before the get_dev_pagemap shouldn't even be here at all, I suppose this is why it is so weird looking] > That part looks fine, __get_user_pages() will still continue but ignore the > page* returned. IMHO it would be clearer if the __get_user_pages() checked the struct page zone and rejected the cases it doesn't want than to sprinkle these checks all over the place based on PMD flags. We want to go in this direction anyhow so we can remove the pmd_dax() flag.. > But what if get_dev_pagemap() returned NULL? Then we'll start to return > -EFAULT where we used to constantly return -EEXIST. IHMO most of the EFAULTs in the walker call chain are pretty questionable.. they should ideally be EEXIST or NULL. eg why does get_gate_page() return EFAULT for a non-present page? That should be NULL and faultin_page() should generate the EFAULT > > For instance, should we return -EEXIST in other cases like devmap and > > more that have PTEs present but are not returnable? It is already > > really confusing (and probably wrong) that we sometimes return NULL > > for populated PTEs. NULL causes faultin_page() to be called on > > something we already know is populated, seems like nonsense. > > Could you elaborate what's the "return NULL" confusion you mentioned? Returning null for a popoulated PTE as a permanent failure is nonsense. NULL means 'call faultin_page()', it should be used on populated pages that ought to have a struct page but for some racy reason is unavailable. The faultin should resolve that race and make the pte either fully non-present or restore the struct page. > > It is not about useful, it is about understandable and > > consistency.. There should be clear rules about when and what errno to > > return in every case, we should trend in that direction. > > I see that both you and John has a strong preference on at least the > WARN_ON_ONCE() in the patch. Well, I wanted to see the WARN_ON_ONCE avoided by making it unnecessary by construction, not just deleted :\ Jason
diff --git a/mm/gup.c b/mm/gup.c index f0af462ac1e2..8ebc04058e97 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -440,7 +440,7 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address, pte_t *pte, unsigned int flags) { /* No page to get reference */ - if (flags & FOLL_GET) + if (flags & (FOLL_GET | FOLL_PIN)) return -EFAULT; if (flags & FOLL_TOUCH) { @@ -1181,7 +1181,13 @@ static long __get_user_pages(struct mm_struct *mm, /* * Proper page table entry exists, but no corresponding * struct page. + * + * Warn if we jumped over even with a valid **pages. + * It shouldn't trigger in practise, but when there's + * buggy returns on -EEXIST we'll warn before returning + * an invalid page pointer in the array. */ + WARN_ON_ONCE(pages); goto next_page; } else if (IS_ERR(page)) { ret = PTR_ERR(page);
Alex reported invalid page pointer returned with pin_user_pages_remote() from vfio after upstream commit 4b6c33b32296 ("vfio/type1: Prepare for batched pinning with struct vfio_batch"). This problem breaks NVIDIA vfio mdev. It turns out that it's not the fault of the vfio commit; however after vfio switches to a full page buffer to store the page pointers it starts to expose the problem easier. The problem is for VM_PFNMAP vmas we should normally fail with an -EFAULT then vfio will carry on to handle the MMIO regions. However when the bug triggered, follow_page_mask() returned -EEXIST for such a page, which will jump over the current page, leaving that entry in **pages untouched. However the caller is not aware of it, hence the caller will reference the page as usual even if the pointer data can be anything. We had that -EEXIST logic since commit 1027e4436b6a ("mm: make GUP handle pfn mapping unless FOLL_GET is requested") which seems very reasonable. It could be that when we reworked GUP with FOLL_PIN we could have overlooked that special path in commit 3faa52c03f44 ("mm/gup: track FOLL_PIN pages"), even if that commit rightfully touched up follow_devmap_pud() on checking FOLL_PIN when it needs to return an -EEXIST. Since at it, add another WARN_ON_ONCE() at the -EEXIST handling to make sure we mustn't have **pages set when reaching there, because otherwise it means the caller will try to read a garbage right after __get_user_pages() returns. Attaching the Fixes to the FOLL_PIN rework commit, as it happened later than 1027e4436b6a. Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jan Kara <jack@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Fixes: 3faa52c03f44 ("mm/gup: track FOLL_PIN pages") Reported-by: Alex Williamson <alex.williamson@redhat.com> Debugged-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> --- mm/gup.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)