Message ID | 20230626085035.e66992e96b4c6d37dad54bd9@linux-foundation.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [GIT,PULL] MM updates for 6.5-rc1 | expand |
On Mon, 26 Jun 2023 at 08:50, Andrew Morton <akpm@linux-foundation.org> wrote: > > Linus, please merge the MM updates for the 6.5-rc cycle. > [...] > merge conflict in mm/gup.c, vs block tree: > https://lkml.kernel.org/r/20230616115856.3ce7682c@canb.auug.org.au Hmm. I think this merge resolution in -next is wrong. It now does a common folio = try_get_folio(page, refs); if (flags & FOLL_GET) return folio; for both FOLL_GET and FOLL_PIN, and then *after* that for the FOLL_PIN case it does that /* * Don't take a pin on the zero page - it's not going anywhere * and it is used in a *lot* of places. */ if (is_zero_page(page)) return page_folio(page); but by then it has already done the try_get_folio(). End result: it has already updated refcounts, despite the comment saying not to do that. So I think it needs to match the comment (and the try_grab_page() logic), and just basically if (flags & FOLL_GET) return try_get_folio(page, refs); if (is_zero_page(page)) return page_folio(page); folio = try_get_folio(page, refs); if (!folio) return NULL; instead. That's what my resolution is going to do, but let's add others to the participants list just in case somebody goes "No, Linus, the reason -next does that is XYZ"... Linus
On Wed, 28 Jun 2023 at 10:27, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > So I think it needs to match the comment (and the try_grab_page() > logic), and just basically > > if (flags & FOLL_GET) > return try_get_folio(page, refs); > > if (is_zero_page(page)) > return page_folio(page); > > folio = try_get_folio(page, refs); > if (!folio) > return NULL; > > instead. Side note: I think we should just do the "FOLL_GET" doesn't touch the refcount either, which would make this all become just if (is_zero_page(page)) return page_folio(page); folio = try_get_folio(page, refs); if (!folio) return NULL; but then we would need to fix try_grab_page() and gup_put_folio() and friends to match. And any other cases I haven't thought of. Long long ago we used to have the logic that PG_reserved meant that no refcounting was done, and that automatically handled the zero page(s). But that was removed back in 2005... That old commit even talks about this issue: A last caveat: the ZERO_PAGE is now refcounted and managed with rmap (and thus mapcounted and count towards shared rss). These writes to the struct page could cause excessive cacheline bouncing on big systems. There are a number of ways this could be addressed if it is an issue. It's commit b5810039a54e ("[PATCH] core remove PageReserved") in case anybody wants to do some historical archaeology. Linus
The pull request you sent on Mon, 26 Jun 2023 08:50:35 -0700:
> https://lkml.kernel.org/r/20230504132148.182960-1-broonie@finisterre.sirena.org.uk Thanks.
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/6e17c6de3ddf3073741d9c91a796ee696914d8a0
Thank you!
Linus Torvalds <torvalds@linux-foundation.org> wrote: > So I think it needs to match the comment (and the try_grab_page() > logic), and just basically > > if (flags & FOLL_GET) > return try_get_folio(page, refs); > > if (is_zero_page(page)) > return page_folio(page); > > folio = try_get_folio(page, refs); > if (!folio) > return NULL; > > instead. That looks right. David
Linus Torvalds <torvalds@linux-foundation.org> wrote: > Side note: I think we should just do the "FOLL_GET" doesn't touch the > refcount either, which would make this all become just Yeah... but there's a lot of places that would potentially need fixing. iov_iter_get_pages*(), for example, is used to grab pages and push them to all sorts of destinations, including pipes and sk_buffs. > but then we would need to fix try_grab_page() and gup_put_folio() and > friends to match. And any other cases I haven't thought of. And put_page(), folio_put(), ... I wonder if there would be a noticeable performance loss from adding an is_zero_page() check into those. It might also make sense to include the entire kernel post-init static image in that, perhaps a check: if ((unsigned long)(virt_address) & KERNEL_MASK == KERNEL_BASE) return; David
On Mon, 26 Jun 2023 at 08:50, Andrew Morton <akpm@linux-foundation.org> wrote: > > Linus, please merge the MM updates for the 6.5-rc cycle. Hmm. I have merged this, as pr-tracker-bot already noticed, but I found a bug after merging it. mm/memory.c: __access_remote_vm() is entirely broken for the HAVE_IOREMAP_PROT case (ie all normal architectures), because it does (skipping the non-HAVE_IOREMAP_PROT case): struct vm_area_struct *vma = NULL; struct page *page = get_user_page_vma_remote(mm, addr, gup_flags, &vma); if (IS_ERR_OR_NULL(page)) { [ ... ] if (!vma) break; if (vma->vm_ops && vma->vm_ops->access) res = vma->vm_ops->access(vma, addr, buf, ... but get_user_page_vma_remote() doesn't even set vma if it fails! So that "if (!vma)" case will always trigger, and the whole ->access() thing is never done. So that __access_remote_vm() conversion in commit ca5e863233e8 ("mm/gup: remove vmas parameter from get_user_pages_remote()") is entirely broken. Now, I don't disagree with removing the vmas parameter. I just disagree with the get_user_page_vma_remote() helper use here. I think the minimal fix is to just put the vma_lookup() back in the error case: --- a/mm/memory.c +++ b/mm/memory.c @@ -5592,6 +5592,7 @@ int __access_remote_vm(struct mm_struct *mm, * Check if this is a VM_IO | VM_PFNMAP VMA, which * we can access using slightly different code. */ + vma = vma_lookup(mm, addr); if (!vma) break; if (vma->vm_ops && vma->vm_ops->access) and I'll commit that fix for now. Anybody who disagrees, please holler. Linus
On Wed, Jun 28, 2023 at 12:19:42PM -0700, Linus Torvalds wrote: > On Mon, 26 Jun 2023 at 08:50, Andrew Morton <akpm@linux-foundation.org> wrote: > > > > Linus, please merge the MM updates for the 6.5-rc cycle. > > Hmm. I have merged this, as pr-tracker-bot already noticed, but I > found a bug after merging it. > > mm/memory.c: __access_remote_vm() is entirely broken for the > HAVE_IOREMAP_PROT case (ie all normal architectures), because it does > (skipping the non-HAVE_IOREMAP_PROT case): > > struct vm_area_struct *vma = NULL; > struct page *page = get_user_page_vma_remote(mm, addr, > gup_flags, &vma); > > if (IS_ERR_OR_NULL(page)) { > [ ... ] > if (!vma) > break; > if (vma->vm_ops && vma->vm_ops->access) > res = vma->vm_ops->access(vma, addr, buf, ... > > but get_user_page_vma_remote() doesn't even set vma if it fails! > > So that "if (!vma)" case will always trigger, and the whole ->access() > thing is never done. Ugh yeah... This came about because the helper function handles the vma_lookup() case but obviously we now only bother with the vma_lookup() if the gup_remote() succeeds. It's made a little trickier by the fact we warn on !vma if the gup succeeded, so probably special casing this one makes sense for now. There's been discussion about eliminating this weirdo thing gup returning 0 but being treated as a not-failure, I will probably try to patch this soon for the one usecase where it matters (uprobes) and maybe also look at this whole odd 'special mappings are ok in this one place' thing here. > > So that __access_remote_vm() conversion in commit ca5e863233e8 > ("mm/gup: remove vmas parameter from get_user_pages_remote()") is > entirely broken. > > Now, I don't disagree with removing the vmas parameter. I just > disagree with the get_user_page_vma_remote() helper use here. > > I think the minimal fix is to just put the vma_lookup() back in the error case: > > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -5592,6 +5592,7 @@ int __access_remote_vm(struct mm_struct *mm, > * Check if this is a VM_IO | VM_PFNMAP VMA, which > * we can access using slightly different code. > */ > + vma = vma_lookup(mm, addr); > if (!vma) > break; > if (vma->vm_ops && vma->vm_ops->access) > > and I'll commit that fix for now. Anybody who disagrees, please holler. > > Linus Yeah I think this is the best fix in this case, we're not doing any additional work since this wouldn't have run in the helper.