Message ID | 20231124-vfs-fixes-3420a81c0abe@brauner (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [GIT,PULL] vfs fixes | expand |
On Fri, 24 Nov 2023 at 02:28, Christian Brauner <brauner@kernel.org> wrote: > > * Fix a bug introduced with the iov_iter rework from last cycle. > > This broke /proc/kcore by copying too much and without the correct > offset. Ugh. I think the whole /proc/kcore vmalloc handling is just COMPLETELY broken. It does this: /* * vmalloc uses spinlocks, so we optimistically try to * read memory. If this fails, fault pages in and try * again until we are done. */ while (true) { read += vread_iter(iter, src, left); if (read == tsz) break; src += read; left -= read; if (fault_in_iov_iter_writeable(iter, left)) { ret = -EFAULT; goto out; } } and that is just broken beyond words for two totally independent reasons: (a) vread_iter() looks like it can fail because of not having a source, and return 0 (I dunno - it seems to try to avoid that, but it all looks pretty dodgy) At that point fault_in_iov_iter_writeable() will try to fault in the destination, which may work just fine, but if the source was the problem, you'd have an endless loop. (b) That "read += X" is completely broken anyway. It should be just a "=". So that whole loop is crazy broken, and only works for the case where you get it all in one go. This code is crap. Now, I think it all works in practice for one simple reason: I doubt anybody uses this (and it looks like the callees in the while loop try very hard to always fill the whole area - maybe people noticed the first bug and tried to work around it that way). I guess there is at least one test program, but it presumably doesn't trigger or care about the bugs here. But I think we can get rid of this all, and just deal with the KCORE_VMALLOC case exactly the same way we already deal with VMEMMAP and TEXT: by just doing copy_from_kernel_nofault() into a bounce buffer, and then doing a regular _copy_to_iter() or whatever. NOTE! I looked at the code, and threw up in my mouth a little, and maybe I missed something. Maybe it all works fine. But Omar - since you found the original problem, may I implore you to test this attached patch? I just like how the patch looks: 6 files changed, 1 insertion(+), 368 deletions(-) and those 350+ deleted lines really looked disgusting to me. This patch is on top of the pull I did, because obviously the fix in that pull was correct, I just think we should go further and get rid of this whole mess entirely. Linus
The pull request you sent on Fri, 24 Nov 2023 11:27:28 +0100:
> git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.7-rc3.fixes
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/fa2b906f5148883e2d0be8952767469c2e3de274
Thank you!
On Fri, 24 Nov 2023 at 10:25, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I just like how the patch looks: > > 6 files changed, 1 insertion(+), 368 deletions(-) > > and those 350+ deleted lines really looked disgusting to me. Gaah. I guess it's the VM_IOREMAP case that is the cause of all this horridness. So we'd have to know not to mess with IO mappings. Annoying. But I think my patch could still act as a starting point, except that case KCORE_VMALLOC: would have to have some kind of "if this is not a regular vmalloc, just skip it" logic. So I guess we can't remove all those lines, but we *could* replace all the vread_iter() code with a "bool can_I_vread_this()" instead. So the fallback would still be to just do the bounce buffer copy. Or something. Oh well. Linus
On Fri, 24 Nov 2023 at 10:52, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Gaah. I guess it's the VM_IOREMAP case that is the cause of all this horridness. > > So we'd have to know not to mess with IO mappings. Annoying. Doing a debian code search, I see a number of programs that do a "stat()" on the kcore file, to get some notion of "system memory size". I don't think it's valid, but whatever. We probably shouldn't change it. I also see some programs that actually read the ELF notes and sections for dumping purposes. But does anybody actually run gdb on that thing or similar? That's the original model for that file, but it was always more of a gimmick than anything else. Because we could just say "read zeroes from KCORE_VMALLOC" and be done with it that way. Linus
> Because we could just say "read zeroes from KCORE_VMALLOC" and be done > with it that way. Let's try to do that and see what happens. If we get away with it then great, if not we can think about fixing this.
On Fri, Nov 24, 2023 at 10:25:15AM -0800, Linus Torvalds wrote: > On Fri, 24 Nov 2023 at 02:28, Christian Brauner <brauner@kernel.org> wrote: > > > > * Fix a bug introduced with the iov_iter rework from last cycle. > > > > This broke /proc/kcore by copying too much and without the correct > > offset. > > Ugh. I think the whole /proc/kcore vmalloc handling is just COMPLETELY broken. Ugh, I didn't even look at that closely because the fix was obviously correct for that helper alone. Let's try and just return zeroed memory like you suggested in your last mail before we bother fixing any of this. Long-term plan, it'd be nice to just get distros to start turning /proc/kcore off. Maybe I underestimate legitimate users but this requires CAP_SYS_RAW_IO so it really can only be useful to pretty privileged stuff anyway.
On Sat, Nov 25, 2023 at 02:10:52PM +0100, Christian Brauner wrote: > On Fri, Nov 24, 2023 at 10:25:15AM -0800, Linus Torvalds wrote: > > On Fri, 24 Nov 2023 at 02:28, Christian Brauner <brauner@kernel.org> wrote: > > > > > > * Fix a bug introduced with the iov_iter rework from last cycle. > > > > > > This broke /proc/kcore by copying too much and without the correct > > > offset. > > > > Ugh. I think the whole /proc/kcore vmalloc handling is just COMPLETELY broken. > > Ugh, I didn't even look at that closely because the fix was obviously > correct for that helper alone. Let's try and just return zeroed memory > like you suggested in your last mail before we bother fixing any of > this. > > Long-term plan, it'd be nice to just get distros to start turning > /proc/kcore off. Maybe I underestimate legitimate users but this > requires CAP_SYS_RAW_IO so it really can only be useful to pretty > privileged stuff anyway. drgn needs /proc/kcore for debugging the live kernel, which is a very important use case for lots of our users. And it does in fact read from KCORE_VMALLOC segments, which is why I found and fixed this bug. I'm happy to clean up this code, although it's a holiday weekend here so I won't get to it immediately of course. But please don't rip this out. Omar
On Sat, Nov 25, 2023 at 05:28:49AM -0800, Omar Sandoval wrote: > On Sat, Nov 25, 2023 at 02:10:52PM +0100, Christian Brauner wrote: > > On Fri, Nov 24, 2023 at 10:25:15AM -0800, Linus Torvalds wrote: > > > On Fri, 24 Nov 2023 at 02:28, Christian Brauner <brauner@kernel.org> wrote: > > > > > > > > * Fix a bug introduced with the iov_iter rework from last cycle. > > > > > > > > This broke /proc/kcore by copying too much and without the correct > > > > offset. > > > > > > Ugh. I think the whole /proc/kcore vmalloc handling is just COMPLETELY broken. > > > > Ugh, I didn't even look at that closely because the fix was obviously > > correct for that helper alone. Let's try and just return zeroed memory > > like you suggested in your last mail before we bother fixing any of > > this. > > > > Long-term plan, it'd be nice to just get distros to start turning > > /proc/kcore off. Maybe I underestimate legitimate users but this > > requires CAP_SYS_RAW_IO so it really can only be useful to pretty > > privileged stuff anyway. > > drgn needs /proc/kcore for debugging the live kernel, which is a very > important use case for lots of our users. And it does in fact read from > KCORE_VMALLOC segments, which is why I found and fixed this bug. I'm > happy to clean up this code, although it's a holiday weekend here so I > won't get to it immediately of course. But please don't rip this out. Ugh, I see. I just grepped through the drgn repo. I didn't realize that you were actively using it and not just testing it. If this is actively used then we won't break you ofc. And yeah, if you would fix it that would be great. Given that you're the main active user who happens to have kernel experience you might even want to be made responsible for this file in the future. ;)