Message ID | 20200529232723.44942-8-viro@ZenIV.linux.org.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/9] pselect6() and friends: take handling the combined 6th/7th args into helper | expand |
On Fri, May 29, 2020 at 4:27 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > a/arch/x86/kvm/hyperv.c > - if (__clear_user((void __user *)addr, sizeof(u32))) > + if (__put_user(0, (u32 __user *)addr)) I'm not doubting that this is a correct transformation and an improvement, but why is it using that double-underscore version in the first place? There's a __copy_to_user() in kvm_hv_set_msr_pw() in addition to this one in kvm_hv_set_msr(). Both go back to 2011 and commit 8b0cedff040b ("KVM: use __copy_to_user/__clear_user to write guest page") and both look purely like "pointlessly avoid the access_ok". All these KVM "optimizations" seem entirely pointless, since access_ok() isn't the problem. And the address _claims_ to be verified, but I'm not seeing it. There is not a single 'access_ok()' anywhere in arch/x86/kvm/ that I can see. It looks like the argument for the address being validated is that it comes from "gfn_to_hva()", which should only return host-virtual-addresses. That may be true. But "should" is not "does", and honestly, the cost of gfn_to_hva() is high enough that then using that as an argument for removing "access_ok()" smells. So I would suggest just removing all these completely bogus double-underscore versions. It's pointless, it's wrong, and it's unsafe. This isn't even some critical path, but even if it was, that user address check isn't where the problems are. Linus
On Fri, May 29, 2020 at 04:52:59PM -0700, Linus Torvalds wrote: > On Fri, May 29, 2020 at 4:27 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > a/arch/x86/kvm/hyperv.c > > - if (__clear_user((void __user *)addr, sizeof(u32))) > > + if (__put_user(0, (u32 __user *)addr)) > > I'm not doubting that this is a correct transformation and an > improvement, but why is it using that double-underscore version in the > first place? > > There's a __copy_to_user() in kvm_hv_set_msr_pw() in addition to this > one in kvm_hv_set_msr(). Both go back to 2011 and commit 8b0cedff040b > ("KVM: use __copy_to_user/__clear_user to write guest page") and both > look purely like "pointlessly avoid the access_ok". > > All these KVM "optimizations" seem entirely pointless, since > access_ok() isn't the problem. And the address _claims_ to be > verified, but I'm not seeing it. There is not a single 'access_ok()' > anywhere in arch/x86/kvm/ that I can see. > > It looks like the argument for the address being validated is that it > comes from "gfn_to_hva()", which should only return > host-virtual-addresses. That may be true. > > But "should" is not "does", and honestly, the cost of gfn_to_hva() is > high enough that then using that as an argument for removing > "access_ok()" smells. > > So I would suggest just removing all these completely bogus > double-underscore versions. It's pointless, it's wrong, and it's > unsafe. It's a bit trickier than that, but I want to deal with that at the same time as the rest of kvm/vhost stuff. So for this series I just went for minimal change. There's quite a pile of vhost and kvm stuff, but it's not ready yet - wait for the next cycle.
On Sat, May 30, 2020 at 03:31:47PM +0100, Al Viro wrote: > It's a bit trickier than that, but I want to deal with that at the same > time as the rest of kvm/vhost stuff. So for this series I just went > for minimal change. There's quite a pile of vhost and kvm stuff, > but it's not ready yet - wait for the next cycle. BTW, regarding uaccess plans for the next cycle: * regset mess (at least the ->get() side) * killing more compat_alloc_user_space() call sites (_maybe_ all of it, if we are lucky enough; v4l2 is a bitch in that respect, but I've some ideas on how to deal with that - need to discuss with mchehab) * sorting the remaining (harder) parts of i915 out * kvm/vhost * fault_in_pages_...() series That should get rid of almost all __... ones outside of arch/*; might actually kill copy_in_user() off as well. * finally lifting stac/clac out of raw_copy_{to,from}_user().
> On Fri, May 29, 2020 at 04:52:59PM -0700, Linus Torvalds wrote: >> It looks like the argument for the address being validated is that it >> comes from "gfn_to_hva()", which should only return >> host-virtual-addresses. That may be true. Yes, the access_ok is done in __kvm_set_memory_region and gfn_to_hva() returns a page-aligned address so it's obviously ok for a u32. But I have no objections to removing the __ because if a read or write is in the hot path it will use kvm_write_guest_cached and similar. Paolo >> But "should" is not "does", and honestly, the cost of gfn_to_hva() is >> high enough that then using that as an argument for removing >> "access_ok()" smells.
On Sat, May 30, 2020 at 9:20 AM Paolo Bonzini <pbonzini@redhat.com> wrote: > > Yes, the access_ok is done in __kvm_set_memory_region and gfn_to_hva() > returns a page-aligned address so it's obviously ok for a u32. It's not that it's "obviously ok for an u32". It is _not_ obviously ok for a user address. There's actually no access_ok() done in the lookup path at all, and what gfn_to_hva() actually ends up doing in the end is __gfn_to_hva_memslot(), which has zero overflow protection at all, and just does slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE; without us having _ever_ checked that 'gfn' parameter. Yes, at some point in the very very distant past, __kvm_set_memory_region() has validated mem->{userspace_addr,memory_size}. But even that validation is actually very questionable, since it's not even done for all of the memory slots, only the "user" ones. So if at any point we have a non-user slot, of it at any point the gfn thing was mis-calculated and {over,under}flows, there are no protections what-so-ever. In other words, it really looks like kvm is entirely dependent on magic and luck and a desperate hope that there are no other bugs to keep the end result as a user address. Because if _any_ bug or oversight in that kvm_memory_slot handling ever happens, you end up with random crap. So no. I disagree. There is absolutely nothing "obviously ok" about any of that kvm code. Quite the reverse. I'd argue that it's very much obviously *NOT* ok, even while it might just happen to work. That double underscore needs to go away. It's either actively buggy right now and I see no proof it isn't, or it's a bug just waiting to happen in the future. Linus
On Sat, May 30, 2020 at 10:57:24AM -0700, Linus Torvalds wrote: > So no. I disagree. There is absolutely nothing "obviously ok" about > any of that kvm code. Quite the reverse. > > I'd argue that it's very much obviously *NOT* ok, even while it might > just happen to work. Actually, it's somewhat less brittle than you think (on non-mips, at least) and not due to those long-ago access_ok(). > That double underscore needs to go away. It's either actively buggy > right now and I see no proof it isn't, or it's a bug just waiting to > happen in the future. FWIW, the kvm side of things (vhost is yet another pile of fun) is [x86] kvm_hv_set_msr_pw(): arch/x86/kvm/hyperv.c:1027: if (__copy_to_user((void __user *)addr, instructions, 4)) HV_X64_MSR_HYPERCALL arch/x86/kvm/hyperv.c:1132: if (__clear_user((void __user *)addr, sizeof(u32))) HV_X64_MSR_VP_ASSIST_PAGE in both cases addr comes from gfn = data >> HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT; addr = kvm_vcpu_gfn_to_hva(vcpu, gfn); if (kvm_is_error_hva(addr)) return 1; [x86] FNAME(walk_addr_generic), very hot: arch/x86/kvm/mmu/paging_tmpl.h:403: if (unlikely(__get_user(pte, ptep_user))) index = PT_INDEX(addr, walker->level); ... offset = index * sizeof(pt_element_t); ... host_addr = kvm_vcpu_gfn_to_hva_prot(vcpu, real_gfn, &walker->pte_writable[walker->level - 1]); if (unlikely(kvm_is_error_hva(host_addr))) goto error; ptep_user = (pt_element_t __user *)((void *)host_addr + offset); __kvm_read_guest_page(): virt/kvm/kvm_main.c:2252: r = __copy_from_user(data, (void __user *)addr + offset, len); addr = gfn_to_hva_memslot_prot(slot, gfn, NULL); if (kvm_is_error_hva(addr)) return -EFAULT; __kvm_read_guest_atomic(): virt/kvm/kvm_main.c:2326: r = __copy_from_user_inatomic(data, (void __user *)addr + offset, len); addr = gfn_to_hva_memslot_prot(slot, gfn, NULL); if (kvm_is_error_hva(addr)) return -EFAULT; __kvm_write_guest_page(): virt/kvm/kvm_main.c:2353: r = __copy_to_user((void __user *)addr + offset, data, len); addr = gfn_to_hva_memslot(memslot, gfn); if (kvm_is_error_hva(addr)) return -EFAULT; kvm_write_guest_offset_cached(): virt/kvm/kvm_main.c:2490: r = __copy_to_user((void __user *)ghc->hva + offset, data, len); if (kvm_is_error_hva(ghc->hva)) return -EFAULT; kvm_read_guest_cached(): virt/kvm/kvm_main.c:2525: r = __copy_from_user(data, (void __user *)ghc->hva, len); if (kvm_is_error_hva(ghc->hva)) return -EFAULT; default kvm_is_error_hva() is addr >= PAGE_OFFSET; however, on mips and s390 it's IS_ERR_VALUE(). Sure, we can use non-__ variants, but is access_ok() the right primitive here? We want userland memory, set_fs() be damned.
On Sat, May 30, 2020 at 11:39 AM Al Viro <viro@zeniv.linux.org.uk> wrote: > > Actually, it's somewhat less brittle than you think (on non-mips, at least) > and not due to those long-ago access_ok(). It really isn't. Your very first statement shows how broken it is: > FWIW, the kvm side of things (vhost is yet another pile of fun) is > > [x86] kvm_hv_set_msr_pw(): > arch/x86/kvm/hyperv.c:1027: if (__copy_to_user((void __user *)addr, instructions, 4)) > HV_X64_MSR_HYPERCALL > arch/x86/kvm/hyperv.c:1132: if (__clear_user((void __user *)addr, sizeof(u32))) > HV_X64_MSR_VP_ASSIST_PAGE > in both cases addr comes from > gfn = data >> HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT; > addr = kvm_vcpu_gfn_to_hva(vcpu, gfn); > if (kvm_is_error_hva(addr)) > return 1; Just look at that. You have _zero_ indication that 'adds" is a user space address. It could be a kernel address. That kvm_vcpu_gfn_to_hva() function is a complicated mess that first looks for the right 'memslot', and basically uses a search with a default slot to try to figure it out. It doesn't even use locking for any of it, but assumes the arrays are stable, and that it can use atomics to reliably read and set the last successfully found slot. And none of that code verifies that the end result is a user address. It _literally_ all depends on this optimistically lock-free code being bug-free, and never using a slot that isn't a user slot. And as mentioned, there _are_ non-user memslots. It's fragile as hell. And it's all completely and utterly pointless. ALL of the above is incredibly much more expensive than just checking the damn address range. So the optimization is completely bogus to begin with, and all it results in is that any bug in this _incredibly_ subtle code will be a security proiblem. And I don't understand why you mention set_fs() vs access_ok(). None of this code has anything that messes with set_fs(). The access_ok() is garbage and shouldn't exist, and those user accesses should all use the checking versions and the double underscores are wrong. I have no idea why you think the double underscores could _possibly_ be worth defending. Linus
On Sat, May 30, 2020 at 11:52:44AM -0700, Linus Torvalds wrote: > > It really isn't. > > Your very first statement shows how broken it is: > > > FWIW, the kvm side of things (vhost is yet another pile of fun) is > > > > [x86] kvm_hv_set_msr_pw(): > > arch/x86/kvm/hyperv.c:1027: if (__copy_to_user((void __user *)addr, instructions, 4)) > > HV_X64_MSR_HYPERCALL > > arch/x86/kvm/hyperv.c:1132: if (__clear_user((void __user *)addr, sizeof(u32))) > > HV_X64_MSR_VP_ASSIST_PAGE > > in both cases addr comes from > > gfn = data >> HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT; > > addr = kvm_vcpu_gfn_to_hva(vcpu, gfn); > > if (kvm_is_error_hva(addr)) > > return 1; > > Just look at that. You have _zero_ indication that 'adds" is a user > space address. It could be a kernel address. > > That kvm_vcpu_gfn_to_hva() function is a complicated mess that first > looks for the right 'memslot', and basically uses a search with a > default slot to try to figure it out. It doesn't even use locking for > any of it, but assumes the arrays are stable, and that it can use > atomics to reliably read and set the last successfully found slot. > > And none of that code verifies that the end result is a user address. kvm_is_error_hva() is return addr >= PAGE_OFFSET;
On Sat, May 30, 2020 at 11:52:44AM -0700, Linus Torvalds wrote: > And I don't understand why you mention set_fs() vs access_ok(). None > of this code has anything that messes with set_fs(). The access_ok() > is garbage and shouldn't exist, and those user accesses should all use > the checking versions and the double underscores are wrong. > > I have no idea why you think the double underscores could _possibly_ > be worth defending. I do not. What I'm saying is that this just might be a beast different from *both* __... and the normal ones. I'm not saying that this __put_user() (or __clear_user(), etc.) is the right primitive here. If anything, it's closer to the situation for (x86) copy_stack_trace().
On Sat, May 30, 2020 at 12:14 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > And none of that code verifies that the end result is a user address. > > kvm_is_error_hva() is > return addr >= PAGE_OFFSET; Ahh, that's what I missed. It won't work on other architectures, but within x86 it's fine. Linus
On Sat, May 30, 2020 at 08:19:40PM +0100, Al Viro wrote: > On Sat, May 30, 2020 at 11:52:44AM -0700, Linus Torvalds wrote: > > > And I don't understand why you mention set_fs() vs access_ok(). None > > of this code has anything that messes with set_fs(). The access_ok() > > is garbage and shouldn't exist, and those user accesses should all use > > the checking versions and the double underscores are wrong. > > > > I have no idea why you think the double underscores could _possibly_ > > be worth defending. > > I do not. What I'm saying is that this just might be a beast different > from *both* __... and the normal ones. I'm not saying that this > __put_user() (or __clear_user(), etc.) is the right primitive here. > If anything, it's closer to the situation for (x86) copy_stack_trace(). ... and no, I'm not saying that copy_stack_trace() should stay with __get_user() either. It feels like we are lacking primitives needed to express that cleanly and copy_stack_trace() currently cobbles something up out of what we have. Which works for arch-specific code, but yes, that kind of thing is brittle for arch-independent places like virt/kvm; I wonder if e.g. s390 is really OK there.
On Sat, May 30, 2020 at 12:20:54PM -0700, Linus Torvalds wrote: > On Sat, May 30, 2020 at 12:14 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > > And none of that code verifies that the end result is a user address. > > > > kvm_is_error_hva() is > > return addr >= PAGE_OFFSET; > > Ahh, that's what I missed. It won't work on other architectures, but > within x86 it's fine. FWIW, we use virt/kvm on x86, powerpc, mips, s390 and arm64. For x86 and powerpc the check is, AFAICS, OK (ppc kernel might start higher than PAGE_OFFSET, but not lower than it). For arm64... not sure - I'm not familiar with the virtual address space layout we use there. mips does *NOT* get that protection at all - there kvm_is_error_hva() is IS_ERR_VALUE() (thus the "at least on non-mips" upthread). And for s390 it's also IS_ERR_VALUE(), but that's an separate can of worms - there access_ok() is constant true; if we ever hit any of that code in virt/kvm while under KERNEL_DS, we are well and truly fucked there.
On Sat, May 30, 2020 at 08:42:32PM +0100, Al Viro wrote: > On Sat, May 30, 2020 at 12:20:54PM -0700, Linus Torvalds wrote: > > On Sat, May 30, 2020 at 12:14 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > > > > And none of that code verifies that the end result is a user address. > > > > > > kvm_is_error_hva() is > > > return addr >= PAGE_OFFSET; > > > > Ahh, that's what I missed. It won't work on other architectures, but > > within x86 it's fine. > > FWIW, we use virt/kvm on x86, powerpc, mips, s390 and arm64. > > For x86 and powerpc the check is, AFAICS, OK (ppc kernel might start > higher than PAGE_OFFSET, but not lower than it). For arm64... not > sure - I'm not familiar with the virtual address space layout we use > there. mips does *NOT* get that protection at all - there kvm_is_error_hva() > is IS_ERR_VALUE() (thus the "at least on non-mips" upthread). And > for s390 it's also IS_ERR_VALUE(), but that's an separate can of worms - > there access_ok() is constant true; if we ever hit any of that code in > virt/kvm while under KERNEL_DS, we are well and truly fucked there. Anyway, I really think it's too big to handle this cycle, what with the amount of other stuff already in queue. If anything, that __put_user() is a useful marker of the things that will need attention. That's arch/x86 and the test excluding the kernel space is just upstream of that call, so IMO that's robust enough for now. Crossing the limit just into the beginning of kernel space is theoretically possible, but that would depend upon slot->userspace_addr not being page-aligned (and would attempt to zero up to 3 bytes past the PAGE_OFFSET in any case). If we get memory corruption going on, we have much worse problems than that. And it would have to be memory corruption - ->userspace_addr is assign-once, there's only one place doing the assignments and alignment check is shortly upstream of it, so all instances must have that field page-aligned all the time. We'll need to sort the kvm-related issues out, but let's leave it for the next cycle.
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c index bcefa9d4e57e..b85b211d4676 100644 --- a/arch/x86/kvm/hyperv.c +++ b/arch/x86/kvm/hyperv.c @@ -1129,7 +1129,7 @@ static int kvm_hv_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data, bool host) * only, there can be valuable data in the rest which needs * to be preserved e.g. on migration. */ - if (__clear_user((void __user *)addr, sizeof(u32))) + if (__put_user(0, (u32 __user *)addr)) return 1; hv_vcpu->hv_vapic = data; kvm_vcpu_mark_page_dirty(vcpu, gfn);