diff mbox series

[8/9] x86: kvm_hv_set_msr(): use __put_user() instead of 32bit __clear_user()

Message ID 20200529232723.44942-8-viro@ZenIV.linux.org.uk (mailing list archive)
State New, archived
Headers show
Series [1/9] pselect6() and friends: take handling the combined 6th/7th args into helper | expand

Commit Message

Al Viro May 29, 2020, 11:27 p.m. UTC
From: Al Viro <viro@zeniv.linux.org.uk>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 arch/x86/kvm/hyperv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Linus Torvalds May 29, 2020, 11:52 p.m. UTC | #1
On Fri, May 29, 2020 at 4:27 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> a/arch/x86/kvm/hyperv.c
> -               if (__clear_user((void __user *)addr, sizeof(u32)))
> +               if (__put_user(0, (u32 __user *)addr))

I'm not doubting that this is a correct transformation and an
improvement, but why is it using that double-underscore version in the
first place?

There's a __copy_to_user() in kvm_hv_set_msr_pw() in addition to this
one in kvm_hv_set_msr(). Both go back to 2011 and commit 8b0cedff040b
("KVM: use __copy_to_user/__clear_user to write guest page") and both
look purely like "pointlessly avoid the access_ok".

All these KVM "optimizations" seem entirely pointless, since
access_ok() isn't the problem. And the address _claims_ to be
verified, but I'm not seeing it. There is not a single 'access_ok()'
anywhere in arch/x86/kvm/ that I can see.

It looks like the argument for the address being validated is that it
comes from "gfn_to_hva()", which should only return
host-virtual-addresses. That may be true.

But "should" is not "does", and honestly, the cost of gfn_to_hva() is
high enough that then using that as an argument for removing
"access_ok()" smells.

So I would suggest just removing all these completely bogus
double-underscore versions. It's pointless, it's wrong, and it's
unsafe.

This isn't even some critical path, but even if it was, that user
address check isn't where the problems are.

             Linus
Al Viro May 30, 2020, 2:31 p.m. UTC | #2
On Fri, May 29, 2020 at 04:52:59PM -0700, Linus Torvalds wrote:
> On Fri, May 29, 2020 at 4:27 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > a/arch/x86/kvm/hyperv.c
> > -               if (__clear_user((void __user *)addr, sizeof(u32)))
> > +               if (__put_user(0, (u32 __user *)addr))
> 
> I'm not doubting that this is a correct transformation and an
> improvement, but why is it using that double-underscore version in the
> first place?
> 
> There's a __copy_to_user() in kvm_hv_set_msr_pw() in addition to this
> one in kvm_hv_set_msr(). Both go back to 2011 and commit 8b0cedff040b
> ("KVM: use __copy_to_user/__clear_user to write guest page") and both
> look purely like "pointlessly avoid the access_ok".
> 
> All these KVM "optimizations" seem entirely pointless, since
> access_ok() isn't the problem. And the address _claims_ to be
> verified, but I'm not seeing it. There is not a single 'access_ok()'
> anywhere in arch/x86/kvm/ that I can see.
> 
> It looks like the argument for the address being validated is that it
> comes from "gfn_to_hva()", which should only return
> host-virtual-addresses. That may be true.
> 
> But "should" is not "does", and honestly, the cost of gfn_to_hva() is
> high enough that then using that as an argument for removing
> "access_ok()" smells.
> 
> So I would suggest just removing all these completely bogus
> double-underscore versions. It's pointless, it's wrong, and it's
> unsafe.

It's a bit trickier than that, but I want to deal with that at the same
time as the rest of kvm/vhost stuff.  So for this series I just went
for minimal change.  There's quite a pile of vhost and kvm stuff,
but it's not ready yet - wait for the next cycle.
Al Viro May 30, 2020, 2:52 p.m. UTC | #3
On Sat, May 30, 2020 at 03:31:47PM +0100, Al Viro wrote:

> It's a bit trickier than that, but I want to deal with that at the same
> time as the rest of kvm/vhost stuff.  So for this series I just went
> for minimal change.  There's quite a pile of vhost and kvm stuff,
> but it's not ready yet - wait for the next cycle.

BTW, regarding uaccess plans for the next cycle:
	* regset mess (at least the ->get() side)
	* killing more compat_alloc_user_space() call sites (_maybe_
all of it, if we are lucky enough; v4l2 is a bitch in that respect,
but I've some ideas on how to deal with that - need to discuss with
mchehab)
	* sorting the remaining (harder) parts of i915 out
	* kvm/vhost
	* fault_in_pages_...() series
That should get rid of almost all __... ones outside of arch/*; might
actually kill copy_in_user() off as well.
	* finally lifting stac/clac out of raw_copy_{to,from}_user().
Paolo Bonzini May 30, 2020, 4:20 p.m. UTC | #4
> On Fri, May 29, 2020 at 04:52:59PM -0700, Linus Torvalds wrote:
>> It looks like the argument for the address being validated is that it
>> comes from "gfn_to_hva()", which should only return
>> host-virtual-addresses. That may be true.

Yes, the access_ok is done in __kvm_set_memory_region and gfn_to_hva()
returns a page-aligned address so it's obviously ok for a u32.

But I have no objections to removing the __ because if a read or write
is in the hot path it will use kvm_write_guest_cached and similar.

Paolo

>> But "should" is not "does", and honestly, the cost of gfn_to_hva() is
>> high enough that then using that as an argument for removing
>> "access_ok()" smells.
Linus Torvalds May 30, 2020, 5:57 p.m. UTC | #5
On Sat, May 30, 2020 at 9:20 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> Yes, the access_ok is done in __kvm_set_memory_region and gfn_to_hva()
> returns a page-aligned address so it's obviously ok for a u32.

It's not that it's "obviously ok for an u32".

It is _not_ obviously ok for a user address. There's actually no
access_ok() done in the lookup path at all, and what gfn_to_hva()
actually ends up doing in the end is __gfn_to_hva_memslot(), which has
zero overflow protection at all, and just does

        slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE;

without us having _ever_ checked that 'gfn' parameter.

Yes, at some point in the very very distant past,
__kvm_set_memory_region() has validated
mem->{userspace_addr,memory_size}.  But even that validation is
actually very questionable, since it's not even done for all of the
memory slots, only the "user" ones.

So if at any point we have a non-user slot, of it at any point the gfn
thing was mis-calculated and {over,under}flows, there are no
protections what-so-ever.

In other words, it really looks like kvm is entirely dependent on
magic and luck and a desperate hope that there are no other bugs to
keep the end result as a user address.

Because if _any_ bug or oversight in that kvm_memory_slot handling
ever happens, you end up with random crap.

So no. I disagree. There is absolutely nothing "obviously ok" about
any of that kvm code. Quite the reverse.

I'd argue that it's very much obviously *NOT* ok, even while it might
just happen to work.

That double underscore needs to go away. It's either actively buggy
right now and I see no proof it isn't, or it's a bug just waiting to
happen in the future.

               Linus
Al Viro May 30, 2020, 6:38 p.m. UTC | #6
On Sat, May 30, 2020 at 10:57:24AM -0700, Linus Torvalds wrote:

> So no. I disagree. There is absolutely nothing "obviously ok" about
> any of that kvm code. Quite the reverse.
> 
> I'd argue that it's very much obviously *NOT* ok, even while it might
> just happen to work.

Actually, it's somewhat less brittle than you think (on non-mips, at least)
and not due to those long-ago access_ok().

> That double underscore needs to go away. It's either actively buggy
> right now and I see no proof it isn't, or it's a bug just waiting to
> happen in the future.

FWIW, the kvm side of things (vhost is yet another pile of fun) is

[x86] kvm_hv_set_msr_pw():
arch/x86/kvm/hyperv.c:1027:             if (__copy_to_user((void __user *)addr, instructions, 4))
	HV_X64_MSR_HYPERCALL
arch/x86/kvm/hyperv.c:1132:             if (__clear_user((void __user *)addr, sizeof(u32)))
	HV_X64_MSR_VP_ASSIST_PAGE
in both cases addr comes from
                gfn = data >> HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT;
                addr = kvm_vcpu_gfn_to_hva(vcpu, gfn);
                if (kvm_is_error_hva(addr))
                        return 1;

[x86] FNAME(walk_addr_generic), very hot:
arch/x86/kvm/mmu/paging_tmpl.h:403:             if (unlikely(__get_user(pte, ptep_user)))
                index = PT_INDEX(addr, walker->level);
                ...
                offset    = index * sizeof(pt_element_t);
		...
                host_addr = kvm_vcpu_gfn_to_hva_prot(vcpu, real_gfn,
                                            &walker->pte_writable[walker->level - 1]);
                if (unlikely(kvm_is_error_hva(host_addr)))
                        goto error;
                ptep_user = (pt_element_t __user *)((void *)host_addr + offset);

__kvm_read_guest_page():
virt/kvm/kvm_main.c:2252:       r = __copy_from_user(data, (void __user *)addr + offset, len);
        addr = gfn_to_hva_memslot_prot(slot, gfn, NULL);
        if (kvm_is_error_hva(addr))
                return -EFAULT;

__kvm_read_guest_atomic():
virt/kvm/kvm_main.c:2326:       r = __copy_from_user_inatomic(data, (void __user *)addr + offset, len);
        addr = gfn_to_hva_memslot_prot(slot, gfn, NULL);
        if (kvm_is_error_hva(addr))
                return -EFAULT;

__kvm_write_guest_page():
virt/kvm/kvm_main.c:2353:       r = __copy_to_user((void __user *)addr + offset, data, len);
        addr = gfn_to_hva_memslot(memslot, gfn);
        if (kvm_is_error_hva(addr))
                return -EFAULT;

kvm_write_guest_offset_cached():
virt/kvm/kvm_main.c:2490:       r = __copy_to_user((void __user *)ghc->hva + offset, data, len);
        if (kvm_is_error_hva(ghc->hva))
                return -EFAULT;

kvm_read_guest_cached():
virt/kvm/kvm_main.c:2525:       r = __copy_from_user(data, (void __user *)ghc->hva, len);
        if (kvm_is_error_hva(ghc->hva))
                return -EFAULT;

default kvm_is_error_hva() is addr >= PAGE_OFFSET; however, on mips and s390 it's
IS_ERR_VALUE().

Sure, we can use non-__ variants, but is access_ok() the right primitive here?
We want userland memory, set_fs() be damned.
Linus Torvalds May 30, 2020, 6:52 p.m. UTC | #7
On Sat, May 30, 2020 at 11:39 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Actually, it's somewhat less brittle than you think (on non-mips, at least)
> and not due to those long-ago access_ok().

It really isn't.

Your very first statement shows how broken it is:

> FWIW, the kvm side of things (vhost is yet another pile of fun) is
>
> [x86] kvm_hv_set_msr_pw():
> arch/x86/kvm/hyperv.c:1027:             if (__copy_to_user((void __user *)addr, instructions, 4))
>         HV_X64_MSR_HYPERCALL
> arch/x86/kvm/hyperv.c:1132:             if (__clear_user((void __user *)addr, sizeof(u32)))
>         HV_X64_MSR_VP_ASSIST_PAGE
> in both cases addr comes from
>                 gfn = data >> HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT;
>                 addr = kvm_vcpu_gfn_to_hva(vcpu, gfn);
>                 if (kvm_is_error_hva(addr))
>                         return 1;

Just look at that. You have _zero_ indication that 'adds" is a user
space address. It could be a kernel address.

That kvm_vcpu_gfn_to_hva() function is a complicated mess that first
looks for the right 'memslot', and basically uses a search with a
default slot to try to figure it out. It doesn't even use locking for
any of it, but assumes the arrays are stable, and that it can use
atomics to reliably read and set the last successfully found slot.

And none of that code verifies that the end result is a user address.

It _literally_ all depends on this optimistically lock-free code being
bug-free, and never using a slot that isn't a user slot. And as
mentioned, there _are_ non-user memslots.

It's fragile as hell.

And it's all completely and utterly pointless. ALL of the above is
incredibly much more expensive than just checking the damn address
range.

So the optimization is completely bogus to begin with, and all it
results in is that any bug in this _incredibly_ subtle code will be a
security proiblem.

And I don't understand why you mention set_fs() vs access_ok(). None
of this code has anything that messes with set_fs(). The access_ok()
is garbage and shouldn't exist, and those user accesses should all use
the checking versions and the double underscores are wrong.

I have no idea why you think the double underscores could _possibly_
be worth defending.

            Linus
Al Viro May 30, 2020, 7:14 p.m. UTC | #8
On Sat, May 30, 2020 at 11:52:44AM -0700, Linus Torvalds wrote:
> 
> It really isn't.
> 
> Your very first statement shows how broken it is:
> 
> > FWIW, the kvm side of things (vhost is yet another pile of fun) is
> >
> > [x86] kvm_hv_set_msr_pw():
> > arch/x86/kvm/hyperv.c:1027:             if (__copy_to_user((void __user *)addr, instructions, 4))
> >         HV_X64_MSR_HYPERCALL
> > arch/x86/kvm/hyperv.c:1132:             if (__clear_user((void __user *)addr, sizeof(u32)))
> >         HV_X64_MSR_VP_ASSIST_PAGE
> > in both cases addr comes from
> >                 gfn = data >> HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT;
> >                 addr = kvm_vcpu_gfn_to_hva(vcpu, gfn);
> >                 if (kvm_is_error_hva(addr))
> >                         return 1;
> 
> Just look at that. You have _zero_ indication that 'adds" is a user
> space address. It could be a kernel address.
> 
> That kvm_vcpu_gfn_to_hva() function is a complicated mess that first
> looks for the right 'memslot', and basically uses a search with a
> default slot to try to figure it out. It doesn't even use locking for
> any of it, but assumes the arrays are stable, and that it can use
> atomics to reliably read and set the last successfully found slot.
> 
> And none of that code verifies that the end result is a user address.

kvm_is_error_hva() is
	return addr >= PAGE_OFFSET;
Al Viro May 30, 2020, 7:19 p.m. UTC | #9
On Sat, May 30, 2020 at 11:52:44AM -0700, Linus Torvalds wrote:

> And I don't understand why you mention set_fs() vs access_ok(). None
> of this code has anything that messes with set_fs(). The access_ok()
> is garbage and shouldn't exist, and those user accesses should all use
> the checking versions and the double underscores are wrong.
> 
> I have no idea why you think the double underscores could _possibly_
> be worth defending.

I do not.  What I'm saying is that this just might be a beast different
from *both* __... and the normal ones.  I'm not saying that this
__put_user() (or __clear_user(), etc.) is the right primitive here.
If anything, it's closer to the situation for (x86) copy_stack_trace().
Linus Torvalds May 30, 2020, 7:20 p.m. UTC | #10
On Sat, May 30, 2020 at 12:14 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> > And none of that code verifies that the end result is a user address.
>
> kvm_is_error_hva() is
>         return addr >= PAGE_OFFSET;

Ahh, that's what I missed. It won't work on other architectures, but
within x86 it's fine.

                  Linus
Al Viro May 30, 2020, 7:27 p.m. UTC | #11
On Sat, May 30, 2020 at 08:19:40PM +0100, Al Viro wrote:
> On Sat, May 30, 2020 at 11:52:44AM -0700, Linus Torvalds wrote:
> 
> > And I don't understand why you mention set_fs() vs access_ok(). None
> > of this code has anything that messes with set_fs(). The access_ok()
> > is garbage and shouldn't exist, and those user accesses should all use
> > the checking versions and the double underscores are wrong.
> > 
> > I have no idea why you think the double underscores could _possibly_
> > be worth defending.
> 
> I do not.  What I'm saying is that this just might be a beast different
> from *both* __... and the normal ones.  I'm not saying that this
> __put_user() (or __clear_user(), etc.) is the right primitive here.
> If anything, it's closer to the situation for (x86) copy_stack_trace().

... and no, I'm not saying that copy_stack_trace() should stay with
__get_user() either.  It feels like we are lacking primitives needed
to express that cleanly and copy_stack_trace() currently cobbles something
up out of what we have.  Which works for arch-specific code, but yes,
that kind of thing is brittle for arch-independent places like virt/kvm;
I wonder if e.g. s390 is really OK there.
Al Viro May 30, 2020, 7:42 p.m. UTC | #12
On Sat, May 30, 2020 at 12:20:54PM -0700, Linus Torvalds wrote:
> On Sat, May 30, 2020 at 12:14 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > > And none of that code verifies that the end result is a user address.
> >
> > kvm_is_error_hva() is
> >         return addr >= PAGE_OFFSET;
> 
> Ahh, that's what I missed. It won't work on other architectures, but
> within x86 it's fine.

FWIW, we use virt/kvm on x86, powerpc, mips, s390 and arm64.

For x86 and powerpc the check is, AFAICS, OK (ppc kernel might start
higher than PAGE_OFFSET, but not lower than it).  For arm64... not
sure - I'm not familiar with the virtual address space layout we use
there.  mips does *NOT* get that protection at all - there kvm_is_error_hva()
is IS_ERR_VALUE() (thus the "at least on non-mips" upthread).  And
for s390 it's also IS_ERR_VALUE(), but that's an separate can of worms -
there access_ok() is constant true; if we ever hit any of that code in
virt/kvm while under KERNEL_DS, we are well and truly fucked there.
Al Viro May 30, 2020, 8:43 p.m. UTC | #13
On Sat, May 30, 2020 at 08:42:32PM +0100, Al Viro wrote:
> On Sat, May 30, 2020 at 12:20:54PM -0700, Linus Torvalds wrote:
> > On Sat, May 30, 2020 at 12:14 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > > And none of that code verifies that the end result is a user address.
> > >
> > > kvm_is_error_hva() is
> > >         return addr >= PAGE_OFFSET;
> > 
> > Ahh, that's what I missed. It won't work on other architectures, but
> > within x86 it's fine.
> 
> FWIW, we use virt/kvm on x86, powerpc, mips, s390 and arm64.
> 
> For x86 and powerpc the check is, AFAICS, OK (ppc kernel might start
> higher than PAGE_OFFSET, but not lower than it).  For arm64... not
> sure - I'm not familiar with the virtual address space layout we use
> there.  mips does *NOT* get that protection at all - there kvm_is_error_hva()
> is IS_ERR_VALUE() (thus the "at least on non-mips" upthread).  And
> for s390 it's also IS_ERR_VALUE(), but that's an separate can of worms -
> there access_ok() is constant true; if we ever hit any of that code in
> virt/kvm while under KERNEL_DS, we are well and truly fucked there.

Anyway, I really think it's too big to handle this cycle, what with the
amount of other stuff already in queue.  If anything, that __put_user()
is a useful marker of the things that will need attention.  That's arch/x86
and the test excluding the kernel space is just upstream of that call,
so IMO that's robust enough for now.  Crossing the limit just into the
beginning of kernel space is theoretically possible, but that would
depend upon slot->userspace_addr not being page-aligned (and would attempt
to zero up to 3 bytes past the PAGE_OFFSET in any case).  If we get
memory corruption going on, we have much worse problems than that.
And it would have to be memory corruption - ->userspace_addr is assign-once,
there's only one place doing the assignments and alignment check is
shortly upstream of it, so all instances must have that field page-aligned
all the time.

We'll need to sort the kvm-related issues out, but let's leave it for the
next cycle.
diff mbox series

Patch

diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index bcefa9d4e57e..b85b211d4676 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -1129,7 +1129,7 @@  static int kvm_hv_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data, bool host)
 		 * only, there can be valuable data in the rest which needs
 		 * to be preserved e.g. on migration.
 		 */
-		if (__clear_user((void __user *)addr, sizeof(u32)))
+		if (__put_user(0, (u32 __user *)addr))
 			return 1;
 		hv_vcpu->hv_vapic = data;
 		kvm_vcpu_mark_page_dirty(vcpu, gfn);