diff mbox series

[1/2] KVM: nVMX: fix CR4_READ_SHADOW when L0 updates CR4 during a signal

Message ID 20240416123558.212040-1-julian.stecklina@cyberus-technology.de (mailing list archive)
State New
Headers show
Series [1/2] KVM: nVMX: fix CR4_READ_SHADOW when L0 updates CR4 during a signal | expand

Commit Message

Julian Stecklina April 16, 2024, 12:35 p.m. UTC
From: Thomas Prescher <thomas.prescher@cyberus-technology.de>

This issue occurs when the kernel is interrupted by a signal while
running a L2 guest. If the signal is meant to be delivered to the L0
VMM, and L0 updates CR4 for L1, i.e. when the VMM sets
KVM_SYNC_X86_SREGS in kvm_run->kvm_dirty_regs, the kernel programs an
incorrect read shadow value for L2's CR4.

The result is that the guest can read a value for CR4 where bits from
L1 have leaked into L2.

We found this issue by running uXen [1] as L2 in VirtualBox/KVM [2].
The issue can also easily be reproduced in Qemu/KVM if we force a sreg
sync on each call to KVM_RUN [3]. The issue can also be reproduced by
running a L2 Windows 10. In the Windows case, CR4.VMXE leaks from L1
to L2 causing the OS to blue-screen with a kernel thread exception
during TLB invalidation where the following code sequence triggers the
issue:

mov rax, cr4 <--- L2 reads CR4 with contents from L1
mov rcx, cr4
btc 0x7, rax <--- L2 toggles CR4.PGE
mov cr4, rax <--- #GP because L2 writes CR4 with reserved bits set
mov cr4, rcx

The existing code seems to fixup CR4_READ_SHADOW after calling
vmx_set_cr4 except in __set_sregs_common. While we could fix it there
as well, it's easier to just handle it centrally.

There might be a similar issue with CR0.

[1] https://github.com/OpenXT/uxen
[2] https://github.com/cyberus-technology/virtualbox-kvm
[3] https://github.com/tpressure/qemu/commit/d64c9d5e76f3f3b747bea7653d677bd61e13aafe

Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de>
Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de>
---
 arch/x86/kvm/vmx/vmx.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Comments

Sean Christopherson April 16, 2024, 2:35 p.m. UTC | #1
On Tue, Apr 16, 2024, Julian Stecklina wrote:
> From: Thomas Prescher <thomas.prescher@cyberus-technology.de>
> 
> This issue occurs when the kernel is interrupted by a signal while
> running a L2 guest. If the signal is meant to be delivered to the L0
> VMM, and L0 updates CR4 for L1, i.e. when the VMM sets
> KVM_SYNC_X86_SREGS in kvm_run->kvm_dirty_regs, the kernel programs an
> incorrect read shadow value for L2's CR4.
> 
> The result is that the guest can read a value for CR4 where bits from
> L1 have leaked into L2.

No, this is a userspace bug.  If L2 is active when userspace stuffs register state,
then from KVM's perspective the incoming value is L2's value.  E.g. if userspace
*wants* to update L2 CR4 for whatever reason, this patch would result in L2 getting
a stale value, i.e. the value of CR4 at the time of VM-Enter.

And even if userspace wants to change L1, this patch is wrong, as KVM is writing
vmcs02.GUEST_CR4, i.e. is clobbering the L2 CR4 that was programmed by L1, *and*
is dropping the CR4 value that userspace wanted to stuff for L1.

To fix this, your userspace needs to either wait until L2 isn't active, or force
the vCPU out of L2 (which isn't easy, but it's doable if absolutely necessary).

Pulling in a snippet from the initial bug report[*],

 : The reason why this triggers in VirtualBox and not in Qemu is that there are
 : cases where VirtualBox marks CR4 dirty even though it hasn't changed.

simply not trying to stuff register state dirty when L2 is active sounds like it
would resolve the issue.

https://lore.kernel.org/all/af2ede328efee9dc3761333bd47648ee6f752686.camel@cyberus-technology.de

> We found this issue by running uXen [1] as L2 in VirtualBox/KVM [2].
> The issue can also easily be reproduced in Qemu/KVM if we force a sreg
> sync on each call to KVM_RUN [3]. The issue can also be reproduced by
> running a L2 Windows 10. In the Windows case, CR4.VMXE leaks from L1
> to L2 causing the OS to blue-screen with a kernel thread exception
> during TLB invalidation where the following code sequence triggers the
> issue:
> 
> mov rax, cr4 <--- L2 reads CR4 with contents from L1
> mov rcx, cr4
> btc 0x7, rax <--- L2 toggles CR4.PGE
> mov cr4, rax <--- #GP because L2 writes CR4 with reserved bits set
> mov cr4, rcx
> 
> The existing code seems to fixup CR4_READ_SHADOW after calling
> vmx_set_cr4 except in __set_sregs_common. While we could fix it there
> as well, it's easier to just handle it centrally.
> 
> There might be a similar issue with CR0.
> 
> [1] https://github.com/OpenXT/uxen
> [2] https://github.com/cyberus-technology/virtualbox-kvm
> [3] https://github.com/tpressure/qemu/commit/d64c9d5e76f3f3b747bea7653d677bd61e13aafe
> 
> Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de>
> Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de>

SoB is reversed, yours should come after Thomas'.

> ---
>  arch/x86/kvm/vmx/vmx.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 6780313914f8..0d4af00245f3 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -3474,7 +3474,11 @@ void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
>  			hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE);
>  	}
>  
> -	vmcs_writel(CR4_READ_SHADOW, cr4);
> +	if (is_guest_mode(vcpu))
> +		vmcs_writel(CR4_READ_SHADOW, nested_read_cr4(get_vmcs12(vcpu)));
> +	else
> +		vmcs_writel(CR4_READ_SHADOW, cr4);
> +
>  	vmcs_writel(GUEST_CR4, hw_cr4);
>  
>  	if ((cr4 ^ old_cr4) & (X86_CR4_OSXSAVE | X86_CR4_PKE))
> -- 
> 2.43.2
>
Thomas Prescher April 16, 2024, 3:08 p.m. UTC | #2
Hi Sean,

On Tue, 2024-04-16 at 07:35 -0700, Sean Christopherson wrote:
> On Tue, Apr 16, 2024, Julian Stecklina wrote:
> > From: Thomas Prescher <thomas.prescher@cyberus-technology.de>
> > 
> > This issue occurs when the kernel is interrupted by a signal while
> > running a L2 guest. If the signal is meant to be delivered to the
> > L0
> > VMM, and L0 updates CR4 for L1, i.e. when the VMM sets
> > KVM_SYNC_X86_SREGS in kvm_run->kvm_dirty_regs, the kernel programs
> > an
> > incorrect read shadow value for L2's CR4.
> > 
> > The result is that the guest can read a value for CR4 where bits
> > from
> > L1 have leaked into L2.
> 
> No, this is a userspace bug.  If L2 is active when userspace stuffs
> register state,
> then from KVM's perspective the incoming value is L2's value.  E.g.
> if userspace
> *wants* to update L2 CR4 for whatever reason, this patch would result
> in L2 getting
> a stale value, i.e. the value of CR4 at the time of VM-Enter.
> 
> And even if userspace wants to change L1, this patch is wrong, as KVM
> is writing
> vmcs02.GUEST_CR4, i.e. is clobbering the L2 CR4 that was programmed
> by L1, *and*
> is dropping the CR4 value that userspace wanted to stuff for L1.
> 
> To fix this, your userspace needs to either wait until L2 isn't
> active, or force
> the vCPU out of L2 (which isn't easy, but it's doable if absolutely
> necessary).

What you say makes sense. Is there any way for
userspace to detect whether L2 is currently active after
returning from KVM_RUN? I couldn't find anything in the official
documentation https://docs.kernel.org/virt/kvm/api.html

Can you point me into the right direction?

> 
> Pulling in a snippet from the initial bug report[*],
> 
>  : The reason why this triggers in VirtualBox and not in Qemu is that
> there are
>  : cases where VirtualBox marks CR4 dirty even though it hasn't
> changed.
> 
> simply not trying to stuff register state dirty when L2 is active
> sounds like it
> would resolve the issue.
> 
> https://lore.kernel.org/all/af2ede328efee9dc3761333bd47648ee6f752686.camel@cyberus-technology.de
> 
> > We found this issue by running uXen [1] as L2 in VirtualBox/KVM
> > [2].
> > The issue can also easily be reproduced in Qemu/KVM if we force a
> > sreg
> > sync on each call to KVM_RUN [3]. The issue can also be reproduced
> > by
> > running a L2 Windows 10. In the Windows case, CR4.VMXE leaks from
> > L1
> > to L2 causing the OS to blue-screen with a kernel thread exception
> > during TLB invalidation where the following code sequence triggers
> > the
> > issue:
> > 
> > mov rax, cr4 <--- L2 reads CR4 with contents from L1
> > mov rcx, cr4
> > btc 0x7, rax <--- L2 toggles CR4.PGE
> > mov cr4, rax <--- #GP because L2 writes CR4 with reserved bits set
> > mov cr4, rcx
> > 
> > The existing code seems to fixup CR4_READ_SHADOW after calling
> > vmx_set_cr4 except in __set_sregs_common. While we could fix it
> > there
> > as well, it's easier to just handle it centrally.
> > 
> > There might be a similar issue with CR0.
> > 
> > [1] https://github.com/OpenXT/uxen
> > [2] https://github.com/cyberus-technology/virtualbox-kvm
> > [3]
> > https://github.com/tpressure/qemu/commit/d64c9d5e76f3f3b747bea7653d677bd61e13aafe
> > 
> > Signed-off-by: Julian Stecklina
> > <julian.stecklina@cyberus-technology.de>
> > Signed-off-by: Thomas Prescher
> > <thomas.prescher@cyberus-technology.de>
> 
> SoB is reversed, yours should come after Thomas'.
> 
> > ---
> >  arch/x86/kvm/vmx/vmx.c | 6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 6780313914f8..0d4af00245f3 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -3474,7 +3474,11 @@ void vmx_set_cr4(struct kvm_vcpu *vcpu,
> > unsigned long cr4)
> >  			hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP |
> > X86_CR4_PKE);
> >  	}
> >  
> > -	vmcs_writel(CR4_READ_SHADOW, cr4);
> > +	if (is_guest_mode(vcpu))
> > +		vmcs_writel(CR4_READ_SHADOW,
> > nested_read_cr4(get_vmcs12(vcpu)));
> > +	else
> > +		vmcs_writel(CR4_READ_SHADOW, cr4);
> > +
> >  	vmcs_writel(GUEST_CR4, hw_cr4);
> >  
> >  	if ((cr4 ^ old_cr4) & (X86_CR4_OSXSAVE | X86_CR4_PKE))
> > -- 
> > 2.43.2
> >
Sean Christopherson April 16, 2024, 3:17 p.m. UTC | #3
On Tue, Apr 16, 2024, Thomas Prescher wrote:
> Hi Sean,
> 
> On Tue, 2024-04-16 at 07:35 -0700, Sean Christopherson wrote:
> > On Tue, Apr 16, 2024, Julian Stecklina wrote:
> > > From: Thomas Prescher <thomas.prescher@cyberus-technology.de>
> > > 
> > > This issue occurs when the kernel is interrupted by a signal while
> > > running a L2 guest. If the signal is meant to be delivered to the L0 VMM,
> > > and L0 updates CR4 for L1, i.e. when the VMM sets KVM_SYNC_X86_SREGS in
> > > kvm_run->kvm_dirty_regs, the kernel programs an incorrect read shadow
> > > value for L2's CR4.
> > > 
> > > The result is that the guest can read a value for CR4 where bits from L1
> > > have leaked into L2.
> > 
> > No, this is a userspace bug.  If L2 is active when userspace stuffs
> > register state, then from KVM's perspective the incoming value is L2's
> > value.  E.g.  if userspace *wants* to update L2 CR4 for whatever reason,
> > this patch would result in L2 getting a stale value, i.e. the value of CR4
> > at the time of VM-Enter.
> > 
> > And even if userspace wants to change L1, this patch is wrong, as KVM is
> > writing vmcs02.GUEST_CR4, i.e. is clobbering the L2 CR4 that was programmed
> > by L1, *and* is dropping the CR4 value that userspace wanted to stuff for
> > L1.
> > 
> > To fix this, your userspace needs to either wait until L2 isn't active, or
> > force the vCPU out of L2 (which isn't easy, but it's doable if absolutely
> > necessary).
> 
> What you say makes sense. Is there any way for
> userspace to detect whether L2 is currently active after
> returning from KVM_RUN? I couldn't find anything in the official
> documentation https://docs.kernel.org/virt/kvm/api.html
> 
> Can you point me into the right direction?

Hmm, the only way to query that information is via KVM_GET_NESTED_STATE, which is
a bit unfortunate as that is a fairly "heavy" ioctl().
Thomas Prescher April 16, 2024, 5:31 p.m. UTC | #4
On Tue, 2024-04-16 at 08:17 -0700, Sean Christopherson wrote:
> On Tue, Apr 16, 2024, Thomas Prescher wrote:
> > Hi Sean,
> > 
> > On Tue, 2024-04-16 at 07:35 -0700, Sean Christopherson wrote:
> > > On Tue, Apr 16, 2024, Julian Stecklina wrote:
> > > > From: Thomas Prescher <thomas.prescher@cyberus-technology.de>
> > > > 
> > > > This issue occurs when the kernel is interrupted by a signal
> > > > while
> > > > running a L2 guest. If the signal is meant to be delivered to
> > > > the L0 VMM,
> > > > and L0 updates CR4 for L1, i.e. when the VMM sets
> > > > KVM_SYNC_X86_SREGS in
> > > > kvm_run->kvm_dirty_regs, the kernel programs an incorrect read
> > > > shadow
> > > > value for L2's CR4.
> > > > 
> > > > The result is that the guest can read a value for CR4 where
> > > > bits from L1
> > > > have leaked into L2.
> > > 
> > > No, this is a userspace bug.  If L2 is active when userspace
> > > stuffs
> > > register state, then from KVM's perspective the incoming value is
> > > L2's
> > > value.  E.g.  if userspace *wants* to update L2 CR4 for whatever
> > > reason,
> > > this patch would result in L2 getting a stale value, i.e. the
> > > value of CR4
> > > at the time of VM-Enter.
> > > 
> > > And even if userspace wants to change L1, this patch is wrong, as
> > > KVM is
> > > writing vmcs02.GUEST_CR4, i.e. is clobbering the L2 CR4 that was
> > > programmed
> > > by L1, *and* is dropping the CR4 value that userspace wanted to
> > > stuff for
> > > L1.
> > > 
> > > To fix this, your userspace needs to either wait until L2 isn't
> > > active, or
> > > force the vCPU out of L2 (which isn't easy, but it's doable if
> > > absolutely
> > > necessary).
> > 
> > What you say makes sense. Is there any way for
> > userspace to detect whether L2 is currently active after
> > returning from KVM_RUN? I couldn't find anything in the official
> > documentation https://docs.kernel.org/virt/kvm/api.html
> > 
> > Can you point me into the right direction?
> 
> Hmm, the only way to query that information is via
> KVM_GET_NESTED_STATE, which is
> a bit unfortunate as that is a fairly "heavy" ioctl().
 
Indeed. What still does not make sense to me is that userspace should
be able to modify the L2 state. From what I can see, even in this
scenario, L0 exits with the L1 state. So what you are saying is that
userspace should be allowed to change L2 even if it receives the
architectural state from L1? What would be the use-case for this
scenario?

If the above is true, I think we should document this explicitly
because it's not obvious, at least not for me ;)

How would you feel if we extend struct kvm_run with a
nested_guest_active flag? This should be fairly cheap and would make
the integration into VirtualBox/KVM much easier. We could also only
sync this flag explicitly, e.g. with a KVM_SYNC_X86_NESTED_GUEST?
Sean Christopherson April 16, 2024, 6:07 p.m. UTC | #5
On Tue, Apr 16, 2024, Thomas Prescher wrote:
> On Tue, 2024-04-16 at 08:17 -0700, Sean Christopherson wrote:
> > On Tue, Apr 16, 2024, Thomas Prescher wrote:
> > > Hi Sean,
> > > 
> > > On Tue, 2024-04-16 at 07:35 -0700, Sean Christopherson wrote:
> > > > On Tue, Apr 16, 2024, Julian Stecklina wrote:
> > > > > From: Thomas Prescher <thomas.prescher@cyberus-technology.de>
> > > > > 
> > > > > This issue occurs when the kernel is interrupted by a signal while
> > > > > running a L2 guest. If the signal is meant to be delivered to the L0
> > > > > VMM, and L0 updates CR4 for L1, i.e. when the VMM sets
> > > > > KVM_SYNC_X86_SREGS in kvm_run->kvm_dirty_regs, the kernel programs an
> > > > > incorrect read shadow value for L2's CR4.
> > > > > 
> > > > > The result is that the guest can read a value for CR4 where bits from
> > > > > L1 have leaked into L2.
> > > > 
> > > > No, this is a userspace bug.  If L2 is active when userspace stuffs
> > > > register state, then from KVM's perspective the incoming value is L2's
> > > > value.  E.g.  if userspace *wants* to update L2 CR4 for whatever
> > > > reason, this patch would result in L2 getting a stale value, i.e. the
> > > > value of CR4 at the time of VM-Enter.
> > > > 
> > > > And even if userspace wants to change L1, this patch is wrong, as KVM
> > > > is writing vmcs02.GUEST_CR4, i.e. is clobbering the L2 CR4 that was
> > > > programmed by L1, *and* is dropping the CR4 value that userspace wanted
> > > > to stuff for L1.
> > > > 
> > > > To fix this, your userspace needs to either wait until L2 isn't active,
> > > > or force the vCPU out of L2 (which isn't easy, but it's doable if
> > > > absolutely necessary).
> > > 
> > > What you say makes sense. Is there any way for
> > > userspace to detect whether L2 is currently active after
> > > returning from KVM_RUN? I couldn't find anything in the official
> > > documentation https://docs.kernel.org/virt/kvm/api.html
> > > 
> > > Can you point me into the right direction?
> > 
> > Hmm, the only way to query that information is via KVM_GET_NESTED_STATE,
> > which is a bit unfortunate as that is a fairly "heavy" ioctl().

Hur dur, I forgot that KVM provides a "guest_mode" stat.  Userspace can do
KVM_GET_STATS_FD on the vCPU FD to get a file handle to the binary stats, and
then you wouldn't need to call back into KVM just to query guest_mode.

Ah, and I also forgot that we have kvm_run.flags, so adding KVM_RUN_X86_GUEST_MODE
would also be trivial (I almost suggested it earlier, but didn't want to add a
new field to kvm_run without a very good reason).

> Indeed. What still does not make sense to me is that userspace should be able
> to modify the L2 state. From what I can see, even in this scenario, L0 exits
> with the L1 state.

No, KVM exits with L2.  Or at least, KVM is supposed to exit with L2 state.  Exiting
with L1 state would actually be quite difficult, as KVM would need to manually
switch to vmcs01, pull out state, and then switch back to vmcs02.  And for GPRs,
KVM doesn't have L1 state because most GPRs are "volatile", i.e. are clobbered by
VM-Enter and thus need to be manually managed by the hypervisor.

I assume you're using kvm_run.kvm_valid_regs to instruct KVM to save vCPU state
when exiting to userspace?  That path grabs state straight from the vCPU, without
regard to is_guest_mode().  If you're seeing L1 state, then there's likely a bug
somewhere, likely in userspace again.   While valid_regs kvm_run.kvm_valid_regs
might not be heavily used, the underlying code is very heavily used for doing
save/restore while L2 is active, e.g. for live migration.

> So what you are saying is that userspace should be allowed to change L2 even
> if it receives the architectural state from L1?

No, see above.

> What would be the use-case for this scenario?
> 
> If the above is true, I think we should document this explicitly
> because it's not obvious, at least not for me ;)
> 
> How would you feel if we extend struct kvm_run with a
> nested_guest_active flag? This should be fairly cheap and would make
> the integration into VirtualBox/KVM much easier. We could also only
> sync this flag explicitly, e.g. with a KVM_SYNC_X86_NESTED_GUEST?

Heh, see above regarding KVM_RUN_X86_GUEST_MODE.
Thomas Prescher April 17, 2024, 1:05 p.m. UTC | #6
On Tue, 2024-04-16 at 11:07 -0700, Sean Christopherson wrote:
> On Tue, Apr 16, 2024, Thomas Prescher wrote:
> > On Tue, 2024-04-16 at 08:17 -0700, Sean Christopherson wrote:
> > > On Tue, Apr 16, 2024, Thomas Prescher wrote:
> > > > Hi Sean,
> > > > 
> > > > On Tue, 2024-04-16 at 07:35 -0700, Sean Christopherson wrote:
> > > > > On Tue, Apr 16, 2024, Julian Stecklina wrote:
> > > > > > From: Thomas Prescher
> > > > > > <thomas.prescher@cyberus-technology.de>
> > > > > > 
> > > > > > This issue occurs when the kernel is interrupted by a
> > > > > > signal while
> > > > > > running a L2 guest. If the signal is meant to be delivered
> > > > > > to the L0
> > > > > > VMM, and L0 updates CR4 for L1, i.e. when the VMM sets
> > > > > > KVM_SYNC_X86_SREGS in kvm_run->kvm_dirty_regs, the kernel
> > > > > > programs an
> > > > > > incorrect read shadow value for L2's CR4.
> > > > > > 
> > > > > > The result is that the guest can read a value for CR4 where
> > > > > > bits from
> > > > > > L1 have leaked into L2.
> > > > > 
> > > > > No, this is a userspace bug.  If L2 is active when userspace
> > > > > stuffs
> > > > > register state, then from KVM's perspective the incoming
> > > > > value is L2's
> > > > > value.  E.g.  if userspace *wants* to update L2 CR4 for
> > > > > whatever
> > > > > reason, this patch would result in L2 getting a stale value,
> > > > > i.e. the
> > > > > value of CR4 at the time of VM-Enter.
> > > > > 
> > > > > And even if userspace wants to change L1, this patch is
> > > > > wrong, as KVM
> > > > > is writing vmcs02.GUEST_CR4, i.e. is clobbering the L2 CR4
> > > > > that was
> > > > > programmed by L1, *and* is dropping the CR4 value that
> > > > > userspace wanted
> > > > > to stuff for L1.
> > > > > 
> > > > > To fix this, your userspace needs to either wait until L2
> > > > > isn't active,
> > > > > or force the vCPU out of L2 (which isn't easy, but it's
> > > > > doable if
> > > > > absolutely necessary).
> > > > 
> > > > What you say makes sense. Is there any way for
> > > > userspace to detect whether L2 is currently active after
> > > > returning from KVM_RUN? I couldn't find anything in the
> > > > official
> > > > documentation https://docs.kernel.org/virt/kvm/api.html
> > > > 
> > > > Can you point me into the right direction?
> > > 
> > > Hmm, the only way to query that information is via
> > > KVM_GET_NESTED_STATE,
> > > which is a bit unfortunate as that is a fairly "heavy" ioctl().
> 
> Hur dur, I forgot that KVM provides a "guest_mode" stat.  Userspace
> can do
> KVM_GET_STATS_FD on the vCPU FD to get a file handle to the binary
> stats, and
> then you wouldn't need to call back into KVM just to query
> guest_mode.
> 
> Ah, and I also forgot that we have kvm_run.flags, so adding
> KVM_RUN_X86_GUEST_MODE
> would also be trivial (I almost suggested it earlier, but didn't want
> to add a
> new field to kvm_run without a very good reason).

Thanks for the pointers. This is really helpful.

I tried the "guest_mode" stat as you suggested and it solves the
immediate issue we have with VirtualBox/KVM.

What I don't understand is that we do not get the effective CR4 value
of the L2 guest in kvm_run.s.regs.sregs.cr4. Instead, userland sees the
contents of Vmcs::GUEST_CR4. Shouldn't this be the combination of
GUEST_CR4, GUEST_CR4_MASK and CR4_READ_SHADOW, i.e. what L2 actually
sees as CR4 value?

If this is expected, can you please explain the reasoning behind this
interface decision? For me, it does not make sense that writing back
the same value we receive at exit time causes a change in what L2 sees
for CR4.

Another question is: when we want to save the VM state during a
savevm/loadvm cycle, we kick all vCPUs via a singal and save their
state. If any vCPU runs in L2 at the time of the exit, we somehow need
to let it continue to run until we get an exit with the L1 state. Is
there a mechanism to help us here? 

> 
> > Indeed. What still does not make sense to me is that userspace
> > should be able
> > to modify the L2 state. From what I can see, even in this scenario,
> > L0 exits
> > with the L1 state.
> 
> No, KVM exits with L2.  Or at least, KVM is supposed to exit with L2
> state.  Exiting
> with L1 state would actually be quite difficult, as KVM would need to
> manually
> switch to vmcs01, pull out state, and then switch back to vmcs02. 
> And for GPRs,
> KVM doesn't have L1 state because most GPRs are "volatile", i.e. are
> clobbered by
> VM-Enter and thus need to be manually managed by the hypervisor.
> 
> I assume you're using kvm_run.kvm_valid_regs to instruct KVM to save
> vCPU state
> when exiting to userspace?  That path grabs state straight from the
> vCPU, without
> regard to is_guest_mode().  If you're seeing L1 state, then there's
> likely a bug
> somewhere, likely in userspace again.   While valid_regs
> kvm_run.kvm_valid_regs
> might not be heavily used, the underlying code is very heavily used
> for doing
> save/restore while L2 is active, e.g. for live migration.
> 
> > So what you are saying is that userspace should be allowed to
> > change L2 even
> > if it receives the architectural state from L1?
> 
> No, see above.
> 
> > What would be the use-case for this scenario?
> > 
> > If the above is true, I think we should document this explicitly
> > because it's not obvious, at least not for me ;)
> > 
> > How would you feel if we extend struct kvm_run with a
> > nested_guest_active flag? This should be fairly cheap and would
> > make
> > the integration into VirtualBox/KVM much easier. We could also only
> > sync this flag explicitly, e.g. with a KVM_SYNC_X86_NESTED_GUEST?
> 
> Heh, see above regarding KVM_RUN_X86_GUEST_MODE.
Sean Christopherson April 17, 2024, 4:11 p.m. UTC | #7
On Wed, Apr 17, 2024, Thomas Prescher wrote:
> On Tue, 2024-04-16 at 11:07 -0700, Sean Christopherson wrote:
> > Hur dur, I forgot that KVM provides a "guest_mode" stat.  Userspace can do
> > KVM_GET_STATS_FD on the vCPU FD to get a file handle to the binary stats,
> > and then you wouldn't need to call back into KVM just to query guest_mode.
> > 
> > Ah, and I also forgot that we have kvm_run.flags, so adding
> > KVM_RUN_X86_GUEST_MODE would also be trivial (I almost suggested it
> > earlier, but didn't want to add a new field to kvm_run without a very good
> > reason).
> 
> Thanks for the pointers. This is really helpful.
> 
> I tried the "guest_mode" stat as you suggested and it solves the
> immediate issue we have with VirtualBox/KVM.

Note, 

> What I don't understand is that we do not get the effective CR4 value
> of the L2 guest in kvm_run.s.regs.sregs.cr4.

Because what you're asking for is *not* the effective CR4 value of L2.

E.g. if L1 is using legacy shadowing paging to run L2, L1 is likely going to run
L2 with GUEST_CR0.PG=1, GUEST_CR4.PAE=1, and GUEST_CR4.PSE=0 (though PSE is largely
irrelevant), i.e. will either use PAE paging or 64-bit paging to shadow L2.

But L2 itself could be unpaged (CR0.PG=0, CR4.PAE=x, CR4.PSE=x), using 32-bit
paging (CR0.PG=1, CR4.PAE=0, CR4.PSE=0), or using 32-bit paging with 4MiB hugepages
(CR0.PG=1, CR4.PAE=0, CR4.PSE=1).  In all of those cases, the effective CR0 and CR4
values consumed by hardware are CR0.PG=1, CR4.PAE=1, and CR4.PSE.

Or to convolute things even further, if L0 is running L1 with shadowing paging,
and L1 is running L2 with shadow paging but doing something weird and using PSE
paging, then it would be possible to end up with:

  vmcs12->guest_cr4:
     .pae = 0
     .pse = 1

  vmcs12->cr4_read_shadow:
     .pae = 0
     .pse = 0

  vmcs02->guest_cr4:
     .pae = 1
     .pse = 0

> Instead, userland sees the contents of Vmcs::GUEST_CR4. Shouldn't this be the
> combination of GUEST_CR4, GUEST_CR4_MASK and CR4_READ_SHADOW, i.e. what L2
> actually sees as CR4 value?

Because KVM_{G,S}ET_SREGS (and all other uAPIs in that vein) are defined to operate
on actual vCPU state, and having them do something different if the vCPU is in guest
mode would confusing/odd, and nonsensical to differences between VMX and SVM.

SVM doesn't have per-bit CR0/CR4 controls, i.e. CR4 loads and stores need to be
intercepted, and so having KVM_{G,S}ET_SREGS operate on CR4_READ_SHADOW for VMX
would yield different ABI for VMX versus SVM.

Note, what L2 *sees* is not a combination of the above; what L2 sees is purely
CR4_READ_SHADOW.  The other fields are consulted only if L2 attempts to load CR4.

> If this is expected, can you please explain the reasoning behind this
> interface decision? For me, it does not make sense that writing back
> the same value we receive at exit time causes a change in what L2 sees
> for CR4.

I doubt there was ever a concious decision, rather it never came up and thus the
code is the result of doing nothing when nested VMX support was added.

That said, KVM's behavior is probably the least awful choice.  The changelog of
the proposed patch is wrong when it says:

  If the signal is meant to be delivered to the L0 VMM, and L0 updates CR4 for L1

because the update isn't for L1, it's for the active vCPU state, which is L2.

At first glance, skipping the vmcs02.CR4_READ_SHADOW seems to make sense, but it
would create a bizarre inconsistency as KVM_SET_SREGS would effectively override
vmcs12->guest_cr4, but not vmcs12->cr4_read_shadow.  KVM doesn't know the intent
of userspace, i.e. KVM can't know if userspace wants to change just the effective
value for CR4, or if userspace wants to change the effective *and* observable
value for CR4.

In your case, where writing CR4 is spurious, preserving the read shadow works,
but if there were some use case where userspace actually wanted to change L2's
CR4, leaving the read shadow set to vmcs12 would be wrong.

The whole situation is rather nonsensical, because if userspace actually did change
CR4, the changes would be lost on the next nested VM-Exit => VM-Entry.  That could
be solved by writing to vmcs12, but that creates a headache of its own because then
userspace changes to L2 become visible to L1, without userspace explicitly requesting
that.

Unfortunately, simply disallowing state save/restore when L2 is active doesn't
work either, because userspace needs to be able to save/restore state that _isn't_
context switched by hardware, i.e. isn't in the VMCS or VMCB.

In short, yes, it's goofy and annoying, but there's no great solution and the
issue really does need to be solved/avoided in userspace

> Another question is: when we want to save the VM state during a
> savevm/loadvm cycle, we kick all vCPUs via a singal and save their
> state. If any vCPU runs in L2 at the time of the exit, we somehow need
> to let it continue to run until we get an exit with the L1 state. Is
> there a mechanism to help us here? 

Hmm, no?  What is it you're trying to do, i.e. why are you doing save/load?  If
you really want to save/load _all_ state, the right thing to do is to also save
and load nested state.
Sean Christopherson April 17, 2024, 4:28 p.m. UTC | #8
On Wed, Apr 17, 2024, Sean Christopherson wrote:
> On Wed, Apr 17, 2024, Thomas Prescher wrote:
> > On Tue, 2024-04-16 at 11:07 -0700, Sean Christopherson wrote:
> > > Hur dur, I forgot that KVM provides a "guest_mode" stat.  Userspace can do
> > > KVM_GET_STATS_FD on the vCPU FD to get a file handle to the binary stats,
> > > and then you wouldn't need to call back into KVM just to query guest_mode.
> > > 
> > > Ah, and I also forgot that we have kvm_run.flags, so adding
> > > KVM_RUN_X86_GUEST_MODE would also be trivial (I almost suggested it
> > > earlier, but didn't want to add a new field to kvm_run without a very good
> > > reason).
> > 
> > Thanks for the pointers. This is really helpful.
> > 
> > I tried the "guest_mode" stat as you suggested and it solves the
> > immediate issue we have with VirtualBox/KVM.
> 
> Note, 

Gah, got distracted.  I was going to say that we should add KVM_RUN_X86_GUEST_MODE,
because stats aren't guaranteed ABI[*], i.e. relying on guest_mode could prove
problematic in the long run (though that's unlikely).

[*] https://lore.kernel.org/all/CABgObfZ4kqaXLaOAOj4aGB5GAe9GxOmJmOP+7kdke6OqA35HzA@mail.gmail.com
Thomas Prescher April 18, 2024, 1:46 p.m. UTC | #9
On Wed, 2024-04-17 at 09:11 -0700, Sean Christopherson wrote:
> On Wed, Apr 17, 2024, Thomas Prescher wrote:
> > On Tue, 2024-04-16 at 11:07 -0700, Sean Christopherson wrote:
> > > Hur dur, I forgot that KVM provides a "guest_mode" stat. 
> > > Userspace can do
> > > KVM_GET_STATS_FD on the vCPU FD to get a file handle to the
> > > binary stats,
> > > and then you wouldn't need to call back into KVM just to query
> > > guest_mode.
> > > 
> > > Ah, and I also forgot that we have kvm_run.flags, so adding
> > > KVM_RUN_X86_GUEST_MODE would also be trivial (I almost suggested
> > > it
> > > earlier, but didn't want to add a new field to kvm_run without a
> > > very good
> > > reason).
> > 
> > Thanks for the pointers. This is really helpful.
> > 
> > I tried the "guest_mode" stat as you suggested and it solves the
> > immediate issue we have with VirtualBox/KVM.
> 
> Note, 
> 
> > What I don't understand is that we do not get the effective CR4
> > value
> > of the L2 guest in kvm_run.s.regs.sregs.cr4.
> 
> Because what you're asking for is *not* the effective CR4 value of
> L2.
> 
> E.g. if L1 is using legacy shadowing paging to run L2, L1 is likely
> going to run
> L2 with GUEST_CR0.PG=1, GUEST_CR4.PAE=1, and GUEST_CR4.PSE=0 (though
> PSE is largely
> irrelevant), i.e. will either use PAE paging or 64-bit paging to
> shadow L2.
> 
> But L2 itself could be unpaged (CR0.PG=0, CR4.PAE=x, CR4.PSE=x),
> using 32-bit
> paging (CR0.PG=1, CR4.PAE=0, CR4.PSE=0), or using 32-bit paging with
> 4MiB hugepages
> (CR0.PG=1, CR4.PAE=0, CR4.PSE=1).  In all of those cases, the
> effective CR0 and CR4
> values consumed by hardware are CR0.PG=1, CR4.PAE=1, and CR4.PSE.
> 
> Or to convolute things even further, if L0 is running L1 with
> shadowing paging,
> and L1 is running L2 with shadow paging but doing something weird and
> using PSE
> paging, then it would be possible to end up with:
> 
>   vmcs12->guest_cr4:
>      .pae = 0
>      .pse = 1
> 
>   vmcs12->cr4_read_shadow:
>      .pae = 0
>      .pse = 0
> 
>   vmcs02->guest_cr4:
>      .pae = 1
>      .pse = 0
> 
> > Instead, userland sees the contents of Vmcs::GUEST_CR4. Shouldn't
> > this be the
> > combination of GUEST_CR4, GUEST_CR4_MASK and CR4_READ_SHADOW, i.e.
> > what L2
> > actually sees as CR4 value?
> 
> Because KVM_{G,S}ET_SREGS (and all other uAPIs in that vein) are
> defined to operate
> on actual vCPU state, and having them do something different if the
> vCPU is in guest
> mode would confusing/odd, and nonsensical to differences between VMX
> and SVM.
> 
> SVM doesn't have per-bit CR0/CR4 controls, i.e. CR4 loads and stores
> need to be
> intercepted, and so having KVM_{G,S}ET_SREGS operate on
> CR4_READ_SHADOW for VMX
> would yield different ABI for VMX versus SVM.
> 
> Note, what L2 *sees* is not a combination of the above; what L2 sees
> is purely
> CR4_READ_SHADOW.  The other fields are consulted only if L2 attempts
> to load CR4.
> 
> > If this is expected, can you please explain the reasoning behind
> > this
> > interface decision? For me, it does not make sense that writing
> > back
> > the same value we receive at exit time causes a change in what L2
> > sees
> > for CR4.
> 
> I doubt there was ever a concious decision, rather it never came up
> and thus the
> code is the result of doing nothing when nested VMX support was
> added.
> 
> That said, KVM's behavior is probably the least awful choice.  The
> changelog of
> the proposed patch is wrong when it says:
> 
>   If the signal is meant to be delivered to the L0 VMM, and L0
> updates CR4 for L1
> 
> because the update isn't for L1, it's for the active vCPU state,
> which is L2.
> 
> At first glance, skipping the vmcs02.CR4_READ_SHADOW seems to make
> sense, but it
> would create a bizarre inconsistency as KVM_SET_SREGS would
> effectively override
> vmcs12->guest_cr4, but not vmcs12->cr4_read_shadow.  KVM doesn't know
> the intent
> of userspace, i.e. KVM can't know if userspace wants to change just
> the effective
> value for CR4, or if userspace wants to change the effective *and*
> observable
> value for CR4.
> 
> In your case, where writing CR4 is spurious, preserving the read
> shadow works,
> but if there were some use case where userspace actually wanted to
> change L2's
> CR4, leaving the read shadow set to vmcs12 would be wrong.
> 
> The whole situation is rather nonsensical, because if userspace
> actually did change
> CR4, the changes would be lost on the next nested VM-Exit => VM-
> Entry.  That could
> be solved by writing to vmcs12, but that creates a headache of its
> own because then
> userspace changes to L2 become visible to L1, without userspace
> explicitly requesting
> that.
> 
> Unfortunately, simply disallowing state save/restore when L2 is
> active doesn't
> work either, because userspace needs to be able to save/restore state
> that _isn't_
> context switched by hardware, i.e. isn't in the VMCS or VMCB.
> 
> In short, yes, it's goofy and annoying, but there's no great solution
> and the
> issue really does need to be solved/avoided in userspace
> 
> > Another question is: when we want to save the VM state during a
> > savevm/loadvm cycle, we kick all vCPUs via a singal and save their
> > state. If any vCPU runs in L2 at the time of the exit, we somehow
> > need
> > to let it continue to run until we get an exit with the L1 state.
> > Is
> > there a mechanism to help us here? 
> 
> Hmm, no?  What is it you're trying to do, i.e. why are you doing
> save/load?  If
> you really want to save/load _all_ state, the right thing to do is to
> also save
> and load nested state.
> 

You are right. After your pointers and looking at the nesting code
again, I think I know what to do. Just to make sure I understand this
correctly: 

If L0 exits with L2 state, KVM_GET_NESTED_STATE will have
KVM_STATE_NESTED_RUN_PENDING set in the flags field. So when we restore
the vCPU state after a vmsave/vmload cycle, we don't need to update
anything in kvm_run.s.regs because KVM will enter the L2 immediately.
Is that correct?
Thomas Prescher April 18, 2024, 1:48 p.m. UTC | #10
On Wed, 2024-04-17 at 09:28 -0700, Sean Christopherson wrote:
> On Wed, Apr 17, 2024, Sean Christopherson wrote:
> > On Wed, Apr 17, 2024, Thomas Prescher wrote:
> > > On Tue, 2024-04-16 at 11:07 -0700, Sean Christopherson wrote:
> > > > Hur dur, I forgot that KVM provides a "guest_mode" stat. 
> > > > Userspace can do
> > > > KVM_GET_STATS_FD on the vCPU FD to get a file handle to the
> > > > binary stats,
> > > > and then you wouldn't need to call back into KVM just to query
> > > > guest_mode.
> > > > 
> > > > Ah, and I also forgot that we have kvm_run.flags, so adding
> > > > KVM_RUN_X86_GUEST_MODE would also be trivial (I almost
> > > > suggested it
> > > > earlier, but didn't want to add a new field to kvm_run without
> > > > a very good
> > > > reason).
> > > 
> > > Thanks for the pointers. This is really helpful.
> > > 
> > > I tried the "guest_mode" stat as you suggested and it solves the
> > > immediate issue we have with VirtualBox/KVM.
> > 
> > Note, 
> 
> Gah, got distracted.  I was going to say that we should add
> KVM_RUN_X86_GUEST_MODE,
> because stats aren't guaranteed ABI[*], i.e. relying on guest_mode
> could prove
> problematic in the long run (though that's unlikely).
> 
> [*]
> https://lore.kernel.org/all/CABgObfZ4kqaXLaOAOj4aGB5GAe9GxOmJmOP+7kdke6OqA35HzA@mail.gmail.com

Allright. I will propose a patch that sets the KVM_RUN_X86_GUEST_MODE
flag in the next couple of days. Do we also need a new capability for
this flag so userland can query whether this field is actually updated
by KVM?
Sean Christopherson April 18, 2024, 6:28 p.m. UTC | #11
On Thu, Apr 18, 2024, Thomas Prescher wrote:
> On Wed, 2024-04-17 at 09:28 -0700, Sean Christopherson wrote:
> > On Wed, Apr 17, 2024, Sean Christopherson wrote:
> > > On Wed, Apr 17, 2024, Thomas Prescher wrote:
> > > > On Tue, 2024-04-16 at 11:07 -0700, Sean Christopherson wrote:
> > > > > Hur dur, I forgot that KVM provides a "guest_mode" stat. 
> > > > > Userspace can do
> > > > > KVM_GET_STATS_FD on the vCPU FD to get a file handle to the
> > > > > binary stats,
> > > > > and then you wouldn't need to call back into KVM just to query
> > > > > guest_mode.
> > > > > 
> > > > > Ah, and I also forgot that we have kvm_run.flags, so adding
> > > > > KVM_RUN_X86_GUEST_MODE would also be trivial (I almost
> > > > > suggested it
> > > > > earlier, but didn't want to add a new field to kvm_run without
> > > > > a very good
> > > > > reason).
> > > > 
> > > > Thanks for the pointers. This is really helpful.
> > > > 
> > > > I tried the "guest_mode" stat as you suggested and it solves the
> > > > immediate issue we have with VirtualBox/KVM.
> > > 
> > > Note, 
> > 
> > Gah, got distracted.  I was going to say that we should add
> > KVM_RUN_X86_GUEST_MODE,
> > because stats aren't guaranteed ABI[*], i.e. relying on guest_mode
> > could prove
> > problematic in the long run (though that's unlikely).
> > 
> > [*]
> > https://lore.kernel.org/all/CABgObfZ4kqaXLaOAOj4aGB5GAe9GxOmJmOP+7kdke6OqA35HzA@mail.gmail.com
> 
> Allright. I will propose a patch that sets the KVM_RUN_X86_GUEST_MODE
> flag in the next couple of days. Do we also need a new capability for
> this flag so userland can query whether this field is actually updated
> by KVM?

Hmm, yeah, I don't see any way around that.
Sean Christopherson April 18, 2024, 6:36 p.m. UTC | #12
On Thu, Apr 18, 2024, Thomas Prescher wrote:
> You are right. After your pointers and looking at the nesting code
> again, I think I know what to do. Just to make sure I understand this
> correctly: 
> 
> If L0 exits with L2 state, KVM_GET_NESTED_STATE will have
> KVM_STATE_NESTED_RUN_PENDING set in the flags field.

Not necessarily.  KVM_STATE_NESTED_GUEST_MODE is the flag that says "L2 state is
loaded", the NESTED_RUN_PENDING flag is effectively a modifier on top of that.

KVM_STATE_NESTED_RUN_PENDING is set when userspace interrupts KVM in the middle
of nested VM-Enter emulation.  In that case, KVM needs to complete emulation of
the VM-Enter instruction (VMLAUNCH, VMRESUME, or VMRUN) before doing anything.
I.e. KVM has loaded L2 state and is committed to completing VM-Enter, but hasn't
actually done so yet.

In retrospect, KVM probably should have forced userspace to call back into KVM to
complete emulation before allowing KVM_GET_NESTED_STATE to succeed, but it's a
minor blip.

> So when we restore the vCPU state after a vmsave/vmload cycle, we don't need
> to update anything in kvm_run.s.regs because KVM will enter the L2
> immediately.  Is that correct?

No?  Presumably your touching vCPU state, otherwise you wouldn't be doing
vmsave/vmload.  And if you touch vCPU state, then you need to restore the old
state for things to work.

Again, what are you trying to do, at a higher level?  I.e. _why_ are you doing
a save/restore cycle?  If it's for something akin to live migration, where you
need to save and restore *everything*, then stating the obvious, you need to
save and restore everything in KVM too, which includes nested state.
diff mbox series

Patch

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 6780313914f8..0d4af00245f3 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3474,7 +3474,11 @@  void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 			hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE);
 	}
 
-	vmcs_writel(CR4_READ_SHADOW, cr4);
+	if (is_guest_mode(vcpu))
+		vmcs_writel(CR4_READ_SHADOW, nested_read_cr4(get_vmcs12(vcpu)));
+	else
+		vmcs_writel(CR4_READ_SHADOW, cr4);
+
 	vmcs_writel(GUEST_CR4, hw_cr4);
 
 	if ((cr4 ^ old_cr4) & (X86_CR4_OSXSAVE | X86_CR4_PKE))