[3/3] Fix TSC MSR read in nested SVM

Message ID	201108021255.p72CtNjD002153@rice.haifa.ibm.com (mailing list archive)
State	New, archived
Headers	show Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by demeter1.kernel.org (8.14.4/8.14.4) with ESMTP id p72CsDto002176 for <patchwork-kvm@patchwork.kernel.org>; Tue, 2 Aug 2011 12:57:21 GMT Date: Tue, 2 Aug 2011 15:55:23 +0300 Message-Id: <201108021255.p72CtNjD002153@rice.haifa.ibm.com> Cc: "Roedel, Joerg" <Joerg.Roedel@amd.com>, Zachary Amsden <zamsden@gmail.com>, Bandan Das <bandan.das@stratus.com>, Marcelo Tosatti <mtosatti@redhat.com>, avi@redhat.com To: kvm@vger.kernel.org From: "Nadav Har'El" <nyh@il.ibm.com> References: <1312289591-nyh@il.ibm.com> Subject: [PATCH 3/3] Fix TSC MSR read in nested SVM Sender: kvm-owner@vger.kernel.org Precedence: bulk

Nadav Har'El Aug. 2, 2011, 12:55 p.m. UTC

When the TSC MSR is read by an L2 guest (when L1 allowed this MSR to be
read without exit), we need to return L2's notion of the TSC, not L1's.

The current code incorrectly returned L1 TSC, because svm_get_msr() was also
used in x86.c where this was assumed, but now that these places call the new
svm_read_l1_tsc(), the MSR read can be fixed.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/svm.c |    4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zachary Amsden Aug. 3, 2011, 8:34 a.m. UTC | #1

Caution: this requires more care.

Pretty sure this breaks userspace suspend at the cost of supporting a
not-so-reasonable hardware feature.

On Tue, Aug 2, 2011 at 5:55 AM, Nadav Har'El <nyh@il.ibm.com> wrote:
> When the TSC MSR is read by an L2 guest (when L1 allowed this MSR to be
> read without exit), we need to return L2's notion of the TSC, not L1's.
>
> The current code incorrectly returned L1 TSC, because svm_get_msr() was also
> used in x86.c where this was assumed, but now that these places call the new
> svm_read_l1_tsc(), the MSR read can be fixed.
>
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
>  arch/x86/kvm/svm.c |    4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
>
> --- .before/arch/x86/kvm/svm.c  2011-08-02 15:51:02.000000000 +0300
> +++ .after/arch/x86/kvm/svm.c   2011-08-02 15:51:02.000000000 +0300
> @@ -2907,9 +2907,7 @@ static int svm_get_msr(struct kvm_vcpu *
>
>        switch (ecx) {
>        case MSR_IA32_TSC: {
> -               struct vmcb *vmcb = get_host_vmcb(svm);
> -
> -               *data = vmcb->control.tsc_offset +
> +               *data = svm->vmcb->control.tsc_offset +
>                        svm_scale_tsc(vcpu, native_read_tsc());
>
>                break;
>

This is correct.  Now you properly return the correct MSR value for
the TSC to the guest in all cases.

However, there is a BIG PROBLEM (yes, it is that bad).  Sorry I did
not think of it before.

When the guest gets suspended and frozen by userspace, to be restarted
later, what qemu is going to do is come along and read all of the MSRs
as part of the saved state.  One of those happens to be the TSC MSR.

Since you can't guarantee suspend will take place when only L1 is
running, you may have mixed L1/L2 TSC MSRs now being returned to
userspace.  Now, when you resume this guest, you will have mixed L1/L2
TSC values written into the MSRs.

Those will almost certainly fail to be matched by the TSC offset
matching code, and thus, with multiple VCPUs, you will end up with
slightly unsynchronized TSCs, and with that, all the problems
associated with unstable TSC come back to haunt you again.  Basically,
all bets for stability are off.

Apart from recording the L1 and L2 TSC through separate MSRs, there
isn't a good way to solve this.  Walking through all the steps again,
I'm pretty sure this is why I thought it would be better for L0
kvm_get_msr() to return the L1 values even if L2 was running.

In the end, it may not be worth the hassle to support this mode of
operation that to our current knowledge, is in fact, unused.  I would
much prefer to see a bad virtualization of a known insecure and unused
guest mode than to have a well used feature such as L1 guest suspend /
resume continue to cause clock de-synchronization.

Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Nadav Har'El Aug. 3, 2011, 11:29 a.m. UTC | #2

Hi,

On Wed, Aug 03, 2011, Zachary Amsden wrote about "Re: [PATCH 3/3] Fix TSC MSR read in nested SVM":
> Pretty sure this breaks userspace suspend at the cost of supporting a
> not-so-reasonable hardware feature.
>...
> This is correct.  Now you properly return the correct MSR value for
> the TSC to the guest in all cases.
> 
> However, there is a BIG PROBLEM (yes, it is that bad).  Sorry I did
> not think of it before.
> 
> When the guest gets suspended and frozen by userspace, to be restarted
> later, what qemu is going to do is come along and read all of the MSRs
> as part of the saved state.  One of those happens to be the TSC MSR.

And just when I thought we were done with this bug :(

Does live migration (or suspend) actually work with nested SVM in the current
code? I certainly don't expect it to work correctly with nested VMX.

Also, I vaguely remember a discussion a while back on this mailing list about
the topic of live migration and nested virtualization, and one of the ideas
raised was that before we can migrate L1, we should force an exit from L2 to
L1, either really (with some real exit reason) or artificially (call the exit
emulation function directly). Then, during the migration we will be sure to
have all the L1 MSRs, in-memory structures, and so on, updated, and,
importantly, we will also be sure to have vmcs12 (the vmcs that L1 keeps for
L2) updated in L1's memory - so that we don't need to send even more internal
KVM state (like vmcs02) during the live migration.

> In the end, it may not be worth the hassle to support this mode of
> operation that to our current knowledge, is in fact, unused.  I would

I do agree that this doesn't sound like a useful mode of operation - but
I also don't like having deliberate mistakes in the code because they
have useful side-effects. I guess that if we can't figure out a way around
this new problem, what I can do is create a patch that:

  1. Always returns L1's TSC for the MSR (as in the original SVM code).

  2. Put a big comment above this function, about it being architecturaly
     *wrong*, but still useful (and explain why).

  3. Check for the case where users might expect the architecturally-correct
     version, not the current "wrong" version. I.e., check if L1 allows L2
     exit-less reads from TSC, using the MSR bitmap; If does, kill the guest,
     or find a way to prevent this setting.

Thanks,
Nadav.

Marcelo Tosatti Aug. 3, 2011, 9 p.m. UTC | #3

On Wed, Aug 03, 2011 at 01:34:58AM -0700, Zachary Amsden wrote:
> Caution: this requires more care.
> 
> Pretty sure this breaks userspace suspend at the cost of supporting a
> not-so-reasonable hardware feature.
> 
> On Tue, Aug 2, 2011 at 5:55 AM, Nadav Har'El <nyh@il.ibm.com> wrote:
> > When the TSC MSR is read by an L2 guest (when L1 allowed this MSR to be
> > read without exit), we need to return L2's notion of the TSC, not L1's.
> >
> > The current code incorrectly returned L1 TSC, because svm_get_msr() was also
> > used in x86.c where this was assumed, but now that these places call the new
> > svm_read_l1_tsc(), the MSR read can be fixed.
> >
> > Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> > ---
> >  arch/x86/kvm/svm.c |    4 +---
> >  1 file changed, 1 insertion(+), 3 deletions(-)
> >
> > --- .before/arch/x86/kvm/svm.c  2011-08-02 15:51:02.000000000 +0300
> > +++ .after/arch/x86/kvm/svm.c   2011-08-02 15:51:02.000000000 +0300
> > @@ -2907,9 +2907,7 @@ static int svm_get_msr(struct kvm_vcpu *
> >
> >        switch (ecx) {
> >        case MSR_IA32_TSC: {
> > -               struct vmcb *vmcb = get_host_vmcb(svm);
> > -
> > -               *data = vmcb->control.tsc_offset +
> > +               *data = svm->vmcb->control.tsc_offset +
> >                        svm_scale_tsc(vcpu, native_read_tsc());
> >
> >                break;
> >
> 
> 
> This is correct.  Now you properly return the correct MSR value for
> the TSC to the guest in all cases.
> 
> However, there is a BIG PROBLEM (yes, it is that bad).  Sorry I did
> not think of it before.
> 
> When the guest gets suspended and frozen by userspace, to be restarted
> later, what qemu is going to do is come along and read all of the MSRs
> as part of the saved state.  One of those happens to be the TSC MSR.
> 
> Since you can't guarantee suspend will take place when only L1 is
> running, you may have mixed L1/L2 TSC MSRs now being returned to
> userspace.  Now, when you resume this guest, you will have mixed L1/L2
> TSC values written into the MSRs.
> 
> Those will almost certainly fail to be matched by the TSC offset
> matching code, and thus, with multiple VCPUs, you will end up with
> slightly unsynchronized TSCs, and with that, all the problems
> associated with unstable TSC come back to haunt you again.  Basically,
> all bets for stability are off.

TSC synchronization is the least of your problems if you attempt to
save/restore state a guest while any vcpu is in L2 mode.

So to keep consistency between the remaining MSRs, i agree with Nadav's
patch. Apparently SVM's original patches to return L1 TSC were aimed at
fixing the problem VMX is facing now, which is fixed by introduction
read_l1_tsc helpers.

> Apart from recording the L1 and L2 TSC through separate MSRs, there
> isn't a good way to solve this.  Walking through all the steps again,
> I'm pretty sure this is why I thought it would be better for L0
> kvm_get_msr() to return the L1 values even if L2 was running.
> 
> In the end, it may not be worth the hassle to support this mode of
> operation that to our current knowledge, is in fact, unused.  I would
> much prefer to see a bad virtualization of a known insecure and unused
> guest mode than to have a well used feature such as L1 guest suspend /
> resume continue to cause clock de-synchronization.
> 
> Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti Aug. 5, 2011, 3:32 p.m. UTC | #4

On Wed, Aug 03, 2011 at 06:00:52PM -0300, Marcelo Tosatti wrote:
> On Wed, Aug 03, 2011 at 01:34:58AM -0700, Zachary Amsden wrote:
> > Caution: this requires more care.
> > 
> > Pretty sure this breaks userspace suspend at the cost of supporting a
> > not-so-reasonable hardware feature.
> > 
> > On Tue, Aug 2, 2011 at 5:55 AM, Nadav Har'El <nyh@il.ibm.com> wrote:
> > > When the TSC MSR is read by an L2 guest (when L1 allowed this MSR to be
> > > read without exit), we need to return L2's notion of the TSC, not L1's.
> > >
> > > The current code incorrectly returned L1 TSC, because svm_get_msr() was also
> > > used in x86.c where this was assumed, but now that these places call the new
> > > svm_read_l1_tsc(), the MSR read can be fixed.
> > >
> > > Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> > > ---
> > >  arch/x86/kvm/svm.c |    4 +---
> > >  1 file changed, 1 insertion(+), 3 deletions(-)
> > >
> > > --- .before/arch/x86/kvm/svm.c  2011-08-02 15:51:02.000000000 +0300
> > > +++ .after/arch/x86/kvm/svm.c   2011-08-02 15:51:02.000000000 +0300
> > > @@ -2907,9 +2907,7 @@ static int svm_get_msr(struct kvm_vcpu *
> > >
> > >        switch (ecx) {
> > >        case MSR_IA32_TSC: {
> > > -               struct vmcb *vmcb = get_host_vmcb(svm);
> > > -
> > > -               *data = vmcb->control.tsc_offset +
> > > +               *data = svm->vmcb->control.tsc_offset +
> > >                        svm_scale_tsc(vcpu, native_read_tsc());
> > >
> > >                break;
> > >
> > 
> > 
> > This is correct.  Now you properly return the correct MSR value for
> > the TSC to the guest in all cases.
> > 
> > However, there is a BIG PROBLEM (yes, it is that bad).  Sorry I did
> > not think of it before.
> > 
> > When the guest gets suspended and frozen by userspace, to be restarted
> > later, what qemu is going to do is come along and read all of the MSRs
> > as part of the saved state.  One of those happens to be the TSC MSR.
> > 
> > Since you can't guarantee suspend will take place when only L1 is
> > running, you may have mixed L1/L2 TSC MSRs now being returned to
> > userspace.  Now, when you resume this guest, you will have mixed L1/L2
> > TSC values written into the MSRs.
> > 
> > Those will almost certainly fail to be matched by the TSC offset
> > matching code, and thus, with multiple VCPUs, you will end up with
> > slightly unsynchronized TSCs, and with that, all the problems
> > associated with unstable TSC come back to haunt you again.  Basically,
> > all bets for stability are off.
> 
> TSC synchronization is the least of your problems if you attempt to
> save/restore state a guest while any vcpu is in L2 mode.
> 
> So to keep consistency between the remaining MSRs, i agree with Nadav's
> patch. Apparently SVM's original patches to return L1 TSC were aimed at
> fixing the problem VMX is facing now, which is fixed by introduction
> read_l1_tsc helpers.

Joerg, can you review and ack Nadav's SVM patch? TIA

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Joerg Roedel Aug. 6, 2011, 12:16 p.m. UTC | #5

On Fri, Aug 05, 2011 at 12:32:24PM -0300, Marcelo Tosatti wrote:
> On Wed, Aug 03, 2011 at 06:00:52PM -0300, Marcelo Tosatti wrote:
> > On Wed, Aug 03, 2011 at 01:34:58AM -0700, Zachary Amsden wrote:
> > > Caution: this requires more care.
> > > 
> > > Pretty sure this breaks userspace suspend at the cost of supporting a
> > > not-so-reasonable hardware feature.
> > > 
> > > On Tue, Aug 2, 2011 at 5:55 AM, Nadav Har'El <nyh@il.ibm.com> wrote:
> > > > When the TSC MSR is read by an L2 guest (when L1 allowed this MSR to be
> > > > read without exit), we need to return L2's notion of the TSC, not L1's.
> > > >
> > > > The current code incorrectly returned L1 TSC, because svm_get_msr() was also
> > > > used in x86.c where this was assumed, but now that these places call the new
> > > > svm_read_l1_tsc(), the MSR read can be fixed.
> > > >
> > > > Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> > > > ---
> > > >  arch/x86/kvm/svm.c |    4 +---
> > > >  1 file changed, 1 insertion(+), 3 deletions(-)
> > > >
> > > > --- .before/arch/x86/kvm/svm.c  2011-08-02 15:51:02.000000000 +0300
> > > > +++ .after/arch/x86/kvm/svm.c   2011-08-02 15:51:02.000000000 +0300
> > > > @@ -2907,9 +2907,7 @@ static int svm_get_msr(struct kvm_vcpu *
> > > >
> > > >        switch (ecx) {
> > > >        case MSR_IA32_TSC: {
> > > > -               struct vmcb *vmcb = get_host_vmcb(svm);
> > > > -
> > > > -               *data = vmcb->control.tsc_offset +
> > > > +               *data = svm->vmcb->control.tsc_offset +
> > > >                        svm_scale_tsc(vcpu, native_read_tsc());
> > > >
> > > >                break;
> > > >
> > > 
> > > 
> > > This is correct.  Now you properly return the correct MSR value for
> > > the TSC to the guest in all cases.
> > > 
> > > However, there is a BIG PROBLEM (yes, it is that bad).  Sorry I did
> > > not think of it before.
> > > 
> > > When the guest gets suspended and frozen by userspace, to be restarted
> > > later, what qemu is going to do is come along and read all of the MSRs
> > > as part of the saved state.  One of those happens to be the TSC MSR.
> > > 
> > > Since you can't guarantee suspend will take place when only L1 is
> > > running, you may have mixed L1/L2 TSC MSRs now being returned to
> > > userspace.  Now, when you resume this guest, you will have mixed L1/L2
> > > TSC values written into the MSRs.
> > > 
> > > Those will almost certainly fail to be matched by the TSC offset
> > > matching code, and thus, with multiple VCPUs, you will end up with
> > > slightly unsynchronized TSCs, and with that, all the problems
> > > associated with unstable TSC come back to haunt you again.  Basically,
> > > all bets for stability are off.
> > 
> > TSC synchronization is the least of your problems if you attempt to
> > save/restore state a guest while any vcpu is in L2 mode.
> > 
> > So to keep consistency between the remaining MSRs, i agree with Nadav's
> > patch. Apparently SVM's original patches to return L1 TSC were aimed at
> > fixing the problem VMX is facing now, which is fixed by introduction
> > read_l1_tsc helpers.
> 
> Joerg, can you review and ack Nadav's SVM patch? TIA

Yes, sorry for the delay. I'll give it a review and test today.

	Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Joerg Roedel Aug. 7, 2011, 12:04 p.m. UTC | #6

On Fri, Aug 05, 2011 at 12:32:24PM -0300, Marcelo Tosatti wrote:
> > So to keep consistency between the remaining MSRs, i agree with Nadav's
> > patch. Apparently SVM's original patches to return L1 TSC were aimed at
> > fixing the problem VMX is facing now, which is fixed by introduction
> > read_l1_tsc helpers.
> 
> Joerg, can you review and ack Nadav's SVM patch? TIA

Tested-by: Joerg Roedel <joerg.roedel@amd.com>
Acked-by: Joerg Roedel <joerg.roedel@amd.com>

Reviewed and tested (didn't apply cleanly, but was easy to fix that up).
Looks all good so far.

Regards,

	Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Nadav Har'El Aug. 10, 2011, 12:35 p.m. UTC | #7

On Sun, Aug 07, 2011, Joerg Roedel wrote about "Re: [PATCH 3/3] Fix TSC MSR read in nested SVM":
> > Joerg, can you review and ack Nadav's SVM patch? TIA
> 
> Tested-by: Joerg Roedel <joerg.roedel@amd.com>
> Acked-by: Joerg Roedel <joerg.roedel@amd.com>
> 
> Reviewed and tested (didn't apply cleanly, but was easy to fix that up).
> Looks all good so far.

Hi guys, I'm a bit confused how we want to proceed from here.

The patches which I sent appear to fix the original bug (as confirmed by
the two people who reported it), but Zachary warned that it would break
migration of nested SVM while L2 is running. I asked whether migration
works at all while L2 is running (without exiting to L1 first) - and
Marcelo suggested that it doesn't.

If the problem Zachary pointed to is considered serious, I proposed a second
option - to leave the code to *wrongly* return the L1 TSC MSR always, and
check (and warn) in situations where this value is wrongly returned to the
guest, but this will leave qemu to always read the TSC MSR from L1, even when
L2 is running. While I proposed this as a second option, I don't think I
can recommend it.

So what's the verdict? :-)

Thanks,
Nadav.

Joerg Roedel Aug. 10, 2011, 1:02 p.m. UTC | #8

On Wed, Aug 10, 2011 at 03:35:28PM +0300, Nadav Har'El wrote:
> On Sun, Aug 07, 2011, Joerg Roedel wrote about "Re: [PATCH 3/3] Fix TSC MSR read in nested SVM":
> > > Joerg, can you review and ack Nadav's SVM patch? TIA
> > 
> > Tested-by: Joerg Roedel <joerg.roedel@amd.com>
> > Acked-by: Joerg Roedel <joerg.roedel@amd.com>
> > 
> > Reviewed and tested (didn't apply cleanly, but was easy to fix that up).
> > Looks all good so far.
> 
> Hi guys, I'm a bit confused how we want to proceed from here.
> 
> The patches which I sent appear to fix the original bug (as confirmed by
> the two people who reported it), but Zachary warned that it would break
> migration of nested SVM while L2 is running. I asked whether migration
> works at all while L2 is running (without exiting to L1 first) - and
> Marcelo suggested that it doesn't.

Migration doesn't work today when L2 is running, so don't worry about it
for now.

Regards,

	Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[3/3] Fix TSC MSR read in nested SVM

Commit Message

Comments

Patch