mbox series

[0/1] KVM: correctly restore the TSC value on nested migration

Message ID 20200917110723.820666-1-mlevitsk@redhat.com (mailing list archive)
Headers show
Series KVM: correctly restore the TSC value on nested migration | expand

Message

Maxim Levitsky Sept. 17, 2020, 11:07 a.m. UTC
This patch is a result of a long investigation I made to understand
why the nested migration more often than not makes the nested guest hang.
Sometimes the nested guest recovers and sometimes it hangs forever.

The root cause of this is that reading MSR_IA32_TSC while nested guest is
running returns its TSC value, that is (assuming no tsc scaling)
host tsc + L1 tsc offset + L2 tsc offset.

This is correct but it is a result of a nice curiosity of X86 VMX
(and apparently SVM too, according to my tests) implementation:

As a rule MSR reads done by the guest should either trap to host, or just
return host value, and therefore kvm_get_msr and friends, should basically
always return the L1 value of any msr.

Well, MSR_IA32_TSC is an exception. Intel's PRM states that when you disable
its interception, then in guest mode the host adds the TSC offset to
the read value.

I haven't found anything like that in AMD's PRM but according to the few
tests I made, it behaves the same.

However, there is no such exception when writing MSR_IA32_TSC, and this
poses a problem for nested migration.

When MSR_IA32_TSC is read, we read L2 value (smaller since L2 is started
after L1), and when we restore it after migration, the value is interpreted
as L1 value, thus resulting in huge TSC jump backward which the guest usually
really doesn't like, especially on AMD with APIC deadline timer, which
usually just doesn't fire afterward sending the guest into endless wait for it.

The proposed patch fixes this by making read of MSR_IA32_TSC depend on
'msr_info->host_initiated'

If guest reads the MSR, we add the TSC offset, but when host's qemu reads
the msr we skip that silly emulation of TSC offset, and return the real value
for the L1 guest which is host tsc + L1 offset.

This patch was tested on both SVM and VMX, and on both it fixes hangs.
On VMX since it uses VMX preemption timer for APIC deadline, the guest seems
to recover after a while without that patch.

To make sure that the nested migration happens I usually used
-overcommit cpu_pm=on but I reproduced this with just running an endless loop
in L2.

This is tested both with and without -invtsc,tsc-frequency=...

The migration was done by saving the migration stream to a file, and then
loading the qemu with '-incoming'

Maxim Levitsky (1):
  KVM: x86: fix MSR_IA32_TSC read for nested migration

 arch/x86/kvm/x86.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)