diff mbox series

[1/2] KVM: x86: implement KVM_SET_TSC_PRECISE/KVM_GET_TSC_PRECISE

Message ID 20201130133559.233242-2-mlevitsk@redhat.com (mailing list archive)
State New, archived
Headers show
Series RFC: Precise TSC migration | expand

Commit Message

Maxim Levitsky Nov. 30, 2020, 1:35 p.m. UTC
These two new ioctls allow to more precisly capture and
restore guest's TSC state.

Both ioctls are meant to be used to accurately migrate guest TSC
even when there is a significant downtime during the migration.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 Documentation/virt/kvm/api.rst | 56 +++++++++++++++++++++++++++
 arch/x86/kvm/x86.c             | 69 ++++++++++++++++++++++++++++++++++
 include/uapi/linux/kvm.h       | 14 +++++++
 3 files changed, 139 insertions(+)

Comments

Paolo Bonzini Nov. 30, 2020, 2:33 p.m. UTC | #1
On 30/11/20 14:35, Maxim Levitsky wrote:
> +		if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) {
> +			tsc_state.tsc_adjust = vcpu->arch.ia32_tsc_adjust_msr;
> +			tsc_state.flags |= KVM_TSC_STATE_TSC_ADJUST_VALID;
> +		}

This is mostly useful for userspace that doesn't disable the quirk, right?

> +		kvm_get_walltime(&wall_nsec, &host_tsc);
> +		diff = wall_nsec - tsc_state.nsec;
> +
> +		if (diff < 0 || tsc_state.nsec == 0)
> +			diff = 0;
> +

diff < 0 should be okay.  Also why the nsec==0 special case?  What about 
using a flag instead?

Paolo
Maxim Levitsky Nov. 30, 2020, 3:58 p.m. UTC | #2
On Mon, 2020-11-30 at 15:33 +0100, Paolo Bonzini wrote:
> On 30/11/20 14:35, Maxim Levitsky wrote:
> > +		if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) {
> > +			tsc_state.tsc_adjust = vcpu->arch.ia32_tsc_adjust_msr;
> > +			tsc_state.flags |= KVM_TSC_STATE_TSC_ADJUST_VALID;
> > +		}
> 
> This is mostly useful for userspace that doesn't disable the quirk, right?

Isn't this the opposite? If I understand the original proposal correctly,
the reason that we include the TSC_ADJUST in the new ioctl, is that
we would like to disable the special kvm behavior (that is disable the quirk),
which would mean that tsc will jump on regular host initiated TSC_ADJUST write.

To avoid this, userspace would set TSC_ADJUST through this new interface.

Note that I haven't yet disabled the quirk in the patches I posted to the qemu,
because we need some infrastructure to manage which quirks we want to disable
in qemu
(That is, KVM_ENABLE_CAP is as I understand write only, so I can't just disable
KVM_X86_QUIRK_TSC_HOST_ACCESS, in the code that enables x-precise-tsc in qemu).

> 
> > +		kvm_get_walltime(&wall_nsec, &host_tsc);
> > +		diff = wall_nsec - tsc_state.nsec;
> > +
> > +		if (diff < 0 || tsc_state.nsec == 0)
> > +			diff = 0;
> > +
> 
> diff < 0 should be okay.  Also why the nsec==0 special case?  What about 
> using a flag instead?

In theory diff < 0 should indeed be okay (though this would mean that target,
has unsynchronized clock or time travel happened).

However for example nsec_to_cycles takes unsigned number, and then
pvclock_scale_delta also takes unsigned number, and so on,
so I was thinking why bother with this case.

There is still (mostly?) theoretical issue, if on some vcpus 'diff' is positive 
and on some is negative
(this can happen if the migration was really fast, and target has the clock
   A. that is only slightly ahead of the source).
Do you think that this is an issue? If so I can make the code work with
signed numbers.

About nsec == 0, this is to allow to use this API for VM initialization.
(That is to call KVM_SET_TSC_PRECISE prior to doing KVM_GET_TSC_PRECISE)

This simplifies qemu code, and I don't think 
that this makes the API much worse.

Best regards,
	Maxim Levitsky

> 
> Paolo
>
Paolo Bonzini Nov. 30, 2020, 5:01 p.m. UTC | #3
On 30/11/20 16:58, Maxim Levitsky wrote:
>> This is mostly useful for userspace that doesn't disable the quirk, right?
> Isn't this the opposite? If I understand the original proposal correctly,
> the reason that we include the TSC_ADJUST in the new ioctl, is that
> we would like to disable the special kvm behavior (that is disable the quirk),
> which would mean that tsc will jump on regular host initiated TSC_ADJUST write.
> 
> To avoid this, userspace would set TSC_ADJUST through this new interface.

Yeah, that makes sense.  It removes the need to think "I have to set TSC 
adjust before TSC".

> Do you think that this is an issue? If so I can make the code work with
> signed numbers.

Not sure if it's an issue, but I prefer to make the API "less 
surprising" for userspace.  Who knows how it will be used.

> About nsec == 0, this is to allow to use this API for VM initialization.
> (That is to call KVM_SET_TSC_PRECISE prior to doing KVM_GET_TSC_PRECISE)

I prefer using flags for that purpose.

Paolo
Thomas Gleixner Dec. 1, 2020, 7:43 p.m. UTC | #4
On Mon, Nov 30 2020 at 15:35, Maxim Levitsky wrote:
> +  struct kvm_tsc_info {
> +	__u32 flags;
> +	__u64 nsec;
> +	__u64 tsc;
> +	__u64 tsc_adjust;
> +  };
> +
> +flags values for ``struct kvm_tsc_info``:
> +
> +``KVM_TSC_INFO_TSC_ADJUST_VALID``
> +
> +  ``tsc_adjust`` contains valid IA32_TSC_ADJUST value

Why exposing TSC_ADJUST at all? Just because?

Thanks,

        tglx
Maxim Levitsky Dec. 3, 2020, 11:11 a.m. UTC | #5
On Tue, 2020-12-01 at 20:43 +0100, Thomas Gleixner wrote:
> On Mon, Nov 30 2020 at 15:35, Maxim Levitsky wrote:
> > +  struct kvm_tsc_info {
> > +	__u32 flags;
> > +	__u64 nsec;
> > +	__u64 tsc;
> > +	__u64 tsc_adjust;
> > +  };
> > +
> > +flags values for ``struct kvm_tsc_info``:
> > +
> > +``KVM_TSC_INFO_TSC_ADJUST_VALID``
> > +
> > +  ``tsc_adjust`` contains valid IA32_TSC_ADJUST value
> 
> Why exposing TSC_ADJUST at all? Just because?

It's because we want to reduce the number of cases where
KVM's msr/read write behavior differs between guest and host 
(e.g qemu) writes.
 
TSC and TSC_ADJUST are tied on architectural level, such as
chang
ing one, changes the other.
 
However for the migration to work we must be able 
to set each one separately.

Currently, KVM does this by turning the host write to 
TSC_ADJUST into a special case that bypasses
the actual TSC adjustment, and just sets this MSR.
 
The next patch in this series, will allow to disable
this special behavior, making host TSC_ADJUST write
work the same way as in guest.

Therefore to still allow to set TSC_ADJUST and TSC independently
after migration this ioctl will be used instead.

Best regards,
	Maxim Levitsky

> 
> Thanks,
> 
>         tglx
>
diff mbox series

Patch

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 70254eaa5229f..2f04aa8ecf119 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -4826,6 +4826,62 @@  If a vCPU is in running state while this ioctl is invoked, the vCPU may
 experience inconsistent filtering behavior on MSR accesses.
 
 
+4.127 KVM_GET_TSC_STATE
+----------------------------
+
+:Capability: KVM_CAP_PRECISE_TSC
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_tsc_info
+:Returns: 0 on success, < 0 on error
+
+::
+
+  #define KVM_TSC_INFO_TSC_ADJUST_VALID 1
+  struct kvm_tsc_info {
+	__u32 flags;
+	__u64 nsec;
+	__u64 tsc;
+	__u64 tsc_adjust;
+  };
+
+flags values for ``struct kvm_tsc_info``:
+
+``KVM_TSC_INFO_TSC_ADJUST_VALID``
+
+  ``tsc_adjust`` contains valid IA32_TSC_ADJUST value
+
+This ioctl allows user space to read guest's IA32_TSC, IA32_TSC_ADJUST,
+and the current value of host CLOCK_REALTIME clock in nanoseconds since unix
+epoch.
+
+
+4.128 KVM_SET_TSC_STATE
+----------------------------
+
+:Capability: KVM_CAP_PRECISE_TSC
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_tsc_info
+:Returns: 0 on success, < 0 on error
+
+::
+
+This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST value
+from the state obtained in the past by KVM_GET_TSC_STATE on the same vCPU.
+
+KVM will adjust the guest TSC value by the time that passed between
+CLOCK_REALTIME timestamp saved in the struct and current value of
+CLOCK_REALTIME, and set guest's TSC to the new value.
+
+TSC_ADJUST is restored as is if KVM_TSC_INFO_TSC_ADJUST_VALID is set.
+
+It is assumed that either both ioctls will be run on the same machine,
+or that source and destination machines have synchronized clocks.
+
+As a special case, it is allowed to leave the timestamp in the struct to zero,
+in which case it will be ignored and the TSC will be restored exactly.
+
 5. The kvm_run structure
 ========================
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a3fdc16cfd6f3..4f0ae9cb14b8a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2438,6 +2438,21 @@  static bool kvm_get_walltime_and_clockread(struct timespec64 *ts,
 
 	return gtod_is_based_on_tsc(do_realtime(ts, tsc_timestamp));
 }
+
+
+static void kvm_get_walltime(u64 *walltime_ns, u64 *host_tsc)
+{
+	struct timespec64 ts;
+
+	if (kvm_get_walltime_and_clockread(&ts, host_tsc)) {
+		*walltime_ns = timespec64_to_ns(&ts);
+		return;
+	}
+
+	*host_tsc = rdtsc();
+	*walltime_ns = ktime_get_real_ns();
+}
+
 #endif
 
 /*
@@ -3757,6 +3772,9 @@  int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_X86_USER_SPACE_MSR:
 	case KVM_CAP_X86_MSR_FILTER:
 	case KVM_CAP_ENFORCE_PV_FEATURE_CPUID:
+#ifdef CONFIG_X86_64
+	case KVM_CAP_PRECISE_TSC:
+#endif
 		r = 1;
 		break;
 	case KVM_CAP_SYNC_REGS:
@@ -4999,6 +5017,57 @@  long kvm_arch_vcpu_ioctl(struct file *filp,
 	case KVM_GET_SUPPORTED_HV_CPUID:
 		r = kvm_ioctl_get_supported_hv_cpuid(vcpu, argp);
 		break;
+#ifdef CONFIG_X86_64
+	case KVM_GET_TSC_STATE: {
+		struct kvm_tsc_state __user *user_tsc_state = argp;
+		struct kvm_tsc_state tsc_state;
+		u64 host_tsc;
+
+		memset(&tsc_state, 0, sizeof(tsc_state));
+
+		kvm_get_walltime(&tsc_state.nsec, &host_tsc);
+		tsc_state.tsc = kvm_read_l1_tsc(vcpu, host_tsc);
+
+		if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) {
+			tsc_state.tsc_adjust = vcpu->arch.ia32_tsc_adjust_msr;
+			tsc_state.flags |= KVM_TSC_STATE_TSC_ADJUST_VALID;
+		}
+
+		r = -EFAULT;
+		if (copy_to_user(user_tsc_state, &tsc_state, sizeof(tsc_state)))
+			goto out;
+		r = 0;
+		break;
+	}
+	case KVM_SET_TSC_STATE: {
+		struct kvm_tsc_state __user *user_tsc_state = argp;
+		struct kvm_tsc_state tsc_state;
+
+		u64 host_tsc, wall_nsec;
+		s64 diff;
+		u64 new_guest_tsc, new_guest_tsc_offset;
+
+		r = -EFAULT;
+		if (copy_from_user(&tsc_state, user_tsc_state, sizeof(tsc_state)))
+			goto out;
+
+		kvm_get_walltime(&wall_nsec, &host_tsc);
+		diff = wall_nsec - tsc_state.nsec;
+
+		if (diff < 0 || tsc_state.nsec == 0)
+			diff = 0;
+
+		new_guest_tsc = tsc_state.tsc + nsec_to_cycles(vcpu, diff);
+		new_guest_tsc_offset = new_guest_tsc - kvm_scale_tsc(vcpu, host_tsc);
+		kvm_vcpu_write_tsc_offset(vcpu, new_guest_tsc_offset);
+
+		if (tsc_state.flags & KVM_TSC_STATE_TSC_ADJUST_VALID)
+			if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST))
+				vcpu->arch.ia32_tsc_adjust_msr = tsc_state.tsc_adjust;
+		r = 0;
+		break;
+	}
+#endif
 	default:
 		r = -EINVAL;
 	}
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 886802b8ffba3..ee1bd5e7da964 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1056,6 +1056,7 @@  struct kvm_ppc_resize_hpt {
 #define KVM_CAP_ENFORCE_PV_FEATURE_CPUID 190
 #define KVM_CAP_SYS_HYPERV_CPUID 191
 #define KVM_CAP_DIRTY_LOG_RING 192
+#define KVM_CAP_PRECISE_TSC 193
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1169,6 +1170,15 @@  struct kvm_clock_data {
 	__u32 pad[9];
 };
 
+
+#define KVM_TSC_STATE_TSC_ADJUST_VALID 1
+struct kvm_tsc_state {
+	__u32 flags;
+	__u64 nsec;
+	__u64 tsc;
+	__u64 tsc_adjust;
+};
+
 /* For KVM_CAP_SW_TLB */
 
 #define KVM_MMU_FSL_BOOKE_NOHV		0
@@ -1563,6 +1573,10 @@  struct kvm_pv_cmd {
 /* Available with KVM_CAP_DIRTY_LOG_RING */
 #define KVM_RESET_DIRTY_RINGS		_IO(KVMIO, 0xc7)
 
+/* Available with KVM_CAP_PRECISE_TSC*/
+#define KVM_SET_TSC_STATE          _IOW(KVMIO,  0xc8, struct kvm_tsc_state)
+#define KVM_GET_TSC_STATE          _IOR(KVMIO,  0xc9, struct kvm_tsc_state)
+
 /* Secure Encrypted Virtualization command */
 enum sev_cmd_id {
 	/* Guest initialization commands */