diff mbox series

[v2,1/3] KVM: x86: implement KVM_{GET|SET}_TSC_STATE

Message ID 20201203171118.372391-2-mlevitsk@redhat.com (mailing list archive)
State New, archived
Headers show
Series RFC: Precise TSC migration | expand

Commit Message

Maxim Levitsky Dec. 3, 2020, 5:11 p.m. UTC
These two new ioctls allow to more precisly capture and
restore guest's TSC state.

Both ioctls are meant to be used to accurately migrate guest TSC
even when there is a significant downtime during the migration.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 Documentation/virt/kvm/api.rst | 65 ++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c             | 73 ++++++++++++++++++++++++++++++++++
 include/uapi/linux/kvm.h       | 15 +++++++
 3 files changed, 153 insertions(+)

Comments

Thomas Gleixner Dec. 6, 2020, 4:19 p.m. UTC | #1
On Thu, Dec 03 2020 at 19:11, Maxim Levitsky wrote:
> +	case KVM_SET_TSC_STATE: {
> +		struct kvm_tsc_state __user *user_tsc_state = argp;
> +		struct kvm_tsc_state tsc_state;
> +		u64 host_tsc, wall_nsec;
> +
> +		u64 new_guest_tsc, new_guest_tsc_offset;
> +
> +		r = -EFAULT;
> +		if (copy_from_user(&tsc_state, user_tsc_state, sizeof(tsc_state)))
> +			goto out;
> +
> +		kvm_get_walltime(&wall_nsec, &host_tsc);
> +		new_guest_tsc = tsc_state.tsc;
> +
> +		if (tsc_state.flags & KVM_TSC_STATE_TIMESTAMP_VALID) {
> +			s64 diff = wall_nsec - tsc_state.nsec;
> +			if (diff >= 0)
> +				new_guest_tsc += nsec_to_cycles(vcpu, diff);
> +			else
> +				new_guest_tsc -= nsec_to_cycles(vcpu, -diff);
> +		}
> +
> +		new_guest_tsc_offset = new_guest_tsc - kvm_scale_tsc(vcpu, host_tsc);
> +		kvm_vcpu_write_tsc_offset(vcpu, new_guest_tsc_offset);

From a timekeeping POV and the guests expectation of TSC this is
fundamentally wrong:

      tscguest = scaled(hosttsc) + offset

The TSC has to be viewed systemwide and not per CPU. It's systemwide
used for timekeeping and for that to work it has to be synchronized. 

Why would this be different on virt? Just because it's virt or what? 

Migration is a guest wide thing and you're not migrating single vCPUs.

This hackery just papers over he underlying design fail that KVM looks
at the TSC per vCPU which is the root cause and that needs to be fixed.

Thanks,

        tglx
Maxim Levitsky Dec. 7, 2020, 12:16 p.m. UTC | #2
On Sun, 2020-12-06 at 17:19 +0100, Thomas Gleixner wrote:
> On Thu, Dec 03 2020 at 19:11, Maxim Levitsky wrote:
> > +	case KVM_SET_TSC_STATE: {
> > +		struct kvm_tsc_state __user *user_tsc_state = argp;
> > +		struct kvm_tsc_state tsc_state;
> > +		u64 host_tsc, wall_nsec;
> > +
> > +		u64 new_guest_tsc, new_guest_tsc_offset;
> > +
> > +		r = -EFAULT;
> > +		if (copy_from_user(&tsc_state, user_tsc_state, sizeof(tsc_state)))
> > +			goto out;
> > +
> > +		kvm_get_walltime(&wall_nsec, &host_tsc);
> > +		new_guest_tsc = tsc_state.tsc;
> > +
> > +		if (tsc_state.flags & KVM_TSC_STATE_TIMESTAMP_VALID) {
> > +			s64 diff = wall_nsec - tsc_state.nsec;
> > +			if (diff >= 0)
> > +				new_guest_tsc += nsec_to_cycles(vcpu, diff);
> > +			else
> > +				new_guest_tsc -= nsec_to_cycles(vcpu, -diff);
> > +		}
> > +
> > +		new_guest_tsc_offset = new_guest_tsc - kvm_scale_tsc(vcpu, host_tsc);
> > +		kvm_vcpu_write_tsc_offset(vcpu, new_guest_tsc_offset);
> 
> From a timekeeping POV and the guests expectation of TSC this is
> fundamentally wrong:
> 
>       tscguest = scaled(hosttsc) + offset
> 
> The TSC has to be viewed systemwide and not per CPU. It's systemwide
> used for timekeeping and for that to work it has to be synchronized. 
> 
> Why would this be different on virt? Just because it's virt or what? 
> 
> Migration is a guest wide thing and you're not migrating single vCPUs.
> 
> This hackery just papers over he underlying design fail that KVM looks
> at the TSC per vCPU which is the root cause and that needs to be fixed.

I don't disagree with you.
As far as I know the main reasons that kvm tracks TSC per guest are

1. cases when host tsc is not stable 
(hopefully rare now, and I don't mind making
the new API just refuse to work when this is detected, and revert to old way
of doing things).

2. (theoretical) ability of the guest to introduce per core tsc offfset
by either using TSC_ADJUST (for which I got recently an idea to stop
advertising this feature to the guest), or writing TSC directly which
is allowed by Intel's PRM:

"The RDMSR and WRMSR instructions read and write the time-stamp counter, treating the time-stamp counter as an
ordinary MSR (address 10H). In the Pentium 4, Intel Xeon, and P6 family processors, all 64-bits of the time-stamp
counter are read using RDMSR (just as with RDTSC). When WRMSR is used to write the time-stamp counter on
processors before family [0FH], models [03H, 04H]: only the low-order 32-bits of the time-stamp counter can be
written (the high-order 32 bits are cleared to 0). For family [0FH], models [03H, 04H, 06H]; for family [06H]],
model [0EH, 0FH]; for family [06H]], DisplayModel [17H, 1AH, 1CH, 1DH]: all 64 bits are writable."


But other than that I don't mind making TSC offset global per VM thing.
Paulo, what do you think about this?

Best regards,
	Maxim Levitsky


> 
> Thanks,
> 
>         tglx
>
Vitaly Kuznetsov Dec. 7, 2020, 1:16 p.m. UTC | #3
Maxim Levitsky <mlevitsk@redhat.com> writes:

>
> But other than that I don't mind making TSC offset global per VM thing.
> Paulo, what do you think about this?
>

Not Paolo here but personally I'd very much prefer we go this route but
unsynchronized TSCs are, unfortunately, still a thing: I was observing
it on an AMD Epyc server just a couple years ago (cured with firmware
update). We try to catch such situation in KVM instead of blowing up but
this may still result in subtle bugs I believe. Maybe we would be better
off killing all VMs in case TSC ever gets unsynced (by default).

Another thing to this bucket is kvmclock which is currently per-cpu. If
we forbid TSC to un-synchronize (he-he), there is no point in doing
that. We can as well use e.g. Hyper-V TSC page method which is
per-VM. Creating another PV clock in KVM may be a hard sell as all
modern x86 CPUs support TSC scaling (in addition to TSC offsetting which
is there for a long time) and when it's there we don't really need a PV
clock to make migration possible.
Thomas Gleixner Dec. 7, 2020, 4:38 p.m. UTC | #4
On Mon, Dec 07 2020 at 14:16, Maxim Levitsky wrote:
> On Sun, 2020-12-06 at 17:19 +0100, Thomas Gleixner wrote:
>> From a timekeeping POV and the guests expectation of TSC this is
>> fundamentally wrong:
>> 
>>       tscguest = scaled(hosttsc) + offset
>> 
>> The TSC has to be viewed systemwide and not per CPU. It's systemwide
>> used for timekeeping and for that to work it has to be synchronized. 
>> 
>> Why would this be different on virt? Just because it's virt or what? 
>> 
>> Migration is a guest wide thing and you're not migrating single vCPUs.
>> 
>> This hackery just papers over he underlying design fail that KVM looks
>> at the TSC per vCPU which is the root cause and that needs to be fixed.
>
> I don't disagree with you.
> As far as I know the main reasons that kvm tracks TSC per guest are
>
> 1. cases when host tsc is not stable 
> (hopefully rare now, and I don't mind making
> the new API just refuse to work when this is detected, and revert to old way
> of doing things).

That's a trainwreck to begin with and I really would just not support it
for anything new which aims to be more precise and correct.  TSC has
become pretty reliable over the years.

> 2. (theoretical) ability of the guest to introduce per core tsc offfset
> by either using TSC_ADJUST (for which I got recently an idea to stop
> advertising this feature to the guest), or writing TSC directly which
> is allowed by Intel's PRM:

For anything halfways modern the write to TSC is reflected in TSC_ADJUST
which means you get the precise offset.

The general principle still applies from a system POV.

     TSC base (systemwide view) - The sane case

     TSC CPU  = TSC base + TSC_ADJUST

The guest TSC base is a per guest constant offset to the host TSC.

     TSC guest base = TSC host base + guest base offset

If the guest want's this different per vCPU by writing to the MSR or to
TSC_ADJUST then you still can have a per vCPU offset in TSC_ADJUST which
is the offset to the TSC base of the guest.

    TSC guest CPU = TSC guest base + CPU TSC_ADJUST

==>

    TSC guest CPU = TSC host base + guest base offset + CPU TSC_ADJUST

The normal and sane case is just TSC_ADJUST == 0.

It's very cleanly decomposable.

Thanks,

        tglx
Andy Lutomirski Dec. 7, 2020, 4:53 p.m. UTC | #5
> On Dec 7, 2020, at 8:38 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Mon, Dec 07 2020 at 14:16, Maxim Levitsky wrote:
>>> On Sun, 2020-12-06 at 17:19 +0100, Thomas Gleixner wrote:
>>> From a timekeeping POV and the guests expectation of TSC this is
>>> fundamentally wrong:
>>> 
>>>      tscguest = scaled(hosttsc) + offset
>>> 
>>> The TSC has to be viewed systemwide and not per CPU. It's systemwide
>>> used for timekeeping and for that to work it has to be synchronized. 
>>> 
>>> Why would this be different on virt? Just because it's virt or what? 
>>> 
>>> Migration is a guest wide thing and you're not migrating single vCPUs.
>>> 
>>> This hackery just papers over he underlying design fail that KVM looks
>>> at the TSC per vCPU which is the root cause and that needs to be fixed.
>> 
>> I don't disagree with you.
>> As far as I know the main reasons that kvm tracks TSC per guest are
>> 
>> 1. cases when host tsc is not stable 
>> (hopefully rare now, and I don't mind making
>> the new API just refuse to work when this is detected, and revert to old way
>> of doing things).
> 
> That's a trainwreck to begin with and I really would just not support it
> for anything new which aims to be more precise and correct.  TSC has
> become pretty reliable over the years.
> 
>> 2. (theoretical) ability of the guest to introduce per core tsc offfset
>> by either using TSC_ADJUST (for which I got recently an idea to stop
>> advertising this feature to the guest), or writing TSC directly which
>> is allowed by Intel's PRM:
> 
> For anything halfways modern the write to TSC is reflected in TSC_ADJUST
> which means you get the precise offset.
> 
> The general principle still applies from a system POV.
> 
>     TSC base (systemwide view) - The sane case
> 
>     TSC CPU  = TSC base + TSC_ADJUST
> 
> The guest TSC base is a per guest constant offset to the host TSC.
> 
>     TSC guest base = TSC host base + guest base offset
> 
> If the guest want's this different per vCPU by writing to the MSR or to
> TSC_ADJUST then you still can have a per vCPU offset in TSC_ADJUST which
> is the offset to the TSC base of the guest.


How about, if the guest wants to write TSC_ADJUST, it can turn off all paravirt features and keep both pieces?
Maxim Levitsky Dec. 7, 2020, 5 p.m. UTC | #6
On Mon, 2020-12-07 at 08:53 -0800, Andy Lutomirski wrote:
> > On Dec 7, 2020, at 8:38 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > 
> > On Mon, Dec 07 2020 at 14:16, Maxim Levitsky wrote:
> > > > On Sun, 2020-12-06 at 17:19 +0100, Thomas Gleixner wrote:
> > > > From a timekeeping POV and the guests expectation of TSC this is
> > > > fundamentally wrong:
> > > > 
> > > >      tscguest = scaled(hosttsc) + offset
> > > > 
> > > > The TSC has to be viewed systemwide and not per CPU. It's systemwide
> > > > used for timekeeping and for that to work it has to be synchronized. 
> > > > 
> > > > Why would this be different on virt? Just because it's virt or what? 
> > > > 
> > > > Migration is a guest wide thing and you're not migrating single vCPUs.
> > > > 
> > > > This hackery just papers over he underlying design fail that KVM looks
> > > > at the TSC per vCPU which is the root cause and that needs to be fixed.
> > > 
> > > I don't disagree with you.
> > > As far as I know the main reasons that kvm tracks TSC per guest are
> > > 
> > > 1. cases when host tsc is not stable 
> > > (hopefully rare now, and I don't mind making
> > > the new API just refuse to work when this is detected, and revert to old way
> > > of doing things).
> > 
> > That's a trainwreck to begin with and I really would just not support it
> > for anything new which aims to be more precise and correct.  TSC has
> > become pretty reliable over the years.
> > 
> > > 2. (theoretical) ability of the guest to introduce per core tsc offfset
> > > by either using TSC_ADJUST (for which I got recently an idea to stop
> > > advertising this feature to the guest), or writing TSC directly which
> > > is allowed by Intel's PRM:
> > 
> > For anything halfways modern the write to TSC is reflected in TSC_ADJUST
> > which means you get the precise offset.
> > 
> > The general principle still applies from a system POV.
> > 
> >     TSC base (systemwide view) - The sane case
> > 
> >     TSC CPU  = TSC base + TSC_ADJUST
> > 
> > The guest TSC base is a per guest constant offset to the host TSC.
> > 
> >     TSC guest base = TSC host base + guest base offset
> > 
> > If the guest want's this different per vCPU by writing to the MSR or to
> > TSC_ADJUST then you still can have a per vCPU offset in TSC_ADJUST which
> > is the offset to the TSC base of the guest.
> 
> How about, if the guest wants to write TSC_ADJUST, it can turn off all paravirt features and keep both pieces?
> 

This is one of the things I had in mind recently.

Even better, we can stop advertising TSC_ADJUST in CPUID to the guest 
and forbid it from writing it at all.

And if the guest insists and writes to the TSC itself, 
then indeed let it keep both pieces (or invoke some failback code).
 
I have nothing against this solution.


Best regards,
	Maxim Levitsky
Oliver Upton Dec. 7, 2020, 5:29 p.m. UTC | #7
On Thu, Dec 3, 2020 at 11:12 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
>
> These two new ioctls allow to more precisly capture and
> restore guest's TSC state.
>
> Both ioctls are meant to be used to accurately migrate guest TSC
> even when there is a significant downtime during the migration.
>
> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> ---
>  Documentation/virt/kvm/api.rst | 65 ++++++++++++++++++++++++++++++
>  arch/x86/kvm/x86.c             | 73 ++++++++++++++++++++++++++++++++++
>  include/uapi/linux/kvm.h       | 15 +++++++
>  3 files changed, 153 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 70254eaa5229f..ebecfe4b414ce 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -4826,6 +4826,71 @@ If a vCPU is in running state while this ioctl is invoked, the vCPU may
>  experience inconsistent filtering behavior on MSR accesses.
>
>
> +4.127 KVM_GET_TSC_STATE
> +----------------------------
> +
> +:Capability: KVM_CAP_PRECISE_TSC
> +:Architectures: x86
> +:Type: vcpu ioctl
> +:Parameters: struct kvm_tsc_state
> +:Returns: 0 on success, < 0 on error
> +
> +::
> +
> +  #define KVM_TSC_STATE_TIMESTAMP_VALID 1
> +  #define KVM_TSC_STATE_TSC_ADJUST_VALID 2
> +  struct kvm_tsc_state {
> +       __u32 flags;
> +       __u64 nsec;
> +       __u64 tsc;
> +       __u64 tsc_adjust;
> +  };
> +
> +flags values for ``struct kvm_tsc_state``:
> +
> +``KVM_TSC_STATE_TIMESTAMP_VALID``
> +
> +  ``nsec`` contains nanoseconds from unix epoch.
> +    Always set by KVM_GET_TSC_STATE, might be omitted in KVM_SET_TSC_STATE
> +
> +``KVM_TSC_STATE_TSC_ADJUST_VALID``
> +
> +  ``tsc_adjust`` contains valid IA32_TSC_ADJUST value
> +
> +
> +This ioctl allows the user space to read the guest's IA32_TSC,IA32_TSC_ADJUST,
> +and the current value of host's CLOCK_REALTIME clock in nanoseconds since unix
> +epoch.
> +
> +
> +4.128 KVM_SET_TSC_STATE
> +----------------------------
> +
> +:Capability: KVM_CAP_PRECISE_TSC
> +:Architectures: x86
> +:Type: vcpu ioctl
> +:Parameters: struct kvm_tsc_state
> +:Returns: 0 on success, < 0 on error
> +
> +::
> +
> +This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST value
> +from the state obtained in the past by KVM_GET_TSC_STATE on the same vCPU.
> +
> +If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags,
> +KVM will adjust the guest TSC value by the time that passed since the moment
> +CLOCK_REALTIME timestamp was saved in the struct and current value of
> +CLOCK_REALTIME, and set the guest's TSC to the new value.
> +
> +Otherwise KVM will set the guest TSC value to the exact value as given
> +in the struct.
> +
> +if KVM_TSC_STATE_TSC_ADJUST_VALID is set, and guest supports IA32_MSR_TSC_ADJUST,
> +then its value will be set to the given value from the struct.
> +
> +It is assumed that either both ioctls will be run on the same machine,
> +or that source and destination machines have synchronized clocks.
> +
>  5. The kvm_run structure
>  ========================
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a3fdc16cfd6f3..9b8a2fe3a2398 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2438,6 +2438,21 @@ static bool kvm_get_walltime_and_clockread(struct timespec64 *ts,
>
>         return gtod_is_based_on_tsc(do_realtime(ts, tsc_timestamp));
>  }
> +
> +
> +static void kvm_get_walltime(u64 *walltime_ns, u64 *host_tsc)
> +{
> +       struct timespec64 ts;
> +
> +       if (kvm_get_walltime_and_clockread(&ts, host_tsc)) {
> +               *walltime_ns = timespec64_to_ns(&ts);
> +               return;
> +       }
> +
> +       *host_tsc = rdtsc();
> +       *walltime_ns = ktime_get_real_ns();
> +}
> +
>  #endif
>
>  /*
> @@ -3757,6 +3772,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>         case KVM_CAP_X86_USER_SPACE_MSR:
>         case KVM_CAP_X86_MSR_FILTER:
>         case KVM_CAP_ENFORCE_PV_FEATURE_CPUID:
> +#ifdef CONFIG_X86_64
> +       case KVM_CAP_PRECISE_TSC:
> +#endif
>                 r = 1;
>                 break;
>         case KVM_CAP_SYNC_REGS:
> @@ -4999,6 +5017,61 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
>         case KVM_GET_SUPPORTED_HV_CPUID:
>                 r = kvm_ioctl_get_supported_hv_cpuid(vcpu, argp);
>                 break;
> +#ifdef CONFIG_X86_64
> +       case KVM_GET_TSC_STATE: {
> +               struct kvm_tsc_state __user *user_tsc_state = argp;
> +               u64 host_tsc;
> +
> +               struct kvm_tsc_state tsc_state = {
> +                       .flags = KVM_TSC_STATE_TIMESTAMP_VALID
> +               };
> +
> +               kvm_get_walltime(&tsc_state.nsec, &host_tsc);
> +               tsc_state.tsc = kvm_read_l1_tsc(vcpu, host_tsc);
> +
> +               if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) {
> +                       tsc_state.tsc_adjust = vcpu->arch.ia32_tsc_adjust_msr;
> +                       tsc_state.flags |= KVM_TSC_STATE_TSC_ADJUST_VALID;
> +               }

[...]

> +
> +               r = -EFAULT;
> +               if (copy_to_user(user_tsc_state, &tsc_state, sizeof(tsc_state)))
> +                       goto out;
> +               r = 0;
> +               break;
> +       }
> +       case KVM_SET_TSC_STATE: {
> +               struct kvm_tsc_state __user *user_tsc_state = argp;
> +               struct kvm_tsc_state tsc_state;
> +               u64 host_tsc, wall_nsec;
> +
> +               u64 new_guest_tsc, new_guest_tsc_offset;
> +
> +               r = -EFAULT;
> +               if (copy_from_user(&tsc_state, user_tsc_state, sizeof(tsc_state)))
> +                       goto out;
> +
> +               kvm_get_walltime(&wall_nsec, &host_tsc);
> +               new_guest_tsc = tsc_state.tsc;
> +
> +               if (tsc_state.flags & KVM_TSC_STATE_TIMESTAMP_VALID) {
> +                       s64 diff = wall_nsec - tsc_state.nsec;
> +                       if (diff >= 0)
> +                               new_guest_tsc += nsec_to_cycles(vcpu, diff);
> +                       else
> +                               new_guest_tsc -= nsec_to_cycles(vcpu, -diff);
> +               }
> +
> +               new_guest_tsc_offset = new_guest_tsc - kvm_scale_tsc(vcpu, host_tsc);
> +               kvm_vcpu_write_tsc_offset(vcpu, new_guest_tsc_offset);

How would a VMM maintain the phase relationship between guest TSCs
using these ioctls?

For as bugged as the old way of doing things is (i.e. the magic
handling of IA32_TSC), it was at least possible to keep guest TSCs in
sync across a live migration so long as you satisfied the conditions
where KVM decides to fudge the TSCs on your behalf. However, if we
migrate the TSCs by value with an optional timestamp to account for
elapsed time, it would appear that the guest TSC offset is likely to
be inconsistent across vCPUs as the offset calculations above do not
use a fixed host tsc timestamp across all vCPUs.

The reason I'd suggested yielding control of the tsc offset controls
to userspace is that they're unaffected by such variations in per-vCPU
calculations. Not only that, userspace can decide how it wants TSCs to
be handled across a migration explicitly instead of relying on the
above computation being done in the kernel.

> +
> +               if (tsc_state.flags & KVM_TSC_STATE_TSC_ADJUST_VALID)
> +                       if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST))
> +                               vcpu->arch.ia32_tsc_adjust_msr = tsc_state.tsc_adjust;

How is this ioctl's handling of the TSC_ADJUST msr an improvement over
KVM_{GET,SET}_MSRS like before? It seems orthogonal to the rest of the
intended API as it isn't involved your computation above.

> +               r = 0;
> +               break;
> +       }
> +#endif
>         default:
>                 r = -EINVAL;
>         }
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 886802b8ffba3..bf4c38fd58291 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1056,6 +1056,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_ENFORCE_PV_FEATURE_CPUID 190
>  #define KVM_CAP_SYS_HYPERV_CPUID 191
>  #define KVM_CAP_DIRTY_LOG_RING 192
> +#define KVM_CAP_PRECISE_TSC 193
>
>  #ifdef KVM_CAP_IRQ_ROUTING
>
> @@ -1169,6 +1170,16 @@ struct kvm_clock_data {
>         __u32 pad[9];
>  };
>
> +
> +#define KVM_TSC_STATE_TIMESTAMP_VALID 1
> +#define KVM_TSC_STATE_TSC_ADJUST_VALID 2
> +struct kvm_tsc_state {
> +       __u32 flags;
> +       __u64 nsec;
> +       __u64 tsc;
> +       __u64 tsc_adjust;
> +};
> +
>  /* For KVM_CAP_SW_TLB */
>
>  #define KVM_MMU_FSL_BOOKE_NOHV         0
> @@ -1563,6 +1574,10 @@ struct kvm_pv_cmd {
>  /* Available with KVM_CAP_DIRTY_LOG_RING */
>  #define KVM_RESET_DIRTY_RINGS          _IO(KVMIO, 0xc7)
>
> +/* Available with KVM_CAP_PRECISE_TSC*/
> +#define KVM_SET_TSC_STATE          _IOW(KVMIO,  0xc8, struct kvm_tsc_state)
> +#define KVM_GET_TSC_STATE          _IOR(KVMIO,  0xc9, struct kvm_tsc_state)
> +
>  /* Secure Encrypted Virtualization command */
>  enum sev_cmd_id {
>         /* Guest initialization commands */
> --
> 2.26.2
>
Thomas Gleixner Dec. 7, 2020, 5:41 p.m. UTC | #8
On Mon, Dec 07 2020 at 14:16, Vitaly Kuznetsov wrote:
> Maxim Levitsky <mlevitsk@redhat.com> writes:
>> But other than that I don't mind making TSC offset global per VM thing.
>> Paulo, what do you think about this?
>>
>
> Not Paolo here but personally I'd very much prefer we go this route but
> unsynchronized TSCs are, unfortunately, still a thing: I was observing
> it on an AMD Epyc server just a couple years ago (cured with firmware
> update).

Right this happens still occasionally, but for quite some time this is
100% firmware sillyness and not a fundamental property of the hardware
anymore. Interestingly enough has the number of reports on Intel based
systems vs. such wreckage as obvservable via TSC_ADJUST gone down after
we added support for it and yelled prominently. I wish AMD would have
that as well.

> We try to catch such situation in KVM instead of blowing up but
> this may still result in subtle bugs I believe. Maybe we would be better
> off killing all VMs in case TSC ever gets unsynced (by default).

I just ran a guest on an old machine with unsynchronized TSCs and was
able to observe clock monotonic going backwards between two threads
pinned on two vCPUs, which _is_ bad. Getting unsynced clocks reliably
under control is extremly hard.

> Another thing to this bucket is kvmclock which is currently per-cpu. If
> we forbid TSC to un-synchronize (he-he), there is no point in doing
> that. We can as well use e.g. Hyper-V TSC page method which is
> per-VM. Creating another PV clock in KVM may be a hard sell as all
> modern x86 CPUs support TSC scaling (in addition to TSC offsetting which
> is there for a long time) and when it's there we don't really need a PV
> clock to make migration possible.

That should be the long term goal.

Thanks,

        tglx
Andy Lutomirski Dec. 7, 2020, 6:04 p.m. UTC | #9
> On Dec 7, 2020, at 9:00 AM, Maxim Levitsky <mlevitsk@redhat.com> wrote:
> 
> On Mon, 2020-12-07 at 08:53 -0800, Andy Lutomirski wrote:
>>>> On Dec 7, 2020, at 8:38 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> 
>>> On Mon, Dec 07 2020 at 14:16, Maxim Levitsky wrote:
>>>>> On Sun, 2020-12-06 at 17:19 +0100, Thomas Gleixner wrote:
>>>>> From a timekeeping POV and the guests expectation of TSC this is
>>>>> fundamentally wrong:
>>>>> 
>>>>>     tscguest = scaled(hosttsc) + offset
>>>>> 
>>>>> The TSC has to be viewed systemwide and not per CPU. It's systemwide
>>>>> used for timekeeping and for that to work it has to be synchronized. 
>>>>> 
>>>>> Why would this be different on virt? Just because it's virt or what? 
>>>>> 
>>>>> Migration is a guest wide thing and you're not migrating single vCPUs.
>>>>> 
>>>>> This hackery just papers over he underlying design fail that KVM looks
>>>>> at the TSC per vCPU which is the root cause and that needs to be fixed.
>>>> 
>>>> I don't disagree with you.
>>>> As far as I know the main reasons that kvm tracks TSC per guest are
>>>> 
>>>> 1. cases when host tsc is not stable 
>>>> (hopefully rare now, and I don't mind making
>>>> the new API just refuse to work when this is detected, and revert to old way
>>>> of doing things).
>>> 
>>> That's a trainwreck to begin with and I really would just not support it
>>> for anything new which aims to be more precise and correct.  TSC has
>>> become pretty reliable over the years.
>>> 
>>>> 2. (theoretical) ability of the guest to introduce per core tsc offfset
>>>> by either using TSC_ADJUST (for which I got recently an idea to stop
>>>> advertising this feature to the guest), or writing TSC directly which
>>>> is allowed by Intel's PRM:
>>> 
>>> For anything halfways modern the write to TSC is reflected in TSC_ADJUST
>>> which means you get the precise offset.
>>> 
>>> The general principle still applies from a system POV.
>>> 
>>>    TSC base (systemwide view) - The sane case
>>> 
>>>    TSC CPU  = TSC base + TSC_ADJUST
>>> 
>>> The guest TSC base is a per guest constant offset to the host TSC.
>>> 
>>>    TSC guest base = TSC host base + guest base offset
>>> 
>>> If the guest want's this different per vCPU by writing to the MSR or to
>>> TSC_ADJUST then you still can have a per vCPU offset in TSC_ADJUST which
>>> is the offset to the TSC base of the guest.
>> 
>> How about, if the guest wants to write TSC_ADJUST, it can turn off all paravirt features and keep both pieces?
>> 
> 
> This is one of the things I had in mind recently.
> 
> Even better, we can stop advertising TSC_ADJUST in CPUID to the guest 
> and forbid it from writing it at all.

Seems reasonable to me.

It also seems okay for some MSRs to stop working after the guest enabled new PV timekeeping.

I do have a feature request, though: IMO it would be quite nifty if the new kvmclock structure could also expose NTP corrections. In other words, if you could expose enough info to calculate CLOCK_MONOTONIC_RAW, CLOCK_MONOTONIC, and CLOCK_REALTIME, then we could have paravirt NTP.

Bonus points if whatever you do for CLOCK_REALTIME also exposes leap seconds in a race free way :). But I suppose that just exposing TAI and letting the guest deal with the TAI - UTC offset itself would get the job done just fine.
Marcelo Tosatti Dec. 7, 2020, 11:11 p.m. UTC | #10
On Mon, Dec 07, 2020 at 10:04:45AM -0800, Andy Lutomirski wrote:
> 
> > On Dec 7, 2020, at 9:00 AM, Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > 
> > On Mon, 2020-12-07 at 08:53 -0800, Andy Lutomirski wrote:
> >>>> On Dec 7, 2020, at 8:38 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> >>> 
> >>> On Mon, Dec 07 2020 at 14:16, Maxim Levitsky wrote:
> >>>>> On Sun, 2020-12-06 at 17:19 +0100, Thomas Gleixner wrote:
> >>>>> From a timekeeping POV and the guests expectation of TSC this is
> >>>>> fundamentally wrong:
> >>>>> 
> >>>>>     tscguest = scaled(hosttsc) + offset
> >>>>> 
> >>>>> The TSC has to be viewed systemwide and not per CPU. It's systemwide
> >>>>> used for timekeeping and for that to work it has to be synchronized. 
> >>>>> 
> >>>>> Why would this be different on virt? Just because it's virt or what? 
> >>>>> 
> >>>>> Migration is a guest wide thing and you're not migrating single vCPUs.
> >>>>> 
> >>>>> This hackery just papers over he underlying design fail that KVM looks
> >>>>> at the TSC per vCPU which is the root cause and that needs to be fixed.
> >>>> 
> >>>> I don't disagree with you.
> >>>> As far as I know the main reasons that kvm tracks TSC per guest are
> >>>> 
> >>>> 1. cases when host tsc is not stable 
> >>>> (hopefully rare now, and I don't mind making
> >>>> the new API just refuse to work when this is detected, and revert to old way
> >>>> of doing things).
> >>> 
> >>> That's a trainwreck to begin with and I really would just not support it
> >>> for anything new which aims to be more precise and correct.  TSC has
> >>> become pretty reliable over the years.
> >>> 
> >>>> 2. (theoretical) ability of the guest to introduce per core tsc offfset
> >>>> by either using TSC_ADJUST (for which I got recently an idea to stop
> >>>> advertising this feature to the guest), or writing TSC directly which
> >>>> is allowed by Intel's PRM:
> >>> 
> >>> For anything halfways modern the write to TSC is reflected in TSC_ADJUST
> >>> which means you get the precise offset.
> >>> 
> >>> The general principle still applies from a system POV.
> >>> 
> >>>    TSC base (systemwide view) - The sane case
> >>> 
> >>>    TSC CPU  = TSC base + TSC_ADJUST
> >>> 
> >>> The guest TSC base is a per guest constant offset to the host TSC.
> >>> 
> >>>    TSC guest base = TSC host base + guest base offset
> >>> 
> >>> If the guest want's this different per vCPU by writing to the MSR or to
> >>> TSC_ADJUST then you still can have a per vCPU offset in TSC_ADJUST which
> >>> is the offset to the TSC base of the guest.
> >> 
> >> How about, if the guest wants to write TSC_ADJUST, it can turn off all paravirt features and keep both pieces?
> >> 
> > 
> > This is one of the things I had in mind recently.
> > 
> > Even better, we can stop advertising TSC_ADJUST in CPUID to the guest 
> > and forbid it from writing it at all.
> 
> Seems reasonable to me.
> 
> It also seems okay for some MSRs to stop working after the guest enabled new PV timekeeping.
> 
> I do have a feature request, though: IMO it would be quite nifty if the new kvmclock structure could also expose NTP corrections. In other words, if you could expose enough info to calculate CLOCK_MONOTONIC_RAW, CLOCK_MONOTONIC, and CLOCK_REALTIME, then we could have paravirt NTP.

Hi Andy,

Any reason why drivers/ptp/ptp_kvm.c does not work for you?

> Bonus points if whatever you do for CLOCK_REALTIME also exposes leap seconds in a race free way :). But I suppose that just exposing TAI and letting the guest deal with the TAI - UTC offset itself would get the job done just fine.
Marcelo Tosatti Dec. 7, 2020, 11:29 p.m. UTC | #11
On Thu, Dec 03, 2020 at 07:11:16PM +0200, Maxim Levitsky wrote:
> These two new ioctls allow to more precisly capture and
> restore guest's TSC state.
> 
> Both ioctls are meant to be used to accurately migrate guest TSC
> even when there is a significant downtime during the migration.
> 
> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> ---
>  Documentation/virt/kvm/api.rst | 65 ++++++++++++++++++++++++++++++
>  arch/x86/kvm/x86.c             | 73 ++++++++++++++++++++++++++++++++++
>  include/uapi/linux/kvm.h       | 15 +++++++
>  3 files changed, 153 insertions(+)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 70254eaa5229f..ebecfe4b414ce 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -4826,6 +4826,71 @@ If a vCPU is in running state while this ioctl is invoked, the vCPU may
>  experience inconsistent filtering behavior on MSR accesses.
>  
>  
> +4.127 KVM_GET_TSC_STATE
> +----------------------------
> +
> +:Capability: KVM_CAP_PRECISE_TSC
> +:Architectures: x86
> +:Type: vcpu ioctl
> +:Parameters: struct kvm_tsc_state
> +:Returns: 0 on success, < 0 on error
> +
> +::
> +
> +  #define KVM_TSC_STATE_TIMESTAMP_VALID 1
> +  #define KVM_TSC_STATE_TSC_ADJUST_VALID 2
> +  struct kvm_tsc_state {
> +	__u32 flags;
> +	__u64 nsec;
> +	__u64 tsc;
> +	__u64 tsc_adjust;
> +  };
> +
> +flags values for ``struct kvm_tsc_state``:
> +
> +``KVM_TSC_STATE_TIMESTAMP_VALID``
> +
> +  ``nsec`` contains nanoseconds from unix epoch.
> +    Always set by KVM_GET_TSC_STATE, might be omitted in KVM_SET_TSC_STATE
> +
> +``KVM_TSC_STATE_TSC_ADJUST_VALID``
> +
> +  ``tsc_adjust`` contains valid IA32_TSC_ADJUST value
> +
> +
> +This ioctl allows the user space to read the guest's IA32_TSC,IA32_TSC_ADJUST,
> +and the current value of host's CLOCK_REALTIME clock in nanoseconds since unix
> +epoch.

Why is CLOCK_REALTIME necessary at all? kvmclock uses the host clock as
a time base, but for TSC it should not be necessary.

> +
> +
> +4.128 KVM_SET_TSC_STATE
> +----------------------------
> +
> +:Capability: KVM_CAP_PRECISE_TSC
> +:Architectures: x86
> +:Type: vcpu ioctl
> +:Parameters: struct kvm_tsc_state
> +:Returns: 0 on success, < 0 on error
> +
> +::
> +
> +This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST value
> +from the state obtained in the past by KVM_GET_TSC_STATE on the same vCPU.
> +
> +If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags,
> +KVM will adjust the guest TSC value by the time that passed since the moment
> +CLOCK_REALTIME timestamp was saved in the struct and current value of
> +CLOCK_REALTIME, and set the guest's TSC to the new value.

This introduces the wraparound bug in Linux timekeeping, doesnt it?

> +
> +Otherwise KVM will set the guest TSC value to the exact value as given
> +in the struct.
> +
> +if KVM_TSC_STATE_TSC_ADJUST_VALID is set, and guest supports IA32_MSR_TSC_ADJUST,
> +then its value will be set to the given value from the struct.
> +
> +It is assumed that either both ioctls will be run on the same machine,
> +or that source and destination machines have synchronized clocks.



>  5. The kvm_run structure
>  ========================
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a3fdc16cfd6f3..9b8a2fe3a2398 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2438,6 +2438,21 @@ static bool kvm_get_walltime_and_clockread(struct timespec64 *ts,
>  
>  	return gtod_is_based_on_tsc(do_realtime(ts, tsc_timestamp));
>  }
> +
> +
> +static void kvm_get_walltime(u64 *walltime_ns, u64 *host_tsc)
> +{
> +	struct timespec64 ts;
> +
> +	if (kvm_get_walltime_and_clockread(&ts, host_tsc)) {
> +		*walltime_ns = timespec64_to_ns(&ts);
> +		return;
> +	}
> +
> +	*host_tsc = rdtsc();
> +	*walltime_ns = ktime_get_real_ns();
> +}
> +
>  #endif
>  
>  /*
> @@ -3757,6 +3772,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  	case KVM_CAP_X86_USER_SPACE_MSR:
>  	case KVM_CAP_X86_MSR_FILTER:
>  	case KVM_CAP_ENFORCE_PV_FEATURE_CPUID:
> +#ifdef CONFIG_X86_64
> +	case KVM_CAP_PRECISE_TSC:
> +#endif
>  		r = 1;
>  		break;
>  	case KVM_CAP_SYNC_REGS:
> @@ -4999,6 +5017,61 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
>  	case KVM_GET_SUPPORTED_HV_CPUID:
>  		r = kvm_ioctl_get_supported_hv_cpuid(vcpu, argp);
>  		break;
> +#ifdef CONFIG_X86_64
> +	case KVM_GET_TSC_STATE: {
> +		struct kvm_tsc_state __user *user_tsc_state = argp;
> +		u64 host_tsc;
> +
> +		struct kvm_tsc_state tsc_state = {
> +			.flags = KVM_TSC_STATE_TIMESTAMP_VALID
> +		};
> +
> +		kvm_get_walltime(&tsc_state.nsec, &host_tsc);
> +		tsc_state.tsc = kvm_read_l1_tsc(vcpu, host_tsc);
> +
> +		if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) {
> +			tsc_state.tsc_adjust = vcpu->arch.ia32_tsc_adjust_msr;
> +			tsc_state.flags |= KVM_TSC_STATE_TSC_ADJUST_VALID;
> +		}
> +
> +		r = -EFAULT;
> +		if (copy_to_user(user_tsc_state, &tsc_state, sizeof(tsc_state)))
> +			goto out;
> +		r = 0;
> +		break;
> +	}
> +	case KVM_SET_TSC_STATE: {
> +		struct kvm_tsc_state __user *user_tsc_state = argp;
> +		struct kvm_tsc_state tsc_state;
> +		u64 host_tsc, wall_nsec;
> +
> +		u64 new_guest_tsc, new_guest_tsc_offset;
> +
> +		r = -EFAULT;
> +		if (copy_from_user(&tsc_state, user_tsc_state, sizeof(tsc_state)))
> +			goto out;
> +
> +		kvm_get_walltime(&wall_nsec, &host_tsc);
> +		new_guest_tsc = tsc_state.tsc;
> +
> +		if (tsc_state.flags & KVM_TSC_STATE_TIMESTAMP_VALID) {
> +			s64 diff = wall_nsec - tsc_state.nsec;
> +			if (diff >= 0)
> +				new_guest_tsc += nsec_to_cycles(vcpu, diff);
> +			else
> +				new_guest_tsc -= nsec_to_cycles(vcpu, -diff);
> +		}
> +
> +		new_guest_tsc_offset = new_guest_tsc - kvm_scale_tsc(vcpu, host_tsc);
> +		kvm_vcpu_write_tsc_offset(vcpu, new_guest_tsc_offset);
> +
> +		if (tsc_state.flags & KVM_TSC_STATE_TSC_ADJUST_VALID)
> +			if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST))
> +				vcpu->arch.ia32_tsc_adjust_msr = tsc_state.tsc_adjust;
> +		r = 0;
> +		break;
> +	}
> +#endif
>  	default:
>  		r = -EINVAL;
>  	}
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 886802b8ffba3..bf4c38fd58291 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1056,6 +1056,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_ENFORCE_PV_FEATURE_CPUID 190
>  #define KVM_CAP_SYS_HYPERV_CPUID 191
>  #define KVM_CAP_DIRTY_LOG_RING 192
> +#define KVM_CAP_PRECISE_TSC 193
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> @@ -1169,6 +1170,16 @@ struct kvm_clock_data {
>  	__u32 pad[9];
>  };
>  
> +
> +#define KVM_TSC_STATE_TIMESTAMP_VALID 1
> +#define KVM_TSC_STATE_TSC_ADJUST_VALID 2
> +struct kvm_tsc_state {
> +	__u32 flags;
> +	__u64 nsec;
> +	__u64 tsc;
> +	__u64 tsc_adjust;
> +};
> +
>  /* For KVM_CAP_SW_TLB */
>  
>  #define KVM_MMU_FSL_BOOKE_NOHV		0
> @@ -1563,6 +1574,10 @@ struct kvm_pv_cmd {
>  /* Available with KVM_CAP_DIRTY_LOG_RING */
>  #define KVM_RESET_DIRTY_RINGS		_IO(KVMIO, 0xc7)
>  
> +/* Available with KVM_CAP_PRECISE_TSC*/
> +#define KVM_SET_TSC_STATE          _IOW(KVMIO,  0xc8, struct kvm_tsc_state)
> +#define KVM_GET_TSC_STATE          _IOR(KVMIO,  0xc9, struct kvm_tsc_state)
> +
>  /* Secure Encrypted Virtualization command */
>  enum sev_cmd_id {
>  	/* Guest initialization commands */
> -- 
> 2.26.2
Marcelo Tosatti Dec. 7, 2020, 11:34 p.m. UTC | #12
On Sun, Dec 06, 2020 at 05:19:16PM +0100, Thomas Gleixner wrote:
> On Thu, Dec 03 2020 at 19:11, Maxim Levitsky wrote:
> > +	case KVM_SET_TSC_STATE: {
> > +		struct kvm_tsc_state __user *user_tsc_state = argp;
> > +		struct kvm_tsc_state tsc_state;
> > +		u64 host_tsc, wall_nsec;
> > +
> > +		u64 new_guest_tsc, new_guest_tsc_offset;
> > +
> > +		r = -EFAULT;
> > +		if (copy_from_user(&tsc_state, user_tsc_state, sizeof(tsc_state)))
> > +			goto out;
> > +
> > +		kvm_get_walltime(&wall_nsec, &host_tsc);
> > +		new_guest_tsc = tsc_state.tsc;
> > +
> > +		if (tsc_state.flags & KVM_TSC_STATE_TIMESTAMP_VALID) {
> > +			s64 diff = wall_nsec - tsc_state.nsec;
> > +			if (diff >= 0)
> > +				new_guest_tsc += nsec_to_cycles(vcpu, diff);
> > +			else
> > +				new_guest_tsc -= nsec_to_cycles(vcpu, -diff);
> > +		}
> > +
> > +		new_guest_tsc_offset = new_guest_tsc - kvm_scale_tsc(vcpu, host_tsc);
> > +		kvm_vcpu_write_tsc_offset(vcpu, new_guest_tsc_offset);
> 
> >From a timekeeping POV and the guests expectation of TSC this is
> fundamentally wrong:
> 
>       tscguest = scaled(hosttsc) + offset
> 
> The TSC has to be viewed systemwide and not per CPU. It's systemwide
> used for timekeeping and for that to work it has to be synchronized. 
> 
> Why would this be different on virt? Just because it's virt or what? 
> 
> Migration is a guest wide thing and you're not migrating single vCPUs.
> 
> This hackery just papers over he underlying design fail that KVM looks
> at the TSC per vCPU which is the root cause and that needs to be fixed.

It already does it: The unified TSC offset is kept at kvm->arch.cur_tsc_offset.
Peter Zijlstra Dec. 8, 2020, 9:35 a.m. UTC | #13
On Mon, Dec 07, 2020 at 05:38:36PM +0100, Thomas Gleixner wrote:
> For anything halfways modern the write to TSC is reflected in TSC_ADJUST
> which means you get the precise offset.

IIRC this is true for everything that has TSC_ADJUST.
Peter Zijlstra Dec. 8, 2020, 9:48 a.m. UTC | #14
On Mon, Dec 07, 2020 at 06:41:41PM +0100, Thomas Gleixner wrote:

> Right this happens still occasionally, but for quite some time this is
> 100% firmware sillyness and not a fundamental property of the hardware
> anymore.

Ever since Nehalem (2008) TSC is synchronized on <= 2 sockets, and any
observed deviation is firmware fail. I don't remember exactly where 4
socket and up got reliable.

(there's the physical hotplug case, but let's not make this complicated)

AMD has had Constant TSC since Barcelona which is roughly the same
timeframe IIRC.

So basically every TSC fail in the last decase is due to firmware being
shit.
Maxim Levitsky Dec. 8, 2020, 11:13 a.m. UTC | #15
On Mon, 2020-12-07 at 11:29 -0600, Oliver Upton wrote:
> On Thu, Dec 3, 2020 at 11:12 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > These two new ioctls allow to more precisly capture and
> > restore guest's TSC state.
> > 
> > Both ioctls are meant to be used to accurately migrate guest TSC
> > even when there is a significant downtime during the migration.
> > 
> > Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > ---
> >  Documentation/virt/kvm/api.rst | 65 ++++++++++++++++++++++++++++++
> >  arch/x86/kvm/x86.c             | 73 ++++++++++++++++++++++++++++++++++
> >  include/uapi/linux/kvm.h       | 15 +++++++
> >  3 files changed, 153 insertions(+)
> > 
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 70254eaa5229f..ebecfe4b414ce 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -4826,6 +4826,71 @@ If a vCPU is in running state while this ioctl is invoked, the vCPU may
> >  experience inconsistent filtering behavior on MSR accesses.
> > 
> > 
> > +4.127 KVM_GET_TSC_STATE
> > +----------------------------
> > +
> > +:Capability: KVM_CAP_PRECISE_TSC
> > +:Architectures: x86
> > +:Type: vcpu ioctl
> > +:Parameters: struct kvm_tsc_state
> > +:Returns: 0 on success, < 0 on error
> > +
> > +::
> > +
> > +  #define KVM_TSC_STATE_TIMESTAMP_VALID 1
> > +  #define KVM_TSC_STATE_TSC_ADJUST_VALID 2
> > +  struct kvm_tsc_state {
> > +       __u32 flags;
> > +       __u64 nsec;
> > +       __u64 tsc;
> > +       __u64 tsc_adjust;
> > +  };
> > +
> > +flags values for ``struct kvm_tsc_state``:
> > +
> > +``KVM_TSC_STATE_TIMESTAMP_VALID``
> > +
> > +  ``nsec`` contains nanoseconds from unix epoch.
> > +    Always set by KVM_GET_TSC_STATE, might be omitted in KVM_SET_TSC_STATE
> > +
> > +``KVM_TSC_STATE_TSC_ADJUST_VALID``
> > +
> > +  ``tsc_adjust`` contains valid IA32_TSC_ADJUST value
> > +
> > +
> > +This ioctl allows the user space to read the guest's IA32_TSC,IA32_TSC_ADJUST,
> > +and the current value of host's CLOCK_REALTIME clock in nanoseconds since unix
> > +epoch.
> > +
> > +
> > +4.128 KVM_SET_TSC_STATE
> > +----------------------------
> > +
> > +:Capability: KVM_CAP_PRECISE_TSC
> > +:Architectures: x86
> > +:Type: vcpu ioctl
> > +:Parameters: struct kvm_tsc_state
> > +:Returns: 0 on success, < 0 on error
> > +
> > +::
> > +
> > +This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST value
> > +from the state obtained in the past by KVM_GET_TSC_STATE on the same vCPU.
> > +
> > +If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags,
> > +KVM will adjust the guest TSC value by the time that passed since the moment
> > +CLOCK_REALTIME timestamp was saved in the struct and current value of
> > +CLOCK_REALTIME, and set the guest's TSC to the new value.
> > +
> > +Otherwise KVM will set the guest TSC value to the exact value as given
> > +in the struct.
> > +
> > +if KVM_TSC_STATE_TSC_ADJUST_VALID is set, and guest supports IA32_MSR_TSC_ADJUST,
> > +then its value will be set to the given value from the struct.
> > +
> > +It is assumed that either both ioctls will be run on the same machine,
> > +or that source and destination machines have synchronized clocks.
> > +
> >  5. The kvm_run structure
> >  ========================
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index a3fdc16cfd6f3..9b8a2fe3a2398 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -2438,6 +2438,21 @@ static bool kvm_get_walltime_and_clockread(struct timespec64 *ts,
> > 
> >         return gtod_is_based_on_tsc(do_realtime(ts, tsc_timestamp));
> >  }
> > +
> > +
> > +static void kvm_get_walltime(u64 *walltime_ns, u64 *host_tsc)
> > +{
> > +       struct timespec64 ts;
> > +
> > +       if (kvm_get_walltime_and_clockread(&ts, host_tsc)) {
> > +               *walltime_ns = timespec64_to_ns(&ts);
> > +               return;
> > +       }
> > +
> > +       *host_tsc = rdtsc();
> > +       *walltime_ns = ktime_get_real_ns();
> > +}
> > +
> >  #endif
> > 
> >  /*
> > @@ -3757,6 +3772,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >         case KVM_CAP_X86_USER_SPACE_MSR:
> >         case KVM_CAP_X86_MSR_FILTER:
> >         case KVM_CAP_ENFORCE_PV_FEATURE_CPUID:
> > +#ifdef CONFIG_X86_64
> > +       case KVM_CAP_PRECISE_TSC:
> > +#endif
> >                 r = 1;
> >                 break;
> >         case KVM_CAP_SYNC_REGS:
> > @@ -4999,6 +5017,61 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
> >         case KVM_GET_SUPPORTED_HV_CPUID:
> >                 r = kvm_ioctl_get_supported_hv_cpuid(vcpu, argp);
> >                 break;
> > +#ifdef CONFIG_X86_64
> > +       case KVM_GET_TSC_STATE: {
> > +               struct kvm_tsc_state __user *user_tsc_state = argp;
> > +               u64 host_tsc;
> > +
> > +               struct kvm_tsc_state tsc_state = {
> > +                       .flags = KVM_TSC_STATE_TIMESTAMP_VALID
> > +               };
> > +
> > +               kvm_get_walltime(&tsc_state.nsec, &host_tsc);
> > +               tsc_state.tsc = kvm_read_l1_tsc(vcpu, host_tsc);
> > +
> > +               if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) {
> > +                       tsc_state.tsc_adjust = vcpu->arch.ia32_tsc_adjust_msr;
> > +                       tsc_state.flags |= KVM_TSC_STATE_TSC_ADJUST_VALID;
> > +               }
> 
> [...]
> 
> > +
> > +               r = -EFAULT;
> > +               if (copy_to_user(user_tsc_state, &tsc_state, sizeof(tsc_state)))
> > +                       goto out;
> > +               r = 0;
> > +               break;
> > +       }
> > +       case KVM_SET_TSC_STATE: {
> > +               struct kvm_tsc_state __user *user_tsc_state = argp;
> > +               struct kvm_tsc_state tsc_state;
> > +               u64 host_tsc, wall_nsec;
> > +
> > +               u64 new_guest_tsc, new_guest_tsc_offset;
> > +
> > +               r = -EFAULT;
> > +               if (copy_from_user(&tsc_state, user_tsc_state, sizeof(tsc_state)))
> > +                       goto out;
> > +
> > +               kvm_get_walltime(&wall_nsec, &host_tsc);
> > +               new_guest_tsc = tsc_state.tsc;
> > +
> > +               if (tsc_state.flags & KVM_TSC_STATE_TIMESTAMP_VALID) {
> > +                       s64 diff = wall_nsec - tsc_state.nsec;
> > +                       if (diff >= 0)
> > +                               new_guest_tsc += nsec_to_cycles(vcpu, diff);
> > +                       else
> > +                               new_guest_tsc -= nsec_to_cycles(vcpu, -diff);
> > +               }
> > +
> > +               new_guest_tsc_offset = new_guest_tsc - kvm_scale_tsc(vcpu, host_tsc);
> > +               kvm_vcpu_write_tsc_offset(vcpu, new_guest_tsc_offset);
> 
> How would a VMM maintain the phase relationship between guest TSCs
> using these ioctls?

By using the nanosecond timestamp. 
 
While I did made it optional in the V2 it was done for the sole sake of being 
able to set TSC on (re)boot to 0 from qemu, and for cases when qemu migrates 
from a VM where the feature is not enabled.
In this case the tsc is set to the given value exactly, just like you
can do today with KVM_SET_MSRS.
In all other cases the nanosecond timestamp will be given.
 
When the userspace uses the nanosecond timestamp, the phase relationship
would not only be maintained but be exact, even if TSC reads were not
synchronized and even if their restore on the target wasn't synchronized as well.
 
Here is an example:
 
Let's assume that TSC on source/target is synchronized, and that the guest TSC
is synchronized as well.

Let's call the guest TSC frequency F (guest TSC increments by F each second)
 
We do KVM_GET_TSC_STATE on vcpu0 and receive (t0,tsc0).
We do KVM_GET_TSC_STATE on vcpu1 after 1 second passed (exaggerated) 
and receive (t0 + 1s, tsc0 + F)
 
 
We do KVM_SET_TSC_STATE at t0 + 10s on vcpu0 after migration, 
and vcpu0's guest tsc is set to tsc0 + F[(t0 + 10s) - t0] = tsc0 + 10*F
 
We do KVM_SET_TSC_STATE at nsec0 + 12s on vcpu1 (also exaggerated)
and  get [tsc0+F] + F[(t0 + 12s) - (t0+1s)] = tsc0 + 12*F
 
Since 2 seconds passed by, both vCPUs have now their TSC set to tsc0 + 12*F.
 
I use kvm's own functions to read the CLOCK_REALTIME, which are done 
in such a way that you first read host TSC once and then convert it to 
nanoseconds by scaling/offsetting it as the kernel would, thus 
there is no random error introduced here.

So except numerical errors,
(which are unavoidable anyway, and should be neglectable) this algorithm should
both keep the TSC in sync, and even keep its absolute time reference
as accurate as the clock synchronization between the host and the target is.

(an offset between source and destination clocks will affect 
all the TSCs in the same manner, as long as both 
source and destination clocks are stable)


> 
> For as bugged as the old way of doing things is (i.e. the magic
> handling of IA32_TSC), it was at least possible to keep guest TSCs in
> sync across a live migration so long as you satisfied the conditions
> where KVM decides to fudge the TSCs on your behalf. However, if we
> migrate the TSCs by value with an optional timestamp to account for
> elapsed time, it would appear that the guest TSC offset is likely to
> be inconsistent across vCPUs as the offset calculations above do not
> use a fixed host tsc timestamp across all vCPUs.

> 
> The reason I'd suggested yielding control of the tsc offset controls
> to userspace is that they're unaffected by such variations in per-vCPU
> calculations. Not only that, userspace can decide how it wants TSCs to
> be handled across a migration explicitly instead of relying on the
> above computation being done in the kernel.
> 
> > +
> > +               if (tsc_state.flags & KVM_TSC_STATE_TSC_ADJUST_VALID)
> > +                       if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST))
> > +                               vcpu->arch.ia32_tsc_adjust_msr = tsc_state.tsc_adjust;
> 
> How is this ioctl's handling of the TSC_ADJUST msr an improvement over
> KVM_{GET,SET}_MSRS like before? It seems orthogonal to the rest of the
> intended API as it isn't involved your computation above.

It's more a refactoring thing. The goal is to avoid 'magic' handling 
of host accesses in KVM_{GET,SET}_MSRS and instead make them 
behave the same way as if the guest read that msr.
That can be useful for debug and such.
 
The second patch adds a KVM quirk, which should be disabled 
when the new API is used.

When disabled, it makes it hard to use the KVM_{GET,SET}_MSRS 
to set both TSC and TSC_ADJUST at the same time to given values,
since these msrs are tied to each other when guest writes them,
and the quirk disables the special (untied) write we had for host writes
to these msrs.


Think of these new ioctls as a way to saving and restoring
the internal TSC state, without bothering even to think what is inside. 
Kind of like we save/restore the nested state.

Best regards,
	Maxim Levitsky


> 
> > +               r = 0;
> > +               break;
> > +       }
> > +#endif
> >         default:
> >                 r = -EINVAL;
> >         }
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 886802b8ffba3..bf4c38fd58291 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -1056,6 +1056,7 @@ struct kvm_ppc_resize_hpt {
> >  #define KVM_CAP_ENFORCE_PV_FEATURE_CPUID 190
> >  #define KVM_CAP_SYS_HYPERV_CPUID 191
> >  #define KVM_CAP_DIRTY_LOG_RING 192
> > +#define KVM_CAP_PRECISE_TSC 193
> > 
> >  #ifdef KVM_CAP_IRQ_ROUTING
> > 
> > @@ -1169,6 +1170,16 @@ struct kvm_clock_data {
> >         __u32 pad[9];
> >  };
> > 
> > +
> > +#define KVM_TSC_STATE_TIMESTAMP_VALID 1
> > +#define KVM_TSC_STATE_TSC_ADJUST_VALID 2
> > +struct kvm_tsc_state {
> > +       __u32 flags;
> > +       __u64 nsec;
> > +       __u64 tsc;
> > +       __u64 tsc_adjust;
> > +};
> > +
> >  /* For KVM_CAP_SW_TLB */
> > 
> >  #define KVM_MMU_FSL_BOOKE_NOHV         0
> > @@ -1563,6 +1574,10 @@ struct kvm_pv_cmd {
> >  /* Available with KVM_CAP_DIRTY_LOG_RING */
> >  #define KVM_RESET_DIRTY_RINGS          _IO(KVMIO, 0xc7)
> > 
> > +/* Available with KVM_CAP_PRECISE_TSC*/
> > +#define KVM_SET_TSC_STATE          _IOW(KVMIO,  0xc8, struct kvm_tsc_state)
> > +#define KVM_GET_TSC_STATE          _IOR(KVMIO,  0xc9, struct kvm_tsc_state)
> > +
> >  /* Secure Encrypted Virtualization command */
> >  enum sev_cmd_id {
> >         /* Guest initialization commands */
> > --
> > 2.26.2
> >
Maxim Levitsky Dec. 8, 2020, 11:24 a.m. UTC | #16
On Mon, 2020-12-07 at 10:04 -0800, Andy Lutomirski wrote:
> > On Dec 7, 2020, at 9:00 AM, Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > 
> > On Mon, 2020-12-07 at 08:53 -0800, Andy Lutomirski wrote:
> > > > > On Dec 7, 2020, at 8:38 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > > > 
> > > > On Mon, Dec 07 2020 at 14:16, Maxim Levitsky wrote:
> > > > > > On Sun, 2020-12-06 at 17:19 +0100, Thomas Gleixner wrote:
> > > > > > From a timekeeping POV and the guests expectation of TSC this is
> > > > > > fundamentally wrong:
> > > > > > 
> > > > > >     tscguest = scaled(hosttsc) + offset
> > > > > > 
> > > > > > The TSC has to be viewed systemwide and not per CPU. It's systemwide
> > > > > > used for timekeeping and for that to work it has to be synchronized. 
> > > > > > 
> > > > > > Why would this be different on virt? Just because it's virt or what? 
> > > > > > 
> > > > > > Migration is a guest wide thing and you're not migrating single vCPUs.
> > > > > > 
> > > > > > This hackery just papers over he underlying design fail that KVM looks
> > > > > > at the TSC per vCPU which is the root cause and that needs to be fixed.
> > > > > 
> > > > > I don't disagree with you.
> > > > > As far as I know the main reasons that kvm tracks TSC per guest are
> > > > > 
> > > > > 1. cases when host tsc is not stable 
> > > > > (hopefully rare now, and I don't mind making
> > > > > the new API just refuse to work when this is detected, and revert to old way
> > > > > of doing things).
> > > > 
> > > > That's a trainwreck to begin with and I really would just not support it
> > > > for anything new which aims to be more precise and correct.  TSC has
> > > > become pretty reliable over the years.
> > > > 
> > > > > 2. (theoretical) ability of the guest to introduce per core tsc offfset
> > > > > by either using TSC_ADJUST (for which I got recently an idea to stop
> > > > > advertising this feature to the guest), or writing TSC directly which
> > > > > is allowed by Intel's PRM:
> > > > 
> > > > For anything halfways modern the write to TSC is reflected in TSC_ADJUST
> > > > which means you get the precise offset.
> > > > 
> > > > The general principle still applies from a system POV.
> > > > 
> > > >    TSC base (systemwide view) - The sane case
> > > > 
> > > >    TSC CPU  = TSC base + TSC_ADJUST
> > > > 
> > > > The guest TSC base is a per guest constant offset to the host TSC.
> > > > 
> > > >    TSC guest base = TSC host base + guest base offset
> > > > 
> > > > If the guest want's this different per vCPU by writing to the MSR or to
> > > > TSC_ADJUST then you still can have a per vCPU offset in TSC_ADJUST which
> > > > is the offset to the TSC base of the guest.
> > > 
> > > How about, if the guest wants to write TSC_ADJUST, it can turn off all paravirt features and keep both pieces?
> > > 
> > 
> > This is one of the things I had in mind recently.
> > 
> > Even better, we can stop advertising TSC_ADJUST in CPUID to the guest 
> > and forbid it from writing it at all.
> 
> Seems reasonable to me.
> 
> It also seems okay for some MSRs to stop working after the guest enabled new PV timekeeping.
This is a very good idea!

> 
> I do have a feature request, though: IMO it would be quite nifty if the new kvmclock structure could also expose NTP corrections. In other words, if you could expose enough info to calculate CLOCK_MONOTONIC_RAW, CLOCK_MONOTONIC, and CLOCK_REALTIME, then we could have paravirt NTP.
> 
> Bonus points if whatever you do for CLOCK_REALTIME also exposes leap seconds in a race free way :). But I suppose that just exposing TAI and letting the guest deal with the TAI - UTC offset itself would get the job done just fine.

This is a good idea too.
As I understand it, this gives a justification to a new kvmclock purpose,
which wouldn't be focused anymore on correcting the tsc shortcomings
(unstable/unscalable tsc), but more on things like that.
I like that idea.

Best regards,
	Maxim Levitsky
Maxim Levitsky Dec. 8, 2020, 2:50 p.m. UTC | #17
On Mon, 2020-12-07 at 20:29 -0300, Marcelo Tosatti wrote:
> On Thu, Dec 03, 2020 at 07:11:16PM +0200, Maxim Levitsky wrote:
> > These two new ioctls allow to more precisly capture and
> > restore guest's TSC state.
> > 
> > Both ioctls are meant to be used to accurately migrate guest TSC
> > even when there is a significant downtime during the migration.
> > 
> > Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > ---
> >  Documentation/virt/kvm/api.rst | 65 ++++++++++++++++++++++++++++++
> >  arch/x86/kvm/x86.c             | 73 ++++++++++++++++++++++++++++++++++
> >  include/uapi/linux/kvm.h       | 15 +++++++
> >  3 files changed, 153 insertions(+)
> > 
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 70254eaa5229f..ebecfe4b414ce 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -4826,6 +4826,71 @@ If a vCPU is in running state while this ioctl is invoked, the vCPU may
> >  experience inconsistent filtering behavior on MSR accesses.
> >  
> >  
> > +4.127 KVM_GET_TSC_STATE
> > +----------------------------
> > +
> > +:Capability: KVM_CAP_PRECISE_TSC
> > +:Architectures: x86
> > +:Type: vcpu ioctl
> > +:Parameters: struct kvm_tsc_state
> > +:Returns: 0 on success, < 0 on error
> > +
> > +::
> > +
> > +  #define KVM_TSC_STATE_TIMESTAMP_VALID 1
> > +  #define KVM_TSC_STATE_TSC_ADJUST_VALID 2
> > +  struct kvm_tsc_state {
> > +	__u32 flags;
> > +	__u64 nsec;
> > +	__u64 tsc;
> > +	__u64 tsc_adjust;
> > +  };
> > +
> > +flags values for ``struct kvm_tsc_state``:
> > +
> > +``KVM_TSC_STATE_TIMESTAMP_VALID``
> > +
> > +  ``nsec`` contains nanoseconds from unix epoch.
> > +    Always set by KVM_GET_TSC_STATE, might be omitted in KVM_SET_TSC_STATE
> > +
> > +``KVM_TSC_STATE_TSC_ADJUST_VALID``
> > +
> > +  ``tsc_adjust`` contains valid IA32_TSC_ADJUST value
> > +
> > +
> > +This ioctl allows the user space to read the guest's IA32_TSC,IA32_TSC_ADJUST,
> > +and the current value of host's CLOCK_REALTIME clock in nanoseconds since unix
> > +epoch.
> 
> Why is CLOCK_REALTIME necessary at all? kvmclock uses the host clock as
> a time base, but for TSC it should not be necessary.


CLOCK_REALTIME is used as an absolute time reference that should match
on both computers. I could have used CLOCK_TAI instead for example.

The reference allows to account for time passed between saving and restoring
the TSC as explained above.


> 
> > +
> > +
> > +4.128 KVM_SET_TSC_STATE
> > +----------------------------
> > +
> > +:Capability: KVM_CAP_PRECISE_TSC
> > +:Architectures: x86
> > +:Type: vcpu ioctl
> > +:Parameters: struct kvm_tsc_state
> > +:Returns: 0 on success, < 0 on error
> > +
> > +::
> > +
> > +This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST value
> > +from the state obtained in the past by KVM_GET_TSC_STATE on the same vCPU.
> > +
> > +If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags,
> > +KVM will adjust the guest TSC value by the time that passed since the moment
> > +CLOCK_REALTIME timestamp was saved in the struct and current value of
> > +CLOCK_REALTIME, and set the guest's TSC to the new value.
> 
> This introduces the wraparound bug in Linux timekeeping, doesnt it?

It does.
Could you prepare a reproducer for this bug so I get a better idea about
what are you talking about?

I assume you need very long (like days worth) jump to trigger this bug
and for such case we can either work around it in qemu / kernel 
or fix it in the guest kernel and I strongly prefer the latter.

Thomas, what do you think about it?

Best regards,
	Maxim Levitsky

> 

> > +
> > +Otherwise KVM will set the guest TSC value to the exact value as given
> > +in the struct.
> > +
> > +if KVM_TSC_STATE_TSC_ADJUST_VALID is set, and guest supports IA32_MSR_TSC_ADJUST,
> > +then its value will be set to the given value from the struct.
> > +
> > +It is assumed that either both ioctls will be run on the same machine,
> > +or that source and destination machines have synchronized clocks.
> 
> 
> >  5. The kvm_run structure
> >  ========================
> >  
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index a3fdc16cfd6f3..9b8a2fe3a2398 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -2438,6 +2438,21 @@ static bool kvm_get_walltime_and_clockread(struct timespec64 *ts,
> >  
> >  	return gtod_is_based_on_tsc(do_realtime(ts, tsc_timestamp));
> >  }
> > +
> > +
> > +static void kvm_get_walltime(u64 *walltime_ns, u64 *host_tsc)
> > +{
> > +	struct timespec64 ts;
> > +
> > +	if (kvm_get_walltime_and_clockread(&ts, host_tsc)) {
> > +		*walltime_ns = timespec64_to_ns(&ts);
> > +		return;
> > +	}
> > +
> > +	*host_tsc = rdtsc();
> > +	*walltime_ns = ktime_get_real_ns();
> > +}
> > +
> >  #endif
> >  
> >  /*
> > @@ -3757,6 +3772,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >  	case KVM_CAP_X86_USER_SPACE_MSR:
> >  	case KVM_CAP_X86_MSR_FILTER:
> >  	case KVM_CAP_ENFORCE_PV_FEATURE_CPUID:
> > +#ifdef CONFIG_X86_64
> > +	case KVM_CAP_PRECISE_TSC:
> > +#endif
> >  		r = 1;
> >  		break;
> >  	case KVM_CAP_SYNC_REGS:
> > @@ -4999,6 +5017,61 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
> >  	case KVM_GET_SUPPORTED_HV_CPUID:
> >  		r = kvm_ioctl_get_supported_hv_cpuid(vcpu, argp);
> >  		break;
> > +#ifdef CONFIG_X86_64
> > +	case KVM_GET_TSC_STATE: {
> > +		struct kvm_tsc_state __user *user_tsc_state = argp;
> > +		u64 host_tsc;
> > +
> > +		struct kvm_tsc_state tsc_state = {
> > +			.flags = KVM_TSC_STATE_TIMESTAMP_VALID
> > +		};
> > +
> > +		kvm_get_walltime(&tsc_state.nsec, &host_tsc);
> > +		tsc_state.tsc = kvm_read_l1_tsc(vcpu, host_tsc);
> > +
> > +		if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) {
> > +			tsc_state.tsc_adjust = vcpu->arch.ia32_tsc_adjust_msr;
> > +			tsc_state.flags |= KVM_TSC_STATE_TSC_ADJUST_VALID;
> > +		}
> > +
> > +		r = -EFAULT;
> > +		if (copy_to_user(user_tsc_state, &tsc_state, sizeof(tsc_state)))
> > +			goto out;
> > +		r = 0;
> > +		break;
> > +	}
> > +	case KVM_SET_TSC_STATE: {
> > +		struct kvm_tsc_state __user *user_tsc_state = argp;
> > +		struct kvm_tsc_state tsc_state;
> > +		u64 host_tsc, wall_nsec;
> > +
> > +		u64 new_guest_tsc, new_guest_tsc_offset;
> > +
> > +		r = -EFAULT;
> > +		if (copy_from_user(&tsc_state, user_tsc_state, sizeof(tsc_state)))
> > +			goto out;
> > +
> > +		kvm_get_walltime(&wall_nsec, &host_tsc);
> > +		new_guest_tsc = tsc_state.tsc;
> > +
> > +		if (tsc_state.flags & KVM_TSC_STATE_TIMESTAMP_VALID) {
> > +			s64 diff = wall_nsec - tsc_state.nsec;
> > +			if (diff >= 0)
> > +				new_guest_tsc += nsec_to_cycles(vcpu, diff);
> > +			else
> > +				new_guest_tsc -= nsec_to_cycles(vcpu, -diff);
> > +		}
> > +
> > +		new_guest_tsc_offset = new_guest_tsc - kvm_scale_tsc(vcpu, host_tsc);
> > +		kvm_vcpu_write_tsc_offset(vcpu, new_guest_tsc_offset);
> > +
> > +		if (tsc_state.flags & KVM_TSC_STATE_TSC_ADJUST_VALID)
> > +			if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST))
> > +				vcpu->arch.ia32_tsc_adjust_msr = tsc_state.tsc_adjust;
> > +		r = 0;
> > +		break;
> > +	}
> > +#endif
> >  	default:
> >  		r = -EINVAL;
> >  	}
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 886802b8ffba3..bf4c38fd58291 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -1056,6 +1056,7 @@ struct kvm_ppc_resize_hpt {
> >  #define KVM_CAP_ENFORCE_PV_FEATURE_CPUID 190
> >  #define KVM_CAP_SYS_HYPERV_CPUID 191
> >  #define KVM_CAP_DIRTY_LOG_RING 192
> > +#define KVM_CAP_PRECISE_TSC 193
> >  
> >  #ifdef KVM_CAP_IRQ_ROUTING
> >  
> > @@ -1169,6 +1170,16 @@ struct kvm_clock_data {
> >  	__u32 pad[9];
> >  };
> >  
> > +
> > +#define KVM_TSC_STATE_TIMESTAMP_VALID 1
> > +#define KVM_TSC_STATE_TSC_ADJUST_VALID 2
> > +struct kvm_tsc_state {
> > +	__u32 flags;
> > +	__u64 nsec;
> > +	__u64 tsc;
> > +	__u64 tsc_adjust;
> > +};
> > +
> >  /* For KVM_CAP_SW_TLB */
> >  
> >  #define KVM_MMU_FSL_BOOKE_NOHV		0
> > @@ -1563,6 +1574,10 @@ struct kvm_pv_cmd {
> >  /* Available with KVM_CAP_DIRTY_LOG_RING */
> >  #define KVM_RESET_DIRTY_RINGS		_IO(KVMIO, 0xc7)
> >  
> > +/* Available with KVM_CAP_PRECISE_TSC*/
> > +#define KVM_SET_TSC_STATE          _IOW(KVMIO,  0xc8, struct kvm_tsc_state)
> > +#define KVM_GET_TSC_STATE          _IOR(KVMIO,  0xc9, struct kvm_tsc_state)
> > +
> >  /* Secure Encrypted Virtualization command */
> >  enum sev_cmd_id {
> >  	/* Guest initialization commands */
> > -- 
> > 2.26.2
Oliver Upton Dec. 8, 2020, 3:57 p.m. UTC | #18
On Tue, Dec 8, 2020 at 5:13 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
>
> On Mon, 2020-12-07 at 11:29 -0600, Oliver Upton wrote:
> > On Thu, Dec 3, 2020 at 11:12 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > > These two new ioctls allow to more precisly capture and
> > > restore guest's TSC state.
> > >
> > > Both ioctls are meant to be used to accurately migrate guest TSC
> > > even when there is a significant downtime during the migration.
> > >
> > > Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> > > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > > ---
> > >  Documentation/virt/kvm/api.rst | 65 ++++++++++++++++++++++++++++++
> > >  arch/x86/kvm/x86.c             | 73 ++++++++++++++++++++++++++++++++++
> > >  include/uapi/linux/kvm.h       | 15 +++++++
> > >  3 files changed, 153 insertions(+)
> > >
> > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > > index 70254eaa5229f..ebecfe4b414ce 100644
> > > --- a/Documentation/virt/kvm/api.rst
> > > +++ b/Documentation/virt/kvm/api.rst
> > > @@ -4826,6 +4826,71 @@ If a vCPU is in running state while this ioctl is invoked, the vCPU may
> > >  experience inconsistent filtering behavior on MSR accesses.
> > >
> > >
> > > +4.127 KVM_GET_TSC_STATE
> > > +----------------------------
> > > +
> > > +:Capability: KVM_CAP_PRECISE_TSC
> > > +:Architectures: x86
> > > +:Type: vcpu ioctl
> > > +:Parameters: struct kvm_tsc_state
> > > +:Returns: 0 on success, < 0 on error
> > > +
> > > +::
> > > +
> > > +  #define KVM_TSC_STATE_TIMESTAMP_VALID 1
> > > +  #define KVM_TSC_STATE_TSC_ADJUST_VALID 2
> > > +  struct kvm_tsc_state {
> > > +       __u32 flags;
> > > +       __u64 nsec;
> > > +       __u64 tsc;
> > > +       __u64 tsc_adjust;
> > > +  };
> > > +
> > > +flags values for ``struct kvm_tsc_state``:
> > > +
> > > +``KVM_TSC_STATE_TIMESTAMP_VALID``
> > > +
> > > +  ``nsec`` contains nanoseconds from unix epoch.
> > > +    Always set by KVM_GET_TSC_STATE, might be omitted in KVM_SET_TSC_STATE
> > > +
> > > +``KVM_TSC_STATE_TSC_ADJUST_VALID``
> > > +
> > > +  ``tsc_adjust`` contains valid IA32_TSC_ADJUST value
> > > +
> > > +
> > > +This ioctl allows the user space to read the guest's IA32_TSC,IA32_TSC_ADJUST,
> > > +and the current value of host's CLOCK_REALTIME clock in nanoseconds since unix
> > > +epoch.
> > > +
> > > +
> > > +4.128 KVM_SET_TSC_STATE
> > > +----------------------------
> > > +
> > > +:Capability: KVM_CAP_PRECISE_TSC
> > > +:Architectures: x86
> > > +:Type: vcpu ioctl
> > > +:Parameters: struct kvm_tsc_state
> > > +:Returns: 0 on success, < 0 on error
> > > +
> > > +::
> > > +
> > > +This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST value
> > > +from the state obtained in the past by KVM_GET_TSC_STATE on the same vCPU.
> > > +
> > > +If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags,
> > > +KVM will adjust the guest TSC value by the time that passed since the moment
> > > +CLOCK_REALTIME timestamp was saved in the struct and current value of
> > > +CLOCK_REALTIME, and set the guest's TSC to the new value.
> > > +
> > > +Otherwise KVM will set the guest TSC value to the exact value as given
> > > +in the struct.
> > > +
> > > +if KVM_TSC_STATE_TSC_ADJUST_VALID is set, and guest supports IA32_MSR_TSC_ADJUST,
> > > +then its value will be set to the given value from the struct.
> > > +
> > > +It is assumed that either both ioctls will be run on the same machine,
> > > +or that source and destination machines have synchronized clocks.
> > > +
> > >  5. The kvm_run structure
> > >  ========================
> > >
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index a3fdc16cfd6f3..9b8a2fe3a2398 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -2438,6 +2438,21 @@ static bool kvm_get_walltime_and_clockread(struct timespec64 *ts,
> > >
> > >         return gtod_is_based_on_tsc(do_realtime(ts, tsc_timestamp));
> > >  }
> > > +
> > > +
> > > +static void kvm_get_walltime(u64 *walltime_ns, u64 *host_tsc)
> > > +{
> > > +       struct timespec64 ts;
> > > +
> > > +       if (kvm_get_walltime_and_clockread(&ts, host_tsc)) {
> > > +               *walltime_ns = timespec64_to_ns(&ts);
> > > +               return;
> > > +       }
> > > +
> > > +       *host_tsc = rdtsc();
> > > +       *walltime_ns = ktime_get_real_ns();
> > > +}
> > > +
> > >  #endif
> > >
> > >  /*
> > > @@ -3757,6 +3772,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> > >         case KVM_CAP_X86_USER_SPACE_MSR:
> > >         case KVM_CAP_X86_MSR_FILTER:
> > >         case KVM_CAP_ENFORCE_PV_FEATURE_CPUID:
> > > +#ifdef CONFIG_X86_64
> > > +       case KVM_CAP_PRECISE_TSC:
> > > +#endif
> > >                 r = 1;
> > >                 break;
> > >         case KVM_CAP_SYNC_REGS:
> > > @@ -4999,6 +5017,61 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
> > >         case KVM_GET_SUPPORTED_HV_CPUID:
> > >                 r = kvm_ioctl_get_supported_hv_cpuid(vcpu, argp);
> > >                 break;
> > > +#ifdef CONFIG_X86_64
> > > +       case KVM_GET_TSC_STATE: {
> > > +               struct kvm_tsc_state __user *user_tsc_state = argp;
> > > +               u64 host_tsc;
> > > +
> > > +               struct kvm_tsc_state tsc_state = {
> > > +                       .flags = KVM_TSC_STATE_TIMESTAMP_VALID
> > > +               };
> > > +
> > > +               kvm_get_walltime(&tsc_state.nsec, &host_tsc);
> > > +               tsc_state.tsc = kvm_read_l1_tsc(vcpu, host_tsc);
> > > +
> > > +               if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) {
> > > +                       tsc_state.tsc_adjust = vcpu->arch.ia32_tsc_adjust_msr;
> > > +                       tsc_state.flags |= KVM_TSC_STATE_TSC_ADJUST_VALID;
> > > +               }
> >
> > [...]
> >
> > > +
> > > +               r = -EFAULT;
> > > +               if (copy_to_user(user_tsc_state, &tsc_state, sizeof(tsc_state)))
> > > +                       goto out;
> > > +               r = 0;
> > > +               break;
> > > +       }
> > > +       case KVM_SET_TSC_STATE: {
> > > +               struct kvm_tsc_state __user *user_tsc_state = argp;
> > > +               struct kvm_tsc_state tsc_state;
> > > +               u64 host_tsc, wall_nsec;
> > > +
> > > +               u64 new_guest_tsc, new_guest_tsc_offset;
> > > +
> > > +               r = -EFAULT;
> > > +               if (copy_from_user(&tsc_state, user_tsc_state, sizeof(tsc_state)))
> > > +                       goto out;
> > > +
> > > +               kvm_get_walltime(&wall_nsec, &host_tsc);
> > > +               new_guest_tsc = tsc_state.tsc;
> > > +
> > > +               if (tsc_state.flags & KVM_TSC_STATE_TIMESTAMP_VALID) {
> > > +                       s64 diff = wall_nsec - tsc_state.nsec;
> > > +                       if (diff >= 0)
> > > +                               new_guest_tsc += nsec_to_cycles(vcpu, diff);
> > > +                       else
> > > +                               new_guest_tsc -= nsec_to_cycles(vcpu, -diff);
> > > +               }
> > > +
> > > +               new_guest_tsc_offset = new_guest_tsc - kvm_scale_tsc(vcpu, host_tsc);
> > > +               kvm_vcpu_write_tsc_offset(vcpu, new_guest_tsc_offset);
> >
> > How would a VMM maintain the phase relationship between guest TSCs
> > using these ioctls?
>
> By using the nanosecond timestamp.
>
> While I did made it optional in the V2 it was done for the sole sake of being
> able to set TSC on (re)boot to 0 from qemu, and for cases when qemu migrates
> from a VM where the feature is not enabled.
> In this case the tsc is set to the given value exactly, just like you
> can do today with KVM_SET_MSRS.
> In all other cases the nanosecond timestamp will be given.
>
> When the userspace uses the nanosecond timestamp, the phase relationship
> would not only be maintained but be exact, even if TSC reads were not
> synchronized and even if their restore on the target wasn't synchronized as well.
>
> Here is an example:
>
> Let's assume that TSC on source/target is synchronized, and that the guest TSC
> is synchronized as well.

Can this assumption be reasonably made though?

NTP could very well step or scale CLOCK_REALTIME when we are in the
middle of saving or restoring TSCs, which could possibly result in
observable drift between vCPUs. Calculating elapsed time between
save/restore once per VM would avoid this issue altogether.

> Let's call the guest TSC frequency F (guest TSC increments by F each second)
>
> We do KVM_GET_TSC_STATE on vcpu0 and receive (t0,tsc0).
> We do KVM_GET_TSC_STATE on vcpu1 after 1 second passed (exaggerated)
> and receive (t0 + 1s, tsc0 + F)
>
>
> We do KVM_SET_TSC_STATE at t0 + 10s on vcpu0 after migration,
> and vcpu0's guest tsc is set to tsc0 + F[(t0 + 10s) - t0] = tsc0 + 10*F
>
> We do KVM_SET_TSC_STATE at nsec0 + 12s on vcpu1 (also exaggerated)
> and  get [tsc0+F] + F[(t0 + 12s) - (t0+1s)] = tsc0 + 12*F
>
> Since 2 seconds passed by, both vCPUs have now their TSC set to tsc0 + 12*F.
>
> I use kvm's own functions to read the CLOCK_REALTIME, which are done
> in such a way that you first read host TSC once and then convert it to
> nanoseconds by scaling/offsetting it as the kernel would, thus
> there is no random error introduced here.

Agreed. In fact, my suggestion of yielding TSC offset controls to
userspace falls short in this regard, since userspace can't make the
same guarantee that the clockread was derived from its paired TSC
value.

> So except numerical errors,
> (which are unavoidable anyway, and should be neglectable) this algorithm should
> both keep the TSC in sync, and even keep its absolute time reference
> as accurate as the clock synchronization between the host and the target is.
>
> (an offset between source and destination clocks will affect
> all the TSCs in the same manner, as long as both
> source and destination clocks are stable)
>
>
> >
> > For as bugged as the old way of doing things is (i.e. the magic
> > handling of IA32_TSC), it was at least possible to keep guest TSCs in
> > sync across a live migration so long as you satisfied the conditions
> > where KVM decides to fudge the TSCs on your behalf. However, if we
> > migrate the TSCs by value with an optional timestamp to account for
> > elapsed time, it would appear that the guest TSC offset is likely to
> > be inconsistent across vCPUs as the offset calculations above do not
> > use a fixed host tsc timestamp across all vCPUs.
>
> >
> > The reason I'd suggested yielding control of the tsc offset controls
> > to userspace is that they're unaffected by such variations in per-vCPU
> > calculations. Not only that, userspace can decide how it wants TSCs to
> > be handled across a migration explicitly instead of relying on the
> > above computation being done in the kernel.
> >
> > > +
> > > +               if (tsc_state.flags & KVM_TSC_STATE_TSC_ADJUST_VALID)
> > > +                       if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST))
> > > +                               vcpu->arch.ia32_tsc_adjust_msr = tsc_state.tsc_adjust;
> >
> > How is this ioctl's handling of the TSC_ADJUST msr an improvement over
> > KVM_{GET,SET}_MSRS like before? It seems orthogonal to the rest of the
> > intended API as it isn't involved your computation above.
>
> It's more a refactoring thing. The goal is to avoid 'magic' handling
> of host accesses in KVM_{GET,SET}_MSRS and instead make them
> behave the same way as if the guest read that msr.
> That can be useful for debug and such.
>
> The second patch adds a KVM quirk, which should be disabled
> when the new API is used.
>
> When disabled, it makes it hard to use the KVM_{GET,SET}_MSRS
> to set both TSC and TSC_ADJUST at the same time to given values,
> since these msrs are tied to each other when guest writes them,
> and the quirk disables the special (untied) write we had for host writes
> to these msrs.
>
>
> Think of these new ioctls as a way to saving and restoring
> the internal TSC state, without bothering even to think what is inside.
> Kind of like we save/restore the nested state.

I agree that the quirk is useful for the guest touching TSC and
TSC_ADJUST, but host writes to the TSC_ADJUST MSR are unaffected by
any sync issues. As such, it seems the existing plumbing for
KVM_{GET,SET}_MSRS VMMs are using seems sufficient.

> Best regards,
>         Maxim Levitsky
>
>
> >
> > > +               r = 0;
> > > +               break;
> > > +       }
> > > +#endif
> > >         default:
> > >                 r = -EINVAL;
> > >         }
> > > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > > index 886802b8ffba3..bf4c38fd58291 100644
> > > --- a/include/uapi/linux/kvm.h
> > > +++ b/include/uapi/linux/kvm.h
> > > @@ -1056,6 +1056,7 @@ struct kvm_ppc_resize_hpt {
> > >  #define KVM_CAP_ENFORCE_PV_FEATURE_CPUID 190
> > >  #define KVM_CAP_SYS_HYPERV_CPUID 191
> > >  #define KVM_CAP_DIRTY_LOG_RING 192
> > > +#define KVM_CAP_PRECISE_TSC 193
> > >
> > >  #ifdef KVM_CAP_IRQ_ROUTING
> > >
> > > @@ -1169,6 +1170,16 @@ struct kvm_clock_data {
> > >         __u32 pad[9];
> > >  };
> > >
> > > +
> > > +#define KVM_TSC_STATE_TIMESTAMP_VALID 1
> > > +#define KVM_TSC_STATE_TSC_ADJUST_VALID 2
> > > +struct kvm_tsc_state {
> > > +       __u32 flags;
> > > +       __u64 nsec;
> > > +       __u64 tsc;
> > > +       __u64 tsc_adjust;
> > > +};
> > > +
> > >  /* For KVM_CAP_SW_TLB */
> > >
> > >  #define KVM_MMU_FSL_BOOKE_NOHV         0
> > > @@ -1563,6 +1574,10 @@ struct kvm_pv_cmd {
> > >  /* Available with KVM_CAP_DIRTY_LOG_RING */
> > >  #define KVM_RESET_DIRTY_RINGS          _IO(KVMIO, 0xc7)
> > >
> > > +/* Available with KVM_CAP_PRECISE_TSC*/
> > > +#define KVM_SET_TSC_STATE          _IOW(KVMIO,  0xc8, struct kvm_tsc_state)
> > > +#define KVM_GET_TSC_STATE          _IOR(KVMIO,  0xc9, struct kvm_tsc_state)
> > > +
> > >  /* Secure Encrypted Virtualization command */
> > >  enum sev_cmd_id {
> > >         /* Guest initialization commands */
> > > --
> > > 2.26.2
> > >
>
>
Oliver Upton Dec. 8, 2020, 3:58 p.m. UTC | #19
+cc Sean's new handle

On Tue, Dec 8, 2020 at 9:57 AM Oliver Upton <oupton@google.com> wrote:
>
> On Tue, Dec 8, 2020 at 5:13 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> >
> > On Mon, 2020-12-07 at 11:29 -0600, Oliver Upton wrote:
> > > On Thu, Dec 3, 2020 at 11:12 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > > > These two new ioctls allow to more precisly capture and
> > > > restore guest's TSC state.
> > > >
> > > > Both ioctls are meant to be used to accurately migrate guest TSC
> > > > even when there is a significant downtime during the migration.
> > > >
> > > > Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> > > > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > > > ---
> > > >  Documentation/virt/kvm/api.rst | 65 ++++++++++++++++++++++++++++++
> > > >  arch/x86/kvm/x86.c             | 73 ++++++++++++++++++++++++++++++++++
> > > >  include/uapi/linux/kvm.h       | 15 +++++++
> > > >  3 files changed, 153 insertions(+)
> > > >
> > > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > > > index 70254eaa5229f..ebecfe4b414ce 100644
> > > > --- a/Documentation/virt/kvm/api.rst
> > > > +++ b/Documentation/virt/kvm/api.rst
> > > > @@ -4826,6 +4826,71 @@ If a vCPU is in running state while this ioctl is invoked, the vCPU may
> > > >  experience inconsistent filtering behavior on MSR accesses.
> > > >
> > > >
> > > > +4.127 KVM_GET_TSC_STATE
> > > > +----------------------------
> > > > +
> > > > +:Capability: KVM_CAP_PRECISE_TSC
> > > > +:Architectures: x86
> > > > +:Type: vcpu ioctl
> > > > +:Parameters: struct kvm_tsc_state
> > > > +:Returns: 0 on success, < 0 on error
> > > > +
> > > > +::
> > > > +
> > > > +  #define KVM_TSC_STATE_TIMESTAMP_VALID 1
> > > > +  #define KVM_TSC_STATE_TSC_ADJUST_VALID 2
> > > > +  struct kvm_tsc_state {
> > > > +       __u32 flags;
> > > > +       __u64 nsec;
> > > > +       __u64 tsc;
> > > > +       __u64 tsc_adjust;
> > > > +  };
> > > > +
> > > > +flags values for ``struct kvm_tsc_state``:
> > > > +
> > > > +``KVM_TSC_STATE_TIMESTAMP_VALID``
> > > > +
> > > > +  ``nsec`` contains nanoseconds from unix epoch.
> > > > +    Always set by KVM_GET_TSC_STATE, might be omitted in KVM_SET_TSC_STATE
> > > > +
> > > > +``KVM_TSC_STATE_TSC_ADJUST_VALID``
> > > > +
> > > > +  ``tsc_adjust`` contains valid IA32_TSC_ADJUST value
> > > > +
> > > > +
> > > > +This ioctl allows the user space to read the guest's IA32_TSC,IA32_TSC_ADJUST,
> > > > +and the current value of host's CLOCK_REALTIME clock in nanoseconds since unix
> > > > +epoch.
> > > > +
> > > > +
> > > > +4.128 KVM_SET_TSC_STATE
> > > > +----------------------------
> > > > +
> > > > +:Capability: KVM_CAP_PRECISE_TSC
> > > > +:Architectures: x86
> > > > +:Type: vcpu ioctl
> > > > +:Parameters: struct kvm_tsc_state
> > > > +:Returns: 0 on success, < 0 on error
> > > > +
> > > > +::
> > > > +
> > > > +This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST value
> > > > +from the state obtained in the past by KVM_GET_TSC_STATE on the same vCPU.
> > > > +
> > > > +If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags,
> > > > +KVM will adjust the guest TSC value by the time that passed since the moment
> > > > +CLOCK_REALTIME timestamp was saved in the struct and current value of
> > > > +CLOCK_REALTIME, and set the guest's TSC to the new value.
> > > > +
> > > > +Otherwise KVM will set the guest TSC value to the exact value as given
> > > > +in the struct.
> > > > +
> > > > +if KVM_TSC_STATE_TSC_ADJUST_VALID is set, and guest supports IA32_MSR_TSC_ADJUST,
> > > > +then its value will be set to the given value from the struct.
> > > > +
> > > > +It is assumed that either both ioctls will be run on the same machine,
> > > > +or that source and destination machines have synchronized clocks.
> > > > +
> > > >  5. The kvm_run structure
> > > >  ========================
> > > >
> > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > index a3fdc16cfd6f3..9b8a2fe3a2398 100644
> > > > --- a/arch/x86/kvm/x86.c
> > > > +++ b/arch/x86/kvm/x86.c
> > > > @@ -2438,6 +2438,21 @@ static bool kvm_get_walltime_and_clockread(struct timespec64 *ts,
> > > >
> > > >         return gtod_is_based_on_tsc(do_realtime(ts, tsc_timestamp));
> > > >  }
> > > > +
> > > > +
> > > > +static void kvm_get_walltime(u64 *walltime_ns, u64 *host_tsc)
> > > > +{
> > > > +       struct timespec64 ts;
> > > > +
> > > > +       if (kvm_get_walltime_and_clockread(&ts, host_tsc)) {
> > > > +               *walltime_ns = timespec64_to_ns(&ts);
> > > > +               return;
> > > > +       }
> > > > +
> > > > +       *host_tsc = rdtsc();
> > > > +       *walltime_ns = ktime_get_real_ns();
> > > > +}
> > > > +
> > > >  #endif
> > > >
> > > >  /*
> > > > @@ -3757,6 +3772,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> > > >         case KVM_CAP_X86_USER_SPACE_MSR:
> > > >         case KVM_CAP_X86_MSR_FILTER:
> > > >         case KVM_CAP_ENFORCE_PV_FEATURE_CPUID:
> > > > +#ifdef CONFIG_X86_64
> > > > +       case KVM_CAP_PRECISE_TSC:
> > > > +#endif
> > > >                 r = 1;
> > > >                 break;
> > > >         case KVM_CAP_SYNC_REGS:
> > > > @@ -4999,6 +5017,61 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
> > > >         case KVM_GET_SUPPORTED_HV_CPUID:
> > > >                 r = kvm_ioctl_get_supported_hv_cpuid(vcpu, argp);
> > > >                 break;
> > > > +#ifdef CONFIG_X86_64
> > > > +       case KVM_GET_TSC_STATE: {
> > > > +               struct kvm_tsc_state __user *user_tsc_state = argp;
> > > > +               u64 host_tsc;
> > > > +
> > > > +               struct kvm_tsc_state tsc_state = {
> > > > +                       .flags = KVM_TSC_STATE_TIMESTAMP_VALID
> > > > +               };
> > > > +
> > > > +               kvm_get_walltime(&tsc_state.nsec, &host_tsc);
> > > > +               tsc_state.tsc = kvm_read_l1_tsc(vcpu, host_tsc);
> > > > +
> > > > +               if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) {
> > > > +                       tsc_state.tsc_adjust = vcpu->arch.ia32_tsc_adjust_msr;
> > > > +                       tsc_state.flags |= KVM_TSC_STATE_TSC_ADJUST_VALID;
> > > > +               }
> > >
> > > [...]
> > >
> > > > +
> > > > +               r = -EFAULT;
> > > > +               if (copy_to_user(user_tsc_state, &tsc_state, sizeof(tsc_state)))
> > > > +                       goto out;
> > > > +               r = 0;
> > > > +               break;
> > > > +       }
> > > > +       case KVM_SET_TSC_STATE: {
> > > > +               struct kvm_tsc_state __user *user_tsc_state = argp;
> > > > +               struct kvm_tsc_state tsc_state;
> > > > +               u64 host_tsc, wall_nsec;
> > > > +
> > > > +               u64 new_guest_tsc, new_guest_tsc_offset;
> > > > +
> > > > +               r = -EFAULT;
> > > > +               if (copy_from_user(&tsc_state, user_tsc_state, sizeof(tsc_state)))
> > > > +                       goto out;
> > > > +
> > > > +               kvm_get_walltime(&wall_nsec, &host_tsc);
> > > > +               new_guest_tsc = tsc_state.tsc;
> > > > +
> > > > +               if (tsc_state.flags & KVM_TSC_STATE_TIMESTAMP_VALID) {
> > > > +                       s64 diff = wall_nsec - tsc_state.nsec;
> > > > +                       if (diff >= 0)
> > > > +                               new_guest_tsc += nsec_to_cycles(vcpu, diff);
> > > > +                       else
> > > > +                               new_guest_tsc -= nsec_to_cycles(vcpu, -diff);
> > > > +               }
> > > > +
> > > > +               new_guest_tsc_offset = new_guest_tsc - kvm_scale_tsc(vcpu, host_tsc);
> > > > +               kvm_vcpu_write_tsc_offset(vcpu, new_guest_tsc_offset);
> > >
> > > How would a VMM maintain the phase relationship between guest TSCs
> > > using these ioctls?
> >
> > By using the nanosecond timestamp.
> >
> > While I did made it optional in the V2 it was done for the sole sake of being
> > able to set TSC on (re)boot to 0 from qemu, and for cases when qemu migrates
> > from a VM where the feature is not enabled.
> > In this case the tsc is set to the given value exactly, just like you
> > can do today with KVM_SET_MSRS.
> > In all other cases the nanosecond timestamp will be given.
> >
> > When the userspace uses the nanosecond timestamp, the phase relationship
> > would not only be maintained but be exact, even if TSC reads were not
> > synchronized and even if their restore on the target wasn't synchronized as well.
> >
> > Here is an example:
> >
> > Let's assume that TSC on source/target is synchronized, and that the guest TSC
> > is synchronized as well.
>
> Can this assumption be reasonably made though?
>
> NTP could very well step or scale CLOCK_REALTIME when we are in the
> middle of saving or restoring TSCs, which could possibly result in
> observable drift between vCPUs. Calculating elapsed time between
> save/restore once per VM would avoid this issue altogether.
>
> > Let's call the guest TSC frequency F (guest TSC increments by F each second)
> >
> > We do KVM_GET_TSC_STATE on vcpu0 and receive (t0,tsc0).
> > We do KVM_GET_TSC_STATE on vcpu1 after 1 second passed (exaggerated)
> > and receive (t0 + 1s, tsc0 + F)
> >
> >
> > We do KVM_SET_TSC_STATE at t0 + 10s on vcpu0 after migration,
> > and vcpu0's guest tsc is set to tsc0 + F[(t0 + 10s) - t0] = tsc0 + 10*F
> >
> > We do KVM_SET_TSC_STATE at nsec0 + 12s on vcpu1 (also exaggerated)
> > and  get [tsc0+F] + F[(t0 + 12s) - (t0+1s)] = tsc0 + 12*F
> >
> > Since 2 seconds passed by, both vCPUs have now their TSC set to tsc0 + 12*F.
> >
> > I use kvm's own functions to read the CLOCK_REALTIME, which are done
> > in such a way that you first read host TSC once and then convert it to
> > nanoseconds by scaling/offsetting it as the kernel would, thus
> > there is no random error introduced here.
>
> Agreed. In fact, my suggestion of yielding TSC offset controls to
> userspace falls short in this regard, since userspace can't make the
> same guarantee that the clockread was derived from its paired TSC
> value.
>
> > So except numerical errors,
> > (which are unavoidable anyway, and should be neglectable) this algorithm should
> > both keep the TSC in sync, and even keep its absolute time reference
> > as accurate as the clock synchronization between the host and the target is.
> >
> > (an offset between source and destination clocks will affect
> > all the TSCs in the same manner, as long as both
> > source and destination clocks are stable)
> >
> >
> > >
> > > For as bugged as the old way of doing things is (i.e. the magic
> > > handling of IA32_TSC), it was at least possible to keep guest TSCs in
> > > sync across a live migration so long as you satisfied the conditions
> > > where KVM decides to fudge the TSCs on your behalf. However, if we
> > > migrate the TSCs by value with an optional timestamp to account for
> > > elapsed time, it would appear that the guest TSC offset is likely to
> > > be inconsistent across vCPUs as the offset calculations above do not
> > > use a fixed host tsc timestamp across all vCPUs.
> >
> > >
> > > The reason I'd suggested yielding control of the tsc offset controls
> > > to userspace is that they're unaffected by such variations in per-vCPU
> > > calculations. Not only that, userspace can decide how it wants TSCs to
> > > be handled across a migration explicitly instead of relying on the
> > > above computation being done in the kernel.
> > >
> > > > +
> > > > +               if (tsc_state.flags & KVM_TSC_STATE_TSC_ADJUST_VALID)
> > > > +                       if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST))
> > > > +                               vcpu->arch.ia32_tsc_adjust_msr = tsc_state.tsc_adjust;
> > >
> > > How is this ioctl's handling of the TSC_ADJUST msr an improvement over
> > > KVM_{GET,SET}_MSRS like before? It seems orthogonal to the rest of the
> > > intended API as it isn't involved your computation above.
> >
> > It's more a refactoring thing. The goal is to avoid 'magic' handling
> > of host accesses in KVM_{GET,SET}_MSRS and instead make them
> > behave the same way as if the guest read that msr.
> > That can be useful for debug and such.
> >
> > The second patch adds a KVM quirk, which should be disabled
> > when the new API is used.
> >
> > When disabled, it makes it hard to use the KVM_{GET,SET}_MSRS
> > to set both TSC and TSC_ADJUST at the same time to given values,
> > since these msrs are tied to each other when guest writes them,
> > and the quirk disables the special (untied) write we had for host writes
> > to these msrs.
> >
> >
> > Think of these new ioctls as a way to saving and restoring
> > the internal TSC state, without bothering even to think what is inside.
> > Kind of like we save/restore the nested state.
>
> I agree that the quirk is useful for the guest touching TSC and
> TSC_ADJUST, but host writes to the TSC_ADJUST MSR are unaffected by
> any sync issues. As such, it seems the existing plumbing for
> KVM_{GET,SET}_MSRS VMMs are using seems sufficient.
>
> > Best regards,
> >         Maxim Levitsky
> >
> >
> > >
> > > > +               r = 0;
> > > > +               break;
> > > > +       }
> > > > +#endif
> > > >         default:
> > > >                 r = -EINVAL;
> > > >         }
> > > > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > > > index 886802b8ffba3..bf4c38fd58291 100644
> > > > --- a/include/uapi/linux/kvm.h
> > > > +++ b/include/uapi/linux/kvm.h
> > > > @@ -1056,6 +1056,7 @@ struct kvm_ppc_resize_hpt {
> > > >  #define KVM_CAP_ENFORCE_PV_FEATURE_CPUID 190
> > > >  #define KVM_CAP_SYS_HYPERV_CPUID 191
> > > >  #define KVM_CAP_DIRTY_LOG_RING 192
> > > > +#define KVM_CAP_PRECISE_TSC 193
> > > >
> > > >  #ifdef KVM_CAP_IRQ_ROUTING
> > > >
> > > > @@ -1169,6 +1170,16 @@ struct kvm_clock_data {
> > > >         __u32 pad[9];
> > > >  };
> > > >
> > > > +
> > > > +#define KVM_TSC_STATE_TIMESTAMP_VALID 1
> > > > +#define KVM_TSC_STATE_TSC_ADJUST_VALID 2
> > > > +struct kvm_tsc_state {
> > > > +       __u32 flags;
> > > > +       __u64 nsec;
> > > > +       __u64 tsc;
> > > > +       __u64 tsc_adjust;
> > > > +};
> > > > +
> > > >  /* For KVM_CAP_SW_TLB */
> > > >
> > > >  #define KVM_MMU_FSL_BOOKE_NOHV         0
> > > > @@ -1563,6 +1574,10 @@ struct kvm_pv_cmd {
> > > >  /* Available with KVM_CAP_DIRTY_LOG_RING */
> > > >  #define KVM_RESET_DIRTY_RINGS          _IO(KVMIO, 0xc7)
> > > >
> > > > +/* Available with KVM_CAP_PRECISE_TSC*/
> > > > +#define KVM_SET_TSC_STATE          _IOW(KVMIO,  0xc8, struct kvm_tsc_state)
> > > > +#define KVM_GET_TSC_STATE          _IOR(KVMIO,  0xc9, struct kvm_tsc_state)
> > > > +
> > > >  /* Secure Encrypted Virtualization command */
> > > >  enum sev_cmd_id {
> > > >         /* Guest initialization commands */
> > > > --
> > > > 2.26.2
> > > >
> >
> >
Thomas Gleixner Dec. 8, 2020, 4:02 p.m. UTC | #20
On Tue, Dec 08 2020 at 16:50, Maxim Levitsky wrote:
> On Mon, 2020-12-07 at 20:29 -0300, Marcelo Tosatti wrote:
>> > +This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST value
>> > +from the state obtained in the past by KVM_GET_TSC_STATE on the same vCPU.
>> > +
>> > +If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags,
>> > +KVM will adjust the guest TSC value by the time that passed since the moment
>> > +CLOCK_REALTIME timestamp was saved in the struct and current value of
>> > +CLOCK_REALTIME, and set the guest's TSC to the new value.
>> 
>> This introduces the wraparound bug in Linux timekeeping, doesnt it?

Which bug?

> It does.
> Could you prepare a reproducer for this bug so I get a better idea about
> what are you talking about?
>
> I assume you need very long (like days worth) jump to trigger this bug
> and for such case we can either work around it in qemu / kernel 
> or fix it in the guest kernel and I strongly prefer the latter.
>
> Thomas, what do you think about it?

For one I have no idea which bug you are talking about and if the bug is
caused by the VMM then why would you "fix" it in the guest kernel.

Aside of that I think I made it pretty clear what the right thing to do
is.

Thanks,

        tglx
Maxim Levitsky Dec. 8, 2020, 4:25 p.m. UTC | #21
On Tue, 2020-12-08 at 17:02 +0100, Thomas Gleixner wrote:
> On Tue, Dec 08 2020 at 16:50, Maxim Levitsky wrote:
> > On Mon, 2020-12-07 at 20:29 -0300, Marcelo Tosatti wrote:
> > > > +This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST value
> > > > +from the state obtained in the past by KVM_GET_TSC_STATE on the same vCPU.
> > > > +
> > > > +If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags,
> > > > +KVM will adjust the guest TSC value by the time that passed since the moment
> > > > +CLOCK_REALTIME timestamp was saved in the struct and current value of
> > > > +CLOCK_REALTIME, and set the guest's TSC to the new value.
> > > 
> > > This introduces the wraparound bug in Linux timekeeping, doesnt it?
> 
> Which bug?
> 
> > It does.
> > Could you prepare a reproducer for this bug so I get a better idea about
> > what are you talking about?
> > 
> > I assume you need very long (like days worth) jump to trigger this bug
> > and for such case we can either work around it in qemu / kernel 
> > or fix it in the guest kernel and I strongly prefer the latter.
> > 
> > Thomas, what do you think about it?
> 
> For one I have no idea which bug you are talking about and if the bug is
> caused by the VMM then why would you "fix" it in the guest kernel.

The "bug" is that if VMM moves a hardware time counter (tsc or anything else) 
forward by large enough value in one go, 
then the guest kernel will supposingly have an overflow in the time code.
I don't consider this to be a buggy VMM behavior, but rather a kernel
bug that should be fixed (if this bug actually exists)
 
Purely in theory this can even happen on real hardware if for example SMM handler
blocks a CPU from running for a long duration, or hardware debugging
interface does, or some other hardware transparent sleep mechanism kicks in
and blocks a CPU from running.
(We do handle this gracefully for S3/S4)

> 
> Aside of that I think I made it pretty clear what the right thing to do
> is.

This is orthogonal to this issue of the 'bug'. 
Here we are not talking about per-vcpu TSC offsets, something that I said 
that I do agree with you that it would be very nice to get rid of.
 
We are talking about the fact that TSC can jump forward by arbitrary large
value if the migration took arbitrary amount of time, which 
(assuming that the bug is real) can crash the guest kernel.

This will happen even if we use per VM global tsc offset.

So what do you think?

Best regards,
	Maxim Levitsky

> 
> Thanks,
> 
>         tglx
>
Thomas Gleixner Dec. 8, 2020, 4:40 p.m. UTC | #22
On Tue, Dec 08 2020 at 13:13, Maxim Levitsky wrote:
> On Mon, 2020-12-07 at 11:29 -0600, Oliver Upton wrote:
>> 
>> How would a VMM maintain the phase relationship between guest TSCs
>> using these ioctls?
>
> By using the nanosecond timestamp. 
>  
> While I did made it optional in the V2 it was done for the sole sake of being 
> able to set TSC on (re)boot to 0 from qemu, and for cases when qemu migrates 
> from a VM where the feature is not enabled.
> In this case the tsc is set to the given value exactly, just like you
> can do today with KVM_SET_MSRS.
> In all other cases the nanosecond timestamp will be given.
>  
> When the userspace uses the nanosecond timestamp, the phase relationship
> would not only be maintained but be exact, even if TSC reads were not
> synchronized and even if their restore on the target wasn't synchronized as well.
>  
> Here is an example:
>  
> Let's assume that TSC on source/target is synchronized, and that the guest TSC
> is synchronized as well.
>
> Let's call the guest TSC frequency F (guest TSC increments by F each second)
>  
> We do KVM_GET_TSC_STATE on vcpu0 and receive (t0,tsc0).
> We do KVM_GET_TSC_STATE on vcpu1 after 1 second passed (exaggerated) 
> and receive (t0 + 1s, tsc0 + F)

Why?

You freeeze the VM and store the realtime timestamp of doing that. At
that point assuming a full sync host system the only interesting thing
to store is the guest offset which is the same on all vCPUs and it is
known already.

So on restore the only thing which needs to be adjusted is the guest
wide offset.

     newoffset = oldoffset + (now - tfreeze)

Then set newoffset for all vCPUs. Anything else is complexity for no
value and bound to fall apart in hard to debug ways.

The offset is still the same for all vCPUs whether you can restore them
in the same nanosecond or whether you need 3 minutes for each one. It
does not matter because when you restore vCPU1 3 minutes after vCPU0
then TSC has advanced 3 minutes as well. It's still correct from the
guest POV.

Even if you support TSCADJUST and let the guest write to it does not
change the per guest offset at all. TSCADJUST is per [v]CPU and adds on
top:

    tscvcpu = tsc_host + guest_offset + TSC_ADJUST

Scaling is just orthogonal and does not change any of this.

Thanks,

        tglx
Maxim Levitsky Dec. 8, 2020, 5:08 p.m. UTC | #23
On Tue, 2020-12-08 at 17:40 +0100, Thomas Gleixner wrote:
> On Tue, Dec 08 2020 at 13:13, Maxim Levitsky wrote:
> > On Mon, 2020-12-07 at 11:29 -0600, Oliver Upton wrote:
> > > How would a VMM maintain the phase relationship between guest TSCs
> > > using these ioctls?
> > 
> > By using the nanosecond timestamp. 
> >  
> > While I did made it optional in the V2 it was done for the sole sake of being 
> > able to set TSC on (re)boot to 0 from qemu, and for cases when qemu migrates 
> > from a VM where the feature is not enabled.
> > In this case the tsc is set to the given value exactly, just like you
> > can do today with KVM_SET_MSRS.
> > In all other cases the nanosecond timestamp will be given.
> >  
> > When the userspace uses the nanosecond timestamp, the phase relationship
> > would not only be maintained but be exact, even if TSC reads were not
> > synchronized and even if their restore on the target wasn't synchronized as well.
> >  
> > Here is an example:
> >  
> > Let's assume that TSC on source/target is synchronized, and that the guest TSC
> > is synchronized as well.
> > 
> > Let's call the guest TSC frequency F (guest TSC increments by F each second)
> >  
> > We do KVM_GET_TSC_STATE on vcpu0 and receive (t0,tsc0).
> > We do KVM_GET_TSC_STATE on vcpu1 after 1 second passed (exaggerated) 
> > and receive (t0 + 1s, tsc0 + F)
> 
> Why?
> 
> You freeeze the VM and store the realtime timestamp of doing that. At
> that point assuming a full sync host system the only interesting thing
> to store is the guest offset which is the same on all vCPUs and it is
> known already.
> 
> So on restore the only thing which needs to be adjusted is the guest
> wide offset.
> 
>      newoffset = oldoffset + (now - tfreeze)
> 
> Then set newoffset for all vCPUs. Anything else is complexity for no
> value and bound to fall apart in hard to debug ways.
> 
> The offset is still the same for all vCPUs whether you can restore them
> in the same nanosecond or whether you need 3 minutes for each one. It
> does not matter because when you restore vCPU1 3 minutes after vCPU0
> then TSC has advanced 3 minutes as well. It's still correct from the
> guest POV.
> 
> Even if you support TSCADJUST and let the guest write to it does not
> change the per guest offset at all. TSCADJUST is per [v]CPU and adds on
> top:
> 
>     tscvcpu = tsc_host + guest_offset + TSC_ADJUST
> 
> Scaling is just orthogonal and does not change any of this.

I agree with this, and I think that this is what we will end up doing.
Paulo, what do you think about this?

Best regards,
	Maxim Levitsky

> 
> Thanks,
> 
>         tglx
>
Maxim Levitsky Dec. 8, 2020, 5:10 p.m. UTC | #24
On Tue, 2020-12-08 at 09:58 -0600, Oliver Upton wrote:
> +cc Sean's new handle
> 
> On Tue, Dec 8, 2020 at 9:57 AM Oliver Upton <oupton@google.com> wrote:
> > On Tue, Dec 8, 2020 at 5:13 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > > On Mon, 2020-12-07 at 11:29 -0600, Oliver Upton wrote:
> > > > On Thu, Dec 3, 2020 at 11:12 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > > > > These two new ioctls allow to more precisly capture and
> > > > > restore guest's TSC state.
> > > > > 
> > > > > Both ioctls are meant to be used to accurately migrate guest TSC
> > > > > even when there is a significant downtime during the migration.
> > > > > 
> > > > > Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> > > > > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > > > > ---
> > > > >  Documentation/virt/kvm/api.rst | 65 ++++++++++++++++++++++++++++++
> > > > >  arch/x86/kvm/x86.c             | 73 ++++++++++++++++++++++++++++++++++
> > > > >  include/uapi/linux/kvm.h       | 15 +++++++
> > > > >  3 files changed, 153 insertions(+)
> > > > > 
> > > > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > > > > index 70254eaa5229f..ebecfe4b414ce 100644
> > > > > --- a/Documentation/virt/kvm/api.rst
> > > > > +++ b/Documentation/virt/kvm/api.rst
> > > > > @@ -4826,6 +4826,71 @@ If a vCPU is in running state while this ioctl is invoked, the vCPU may
> > > > >  experience inconsistent filtering behavior on MSR accesses.
> > > > > 
> > > > > 
> > > > > +4.127 KVM_GET_TSC_STATE
> > > > > +----------------------------
> > > > > +
> > > > > +:Capability: KVM_CAP_PRECISE_TSC
> > > > > +:Architectures: x86
> > > > > +:Type: vcpu ioctl
> > > > > +:Parameters: struct kvm_tsc_state
> > > > > +:Returns: 0 on success, < 0 on error
> > > > > +
> > > > > +::
> > > > > +
> > > > > +  #define KVM_TSC_STATE_TIMESTAMP_VALID 1
> > > > > +  #define KVM_TSC_STATE_TSC_ADJUST_VALID 2
> > > > > +  struct kvm_tsc_state {
> > > > > +       __u32 flags;
> > > > > +       __u64 nsec;
> > > > > +       __u64 tsc;
> > > > > +       __u64 tsc_adjust;
> > > > > +  };
> > > > > +
> > > > > +flags values for ``struct kvm_tsc_state``:
> > > > > +
> > > > > +``KVM_TSC_STATE_TIMESTAMP_VALID``
> > > > > +
> > > > > +  ``nsec`` contains nanoseconds from unix epoch.
> > > > > +    Always set by KVM_GET_TSC_STATE, might be omitted in KVM_SET_TSC_STATE
> > > > > +
> > > > > +``KVM_TSC_STATE_TSC_ADJUST_VALID``
> > > > > +
> > > > > +  ``tsc_adjust`` contains valid IA32_TSC_ADJUST value
> > > > > +
> > > > > +
> > > > > +This ioctl allows the user space to read the guest's IA32_TSC,IA32_TSC_ADJUST,
> > > > > +and the current value of host's CLOCK_REALTIME clock in nanoseconds since unix
> > > > > +epoch.
> > > > > +
> > > > > +
> > > > > +4.128 KVM_SET_TSC_STATE
> > > > > +----------------------------
> > > > > +
> > > > > +:Capability: KVM_CAP_PRECISE_TSC
> > > > > +:Architectures: x86
> > > > > +:Type: vcpu ioctl
> > > > > +:Parameters: struct kvm_tsc_state
> > > > > +:Returns: 0 on success, < 0 on error
> > > > > +
> > > > > +::
> > > > > +
> > > > > +This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST value
> > > > > +from the state obtained in the past by KVM_GET_TSC_STATE on the same vCPU.
> > > > > +
> > > > > +If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags,
> > > > > +KVM will adjust the guest TSC value by the time that passed since the moment
> > > > > +CLOCK_REALTIME timestamp was saved in the struct and current value of
> > > > > +CLOCK_REALTIME, and set the guest's TSC to the new value.
> > > > > +
> > > > > +Otherwise KVM will set the guest TSC value to the exact value as given
> > > > > +in the struct.
> > > > > +
> > > > > +if KVM_TSC_STATE_TSC_ADJUST_VALID is set, and guest supports IA32_MSR_TSC_ADJUST,
> > > > > +then its value will be set to the given value from the struct.
> > > > > +
> > > > > +It is assumed that either both ioctls will be run on the same machine,
> > > > > +or that source and destination machines have synchronized clocks.
> > > > > +
> > > > >  5. The kvm_run structure
> > > > >  ========================
> > > > > 
> > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > index a3fdc16cfd6f3..9b8a2fe3a2398 100644
> > > > > --- a/arch/x86/kvm/x86.c
> > > > > +++ b/arch/x86/kvm/x86.c
> > > > > @@ -2438,6 +2438,21 @@ static bool kvm_get_walltime_and_clockread(struct timespec64 *ts,
> > > > > 
> > > > >         return gtod_is_based_on_tsc(do_realtime(ts, tsc_timestamp));
> > > > >  }
> > > > > +
> > > > > +
> > > > > +static void kvm_get_walltime(u64 *walltime_ns, u64 *host_tsc)
> > > > > +{
> > > > > +       struct timespec64 ts;
> > > > > +
> > > > > +       if (kvm_get_walltime_and_clockread(&ts, host_tsc)) {
> > > > > +               *walltime_ns = timespec64_to_ns(&ts);
> > > > > +               return;
> > > > > +       }
> > > > > +
> > > > > +       *host_tsc = rdtsc();
> > > > > +       *walltime_ns = ktime_get_real_ns();
> > > > > +}
> > > > > +
> > > > >  #endif
> > > > > 
> > > > >  /*
> > > > > @@ -3757,6 +3772,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> > > > >         case KVM_CAP_X86_USER_SPACE_MSR:
> > > > >         case KVM_CAP_X86_MSR_FILTER:
> > > > >         case KVM_CAP_ENFORCE_PV_FEATURE_CPUID:
> > > > > +#ifdef CONFIG_X86_64
> > > > > +       case KVM_CAP_PRECISE_TSC:
> > > > > +#endif
> > > > >                 r = 1;
> > > > >                 break;
> > > > >         case KVM_CAP_SYNC_REGS:
> > > > > @@ -4999,6 +5017,61 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
> > > > >         case KVM_GET_SUPPORTED_HV_CPUID:
> > > > >                 r = kvm_ioctl_get_supported_hv_cpuid(vcpu, argp);
> > > > >                 break;
> > > > > +#ifdef CONFIG_X86_64
> > > > > +       case KVM_GET_TSC_STATE: {
> > > > > +               struct kvm_tsc_state __user *user_tsc_state = argp;
> > > > > +               u64 host_tsc;
> > > > > +
> > > > > +               struct kvm_tsc_state tsc_state = {
> > > > > +                       .flags = KVM_TSC_STATE_TIMESTAMP_VALID
> > > > > +               };
> > > > > +
> > > > > +               kvm_get_walltime(&tsc_state.nsec, &host_tsc);
> > > > > +               tsc_state.tsc = kvm_read_l1_tsc(vcpu, host_tsc);
> > > > > +
> > > > > +               if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) {
> > > > > +                       tsc_state.tsc_adjust = vcpu->arch.ia32_tsc_adjust_msr;
> > > > > +                       tsc_state.flags |= KVM_TSC_STATE_TSC_ADJUST_VALID;
> > > > > +               }
> > > > 
> > > > [...]
> > > > 
> > > > > +
> > > > > +               r = -EFAULT;
> > > > > +               if (copy_to_user(user_tsc_state, &tsc_state, sizeof(tsc_state)))
> > > > > +                       goto out;
> > > > > +               r = 0;
> > > > > +               break;
> > > > > +       }
> > > > > +       case KVM_SET_TSC_STATE: {
> > > > > +               struct kvm_tsc_state __user *user_tsc_state = argp;
> > > > > +               struct kvm_tsc_state tsc_state;
> > > > > +               u64 host_tsc, wall_nsec;
> > > > > +
> > > > > +               u64 new_guest_tsc, new_guest_tsc_offset;
> > > > > +
> > > > > +               r = -EFAULT;
> > > > > +               if (copy_from_user(&tsc_state, user_tsc_state, sizeof(tsc_state)))
> > > > > +                       goto out;
> > > > > +
> > > > > +               kvm_get_walltime(&wall_nsec, &host_tsc);
> > > > > +               new_guest_tsc = tsc_state.tsc;
> > > > > +
> > > > > +               if (tsc_state.flags & KVM_TSC_STATE_TIMESTAMP_VALID) {
> > > > > +                       s64 diff = wall_nsec - tsc_state.nsec;
> > > > > +                       if (diff >= 0)
> > > > > +                               new_guest_tsc += nsec_to_cycles(vcpu, diff);
> > > > > +                       else
> > > > > +                               new_guest_tsc -= nsec_to_cycles(vcpu, -diff);
> > > > > +               }
> > > > > +
> > > > > +               new_guest_tsc_offset = new_guest_tsc - kvm_scale_tsc(vcpu, host_tsc);
> > > > > +               kvm_vcpu_write_tsc_offset(vcpu, new_guest_tsc_offset);
> > > > 
> > > > How would a VMM maintain the phase relationship between guest TSCs
> > > > using these ioctls?
> > > 
> > > By using the nanosecond timestamp.
> > > 
> > > While I did made it optional in the V2 it was done for the sole sake of being
> > > able to set TSC on (re)boot to 0 from qemu, and for cases when qemu migrates
> > > from a VM where the feature is not enabled.
> > > In this case the tsc is set to the given value exactly, just like you
> > > can do today with KVM_SET_MSRS.
> > > In all other cases the nanosecond timestamp will be given.
> > > 
> > > When the userspace uses the nanosecond timestamp, the phase relationship
> > > would not only be maintained but be exact, even if TSC reads were not
> > > synchronized and even if their restore on the target wasn't synchronized as well.
> > > 
> > > Here is an example:
> > > 
> > > Let's assume that TSC on source/target is synchronized, and that the guest TSC
> > > is synchronized as well.
> > 
> > Can this assumption be reasonably made though?
> > 
> > NTP could very well step or scale CLOCK_REALTIME when we are in the
> > middle of saving or restoring TSCs, which could possibly result in
> > observable drift between vCPUs. Calculating elapsed time between
> > save/restore once per VM would avoid this issue altogether.

You raise a good point, which adds even more justification
to the solution that Thomas is proposing.

Thanks!

Best regards,
	Maxim Levitsky


> > 
> > > Let's call the guest TSC frequency F (guest TSC increments by F each second)
> > > 
> > > We do KVM_GET_TSC_STATE on vcpu0 and receive (t0,tsc0).
> > > We do KVM_GET_TSC_STATE on vcpu1 after 1 second passed (exaggerated)
> > > and receive (t0 + 1s, tsc0 + F)
> > > 
> > > 
> > > We do KVM_SET_TSC_STATE at t0 + 10s on vcpu0 after migration,
> > > and vcpu0's guest tsc is set to tsc0 + F[(t0 + 10s) - t0] = tsc0 + 10*F
> > > 
> > > We do KVM_SET_TSC_STATE at nsec0 + 12s on vcpu1 (also exaggerated)
> > > and  get [tsc0+F] + F[(t0 + 12s) - (t0+1s)] = tsc0 + 12*F
> > > 
> > > Since 2 seconds passed by, both vCPUs have now their TSC set to tsc0 + 12*F.
> > > 
> > > I use kvm's own functions to read the CLOCK_REALTIME, which are done
> > > in such a way that you first read host TSC once and then convert it to
> > > nanoseconds by scaling/offsetting it as the kernel would, thus
> > > there is no random error introduced here.
> > 
> > Agreed. In fact, my suggestion of yielding TSC offset controls to
> > userspace falls short in this regard, since userspace can't make the
> > same guarantee that the clockread was derived from its paired TSC
> > value.
> > 
> > > So except numerical errors,
> > > (which are unavoidable anyway, and should be neglectable) this algorithm should
> > > both keep the TSC in sync, and even keep its absolute time reference
> > > as accurate as the clock synchronization between the host and the target is.
> > > 
> > > (an offset between source and destination clocks will affect
> > > all the TSCs in the same manner, as long as both
> > > source and destination clocks are stable)
> > > 
> > > 
> > > > For as bugged as the old way of doing things is (i.e. the magic
> > > > handling of IA32_TSC), it was at least possible to keep guest TSCs in
> > > > sync across a live migration so long as you satisfied the conditions
> > > > where KVM decides to fudge the TSCs on your behalf. However, if we
> > > > migrate the TSCs by value with an optional timestamp to account for
> > > > elapsed time, it would appear that the guest TSC offset is likely to
> > > > be inconsistent across vCPUs as the offset calculations above do not
> > > > use a fixed host tsc timestamp across all vCPUs.
> > > > The reason I'd suggested yielding control of the tsc offset controls
> > > > to userspace is that they're unaffected by such variations in per-vCPU
> > > > calculations. Not only that, userspace can decide how it wants TSCs to
> > > > be handled across a migration explicitly instead of relying on the
> > > > above computation being done in the kernel.
> > > > 
> > > > > +
> > > > > +               if (tsc_state.flags & KVM_TSC_STATE_TSC_ADJUST_VALID)
> > > > > +                       if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST))
> > > > > +                               vcpu->arch.ia32_tsc_adjust_msr = tsc_state.tsc_adjust;
> > > > 
> > > > How is this ioctl's handling of the TSC_ADJUST msr an improvement over
> > > > KVM_{GET,SET}_MSRS like before? It seems orthogonal to the rest of the
> > > > intended API as it isn't involved your computation above.
> > > 
> > > It's more a refactoring thing. The goal is to avoid 'magic' handling
> > > of host accesses in KVM_{GET,SET}_MSRS and instead make them
> > > behave the same way as if the guest read that msr.
> > > That can be useful for debug and such.
> > > 
> > > The second patch adds a KVM quirk, which should be disabled
> > > when the new API is used.
> > > 
> > > When disabled, it makes it hard to use the KVM_{GET,SET}_MSRS
> > > to set both TSC and TSC_ADJUST at the same time to given values,
> > > since these msrs are tied to each other when guest writes them,
> > > and the quirk disables the special (untied) write we had for host writes
> > > to these msrs.
> > > 
> > > 
> > > Think of these new ioctls as a way to saving and restoring
> > > the internal TSC state, without bothering even to think what is inside.
> > > Kind of like we save/restore the nested state.
> > 
> > I agree that the quirk is useful for the guest touching TSC and
> > TSC_ADJUST, but host writes to the TSC_ADJUST MSR are unaffected by
> > any sync issues. As such, it seems the existing plumbing for
> > KVM_{GET,SET}_MSRS VMMs are using seems sufficient.
> > 
> > > Best regards,
> > >         Maxim Levitsky
> > > 
> > > 
> > > > > +               r = 0;
> > > > > +               break;
> > > > > +       }
> > > > > +#endif
> > > > >         default:
> > > > >                 r = -EINVAL;
> > > > >         }
> > > > > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > > > > index 886802b8ffba3..bf4c38fd58291 100644
> > > > > --- a/include/uapi/linux/kvm.h
> > > > > +++ b/include/uapi/linux/kvm.h
> > > > > @@ -1056,6 +1056,7 @@ struct kvm_ppc_resize_hpt {
> > > > >  #define KVM_CAP_ENFORCE_PV_FEATURE_CPUID 190
> > > > >  #define KVM_CAP_SYS_HYPERV_CPUID 191
> > > > >  #define KVM_CAP_DIRTY_LOG_RING 192
> > > > > +#define KVM_CAP_PRECISE_TSC 193
> > > > > 
> > > > >  #ifdef KVM_CAP_IRQ_ROUTING
> > > > > 
> > > > > @@ -1169,6 +1170,16 @@ struct kvm_clock_data {
> > > > >         __u32 pad[9];
> > > > >  };
> > > > > 
> > > > > +
> > > > > +#define KVM_TSC_STATE_TIMESTAMP_VALID 1
> > > > > +#define KVM_TSC_STATE_TSC_ADJUST_VALID 2
> > > > > +struct kvm_tsc_state {
> > > > > +       __u32 flags;
> > > > > +       __u64 nsec;
> > > > > +       __u64 tsc;
> > > > > +       __u64 tsc_adjust;
> > > > > +};
> > > > > +
> > > > >  /* For KVM_CAP_SW_TLB */
> > > > > 
> > > > >  #define KVM_MMU_FSL_BOOKE_NOHV         0
> > > > > @@ -1563,6 +1574,10 @@ struct kvm_pv_cmd {
> > > > >  /* Available with KVM_CAP_DIRTY_LOG_RING */
> > > > >  #define KVM_RESET_DIRTY_RINGS          _IO(KVMIO, 0xc7)
> > > > > 
> > > > > +/* Available with KVM_CAP_PRECISE_TSC*/
> > > > > +#define KVM_SET_TSC_STATE          _IOW(KVMIO,  0xc8, struct kvm_tsc_state)
> > > > > +#define KVM_GET_TSC_STATE          _IOR(KVMIO,  0xc9, struct kvm_tsc_state)
> > > > > +
> > > > >  /* Secure Encrypted Virtualization command */
> > > > >  enum sev_cmd_id {
> > > > >         /* Guest initialization commands */
> > > > > --
> > > > > 2.26.2
> > > > >
Andy Lutomirski Dec. 8, 2020, 5:33 p.m. UTC | #25
On Tue, Dec 8, 2020 at 8:26 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
>
> On Tue, 2020-12-08 at 17:02 +0100, Thomas Gleixner wrote:
> > On Tue, Dec 08 2020 at 16:50, Maxim Levitsky wrote:
> > > On Mon, 2020-12-07 at 20:29 -0300, Marcelo Tosatti wrote:
> > > > > +This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST value
> > > > > +from the state obtained in the past by KVM_GET_TSC_STATE on the same vCPU.
> > > > > +
> > > > > +If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags,
> > > > > +KVM will adjust the guest TSC value by the time that passed since the moment
> > > > > +CLOCK_REALTIME timestamp was saved in the struct and current value of
> > > > > +CLOCK_REALTIME, and set the guest's TSC to the new value.
> > > >
> > > > This introduces the wraparound bug in Linux timekeeping, doesnt it?
> >
> > Which bug?
> >
> > > It does.
> > > Could you prepare a reproducer for this bug so I get a better idea about
> > > what are you talking about?
> > >
> > > I assume you need very long (like days worth) jump to trigger this bug
> > > and for such case we can either work around it in qemu / kernel
> > > or fix it in the guest kernel and I strongly prefer the latter.
> > >
> > > Thomas, what do you think about it?
> >
> > For one I have no idea which bug you are talking about and if the bug is
> > caused by the VMM then why would you "fix" it in the guest kernel.
>
> The "bug" is that if VMM moves a hardware time counter (tsc or anything else)
> forward by large enough value in one go,
> then the guest kernel will supposingly have an overflow in the time code.
> I don't consider this to be a buggy VMM behavior, but rather a kernel
> bug that should be fixed (if this bug actually exists)
>
> Purely in theory this can even happen on real hardware if for example SMM handler
> blocks a CPU from running for a long duration, or hardware debugging
> interface does, or some other hardware transparent sleep mechanism kicks in
> and blocks a CPU from running.
> (We do handle this gracefully for S3/S4)

IIRC we introduced mul_u64_u32_shift() for roughly this reason,but we
don't seem to be using it in the relevant code paths.  We should be
able to use the same basic math with wider intermediates to allow very
large intervals between updates.
Marcelo Tosatti Dec. 8, 2020, 5:35 p.m. UTC | #26
On Tue, Dec 08, 2020 at 04:50:53PM +0200, Maxim Levitsky wrote:
> On Mon, 2020-12-07 at 20:29 -0300, Marcelo Tosatti wrote:
> > On Thu, Dec 03, 2020 at 07:11:16PM +0200, Maxim Levitsky wrote:
> > > These two new ioctls allow to more precisly capture and
> > > restore guest's TSC state.
> > > 
> > > Both ioctls are meant to be used to accurately migrate guest TSC
> > > even when there is a significant downtime during the migration.
> > > 
> > > Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> > > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > > ---
> > >  Documentation/virt/kvm/api.rst | 65 ++++++++++++++++++++++++++++++
> > >  arch/x86/kvm/x86.c             | 73 ++++++++++++++++++++++++++++++++++
> > >  include/uapi/linux/kvm.h       | 15 +++++++
> > >  3 files changed, 153 insertions(+)
> > > 
> > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > > index 70254eaa5229f..ebecfe4b414ce 100644
> > > --- a/Documentation/virt/kvm/api.rst
> > > +++ b/Documentation/virt/kvm/api.rst
> > > @@ -4826,6 +4826,71 @@ If a vCPU is in running state while this ioctl is invoked, the vCPU may
> > >  experience inconsistent filtering behavior on MSR accesses.
> > >  
> > >  
> > > +4.127 KVM_GET_TSC_STATE
> > > +----------------------------
> > > +
> > > +:Capability: KVM_CAP_PRECISE_TSC
> > > +:Architectures: x86
> > > +:Type: vcpu ioctl
> > > +:Parameters: struct kvm_tsc_state
> > > +:Returns: 0 on success, < 0 on error
> > > +
> > > +::
> > > +
> > > +  #define KVM_TSC_STATE_TIMESTAMP_VALID 1
> > > +  #define KVM_TSC_STATE_TSC_ADJUST_VALID 2
> > > +  struct kvm_tsc_state {
> > > +	__u32 flags;
> > > +	__u64 nsec;
> > > +	__u64 tsc;
> > > +	__u64 tsc_adjust;
> > > +  };
> > > +
> > > +flags values for ``struct kvm_tsc_state``:
> > > +
> > > +``KVM_TSC_STATE_TIMESTAMP_VALID``
> > > +
> > > +  ``nsec`` contains nanoseconds from unix epoch.
> > > +    Always set by KVM_GET_TSC_STATE, might be omitted in KVM_SET_TSC_STATE
> > > +
> > > +``KVM_TSC_STATE_TSC_ADJUST_VALID``
> > > +
> > > +  ``tsc_adjust`` contains valid IA32_TSC_ADJUST value
> > > +
> > > +
> > > +This ioctl allows the user space to read the guest's IA32_TSC,IA32_TSC_ADJUST,
> > > +and the current value of host's CLOCK_REALTIME clock in nanoseconds since unix
> > > +epoch.
> > 
> > Why is CLOCK_REALTIME necessary at all? kvmclock uses the host clock as
> > a time base, but for TSC it should not be necessary.
> 
> 
> CLOCK_REALTIME is used as an absolute time reference that should match
> on both computers. I could have used CLOCK_TAI instead for example.
> 
> The reference allows to account for time passed between saving and restoring
> the TSC as explained above.

As mentioned we don't want this due to the overflow. 

Again, i think higher priority is to allow enablement of invariant TSC
by default (to disable kvmclock).

> > > +
> > > +
> > > +4.128 KVM_SET_TSC_STATE
> > > +----------------------------
> > > +
> > > +:Capability: KVM_CAP_PRECISE_TSC
> > > +:Architectures: x86
> > > +:Type: vcpu ioctl
> > > +:Parameters: struct kvm_tsc_state
> > > +:Returns: 0 on success, < 0 on error
> > > +
> > > +::
> > > +
> > > +This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST value
> > > +from the state obtained in the past by KVM_GET_TSC_STATE on the same vCPU.
> > > +
> > > +If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags,
> > > +KVM will adjust the guest TSC value by the time that passed since the moment
> > > +CLOCK_REALTIME timestamp was saved in the struct and current value of
> > > +CLOCK_REALTIME, and set the guest's TSC to the new value.
> > 
> > This introduces the wraparound bug in Linux timekeeping, doesnt it?
> 
> It does.
> Could you prepare a reproducer for this bug so I get a better idea about
> what are you talking about?

Enable CONFIG_DEBUG_TIMEKEEPING, check what max_cycles is from the TSC
clocksource:

#ifdef CONFIG_DEBUG_TIMEKEEPING
#define WARNING_FREQ (HZ*300) /* 5 minute rate-limiting */

static void timekeeping_check_update(struct timekeeper *tk, u64 offset)
{

        u64 max_cycles = tk->tkr_mono.clock->max_cycles;
        const char *name = tk->tkr_mono.clock->name;

        if (offset > max_cycles) {
                printk_deferred("WARNING: timekeeping: Cycle offset (%lld) is larger than allowed by the '%s' clock's max_cycles value (%lld): time overflow danger\n",
                                offset, name, max_cycles);
                printk_deferred("         timekeeping: Your kernel is sick, but tries to cope by capping time updates\n");
        } else {
                if (offset > (max_cycles >> 1)) {
                        printk_deferred("INFO: timekeeping: Cycle offset (%lld) is larger than the '%s' clock's 50%% safety margin (%lld)\n",
                                        offset, name, max_cycles >> 1);
                        printk_deferred("      timekeeping: Your kernel is still fine, but is feeling a bit nervous\n");
                }
        }

        if (tk->underflow_seen) {
                if (jiffies - tk->last_warning > WARNING_FREQ) {
                        printk_deferred("WARNING: Underflow in clocksource '%s' observed, time update ignored.\n", name);
                        printk_deferred("         Please report this, consider using a different clocksource, if possible.\n");
                        printk_deferred("         Your kernel is probably still fine.\n");
                        tk->last_warning = jiffies;
                }
                tk->underflow_seen = 0;
        }

> I assume you need very long (like days worth) jump to trigger this bug

Exactly. max_cycles worth (for kvmclock one or two days
vmstop/vmstart was sufficient to trigger the bug).

> and for such case we can either work around it in qemu / kernel 
> or fix it in the guest kernel and I strongly prefer the latter.

Well, what about older kernels? Can't fix those in the guest kernel. 

Moreover:

https://patchwork.kernel.org/project/kvm/patch/20130618233825.GA19042@amt.cnet/

2) Users rely on CLOCK_MONOTONIC to count run time, that is,
time which OS has been in a runnable state (see CLOCK_BOOTTIME).

I think the current 100ms delta (on migration) can be reduced without 
checking the clock delta between source and destination hosts.

So to reiterate: the idea to pass a tuple (tsc, tsc_adjust) is
good because you can fix the issues introduced by writing the values
separately.

However, IMHO the patchset lacks a clear problem (or set of problems) 
that its addressing.

> Thomas, what do you think about it?
> 
> Best regards,
> 	Maxim Levitsky
> 
> > 
> 
> > > +
> > > +Otherwise KVM will set the guest TSC value to the exact value as given
> > > +in the struct.
> > > +
> > > +if KVM_TSC_STATE_TSC_ADJUST_VALID is set, and guest supports IA32_MSR_TSC_ADJUST,
> > > +then its value will be set to the given value from the struct.
> > > +
> > > +It is assumed that either both ioctls will be run on the same machine,
> > > +or that source and destination machines have synchronized clocks.
> > 
> > 
> > >  5. The kvm_run structure
> > >  ========================
> > >  
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index a3fdc16cfd6f3..9b8a2fe3a2398 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -2438,6 +2438,21 @@ static bool kvm_get_walltime_and_clockread(struct timespec64 *ts,
> > >  
> > >  	return gtod_is_based_on_tsc(do_realtime(ts, tsc_timestamp));
> > >  }
> > > +
> > > +
> > > +static void kvm_get_walltime(u64 *walltime_ns, u64 *host_tsc)
> > > +{
> > > +	struct timespec64 ts;
> > > +
> > > +	if (kvm_get_walltime_and_clockread(&ts, host_tsc)) {
> > > +		*walltime_ns = timespec64_to_ns(&ts);
> > > +		return;
> > > +	}
> > > +
> > > +	*host_tsc = rdtsc();
> > > +	*walltime_ns = ktime_get_real_ns();
> > > +}
> > > +
> > >  #endif
> > >  
> > >  /*
> > > @@ -3757,6 +3772,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> > >  	case KVM_CAP_X86_USER_SPACE_MSR:
> > >  	case KVM_CAP_X86_MSR_FILTER:
> > >  	case KVM_CAP_ENFORCE_PV_FEATURE_CPUID:
> > > +#ifdef CONFIG_X86_64
> > > +	case KVM_CAP_PRECISE_TSC:
> > > +#endif
> > >  		r = 1;
> > >  		break;
> > >  	case KVM_CAP_SYNC_REGS:
> > > @@ -4999,6 +5017,61 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
> > >  	case KVM_GET_SUPPORTED_HV_CPUID:
> > >  		r = kvm_ioctl_get_supported_hv_cpuid(vcpu, argp);
> > >  		break;
> > > +#ifdef CONFIG_X86_64
> > > +	case KVM_GET_TSC_STATE: {
> > > +		struct kvm_tsc_state __user *user_tsc_state = argp;
> > > +		u64 host_tsc;
> > > +
> > > +		struct kvm_tsc_state tsc_state = {
> > > +			.flags = KVM_TSC_STATE_TIMESTAMP_VALID
> > > +		};
> > > +
> > > +		kvm_get_walltime(&tsc_state.nsec, &host_tsc);
> > > +		tsc_state.tsc = kvm_read_l1_tsc(vcpu, host_tsc);
> > > +
> > > +		if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) {
> > > +			tsc_state.tsc_adjust = vcpu->arch.ia32_tsc_adjust_msr;
> > > +			tsc_state.flags |= KVM_TSC_STATE_TSC_ADJUST_VALID;
> > > +		}
> > > +
> > > +		r = -EFAULT;
> > > +		if (copy_to_user(user_tsc_state, &tsc_state, sizeof(tsc_state)))
> > > +			goto out;
> > > +		r = 0;
> > > +		break;
> > > +	}
> > > +	case KVM_SET_TSC_STATE: {
> > > +		struct kvm_tsc_state __user *user_tsc_state = argp;
> > > +		struct kvm_tsc_state tsc_state;
> > > +		u64 host_tsc, wall_nsec;
> > > +
> > > +		u64 new_guest_tsc, new_guest_tsc_offset;
> > > +
> > > +		r = -EFAULT;
> > > +		if (copy_from_user(&tsc_state, user_tsc_state, sizeof(tsc_state)))
> > > +			goto out;
> > > +
> > > +		kvm_get_walltime(&wall_nsec, &host_tsc);
> > > +		new_guest_tsc = tsc_state.tsc;
> > > +
> > > +		if (tsc_state.flags & KVM_TSC_STATE_TIMESTAMP_VALID) {
> > > +			s64 diff = wall_nsec - tsc_state.nsec;
> > > +			if (diff >= 0)
> > > +				new_guest_tsc += nsec_to_cycles(vcpu, diff);
> > > +			else
> > > +				new_guest_tsc -= nsec_to_cycles(vcpu, -diff);
> > > +		}
> > > +
> > > +		new_guest_tsc_offset = new_guest_tsc - kvm_scale_tsc(vcpu, host_tsc);
> > > +		kvm_vcpu_write_tsc_offset(vcpu, new_guest_tsc_offset);
> > > +
> > > +		if (tsc_state.flags & KVM_TSC_STATE_TSC_ADJUST_VALID)
> > > +			if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST))
> > > +				vcpu->arch.ia32_tsc_adjust_msr = tsc_state.tsc_adjust;
> > > +		r = 0;
> > > +		break;
> > > +	}
> > > +#endif
> > >  	default:
> > >  		r = -EINVAL;
> > >  	}
> > > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > > index 886802b8ffba3..bf4c38fd58291 100644
> > > --- a/include/uapi/linux/kvm.h
> > > +++ b/include/uapi/linux/kvm.h
> > > @@ -1056,6 +1056,7 @@ struct kvm_ppc_resize_hpt {
> > >  #define KVM_CAP_ENFORCE_PV_FEATURE_CPUID 190
> > >  #define KVM_CAP_SYS_HYPERV_CPUID 191
> > >  #define KVM_CAP_DIRTY_LOG_RING 192
> > > +#define KVM_CAP_PRECISE_TSC 193
> > >  
> > >  #ifdef KVM_CAP_IRQ_ROUTING
> > >  
> > > @@ -1169,6 +1170,16 @@ struct kvm_clock_data {
> > >  	__u32 pad[9];
> > >  };
> > >  
> > > +
> > > +#define KVM_TSC_STATE_TIMESTAMP_VALID 1
> > > +#define KVM_TSC_STATE_TSC_ADJUST_VALID 2
> > > +struct kvm_tsc_state {
> > > +	__u32 flags;
> > > +	__u64 nsec;
> > > +	__u64 tsc;
> > > +	__u64 tsc_adjust;
> > > +};
> > > +
> > >  /* For KVM_CAP_SW_TLB */
> > >  
> > >  #define KVM_MMU_FSL_BOOKE_NOHV		0
> > > @@ -1563,6 +1574,10 @@ struct kvm_pv_cmd {
> > >  /* Available with KVM_CAP_DIRTY_LOG_RING */
> > >  #define KVM_RESET_DIRTY_RINGS		_IO(KVMIO, 0xc7)
> > >  
> > > +/* Available with KVM_CAP_PRECISE_TSC*/
> > > +#define KVM_SET_TSC_STATE          _IOW(KVMIO,  0xc8, struct kvm_tsc_state)
> > > +#define KVM_GET_TSC_STATE          _IOR(KVMIO,  0xc9, struct kvm_tsc_state)
> > > +
> > >  /* Secure Encrypted Virtualization command */
> > >  enum sev_cmd_id {
> > >  	/* Guest initialization commands */
> > > -- 
> > > 2.26.2
>
Andy Lutomirski Dec. 8, 2020, 5:43 p.m. UTC | #27
On Tue, Dec 8, 2020 at 6:23 AM Marcelo Tosatti <mtosatti@redhat.com> wrote:
>
> On Mon, Dec 07, 2020 at 10:04:45AM -0800, Andy Lutomirski wrote:
> >
> >
> > I do have a feature request, though: IMO it would be quite nifty if the new kvmclock structure could also expose NTP corrections. In other words, if you could expose enough info to calculate CLOCK_MONOTONIC_RAW, CLOCK_MONOTONIC, and CLOCK_REALTIME, then we could have paravirt NTP.
>
> Hi Andy,
>
> Any reason why drivers/ptp/ptp_kvm.c does not work for you?
>

It looks like it tries to accomplish the right goal, but in a rather
roundabout way.  The host knows how to convert from TSC to
CLOCK_REALTIME, and ptp_kvm.c exposes this data to the guest.  But,
rather than just making the guest use the same CLOCK_REALTIME data as
the host, ptp_kvm.c seems to expose information to usermode that a
user daemon could use to attempt (with some degree of error?) to use
to make the guest kernel track CLOCK_REALTIME.  This seems inefficient
and dubiously accurate.

My feature request is for this to be fully automatic and completely
coherent.  I would like for a host user program and a guest user
program to be able to share memory, run concurrently, and use the
shared memory to exchange CLOCK_REALTIME values without ever observing
the clock going backwards.  This ought to be doable.  Ideally the
result should even be usable for Spanner-style synchronization
assuming the host clock is good enough.  Also, this whole thing should
work without needing to periodically wake the guest to remain
synchronized.  If the guest sleeps for two minutes (full nohz-idle, no
guest activity at all), the host makes a small REALTIME frequency
adjustment, and then the guest runs user code that reads
CLOCK_REALTIME, the guest clock should still be fully synchronized
with the host.  I don't think that ptp_kvm.c-style synchronization can
do this.

tglx etc, I think that doing this really really nicely might involve
promoting something like the current vDSO data structures to ABI -- a
straightforward-ish implementation would be for the KVM host to export
its vvar clock data to the guest and for the guest to use it, possibly
with an offset applied.  The offset could work a lot like timens works
today.

--Andy
Marcelo Tosatti Dec. 8, 2020, 6:11 p.m. UTC | #28
On Tue, Dec 08, 2020 at 05:02:07PM +0100, Thomas Gleixner wrote:
> On Tue, Dec 08 2020 at 16:50, Maxim Levitsky wrote:
> > On Mon, 2020-12-07 at 20:29 -0300, Marcelo Tosatti wrote:
> >> > +This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST value
> >> > +from the state obtained in the past by KVM_GET_TSC_STATE on the same vCPU.
> >> > +
> >> > +If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags,
> >> > +KVM will adjust the guest TSC value by the time that passed since the moment
> >> > +CLOCK_REALTIME timestamp was saved in the struct and current value of
> >> > +CLOCK_REALTIME, and set the guest's TSC to the new value.
> >> 
> >> This introduces the wraparound bug in Linux timekeeping, doesnt it?
> 
> Which bug?

max_cycles overflow. Sent a message to Maxim describing it.

> 
> > It does.
> > Could you prepare a reproducer for this bug so I get a better idea about
> > what are you talking about?
> >
> > I assume you need very long (like days worth) jump to trigger this bug
> > and for such case we can either work around it in qemu / kernel 
> > or fix it in the guest kernel and I strongly prefer the latter.
> >
> > Thomas, what do you think about it?
> 
> For one I have no idea which bug you are talking about and if the bug is
> caused by the VMM then why would you "fix" it in the guest kernel.

1) Stop guest, save TSC value of cpu-0 = V.
2) Wait for some amount of time = W.
3) Start guest, load TSC value with V+W.

Can cause an overflow on Linux timekeeping.

> Aside of that I think I made it pretty clear what the right thing to do
> is.

Sure: the notion of a "unique TSC offset" already exists (it is detected
by write TSC logic, and not explicit in the interface, though).

But AFAIK it works pretty well.

Exposing a single TSC value on the interface level seems alright to
me...
Marcelo Tosatti Dec. 8, 2020, 6:12 p.m. UTC | #29
On Tue, Dec 08, 2020 at 06:25:13PM +0200, Maxim Levitsky wrote:
> On Tue, 2020-12-08 at 17:02 +0100, Thomas Gleixner wrote:
> > On Tue, Dec 08 2020 at 16:50, Maxim Levitsky wrote:
> > > On Mon, 2020-12-07 at 20:29 -0300, Marcelo Tosatti wrote:
> > > > > +This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST value
> > > > > +from the state obtained in the past by KVM_GET_TSC_STATE on the same vCPU.
> > > > > +
> > > > > +If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags,
> > > > > +KVM will adjust the guest TSC value by the time that passed since the moment
> > > > > +CLOCK_REALTIME timestamp was saved in the struct and current value of
> > > > > +CLOCK_REALTIME, and set the guest's TSC to the new value.
> > > > 
> > > > This introduces the wraparound bug in Linux timekeeping, doesnt it?
> > 
> > Which bug?
> > 
> > > It does.
> > > Could you prepare a reproducer for this bug so I get a better idea about
> > > what are you talking about?
> > > 
> > > I assume you need very long (like days worth) jump to trigger this bug
> > > and for such case we can either work around it in qemu / kernel 
> > > or fix it in the guest kernel and I strongly prefer the latter.
> > > 
> > > Thomas, what do you think about it?
> > 
> > For one I have no idea which bug you are talking about and if the bug is
> > caused by the VMM then why would you "fix" it in the guest kernel.
> 
> The "bug" is that if VMM moves a hardware time counter (tsc or anything else) 
> forward by large enough value in one go, 
> then the guest kernel will supposingly have an overflow in the time code.
> I don't consider this to be a buggy VMM behavior, but rather a kernel
> bug that should be fixed (if this bug actually exists)

It exists.

> Purely in theory this can even happen on real hardware if for example SMM handler
> blocks a CPU from running for a long duration, or hardware debugging
> interface does, or some other hardware transparent sleep mechanism kicks in
> and blocks a CPU from running.
> (We do handle this gracefully for S3/S4)
> 
> > 
> > Aside of that I think I made it pretty clear what the right thing to do
> > is.
> 
> This is orthogonal to this issue of the 'bug'. 
> Here we are not talking about per-vcpu TSC offsets, something that I said 
> that I do agree with you that it would be very nice to get rid of.
>  
> We are talking about the fact that TSC can jump forward by arbitrary large
> value if the migration took arbitrary amount of time, which 
> (assuming that the bug is real) can crash the guest kernel.

QE reproduced it.

> This will happen even if we use per VM global tsc offset.
> 
> So what do you think?
> 
> Best regards,
> 	Maxim Levitsky
> 
> > 
> > Thanks,
> > 
> >         tglx
> > 
>
Thomas Gleixner Dec. 8, 2020, 7:24 p.m. UTC | #30
On Tue, Dec 08 2020 at 09:43, Andy Lutomirski wrote:
> On Tue, Dec 8, 2020 at 6:23 AM Marcelo Tosatti <mtosatti@redhat.com> wrote:
> It looks like it tries to accomplish the right goal, but in a rather
> roundabout way.  The host knows how to convert from TSC to
> CLOCK_REALTIME, and ptp_kvm.c exposes this data to the guest.  But,
> rather than just making the guest use the same CLOCK_REALTIME data as
> the host, ptp_kvm.c seems to expose information to usermode that a
> user daemon could use to attempt (with some degree of error?) to use
> to make the guest kernel track CLOCK_REALTIME.  This seems inefficient
> and dubiously accurate.
>
> My feature request is for this to be fully automatic and completely
> coherent.  I would like for a host user program and a guest user
> program to be able to share memory, run concurrently, and use the
> shared memory to exchange CLOCK_REALTIME values without ever observing
> the clock going backwards.  This ought to be doable.  Ideally the
> result should even be usable for Spanner-style synchronization
> assuming the host clock is good enough.  Also, this whole thing should
> work without needing to periodically wake the guest to remain
> synchronized.  If the guest sleeps for two minutes (full nohz-idle, no
> guest activity at all), the host makes a small REALTIME frequency
> adjustment, and then the guest runs user code that reads
> CLOCK_REALTIME, the guest clock should still be fully synchronized
> with the host.  I don't think that ptp_kvm.c-style synchronization can
> do this.

One issue here is that guests might want to run their own NTP/PTP. One
reason to do that is that some people prefer the leap second smearing
NTP servers. 

> tglx etc, I think that doing this really really nicely might involve
> promoting something like the current vDSO data structures to ABI -- a
> straightforward-ish implementation would be for the KVM host to export
> its vvar clock data to the guest and for the guest to use it, possibly
> with an offset applied.  The offset could work a lot like timens works
> today.

Works nicely if the guest TSC is not scaled. But that means that on
migration the raw TSC usage in the guest is borked because the new host
might have a different TSC frequency.

If you use TSC scaling then the conversion needs to take TSC scaling
into account which needs some thought. And the guest would need to read
the host conversion from 'vdso data' and the scaling from the next page
(per guest) and then still has to support timens. Doable but adds extra
overhead on every time read operation.

If you want to avoid that you are back to the point where you need to
chase all guest data when the host NTP/PTP adjusts the host side.
Chasing and updating all this stuff in the tick was the reason why I was
fighting the idea of clock realtime in namespaces.

Thanks,

        tglx
Andy Lutomirski Dec. 8, 2020, 8:32 p.m. UTC | #31
> On Dec 8, 2020, at 11:25 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Tue, Dec 08 2020 at 09:43, Andy Lutomirski wrote:
>> On Tue, Dec 8, 2020 at 6:23 AM Marcelo Tosatti <mtosatti@redhat.com> wrote:
>> It looks like it tries to accomplish the right goal, but in a rather
>> roundabout way.  The host knows how to convert from TSC to
>> CLOCK_REALTIME, and ptp_kvm.c exposes this data to the guest.  But,
>> rather than just making the guest use the same CLOCK_REALTIME data as
>> the host, ptp_kvm.c seems to expose information to usermode that a
>> user daemon could use to attempt (with some degree of error?) to use
>> to make the guest kernel track CLOCK_REALTIME.  This seems inefficient
>> and dubiously accurate.
>> 
>> My feature request is for this to be fully automatic and completely
>> coherent.  I would like for a host user program and a guest user
>> program to be able to share memory, run concurrently, and use the
>> shared memory to exchange CLOCK_REALTIME values without ever observing
>> the clock going backwards.  This ought to be doable.  Ideally the
>> result should even be usable for Spanner-style synchronization
>> assuming the host clock is good enough.  Also, this whole thing should
>> work without needing to periodically wake the guest to remain
>> synchronized.  If the guest sleeps for two minutes (full nohz-idle, no
>> guest activity at all), the host makes a small REALTIME frequency
>> adjustment, and then the guest runs user code that reads
>> CLOCK_REALTIME, the guest clock should still be fully synchronized
>> with the host.  I don't think that ptp_kvm.c-style synchronization can
>> do this.
> 
> One issue here is that guests might want to run their own NTP/PTP. One
> reason to do that is that some people prefer the leap second smearing
> NTP servers. 

I would hope that using this part would be optional on the guest’s part. Guests should be able to use just the CLOCK_MONOTONIC_RAW part or fancier stuff at their option.

(Hmm, it would, in principle, be possible for a guest to use the host’s TAI but still smear leap seconds. Even without virt, smearing could be a per-timens option.)

> 
>> tglx etc, I think that doing this really really nicely might involve
>> promoting something like the current vDSO data structures to ABI -- a
>> straightforward-ish implementation would be for the KVM host to export
>> its vvar clock data to the guest and for the guest to use it, possibly
>> with an offset applied.  The offset could work a lot like timens works
>> today.
> 
> Works nicely if the guest TSC is not scaled. But that means that on
> migration the raw TSC usage in the guest is borked because the new host
> might have a different TSC frequency.
> 
> If you use TSC scaling then the conversion needs to take TSC scaling
> into account which needs some thought. And the guest would need to read
> the host conversion from 'vdso data' and the scaling from the next page
> (per guest) and then still has to support timens. Doable but adds extra
> overhead on every time read operation.

Is the issue that scaling would result in a different guest vs host frequency?  Perhaps we could limit each physical machine to exactly two modes: unscaled (use TSC ticks, convert in software) and scaled to nanoseconds (CLOCK_MONOTONIC_RAW is RDTSC + possible offset).  Then the host could produce its data structures in exactly those two formats and export them as appropriate. 

> 
> If you want to avoid that you are back to the point where you need to
> chase all guest data when the host NTP/PTP adjusts the host side.
> Chasing and updating all this stuff in the tick was the reason why I was
> fighting the idea of clock realtime in namespaces.

I think that, if we can arrange for a small, bounded number of pages generated by the host, then this problem isn’t so bad.

Hmm, leap second smearing is just a different linear mapping. I’m not sure how leap second smearing should interact with timens, but it seems to be that the host should be able to produce four data pages (scaled vs unscaled and smeared vs unsmeared) and one per-guest/timens offset page (where offset applies to MONOTONIC and MONOTONIC_RAW only) and cover all bases.  Or do people actually want to offset their TAI and/or REALTIME, and what would that even mean if the offset crosses a leap second?

(I haven’t though about the interaction of any of this with ART.)
Thomas Gleixner Dec. 8, 2020, 9:20 p.m. UTC | #32
On Tue, Dec 08 2020 at 18:25, Maxim Levitsky wrote:
> On Tue, 2020-12-08 at 17:02 +0100, Thomas Gleixner wrote:
>> For one I have no idea which bug you are talking about and if the bug is
>> caused by the VMM then why would you "fix" it in the guest kernel.
>
> The "bug" is that if VMM moves a hardware time counter (tsc or anything else) 
> forward by large enough value in one go, 
> then the guest kernel will supposingly have an overflow in the time code.
> I don't consider this to be a buggy VMM behavior, but rather a kernel
> bug that should be fixed (if this bug actually exists)

Well, that's debatable. The kernel has a safe guard in place for each
clocksource which calculates the maximum time before an update needs to
take place. That limit comes from:

 1) Hardware counter wraparound time
 2) Math limitation

#1 is a non-issue on TSC, but it is on pm-timer, hpet and lots of other
   non-x86 devices

#2 The overflow surely can happen if you're long enough out. For TSC
   it's ~ 800 / f [seconds/GHz TSC frequency], i.e. 200 seconds for a
   4Ghz TSC.

> Purely in theory this can even happen on real hardware if for example
> SMM handler blocks a CPU from running for a long duration, or hardware
> debugging interface does, or some other hardware transparent sleep
> mechanism kicks in and blocks a CPU from running.  (We do handle this
> gracefully for S3/S4)

We had this discussion before. People got upset the stuff didn't work
when they resumed debugging after leaving the box in the breakpoint over
the weekend. *Shrug*

If SMM goes out for lunch for > 200 seconds it's broken. End of story,
really. There are bigger problems than timekeeping when that happens.

Hardware transparent sleep mechanisms which are doing this behind the
kernels back without giving it a mechanism to configure it is pretty
much like SMM: It's broken.

So now life migration comes a long time after timekeeping had set the
limits and just because it's virt it expects that everything works and it
just can ignore these limits.

TBH. That's not any different than SMM or hard/firmware taking the
machine out for lunch. It's exactly the same: It's broken.

And of course since that migration muck started _nobody_ bothered until
today to talk to me about that.

It's not a kernel bug. The kernel works as designed for the purpose and
the design clearly had these goals:

    1) Correctness
    2) Performance
    3) Scalability

and for that we introduced limitations which were perfectly reasonable
at the time because SMM and hardware/firmware wreckage definitely cannot
be the limiting factor and for the fast wrapping stuff there is no
design at all. These limitations are still reasonable because lifting
them hurts performance and depending on the length has effects on
correctness as well. Timekeeping is a complex problem.

It's a virt bug caused by pure ignorance of the underlying and already
existing technology and the unwillingness to talk to people who actually
understand it. I don't even want to know what kind of magic workarounds
VMMs have dreamed up for that. I'm seriously grumpy that more than 10
years after I reported that time can be observed going backwards this is
still not fixed and that has absolutely nothing to do with guest
migration.  Ignoring the simple and trivial requirement for timekeeping
correctness in the first place and then having the chuzpah to claim that
the kernel is buggy because virt decided it can do what it wants is
beyond my comprehension and yet another proof for the theorem that virt
creates more problems than it solves.

</rant>

The question how it can be made work is a different problem. I carefully
said 'made work' because you can't 'fix' it.

  - It can't be fixed at the VMM side at all

  - It can't be fixed for fast wrapping clock sources by just fiddling
    with the timekeeping and time accessor code at all.

  - Even for TSC it can't be just fixed without imposing overhead on
    every time read including VDSO. And just fixing it for x86 and TSC
    does not cut it. There is a whole world outside of x86 and we are
    not going to impose any x86/TSC specific insanity on everybody
    else. We are neither going to make generic code have TSC specific
    hoops and loops just to deal with that.

This needs orchestration and collaboration from both the VMM and the
guest kernel to make this work proper and reliably.

There are two ways to do that:

   1) Suspend / resume the guest kernel

   2) Have a protocol which is safe under all circumstances.

If #2 is not there then #1 is the only correct option unless the VMM can
guarantee that the guest is restarted _before_ time goes south.

Doing #2 correctly is not rocket science either. The kernel has
mechanisms to deal with such problems already. All it requires is to
expose and utilize them.

The only requirement there is to bring the kernel into a state where no
CPU can observe that the time goes backwards. The kernel has two
mechanisms for that:

   1) Suspend / resume. Trivial because all CPUs except the (usually)
      boot CPU are unplugged or in a state which does not matter

   2) Switching clocksources. A runtime operation which is safe and
      correct utilizing stop_machine()

If you really think about it then this migration problem is nothing else
than switching the underlying clocksource. The only difference is that
in case of a regular clocksource switch the previous clocksource is
still accessible up to the point where the switchover happens which is
obviously not the case for VM migration.

But instead of switching the clocksource via stop machine we can just
use the same mechanism to update the current clocksource so that the
time jump does not matter and cannot be observed.

That needs a few things to be done:

VMM:
        - Disable NMI delivery to the guest
        - Inject a magic VMM to guest IPI which starts the operation

Guest:
        - Schedule work from the IPI

        - work handles nested VMs if necessary        
        
        - work invokes stop_machine(update_guest_clocksource, NULL, NULL)

        - When all vCPUs rendevouzed then the one vCPU which actually
          runs update_guest_clocksource() and reports to the VMM via
          hypercall or whatever that it reached the state and spin waits
          on a hyperpage shared between guest and VMM to wait for the
          VMM to signal to proceed.

          At that point it's 100% safe to freeze it. Probably not only
          from a timekeeping POV.

VMM:
        - Freeze VM and send image to destination VMM

New VMM:

        - Install the image

        - Setup the vCPU TSC muck

        - Store necessary information in the hyperpage and then flip the
          bit which makes the guest waiting in update_guest_clocksource()
          proceed.

        - Schedule the guest vCPUs

Guest:

        - The one vCPU waiting in update_guest_clocksource() observes the
          GO bit, updates timekeeping and returns.

        - All CPUs leave stomp_machine() and everything is fine

        - work resumes and handles nested VMs

        - work tells VMM that everything is done

All the bits and pieces are there already except for the VMM/guest
contract and the extra 20 lines of code in the timekeeping core.

There is one caveat vs. the NMI safe time keeper, but we have the
mechanism to deal with that in place already for suspend so that's just
another 5 lines of code to deal with at the core side.

Now you combine this with a proper mechanism to deal with the TSC offset
as I outlined before and your problems are pretty much gone in a very
clean and understandable way.

Thanks,

        tglx
Thomas Gleixner Dec. 8, 2020, 9:25 p.m. UTC | #33
On Tue, Dec 08 2020 at 09:33, Andy Lutomirski wrote:
> On Tue, Dec 8, 2020 at 8:26 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
>
> IIRC we introduced mul_u64_u32_shift() for roughly this reason,but we
> don't seem to be using it in the relevant code paths.  We should be
> able to use the same basic math with wider intermediates to allow very
> large intervals between updates.

That's 128 bit math and _NO_ we are not going there for various
reasons.

Hint1: There is a world outside of x8664 and TSC.
Hint2: Even on x8664 and TSC it comes with performance overhead

Thanks,

        tglx
Thomas Gleixner Dec. 8, 2020, 9:33 p.m. UTC | #34
On Tue, Dec 08 2020 at 15:11, Marcelo Tosatti wrote:
> On Tue, Dec 08, 2020 at 05:02:07PM +0100, Thomas Gleixner wrote:
>> On Tue, Dec 08 2020 at 16:50, Maxim Levitsky wrote:
>> > On Mon, 2020-12-07 at 20:29 -0300, Marcelo Tosatti wrote:
>> >> > +This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST value
>> >> > +from the state obtained in the past by KVM_GET_TSC_STATE on the same vCPU.
>> >> > +
>> >> > +If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags,
>> >> > +KVM will adjust the guest TSC value by the time that passed since the moment
>> >> > +CLOCK_REALTIME timestamp was saved in the struct and current value of
>> >> > +CLOCK_REALTIME, and set the guest's TSC to the new value.
>> >> 
>> >> This introduces the wraparound bug in Linux timekeeping, doesnt it?
>> 
>> Which bug?
>
> max_cycles overflow. Sent a message to Maxim describing it.

Truly helpful. Why the hell did you not talk to me when you ran into
that the first time?

>> For one I have no idea which bug you are talking about and if the bug is
>> caused by the VMM then why would you "fix" it in the guest kernel.
>
> 1) Stop guest, save TSC value of cpu-0 = V.
> 2) Wait for some amount of time = W.
> 3) Start guest, load TSC value with V+W.
>
> Can cause an overflow on Linux timekeeping.

Yes, because you violate the basic assumption which Linux timekeeping
makes. See the other mail in this thread.

Thanks,

        tglx
Thomas Gleixner Dec. 8, 2020, 9:35 p.m. UTC | #35
On Tue, Dec 08 2020 at 15:12, Marcelo Tosatti wrote:
> On Tue, Dec 08, 2020 at 06:25:13PM +0200, Maxim Levitsky wrote:
>> On Tue, 2020-12-08 at 17:02 +0100, Thomas Gleixner wrote:
>> The "bug" is that if VMM moves a hardware time counter (tsc or anything else) 
>> forward by large enough value in one go, 
>> then the guest kernel will supposingly have an overflow in the time code.
>> I don't consider this to be a buggy VMM behavior, but rather a kernel
>> bug that should be fixed (if this bug actually exists)
>
> It exists.

In the VMM. 

>> We are talking about the fact that TSC can jump forward by arbitrary large
>> value if the migration took arbitrary amount of time, which 
>> (assuming that the bug is real) can crash the guest kernel.
>
> QE reproduced it.

Sure, that's what QE is about. Just your conclusion is wrong.

Thanks,

        tglx
Thomas Gleixner Dec. 9, 2020, 12:19 a.m. UTC | #36
On Tue, Dec 08 2020 at 12:32, Andy Lutomirski wrote:
>> On Dec 8, 2020, at 11:25 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> One issue here is that guests might want to run their own NTP/PTP. One
>> reason to do that is that some people prefer the leap second smearing
>> NTP servers. 
>
> I would hope that using this part would be optional on the guest’s
> part. Guests should be able to use just the CLOCK_MONOTONIC_RAW part
> or fancier stuff at their option.
>
> (Hmm, it would, in principle, be possible for a guest to use the
> host’s TAI but still smear leap seconds. Even without virt, smearing
> could be a per-timens option.)

No. Don't even think about it. Read the thread:

  https://lore.kernel.org/r/20201030110229.43f0773b@jawa

all the way through the end and then come up with a real proposal which
solves all of the issues mentioned there.

I might be missing the obvious, but please before you make proposals
about time keeping out of thin air, please do your homework. You would
not ask these questions otherwise.

If it would be that simple we wouldn't be discussing this in the first
place.

Sorry for being blunt, but this has been discussed to death already.

It can be solved on the theory level, but it's not practical.

You _cannot_ make leap second smearing or different notions of clock
realtime an isolated problem. We have

   CLOCK_MONOTONIC
   CLOCK_BOOTTIME
   CLOCK_REALTIME
   CLOCK_TAI

They share one fundamental property:

     All frequency adjustments done by NTP/PTP/PPS or whatever affect
     _ALL_ of them in the exactly same way.

I.e. leap second smearing whether you like it or not is affecting all of
them and there is nothing we can do about that. Why?

 1) Because it's wrong to begin with. It creates a seperate universe of
    CLOCK_REALTIME and therefore of CLOCK_TAI because they are strictly
    coupled by definition.

 2) It's user space ABI. adjtimex() can make the kernel do random
    crap. So if you extend that to time namespaces (which is pretty much
    the same as a guest time space) then you have to solve the following
    problems:

    A) The tick based gradual adjustment of adjtimex() to avoid time
       jumping around have to be done for every time universe which has
       different parameters than the host.

       Arguably this can be solved by having a seqcount based magic hack
       which forces the first time name space consumer into updating the
       per time name space notion of time instead of doing it from the
       host tick, but that requires to have fully synchronized nested
       sequence counts and if you extend this to nested virt it creates
       an exponential problem.

    B) What to do about enforced time jumps on the host (settimeofday,
       adjtimex)? 

    C) Once you have solved #1 and #2 explain how timers (nanosleep,
       interval timers, ....) which are all user space ABI and have the
       fundamental guarantee to not expire early can be handled in a sane
       and scalable way.

Once you have a coherent answer for all of the above I'm happy to step
down and hand over. In that case I'm more than happy not to have to deal
with the inevitable regression reports.

>>> tglx etc, I think that doing this really really nicely might involve
>>> promoting something like the current vDSO data structures to ABI -- a
>>> straightforward-ish implementation would be for the KVM host to export
>>> its vvar clock data to the guest and for the guest to use it, possibly
>>> with an offset applied.  The offset could work a lot like timens works
>>> today.
>> 
>> Works nicely if the guest TSC is not scaled. But that means that on
>> migration the raw TSC usage in the guest is borked because the new host
>> might have a different TSC frequency.
>> 
>> If you use TSC scaling then the conversion needs to take TSC scaling
>> into account which needs some thought. And the guest would need to read
>> the host conversion from 'vdso data' and the scaling from the next page
>> (per guest) and then still has to support timens. Doable but adds extra
>> overhead on every time read operation.
>
> Is the issue that scaling would result in a different guest vs host
> frequency?  Perhaps we could limit each physical machine to exactly
> two modes: unscaled (use TSC ticks, convert in software) and scaled to
> nanoseconds (CLOCK_MONOTONIC_RAW is RDTSC + possible offset).  Then
> the host could produce its data structures in exactly those two
> formats and export them as appropriate.

The latter - nanoseconds scaling - is the only reasonable solution but
then _ALL_ involved hosts must agree on that.

>> If you want to avoid that you are back to the point where you need to
>> chase all guest data when the host NTP/PTP adjusts the host side.
>> Chasing and updating all this stuff in the tick was the reason why I was
>> fighting the idea of clock realtime in namespaces.
>
> I think that, if we can arrange for a small, bounded number of pages
> generated by the host, then this problem isn’t so bad.

Whishful thinking unless we have a very strict contract about

 - scaling to 1 GHz, aka nanoseconds,
 - all guest argree with the host defined management of clock
   REALTIME and TAI

> Hmm, leap second smearing is just a different linear mapping. I’m not
> sure how leap second smearing should interact with timens, but it
> seems to be that the host should be able to produce four data pages
> (scaled vs unscaled and smeared vs unsmeared) and one per-guest/timens
> offset page (where offset applies to MONOTONIC and MONOTONIC_RAW only)
> and cover all bases.  Or do people actually want to offset their TAI
> and/or REALTIME, and what would that even mean if the offset crosses a
> leap second?
>

See above.

And yes people want to run different time universes where TAI != TAI.

> (I haven’t though about the interaction of any of this with ART.)

Obviously not:) And I'm pretty sure that nobody else did, but ART is the
least of my worries so far.

The current VM migration approach is: Get it done no matter what, we
deal with the fallout later. Which means endless tinkering because you
can't fix essential design fails after the fact.

Thanks,

        tglx
Andy Lutomirski Dec. 9, 2020, 4:08 a.m. UTC | #37
On Tue, Dec 8, 2020 at 4:19 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Tue, Dec 08 2020 at 12:32, Andy Lutomirski wrote:
> >> On Dec 8, 2020, at 11:25 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> >> One issue here is that guests might want to run their own NTP/PTP. One
> >> reason to do that is that some people prefer the leap second smearing
> >> NTP servers.
> >
> > I would hope that using this part would be optional on the guest’s
> > part. Guests should be able to use just the CLOCK_MONOTONIC_RAW part
> > or fancier stuff at their option.
> >
> > (Hmm, it would, in principle, be possible for a guest to use the
> > host’s TAI but still smear leap seconds. Even without virt, smearing
> > could be a per-timens option.)
>
> No. Don't even think about it. Read the thread:
>
>   https://lore.kernel.org/r/20201030110229.43f0773b@jawa
>
> all the way through the end and then come up with a real proposal which
> solves all of the issues mentioned there.

You're misunderstanding me, which is entirely reasonable, since my
description was crap.  In particular, what I meant by smearing is not
at all what's done today.  Let me try again.  The thing below is my
proposal, not necessarily a description of exactly what happens now.

(I read most of that thread, and I read most of this thread, and I've
hacked on the time code, cursed at the KVM code, modified the KVM
code, cursed at the KVM code some more, etc.  None of which is to say
that I have a full understanding of every possible timekeeping nuance,
but I'm pretty sure I can at least pretend to understand some of it.)

We have some time source that we can read (e.g. TSC).  Let's call it
read_time().  It returns an integer (64-bits would be nice, but we'll
take what we can get).  From the output of read_time(), Linux user
programs, and the kernel itself (and guests perhaps, see below) would
like to produce various outputs.  Each of them is protected by a
seqlock that I'll omit in the descriptions below.  The operations
below are all protected by a seqlock retry loop.  Also, when I say *
below, I mean the usual calculation with a multiplication and a shift.

All of these are only valid if t_start <= read_time() <= t_end and,
and they all assume that read_time() hasn't wrapped and gotten into
that interval again.  There is nothing at all we can do in software if
we wrap like this.  t_end isn't necessarily something we compute
explicitly --- it might just be the case that, if read_time() > t_end,
our arithmetic overflows and we return garbage.  But t_end might be a
real thing on architectures where vdso_cycles_ok() actually does
something (sigh, x86).

CLOCK_MONOTONIC_RAW: not affected by NTP, adjtimex, etc.
return mult[monotonic_raw] * (read_time() - t_start) + offset[monotonic_raw];

CLOCK_MONOTONIC:  This is never affected by leap-second smearing.  If
userspace tries to smear it in the new mode, userspace gets to keep
all the pieces.
return mult[monotonic] * (read_time() - t_start) + offset[monotonic];

CLOCK_TAI:  This is not smeared.
return mult[tai] * (read_time() - t_start) + offset[tai];

CLOCK_SANE_REALTIME: This is not smeared either.
return mult[sane_realtime] * (read_time() - t_start) + offset[sane_realtime];

And yes, we require that mult[monotonic] == mult[tai] == mult[sane_realtime].

CLOCK_SMEARED_REALTIME:
return mult[smeared_realtime] * (read_time() - t_start) +
offset[smeared_realtime]
This is a leap-second-smeared variant of CLOCK_SANE_REALTIME.

CLOCK_REALTIME: maps to CLOCK_SANE_REALTIME or CLOCK_SMEARED_REALTIME
depending on user preference.  Doing this without an extra branch
somewhere might take a bit of thought.

If t > t_end, then we fall back to a syscall if we're in user mode and
we fall back to hypercall or we just spin if we're in the kernel.  But
see below.

As far as I can tell, if the kernel were to do something utterly
asinine like adding some arbitrary value to TSC_ADJUST on all CPUs,
the kernel could do so correctly by taking the seqlock, making the
change, updating everything, and releasing the seqlock.  This would be
nuts, but it's more or less the same thing that happens when a VM
migrates.  So I think a VM could migrate a guest without any
particular magic, except that there's a potential race if the old and
new systems happen to have close enough seqlock values that the guest
might start reading on the old host, finish on the new host, see the
same seqlock value, and end up with utter garbage.  One way to
mitigate this would be, in paravirt mode, to have an extra per-guest
page that contains a count of how many times the guest has migrated.

Timens would work a lot like it does today, but the mechanism that
tells the vdso code to use timens might need tweaking.

I could easily be missing something that prevents this from working,
but I'm not seeing any fundamental problems.

If we want to get fancy, we can make a change that I've contemplated
for a while -- we could make t_end explicit and have two copies of all
these data structures.  The reader would use one copy if t < t_change
and a different copy if t >= t_change.  This would allow NTP-like code
in usermode to schedule a frequency shift to start at a specific time.
With some care, it would also allow the timekeeping code to update the
data structures without causing clock_gettime() to block while the
timekeeping code is running on a different CPU.

One other thing that might be worth noting: there's another thread
about "vmgenid".  It's plausible that it's worth considering stopping
the guest or perhaps interrupting all vCPUs to allow it to take some
careful actions on migration for reasons that have nothing to do with
timekeeping.
Thomas Gleixner Dec. 9, 2020, 10:14 a.m. UTC | #38
Andy,

On Tue, Dec 08 2020 at 20:08, Andy Lutomirski wrote:
> On Tue, Dec 8, 2020 at 4:19 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> On Tue, Dec 08 2020 at 12:32, Andy Lutomirski wrote:
>> all the way through the end and then come up with a real proposal which
>> solves all of the issues mentioned there.
>
> You're misunderstanding me, which is entirely reasonable, since my
> description was crap.  In particular, what I meant by smearing is not
> at all what's done today.  Let me try again.  The thing below is my
> proposal, not necessarily a description of exactly what happens now.

Fair enough. /me rewinds grump and starts over.

> All of these are only valid if t_start <= read_time() <= t_end and,
> and they all assume that read_time() hasn't wrapped and gotten into
> that interval again.  There is nothing at all we can do in software if
> we wrap like this.  t_end isn't necessarily something we compute
> explicitly --- it might just be the case that, if read_time() > t_end,
> our arithmetic overflows and we return garbage.  But t_end might be a
> real thing on architectures where vdso_cycles_ok() actually does
> something (sigh, x86).
>
> If t > t_end, then we fall back to a syscall if we're in user mode and
> we fall back to hypercall or we just spin if we're in the kernel.  But
> see below.

Yes, we could do that.

> CLOCK_SMEARED_REALTIME:
> return mult[smeared_realtime] * (read_time() - t_start) +
> offset[smeared_realtime]
> This is a leap-second-smeared variant of CLOCK_SANE_REALTIME.
>
> CLOCK_REALTIME: maps to CLOCK_SANE_REALTIME or CLOCK_SMEARED_REALTIME
> depending on user preference.  Doing this without an extra branch
> somewhere might take a bit of thought.

Plus adding support for clock_*(CLOCK_DISTORTED_TIME) and make that work
correctly. Plus dealing with all the other interesting problems vs. file
time stamps and whatever. Time is all over the place and not just in
clock_gettime(CLOCK*).

But what's more problematic is the basic requirement that time all over
the place has to be consistent.

On machines which use DISTORTED_REALTIME everything _IS_ consistent
within the distorted universe they created. It's still inconsistent
vs. the outside, but that's unsolvable and none of our problems.

TLDR: Do not even think about opening pandoras box.

> As far as I can tell, if the kernel were to do something utterly
> asinine like adding some arbitrary value to TSC_ADJUST on all CPUs,
> the kernel could do so correctly by taking the seqlock, making the
> change, updating everything, and releasing the seqlock.

Plus a few other things, but yes that's similar to the scheme I
outlined. Using stomp_machine() ensures that _all_ possible ways to
wreckage things are covered.

> This would be nuts, but it's more or less the same thing that happens
> when a VM migrates.  So I think a VM could migrate a guest without any
> particular magic, except that there's a potential race if the old and
> new systems happen to have close enough seqlock values that the guest
> might start reading on the old host, finish on the new host, see the
> same seqlock value, and end up with utter garbage.  One way to
> mitigate this would be, in paravirt mode, to have an extra per-guest
> page that contains a count of how many times the guest has migrated.
>
> Timens would work a lot like it does today, but the mechanism that
> tells the vdso code to use timens might need tweaking.
>
> I could easily be missing something that prevents this from working,
> but I'm not seeing any fundamental problems.

It can be made work.

> If we want to get fancy, we can make a change that I've contemplated
> for a while -- we could make t_end explicit and have two copies of all
> these data structures.  The reader would use one copy if t < t_change
> and a different copy if t >= t_change.

See below.

> This would allow NTP-like code in usermode to schedule a frequency
> shift to start at a specific time.

That's an orthogonal problem and can be done without changing the
reader side.

> With some care, it would also allow the timekeeping code to update the
> data structures without causing clock_gettime() to block while the
> timekeeping code is running on a different CPU.

It still has to block, i.e. retry, because the data set becomes invalid
when t_end is reached. So the whole thing would be:

       do {
       		seq = read_seqcount_latch();
                data = select_data(seq);
                delta = read_clocksource() - data->base;
                if (delta >= data->max_delta)
                	continue;
                ....
      } while (read_seqcount_latch_retry());

TBH, I like the idea for exactly one reason: It increases robustness.

For X86 we already have the comparison for dealing with TSC < base
which would be covered by

                if (delta >= data->max_delta)
                	continue;

automatically. Non X86 gains this extra conditional, but I think it's
worth to pay that price.

It won't solve the VM migration problem on it's own though. You still
have to be careful about the inner workings of everything related to
timekeeping itself.

> One other thing that might be worth noting: there's another thread
> about "vmgenid".  It's plausible that it's worth considering stopping
> the guest or perhaps interrupting all vCPUs to allow it to take some
> careful actions on migration for reasons that have nothing to do with
> timekeeping.

How surprising. Who could have thought about that?

OMG, virtualization seems to have gone off into a virtual reality long
ago.

Thanks,

        tglx
Marcelo Tosatti Dec. 9, 2020, 4:34 p.m. UTC | #39
On Tue, Dec 08, 2020 at 10:33:15PM +0100, Thomas Gleixner wrote:
> On Tue, Dec 08 2020 at 15:11, Marcelo Tosatti wrote:
> > On Tue, Dec 08, 2020 at 05:02:07PM +0100, Thomas Gleixner wrote:
> >> On Tue, Dec 08 2020 at 16:50, Maxim Levitsky wrote:
> >> > On Mon, 2020-12-07 at 20:29 -0300, Marcelo Tosatti wrote:
> >> >> > +This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST value
> >> >> > +from the state obtained in the past by KVM_GET_TSC_STATE on the same vCPU.
> >> >> > +
> >> >> > +If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags,
> >> >> > +KVM will adjust the guest TSC value by the time that passed since the moment
> >> >> > +CLOCK_REALTIME timestamp was saved in the struct and current value of
> >> >> > +CLOCK_REALTIME, and set the guest's TSC to the new value.
> >> >> 
> >> >> This introduces the wraparound bug in Linux timekeeping, doesnt it?
> >> 
> >> Which bug?
> >
> > max_cycles overflow. Sent a message to Maxim describing it.
> 
> Truly helpful. Why the hell did you not talk to me when you ran into
> that the first time?

Because 

1) Users wanted CLOCK_BOOTTIME to stop counting while the VM 
is paused (so we wanted to stop guest clock when VM is paused anyway).

2) The solution to inject NMIs to the guest seemed overly
complicated.

> >> For one I have no idea which bug you are talking about and if the bug is
> >> caused by the VMM then why would you "fix" it in the guest kernel.
> >
> > 1) Stop guest, save TSC value of cpu-0 = V.
> > 2) Wait for some amount of time = W.
> > 3) Start guest, load TSC value with V+W.
> >
> > Can cause an overflow on Linux timekeeping.
> 
> Yes, because you violate the basic assumption which Linux timekeeping
> makes. See the other mail in this thread.
> 
> Thanks,
> 
>         tglx
Thomas Gleixner Dec. 9, 2020, 8:58 p.m. UTC | #40
Marcelo,

On Wed, Dec 09 2020 at 13:34, Marcelo Tosatti wrote:
> On Tue, Dec 08, 2020 at 10:33:15PM +0100, Thomas Gleixner wrote:
>> On Tue, Dec 08 2020 at 15:11, Marcelo Tosatti wrote:
>> > max_cycles overflow. Sent a message to Maxim describing it.
>> 
>> Truly helpful. Why the hell did you not talk to me when you ran into
>> that the first time?
>
> Because 
>
> 1) Users wanted CLOCK_BOOTTIME to stop counting while the VM 
> is paused (so we wanted to stop guest clock when VM is paused anyway).

How is that supposed to work w/o the guest kernels help if you have to
keep clock realtime up to date? 

> 2) The solution to inject NMIs to the guest seemed overly
> complicated.

Why do you need NMIs?

All you need is a way to communicate to the guest that it should prepare
for clock madness to happen. Whether that's an IPI or a bit in a
hyperpage which gets checked during the update of the guest timekeeping
does not matter at all.

But you certainly do not need an NMI because there is nothing useful you
can do within an NMI.

Thanks,

        tglx
Paolo Bonzini Dec. 10, 2020, 11:42 a.m. UTC | #41
On 07/12/20 18:41, Thomas Gleixner wrote:
> Right this happens still occasionally, but for quite some time this is
> 100% firmware sillyness and not a fundamental property of the hardware
> anymore.

It's still a fundamental property of old hardware.  Last time I tried to 
kill support for processors earlier than Core 2, I had to revert it. 
That's older than Nehalem.

>> We try to catch such situation in KVM instead of blowing up but
>> this may still result in subtle bugs I believe. Maybe we would be better
>> off killing all VMs in case TSC ever gets unsynced (by default).
> 
> I just ran a guest on an old machine with unsynchronized TSCs and was
> able to observe clock monotonic going backwards between two threads
> pinned on two vCPUs, which _is_ bad. Getting unsynced clocks reliably
> under control is extremly hard.

Using kvmclock?  (Half serious: perhaps a good reason to have per-vCPU 
offsets is to be able to test what happens with unsynchronized TSCs...).

Paolo
Paolo Bonzini Dec. 10, 2020, 11:48 a.m. UTC | #42
On 08/12/20 18:08, Maxim Levitsky wrote:
>> Even if you support TSCADJUST and let the guest write to it does not
>> change the per guest offset at all. TSCADJUST is per [v]CPU and adds on
>> top:
>>
>>      tscvcpu = tsc_host + guest_offset + TSC_ADJUST
>>
>> Scaling is just orthogonal and does not change any of this.
>
> I agree with this, and I think that this is what we will end up doing.
> Paulo, what do you think about this?

Yes, you can have a VM ioctl that saves/restores cur_tsc_nsec and 
cur_tsc_write.  The restore side would loop on all vcpus.

However, it is not so easy: 1) it would have to be usable only if 
KVM_X86_QUIRK_TSC_HOST_ACCESS is false, 2) it would fail if 
kvm->arch.nr_vcpus_matched_tsc == kvm->online_vcpus (which basically 
means that userspace didn't mess up the TSC configuration).  If not, it 
would return -EINVAL.

Also, while at it let's burn and pour salt on the support for 
KVM_SET_TSC_KHZ unless TSC scaling is supported, together with 
vcpu->tsc_catchup and all the "tolerance" crap that is in 
kvm_set_tsc_khz.  And initialize vcpu->arch.virtual_tsc_khz to 
kvm->arch.last_tsc_khz before calling kvm_synchronize_tsc.

Paolo
Paolo Bonzini Dec. 10, 2020, 11:48 a.m. UTC | #43
On 08/12/20 22:20, Thomas Gleixner wrote:
> 
> So now life migration comes a long time after timekeeping had set the
> limits and just because it's virt it expects that everything works and it
> just can ignore these limits.
> 
> TBH. That's not any different than SMM or hard/firmware taking the
> machine out for lunch. It's exactly the same: It's broken.

I agree.  If *live* migration stops the VM for 200 seconds, it's broken.

Sure, there's the case of snapshotting the VM over the weekend.  My 
favorite solution would be to just put it in S3 before doing that.  *Do 
what bare metal does* and you can't go that wrong.

In general it's userspace policy whether to keep the TSC value the same 
across live migration.  There's pros and cons to both approaches, so KVM 
should provide the functionality to keep the TSC running (which the 
guest will see as a very long, but not extreme SMI), and this is what 
this series does.  Maxim will change it to operate per-VM.  Thanks 
Thomas, Oliver and everyone else for the input.

Paolo
Peter Zijlstra Dec. 10, 2020, 12:14 p.m. UTC | #44
On Thu, Dec 10, 2020 at 12:42:36PM +0100, Paolo Bonzini wrote:
> On 07/12/20 18:41, Thomas Gleixner wrote:
> > Right this happens still occasionally, but for quite some time this is
> > 100% firmware sillyness and not a fundamental property of the hardware
> > anymore.
> 
> It's still a fundamental property of old hardware.  Last time I tried to
> kill support for processors earlier than Core 2, I had to revert it. That's
> older than Nehalem.

Core2 doesn't use TSC for timekeeping anyway. KVM shouldn't either.
Paolo Bonzini Dec. 10, 2020, 12:22 p.m. UTC | #45
On 10/12/20 13:14, Peter Zijlstra wrote:
> On Thu, Dec 10, 2020 at 12:42:36PM +0100, Paolo Bonzini wrote:
>> On 07/12/20 18:41, Thomas Gleixner wrote:
>>> Right this happens still occasionally, but for quite some time this is
>>> 100% firmware sillyness and not a fundamental property of the hardware
>>> anymore.
>>
>> It's still a fundamental property of old hardware.  Last time I tried to
>> kill support for processors earlier than Core 2, I had to revert it. That's
>> older than Nehalem.
> 
> Core2 doesn't use TSC for timekeeping anyway. KVM shouldn't either.

On Core2, KVM guests pass TSC through kvmclock in order to get something 
usable and not incredibly slow.

Paolo
Peter Zijlstra Dec. 10, 2020, 1:01 p.m. UTC | #46
On Thu, Dec 10, 2020 at 01:22:02PM +0100, Paolo Bonzini wrote:
> On 10/12/20 13:14, Peter Zijlstra wrote:
> > On Thu, Dec 10, 2020 at 12:42:36PM +0100, Paolo Bonzini wrote:
> > > On 07/12/20 18:41, Thomas Gleixner wrote:
> > > > Right this happens still occasionally, but for quite some time this is
> > > > 100% firmware sillyness and not a fundamental property of the hardware
> > > > anymore.
> > > 
> > > It's still a fundamental property of old hardware.  Last time I tried to
> > > kill support for processors earlier than Core 2, I had to revert it. That's
> > > older than Nehalem.
> > 
> > Core2 doesn't use TSC for timekeeping anyway. KVM shouldn't either.
> 
> On Core2, KVM guests pass TSC through kvmclock in order to get something
> usable and not incredibly slow.

Which is incredibly wrong.
Maxim Levitsky Dec. 10, 2020, 2:25 p.m. UTC | #47
On Thu, 2020-12-10 at 12:48 +0100, Paolo Bonzini wrote:
> On 08/12/20 18:08, Maxim Levitsky wrote:
> > > Even if you support TSCADJUST and let the guest write to it does not
> > > change the per guest offset at all. TSCADJUST is per [v]CPU and adds on
> > > top:
> > > 
> > >      tscvcpu = tsc_host + guest_offset + TSC_ADJUST
> > > 
> > > Scaling is just orthogonal and does not change any of this.
> > 
> > I agree with this, and I think that this is what we will end up doing.
> > Paulo, what do you think about this?
> 
> Yes, you can have a VM ioctl that saves/restores cur_tsc_nsec and 
> cur_tsc_write.  The restore side would loop on all vcpus.

Why would the restore need to run on all VCPUs though?

The picture that I have in mind is that we we will restore the 
cur_tsc_nsec/cur_tsc_write pair once per VM, and then restore the 
TSC_ADJUST value on all vCPUs as if the guest wrote it (with the quirk disabled), 
which will restore all the horrible things that the guest did to its TSC.

Since TSC adjust is enabled by default, if the user disables it in the CPUID, 
we might as well just disable the whole thing, 
make TSC readonly or something similar.

This way we don't need to worry about TSC writes as 
those will be reflected in the TSC_ADJUST.

> 
> However, it is not so easy: 1) it would have to be usable only if 
> KVM_X86_QUIRK_TSC_HOST_ACCESS is false, 2) it would fail if 
> kvm->arch.nr_vcpus_matched_tsc == kvm->online_vcpus (which basically 
> means that userspace didn't mess up the TSC configuration).  If not, it 
> would return -EINVAL.

Note though that we don't track the guest tsc/tsc_adjust writes anymore
via the tsc sync code, and with the quirk disabled we don't track them
even for the host writes, thus the (2) condition will always be true
(the only call left to kvm_synchronize_tsc is in the kvm_arch_vcpu_postcreate)

However I indeed see no reason to allow new per VM API when the masterclock is
disabled, and therefore using cur_tsc_nsec/cur_tsc_write is reasonable.

The cur_tsc_nsec should IMHO be converted to absolute time (CLOCK_REALTIME
or similiar) or should the conversion be done in the userspace? 
I don't know yet if I can convert between different POSIX clocks
in race/error free manner. (e.g get the offset between them).


> 
> Also, while at it let's burn and pour salt on the support for 
> KVM_SET_TSC_KHZ unless TSC scaling is supported, together with 
> vcpu->tsc_catchup and all the "tolerance" crap that is in 
> kvm_set_tsc_khz.  And initialize vcpu->arch.virtual_tsc_khz to 
> kvm->arch.last_tsc_khz before calling kvm_synchronize_tsc.

The tsc_catchup is enabled when host TSC is unstable, so that guest
TSC at least roughtly follows real time (which host kernel gets
by other means).

We push the guest tsc forward roughtly each time VM entry from userspace
happens:

(On each kvm_arch_vcpu_load, we raise KVM_REQ_GLOBAL_CLOCK_UPDATE
request which (with small delay) makes all vcpus go through 
kvm_guest_time_update which pushes the guest tsc forward)

However we also have the 'tsc_always_catchup' mode which is indeed 
something I would like to remove.

It is a poor man simulation of the TSC scaling which is enabled 
when the host doesn't support TSC scaling, but we were asked 
to run at frequency faster than host TSC frequency is.

This way guest tsc runs slower than it should, but on each VM exit,
we punt the guest tsc offset forward so on average the guest tsc
doesn't lag behind.

Unless backward compatibility is an issue, I have no objections
to remove that code.

The tolerance thing might be needed. On many systems 
(including mine new 3970X), the hardware doesn't have means to report 
the unscaled host TSC frequency, thus the kernel has to 
measure it, and since the measurement is not 100% accurate, a slightly
different value is used every time the host boots.

Thus without a small tolerance, we will always have to use TSC scaling,
while migrating even between two identical systems.
I don't know if this is an issue.


Best regards,
	Maxim Levitsky


> 
> Paolo
>
Maxim Levitsky Dec. 10, 2020, 2:52 p.m. UTC | #48
On Thu, 2020-12-10 at 12:48 +0100, Paolo Bonzini wrote:
> On 08/12/20 22:20, Thomas Gleixner wrote:
> > So now life migration comes a long time after timekeeping had set the
> > limits and just because it's virt it expects that everything works and it
> > just can ignore these limits.
> > 
> > TBH. That's not any different than SMM or hard/firmware taking the
> > machine out for lunch. It's exactly the same: It's broken.
> 
> I agree.  If *live* migration stops the VM for 200 seconds, it's broken.
> 
> Sure, there's the case of snapshotting the VM over the weekend.  My 
> favorite solution would be to just put it in S3 before doing that.  *Do 
> what bare metal does* and you can't go that wrong.

Note though that qemu has a couple of issues with s3, and it is disabled 
by default in libvirt. 
I would be very happy to work on improving this if there is a need for that.


> 
> In general it's userspace policy whether to keep the TSC value the same 
> across live migration.  There's pros and cons to both approaches, so KVM 
> should provide the functionality to keep the TSC running (which the 
> guest will see as a very long, but not extreme SMI), and this is what 
> this series does.  Maxim will change it to operate per-VM.  Thanks 
> Thomas, Oliver and everyone else for the input.

I agree with that.
 
I still think though that we should have a discussion on feasibility
of making the kernel time code deal with large *forward* tsc jumps 
without crashing.

If that is indeed hard to do, or will cause performance issues,
then I agree that we might indeed inform the guest of time jumps instead.
 
In fact kvmclock already have such a mechanism (KVM_KVMCLOCK_CTRL ioctl, which sets 
the PVCLOCK_GUEST_STOPPED bit in the PV clock struct).
That informs the guest that it was stopped (guest clears this bit), 
and currently that makes the guest touch various watchdogs.
 
I think that the guest uses it only when kvmclock is used but
we can think about extending this to make guest use it 
even when bare tsc is used, and also implement whatever logic is
needed to jump the guest clock forward when this bit is set.

What do you think?

Best regards,
	Maxim Levitsky

> 
> Paolo
>
Andy Lutomirski Dec. 10, 2020, 3:16 p.m. UTC | #49
> On Dec 10, 2020, at 6:52 AM, Maxim Levitsky <mlevitsk@redhat.com> wrote:
> 
> On Thu, 2020-12-10 at 12:48 +0100, Paolo Bonzini wrote:
>>> On 08/12/20 22:20, Thomas Gleixner wrote:
>>> So now life migration comes a long time after timekeeping had set the
>>> limits and just because it's virt it expects that everything works and it
>>> just can ignore these limits.
>>> 
>>> TBH. That's not any different than SMM or hard/firmware taking the
>>> machine out for lunch. It's exactly the same: It's broken.
>> 
>> I agree.  If *live* migration stops the VM for 200 seconds, it's broken.
>> 
>> Sure, there's the case of snapshotting the VM over the weekend.  My 
>> favorite solution would be to just put it in S3 before doing that.  *Do 
>> what bare metal does* and you can't go that wrong.
> 
> Note though that qemu has a couple of issues with s3, and it is disabled 
> by default in libvirt. 
> I would be very happy to work on improving this if there is a need for that.

There’s also the case where someone has a VM running on a laptop and someone closes the lid. The host QEMU might not have a chance to convince the guest to enter S3.

> 
> 
>> 
>> In general it's userspace policy whether to keep the TSC value the same 
>> across live migration.  There's pros and cons to both approaches, so KVM 
>> should provide the functionality to keep the TSC running (which the 
>> guest will see as a very long, but not extreme SMI), and this is what 
>> this series does.  Maxim will change it to operate per-VM.  Thanks 
>> Thomas, Oliver and everyone else for the input.
> 
> I agree with that.
> 
> I still think though that we should have a discussion on feasibility
> of making the kernel time code deal with large *forward* tsc jumps 
> without crashing.
> 
> If that is indeed hard to do, or will cause performance issues,
> then I agree that we might indeed inform the guest of time jumps instead.
> 

Tglx, even without fancy shared host/guest timekeeping, count the guest kernel manage to update its timekeeping if the host sent the guest an interrupt or NMI on all CPUs synchronously on resume?

Alternatively, if we had the explicit “max TSC value that makes sense right now” in the timekeeping data, the guest would reliably notice the large jump and could at least do something intelligent about it instead of overflowing its internal calculation.
Marcelo Tosatti Dec. 10, 2020, 3:26 p.m. UTC | #50
On Wed, Dec 09, 2020 at 09:58:23PM +0100, Thomas Gleixner wrote:
> Marcelo,
> 
> On Wed, Dec 09 2020 at 13:34, Marcelo Tosatti wrote:
> > On Tue, Dec 08, 2020 at 10:33:15PM +0100, Thomas Gleixner wrote:
> >> On Tue, Dec 08 2020 at 15:11, Marcelo Tosatti wrote:
> >> > max_cycles overflow. Sent a message to Maxim describing it.
> >> 
> >> Truly helpful. Why the hell did you not talk to me when you ran into
> >> that the first time?
> >
> > Because 
> >
> > 1) Users wanted CLOCK_BOOTTIME to stop counting while the VM 
> > is paused (so we wanted to stop guest clock when VM is paused anyway).
> 
> How is that supposed to work w/o the guest kernels help if you have to
> keep clock realtime up to date? 

Upon VM resume, we notify NTP daemon in the guest to sync realtime
clock.
> 
> > 2) The solution to inject NMIs to the guest seemed overly
> > complicated.
> 
> Why do you need NMIs?
> 
> All you need is a way to communicate to the guest that it should prepare
> for clock madness to happen. Whether that's an IPI or a bit in a
> hyperpage which gets checked during the update of the guest timekeeping
> does not matter at all.
> 
> But you certainly do not need an NMI because there is nothing useful you
> can do within an NMI.
> 
> Thanks,
> 
>         tglx
Oliver Upton Dec. 10, 2020, 5:59 p.m. UTC | #51
On Thu, Dec 10, 2020 at 9:16 AM Andy Lutomirski <luto@amacapital.net> wrote:
>
>
>
> > On Dec 10, 2020, at 6:52 AM, Maxim Levitsky <mlevitsk@redhat.com> wrote:
> >
> > On Thu, 2020-12-10 at 12:48 +0100, Paolo Bonzini wrote:
> >>> On 08/12/20 22:20, Thomas Gleixner wrote:
> >>> So now life migration comes a long time after timekeeping had set the
> >>> limits and just because it's virt it expects that everything works and it
> >>> just can ignore these limits.
> >>>
> >>> TBH. That's not any different than SMM or hard/firmware taking the
> >>> machine out for lunch. It's exactly the same: It's broken.
> >>
> >> I agree.  If *live* migration stops the VM for 200 seconds, it's broken.
> >>
> >> Sure, there's the case of snapshotting the VM over the weekend.  My
> >> favorite solution would be to just put it in S3 before doing that.  *Do
> >> what bare metal does* and you can't go that wrong.
> >
> > Note though that qemu has a couple of issues with s3, and it is disabled
> > by default in libvirt.
> > I would be very happy to work on improving this if there is a need for that.
>
> There’s also the case where someone has a VM running on a laptop and someone closes the lid. The host QEMU might not have a chance to convince the guest to enter S3.
>
> >
> >
> >>
> >> In general it's userspace policy whether to keep the TSC value the same
> >> across live migration.  There's pros and cons to both approaches, so KVM
> >> should provide the functionality to keep the TSC running (which the
> >> guest will see as a very long, but not extreme SMI), and this is what
> >> this series does.  Maxim will change it to operate per-VM.  Thanks
> >> Thomas, Oliver and everyone else for the input.

So, to be clear, this per-VM ioctl will work something like the following:

static u64 kvm_read_tsc_base(struct kvm *kvm, u64 host_tsc)
{
        return kvm_scale_tsc(kvm, host_tsc) + kvm->host_tsc_offset;
}

case KVM_GET_TSC_BASE:
        struct kvm_tsc_base base = {
                .flags = KVM_TSC_BASE_TIMESTAMP_VALID;
        };
        u64 host_tsc;

        kvm_get_walltime(&base.nsec, &host_tsc);
        base.tsc = kvm_read_tsc_base(kvm, host_tsc);

        copy_to_user(...);

[...]

case KVM_SET_TSC_BASE:
        struct kvm_tsc_base base;
        u64 host_tsc, nsec;
        s64 delta = 0;

        copy_from_user(...);

        kvm_get_walltime(&nsec, &host_tsc);
        delta += base.tsc - kvm_read_tsc_base(kvm, host_tsc);

        if (base.flags & KVM_TSC_BASE_TIMESTAMP_VALID) {
                u64 delta_nsec = nsec - base.nsec;

                if (delta_nsec > 0)
                        delta += nsec_to_cycles(kvm, diff);
                else
                        delta -= nsec_to_cycles(kvm, -diff);
        }

        kvm->host_tsc_offset += delta;
        /* plumb host_tsc_offset through to each vcpu */

However, I don't believe we can assume the guest's TSCs to be synchronized,
even if sane guests will never touch them. In this case, I think a per-vCPU
ioctl is still warranted, allowing userspace to get at the guest CPU adjust
component of Thomas' equation below (paraphrased):

        TSC guest CPU = host tsc base + guest base offset + guest CPU adjust

Alternatively, a write from userspace to the guest's IA32_TSC_ADJUST with
KVM_X86_QUIRK_TSC_HOST_ACCESS could have the same effect, but that seems to be
problematic for a couple reasons. First, depending on the guest's CPUID the
TSC_ADJUST MSR may not even be available, meaning that the guest could've used
IA32_TSC to adjust the TSC (eww). Second, userspace replaying writes to IA32_TSC
(in the case IA32_TSC_ADJUST doesn't exist for the guest) seems _very_
unlikely to work given all the magic handling that KVM does for
writes to it.

Is this roughly where we are or have I entirely missed the mark? :-)

--
Thanks,
Oliver

> >
> > I agree with that.
> >
> > I still think though that we should have a discussion on feasibility
> > of making the kernel time code deal with large *forward* tsc jumps
> > without crashing.
> >
> > If that is indeed hard to do, or will cause performance issues,
> > then I agree that we might indeed inform the guest of time jumps instead.
> >
>
> Tglx, even without fancy shared host/guest timekeeping, count the guest kernel manage to update its timekeeping if the host sent the guest an interrupt or NMI on all CPUs synchronously on resume?
>
> Alternatively, if we had the explicit “max TSC value that makes sense right now” in the timekeeping data, the guest would reliably notice the large jump and could at least do something intelligent about it instead of overflowing its internal calculation.
Paolo Bonzini Dec. 10, 2020, 6:05 p.m. UTC | #52
On 10/12/20 18:59, Oliver Upton wrote:
> However, I don't believe we can assume the guest's TSCs to be synchronized,
> even if sane guests will never touch them. In this case, I think a per-vCPU
> ioctl is still warranted, allowing userspace to get at the guest CPU adjust
> component of Thomas' equation below (paraphrased):
> 
>          TSC guest CPU = host tsc base + guest base offset + guest CPU adjust

Right now that would be:

- KVM_GET_TSC_STATE (vm) returns host tsc base + guest base offset (plus 
the associated time)

- KVM_GET_MSR *without* KVM_X86_QUIRK_TSC_HOST_ACCESS for guest CPU adjust

and the corresponding SET ioctls.  What am *I* missing?

> Alternatively, a write from userspace to the guest's IA32_TSC_ADJUST with
> KVM_X86_QUIRK_TSC_HOST_ACCESS could have the same effect, but that seems to be
> problematic for a couple reasons. First, depending on the guest's CPUID the
> TSC_ADJUST MSR may not even be available, meaning that the guest could've used
> IA32_TSC to adjust the TSC (eww).

Indeed, the host should always be able to read/write IA32_TSC and 
IA32_TSC_ADJUST.

Thanks,

Paolo

> Second, userspace replaying writes to IA32_TSC
> (in the case IA32_TSC_ADJUST doesn't exist for the guest) seems_very_
> unlikely to work given all the magic handling that KVM does for
> writes to it.
> 
> Is this roughly where we are or have I entirely missed the mark?:-)
Oliver Upton Dec. 10, 2020, 6:13 p.m. UTC | #53
On Thu, Dec 10, 2020 at 12:05 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 10/12/20 18:59, Oliver Upton wrote:
> > However, I don't believe we can assume the guest's TSCs to be synchronized,
> > even if sane guests will never touch them. In this case, I think a per-vCPU
> > ioctl is still warranted, allowing userspace to get at the guest CPU adjust
> > component of Thomas' equation below (paraphrased):
> >
> >          TSC guest CPU = host tsc base + guest base offset + guest CPU adjust
>
> Right now that would be:
>
> - KVM_GET_TSC_STATE (vm) returns host tsc base + guest base offset (plus
> the associated time)
>
> - KVM_GET_MSR *without* KVM_X86_QUIRK_TSC_HOST_ACCESS for guest CPU adjust
>
> and the corresponding SET ioctls.  What am *I* missing?
>
> > Alternatively, a write from userspace to the guest's IA32_TSC_ADJUST with
> > KVM_X86_QUIRK_TSC_HOST_ACCESS could have the same effect, but that seems to be
> > problematic for a couple reasons. First, depending on the guest's CPUID the
> > TSC_ADJUST MSR may not even be available, meaning that the guest could've used
> > IA32_TSC to adjust the TSC (eww).
>
> Indeed, the host should always be able to read/write IA32_TSC and
> IA32_TSC_ADJUST.

So long as it is guaranteed that guest manipulations of IA32_TSC are
reflected in IA32_TSC_ADJUST even if it isn't in the guest's CPUID,
then this seems OK. I think having clear documentation on this subject
is also necessary, as we're going to rely on the combination of
KVM_{GET,SET}_TSC_STATE, disabling KVM_X86_QUIRK_TSC_HOST_ACCESS, and
userspace reading/writing a possibly hidden MSR to pull this off
right.

--
Thanks,
Oliver

> Thanks,
>
> Paolo
>
> > Second, userspace replaying writes to IA32_TSC
> > (in the case IA32_TSC_ADJUST doesn't exist for the guest) seems_very_
> > unlikely to work given all the magic handling that KVM does for
> > writes to it.
> >
> > Is this roughly where we are or have I entirely missed the mark?:-)
>
Thomas Gleixner Dec. 10, 2020, 8:20 p.m. UTC | #54
On Thu, Dec 10 2020 at 14:01, Peter Zijlstra wrote:
> On Thu, Dec 10, 2020 at 01:22:02PM +0100, Paolo Bonzini wrote:
>> On 10/12/20 13:14, Peter Zijlstra wrote:
>> > On Thu, Dec 10, 2020 at 12:42:36PM +0100, Paolo Bonzini wrote:
>> > > On 07/12/20 18:41, Thomas Gleixner wrote:
>> > > > Right this happens still occasionally, but for quite some time this is
>> > > > 100% firmware sillyness and not a fundamental property of the hardware
>> > > > anymore.
>> > > 
>> > > It's still a fundamental property of old hardware.  Last time I tried to
>> > > kill support for processors earlier than Core 2, I had to revert it. That's
>> > > older than Nehalem.
>> > 
>> > Core2 doesn't use TSC for timekeeping anyway. KVM shouldn't either.
>> 
>> On Core2, KVM guests pass TSC through kvmclock in order to get something
>> usable and not incredibly slow.
>
> Which is incredibly wrong.

Core2 is really not something which should prevent making all of this
correct and robust. That'd be not only wrong, that'd be outright insane.

Thanks,

        tglx
Thomas Gleixner Dec. 10, 2020, 9:25 p.m. UTC | #55
Andy,

On Thu, Dec 10 2020 at 07:16, Andy Lutomirski wrote:
>> On Dec 10, 2020, at 6:52 AM, Maxim Levitsky <mlevitsk@redhat.com> wrote:
>> On Thu, 2020-12-10 at 12:48 +0100, Paolo Bonzini wrote:
>>>> On 08/12/20 22:20, Thomas Gleixner wrote:
>>>> So now life migration comes a long time after timekeeping had set the
>>>> limits and just because it's virt it expects that everything works and it
>>>> just can ignore these limits.
>>>> 
>>>> TBH. That's not any different than SMM or hard/firmware taking the
>>>> machine out for lunch. It's exactly the same: It's broken.
>>> 
>>> I agree.  If *live* migration stops the VM for 200 seconds, it's broken.

I'm glad we are on the same page here.

>>> Sure, there's the case of snapshotting the VM over the weekend.  My 
>>> favorite solution would be to just put it in S3 before doing that.  *Do 
>>> what bare metal does* and you can't go that wrong.

:)

>> Note though that qemu has a couple of issues with s3, and it is disabled 
>> by default in libvirt. 
>> I would be very happy to work on improving this if there is a need for that.
>
> There’s also the case where someone has a VM running on a laptop and
> someone closes the lid. The host QEMU might not have a chance to
> convince the guest to enter S3.

But the host kernel can do something sensible before going off into lala
land. It knows that it is about to do that and it knows that there are
guests running.

>> I still think though that we should have a discussion on feasibility
>> of making the kernel time code deal with large *forward* tsc jumps 
>> without crashing.

I'm not opposed against that as I said before.
 
>> If that is indeed hard to do, or will cause performance issues,
>> then I agree that we might indeed inform the guest of time jumps instead.
>> 
>
> Tglx, even without fancy shared host/guest timekeeping, count the
> guest kernel manage to update its timekeeping if the host sent the
> guest an interrupt or NMI on all CPUs synchronously on resume?

Tell it before it takes a nap is simpler and does not require an NMI
which is horrible anyway because we can't do much in the NMI and
scheduling irq_work from NMI does not help either because the guest
could be in the middle of ... timekeeping. See below.

> Alternatively, if we had the explicit “max TSC value that makes sense
> right now” in the timekeeping data, the guest would reliably notice
> the large jump and could at least do something intelligent about it
> instead of overflowing its internal calculation.

Yes. We can do that and we should do that for robustness sake.

But, there is more than the robustness problem on the reader side, which
is trivial as we discussed in the other part of this thread already.

There is also the problem on the timekeeping core in various aspects
which need a very close look.

So there are various things to solve:

   1) Readerside delta limit

      Trivial to provide and trivial to implement for the VDSO because
      the VDSO just can loop forever.

      Not so trivial for kernel side usage due to the fact that being
      caught in a read can prevent timekeeping from being updated.

      Hint: NOHZ entry path. That would simply livelock and never reach
      the timekeeping code. Also any interrupt disabled region can cause
      that.

      I looked into using 128 bit math as well, but that only works for
      wide clock sources like TSC and needs to be conditional on 64bit
      as 32bit would really suffer badly in the hotpath and even on
      64bit it's measurable.

      So we could keep 64bit math, use the limit and if the delta is
      larger than the limit take a slowpath which does wider math.

      But that still needs thoughts about clocksources with smaller
      counterwidth and therefore a fast wraparound time.

      There is another issue with larger deltas. The extrapolation might
      be off and then cause other side effects like observable time
      going backwards etc.

      That has to be analyzed together with the writer side because
      that's where we need to ensure the continuity/monotonicity etc.

   2) Writer side issues

      The core timekeeping code is pretty robust against large deltas
      already, but there are some limitations nevertheless and it was
      obviously not designed to be taken out in the middle of the
      updates. Haven't wrapped my head around that yet.

      But there is more than the timekeeper update, there are other
      things like NTP, PTP and consumers of the more raw timekeeping
      internals which might get surprised by large deltas and run into
      similar problems because they were not designed to deal with that.

So yes, it can and should be done, but it's not a project for friday
afternoon and it's going to be hard to backport. I know that distro
people do not care because another 500 patches on their frankenkernels
are just noise, but it leaves everybody else out in the dark and I have
zero interest to proliferate that.

I'm still convinced that a notification about 'we take a nap' will be
more robust, less complex and more trivial to backport.

Both life migration and suspend on the host know it upfront which means
that not using this knowledge and instead of that trying to cure the
symptom is violating the basic engineering principles and TBH outright
stupid.

As I said before we have most of the bits and pieces in place and I'm
sure we can come up with an even simpler solution as the one I outlined
before and once that is solved (or in parallel) make the time keeping
more robust. 

Thanks,

        tglx
Thomas Gleixner Dec. 10, 2020, 9:48 p.m. UTC | #56
On Thu, Dec 10 2020 at 12:26, Marcelo Tosatti wrote:
> On Wed, Dec 09, 2020 at 09:58:23PM +0100, Thomas Gleixner wrote:
>> Marcelo,
>> 
>> On Wed, Dec 09 2020 at 13:34, Marcelo Tosatti wrote:
>> > On Tue, Dec 08, 2020 at 10:33:15PM +0100, Thomas Gleixner wrote:
>> >> On Tue, Dec 08 2020 at 15:11, Marcelo Tosatti wrote:
>> >> > max_cycles overflow. Sent a message to Maxim describing it.
>> >> 
>> >> Truly helpful. Why the hell did you not talk to me when you ran into
>> >> that the first time?
>> >
>> > Because 
>> >
>> > 1) Users wanted CLOCK_BOOTTIME to stop counting while the VM 
>> > is paused (so we wanted to stop guest clock when VM is paused anyway).
>> 
>> How is that supposed to work w/o the guest kernels help if you have to
>> keep clock realtime up to date? 
>
> Upon VM resume, we notify NTP daemon in the guest to sync realtime
> clock.

Brilliant. What happens if there is no NTP daemon? What happens if the
NTP daemon is not part of the virt orchestration magic and cannot be
notified, then it will notice the time jump after the next update
interval.

What about correctness?

ALL CLOCK_* stop and resume when the VM is resumed at the point where
they stopped.

So up to the point where NTP catches up and corrects clock realtime and
TAI other processes can observe that time jumped in the outside world,
e.g. via a network packet or whatever, but there is no reason why time
should have jumped outside vs. the local one.

You really all live in a seperate universe creating your own rules how
things which other people work hard on to get it correct can be screwed
over.

Of course this all is nowhere documented in detail. At least a quick
search with about 10 different keyword combinations revealed absolutely
nothing.

This features first, correctness later frenzy is insane and it better
stops now before you pile even more crap on the existing steaming pile
of insanities.

Thanks,

        tglx
Andy Lutomirski Dec. 10, 2020, 10:01 p.m. UTC | #57
On Thu, Dec 10, 2020 at 1:25 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Andy,
>

>
> I'm still convinced that a notification about 'we take a nap' will be
> more robust, less complex and more trivial to backport.

What do you have in mind?  Suppose the host kernel sends the guest an
interrupt on all vCPUs saying "I'm about to take a nap".  What happens
if the guest is busy with IRQs off for a little bit?  Does the host
guarantee the guest a certain about of time to try to get the
interrupt delivered before allowing the host to enter S3?  How about
if the host wants to reboot for a security fix -- how long is a guest
allowed to delay the process?

I'm sure this can all be made to work 99% of time, but I'm a bit
concerned about that last 1%.

--Andy
Thomas Gleixner Dec. 10, 2020, 10:28 p.m. UTC | #58
On Thu, Dec 10 2020 at 14:01, Andy Lutomirski wrote:
> On Thu, Dec 10, 2020 at 1:25 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> I'm still convinced that a notification about 'we take a nap' will be
>> more robust, less complex and more trivial to backport.
>
> What do you have in mind?  Suppose the host kernel sends the guest an
> interrupt on all vCPUs saying "I'm about to take a nap".  What happens
> if the guest is busy with IRQs off for a little bit?  Does the host
> guarantee the guest a certain about of time to try to get the
> interrupt delivered before allowing the host to enter S3?  How about
> if the host wants to reboot for a security fix -- how long is a guest
> allowed to delay the process?
>
> I'm sure this can all be made to work 99% of time, but I'm a bit
> concerned about that last 1%.

Seriously?

If the guest has interrupts disabled for ages, i.e. it went for out for
lunch on its own, then surely the hypervisor can just pull the plug and
wreckage it. It's like you hit the reset button or pull the powerplug of
the machine which is not responding anymore.

Reboot waits already today for guests to shut down/hibernate/supsend or
whatever they are supposed to do. systemd sits there and waits for
minutes until it decides to kill them. Just crash a guest kernel and
forget to reset or force power off the guest before you reboot the
host. Twiddle thumbs for a while and watch the incomprehensible time
display.

If your security fix reboot is so urgent that it can't wait then just
pull the plug and be done with it, i.e. kill the guest which makes it
start from a known state which is a gazillion times better than bringing
it into a state which it can't handle anymore.

Again, that's not any different than hitting the reset button on the
host or pulling and reinserting the host powerplug which you would do
anyway in an emergency case.

Can we please focus on real problems instead of making up new ones?

Correctness of time is a real problem despite the believe of virt folks
that it can be ignored or duct taped to death.

Thanks,

        tglx
Andy Lutomirski Dec. 10, 2020, 11:19 p.m. UTC | #59
> On Dec 10, 2020, at 2:28 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Thu, Dec 10 2020 at 14:01, Andy Lutomirski wrote:
>>> On Thu, Dec 10, 2020 at 1:25 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>>> I'm still convinced that a notification about 'we take a nap' will be
>>> more robust, less complex and more trivial to backport.
>> 
>> What do you have in mind?  Suppose the host kernel sends the guest an
>> interrupt on all vCPUs saying "I'm about to take a nap".  What happens
>> if the guest is busy with IRQs off for a little bit?  Does the host
>> guarantee the guest a certain about of time to try to get the
>> interrupt delivered before allowing the host to enter S3?  How about
>> if the host wants to reboot for a security fix -- how long is a guest
>> allowed to delay the process?
>> 
>> I'm sure this can all be made to work 99% of time, but I'm a bit
>> concerned about that last 1%.
> 
> Seriously?
> 
> If the guest has interrupts disabled for ages, i.e. it went for out for
> lunch on its own, then surely the hypervisor can just pull the plug and
> wreckage it. It's like you hit the reset button or pull the powerplug of
> the machine which is not responding anymore.
> 
> Reboot waits already today for guests to shut down/hibernate/supsend or
> whatever they are supposed to do. systemd sits there and waits for
> minutes until it decides to kill them. Just crash a guest kernel and
> forget to reset or force power off the guest before you reboot the
> host. Twiddle thumbs for a while and watch the incomprehensible time
> display.
> 
> If your security fix reboot is so urgent that it can't wait then just
> pull the plug and be done with it, i.e. kill the guest which makes it
> start from a known state which is a gazillion times better than bringing
> it into a state which it can't handle anymore.
> 
> Again, that's not any different than hitting the reset button on the
> host or pulling and reinserting the host powerplug which you would do
> anyway in an emergency case.
> 
> Can we please focus on real problems instead of making up new ones?
> 
> Correctness of time is a real problem despite the believe of virt folks
> that it can be ignored or duct taped to death.
> 

I’m fine with this as long as it’s intentional. If we say “guest timekeeping across host suspend is correct because we notify the guest”, then we have a hole. But if we say “the host will try to notify the guest, and if the guest is out to lunch then the host reserves the right to suspend without waiting, and the guest should deal with this”, then okay.
Andy Lutomirski Dec. 10, 2020, 11:42 p.m. UTC | #60
> On Dec 9, 2020, at 2:14 AM, Thomas Gleixner <tglx@linutronix.de> wrote:


>
> But what's more problematic is the basic requirement that time all over
> the place has to be consistent.
>
> On machines which use DISTORTED_REALTIME everything _IS_ consistent
> within the distorted universe they created. It's still inconsistent
> vs. the outside, but that's unsolvable and none of our problems.
>
> TLDR: Do not even think about opening pandoras box.

This could be a per-machine/per-vm setting that nonetheless uses the
same underlying implementation. There are, sadly, lots of people using
smeared time, and there will probably be VM hosts that simultaneously
have both styles of guest. Supporting full PV time for both would be
nice.  Obviously this gets a bit screwy if they are using a paravirt
fs, but it’s also a problem with NFS, etc. So maybe the nasty corners
could be narrow enough to just say “don’t do that”.

>
>> If we want to get fancy, we can make a change that I've contemplated
>> for a while -- we could make t_end explicit and have two copies of all
>> these data structures.  The reader would use one copy if t < t_change
>> and a different copy if t >= t_change.
>
> See below.
>
>> This would allow NTP-like code in usermode to schedule a frequency
>> shift to start at a specific time.
>
> That's an orthogonal problem and can be done without changing the
> reader side.

Really?  Right now, the switch happens whenever the kernel takes the
seqlock, which it doesn’t have exact control over. But I meant
something a little different:

>
>> With some care, it would also allow the timekeeping code to update the
>> data structures without causing clock_gettime() to block while the
>> timekeeping code is running on a different CPU.
>
> It still has to block, i.e. retry, because the data set becomes invalid
> when t_end is reached. So the whole thing would be:
>
>       do {
>               seq = read_seqcount_latch();
>                data = select_data(seq);
>                delta = read_clocksource() - data->base;
>                if (delta >= data->max_delta)
>                    continue;
>                ....
>      } while (read_seqcount_latch_retry());
>
> TBH, I like the idea for exactly one reason: It increases robustness.

I like the max_delta for robustness, too.

What do you have in mind for select_data()?  Is the idea that
select_data() returns something like &data[seq & 1]?

But I actually meant something a little bit different: you would use
delta >= data->max_delta as an indication that you need to look at the
other copy.  Perhaps the lower three bits of the seqcount would work
like:

00: both copies are valid, but start with the first copy.
10: only the first copy is valid.
01: both copies are valid, but start with the second copy.
11: only the second copy is valid

You'd read it like this (with all the bugs that I surely have fixed);

do {
  seq = read_seqcount();
  data = &data_array[seq & 1];
  clock = read_clocksource();
  delta = clock - data->base;
  if (delta->data->max_delta) {
    if (seq & 2)
      continue;
    data = &data_array[(seq + 1) & 1];  // <-- the other copy
    delta = clock - data->base;
    if (delta >= data->max_delta)
      continue;
  }
  ...;
} while (seq == read_seqcount());

This has two main benefits.  First, it allows the timekeeping code to
run concurrently with readers, which is nice for tail latency --
readers would only ever need to spin if the timekeeper falls behind,
intentionally disables both copies, or somehow manages to run one
iteration for each reader attempt and livelocks the reader.  The
latter is very unlikely.)  Second, it allows the timekeeping code to
literally schedule an update to occur at a precise clocksource tick,
which seems to be like it could make the timekeeping code simpler and
more robust.

(If the timekeeper wants to simultaneously disable both copies, it
sets one copy's max_delta to zero and uses seq to disable the other
copy.)

--Andy






>
> For X86 we already have the comparison for dealing with TSC < base
> which would be covered by
>
>                if (delta >= data->max_delta)
>                    continue;
>
> automatically. Non X86 gains this extra conditional, but I think it's
> worth to pay that price.
>
> It won't solve the VM migration problem on it's own though. You still
> have to be careful about the inner workings of everything related to
> timekeeping itself.
>
>> One other thing that might be worth noting: there's another thread
>> about "vmgenid".  It's plausible that it's worth considering stopping
>> the guest or perhaps interrupting all vCPUs to allow it to take some
>> careful actions on migration for reasons that have nothing to do with
>> timekeeping.
>
> How surprising. Who could have thought about that?
>
> OMG, virtualization seems to have gone off into a virtual reality long
> ago.
>
> Thanks,
>
>        tglx
>
Thomas Gleixner Dec. 11, 2020, 12:03 a.m. UTC | #61
On Thu, Dec 10 2020 at 15:19, Andy Lutomirski wrote:
>> On Dec 10, 2020, at 2:28 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> Can we please focus on real problems instead of making up new ones?
>> 
>> Correctness of time is a real problem despite the believe of virt folks
>> that it can be ignored or duct taped to death.
>> 
> I’m fine with this as long as it’s intentional. If we say “guest
> timekeeping across host suspend is correct because we notify the
> guest”, then we have a hole. But if we say “the host will try to
> notify the guest, and if the guest is out to lunch then the host
> reserves the right to suspend without waiting, and the guest should
> deal with this”, then okay.

Yes of course this would be intentional and documented behaviour, which
is an infinite improvement over the current situation.

Thanks,

        tglx
Marcelo Tosatti Dec. 11, 2020, 12:27 a.m. UTC | #62
On Thu, Dec 10, 2020 at 10:48:10PM +0100, Thomas Gleixner wrote:
> On Thu, Dec 10 2020 at 12:26, Marcelo Tosatti wrote:
> > On Wed, Dec 09, 2020 at 09:58:23PM +0100, Thomas Gleixner wrote:
> >> Marcelo,
> >> 
> >> On Wed, Dec 09 2020 at 13:34, Marcelo Tosatti wrote:
> >> > On Tue, Dec 08, 2020 at 10:33:15PM +0100, Thomas Gleixner wrote:
> >> >> On Tue, Dec 08 2020 at 15:11, Marcelo Tosatti wrote:
> >> >> > max_cycles overflow. Sent a message to Maxim describing it.
> >> >> 
> >> >> Truly helpful. Why the hell did you not talk to me when you ran into
> >> >> that the first time?
> >> >
> >> > Because 
> >> >
> >> > 1) Users wanted CLOCK_BOOTTIME to stop counting while the VM 
> >> > is paused (so we wanted to stop guest clock when VM is paused anyway).
> >> 
> >> How is that supposed to work w/o the guest kernels help if you have to
> >> keep clock realtime up to date? 
> >
> > Upon VM resume, we notify NTP daemon in the guest to sync realtime
> > clock.
> 
> Brilliant. What happens if there is no NTP daemon? What happens if the
> NTP daemon is not part of the virt orchestration magic and cannot be
> notified, then it will notice the time jump after the next update
> interval.
> 
> What about correctness?
> 
> ALL CLOCK_* stop and resume when the VM is resumed at the point where
> they stopped.
> 
> So up to the point where NTP catches up and corrects clock realtime and
> TAI other processes can observe that time jumped in the outside world,
> e.g. via a network packet or whatever, but there is no reason why time
> should have jumped outside vs. the local one.
> 
> You really all live in a seperate universe creating your own rules how
> things which other people work hard on to get it correct can be screwed
> over.

	1. T = read timestamp.
	2. migrate (VM stops for a certain period).
	3. use timestamp T.

> Of course this all is nowhere documented in detail. At least a quick
> search with about 10 different keyword combinations revealed absolutely
> nothing.
> 
> This features first, correctness later frenzy is insane and it better
> stops now before you pile even more crap on the existing steaming pile
> of insanities.

Sure.
Thomas Gleixner Dec. 11, 2020, 1:30 p.m. UTC | #63
On Thu, Dec 10 2020 at 21:27, Marcelo Tosatti wrote:
> On Thu, Dec 10, 2020 at 10:48:10PM +0100, Thomas Gleixner wrote:
>> You really all live in a seperate universe creating your own rules how
>> things which other people work hard on to get it correct can be screwed
>> over.
>
> 	1. T = read timestamp.
> 	2. migrate (VM stops for a certain period).
> 	3. use timestamp T.

This is exactly the problem. Time stops at pause and continues where it
stopped on resume.

But CLOCK_REALTIME and CLOCK_TAI advanced in reality. So up to the point
where NTP fixes this - if there is NTP at all - the guest CLOCK_REALTIME
and CLOCK_TAI are off by tpause.

Now the application gets a packet from the outside world with a
CLOCK_REALTIME timestamp which is suddenly ahead of the value it reads
from clock_gettime(CLOCK_REALTIME) by tpause. So what is it supposed to
do with that? Make stupid assumptions that the other end screwed up
timekeeping, throw an error that the system it is running on screwed up
timekeeping? And a second later when NTP catched up it gets the next
surprise because the systems CLOCK_REALTIME jumped forward unexpectedly
or if there is no NTP it's confused forever.

How can you even assume that this is correct?

It is exactly the same problem as we had many years ago with hardware
clocks suddenly stopping to tick which caused quite some stuff to go
belly up.

In a proper suspend/resume scenario CLOCK_REALTIME/TAI are advanced
(with a certain degree of accuracy) to compensate for the sleep time, so
the other end of a communication is at least in the same ballpark, but
not 50 seconds off.

>> This features first, correctness later frenzy is insane and it better
>> stops now before you pile even more crap on the existing steaming pile
>> of insanities.
>
> Sure.

I wish that would be true. OS people - you should know that - are
fighting forever with hardware people over feature madness and the
attitude of 'we can fix that in software' which turns often enough out
to be wrong.

Now sadly enough people who suffered from that madness work on
virtualization and instead of trying to avoid the same problem they go
off and make it even worse.

It's the same problem again as with hardware people. Not talking to the
other people _before_ making uninformed assumptions and decisions.

We did it that way because big customer asked for it is not a
justification for inflicting this on everybody else and thereby
violating correctness. Works for me and my big customer is not a proof
of correctness either.

It's another proof that this industry just "works" by chance.

Thanks,

        tglx
Paolo Bonzini Dec. 11, 2020, 1:37 p.m. UTC | #64
On 11/12/20 01:27, Marcelo Tosatti wrote:
> This features first, correctness later frenzy is insane and it better
> stops now before you pile even more crap on the existing steaming pile
> of insanities.

FWIW I agree, in fact I wish I knew exactly where to start simplifying 
it; the timekeeping code in KVM is perhaps the last remaining bastion 
where I'm afraid just to look at it.

At least I know there's a couple of features that IMHO are completely 
useless, and that's even before you reach the ones that might "only" be 
obsolete.  So perhaps that's where to start, together with the messiest 
userspace interfaces.

Paolo
Marcelo Tosatti Dec. 11, 2020, 2:18 p.m. UTC | #65
On Fri, Dec 11, 2020 at 02:30:34PM +0100, Thomas Gleixner wrote:
> On Thu, Dec 10 2020 at 21:27, Marcelo Tosatti wrote:
> > On Thu, Dec 10, 2020 at 10:48:10PM +0100, Thomas Gleixner wrote:
> >> You really all live in a seperate universe creating your own rules how
> >> things which other people work hard on to get it correct can be screwed
> >> over.
> >
> > 	1. T = read timestamp.
> > 	2. migrate (VM stops for a certain period).
> > 	3. use timestamp T.
> 
> This is exactly the problem. Time stops at pause and continues where it
> stopped on resume.
> 
> But CLOCK_REALTIME and CLOCK_TAI advanced in reality. So up to the point
> where NTP fixes this - if there is NTP at all - the guest CLOCK_REALTIME
> and CLOCK_TAI are off by tpause.
> 
> Now the application gets a packet from the outside world with a
> CLOCK_REALTIME timestamp which is suddenly ahead of the value it reads
> from clock_gettime(CLOCK_REALTIME) by tpause. So what is it supposed to
> do with that? Make stupid assumptions that the other end screwed up
> timekeeping, throw an error that the system it is running on screwed up
> timekeeping? And a second later when NTP catched up it gets the next
> surprise because the systems CLOCK_REALTIME jumped forward unexpectedly
> or if there is no NTP it's confused forever.

This can happen even with a "perfect" solution that syncs time
instantly on the migration destination. See steps 1,2,3.

Unless you notify applications to invalidate their time reads,
i can't see a way to fix this.

Therefore if you use VM migration in the first place, a certain amount of
timestamp accuracy error must be tolerated.

> How can you even assume that this is correct?

As noted above, even without a window of unsynchronized time (due to
delay for NTP to sync time), time reads can be stale.

> It is exactly the same problem as we had many years ago with hardware
> clocks suddenly stopping to tick which caused quite some stuff to go
> belly up.

Customers complained when it was 5 seconds off, now its 0.1ms (and
people seem happy).

> In a proper suspend/resume scenario CLOCK_REALTIME/TAI are advanced
> (with a certain degree of accuracy) to compensate for the sleep time, so
> the other end of a communication is at least in the same ballpark, but
> not 50 seconds off.

Its 100ms off with migration, and can be reduced further (customers
complained about 5 seconds but seem happy with 0.1ms).

> >> This features first, correctness later frenzy is insane and it better
> >> stops now before you pile even more crap on the existing steaming pile
> >> of insanities.
> >
> > Sure.
> 
> I wish that would be true. OS people - you should know that - are
> fighting forever with hardware people over feature madness and the
> attitude of 'we can fix that in software' which turns often enough out
> to be wrong.
> 
> Now sadly enough people who suffered from that madness work on
> virtualization and instead of trying to avoid the same problem they go
> off and make it even worse.

So you think its important to reduce the 100ms offset? 

> It's the same problem again as with hardware people. Not talking to the
> other people _before_ making uninformed assumptions and decisions.
> 
> We did it that way because big customer asked for it is not a
> justification for inflicting this on everybody else and thereby
> violating correctness. Works for me and my big customer is not a proof
> of correctness either.
> 
> It's another proof that this industry just "works" by chance.
> 
> Thanks,
> 
>         tglx

OK, makes sense, then reducing the 0.1ms window even further
is a useful thing to do. What would be an acceptable 
CLOCK_REALTIME accuracy error, on migration?
Thomas Gleixner Dec. 11, 2020, 9:04 p.m. UTC | #66
On Fri, Dec 11 2020 at 11:18, Marcelo Tosatti wrote:
> On Fri, Dec 11, 2020 at 02:30:34PM +0100, Thomas Gleixner wrote:
> Unless you notify applications to invalidate their time reads,
> i can't see a way to fix this.

This is just wrong. Suspend/resume handles that fine and the system is
guaranteed to come back with time which is very close to the reality.

And for suspend/resume everything from kernel to userspace can have a
notification before suspend and post resume. So applications or those
parts of the kernel which are e.g. time sensitive can prepare upfront
for the disruption and mop up on resume.

> Therefore if you use VM migration in the first place, a certain amount of
> timestamp accuracy error must be tolerated.

That's just because it was never designed in the right way. And you
simply declared that all applications have to deal with that.

Again, where is this documented? VMs are subject to migration whether
the customer who pays for it wants it or not. None of the virt tool docs
mentions that pausing a VM for a long time makes timekeeping go
south.

I still have no sensible explanation WHY time should not advance accross
a migration. All you told me is that customers complained. Which
customers? The ones running the hosts or the ones paying for the VM?

It's all just decided by some folks to "fix" a problem with the pause/
migration mechanism they designed instead of fixing the design fail.

>> How can you even assume that this is correct?
>
> As noted above, even without a window of unsynchronized time (due to
> delay for NTP to sync time), time reads can be stale.

So with suspend/resume we have:

app:
   t = clock_gettime()
        <---------------- tsuspend
        <-----------------tresume
        So t is now behind reality by tresume - tsuspend

  packet -> check timestamp .... ooops recheck
  t = clock_gettime()
  and t and timestamp are in the same ballpark again

Now with your thing:

app:
   t = clock_gettime()
        <---------------- tpause
        <-----------------tresume
        So t is now behind reality by tresume - tpause

  packet -> check timestamp .... ooops recheck
  t = clock_gettime()
  and t and timestamp are still apart by ~ (tresume - tpause)

this persists until NTP kicks in, if and only if NTP is running.

Can you spot the difference?

>> It is exactly the same problem as we had many years ago with hardware
>> clocks suddenly stopping to tick which caused quite some stuff to go
>> belly up.
>
> Customers complained when it was 5 seconds off, now its 0.1ms (and
> people seem happy).

And because customers complained you decided to create a scenario which
is completely different to all other scenarios and from a time keeping
POV not making any sense at all.

>> In a proper suspend/resume scenario CLOCK_REALTIME/TAI are advanced
>> (with a certain degree of accuracy) to compensate for the sleep time, so
>> the other end of a communication is at least in the same ballpark, but
>> not 50 seconds off.
>
> Its 100ms off with migration, and can be reduced further (customers
> complained about 5 seconds but seem happy with 0.1ms).

What is 100ms? Guaranteed maximum migration time?

CLOCK_REALTIME and CLOCK_TAI are off by the time the VM is paused and
this state persists up to the point where NTP corrects it with a time
jump.

So if migration takes 5 seconds then CLOCK_REALTIME is not off by 100ms
it's off by 5 seconds.

CLOCK_MONOTONIC/BOOTTIME might be off by 100ms between pause and resume.

> OK, makes sense, then reducing the 0.1ms window even further
> is a useful thing to do. What would be an acceptable 
> CLOCK_REALTIME accuracy error, on migration?

Can you please explain how you can achive 0.1ms accuracy when migration
time is more than that and guest TSC is just restored to the value at
which it was stopped?

Then ALL clocks including CLOCK_REALTIME and CLOCK_TAI continue from the
point at which they were stopped. Ergo:

      t(CLOCK_REALTIME) = t(REALITY) - t(STOPPED)

CLOCK_REALTIME and CLOCK_TAI are global clocks and they have rules which
have to be respected in order to make stuff work.

CLOCK_MONOTONIC and CLOCK_BOOTTIME are local to a system (host, guests).
So manipulating them is a completely different story albeit the kernel
has explicit guarantees for the relationship between CLOCK_MONOTONIC,
CLOCK_BOOTTIME and CLOCK_REALTIME/TAI

If you could guarantee t(STOPPED) < 100ms and therefore

   t(REALITY) - t(CLOCK_REALTIME) < 100ms

under _all_ circumstances then we would not even have that discussion.

Even < 1000ms might be acceptable. That's the margin of error which is
also happening accross bare metal suspend/resume in the case that the
sleep time has to be read from the RTC when TSC stops accross suspend.

But suspend/resume is still substantially different because everything
from kernel to userspace can have a notification before suspend. So
applications or those parts of the kernel which are e.g. time sensitive
can prepare upfront for the disruption. In your case, not at all. They
just have to cope with the fallout. Brilliant.

Your design or the lack of it just decides that everything has to cope
with what you decided is the right thing to do and for how long it
takes.

Yes it "works" by some definition of works, but that does not mean it
works correctly.

Thanks,

        tglx
Paolo Bonzini Dec. 11, 2020, 9:59 p.m. UTC | #67
On 11/12/20 22:04, Thomas Gleixner wrote:
>> Its 100ms off with migration, and can be reduced further (customers
>> complained about 5 seconds but seem happy with 0.1ms).
> What is 100ms? Guaranteed maximum migration time?

I suppose it's the length between the time from KVM_GET_CLOCK and 
KVM_GET_MSR(IA32_TSC) to KVM_SET_CLOCK and KVM_SET_MSR(IA32_TSC).  But 
the VM is paused for much longer, the sequence for the non-live part of 
the migration (aka brownout) is as follows:

     pause
     finish sending RAM            receive RAM               ~1 sec
     send paused-VM state          finish receiving RAM     \
                                   receive paused-VM state   ) 0.1 sec
                                   restart                  /

The nanosecond and TSC times are sent as part of the paused-VM state at 
the very end of the live migration process.

So it's still true that the time advances during live migration 
brownout; 0.1 seconds is just the final part of the live migration 
process.  But for _live_ migration there is no need to design things 
according to "people are happy if their clock is off by 0.1 seconds 
only".  Again, save-to-disk, reverse debugging and the like are a 
different story, which is why KVM should delegate policy to userspace 
(while documenting how to do it right).

Paolo

> CLOCK_REALTIME and CLOCK_TAI are off by the time the VM is paused and
> this state persists up to the point where NTP corrects it with a time
> jump.
> 
> So if migration takes 5 seconds then CLOCK_REALTIME is not off by 100ms
> it's off by 5 seconds.
> 
> CLOCK_MONOTONIC/BOOTTIME might be off by 100ms between pause and resume.
>
Thomas Gleixner Dec. 12, 2020, 1:03 p.m. UTC | #68
On Fri, Dec 11 2020 at 22:59, Paolo Bonzini wrote:
> On 11/12/20 22:04, Thomas Gleixner wrote:
> The nanosecond and TSC times are sent as part of the paused-VM state at 
> the very end of the live migration process.
>
> So it's still true that the time advances during live migration 
> brownout; 0.1 seconds is just the final part of the live migration 
> process.  But for _live_ migration there is no need to design things 
> according to "people are happy if their clock is off by 0.1 seconds 
> only".

The problem is not the 0.1 second jump of the TSC. That's just a minor
nuisance. The problem is the incorrectness of CLOCK_REALTIME after this
operation which is far larger than 0.1s and this needs to be fixed.

> Again, save-to-disk, reverse debugging and the like are a different
> story, which is why KVM should delegate policy to userspace (while
> documenting how to do it right).

One ready to use option would be suspend to idle. It's fast and readily
available including all the notifications and kernel side handling of
time and whatever.

Thanks,

        tglx
Marcelo Tosatti Dec. 15, 2020, 10:59 a.m. UTC | #69
On Fri, Dec 11, 2020 at 10:59:59PM +0100, Paolo Bonzini wrote:
> On 11/12/20 22:04, Thomas Gleixner wrote:
> > > Its 100ms off with migration, and can be reduced further (customers
> > > complained about 5 seconds but seem happy with 0.1ms).
> > What is 100ms? Guaranteed maximum migration time?
> 
> I suppose it's the length between the time from KVM_GET_CLOCK and
> KVM_GET_MSR(IA32_TSC) to KVM_SET_CLOCK and KVM_SET_MSR(IA32_TSC).  But the
> VM is paused for much longer, the sequence for the non-live part of the
> migration (aka brownout) is as follows:
> 
>     pause
>     finish sending RAM            receive RAM               ~1 sec
>     send paused-VM state          finish receiving RAM     \
>                                   receive paused-VM state   ) 0.1 sec
>                                   restart                  /
> 
> The nanosecond and TSC times are sent as part of the paused-VM state at the
> very end of the live migration process.
> 
> So it's still true that the time advances during live migration brownout;
> 0.1 seconds is just the final part of the live migration process.  But for
> _live_ migration there is no need to design things according to "people are
> happy if their clock is off by 0.1 seconds only".  

Agree. What would be a good way to fix this? 

It seems to me using CLOCK_REALTIME as in the interface Maxim is
proposing is prone to difference in CLOCK_REALTIME itself.

Perhaps there is another way to measure that 0.1 sec which is
independent of the clock values of the source and destination hosts
(say by sending a packet once the clock stops counting).

Then on destination measure delta = clock_restart_time - packet_receival
and increase clock by that amount.



> Again, save-to-disk,
> reverse debugging and the like are a different story, which is why KVM
> should delegate policy to userspace (while documenting how to do it right).
> 
> Paolo
> 
> > CLOCK_REALTIME and CLOCK_TAI are off by the time the VM is paused and
> > this state persists up to the point where NTP corrects it with a time
> > jump.
> > 
> > So if migration takes 5 seconds then CLOCK_REALTIME is not off by 100ms
> > it's off by 5 seconds.
> > 
> > CLOCK_MONOTONIC/BOOTTIME might be off by 100ms between pause and resume.
> >
Andy Lutomirski Dec. 15, 2020, 4:55 p.m. UTC | #70
On Tue, Dec 15, 2020 at 3:35 AM Marcelo Tosatti <mtosatti@redhat.com> wrote:
>
> On Fri, Dec 11, 2020 at 10:59:59PM +0100, Paolo Bonzini wrote:
> > On 11/12/20 22:04, Thomas Gleixner wrote:
> > > > Its 100ms off with migration, and can be reduced further (customers
> > > > complained about 5 seconds but seem happy with 0.1ms).
> > > What is 100ms? Guaranteed maximum migration time?
> >
> > I suppose it's the length between the time from KVM_GET_CLOCK and
> > KVM_GET_MSR(IA32_TSC) to KVM_SET_CLOCK and KVM_SET_MSR(IA32_TSC).  But the
> > VM is paused for much longer, the sequence for the non-live part of the
> > migration (aka brownout) is as follows:
> >
> >     pause
> >     finish sending RAM            receive RAM               ~1 sec
> >     send paused-VM state          finish receiving RAM     \
> >                                   receive paused-VM state   ) 0.1 sec
> >                                   restart                  /
> >
> > The nanosecond and TSC times are sent as part of the paused-VM state at the
> > very end of the live migration process.
> >
> > So it's still true that the time advances during live migration brownout;
> > 0.1 seconds is just the final part of the live migration process.  But for
> > _live_ migration there is no need to design things according to "people are
> > happy if their clock is off by 0.1 seconds only".
>
> Agree. What would be a good way to fix this?
>

Could you implement the Hyper-V clock interface?  It's much, much
simpler than the kvmclock interface.  It has the downside that
CLOCK_BOOTTIME won't do what you want, but I'm not really convinced
that's a problem, and you could come up with a minimal extension to
fix that.
Thomas Gleixner Dec. 15, 2020, 10:34 p.m. UTC | #71
On Tue, Dec 15 2020 at 07:59, Marcelo Tosatti wrote:
> On Fri, Dec 11, 2020 at 10:59:59PM +0100, Paolo Bonzini wrote:
>> So it's still true that the time advances during live migration brownout;
>> 0.1 seconds is just the final part of the live migration process.  But for
>> _live_ migration there is no need to design things according to "people are
>> happy if their clock is off by 0.1 seconds only".  
>
> Agree. What would be a good way to fix this?

None of what's proposed is fixing the underlying problem of wreckaging
CLOCK_REALTIME.

Stop proliferating broken behaviour and please answer the questions I
asked several times now:

   1) Why has the TSC has to restart at the same value?

   2) What is the technical argument that it is correct and acceptable
      to wreckage CLOCK_REALTIME by doing #1?

Thanks,

        tglx
diff mbox series

Patch

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 70254eaa5229f..ebecfe4b414ce 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -4826,6 +4826,71 @@  If a vCPU is in running state while this ioctl is invoked, the vCPU may
 experience inconsistent filtering behavior on MSR accesses.
 
 
+4.127 KVM_GET_TSC_STATE
+----------------------------
+
+:Capability: KVM_CAP_PRECISE_TSC
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_tsc_state
+:Returns: 0 on success, < 0 on error
+
+::
+
+  #define KVM_TSC_STATE_TIMESTAMP_VALID 1
+  #define KVM_TSC_STATE_TSC_ADJUST_VALID 2
+  struct kvm_tsc_state {
+	__u32 flags;
+	__u64 nsec;
+	__u64 tsc;
+	__u64 tsc_adjust;
+  };
+
+flags values for ``struct kvm_tsc_state``:
+
+``KVM_TSC_STATE_TIMESTAMP_VALID``
+
+  ``nsec`` contains nanoseconds from unix epoch.
+    Always set by KVM_GET_TSC_STATE, might be omitted in KVM_SET_TSC_STATE
+
+``KVM_TSC_STATE_TSC_ADJUST_VALID``
+
+  ``tsc_adjust`` contains valid IA32_TSC_ADJUST value
+
+
+This ioctl allows the user space to read the guest's IA32_TSC,IA32_TSC_ADJUST,
+and the current value of host's CLOCK_REALTIME clock in nanoseconds since unix
+epoch.
+
+
+4.128 KVM_SET_TSC_STATE
+----------------------------
+
+:Capability: KVM_CAP_PRECISE_TSC
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_tsc_state
+:Returns: 0 on success, < 0 on error
+
+::
+
+This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST value
+from the state obtained in the past by KVM_GET_TSC_STATE on the same vCPU.
+
+If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags,
+KVM will adjust the guest TSC value by the time that passed since the moment
+CLOCK_REALTIME timestamp was saved in the struct and current value of
+CLOCK_REALTIME, and set the guest's TSC to the new value.
+
+Otherwise KVM will set the guest TSC value to the exact value as given
+in the struct.
+
+if KVM_TSC_STATE_TSC_ADJUST_VALID is set, and guest supports IA32_MSR_TSC_ADJUST,
+then its value will be set to the given value from the struct.
+
+It is assumed that either both ioctls will be run on the same machine,
+or that source and destination machines have synchronized clocks.
+
 5. The kvm_run structure
 ========================
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a3fdc16cfd6f3..9b8a2fe3a2398 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2438,6 +2438,21 @@  static bool kvm_get_walltime_and_clockread(struct timespec64 *ts,
 
 	return gtod_is_based_on_tsc(do_realtime(ts, tsc_timestamp));
 }
+
+
+static void kvm_get_walltime(u64 *walltime_ns, u64 *host_tsc)
+{
+	struct timespec64 ts;
+
+	if (kvm_get_walltime_and_clockread(&ts, host_tsc)) {
+		*walltime_ns = timespec64_to_ns(&ts);
+		return;
+	}
+
+	*host_tsc = rdtsc();
+	*walltime_ns = ktime_get_real_ns();
+}
+
 #endif
 
 /*
@@ -3757,6 +3772,9 @@  int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_X86_USER_SPACE_MSR:
 	case KVM_CAP_X86_MSR_FILTER:
 	case KVM_CAP_ENFORCE_PV_FEATURE_CPUID:
+#ifdef CONFIG_X86_64
+	case KVM_CAP_PRECISE_TSC:
+#endif
 		r = 1;
 		break;
 	case KVM_CAP_SYNC_REGS:
@@ -4999,6 +5017,61 @@  long kvm_arch_vcpu_ioctl(struct file *filp,
 	case KVM_GET_SUPPORTED_HV_CPUID:
 		r = kvm_ioctl_get_supported_hv_cpuid(vcpu, argp);
 		break;
+#ifdef CONFIG_X86_64
+	case KVM_GET_TSC_STATE: {
+		struct kvm_tsc_state __user *user_tsc_state = argp;
+		u64 host_tsc;
+
+		struct kvm_tsc_state tsc_state = {
+			.flags = KVM_TSC_STATE_TIMESTAMP_VALID
+		};
+
+		kvm_get_walltime(&tsc_state.nsec, &host_tsc);
+		tsc_state.tsc = kvm_read_l1_tsc(vcpu, host_tsc);
+
+		if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) {
+			tsc_state.tsc_adjust = vcpu->arch.ia32_tsc_adjust_msr;
+			tsc_state.flags |= KVM_TSC_STATE_TSC_ADJUST_VALID;
+		}
+
+		r = -EFAULT;
+		if (copy_to_user(user_tsc_state, &tsc_state, sizeof(tsc_state)))
+			goto out;
+		r = 0;
+		break;
+	}
+	case KVM_SET_TSC_STATE: {
+		struct kvm_tsc_state __user *user_tsc_state = argp;
+		struct kvm_tsc_state tsc_state;
+		u64 host_tsc, wall_nsec;
+
+		u64 new_guest_tsc, new_guest_tsc_offset;
+
+		r = -EFAULT;
+		if (copy_from_user(&tsc_state, user_tsc_state, sizeof(tsc_state)))
+			goto out;
+
+		kvm_get_walltime(&wall_nsec, &host_tsc);
+		new_guest_tsc = tsc_state.tsc;
+
+		if (tsc_state.flags & KVM_TSC_STATE_TIMESTAMP_VALID) {
+			s64 diff = wall_nsec - tsc_state.nsec;
+			if (diff >= 0)
+				new_guest_tsc += nsec_to_cycles(vcpu, diff);
+			else
+				new_guest_tsc -= nsec_to_cycles(vcpu, -diff);
+		}
+
+		new_guest_tsc_offset = new_guest_tsc - kvm_scale_tsc(vcpu, host_tsc);
+		kvm_vcpu_write_tsc_offset(vcpu, new_guest_tsc_offset);
+
+		if (tsc_state.flags & KVM_TSC_STATE_TSC_ADJUST_VALID)
+			if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST))
+				vcpu->arch.ia32_tsc_adjust_msr = tsc_state.tsc_adjust;
+		r = 0;
+		break;
+	}
+#endif
 	default:
 		r = -EINVAL;
 	}
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 886802b8ffba3..bf4c38fd58291 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1056,6 +1056,7 @@  struct kvm_ppc_resize_hpt {
 #define KVM_CAP_ENFORCE_PV_FEATURE_CPUID 190
 #define KVM_CAP_SYS_HYPERV_CPUID 191
 #define KVM_CAP_DIRTY_LOG_RING 192
+#define KVM_CAP_PRECISE_TSC 193
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1169,6 +1170,16 @@  struct kvm_clock_data {
 	__u32 pad[9];
 };
 
+
+#define KVM_TSC_STATE_TIMESTAMP_VALID 1
+#define KVM_TSC_STATE_TSC_ADJUST_VALID 2
+struct kvm_tsc_state {
+	__u32 flags;
+	__u64 nsec;
+	__u64 tsc;
+	__u64 tsc_adjust;
+};
+
 /* For KVM_CAP_SW_TLB */
 
 #define KVM_MMU_FSL_BOOKE_NOHV		0
@@ -1563,6 +1574,10 @@  struct kvm_pv_cmd {
 /* Available with KVM_CAP_DIRTY_LOG_RING */
 #define KVM_RESET_DIRTY_RINGS		_IO(KVMIO, 0xc7)
 
+/* Available with KVM_CAP_PRECISE_TSC*/
+#define KVM_SET_TSC_STATE          _IOW(KVMIO,  0xc8, struct kvm_tsc_state)
+#define KVM_GET_TSC_STATE          _IOR(KVMIO,  0xc9, struct kvm_tsc_state)
+
 /* Secure Encrypted Virtualization command */
 enum sev_cmd_id {
 	/* Guest initialization commands */