[Bug,217423] New: TSC synchronization issue in VM restore

Message ID	bug-217423-28872@https.bugzilla.kernel.org/ (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@vger.kernel.org> From: bugzilla-daemon@kernel.org To: kvm@vger.kernel.org Subject: [Bug 217423] New: TSC synchronization issue in VM restore Date: Tue, 09 May 2023 14:01:29 +0000 Message-ID: <bug-217423-28872@https.bugzilla.kernel.org/> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Auto-Submitted: auto-generated MIME-Version: 1.0 Precedence: bulk
Series	[Bug,217423] New: TSC synchronization issue in VM restore \| expand [Bug,217423] New: TSC synchronization issue in VM restore

bugzilla-daemon@kernel.org May 9, 2023, 2:01 p.m. UTC

https://bugzilla.kernel.org/show_bug.cgi?id=217423

            Bug ID: 217423
           Summary: TSC synchronization issue in VM restore
           Product: Virtualization
           Version: unspecified
          Hardware: All
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P3
         Component: kvm
          Assignee: virtualization_kvm@kernel-bugs.osdl.org
          Reporter: zhuangel570@gmail.com
        Regression: No

Hi

We are using lightweight VM with snapshot feature, the VM will be saved with
100ms+, and we found restore such VM will not get correct TSC, which will make
the VM world stop about 100ms+ after restore (the stop time is same as time
when VM saved).

After Investigation, we found the issue caused by TSC synchronization in
setting MSR_IA32_TSC. In VM save, VMM (cloud-hypervisor) will record TSC of
each
VCPU, then restore the TSC of VCPU in VM restore (about 100ms+ in guest time).
But in KVM, setting a TSC within 1 second is identified as TSC synchronization,
and the TSC offset will not be updated in stable TSC environment, this will
cause the lapic set up a hrtimer expires after 100ms+, the restored VM now will
in stop state about 100ms+, if no other event to wake guest kernel in NO_HZ
mode.

More investigation show, the MSR_IA32_TSC set from guest side has disabled TSC
synchronization in commit 0c899c25d754 (KVM: x86: do not attempt TSC
synchronization on guest writes), now host side will do TSC synchronization
when
setting MSR_IA32_TSC.

I think setting MSR_IA32_TSC within 1 second from host side should not be
identified as TSC synchronization, like above case, VMM set TSC from host side
always should be updated as user want.

The MSR_IA32_TSC set code is complicated and with a long history, so I come
here
to try to get help about whether my thought is correct. Here is my fix to solve
the issue, any comments are welcomed:


        }

bugzilla-daemon@kernel.org May 9, 2023, 4:27 p.m. UTC | #1

https://bugzilla.kernel.org/show_bug.cgi?id=217423

--- Comment #1 from Artem S. Tashkinov (aros@gmx.com) ---
*** Bug 217424 has been marked as a duplicate of this bug. ***

Robert Hoo May 17, 2023, 6:56 a.m. UTC | #2

On 5/9/2023 10:01 PM, bugzilla-daemon@kernel.org wrote:
> Hi
> 
> We are using lightweight VM with snapshot feature, the VM will be saved with
> 100ms+, and we found restore such VM will not get correct TSC, which will make
> the VM world stop about 100ms+ after restore (the stop time is same as time
> when VM saved).
> 
> After Investigation, we found the issue caused by TSC synchronization in
> setting MSR_IA32_TSC. In VM save, VMM (cloud-hypervisor) will record TSC of
> each
> VCPU, then restore the TSC of VCPU in VM restore (about 100ms+ in guest time).
> But in KVM, setting a TSC within 1 second is identified as TSC synchronization,
> and the TSC offset will not be updated in stable TSC environment, this will
> cause the lapic set up a hrtimer expires after 100ms+, 

Can elaborate more on this hrtimer issue/code path?

> the restored VM now will
> in stop state about 100ms+, if no other event to wake guest kernel in NO_HZ
> mode.
> 
> More investigation show, the MSR_IA32_TSC set from guest side has disabled TSC
> synchronization in commit 0c899c25d754 (KVM: x86: do not attempt TSC
> synchronization on guest writes), now host side will do TSC synchronization
> when
> setting MSR_IA32_TSC.
> 
> I think setting MSR_IA32_TSC within 1 second from host side should not be
> identified as TSC synchronization, like above case, VMM set TSC from host side
> always should be updated as user want.

This is heuristics, I think; at the very beginning, it was 5 seconds.
Perhaps nowadays, can we have some deterministic approach to identify a 
synchronization? e.g. add a new VM ioctl?
> 
> The MSR_IA32_TSC set code is complicated and with a long history, so I come
> here
> to try to get help about whether my thought is correct. Here is my fix to solve
> the issue, any comments are welcomed:
> 
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ceb7c5e9cf9e..9380a88b9c1f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2722,17 +2722,6 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu,
> u64 data)
>                           * kvm_clock stable after CPU hotplug
>                           */
>                          synchronizing = true;
> -               } else {
> -                       u64 tsc_exp = kvm->arch.last_tsc_write +
> -                                               nsec_to_cycles(vcpu, elapsed);
> -                       u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL;
> -                       /*
> -                        * Special case: TSC write with a small delta (1
> second)
> -                        * of virtual cycle time against real time is
> -                        * interpreted as an attempt to synchronize the CPU.
> -                        */
> -                       synchronizing = data < tsc_exp + tsc_hz &&
> -                                       data + tsc_hz > tsc_exp;
>                  }
>          }
> 
This hunk of code is indeed historic and heuristic. But simply removing it 
isn't the way.
Is the interval between your "save VM" and "restore VM" less than 1s?

An alternative, I think, is to bypass this directly write IA32_MSR_TSC way 
to set/sync TSC offsets, but follow new approach introduced in your VMM by

commit 828ca89628bfcb1b8f27535025f69dd00eb55207
Author: Oliver Upton <oliver.upton@linux.dev>
Date:   Thu Sep 16 18:15:38 2021 +0000

     KVM: x86: Expose TSC offset controls to userspace

...

Documentation/virt/kvm/devices/vcpu.rst:

4.1 ATTRIBUTE: KVM_VCPU_TSC_OFFSET

:Parameters: 64-bit unsigned TSC offset

...

Specifies the guest's TSC offset relative to the host's TSC. The guest's
TSC is then derived by the following equation:

   guest_tsc = host_tsc + KVM_VCPU_TSC_OFFSET

The following describes a possible algorithm to use for this purpose
...

bugzilla-daemon@kernel.org May 17, 2023, 6:56 a.m. UTC | #3

https://bugzilla.kernel.org/show_bug.cgi?id=217423

--- Comment #2 from robert.hoo.linux@gmail.com ---
On 5/9/2023 10:01 PM, bugzilla-daemon@kernel.org wrote:
> Hi
> 
> We are using lightweight VM with snapshot feature, the VM will be saved with
> 100ms+, and we found restore such VM will not get correct TSC, which will
> make
> the VM world stop about 100ms+ after restore (the stop time is same as time
> when VM saved).
> 
> After Investigation, we found the issue caused by TSC synchronization in
> setting MSR_IA32_TSC. In VM save, VMM (cloud-hypervisor) will record TSC of
> each
> VCPU, then restore the TSC of VCPU in VM restore (about 100ms+ in guest
> time).
> But in KVM, setting a TSC within 1 second is identified as TSC
> synchronization,
> and the TSC offset will not be updated in stable TSC environment, this will
> cause the lapic set up a hrtimer expires after 100ms+, 

Can elaborate more on this hrtimer issue/code path?

> the restored VM now will
> in stop state about 100ms+, if no other event to wake guest kernel in NO_HZ
> mode.
> 
> More investigation show, the MSR_IA32_TSC set from guest side has disabled
> TSC
> synchronization in commit 0c899c25d754 (KVM: x86: do not attempt TSC
> synchronization on guest writes), now host side will do TSC synchronization
> when
> setting MSR_IA32_TSC.
> 
> I think setting MSR_IA32_TSC within 1 second from host side should not be
> identified as TSC synchronization, like above case, VMM set TSC from host
> side
> always should be updated as user want.

This is heuristics, I think; at the very beginning, it was 5 seconds.
Perhaps nowadays, can we have some deterministic approach to identify a 
synchronization? e.g. add a new VM ioctl?
> 
> The MSR_IA32_TSC set code is complicated and with a long history, so I come
> here
> to try to get help about whether my thought is correct. Here is my fix to
> solve
> the issue, any comments are welcomed:
> 
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ceb7c5e9cf9e..9380a88b9c1f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2722,17 +2722,6 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu,
> u64 data)
>                           * kvm_clock stable after CPU hotplug
>                           */
>                          synchronizing = true;
> -               } else {
> -                       u64 tsc_exp = kvm->arch.last_tsc_write +
> -                                               nsec_to_cycles(vcpu,
> elapsed);
> -                       u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL;
> -                       /*
> -                        * Special case: TSC write with a small delta (1
> second)
> -                        * of virtual cycle time against real time is
> -                        * interpreted as an attempt to synchronize the CPU.
> -                        */
> -                       synchronizing = data < tsc_exp + tsc_hz &&
> -                                       data + tsc_hz > tsc_exp;
>                  }
>          }
> 
This hunk of code is indeed historic and heuristic. But simply removing it 
isn't the way.
Is the interval between your "save VM" and "restore VM" less than 1s?

An alternative, I think, is to bypass this directly write IA32_MSR_TSC way 
to set/sync TSC offsets, but follow new approach introduced in your VMM by

commit 828ca89628bfcb1b8f27535025f69dd00eb55207
Author: Oliver Upton <oliver.upton@linux.dev>
Date:   Thu Sep 16 18:15:38 2021 +0000

     KVM: x86: Expose TSC offset controls to userspace

...

Documentation/virt/kvm/devices/vcpu.rst:

4.1 ATTRIBUTE: KVM_VCPU_TSC_OFFSET

:Parameters: 64-bit unsigned TSC offset

...

Specifies the guest's TSC offset relative to the host's TSC. The guest's
TSC is then derived by the following equation:

   guest_tsc = host_tsc + KVM_VCPU_TSC_OFFSET

The following describes a possible algorithm to use for this purpose
...

bugzilla-daemon@kernel.org May 18, 2023, 12:48 p.m. UTC | #4

https://bugzilla.kernel.org/show_bug.cgi?id=217423

--- Comment #3 from zhuangel (zhuangel570@gmail.com) ---
(In reply to robert.hoo.linux from comment #2)
> On 5/9/2023 10:01 PM, bugzilla-daemon@kernel.org wrote:
> > Hi
> > 
> > We are using lightweight VM with snapshot feature, the VM will be saved
> with
> > 100ms+, and we found restore such VM will not get correct TSC, which will
> > make
> > the VM world stop about 100ms+ after restore (the stop time is same as time
> > when VM saved).
> > 
> > After Investigation, we found the issue caused by TSC synchronization in
> > setting MSR_IA32_TSC. In VM save, VMM (cloud-hypervisor) will record TSC of
> > each
> > VCPU, then restore the TSC of VCPU in VM restore (about 100ms+ in guest
> > time).
> > But in KVM, setting a TSC within 1 second is identified as TSC
> > synchronization,
> > and the TSC offset will not be updated in stable TSC environment, this will
> > cause the lapic set up a hrtimer expires after 100ms+, 
> 
> Can elaborate more on this hrtimer issue/code path?

Below are the steps in detail, I traced them via bpftrace, to simplify the
analysis, the preemption timer on host is disabled, guest is running with
TSC timer deadline mode.

TSC changes before save VM:
1 create VM/VCPU, guest TSC start from 0 (VCPU initial value)
  host_tsc0 = 0 + offset0
2 pause VM after guest start finished (about 200ms)
  host_tsc1 = guest_tsc1 + offset0
  guest_tsc1_deadline = guest_tsc1 + expire1
3 save VM state
  save guest_tsc1 by reading MSR_IA32_TSC
  save guest_tsc1_deadline by reading MSR_IA32_TSC_DEADLINE

TSC changes in restore VM (to simplify the analysis, step 4
and step 5 ignore the host TSC changes in restore process):
4 create VM/VCPU, guest TSC start from 0 (VCPU initial value)
  host_tsc3 = 0 + offset1
5 restore VM state
  set MSR_IA32_TSC by guest_tsc1
  set MSR_IA32_TSC_DEADLINE by guest_tsc1_deadline
6 start VM
  VCPU_RUN

In step 5 setting MSR_IA32_TSC, because the guest_tsc1 is within 1 second,
KVM will take this update as TSC synchronize, then skip update offset1.
This means the guest TSC is still at 0 (initialize value).

In step 5 setting MSR_IA32_TSC_DEADLINE, VMM just want setup hrtimer after
expire1, but KVM will get the current guest TSC is 0, and then calculate
the expire time as (guest_tsc1_deadline - 0), then the hrtimer introduce
guest_tsc1 latency (the guest kernel stopped if it will only kick by loapic
timer).

> 
> > the restored VM now will
> > in stop state about 100ms+, if no other event to wake guest kernel in NO_HZ
> > mode.
> > 
> > More investigation show, the MSR_IA32_TSC set from guest side has disabled
> > TSC
> > synchronization in commit 0c899c25d754 (KVM: x86: do not attempt TSC
> > synchronization on guest writes), now host side will do TSC synchronization
> > when
> > setting MSR_IA32_TSC.
> > 
> > I think setting MSR_IA32_TSC within 1 second from host side should not be
> > identified as TSC synchronization, like above case, VMM set TSC from host
> > side
> > always should be updated as user want.
> 
> This is heuristics, I think; at the very beginning, it was 5 seconds.
> Perhaps nowadays, can we have some deterministic approach to identify a 
> synchronization? e.g. add a new VM ioctl?

The 5 seconds was original introduced in commit f38e098ff3a3 ("KVM: x86:
TSC reset compensation"), when setting TSC into same value in 5 second. And
in 46543ba45fc49 ("KVM: x86: Robust TSC compensation"), remove the same value
check.

From changes in these 2 commits, I think the TSC synchronize first introduced
to handle synchronize from guest side (kernel boot), then changed to migration
setting from host side.

In commit 0c899c25d754 ("KVM: x86: do not attempt TSC synchronization
on guest writes"), Paolo has split the guest TSC write from this complicated
synchronization.

Now just migration scenarios left, in migration the TSC restore should be set
before VCPU start to run, one hack is to skip checking whether the VCPU starts
running, such as check last_vmentry_cpu is -1.

> > 
> > The MSR_IA32_TSC set code is complicated and with a long history, so I come
> > here
> > to try to get help about whether my thought is correct. Here is my fix to
> > solve
> > the issue, any comments are welcomed:
> > 
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index ceb7c5e9cf9e..9380a88b9c1f 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -2722,17 +2722,6 @@ static void kvm_synchronize_tsc(struct kvm_vcpu
> *vcpu,
> > u64 data)
> >                           * kvm_clock stable after CPU hotplug
> >                           */
> >                          synchronizing = true;
> > -               } else {
> > -                       u64 tsc_exp = kvm->arch.last_tsc_write +
> > -                                               nsec_to_cycles(vcpu,
> > elapsed);
> > -                       u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL;
> > -                       /*
> > -                        * Special case: TSC write with a small delta (1
> > second)
> > -                        * of virtual cycle time against real time is
> > -                        * interpreted as an attempt to synchronize the
> CPU.
> > -                        */
> > -                       synchronizing = data < tsc_exp + tsc_hz &&
> > -                                       data + tsc_hz > tsc_exp;
> >                  }
> >          }
> > 
> This hunk of code is indeed historic and heuristic. But simply removing it 
> isn't the way.
> Is the interval between your "save VM" and "restore VM" less than 1s?
> 
> An alternative, I think, is to bypass this directly write IA32_MSR_TSC way 
> to set/sync TSC offsets, but follow new approach introduced in your VMM by
> 
> commit 828ca89628bfcb1b8f27535025f69dd00eb55207
> Author: Oliver Upton <oliver.upton@linux.dev>
> Date:   Thu Sep 16 18:15:38 2021 +0000
> 
>      KVM: x86: Expose TSC offset controls to userspace
> 
> ...
> 
> Documentation/virt/kvm/devices/vcpu.rst:
> 
> 4.1 ATTRIBUTE: KVM_VCPU_TSC_OFFSET
> 
> :Parameters: 64-bit unsigned TSC offset
> 
> ...
> 
> Specifies the guest's TSC offset relative to the host's TSC. The guest's
> TSC is then derived by the following equation:
> 
>    guest_tsc = host_tsc + KVM_VCPU_TSC_OFFSET
> 
> The following describes a possible algorithm to use for this purpose
> ...

"TSC counts the time during which the VM was paused.", This new feature works
for live migration. But if we save/restore VM with snapshot, the TSC should be
paused either?

Robert Hoo June 23, 2023, 2:40 p.m. UTC | #5

On 5/18/2023 8:48 PM, bugzilla-daemon@kernel.org wrote:
[...]
(Sorry for late response)
>> Can elaborate more on this hrtimer issue/code path?
> 
> Below are the steps in detail, I traced them via bpftrace, to simplify the
> analysis, the preemption timer on host is disabled, guest is running with
> TSC timer deadline mode.
> 
> TSC changes before save VM:
> 1 create VM/VCPU, guest TSC start from 0 (VCPU initial value)
>    host_tsc0 = 0 + offset0
> 2 pause VM after guest start finished (about 200ms)
>    host_tsc1 = guest_tsc1 + offset0
>    guest_tsc1_deadline = guest_tsc1 + expire1
> 3 save VM state
>    save guest_tsc1 by reading MSR_IA32_TSC
>    save guest_tsc1_deadline by reading MSR_IA32_TSC_DEADLINE
> 
> TSC changes in restore VM (to simplify the analysis, step 4
> and step 5 ignore the host TSC changes in restore process):
> 4 create VM/VCPU, guest TSC start from 0 (VCPU initial value)
>    host_tsc3 = 0 + offset1
> 5 restore VM state
>    set MSR_IA32_TSC by guest_tsc1
>    set MSR_IA32_TSC_DEADLINE by guest_tsc1_deadline
> 6 start VM
>    VCPU_RUN
> 
> In step 5 setting MSR_IA32_TSC, because the guest_tsc1 is within 1 second,
> KVM will take this update as TSC synchronize, then skip update offset1.
> This means the guest TSC is still at 0 (initialize value).

IIUC, here no matter synchronizing = true or false, offset will always be 
updated, i.e. __kvm_synchronize_tsc() will be called. But the offset value will 
differ.

I guess your environment is tsc_stable, then offset = kvm->arch.cur_tsc_offset, 
which is 0. That is to say, the elapsed time isn't counted in by the heuristics 
method in current code, that's the culprit.

static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
{
	...
	offset = kvm_compute_l1_tsc_offset(vcpu, data);
	...

	/*
	 * For a reliable TSC, we can match TSC offsets, and for an unstable
	 * TSC, we add elapsed time in this computation.  We could let the
	 * compensation code attempt to catch up if we fall behind, but
	 * it's better to try to match offsets from the beginning.
          */
	if (synchronizing &&
	    vcpu->arch.virtual_tsc_khz == kvm->arch.last_tsc_khz) {
		if (!kvm_check_tsc_unstable()) {
			offset = kvm->arch.cur_tsc_offset;
		} else {
			u64 delta = nsec_to_cycles(vcpu, elapsed);
			data += delta;
			offset = kvm_compute_l1_tsc_offset(vcpu, data);
		}
		matched = true;
	}

	__kvm_synchronize_tsc(vcpu, offset, data, ns, matched);
	raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
}


>> An alternative, I think, is to bypass this directly write IA32_MSR_TSC way
>> to set/sync TSC offsets, but follow new approach introduced in your VMM by
>>
>> commit 828ca89628bfcb1b8f27535025f69dd00eb55207
>> Author: Oliver Upton <oliver.upton@linux.dev>
>> Date:   Thu Sep 16 18:15:38 2021 +0000
>>
>>       KVM: x86: Expose TSC offset controls to userspace
>>
>> ...
>>
>> Documentation/virt/kvm/devices/vcpu.rst:
>>
>> 4.1 ATTRIBUTE: KVM_VCPU_TSC_OFFSET
>>
>> :Parameters: 64-bit unsigned TSC offset
>>
>> ...
>>
>> Specifies the guest's TSC offset relative to the host's TSC. The guest's
>> TSC is then derived by the following equation:
>>
>>     guest_tsc = host_tsc + KVM_VCPU_TSC_OFFSET
>>
>> The following describes a possible algorithm to use for this purpose
>> ...
> 
> "TSC counts the time during which the VM was paused.", This new feature works
> for live migration. But if we save/restore VM with snapshot, the TSC should be
> paused either?
> 
Not sure what's host's TSC situation when host is, say, suspended/hibernated. VM 
Save/Restore can refer to that.
But, the key point of this new approach is to use OFFSET rather than direct TSC 
value, this is like x86 TSC_ADJUST was introduced, and is preferred.
Via this new interface,
"... Ensure that the KVM_CLOCK_REALTIME flag is set in the provided structure.
KVM will advance the VM's kvmclock to account for elapsed time since recording 
the clock values.", therefore I think it can solve your problem, rather than 
modify the ancient and heuristics code at high risk.

bugzilla-daemon@kernel.org June 23, 2023, 2:41 p.m. UTC | #6

https://bugzilla.kernel.org/show_bug.cgi?id=217423

--- Comment #4 from robert.hoo.linux@gmail.com ---
On 5/18/2023 8:48 PM, bugzilla-daemon@kernel.org wrote:
[...]
(Sorry for late response)
>> Can elaborate more on this hrtimer issue/code path?
> 
> Below are the steps in detail, I traced them via bpftrace, to simplify the
> analysis, the preemption timer on host is disabled, guest is running with
> TSC timer deadline mode.
> 
> TSC changes before save VM:
> 1 create VM/VCPU, guest TSC start from 0 (VCPU initial value)
>    host_tsc0 = 0 + offset0
> 2 pause VM after guest start finished (about 200ms)
>    host_tsc1 = guest_tsc1 + offset0
>    guest_tsc1_deadline = guest_tsc1 + expire1
> 3 save VM state
>    save guest_tsc1 by reading MSR_IA32_TSC
>    save guest_tsc1_deadline by reading MSR_IA32_TSC_DEADLINE
> 
> TSC changes in restore VM (to simplify the analysis, step 4
> and step 5 ignore the host TSC changes in restore process):
> 4 create VM/VCPU, guest TSC start from 0 (VCPU initial value)
>    host_tsc3 = 0 + offset1
> 5 restore VM state
>    set MSR_IA32_TSC by guest_tsc1
>    set MSR_IA32_TSC_DEADLINE by guest_tsc1_deadline
> 6 start VM
>    VCPU_RUN
> 
> In step 5 setting MSR_IA32_TSC, because the guest_tsc1 is within 1 second,
> KVM will take this update as TSC synchronize, then skip update offset1.
> This means the guest TSC is still at 0 (initialize value).

IIUC, here no matter synchronizing = true or false, offset will always be 
updated, i.e. __kvm_synchronize_tsc() will be called. But the offset value will 
differ.

I guess your environment is tsc_stable, then offset = kvm->arch.cur_tsc_offset, 
which is 0. That is to say, the elapsed time isn't counted in by the heuristics 
method in current code, that's the culprit.

static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
{
        ...
        offset = kvm_compute_l1_tsc_offset(vcpu, data);
        ...

        /*
         * For a reliable TSC, we can match TSC offsets, and for an unstable
         * TSC, we add elapsed time in this computation.  We could let the
         * compensation code attempt to catch up if we fall behind, but
         * it's better to try to match offsets from the beginning.
          */
        if (synchronizing &&
            vcpu->arch.virtual_tsc_khz == kvm->arch.last_tsc_khz) {
                if (!kvm_check_tsc_unstable()) {
                        offset = kvm->arch.cur_tsc_offset;
                } else {
                        u64 delta = nsec_to_cycles(vcpu, elapsed);
                        data += delta;
                        offset = kvm_compute_l1_tsc_offset(vcpu, data);
                }
                matched = true;
        }

        __kvm_synchronize_tsc(vcpu, offset, data, ns, matched);
        raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
}


>> An alternative, I think, is to bypass this directly write IA32_MSR_TSC way
>> to set/sync TSC offsets, but follow new approach introduced in your VMM by
>>
>> commit 828ca89628bfcb1b8f27535025f69dd00eb55207
>> Author: Oliver Upton <oliver.upton@linux.dev>
>> Date:   Thu Sep 16 18:15:38 2021 +0000
>>
>>       KVM: x86: Expose TSC offset controls to userspace
>>
>> ...
>>
>> Documentation/virt/kvm/devices/vcpu.rst:
>>
>> 4.1 ATTRIBUTE: KVM_VCPU_TSC_OFFSET
>>
>> :Parameters: 64-bit unsigned TSC offset
>>
>> ...
>>
>> Specifies the guest's TSC offset relative to the host's TSC. The guest's
>> TSC is then derived by the following equation:
>>
>>     guest_tsc = host_tsc + KVM_VCPU_TSC_OFFSET
>>
>> The following describes a possible algorithm to use for this purpose
>> ...
> 
> "TSC counts the time during which the VM was paused.", This new feature works
> for live migration. But if we save/restore VM with snapshot, the TSC should
> be
> paused either?
> 
Not sure what's host's TSC situation when host is, say, suspended/hibernated.
VM 
Save/Restore can refer to that.
But, the key point of this new approach is to use OFFSET rather than direct TSC 
value, this is like x86 TSC_ADJUST was introduced, and is preferred.
Via this new interface,
"... Ensure that the KVM_CLOCK_REALTIME flag is set in the provided structure.
KVM will advance the VM's kvmclock to account for elapsed time since recording 
the clock values.", therefore I think it can solve your problem, rather than 
modify the ancient and heuristics code at high risk.

[Bug,217423] New: TSC synchronization issue in VM restore

Commit Message

Comments

Patch