diff mbox

[RFC,0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO.

Message ID 1442591670-5216-1-git-send-email-rkrcmar@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Radim Krčmář Sept. 18, 2015, 3:54 p.m. UTC
This patch series will be disabling PVCLOCK_COUNTS_FROM_ZERO flag and is
RFC because I haven't explored many potential problems or tested it.

[1/2] uses a different algorithm in the guest to start counting from 0.
[2/2] stops exposing PVCLOCK_COUNTS_FROM_ZERO in the hypervisor.

A viable alternative would be to implement opt-in features in kvm clock.

And because we probably only broke one old user (the infamous SLES 10), a
workaround like this is also possible: (but I'd rather not do that)



Radim Kr?má? (2):
  x86: kvmclock: abolish PVCLOCK_COUNTS_FROM_ZERO
  Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock
    system MSR"

 arch/x86/include/asm/pvclock-abi.h |  1 +
 arch/x86/kernel/kvmclock.c         | 46 +++++++++++++++++++++++++++++---------
 arch/x86/kvm/x86.c                 |  4 ----
 3 files changed, 36 insertions(+), 15 deletions(-)

Comments

Marcelo Tosatti Sept. 20, 2015, 10:57 p.m. UTC | #1
On Fri, Sep 18, 2015 at 05:54:28PM +0200, Radim Kr?má? wrote:
> This patch series will be disabling PVCLOCK_COUNTS_FROM_ZERO flag and is
> RFC because I haven't explored many potential problems or tested it.

The justification to disable PVCLOCK_COUNTS_FROM_ZERO is because you
haven't explored potential problems or tested it? Sorry can't parse it.

> 
> [1/2] uses a different algorithm in the guest to start counting from 0.
> [2/2] stops exposing PVCLOCK_COUNTS_FROM_ZERO in the hypervisor.
> 
> A viable alternative would be to implement opt-in features in kvm clock.
> 
> And because we probably only broke one old user (the infamous SLES 10), a
> workaround like this is also possible: (but I'd rather not do that)

Please describe why SLES 10 breaks in detail: the state of the guest and
the host before the patch, the state of the guest and host after the
patch.

What does SLES10 expect?
Is it counting from zero that breaks SLES10? 

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a60bdbccff51..ae9049248aaf 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2007,7 +2007,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  
>  			ka->boot_vcpu_runs_old_kvmclock = tmp;
>  
> -			ka->kvmclock_offset = -get_kernel_ns();
> +			if (!ka->boot_vcpu_runs_old_kvmclock)
> +				ka->kvmclock_offset = -get_kernel_ns();
>  		}
>  
>  		vcpu->arch.time = data;
> 
> 
> Radim Kr?má? (2):
>   x86: kvmclock: abolish PVCLOCK_COUNTS_FROM_ZERO
>   Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock
>     system MSR"
> 
>  arch/x86/include/asm/pvclock-abi.h |  1 +
>  arch/x86/kernel/kvmclock.c         | 46 +++++++++++++++++++++++++++++---------
>  arch/x86/kvm/x86.c                 |  4 ----
>  3 files changed, 36 insertions(+), 15 deletions(-)
> 
> -- 
> 2.5.2
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Radim Krčmář Sept. 21, 2015, 3:12 p.m. UTC | #2
2015-09-20 19:57-0300, Marcelo Tosatti:
> On Fri, Sep 18, 2015 at 05:54:28PM +0200, Radim Kr?má? wrote:
>> This patch series will be disabling PVCLOCK_COUNTS_FROM_ZERO flag and is
>> RFC because I haven't explored many potential problems or tested it.
> 
> The justification to disable PVCLOCK_COUNTS_FROM_ZERO is because you
> haven't explored potential problems or tested it? Sorry can't parse it.
> 
>> 
>> [1/2] uses a different algorithm in the guest to start counting from 0.
>> [2/2] stops exposing PVCLOCK_COUNTS_FROM_ZERO in the hypervisor.
>> 
>> A viable alternative would be to implement opt-in features in kvm clock.
>> 
>> And because we probably only broke one old user (the infamous SLES 10), a
>> workaround like this is also possible: (but I'd rather not do that)
> 
> Please describe why SLES 10 breaks in detail: the state of the guest and
> the host before the patch, the state of the guest and host after the
> patch.

1) The guest periodically receives an interrupt that is handled by
   main_timer_handler():
   a) get time using the kvm clock:
      1) write the address to MSR_KVM_SYSTEM_TIME
      2) read tsc and pvclock (tsc_offset, system_time)
      3) time = tsc - tsc_offset + system_time
   b) compute time since the last main_timer_handler()
   c) bump jiffies if enough time has elapsed
2) the guest wants to calibrate loops per jiffy [1]:
   a) read tsc
   b) loop till jiffies increase
   c) compute lpj

Because (1a1) always resets the system_time to 0, we read the same value
over and over so the condition for (1c) is never true and jiffies remain
constant.  This is the problem.  A hang happens in (2b) as it is the
first place that depends on jiffies.

> What does SLES10 expect?

That a write to MSR_KVM_SYSTEM_TIME does not reset the system time.

> Is it counting from zero that breaks SLES10?

Not by itself, treating MSR_KVM_SYSTEM_TIME as one-shot initializer did.
The guest wants to write to MSR_KVM_SYSTEM_TIME as much as it likes to,
while still keeping system time;  we used to allow that, which means an
ABI breakage.  (And we can't even say that guest's behaviour is against
the spec ...)


---
1: I also did diassembly, but the reproducer is easier to paste
   (couldn't find debuginfo)
   # qemu-kvm -nographic -kernel vmlinuz-2.6.16.60-0.85.1-default \
    -serial stdio -monitor /dev/null -append 'console=ttyS0'
  
  and you can get a bit further when setting loops per jiffy manually,
    -serial stdio -monitor /dev/null -append 'console=ttyS0 lpj=12345678'

  The dmesg for failing run is
    Initializing CPU#0
    PID hash table entries: 512 (order: 9, 16384 bytes)
    kvm-clock: cpu 0, msr 0:3f6041, boot clock
    kvm_get_tsc_khz: cpu 0, msr 0:e001
    time.c: Using tsc for timekeeping HZ 250
    time.c: Using 100.000000 MHz WALL KVM GTOD KVM timer.
    time.c: Detected 2591.580 MHz processor.
    Console: colour VGA+ 80x25
    Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
    Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
    Checking aperture...
    Nosave address range: 000000000009f000 - 00000000000a0000
    Nosave address range: 00000000000a0000 - 00000000000f0000
    Nosave address range: 00000000000f0000 - 0000000000100000
    Memory: 124884k/130944k available (1856k kernel code, 5544k reserved, 812k data, 188k init)
    [Infinitely querying kvm clock here ...]

  With '-cpu kvm64,-kvmclock', the next line is
    Calibrating delay using timer specific routine.. 5199.75 BogoMIPS (lpj=10399519)

  With 'lpj=10399519',
    Calibrating delay loop (skipped)... 5199.75 BogoMIPS preset
    [Manages to get stuck later, in default_idle.]
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Radim Krčmář Sept. 21, 2015, 3:43 p.m. UTC | #3
2015-09-21 17:12+0200, Radim Kr?má?:
> 2015-09-20 19:57-0300, Marcelo Tosatti:
> > On Fri, Sep 18, 2015 at 05:54:28PM +0200, Radim Kr?má? wrote:
>>> This patch series will be disabling PVCLOCK_COUNTS_FROM_ZERO flag and is
>>> RFC because I haven't explored many potential problems or tested it.
>> 
>> The justification to disable PVCLOCK_COUNTS_FROM_ZERO is because you
>> haven't explored potential problems or tested it? Sorry can't parse it.

I missed that one, sorry.
PVCLOCK_COUNTS_FROM_ZERO is disabled because it breaks ABI.
Not testing nor ruling out all problems is a justification for RFC.

(This patch series
 - will be disabling PVCLOCK_COUNTS_FROM_ZERO flag and
 - is RFC because I haven't explored many potential problems or tested
   it [this patch series])
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Sept. 21, 2015, 3:52 p.m. UTC | #4
On Mon, Sep 21, 2015 at 05:12:10PM +0200, Radim Kr?má? wrote:
> 2015-09-20 19:57-0300, Marcelo Tosatti:
> > On Fri, Sep 18, 2015 at 05:54:28PM +0200, Radim Kr?má? wrote:
> >> This patch series will be disabling PVCLOCK_COUNTS_FROM_ZERO flag and is
> >> RFC because I haven't explored many potential problems or tested it.
> > 
> > The justification to disable PVCLOCK_COUNTS_FROM_ZERO is because you
> > haven't explored potential problems or tested it? Sorry can't parse it.
> > 
> >> 
> >> [1/2] uses a different algorithm in the guest to start counting from 0.
> >> [2/2] stops exposing PVCLOCK_COUNTS_FROM_ZERO in the hypervisor.
> >> 
> >> A viable alternative would be to implement opt-in features in kvm clock.
> >> 
> >> And because we probably only broke one old user (the infamous SLES 10), a
> >> workaround like this is also possible: (but I'd rather not do that)
> > 
> > Please describe why SLES 10 breaks in detail: the state of the guest and
> > the host before the patch, the state of the guest and host after the
> > patch.
> 
> 1) The guest periodically receives an interrupt that is handled by
>    main_timer_handler():
>    a) get time using the kvm clock:
>       1) write the address to MSR_KVM_SYSTEM_TIME
>       2) read tsc and pvclock (tsc_offset, system_time)
>       3) time = tsc - tsc_offset + system_time
>    b) compute time since the last main_timer_handler()
>    c) bump jiffies if enough time has elapsed
> 2) the guest wants to calibrate loops per jiffy [1]:
>    a) read tsc
>    b) loop till jiffies increase
>    c) compute lpj
> 
> Because (1a1) always resets the system_time to 0, we read the same value
> over and over so the condition for (1c) is never true and jiffies remain
> constant.  This is the problem.  A hang happens in (2b) as it is the
> first place that depends on jiffies.
> 
> > What does SLES10 expect?
> 
> That a write to MSR_KVM_SYSTEM_TIME does not reset the system time.
> 
> > Is it counting from zero that breaks SLES10?
> 
> Not by itself, treating MSR_KVM_SYSTEM_TIME as one-shot initializer did.
> The guest wants to write to MSR_KVM_SYSTEM_TIME as much as it likes to,
> while still keeping system time;  we used to allow that, which means an
> ABI breakage.  (And we can't even say that guest's behaviour is against
> the spec ...)

Because this behaviour was not defined.

Can't you just condition PVCLOCK_COUNTS_FROM_ZERO behaviour on
boot_vcpu_runs_old_kvmclock == false? 
The patch would be much simpler.

The problem is, "selecting one read as the initial point" is inherently
racy: that delta is relative to one moment (kvmclock read) at one vcpu,
but must be applied to all vcpus.

Besides:

	1) Stable sched clock in guest does not depend on
	   KVM_FEATURE_CLOCKSOURCE_STABLE_BIT.
	2) You rely on monotonicity across vcpus to perform 
 	   the 'minus delta that was read on vcpu0' calculation, but 
	   monotonicity across vcpus can fail during runtime
           (say host clocksource goes tsc->hpet due to tsc instability).


> 
> 
> ---
> 1: I also did diassembly, but the reproducer is easier to paste
>    (couldn't find debuginfo)
>    # qemu-kvm -nographic -kernel vmlinuz-2.6.16.60-0.85.1-default \
>     -serial stdio -monitor /dev/null -append 'console=ttyS0'
>   
>   and you can get a bit further when setting loops per jiffy manually,
>     -serial stdio -monitor /dev/null -append 'console=ttyS0 lpj=12345678'
> 
>   The dmesg for failing run is
>     Initializing CPU#0
>     PID hash table entries: 512 (order: 9, 16384 bytes)
>     kvm-clock: cpu 0, msr 0:3f6041, boot clock
>     kvm_get_tsc_khz: cpu 0, msr 0:e001
>     time.c: Using tsc for timekeeping HZ 250
>     time.c: Using 100.000000 MHz WALL KVM GTOD KVM timer.
>     time.c: Detected 2591.580 MHz processor.
>     Console: colour VGA+ 80x25
>     Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
>     Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
>     Checking aperture...
>     Nosave address range: 000000000009f000 - 00000000000a0000
>     Nosave address range: 00000000000a0000 - 00000000000f0000
>     Nosave address range: 00000000000f0000 - 0000000000100000
>     Memory: 124884k/130944k available (1856k kernel code, 5544k reserved, 812k data, 188k init)
>     [Infinitely querying kvm clock here ...]
> 
>   With '-cpu kvm64,-kvmclock', the next line is
>     Calibrating delay using timer specific routine.. 5199.75 BogoMIPS (lpj=10399519)
> 
>   With 'lpj=10399519',
>     Calibrating delay loop (skipped)... 5199.75 BogoMIPS preset
>     [Manages to get stuck later, in default_idle.]
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Radim Krčmář Sept. 21, 2015, 8 p.m. UTC | #5
2015-09-21 12:52-0300, Marcelo Tosatti:
> On Mon, Sep 21, 2015 at 05:12:10PM +0200, Radim Kr?má? wrote:
>> 2015-09-20 19:57-0300, Marcelo Tosatti:
>>> Is it counting from zero that breaks SLES10?
>> 
>> Not by itself, treating MSR_KVM_SYSTEM_TIME as one-shot initializer did.
>> The guest wants to write to MSR_KVM_SYSTEM_TIME as much as it likes to,
>> while still keeping system time;  we used to allow that, which means an
>> ABI breakage.  (And we can't even say that guest's behaviour is against
>> the spec ...)
> 
> Because this behaviour was not defined.

It is defined by implementation.

> Can't you just condition PVCLOCK_COUNTS_FROM_ZERO behaviour on
> boot_vcpu_runs_old_kvmclock == false? 
> The patch would be much simpler.


If you mean the hunk in cover letter, I don't like it because we presume
that no other guests were broken.

I really don't like it so I thought about other problems with
PVCLOCK_COUNTS_FROM_ZERO ... have you tried to hot-replug VCPU 0?
It doesn't work well ;)

We don't want to guess what the guest wants so I'd go for the opt-in
paravirt feature unless counting from zero can be done in guest alone.

> The problem is, "selecting one read as the initial point" is inherently
> racy: that delta is relative to one moment (kvmclock read) at one vcpu,
> but must be applied to all vcpus.

I don't think that is a problem.

Kvmclock has a notion of a global system_time in nanoseconds (one value
that defines the time, assigned with VM ioctl KVM_SET_CLOCK) and tries
to propagate system_time into guest VCPUs as precisely as possible with
the help of TSC.

sched_clock uses kvmclock to get nanoseconds since the system was
brought up and [1/2] only works with this abstracted ns count.
A poorly synchronized initial read is irrelevant because all VCPUs will
be using the same constant offset.
(We can never know the precise time anyway.)

> Besides:
> 
> 	1) Stable sched clock in guest does not depend on
> 	   KVM_FEATURE_CLOCKSOURCE_STABLE_BIT.

Yes, thanks, I will remove that requirement in v1.  (We'd need to
improve a loss of PVCLOCK_TSC_STABLE_BIT otherwise anyway.)

Because the clutchy dependency on PVCLOCK_TSC_STABLE_BIT is going away,
there is now one possible unsigned overflow: in case the clock was very
imprecise and VCPU1 managed to get smaller system_time than VCPU0 at the
time of initialization.  It's very unlikely that we'll ever reach legal
overflow so I can add a condition there.

> 	2) You rely on monotonicity across vcpus to perform 
> 	   the 'minus delta that was read on vcpu0' calculation, but 
> 	   monotonicity across vcpus can fail during runtime
>            (say host clocksource goes tsc->hpet due to tsc instability).

That could be a problem, but I'm adding a VCPU independent constant to
all reads -- does the new code rely on monoticity in places where the
old one didn't?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Sept. 21, 2015, 8:53 p.m. UTC | #6
On Mon, Sep 21, 2015 at 10:00:27PM +0200, Radim Kr?má? wrote:
> 2015-09-21 12:52-0300, Marcelo Tosatti:
> > On Mon, Sep 21, 2015 at 05:12:10PM +0200, Radim Kr?má? wrote:
> >> 2015-09-20 19:57-0300, Marcelo Tosatti:
> >>> Is it counting from zero that breaks SLES10?
> >> 
> >> Not by itself, treating MSR_KVM_SYSTEM_TIME as one-shot initializer did.
> >> The guest wants to write to MSR_KVM_SYSTEM_TIME as much as it likes to,
> >> while still keeping system time;  we used to allow that, which means an
> >> ABI breakage.  (And we can't even say that guest's behaviour is against
> >> the spec ...)
> > 
> > Because this behaviour was not defined.
> 
> It is defined by implementation.
> 
> > Can't you just condition PVCLOCK_COUNTS_FROM_ZERO behaviour on
> > boot_vcpu_runs_old_kvmclock == false? 
> > The patch would be much simpler.
> 
> 
> If you mean the hunk in cover letter, I don't like it because we presume
> that no other guests were broken.

Yes patch 1.

Do you have an example of another guest that is broken? 

Really, assuming things about the value of the msr read when you
write to the MSR is very awkward. That SuSE code is incorrect, it fails
on other occasions as well (see
54750f2cf042c42b4223d67b1bd20138464bde0e).

> I really don't like it 

Because it changes behaviour that was previously unspecified?

> so I thought about other problems with
> PVCLOCK_COUNTS_FROM_ZERO ... have you tried to hot-replug VCPU 0?

Can't unplug VCPU 0 i think.

> It doesn't work well ;)

Can you hot-unplug vcpu 0? 

> We don't want to guess what the guest wants so I'd go for the opt-in
> paravirt feature unless counting from zero can be done in guest alone.

The problem it is tricky (to do the counting inside the guest).

Its much cleaner and less complicated if the host starts counting from
zero.

> 
> > The problem is, "selecting one read as the initial point" is inherently
> > racy: that delta is relative to one moment (kvmclock read) at one vcpu,
> > but must be applied to all vcpus.
> 
> I don't think that is a problem.
> 
> Kvmclock has a notion of a global system_time in nanoseconds (one value
> that defines the time, assigned with VM ioctl KVM_SET_CLOCK) and tries
> to propagate system_time into guest VCPUs as precisely as possible with
> the help of TSC.

Different pairs of values (system_time, tsc reads) are fundamentally
broken. This is why 

commit 489fb490dbf8dab0249ad82b56688ae3842a79e8
    x86, paravirt: Add a global synchronization point for pvclock

Exists.

Then to guarantee monotonicity you need to stop those updates
(or perform the pair update in lock step):

commit d828199e84447795c6669ff0e6c6d55eb9beeff6
    KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag


> sched_clock uses kvmclock to get nanoseconds since the system was
> brought up and [1/2] only works with this abstracted ns count.
> A poorly synchronized initial read is irrelevant because all VCPUs will
> be using the same constant offset.
> (We can never know the precise time anyway.)
> 
> > Besides:
> > 
> > 	1) Stable sched clock in guest does not depend on
> > 	   KVM_FEATURE_CLOCKSOURCE_STABLE_BIT.
> 
> Yes, thanks, I will remove that requirement in v1.  (We'd need to
> improve a loss of PVCLOCK_TSC_STABLE_BIT otherwise anyway.)
> 
> Because the clutchy dependency on PVCLOCK_TSC_STABLE_BIT is going away,
> there is now one possible unsigned overflow: in case the clock was very
> imprecise and VCPU1 managed to get smaller system_time than VCPU0 at the
> time of initialization.  It's very unlikely that we'll ever reach legal
> overflow so I can add a condition there.
> 
> > 	2) You rely on monotonicity across vcpus to perform 
> > 	   the 'minus delta that was read on vcpu0' calculation, but 
> > 	   monotonicity across vcpus can fail during runtime
> >            (say host clocksource goes tsc->hpet due to tsc instability).
> 
> That could be a problem, but I'm adding a VCPU independent constant to
> all reads -- does the new code rely on monoticity in places where the
> old one didn't?

The problem is overflow:
kvmclock non-monotonic across vcpus AND delta subtraction:

ptime | vcpu0kvmclock time | vcpu0sched_clock | vcpu1kvmclock time | vcpu1sched_clock
1	     1			   1 		        
2            2		           2			1		
3	     3			   0			2           -1
4	     4			   1			3	     0
5	     5			   2			4	     1
6	     6			   3			5	     2
7	     7			   4			6	     3

ptime is "physical time".

I prefer that the host counts from zero (there are less problems to
handle).

Again, that SuSE patch is incorrect and a huge hack.

An alternative is to enable PVCLOCK_COUNTS_FROM_ZERO _if_ the guest
requests the feature.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Radim Krčmář Sept. 21, 2015, 10 p.m. UTC | #7
2015-09-21 17:53-0300, Marcelo Tosatti:
> On Mon, Sep 21, 2015 at 10:00:27PM +0200, Radim Kr?má? wrote:
>> 2015-09-21 12:52-0300, Marcelo Tosatti:
>> > On Mon, Sep 21, 2015 at 05:12:10PM +0200, Radim Kr?má? wrote:
>> >> 2015-09-20 19:57-0300, Marcelo Tosatti:
>> >>> Is it counting from zero that breaks SLES10?
>> >> 
>> >> Not by itself, treating MSR_KVM_SYSTEM_TIME as one-shot initializer did.
>> >> The guest wants to write to MSR_KVM_SYSTEM_TIME as much as it likes to,
>> >> while still keeping system time;  we used to allow that, which means an
>> >> ABI breakage.  (And we can't even say that guest's behaviour is against
>> >> the spec ...)
>> > 
>> > Because this behaviour was not defined.
>> 
>> It is defined by implementation.
>> 
>> > Can't you just condition PVCLOCK_COUNTS_FROM_ZERO behaviour on
>> > boot_vcpu_runs_old_kvmclock == false? 
>> > The patch would be much simpler.
>> 
>> If you mean the hunk in cover letter, I don't like it because we presume
>> that no other guests were broken.
> 
> Yes patch 1.

(I'd call them: a hunk in cover letter [0/2], patch 1 = [1/2], and
 patch 2 = [2/2].)

> Do you have an example of another guest that is broken? 

Yes, the hot-plug in any relatively recent Linux guest.

> Really, assuming things about the value of the msr read when you
> write to the MSR is very awkward. That SuSE code is incorrect, it fails
> on other occasions as well (see
> 54750f2cf042c42b4223d67b1bd20138464bde0e).

Yeah, I even read the SUSE implementation, sad times.

>> I really don't like it 
> 
> Because it changes behaviour that was previously unspecified?

No, because it adds another exemption to already ugly code.
(Talking about the hunk in [0/2].)

>> so I thought about other problems with
>> PVCLOCK_COUNTS_FROM_ZERO ... have you tried to hot-replug VCPU 0?
> 
> Can't unplug VCPU 0 i think.

You can.

>> It doesn't work well ;)
> 
> Can you hot-unplug vcpu 0? 

I can, I did, hot-plug bugged.

>> > The problem is, "selecting one read as the initial point" is inherently
>> > racy: that delta is relative to one moment (kvmclock read) at one vcpu,
>> > but must be applied to all vcpus.
>> 
>> I don't think that is a problem.
>> 
>> Kvmclock has a notion of a global system_time in nanoseconds (one value
>> that defines the time, assigned with VM ioctl KVM_SET_CLOCK) and tries
>> to propagate system_time into guest VCPUs as precisely as possible with
>> the help of TSC.
>
>Different pairs of values (system_time, tsc reads) are fundamentally
>broken. This is why 
>
>commit 489fb490dbf8dab0249ad82b56688ae3842a79e8
>    x86, paravirt: Add a global synchronization point for pvclock
>
>Exists.
>
>Then to guarantee monotonicity you need to stop those updates
>(or perform the pair update in lock step):
>
>commit d828199e84447795c6669ff0e6c6d55eb9beeff6
>    KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag

(Thanks for the commits, split values are the core design that avoids
 modifying rdtsc, so I'll be grad to read its details.)

>> > 	2) You rely on monotonicity across vcpus to perform 
>> > 	   the 'minus delta that was read on vcpu0' calculation, but 
>> > 	   monotonicity across vcpus can fail during runtime
>> >            (say host clocksource goes tsc->hpet due to tsc instability).
>> 
>> That could be a problem, but I'm adding a VCPU independent constant to
>> all reads -- does the new code rely on monoticity in places where the
>> old one didn't?
> 
> The problem is overflow:
> kvmclock non-monotonic across vcpus AND delta subtraction:
> 
> ptime | vcpu0kvmclock time | vcpu0sched_clock | vcpu1kvmclock time | vcpu1sched_clock
> 1          1                      1                         
> 2          2                      2                        1                

KVM sched clock not used before this point (I even moved the code to
make sure), so there is no problem with monotonicity.

> 3          3                      0                        2            -1
                                                                          ^^^
-1 is the overflow.  Very unlikely to ever happen in Linux, as we now
boot other VCPUs much later than the KVM clock configuration and it can
only happen if VCPU1 is running as the reconfiguration takes place, but
a simple

  if (vcpu[x].sched_clock <= global_sched_clock_offset)
  	return 0;

would take care of it.  The time would stand still for a while, which is
not a huge problem for boot-only scenario.  Other VCPUs couldn't be
reading KVM sched before it was configured, so only operations that
happen before vcpu1sched_clock goes positive are affected.
(We have other problems if the window can be long.)

> 4          4                      1                        3             0
> 5          5                      2                        4             1
> 6          6                      3                        5             2
> 7          7                      4                        6             3
> 
> ptime is "physical time".
> 
> I prefer that the host counts from zero (there are less problems to
> handle).

The main advantage of a hypervisor solution for me is one saved
subtraction and comparison on every sched_clock().

> Again, that SuSE patch is incorrect and a huge hack.

I'm not disputing that.  (Which doesn't justify the breakage.)

> An alternative is to enable PVCLOCK_COUNTS_FROM_ZERO _if_ the guest
> requests the feature.

Yes, and the alternative needs new paravirt interface.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Sept. 21, 2015, 10:37 p.m. UTC | #8
On Tue, Sep 22, 2015 at 12:00:39AM +0200, Radim Kr?má? wrote:
> 2015-09-21 17:53-0300, Marcelo Tosatti:
> > On Mon, Sep 21, 2015 at 10:00:27PM +0200, Radim Kr?má? wrote:
> >> 2015-09-21 12:52-0300, Marcelo Tosatti:
> >> > On Mon, Sep 21, 2015 at 05:12:10PM +0200, Radim Kr?má? wrote:
> >> >> 2015-09-20 19:57-0300, Marcelo Tosatti:
> >> >>> Is it counting from zero that breaks SLES10?
> >> >> 
> >> >> Not by itself, treating MSR_KVM_SYSTEM_TIME as one-shot initializer did.
> >> >> The guest wants to write to MSR_KVM_SYSTEM_TIME as much as it likes to,
> >> >> while still keeping system time;  we used to allow that, which means an
> >> >> ABI breakage.  (And we can't even say that guest's behaviour is against
> >> >> the spec ...)
> >> > 
> >> > Because this behaviour was not defined.
> >> 
> >> It is defined by implementation.
> >> 
> >> > Can't you just condition PVCLOCK_COUNTS_FROM_ZERO behaviour on
> >> > boot_vcpu_runs_old_kvmclock == false? 
> >> > The patch would be much simpler.
> >> 
> >> If you mean the hunk in cover letter, I don't like it because we presume
> >> that no other guests were broken.
> > 
> > Yes patch 1.
> 
> (I'd call them: a hunk in cover letter [0/2], patch 1 = [1/2], and
>  patch 2 = [2/2].)
> 
> > Do you have an example of another guest that is broken? 
> 
> Yes, the hot-plug in any relatively recent Linux guest.
> 
> > Really, assuming things about the value of the msr read when you
> > write to the MSR is very awkward. That SuSE code is incorrect, it fails
> > on other occasions as well (see
> > 54750f2cf042c42b4223d67b1bd20138464bde0e).
> 
> Yeah, I even read the SUSE implementation, sad times.
> 
> >> I really don't like it 
> > 
> > Because it changes behaviour that was previously unspecified?
> 
> No, because it adds another exemption to already ugly code.
> (Talking about the hunk in [0/2].)
> 
> >> so I thought about other problems with
> >> PVCLOCK_COUNTS_FROM_ZERO ... have you tried to hot-replug VCPU 0?
> > 
> > Can't unplug VCPU 0 i think.
> 
> You can.
> 
> >> It doesn't work well ;)
> > 
> > Can you hot-unplug vcpu 0? 
> 
> I can, I did, hot-plug bugged.

Due to PVCLOCK_COUNTS_FROM_ZERO patch?  

> >> > The problem is, "selecting one read as the initial point" is inherently
> >> > racy: that delta is relative to one moment (kvmclock read) at one vcpu,
> >> > but must be applied to all vcpus.
> >> 
> >> I don't think that is a problem.
> >> 
> >> Kvmclock has a notion of a global system_time in nanoseconds (one value
> >> that defines the time, assigned with VM ioctl KVM_SET_CLOCK) and tries
> >> to propagate system_time into guest VCPUs as precisely as possible with
> >> the help of TSC.
> >
> >Different pairs of values (system_time, tsc reads) are fundamentally
> >broken. This is why 
> >
> >commit 489fb490dbf8dab0249ad82b56688ae3842a79e8
> >    x86, paravirt: Add a global synchronization point for pvclock
> >
> >Exists.
> >
> >Then to guarantee monotonicity you need to stop those updates
> >(or perform the pair update in lock step):
> >
> >commit d828199e84447795c6669ff0e6c6d55eb9beeff6
> >    KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag
> 
> (Thanks for the commits, split values are the core design that avoids
>  modifying rdtsc, 

Which is broken (thats the point).

> so I'll be grad to read its details.)
> 
> >> > 	2) You rely on monotonicity across vcpus to perform 
> >> > 	   the 'minus delta that was read on vcpu0' calculation, but 
> >> > 	   monotonicity across vcpus can fail during runtime
> >> >            (say host clocksource goes tsc->hpet due to tsc instability).
> >> 
> >> That could be a problem, but I'm adding a VCPU independent constant to
> >> all reads -- does the new code rely on monoticity in places where the
> >> old one didn't?
> > 
> > The problem is overflow:
> > kvmclock non-monotonic across vcpus AND delta subtraction:
> > 
> > ptime | vcpu0kvmclock time | vcpu0sched_clock | vcpu1kvmclock time | vcpu1sched_clock
> > 1          1                      1                         
> > 2          2                      2                        1                
> 
> KVM sched clock not used before this point (I even moved the code to
> make sure), so there is no problem with monotonicity.
> 
> > 3          3                      0                        2            -1
>                                                                           ^^^
> -1 is the overflow.  Very unlikely to ever happen in Linux, as we now
> boot other VCPUs much later than the KVM clock configuration and it can
> only happen if VCPU1 is running as the reconfiguration takes place, but
> a simple
> 
>   if (vcpu[x].sched_clock <= global_sched_clock_offset)
>   	return 0;
> 
> would take care of it.  The time would stand still for a while, which is
> not a huge problem for boot-only scenario.  

Look at 

commit 489fb490dbf8dab0249ad82b56688ae3842a79e8
    x86, paravirt: Add a global synchronization point for pvclock

And think of what an overflow does.

Note: i tried your solution before (to add offset) and saw this issue
in practice.

> Other VCPUs couldn't be
> reading KVM sched before it was configured, so only operations that
> happen before vcpu1sched_clock goes positive are affected.
> (We have other problems if the window can be long.)

The point where vcpu1sched_clock is negative is after kvmclock is
initialized.

> 
> > 4          4                      1                        3             0
> > 5          5                      2                        4             1
> > 6          6                      3                        5             2
> > 7          7                      4                        6             3
> > 
> > ptime is "physical time".
> > 
> > I prefer that the host counts from zero (there are less problems to
> > handle).
> 
> The main advantage of a hypervisor solution for me is one saved
> subtraction and comparison on every sched_clock().

To me are such complications as handling overflows.

> > Again, that SuSE patch is incorrect and a huge hack.
> 
> I'm not disputing that.  (Which doesn't justify the breakage.)
> 
> > An alternative is to enable PVCLOCK_COUNTS_FROM_ZERO _if_ the guest
> > requests the feature.
> 
> Yes, and the alternative needs new paravirt interface.

So either:

Proceed with guest solution:
-> Make sure the overflow can't happen (and write down why not in the
code). Don't assume a small delta between kvmclock values of vcpus.
-> Handle stable -> non-stable kvmclock transition.
-> kvmclock counts from zero should not depend on stable kvmclock
(because nohz_full should work on hpet host systems).

Enable counts-from-zero on MSR_KVM_SYSTEM_TIME_NEW:
-> Figure out whats wrong with different kvmclock values on hotplug, 
and fix it.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Sept. 22, 2015, 12:40 a.m. UTC | #9
> So either:
> 
> Proceed with guest solution:
> -> Make sure the overflow can't happen (and write down why not in the
> code). Don't assume a small delta between kvmclock values of vcpus.
> -> Handle stable -> non-stable kvmclock transition.
> -> kvmclock counts from zero should not depend on stable kvmclock
> (because nohz_full should work on hpet host systems).
> 
> Enable counts-from-zero on MSR_KVM_SYSTEM_TIME_NEW:
> -> Figure out whats wrong with different kvmclock values on hotplug, 
> and fix it.

Find data which allows you to differentiate between hotplug of pCPU-0
and system initialization.

Easy one: whether MSR_KVM_SYSTEM_TIME_NEW contains valid data (that is
kvmclock is enabled) for vCPUs other than vCPU-0.

This can't be the case on system initialization (otherwise the host will
be corrupting guest memory), and must be the case when hotplugging
vCPU-0.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Radim Krčmář Sept. 22, 2015, 2:33 p.m. UTC | #10
2015-09-21 19:37-0300, Marcelo Tosatti:
> On Tue, Sep 22, 2015 at 12:00:39AM +0200, Radim Kr?má? wrote:
>> 2015-09-21 17:53-0300, Marcelo Tosatti:
>> > On Mon, Sep 21, 2015 at 10:00:27PM +0200, Radim Kr?má? wrote:
>> >> so I thought about other problems with
>> >> PVCLOCK_COUNTS_FROM_ZERO ... have you tried to hot-replug VCPU 0?
>> >> It doesn't work well ;)
>> > 
>> > Can you hot-unplug vcpu 0? 
>> 
>> I can, I did, hot-plug bugged.
> 
> Due to PVCLOCK_COUNTS_FROM_ZERO patch?  

Yes, VCPU 0 writes MSR_KVM_SYSTEM_TIME_NEW to access the kvm clock, but
that very operation resets system_time.  The result is a long sleep,
because at least Linux doesn't handle that shift.
(Just in case, works fine after reverting the one host patch.)

With more thinking, PVCLOCK_COUNTS_FROM_ZERO also breaks suspend and
resume as we write the MSR there.  Testing shows that is isn't as bad as
the hotplug, because `sleep 1` returns in a second and `date` is fine,
but having multiple 0 points in `dmesg` isn't very nice either.

There are too many problems with PVCLOCK_COUNTS_FROM_ZERO, I'll send the
revert with cc:stable soon.  (Without any guest changes to sched_clock.)

>> >> > The problem is, "selecting one read as the initial point" is inherently
>> >> > racy: that delta is relative to one moment (kvmclock read) at one vcpu,
>> >> > but must be applied to all vcpus.
>> >> 
>> >> I don't think that is a problem.
>> >> 
>> >> Kvmclock has a notion of a global system_time in nanoseconds (one value
>> >> that defines the time, assigned with VM ioctl KVM_SET_CLOCK) and tries
>> >> to propagate system_time into guest VCPUs as precisely as possible with
>> >> the help of TSC.
>> >
>> >Different pairs of values (system_time, tsc reads) are fundamentally
>> >broken. This is why 
>> >
>> >commit 489fb490dbf8dab0249ad82b56688ae3842a79e8
>> >    x86, paravirt: Add a global synchronization point for pvclock
>> >
>> >Exists.
>> >
>> >Then to guarantee monotonicity you need to stop those updates
>> >(or perform the pair update in lock step):
>> >
>> >commit d828199e84447795c6669ff0e6c6d55eb9beeff6
>> >    KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag
>> 
>> (Thanks for the commits, split values are the core design that avoids
>>  modifying rdtsc, 
> 
> Which is broken (thats the point).

Yes, but I don't think we could use tsc any better without having a
correct guest's time in the host.

We chose to provide kvmclock without a host_tsc -> host_ns function in
the host.  It is just impossible to give the guest a precise (tsc, ns)
tuple if the host is not using TSC for its clock.  (We don't have
control over the machine and using two reads of time can't be precise.)

>> > 3          3                      0                        2            -1
>>                                                                           ^^^
>> -1 is the overflow.  Very unlikely to ever happen in Linux, as we now
>> boot other VCPUs much later than the KVM clock configuration and it can
>> only happen if VCPU1 is running as the reconfiguration takes place, but
>> a simple
>> 
>>   if (vcpu[x].sched_clock <= global_sched_clock_offset)
>>   	return 0;
>> 
>> would take care of it.  The time would stand still for a while, which is
>> not a huge problem for boot-only scenario.  
> 
> Look at 
> 
> commit 489fb490dbf8dab0249ad82b56688ae3842a79e8
>     x86, paravirt: Add a global synchronization point for pvclock

This patch shows the place where hotplug likely fails -- system_time
goes back to 0 so time is frozen until system_time reaches the original
value, again.
It also makes the result of kvm_clock_read monotonic across all vcpus,
so [1/2] doesn't need to add code to handle initial overflow.

> And think of what an overflow does.

If the clock was slightly negative (unsigned) and then overflowed,
sched_clock would freeze because it would never reach that high value
again.  Offset in [1/2] is applied later and can't reach negative.

> Note: i tried your solution before (to add offset) and saw this issue
> in practice.

Great, thanks, I will thoroughly test it to see where it fails.

>> Other VCPUs couldn't be
>> reading KVM sched before it was configured, so only operations that
>> happen before vcpu1sched_clock goes positive are affected.
>> (We have other problems if the window can be long.)
> 
> The point where vcpu1sched_clock is negative is after kvmclock is
> initialized.

Existing code takes care of that -- vcpu0 reads a value before any user
of sched_clock could have so later reads must be larger.

>> > An alternative is to enable PVCLOCK_COUNTS_FROM_ZERO _if_ the guest
>> > requests the feature.
>> 
>> Yes, and the alternative needs new paravirt interface.
> 
> So either:
> 
> Proceed with guest solution:
> -> Make sure the overflow can't happen (and write down why not in the
> code). Don't assume a small delta between kvmclock values of vcpus.

I will comment that 489fb490dbf8 ("x86, paravirt: Add a global
synchronization point for pvclock") gives all guarantees that prevent
the overflow on secondary VCPU bringup.

> -> Handle stable -> non-stable kvmclock transition.

No need to do anything here either.  It will behave like before.

> -> kvmclock counts from zero should not depend on stable kvmclock
> (because nohz_full should work on hpet host systems).

Yes, I will do that.

> Enable counts-from-zero on MSR_KVM_SYSTEM_TIME_NEW:
> -> Figure out whats wrong with different kvmclock values on hotplug, 
> and fix it.

It's wrong to reset the time unconditionally on an operation that can be
used anytime (and sometimes must be used even if we don't want to reset
the time), revert is the best fix.

We can do guest-only or paravirtualized zeroing.
Paravirtualized zeroing has more complicated code, but simpler context.
I mainly don't like the idea of adding an interface for a feature the
guest can implement at the same level of quality.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Radim Krčmář Sept. 22, 2015, 2:46 p.m. UTC | #11
[My mailbox filled between yours two emails, so I have pasted the latter
 one from archives and replied to a wrong one.]

2015-09-21 21:42-0300, Marcelo Tosatti:
> 2015-09-21 19:37-0300, Marcelo Tosatti:
>> Enable counts-from-zero on MSR_KVM_SYSTEM_TIME_NEW:
>> -> Figure out whats wrong with different kvmclock values on hotplug, 
>> and fix it.
> 
> Find data which allows you to differentiate between hotplug of pCPU-0
> and system initialization.

This kind of programminng is exactly what I don't like and won't be
doing.  We should at least be removing previous mistakes when adding new
ones.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini Sept. 28, 2015, 11:05 a.m. UTC | #12
On 18/09/2015 17:54, Radim Kr?má? wrote:
> This patch series will be disabling PVCLOCK_COUNTS_FROM_ZERO flag and is
> RFC because I haven't explored many potential problems or tested it.
> 
> [1/2] uses a different algorithm in the guest to start counting from 0.
> [2/2] stops exposing PVCLOCK_COUNTS_FROM_ZERO in the hypervisor.
> 
> A viable alternative would be to implement opt-in features in kvm clock.
> 
> And because we probably only broke one old user (the infamous SLES 10), a
> workaround like this is also possible: (but I'd rather not do that)

Thanks,

applying 2/2 for 4.4 and 1/2 for 4.3.

Paolo

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a60bdbccff51..ae9049248aaf 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2007,7 +2007,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  
>  			ka->boot_vcpu_runs_old_kvmclock = tmp;
>  
> -			ka->kvmclock_offset = -get_kernel_ns();
> +			if (!ka->boot_vcpu_runs_old_kvmclock)
> +				ka->kvmclock_offset = -get_kernel_ns();
>  		}
>  
>  		vcpu->arch.time = data;
> 
> 
> Radim Kr?má? (2):
>   x86: kvmclock: abolish PVCLOCK_COUNTS_FROM_ZERO
>   Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock
>     system MSR"
> 
>  arch/x86/include/asm/pvclock-abi.h |  1 +
>  arch/x86/kernel/kvmclock.c         | 46 +++++++++++++++++++++++++++++---------
>  arch/x86/kvm/x86.c                 |  4 ----
>  3 files changed, 36 insertions(+), 15 deletions(-)
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a60bdbccff51..ae9049248aaf 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2007,7 +2007,8 @@  int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 
 			ka->boot_vcpu_runs_old_kvmclock = tmp;
 
-			ka->kvmclock_offset = -get_kernel_ns();
+			if (!ka->boot_vcpu_runs_old_kvmclock)
+				ka->kvmclock_offset = -get_kernel_ns();
 		}
 
 		vcpu->arch.time = data;