x86/kvm: fix condition to update kvm master clocks

Message ID	20160527181139.GA18797@potion (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> Date: Fri, 27 May 2016 20:11:40 +0200 From: Radim =?utf-8?B?S3LEjW3DocWZ?= <rkrcmar@redhat.com> To: Roman Kagan <rkagan@virtuozzo.com>, kvm@vger.kernel.org, "Denis V. Lunev" <den@openvz.org>, Owen Hofmann <osh@google.com>, Paolo Bonzini <pbonzini@redhat.com>, Marcelo Tosatti <mtosatti@redhat.com> Subject: Re: [PATCH] x86/kvm: fix condition to update kvm master clocks Message-ID: <20160527181139.GA18797@potion> References: <1464274195-31296-1-git-send-email-rkagan@virtuozzo.com> <20160526201936.GA25334@potion> <20160527172809.GB17398@rkaganb.sw.ru> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20160527172809.GB17398@rkaganb.sw.ru> Sender: kvm-owner@vger.kernel.org Precedence: bulk

Message ID

20160527181139.GA18797@potion (mailing list archive)

State

New, archived

Headers

Date: Fri, 27 May 2016 20:11:40 +0200
From: Radim =?utf-8?B?S3LEjW3DocWZ?= <rkrcmar@redhat.com>
To: Roman Kagan <rkagan@virtuozzo.com>, kvm@vger.kernel.org,
	"Denis V. Lunev" <den@openvz.org>, Owen Hofmann <osh@google.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Marcelo Tosatti <mtosatti@redhat.com>
Subject: Re: [PATCH] x86/kvm: fix condition to update kvm master clocks
Message-ID: <20160527181139.GA18797@potion>
References: <1464274195-31296-1-git-send-email-rkagan@virtuozzo.com>
	<20160526201936.GA25334@potion>
	<20160527172809.GB17398@rkaganb.sw.ru>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20160527172809.GB17398@rkaganb.sw.ru>
Sender: kvm-owner@vger.kernel.org
Precedence: bulk

Commit Message

Radim Krčmář May 27, 2016, 6:11 p.m. UTC

2016-05-27 20:28+0300, Roman Kagan:
> On Thu, May 26, 2016 at 10:19:36PM +0200, Radim Krčmář wrote:
>> >  	    atomic_read(&kvm_guest_has_master_clock) != 0)
>> 
>> And I don't see why we don't want to enable master clock if the host
>> switches back to TSC.
> 
> Agreed (even though I guess it's not very likely: AFAICS once switched
> to a different clocksource, the host can switch back to TSC only upon
> human manipulating /sys/devices/system/clocksource).

Yeah, it's a corner case.  Human would have to switch from tsc as well,
automatic switch happens only when tsc is not useable anymore, AFAIK.

>> >  		queue_work(system_long_wq, &pvclock_gtod_work);
>> 
>> Queueing unconditionally seems to be the correct thing to do.
> 
> The notifier is registered at kvm module init, so the work will be
> scheduled even when there are no VMs at all.

Good point, we don't want to call pvclock_gtod_notify in that case
either.  Registering (unregistering) with the first (last) VM should be
good enough ... what about adding something based on this?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Roman Kagan May 27, 2016, 6:46 p.m. UTC | #1

On Fri, May 27, 2016 at 08:11:40PM +0200, Radim Krčmář wrote:
> 2016-05-27 20:28+0300, Roman Kagan:
> >> Queueing unconditionally seems to be the correct thing to do.
> > 
> > The notifier is registered at kvm module init, so the work will be
> > scheduled even when there are no VMs at all.
> 
> Good point, we don't want to call pvclock_gtod_notify in that case
> either.  Registering (unregistering) with the first (last) VM should be
> good enough ... what about adding something based on this?
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 37af23052470..0779f0f01523 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -655,6 +655,8 @@ static struct kvm *kvm_create_vm(unsigned long type)
>  		goto out_err;
>  
>  	spin_lock(&kvm_lock);
> +	if (list_empty(&kvm->vm_list))
> +		kvm_arch_create_first_vm(kvm);
>  	list_add(&kvm->vm_list, &vm_list);
>  	spin_unlock(&kvm_lock);
>  
> @@ -709,6 +711,8 @@ static void kvm_destroy_vm(struct kvm *kvm)
>  	kvm_arch_sync_events(kvm);
>  	spin_lock(&kvm_lock);
>  	list_del(&kvm->vm_list);
> +	if (list_empty(&kvm->vm_list))
> +		kvm_arch_destroy_last_vm(kvm);
>  	spin_unlock(&kvm_lock);
>  	kvm_free_irq_routing(kvm);
>  	for (i = 0; i < KVM_NR_BUSES; i++)

Makes perfect sense IMO.

> >> Interaction between kvm_gen_update_masterclock(), pvclock_gtod_work(),
> >> and NTP could be a problem:  kvm_gen_update_masterclock() only has to
> >> run once per VM, but pvclock_gtod_work() calls it on every VCPU, so
> >> frequent NTP updates on bigger guests could kill performance.
> > 
> > Unfortunately, things are worse than that: this stuff is updated on
> > every *tick* on the timekeeping CPU, so, as long as you keep at least
> > one of your CPUs busy, the update rate can reach HZ.  The frequency of
> > NTP updates is unimportant; it happens without NTP updates at all.
> > 
> > So I tend to agree that we're perhaps better off not fixing this bug and
> > leaving the kvm_clocks to drift until we figure out how to do it with
> > acceptable overhead.
> 
> Yuck ... the hunk below could help a bit.
> I haven't checked if the timekeeping code updates gtod and therefore
> sets 'was_set' even when the resulting time hasn't changed, so we might
> need to do more to avoid useless situations.
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a8c7ca34ee5d..37ed0a342bf1 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5802,12 +5802,15 @@ static DECLARE_WORK(pvclock_gtod_work, pvclock_gtod_update_fn);
>  /*
>   * Notification about pvclock gtod data update.
>   */
> -static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long unused,
> +static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long was_set,
>  			       void *priv)
>  {
>  	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
>  	struct timekeeper *tk = priv;
>  
> +	if (!was_set)
> +		return 0;
> +
>  	update_pvclock_gtod(tk);
>  

Nope, this parameter is only set when there's a step-like change in the
time.  The timekeeper itself is always updated.  I guess we could
mitigate the costs somewhat if we skipped updating the gtod copy until
the accumulated error reaches certain limit; not sure if that's gonna
help though.

Roman.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Radim Krčmář May 27, 2016, 7:29 p.m. UTC | #2

2016-05-27 21:46+0300, Roman Kagan:
> On Fri, May 27, 2016 at 08:11:40PM +0200, Radim Krčmář wrote:
> > 2016-05-27 20:28+0300, Roman Kagan:
>> >> Interaction between kvm_gen_update_masterclock(), pvclock_gtod_work(),
>> >> and NTP could be a problem:  kvm_gen_update_masterclock() only has to
>> >> run once per VM, but pvclock_gtod_work() calls it on every VCPU, so
>> >> frequent NTP updates on bigger guests could kill performance.
>> > 
>> > Unfortunately, things are worse than that: this stuff is updated on
>> > every *tick* on the timekeeping CPU, so, as long as you keep at least
>> > one of your CPUs busy, the update rate can reach HZ.  The frequency of
>> > NTP updates is unimportant; it happens without NTP updates at all.
>> > 
>> > So I tend to agree that we're perhaps better off not fixing this bug and
>> > leaving the kvm_clocks to drift until we figure out how to do it with
>> > acceptable overhead.
>> 
>> Yuck ... the hunk below could help a bit.
>> I haven't checked if the timekeeping code updates gtod and therefore
>> sets 'was_set' even when the resulting time hasn't changed, so we might
>> need to do more to avoid useless situations.
>> 
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index a8c7ca34ee5d..37ed0a342bf1 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -5802,12 +5802,15 @@ static DECLARE_WORK(pvclock_gtod_work, pvclock_gtod_update_fn);
>>  /*
>>   * Notification about pvclock gtod data update.
>>   */
>> -static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long unused,
>> +static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long was_set,
>>  			       void *priv)
>>  {
>>  	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
>>  	struct timekeeper *tk = priv;
>>  
>> +	if (!was_set)
>> +		return 0;
>> +
>>  	update_pvclock_gtod(tk);
>>  
> 
> Nope, this parameter is only set when there's a step-like change in the
> time.  The timekeeper itself is always updated.  I guess we could
> mitigate the costs somewhat if we skipped updating the gtod copy until
> the accumulated error reaches certain limit; not sure if that's gonna
> help though.

I see, timekeeping_adjust() isn't covered, but it should not adjust
every tick, so we could propagate information about adjustments to
pvclock_gtod_notify (rename unused to has_changed), because pvclock only
cares about change of time.
Adding another threshold is a reasonable improvement if adjustments
happen too often, but we need to fix pvclock_gtod_update_fn() in any
case.

Am I missing anyting else?

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 37af23052470..0779f0f01523 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -655,6 +655,8 @@  static struct kvm *kvm_create_vm(unsigned long type)
 		goto out_err;
 
 	spin_lock(&kvm_lock);
+	if (list_empty(&kvm->vm_list))
+		kvm_arch_create_first_vm(kvm);
 	list_add(&kvm->vm_list, &vm_list);
 	spin_unlock(&kvm_lock);
 
@@ -709,6 +711,8 @@  static void kvm_destroy_vm(struct kvm *kvm)
 	kvm_arch_sync_events(kvm);
 	spin_lock(&kvm_lock);
 	list_del(&kvm->vm_list);
+	if (list_empty(&kvm->vm_list))
+		kvm_arch_destroy_last_vm(kvm);
 	spin_unlock(&kvm_lock);
 	kvm_free_irq_routing(kvm);
 	for (i = 0; i < KVM_NR_BUSES; i++)

>> Interaction between kvm_gen_update_masterclock(), pvclock_gtod_work(),
>> and NTP could be a problem:  kvm_gen_update_masterclock() only has to
>> run once per VM, but pvclock_gtod_work() calls it on every VCPU, so
>> frequent NTP updates on bigger guests could kill performance.
> 
> Unfortunately, things are worse than that: this stuff is updated on
> every *tick* on the timekeeping CPU, so, as long as you keep at least
> one of your CPUs busy, the update rate can reach HZ.  The frequency of
> NTP updates is unimportant; it happens without NTP updates at all.
> 
> So I tend to agree that we're perhaps better off not fixing this bug and
> leaving the kvm_clocks to drift until we figure out how to do it with
> acceptable overhead.

Yuck ... the hunk below could help a bit.
I haven't checked if the timekeeping code updates gtod and therefore
sets 'was_set' even when the resulting time hasn't changed, so we might
need to do more to avoid useless situations.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a8c7ca34ee5d..37ed0a342bf1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5802,12 +5802,15 @@  static DECLARE_WORK(pvclock_gtod_work, pvclock_gtod_update_fn);
 /*
  * Notification about pvclock gtod data update.
  */
-static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long unused,
+static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long was_set,
 			       void *priv)
 {
 	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
 	struct timekeeper *tk = priv;
 
+	if (!was_set)
+		return 0;
+
 	update_pvclock_gtod(tk);
 
 	/* disable master clock if host does not trust, or does not

x86/kvm: fix condition to update kvm master clocks

Commit Message

Comments

Patch