From patchwork Thu Apr 15 18:37:28 2010 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Glauber Costa X-Patchwork-Id: 92813 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by demeter.kernel.org (8.14.3/8.14.3) with ESMTP id o3FIeq4M032175 for ; Thu, 15 Apr 2010 18:40:53 GMT Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755961Ab0DOSkG (ORCPT ); Thu, 15 Apr 2010 14:40:06 -0400 Received: from mx1.redhat.com ([209.132.183.28]:21572 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755450Ab0DOSjL (ORCPT ); Thu, 15 Apr 2010 14:39:11 -0400 Received: from int-mx02.intmail.prod.int.phx2.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) by mx1.redhat.com (8.13.8/8.13.8) with ESMTP id o3FIdAoV026157 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Thu, 15 Apr 2010 14:39:11 -0400 Received: from localhost.localdomain (virtlab1.virt.bos.redhat.com [10.16.72.21]) by int-mx02.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id o3FId6cg019625; Thu, 15 Apr 2010 14:39:10 -0400 From: Glauber Costa To: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org, avi@redhat.com Subject: [PATCH 5/5] add documentation about kvmclock Date: Thu, 15 Apr 2010 14:37:28 -0400 Message-Id: <1271356648-5108-6-git-send-email-glommer@redhat.com> In-Reply-To: <1271356648-5108-5-git-send-email-glommer@redhat.com> References: <1271356648-5108-1-git-send-email-glommer@redhat.com> <1271356648-5108-2-git-send-email-glommer@redhat.com> <1271356648-5108-3-git-send-email-glommer@redhat.com> <1271356648-5108-4-git-send-email-glommer@redhat.com> <1271356648-5108-5-git-send-email-glommer@redhat.com> X-Scanned-By: MIMEDefang 2.67 on 10.5.11.12 Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Greylist: IP, sender and recipient auto-whitelisted, not delayed by milter-greylist-4.2.3 (demeter.kernel.org [140.211.167.41]); Thu, 15 Apr 2010 18:40:53 +0000 (UTC) diff --git a/Documentation/kvm/kvmclock.txt b/Documentation/kvm/kvmclock.txt new file mode 100644 index 0000000..21008bb --- /dev/null +++ b/Documentation/kvm/kvmclock.txt @@ -0,0 +1,138 @@ +KVM Paravirtual Clocksource driver +Glauber Costa, Red Hat Inc. +================================== + +1. General Description +======================= + +Keeping time in virtual machine is acknowledged as a hard problem. The most +basic mode of operation, usually done by older guests, assumes a fixed length +between timer interrupts. It then counts the number of interrupts and +calculates elapsed time. This method fails easily in virtual machines, since +we can't guarantee that the virtual interrupt will be delivered in time. + +Another possibility is to emulate modern devices like HPET, or any other we +see fit. A modern guest which implements something like the clocksource +infrastructure, can then ask this virtual device about current time when it +needs to. The problem with this approach, is that it bumps the guest out +of guest mode operation, and in some cases, even to userspace very frequently. + +In this context, the best approach is to provide the guest with a +virtualization-aware (paravirtual) clock device. It the asks the hypervisor +about current time, guaranteeing both stable and accurate timekeeping. + +2. kvmclock basics +=========================== + +When supported by the hypervisor, guests can register a memory page +to contain kvmclock data. This page has to be present in guest's address space +throughout its whole life. The hypervisor continues to write to it until it is +explicitly disabled or the guest is turned off. + +2.1 kvmclock availability +------------------------- + +Guests that want to take advantage of kvmclock should first check its +availability through cpuid. + +kvm features are presented to the guest in leaf 0x40000001. Bit 3 indicates +the present of kvmclock. Bit 0 indicates that kvmclock is present, but the +old MSR set must be used. See section 2.3 for details. + +2.2 kvmclock functionality +-------------------------- + +Two MSRs are provided by the hypervisor, controlling kvmclock operation: + + * MSR_KVM_WALL_CLOCK, value 0x4b564d00 and + * MSR_KVM_SYSTEM_TIME, value 0x4b564d01. + +The first one is only used in rare situations, like boot-time and a +suspend-resume cycle. Data is disposable, and after used, the guest +may use it for something else. This is hardly a hot path for anything. +The Hypervisor fills in the address provided through this MSR with the +following structure: + +struct pvclock_wall_clock { + u32 version; + u32 sec; + u32 nsec; +} __attribute__((__packed__)); + +Guest should only trust data to be valid when version haven't changed before +and after reads of sec and nsec. Besides not changing, it has to be an even +number. Hypervisor may write an odd number to version field to indicate that +an update is in progress. + +MSR_KVM_SYSTEM_TIME, on the other hand, has persistent data, and is +constantly updated by the hypervisor with time information. The data +written in this MSR contains two pieces of information: the address in which +the guests expects time data to be present 4-byte aligned or'ed with an +enabled bit. If one wants to shutdown kvmclock, it just needs to write +anything that has 0 as its last bit. + +Time information presented by the hypervisor follows the structure: + +struct pvclock_vcpu_time_info { + u32 version; + u32 pad0; + u64 tsc_timestamp; + u64 system_time; + u32 tsc_to_system_mul; + s8 tsc_shift; + u8 pad[3]; +} __attribute__((__packed__)); + +The version field plays the same role as with the one in struct +pvclock_wall_clock. The other fields, are: + + a. tsc_timestamp: the guest-visible tsc (result of rdtsc + tsc_offset) of + this cpu at the moment we recorded system_time. Note that some time is + inevitably spent between system_time and tsc_timestamp measurements. + Guests can subtract this quantity from the current value of tsc to obtain + a delta to be added to system_time + + b. system_time: this is the most recent host-time we could be provided with. + host gets it through ktime_get_ts, using whichever clocksource is + registered at the moment + + c. tsc_to_system_mul: this is the number that tsc delta has to be multiplied + by in order to obtain time in nanoseconds. Hypervisor is free to change + this value in face of events like cpu frequency change, pcpu migration, + etc. + + d. tsc_shift: guests must shift + +With this information available, guest calculates current time as: + + T = kt + to_nsec(tsc - tsc_0) + +2.3 Compatibility MSRs +---------------------- + +Guests running on top of older hypervisors may have to use a different set of +MSRs. This is because originally, kvmclock MSRs were exported within a +reserved range by accident. Guests should check cpuid leaf 0x40000001 for the +presence of kvmclock. If bit 3 is disabled, but bit 0 is enabled, guests can +have access to kvmclock functionality through + + * MSR_KVM_WALL_CLOCK_OLD, value 0x11 and + * MSR_KVM_SYSTEM_TIME_OLD, value 0x12. + +Note, however, that this is deprecated. + +3. Migration +============ + +Two ioctls are provided to aid the task of migration: + + * KVM_GET_CLOCK and + * KVM_SET_CLOCK + +Their aim is to control an offset that can be summed to system_time, in order +to guarantee monotonicity on the time over guest migration. Source host +executes KVM_GET_CLOCK, obtaining the last valid timestamp in this host, while +destination sets it with KVM_SET_CLOCK. It's the destination responsibility to +never return time that is less than that. + +