diff mbox

[Bug,59521] New: KVM linux guest reads uninitialized pvclock values before executing rdmsr MSR_KVM_WALL_CLOCK

Message ID bug-59521-28872@https.bugzilla.kernel.org/ (mailing list archive)
State New, archived
Headers show

Commit Message

bugzilla-daemon@bugzilla.kernel.org June 10, 2013, 2:11 p.m. UTC
https://bugzilla.kernel.org/show_bug.cgi?id=59521

           Summary: KVM linux guest reads uninitialized pvclock values
                    before executing rdmsr MSR_KVM_WALL_CLOCK
           Product: Virtualization
           Version: unspecified
    Kernel Version: 3.8 and higher
          Platform: All
        OS/Version: Linux
              Tree: Mainline
            Status: NEW
          Severity: high
          Priority: P1
         Component: kvm
        AssignedTo: virtualization_kvm@kernel-bugs.osdl.org
        ReportedBy: eabatalov89@gmail.com
        Regression: Yes


It turns out that since linux-image-3.8.0-1-generic_3.8.0-1.5 built from
sources Ubuntu kernel reads uninitialized KVM paravirtualized clocksource data
while working under hypervisor which supports KVM_FEATURE_CLOCKSOURCE.

I've trapped all the guest r/w memory accesses to paravirtualized clock
structures to proof it.

The event sequence is the following:
1. Kernel bootstrap cpu executes  rdmsr MSR_KVM_WALL_CLOCK and reads proper
initialized values of kvm clocksource
2. Memory for percpu paravirtualized KVM clocksources is allocated during
initialization of KVM subsystem.
arch/x86/kernel/kvmclock.c:
221 void __init kvmclock_init(void)
...
240         mem = memblock_alloc(size, PAGE_SIZE);
3. SMP boot is performed. One by one cpus 0, 1, 2, .. n are initialized.
4. Each cpu reads its uninitialized preallocated paravirtualized wallclock
structures before executing rdmsr MSR_KVM_WALL_CLOCK which should initialize
them. So the contents of structs are more or less random.

Here is the struct which is read:
struct pvclock_vcpu_time_info {
 62         u32   version;
 63         u32   pad0;
 64         u64   tsc_timestamp;
 65         u64   system_time;
 66         u32   tsc_to_system_mul;
 67         s8    tsc_shift;
 68         u8    flags;
 69         u8    pad[2];
 70 } __attribute__((__packed__)); /* 32 bytes */
Reading uninitialized contents creates 2 problems for us:
1. If version field is odd then cpu spins forever (see
./arch/x86/kernel/pvclock.c:pvclock_clocksource_read)
2. Depending on values of other struct fields and on the features of
virtualized hardware kernel scheduler can behave in different ways. I've
observed kernel boot hang after start of SMP initialization.

SMP cpu#0 doesn't read unintialized data because it reads bootstrap cpu struct
which was properly initialized using preliminary rdmsr.
At the same time SMP cpu#0 performs the same thing as other SMP cpus - executes
rdmsr MSR_KVM_WALL_CLOCK after reading its paravirtualized clock struct. So no
reproduction on UMP machine.

It may be hard to reproduce the bug because of its random nature.
To easily reproduce it I propose the following patch:

This patch set version field to an odd value so if a cpu reads it before rdmsr
it spins forever because hypervisor won't set it to even value.

The first patch which makes the bug reproducible is kernel mainline commit:
"x86: kvmclock: allocate pvclock shared memory area".

Before the patch memory for KVM clocksources was allocated in percpu variables.
Every SMP cpu which starts execution got copy of bootstrab cpu pvclock
variable.
That's why pvclock structs were "initialized" before the patch and there was no
bug with uninitialized memory read.
After the patch memory is allocated using memblock_alloc. Nothing is written to
the allocated memory before actual call of MSR_KVM_WALL_CLOCK.

Comments

bugzilla-daemon@bugzilla.kernel.org June 11, 2013, 4:03 p.m. UTC | #1
https://bugzilla.kernel.org/show_bug.cgi?id=59521





--- Comment #1 from Eugene Batalov <eabatalov89@gmail.com>  2013-06-11 16:03:55 ---
I have reconstructed the uninitialized pvclock read backtrace.
References to file lines are for Ubuntu-raring kernel
git://kernel.ubuntu.com/ubuntu/ubuntu-raring.git
tag is Ubuntu-3.8.0-19.30

bp: 0xf3ccbe68 ip: 0xc103cfbd
arch/x86/include/asm/pvclock.h:78
arch/x86/kernel/pvclock.c:74
bp: 0xf3ccbe70 ip: 0xc103c057
arch/x86/kernel/kvmclock.c:91
bp: 0xf3ccbe78 ip: 0xc1017598
arch/x86/kernel/tsc.c:58
bp: 0xf3ccbea8 ip: 0xc107e98d
kernel/sched/clock.c:248
bp: 0xf3ccbeb8 ip: 0xc107ea35
kernel/sched/clock.c:342
bp: 0xf3ccbf08 ip: 0xc104ad85
kernel/printk.c:356
bp: 0xf3ccbf50 ip: 0xc104c4e1
kernel/printk.c:1607
bp: 0xf3ccbf70 ip: 0xc1609bb6
kernel/printk.c:1688
bp: 0xf3ccbf90 ip: 0xc1600a51
arch/x86/include/asm/bitops.h:321
arch/x86/kernel/cpu/common.c:1325
bp: 0xf3ccbfb4 ip: 0xc1604000
??
bp: 0x00000000


kernel/printk.c:356
calls local_clock()
calls sched_clock_cpu()
calls sched_clock()
calls paravirt_sched_clock()
calls indirectly kvm_clock_read()
unintialized pv_clock is read here

vcpu kvmclock initialization is performed in kvm_register_clock.
kvm_register_clock is called from
static void __init kvm_smp_prepare_boot_cpu(void)
called form
 ./init/main.c:524 as smp_prepare_boot_cpu

I'll think about proper fix soon. We probably should fix cpu initialization
stages order or disable usage of pvclock before it initialized.
bugzilla-daemon@bugzilla.kernel.org June 15, 2013, 5:17 p.m. UTC | #2
https://bugzilla.kernel.org/show_bug.cgi?id=59521





--- Comment #2 from Eugene Batalov <eabatalov89@gmail.com>  2013-06-15 17:17:18 ---
KVM pv_clock initialization for SMP CPU is called very early on SMP CPU boot
stage:
arch/x86/kernel/smpboot.c:
__cpuinit start_secondary(void *unused)
...
    cpu_init();
    x86_cpuinit.early_percpu_clock_init();

early_percpu_clock_init is set as virtual function in
./arch/x86/kernel/kvmclock.c:
#ifdef CONFIG_X86_LOCAL_APIC
        x86_cpuinit.early_percpu_clock_init =
                kvm_setup_secondary_clock;
#endif

The problem is in cpu_init() which is called earlier.
cpu_init() calls printk and possibly other stuff which can use timestamps.
printk calls local_clock() to obtain a timestamp of a log message. On KVM
guests call sequence usually ends up in kvm_clock_read but needed rdmsr is
executed only in x86_cpuinit.early_percpu_clock_init().

I consider two approaches to fix the problem:
1. Swap cpu_init(); and x86_cpuinit.early_percpu_clock_init();
+ Simple
- We will get excessive restrictions on operations which allowed to be
performed in early_percpu_clock_init() because percpu specific data is
initialized only in cpu_init().
2. Return 0ULL from kvm_clock_read untill it is initialized.
+ Simple too
- Additional if statement inside kvm_clock_read (not serious even for
performance paranoiacs)
- Returning 0ULL looks ok because it is the same thing which kernel bootstrap
CPU does on early boot stages. But I am not quite sure. Better to ask guys who
maintain the needed relevant subsystem.

I prefer the second way. It doesn't add complex restrictions to CPU bootup
code. I'll send a patch soon which fixes the problem in the second way.

I don't propagate such a logic to levels higher then KVM clocksource
(pv_time_ops level for example) because of the following code:
void __init kvmclock_init(void)
...
263         pv_time_ops.sched_clock = kvm_clock_read;
264         x86_platform.calibrate_tsc = kvm_get_tsc_khz;
265         x86_platform.get_wallclock = kvm_get_wallclock;
266         x86_platform.set_wallclock = kvm_set_wallclock;
267 #ifdef CONFIG_X86_LOCAL_APIC
268         x86_cpuinit.early_percpu_clock_init =
269                 kvm_setup_secondary_clock;
270 #endif
271         x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
272         x86_platform.restore_sched_clock_state =
kvm_restore_sched_clock_state;

To propagate the logic I need to make changes both in x86_platform and
pv_time_ops also I should make a similar fix for ia64 arch. It needs some
subsystems refactoring to make the changes clean. Dont' think that its worth to
fix the bug. Better to make a simple fix.
Paolo Bonzini June 17, 2013, 3:50 p.m. UTC | #3
Il 15/06/2013 19:17, bugzilla-daemon@bugzilla.kernel.org ha scritto:
> The problem is in cpu_init() which is called earlier.
> cpu_init() calls printk and possibly other stuff which can use timestamps.
> printk calls local_clock() to obtain a timestamp of a log message. On KVM
> guests call sequence usually ends up in kvm_clock_read but needed rdmsr is
> executed only in x86_cpuinit.early_percpu_clock_init().
> 
> I consider two approaches to fix the problem:
> 1. Swap cpu_init(); and x86_cpuinit.early_percpu_clock_init();
> + Simple
> - We will get excessive restrictions on operations which allowed to be
> performed in early_percpu_clock_init() because percpu specific data is
> initialized only in cpu_init().

Considering how simple kvm_register_clock is, I think this is
preferrable if it works.  Ironically, commit 7069ed6 (x86: kvmclock:
allocate pvclock shared memory area, 2012-11-27), which introduced the
regression, is what should make this simpler fix possible.

Paolo

> 2. Return 0ULL from kvm_clock_read untill it is initialized.
> + Simple too
> - Additional if statement inside kvm_clock_read (not serious even for
> performance paranoiacs)
> - Returning 0ULL looks ok because it is the same thing which kernel bootstrap
> CPU does on early boot stages. But I am not quite sure. Better to ask guys who
> maintain the needed relevant subsystem.
> 
> I prefer the second way. It doesn't add complex restrictions to CPU bootup
> code. I'll send a patch soon which fixes the problem in the second way.
> 
> I don't propagate such a logic to levels higher then KVM clocksource
> (pv_time_ops level for example) because of the following code:
> void __init kvmclock_init(void)
> ...
> 263         pv_time_ops.sched_clock = kvm_clock_read;
> 264         x86_platform.calibrate_tsc = kvm_get_tsc_khz;
> 265         x86_platform.get_wallclock = kvm_get_wallclock;
> 266         x86_platform.set_wallclock = kvm_set_wallclock;
> 267 #ifdef CONFIG_X86_LOCAL_APIC
> 268         x86_cpuinit.early_percpu_clock_init =
> 269                 kvm_setup_secondary_clock;
> 270 #endif
> 271         x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
> 272         x86_platform.restore_sched_clock_state =
> kvm_restore_sched_clock_state;
> 
> To propagate the logic I need to make changes both in x86_platform and
> pv_time_ops also I should make a similar fix for ia64 arch. It needs some
> subsystems refactoring to make the changes clean. Dont' think that its worth to
> fix the bug. Better to make a simple fix.
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
bugzilla-daemon@bugzilla.kernel.org June 17, 2013, 3:50 p.m. UTC | #4
https://bugzilla.kernel.org/show_bug.cgi?id=59521





--- Comment #3 from Anonymous Emailer <anonymous@kernel-bugs.osdl.org>  2013-06-17 15:50:27 ---
Reply-To: pbonzini@redhat.com

Il 15/06/2013 19:17, bugzilla-daemon@bugzilla.kernel.org ha scritto:
> The problem is in cpu_init() which is called earlier.
> cpu_init() calls printk and possibly other stuff which can use timestamps.
> printk calls local_clock() to obtain a timestamp of a log message. On KVM
> guests call sequence usually ends up in kvm_clock_read but needed rdmsr is
> executed only in x86_cpuinit.early_percpu_clock_init().
> 
> I consider two approaches to fix the problem:
> 1. Swap cpu_init(); and x86_cpuinit.early_percpu_clock_init();
> + Simple
> - We will get excessive restrictions on operations which allowed to be
> performed in early_percpu_clock_init() because percpu specific data is
> initialized only in cpu_init().

Considering how simple kvm_register_clock is, I think this is
preferrable if it works.  Ironically, commit 7069ed6 (x86: kvmclock:
allocate pvclock shared memory area, 2012-11-27), which introduced the
regression, is what should make this simpler fix possible.

Paolo

> 2. Return 0ULL from kvm_clock_read untill it is initialized.
> + Simple too
> - Additional if statement inside kvm_clock_read (not serious even for
> performance paranoiacs)
> - Returning 0ULL looks ok because it is the same thing which kernel bootstrap
> CPU does on early boot stages. But I am not quite sure. Better to ask guys who
> maintain the needed relevant subsystem.
> 
> I prefer the second way. It doesn't add complex restrictions to CPU bootup
> code. I'll send a patch soon which fixes the problem in the second way.
> 
> I don't propagate such a logic to levels higher then KVM clocksource
> (pv_time_ops level for example) because of the following code:
> void __init kvmclock_init(void)
> ...
> 263         pv_time_ops.sched_clock = kvm_clock_read;
> 264         x86_platform.calibrate_tsc = kvm_get_tsc_khz;
> 265         x86_platform.get_wallclock = kvm_get_wallclock;
> 266         x86_platform.set_wallclock = kvm_set_wallclock;
> 267 #ifdef CONFIG_X86_LOCAL_APIC
> 268         x86_cpuinit.early_percpu_clock_init =
> 269                 kvm_setup_secondary_clock;
> 270 #endif
> 271         x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
> 272         x86_platform.restore_sched_clock_state =
> kvm_restore_sched_clock_state;
> 
> To propagate the logic I need to make changes both in x86_platform and
> pv_time_ops also I should make a similar fix for ia64 arch. It needs some
> subsystems refactoring to make the changes clean. Dont' think that its worth to
> fix the bug. Better to make a simple fix.
>
bugzilla-daemon@bugzilla.kernel.org June 17, 2013, 9:29 p.m. UTC | #5
https://bugzilla.kernel.org/show_bug.cgi?id=59521





--- Comment #4 from Eugene Batalov <eabatalov89@gmail.com>  2013-06-17 21:29:08 ---
(In reply to comment #3)
> Reply-To: pbonzini@redhat.com
> 
> Il 15/06/2013 19:17, bugzilla-daemon@bugzilla.kernel.org ha scritto:
> > The problem is in cpu_init() which is called earlier.
> > cpu_init() calls printk and possibly other stuff which can use timestamps.
> > printk calls local_clock() to obtain a timestamp of a log message. On KVM
> > guests call sequence usually ends up in kvm_clock_read but needed rdmsr is
> > executed only in x86_cpuinit.early_percpu_clock_init().
> > 
> > I consider two approaches to fix the problem:
> > 1. Swap cpu_init(); and x86_cpuinit.early_percpu_clock_init();
> > + Simple
> > - We will get excessive restrictions on operations which allowed to be
> > performed in early_percpu_clock_init() because percpu specific data is
> > initialized only in cpu_init().
> 
> Considering how simple kvm_register_clock is, I think this is
> preferrable if it works.  Ironically, commit 7069ed6 (x86: kvmclock:
> allocate pvclock shared memory area, 2012-11-27), which introduced the
> regression, is what should make this simpler fix possible.
> 
> Paolo

Understood your point. I'll test this fix and report the results.
Igor Mammedov June 21, 2013, 9:01 a.m. UTC | #6
Fixes regression introduced by
    7069ed6 x86: kvmclock: allocate pvclock shared memory area
and reported by Eugene:
    https://bugzilla.kernel.org/show_bug.cgi?id=59521

v2:
 - "[1/2] x86: kvmclock: zero initialize pvclock shared memory area"
      added to commit messaged another reason why it's clearing
      is necessary.
 - added "[2/2] x86: kvmclock: register per-cpu kvmclock at earliest possible time"
      to register clock earlier.

Igor Mammedov (2):
  x86: kvmclock: zero initialize pvclock shared memory area
  x86: kvmclock: register per-cpu kvmclock at earliest possible time

 arch/x86/kernel/kvmclock.c |   12 ++++++++++--
 arch/x86/kernel/smpboot.c  |    2 +-
 arch/x86/mm/pageattr.c     |    4 ++--
 3 files changed, 13 insertions(+), 5 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 5bedbdd..44aaf25 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -219,6 +222,7 @@  void __init kvmclock_init(void)
 {
        unsigned long mem;
        int size;
+       int i;

        size = PAGE_ALIGN(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS);

@@ -238,6 +242,9 @@  void __init kvmclock_init(void)
        if (!mem)
                return;
        hv_clock = __va(mem);
+       for (i = 0; i < NR_CPUS; ++i) {
+               hv_clock[i].pvti.version = 0xDEADBEAF;
+       }

        if (kvm_register_clock("boot clock")) {
                hv_clock = NULL;