Message ID | 20140211194553.GZ9987@twins.programming.kicks-ass.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 11.02.2014 20:45, Peter Zijlstra wrote: > On Tue, Feb 11, 2014 at 07:34:51PM +0100, Stefan Bader wrote: >> Hi Peter, >> >> I am currently looking at a weird issue that manifest itself when trying to run >> kvm enabled qemu on a i386 host (v3.13 kernel, oh and potentially important the >> cpu is 64bit capable, so qemu-system-x86_64 is called). > > AMD or Intel machine? Personally I am reproducing this on an AMD: cpu family : 15 model : 72 model name : AMD Turion(tm) 64 X2 Mobile Technology TL-64 but the reports I got are showing the same issue on an Intel i7 cpu family : 6 model : 37 model name : Intel(R) Core(TM) i7 CPU M 620 @ 2.67GH Another reporter also saw this on an AMD and said it could not be reproduced on the same hardware and the same software versions when using 64bit instead of 32. In my case on a 32bit installation I will see this on every invocation of qemu-system-x86_64 -cdrom <32bit live iso> -boot d -m 512 > >> Sooner or later this >> causes softlockup messages on the host. I tracked this down to __vcpu_run in >> arch/x86/kvm/x86.c which does a loop which in that case never seems to make >> progress or exit. >> >> What I found is that vcpu_enter_guest will exit quickly without causing the loop >> to exit when need_resched() is true. Looking at a crash dump I took, this was >> the case (thread_info->flags had TIF_NEED_RESCHED set). So after immediately >> returning __vcpu_run has the following code: >> >> if (need_resched()) { >> srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx); >> kvm_resched(vcpu); // now cond_resched(); >> vcpu->srcu_idx = srcu_read_lock(&kvm->srcu); >> } >> >> The kvm_resched basically would end up doing a cond_resched() which now checks >> preempt_count() to be 0. If that is zero it will do the reschedule, otherwise it >> just does nothing. Looking at the percpu variables in the dump, I saw that >> the preempt_count was 0x8000000 (actually it was 0x80110000 but that was me >> triggering the kexec crashdump with sysrq-c). >> >> I saw that there have been some changes in the upstream kernel and have picked >> the following: >> 1) x86, acpi, idle: Restructure the mwait idle routines >> 2) x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers >> 3) sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding >> 4) sched/preempt/x86: Fix voluntary preempt for x86 >> >> Patch 1) and 2) as dependencies of 3) (to get the mwait function correct and to >> the other file). Finally 4) is fixing up 3). [maybe worth suggesting to do for >> 3.13.y stable]. >> >> Still, with all those I got the softlockup. Since I knew from the dump info that >> something is wrong with the folding, I made the pragmatic approach and added the >> following: >> >> if (need_resched()) { >> srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx); >> + preempt_fold_need_resched(); >> kvm_resched(vcpu); // now cond_resched(); >> vcpu->srcu_idx = srcu_read_lock(&kvm->srcu); >> } >> >> And this lets the kvm guest run without the softlockups! However I am less than >> convinced that this is the right thing to do. Somehow something done when >> converting the preempt_count into percpu has caused at least the i386 side to >> get into this mess (as there has not been any whining about 64bit). Just fail to >> see what. > > I've been looking at the same thing too; I've got a trace from someone > who can reproduce and its just not making sense. > > Looks like the hardware is loosing an interrupt (or worse). > > With the below patch on top of current -git (might be whitespace damaged > due to copy-paste). > > --- > arch/x86/kvm/x86.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 39c28f09dfd5..f8b2fc7ce491 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -6115,7 +6115,11 @@ static int __vcpu_run(struct kvm_vcpu *vcpu) > } > if (need_resched()) { > srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx); > - cond_resched(); > + if (need_resched() != should_resched()) { > + tracing_stop(); > + printk(KERN_ERR "Stopped tracing, due to inconsistent state.\n"); > + } > + schedule(); > vcpu->srcu_idx = srcu_read_lock(&kvm->srcu); > } > } > > --- > > I got the following from an Athlon X2 (fam 0xf): > > On Mon, Feb 10, 2014 at 05:23:15PM -0200, Marcelo Tosatti wrote: >> What is the problem exactly? > > From the trace: > > qemu-system-x86-2455 [001] d.s3 393.630586: smp_reschedule_interrupt <-reschedule_interrupt > qemu-system-x86-2455 [001] d.s3 393.630586: scheduler_ipi <-smp_reschedule_interrupt > qemu-system-x86-2455 [001] ..s3 393.630586: preempt_count_sub <-_raw_spin_unlock_irqrestore > qemu-system-x86-2455 [001] ..s2 393.630586: _raw_spin_lock_irq <-run_timer_softirq > qemu-system-x86-2455 [001] d.s2 393.630586: preempt_count_add <-_raw_spin_lock_irq > qemu-system-x86-2455 [001] d.s3 393.630586: _raw_spin_unlock_irq <-run_timer_softirq > qemu-system-x86-2455 [001] ..s3 393.630586: preempt_count_sub <-_raw_spin_unlock_irq > qemu-system-x86-2455 [001] ..s2 393.630586: rcu_bh_qs <-__do_softirq > qemu-system-x86-2455 [001] d.s2 393.630586: irqtime_account_irq <-__do_softirq > qemu-system-x86-2455 [001] d.s2 393.630586: __local_bh_enable <-__do_softirq > qemu-system-x86-2455 [001] d.s2 393.630586: preempt_count_sub <-__local_bh_enable > > > dmesg-2458 [000] d.s5 393.630614: resched_task <-check_preempt_wakeup > dmesg-2458 [000] d.s5 393.630614: native_smp_send_reschedule <-resched_task > dmesg-2458 [000] d.s5 393.630614: default_send_IPI_mask_logical <-native_smp_send_reschedule > > > qemu-system-x86-2455 [001] .n.. 393.630636: preempt_count_add <-kmap_atomic_prot > qemu-system-x86-2455 [001] .n.1 393.630639: __kunmap_atomic <-clear_huge_page > qemu-system-x86-2455 [001] .n.1 393.630639: preempt_count_sub <-__kunmap_atomic > qemu-system-x86-2455 [001] .n.. 393.630639: _cond_resched <-clear_huge_page > qemu-system-x86-2455 [001] .n.. 393.630640: kmap_atomic <-clear_huge_page > qemu-system-x86-2455 [001] .n.. 393.630640: kmap_atomic_prot <-kmap_atomic > qemu-system-x86-2455 [001] .n.. 393.630640: preempt_count_add <-kmap_atomic_prot > > > The resched_task() in the middle does: > > set_tsk_need_resched(); > smp_mb(); > smp_send_reschedule(); > > This means that the remote cpu must observe TIF_NEED_RESCHED (.n.. in > the traces) when the IPI (smp_reschedule_interrupt) hits. > > Now given the machine has unsynchronized TSC the order in the above is > not necessarily the true order, but there's only two possible scenarios: > > 1) its the right order; the smp_reschedule_interrupt() is from a > previous resched_task()/rcu kick or any of the other users that sends > it. > > In which case we should get another smp_reschedule_interrupt() _after_ > the resched_task() which should observe the 'n' and promote it to an > 'N'. > > This doesn't happen and we're stuck with the 'n' state.. > > 2) its the wrong order and the smp_reschedule_interrupt() actually > happened after the resched_task(), but in that case we should have > observed the 'n', we did not. > > Now given that 2) would mean a bug in the coherency fabric and 1) would > mean a lost interrupt, 1 seems like the more likely suspect. > > Joerg wanted the extra kvm tracepoints to see what the vm-exit reason > is. > Let me think about this after I had another cup of coffee and my brain decides to follow my body in the process of waking up... -Stefan
On Wed, Feb 12, 2014 at 09:20:24AM +0100, Stefan Bader wrote: > On 11.02.2014 20:45, Peter Zijlstra wrote: > > On Tue, Feb 11, 2014 at 07:34:51PM +0100, Stefan Bader wrote: > >> Hi Peter, > >> > >> I am currently looking at a weird issue that manifest itself when trying to run > >> kvm enabled qemu on a i386 host (v3.13 kernel, oh and potentially important the > >> cpu is 64bit capable, so qemu-system-x86_64 is called). > > > > AMD or Intel machine? > > Personally I am reproducing this on an AMD: > > cpu family : 15 > model : 72 > model name : AMD Turion(tm) 64 X2 Mobile Technology TL-64 > > but the reports I got are showing the same issue on an Intel i7 > > cpu family : 6 > model : 37 > model name : Intel(R) Core(TM) i7 CPU M 620 @ 2.67GH > > Another reporter also saw this on an AMD and said it could not be reproduced on > the same hardware and the same software versions when using 64bit instead of 32. > > In my case on a 32bit installation I will see this on every invocation of Curious.. so its both AMD and Intel. That makes funny hardware less likely... but how can the vmexit loose an interrupt like this. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 12, 2014 at 11:37:13AM +0100, Peter Zijlstra wrote: > > Another reporter also saw this on an AMD and said it could not be reproduced on > > the same hardware and the same software versions when using 64bit instead of 32. > > > > In my case on a 32bit installation I will see this on every invocation of > > Curious.. so its both AMD and Intel. > > That makes funny hardware less likely... but how can the vmexit loose an > interrupt like this. Also what I'm wondering about and what's not clear from Stefan's reply is whether this is purely a 32-bit issue, i.e. a 32-bit host running a 64-bit qemu running a 32-bit iso or what is it? Or do we have reports for both 32-bit and 64-bit hosts? Thanks.
On 12.02.2014 11:40, Borislav Petkov wrote: > On Wed, Feb 12, 2014 at 11:37:13AM +0100, Peter Zijlstra wrote: >>> Another reporter also saw this on an AMD and said it could not be reproduced on >>> the same hardware and the same software versions when using 64bit instead of 32. >>> >>> In my case on a 32bit installation I will see this on every invocation of >> >> Curious.. so its both AMD and Intel. >> >> That makes funny hardware less likely... but how can the vmexit loose an >> interrupt like this. > > Also what I'm wondering about and what's not clear from Stefan's reply > is whether this is purely a 32-bit issue, i.e. a 32-bit host running a > 64-bit qemu running a 32-bit iso or what is it? > > Or do we have reports for both 32-bit and 64-bit hosts? Only for 32bit hosts. I have not tested a 64bit kernel plus 32bit userspace, though. But 64bit kernel and 64bit user-space seemed ok. And one detail I should add is that while the reproduction is possible in all attempts, the timing can vary. In some of the cases this was even before running a Linux kernel in the guest. Something else here I run a kernel with CONFIG_PREEMPT not set and NR_CPUS limited to 8 (for the 32bit kernel). So the default apic driver is used. Since default_send_IPI_mask_logical is only used from there, I assume the trace you got does the same. Maybe something there is wrong which would explain why we only see it on 32bit hosts. > > Thanks. >
On Wed, Feb 12, 2014 at 11:40:17AM +0100, Borislav Petkov wrote: > Also what I'm wondering about and what's not clear from Stefan's reply > is whether this is purely a 32-bit issue, i.e. a 32-bit host running a > 64-bit qemu running a 32-bit iso or what is it? > > Or do we have reports for both 32-bit and 64-bit hosts? As I understand it, the problem reproduces only on a 32bit host, which also limits the guests to 32bit, at least when ran with KVM. This is true even if a x86-64 qemu is used. This is also independent on whether the host-cpu supports 64bit or not. Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 12, 2014 at 12:09:29PM +0100, Stefan Bader wrote: > Something else here I run a kernel with CONFIG_PREEMPT not set and NR_CPUS > limited to 8 (for the 32bit kernel). So the default apic driver is used. Since > default_send_IPI_mask_logical is only used from there, I assume the trace you > got does the same. Maybe something there is wrong which would explain why we > only see it on 32bit hosts. Can you try with a different APIC driver to test this? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12.02.2014 12:54, Peter Zijlstra wrote: > On Wed, Feb 12, 2014 at 12:09:29PM +0100, Stefan Bader wrote: >> Something else here I run a kernel with CONFIG_PREEMPT not set and NR_CPUS >> limited to 8 (for the 32bit kernel). So the default apic driver is used. Since >> default_send_IPI_mask_logical is only used from there, I assume the trace you >> got does the same. Maybe something there is wrong which would explain why we >> only see it on 32bit hosts. > > Can you try with a different APIC driver to test this? > I don't think I can. And I think the statement about this only be used for 32bit could be wrong. I got mislead to think so because those are only defined in probe_32 but the 64bit counterpart isn't containing much aside that. Anyway, I played around with tracing a bit more. So with this change: if (need_resched()) { srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx); if (need_resched() != should_resched()) { + trace_printk("need(%i) != should(%i)\n", + need_resched(), should_resched()); + trace_printk("exit_reason=%u\n", + vcpu->run->exit_reason); + trace_printk("preempt_count=%lx\n", + __this_cpu_read_4(__preempt_count)); + tracing_stop(); + printk(KERN_ERR "Stopped tracing, due to inconsistent state.\n"); } + schedule(); - cond_reschedule(); vcpu->srcu_idx = srcu_read_lock(&kvm->srcu); } I get the following (weird) output: Xorg-1078 [001] d... 71.270251: native_smp_send_reschedule <-resched_task Xorg-1078 [001] d... 71.270251: default_send_IPI_mask_logical <-native_smp_send_reschedule bamfdaemon-2318 [001] d... 71.270465: resched_task <-check_preempt_wakeup bamfdaemon-2318 [001] d... 71.270539: resched_task <-check_preempt_wakeup compiz-2365 [001] d... 71.270689: resched_task <-check_preempt_wakeup compiz-2365 [001] d... 71.270827: resched_task <-check_preempt_wakeup compiz-2365 [001] d... 71.270940: resched_task <-check_preempt_wakeup qemu-system-x86-2679 [000] dn.. 71.270999: smp_reschedule_interrupt <-reschedule_interrupt qemu-system-x86-2679 [000] dn.. 71.270999: scheduler_ipi <-smp_reschedule_interrupt qemu-system-x86-2679 [000] .N.. 71.271001: kvm_arch_vcpu_ioctl_run: need(1) != should(0) qemu-system-x86-2679 [000] .N.. 71.271002: kvm_arch_vcpu_ioctl_run: exit_reason=2 qemu-system-x86-2679 [000] .N.. 71.271003: kvm_arch_vcpu_ioctl_run: preempt_count=0 So am I reading this right, that the interrupt did get delivered to cpu#0 while the thread info already had the resched flag set. So this really should have cleared the bit in preempt_count. But while the trace info shows 'N' for some reason should_reschedule returns false but at the same time reading the preempt count manually shows it 0?
On Thu, Feb 13, 2014 at 06:00:19PM +0100, Stefan Bader wrote: > On 12.02.2014 12:54, Peter Zijlstra wrote: > > On Wed, Feb 12, 2014 at 12:09:29PM +0100, Stefan Bader wrote: > >> Something else here I run a kernel with CONFIG_PREEMPT not set and NR_CPUS > >> limited to 8 (for the 32bit kernel). So the default apic driver is used. Since > >> default_send_IPI_mask_logical is only used from there, I assume the trace you > >> got does the same. Maybe something there is wrong which would explain why we > >> only see it on 32bit hosts. > > > > Can you try with a different APIC driver to test this? > > > I don't think I can. And I think the statement about this only be used for 32bit > could be wrong. I got mislead to think so because those are only defined in > probe_32 but the 64bit counterpart isn't containing much aside that. > > Anyway, I played around with tracing a bit more. So with this change: > > if (need_resched()) { > srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx); > if (need_resched() != should_resched()) { > + trace_printk("need(%i) != should(%i)\n", > + need_resched(), should_resched()); > + trace_printk("exit_reason=%u\n", > + vcpu->run->exit_reason); > + trace_printk("preempt_count=%lx\n", > + __this_cpu_read_4(__preempt_count)); > + tracing_stop(); > + printk(KERN_ERR "Stopped tracing, due to > inconsistent state.\n"); > } > + schedule(); > - cond_reschedule(); > vcpu->srcu_idx = srcu_read_lock(&kvm->srcu); > } > > I get the following (weird) output: > > Xorg-1078 [001] d... 71.270251: native_smp_send_reschedule > <-resched_task > Xorg-1078 [001] d... 71.270251: default_send_IPI_mask_logical > <-native_smp_send_reschedule > bamfdaemon-2318 [001] d... 71.270465: resched_task <-check_preempt_wakeup > bamfdaemon-2318 [001] d... 71.270539: resched_task <-check_preempt_wakeup > compiz-2365 [001] d... 71.270689: resched_task <-check_preempt_wakeup > compiz-2365 [001] d... 71.270827: resched_task <-check_preempt_wakeup > compiz-2365 [001] d... 71.270940: resched_task <-check_preempt_wakeup > qemu-system-x86-2679 [000] dn.. 71.270999: smp_reschedule_interrupt > <-reschedule_interrupt > qemu-system-x86-2679 [000] dn.. 71.270999: scheduler_ipi > <-smp_reschedule_interrupt > qemu-system-x86-2679 [000] .N.. 71.271001: kvm_arch_vcpu_ioctl_run: need(1) > != should(0) > qemu-system-x86-2679 [000] .N.. 71.271002: kvm_arch_vcpu_ioctl_run: > exit_reason=2 > qemu-system-x86-2679 [000] .N.. 71.271003: kvm_arch_vcpu_ioctl_run: > preempt_count=0 > > So am I reading this right, that the interrupt did get delivered to cpu#0 while > the thread info already had the resched flag set. So this really should have > cleared the bit in preempt_count. But while the trace info shows 'N' for some > reason should_reschedule returns false but at the same time reading the preempt > count manually shows it 0? *blink*... That's weird indeed... do you have the asm that goes along with that? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Feb 13, 2014 at 06:00:19PM +0100, Stefan Bader wrote: > On 12.02.2014 12:54, Peter Zijlstra wrote: > > On Wed, Feb 12, 2014 at 12:09:29PM +0100, Stefan Bader wrote: > >> Something else here I run a kernel with CONFIG_PREEMPT not set and NR_CPUS > >> limited to 8 (for the 32bit kernel). So the default apic driver is used. Since > >> default_send_IPI_mask_logical is only used from there, I assume the trace you > >> got does the same. Maybe something there is wrong which would explain why we > >> only see it on 32bit hosts. > > > > Can you try with a different APIC driver to test this? > > > I don't think I can. And I think the statement about this only be used for 32bit > could be wrong. I got mislead to think so because those are only defined in > probe_32 but the 64bit counterpart isn't containing much aside that. > > Anyway, I played around with tracing a bit more. So with this change: > > if (need_resched()) { > srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx); > if (need_resched() != should_resched()) { > + trace_printk("need(%i) != should(%i)\n", > + need_resched(), should_resched()); > + trace_printk("exit_reason=%u\n", > + vcpu->run->exit_reason); > + trace_printk("preempt_count=%lx\n", > + __this_cpu_read_4(__preempt_count)); > + tracing_stop(); > + printk(KERN_ERR "Stopped tracing, due to > inconsistent state.\n"); > } > + schedule(); > - cond_reschedule(); > vcpu->srcu_idx = srcu_read_lock(&kvm->srcu); > } > > I get the following (weird) output: > > Xorg-1078 [001] d... 71.270251: native_smp_send_reschedule > <-resched_task > Xorg-1078 [001] d... 71.270251: default_send_IPI_mask_logical > <-native_smp_send_reschedule > bamfdaemon-2318 [001] d... 71.270465: resched_task <-check_preempt_wakeup > bamfdaemon-2318 [001] d... 71.270539: resched_task <-check_preempt_wakeup > compiz-2365 [001] d... 71.270689: resched_task <-check_preempt_wakeup > compiz-2365 [001] d... 71.270827: resched_task <-check_preempt_wakeup > compiz-2365 [001] d... 71.270940: resched_task <-check_preempt_wakeup > qemu-system-x86-2679 [000] dn.. 71.270999: smp_reschedule_interrupt > <-reschedule_interrupt > qemu-system-x86-2679 [000] dn.. 71.270999: scheduler_ipi > <-smp_reschedule_interrupt > qemu-system-x86-2679 [000] .N.. 71.271001: kvm_arch_vcpu_ioctl_run: need(1) > != should(0) > qemu-system-x86-2679 [000] .N.. 71.271002: kvm_arch_vcpu_ioctl_run: > exit_reason=2 > qemu-system-x86-2679 [000] .N.. 71.271003: kvm_arch_vcpu_ioctl_run: > preempt_count=0 > > So am I reading this right, that the interrupt did get delivered to cpu#0 while > the thread info already had the resched flag set. So this really should have > cleared the bit in preempt_count. But while the trace info shows 'N' for some > reason should_reschedule returns false but at the same time reading the preempt > count manually shows it 0? So the assembly merges the first and second should_resched(), so its possible that load got before the interrupt(). The 3rd preempt_count load gets re-issued and so that would show the 'true' value again. If you want to force a reload after the condition; put in a barrier(). In any case; this looks like a false-positive. Please try again until you get one where the interrupt doesn't happen and we stay in 'n' state. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Feb 13, 2014 at 07:03:56PM +0100, Stefan Bader wrote:
> Yeah... not sure the interleaved source helps or not ...
It did, thanks!
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
On 13.02.2014 19:25, Peter Zijlstra wrote: > On Thu, Feb 13, 2014 at 06:00:19PM +0100, Stefan Bader wrote: >> On 12.02.2014 12:54, Peter Zijlstra wrote: >>> On Wed, Feb 12, 2014 at 12:09:29PM +0100, Stefan Bader wrote: >>>> Something else here I run a kernel with CONFIG_PREEMPT not set and NR_CPUS >>>> limited to 8 (for the 32bit kernel). So the default apic driver is used. Since >>>> default_send_IPI_mask_logical is only used from there, I assume the trace you >>>> got does the same. Maybe something there is wrong which would explain why we >>>> only see it on 32bit hosts. >>> >>> Can you try with a different APIC driver to test this? >>> >> I don't think I can. And I think the statement about this only be used for 32bit >> could be wrong. I got mislead to think so because those are only defined in >> probe_32 but the 64bit counterpart isn't containing much aside that. >> >> Anyway, I played around with tracing a bit more. So with this change: >> >> if (need_resched()) { >> srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx); >> if (need_resched() != should_resched()) { >> + trace_printk("need(%i) != should(%i)\n", >> + need_resched(), should_resched()); >> + trace_printk("exit_reason=%u\n", >> + vcpu->run->exit_reason); >> + trace_printk("preempt_count=%lx\n", >> + __this_cpu_read_4(__preempt_count)); >> + tracing_stop(); >> + printk(KERN_ERR "Stopped tracing, due to >> inconsistent state.\n"); >> } >> + schedule(); >> - cond_reschedule(); >> vcpu->srcu_idx = srcu_read_lock(&kvm->srcu); >> } >> >> I get the following (weird) output: >> >> Xorg-1078 [001] d... 71.270251: native_smp_send_reschedule >> <-resched_task >> Xorg-1078 [001] d... 71.270251: default_send_IPI_mask_logical >> <-native_smp_send_reschedule >> bamfdaemon-2318 [001] d... 71.270465: resched_task <-check_preempt_wakeup >> bamfdaemon-2318 [001] d... 71.270539: resched_task <-check_preempt_wakeup >> compiz-2365 [001] d... 71.270689: resched_task <-check_preempt_wakeup >> compiz-2365 [001] d... 71.270827: resched_task <-check_preempt_wakeup >> compiz-2365 [001] d... 71.270940: resched_task <-check_preempt_wakeup >> qemu-system-x86-2679 [000] dn.. 71.270999: smp_reschedule_interrupt >> <-reschedule_interrupt >> qemu-system-x86-2679 [000] dn.. 71.270999: scheduler_ipi >> <-smp_reschedule_interrupt >> qemu-system-x86-2679 [000] .N.. 71.271001: kvm_arch_vcpu_ioctl_run: need(1) >> != should(0) >> qemu-system-x86-2679 [000] .N.. 71.271002: kvm_arch_vcpu_ioctl_run: >> exit_reason=2 >> qemu-system-x86-2679 [000] .N.. 71.271003: kvm_arch_vcpu_ioctl_run: >> preempt_count=0 >> >> So am I reading this right, that the interrupt did get delivered to cpu#0 while >> the thread info already had the resched flag set. So this really should have >> cleared the bit in preempt_count. But while the trace info shows 'N' for some >> reason should_reschedule returns false but at the same time reading the preempt >> count manually shows it 0? > > So the assembly merges the first and second should_resched(), so its > possible that load got before the interrupt(). > > The 3rd preempt_count load gets re-issued and so that would show the > 'true' value again. > > If you want to force a reload after the condition; put in a barrier(). > > In any case; this looks like a false-positive. Please try again until > you get one where the interrupt doesn't happen and we stay in 'n' state. > Ok, I think I now got a log of the actual issue. It seems cpu#1 missed out on handling a reschedule interrupt but did send one to cpu#0 and on cpu#0 while handling the interrupt the tif flag was not set (yet?) but then when it is, there is no more interrupt to do the folding... (sorry for the wrapping)... qemu-system-x86-2503 [000] d.h. 1145.887872: default_send_IPI_mask_logical <-native_smp_send_reschedule qemu-system-x86-2503 [000] d... 1145.887888: resched_task <-check_preempt_curr <idle>-0 [000] d.h. 1145.888001: resched_task <-check_preempt_curr <idle>-0 [001] dn.. 1145.888788: smp_reschedule_interrupt <-reschedule_interrupt <idle>-0 [001] dn.. 1145.888789: scheduler_ipi <-smp_reschedule_interrupt gnome-terminal-2440 [000] d... 1145.888957: resched_task <-check_preempt_curr gnome-terminal-2440 [000] d... 1145.888959: native_smp_send_reschedule <-resched_task gnome-terminal-2440 [000] d... 1145.888959: default_send_IPI_mask_logical <-native_smp_send_reschedule <idle>-0 [001] dn.. 1145.889296: smp_reschedule_interrupt <-reschedule_interrupt <idle>-0 [001] dn.. 1145.889297: scheduler_ipi <-smp_reschedule_interrupt gnome-terminal-2440 [000] d... 1145.889465: resched_task <-check_preempt_curr gnome-terminal-2440 [000] d... 1145.889467: native_smp_send_reschedule <-resched_task gnome-terminal-2440 [000] d... 1145.889467: default_send_IPI_mask_logical <-native_smp_send_reschedule Xorg-1054 [001] d.h. 1145.889469: resched_task <-check_preempt_wakeup Xorg-1054 [001] d.h. 1145.889469: native_smp_send_reschedule <-resched_task Xorg-1054 [001] d.h. 1145.889469: default_send_IPI_mask_logical <-native_smp_send_reschedule qemu-system-x86-2503 [000] d.s. 1145.889643: smp_reschedule_interrupt <-reschedule_interrupt qemu-system-x86-2503 [000] d.s. 1145.889643: scheduler_ipi <-smp_reschedule_interrupt qemu-system-x86-2503 [000] .n.. 1145.889647: kvm_arch_vcpu_ioctl_run: sched inconsistency qemu-system-x86-2503 [000] .n.. 1145.889662: kvm_arch_vcpu_ioctl_run: sched inconsistency qemu-system-x86-2503 [000] .n.. 1145.889664: kvm_arch_vcpu_ioctl_run: sched inconsistency qemu-system-x86-2503 [000] .n.. 1145.889668: kvm_arch_vcpu_ioctl_run: need(1) != should(0) qemu-system-x86-2503 [000] .n.. 1145.889669: kvm_arch_vcpu_ioctl_run: exit_reason=2 qemu-system-x86-2503 [000] .n.. 1145.889669: kvm_arch_vcpu_ioctl_run: preempt_count=80000000
On Fri, Feb 14, 2014 at 11:55:25AM +0100, Stefan Bader wrote: > Ok, I think I now got a log of the actual issue. It seems cpu#1 missed out on > handling a reschedule interrupt but did send one to cpu#0 and on cpu#0 while > handling the interrupt the tif flag was not set (yet?) but then when it is, > there is no more interrupt to do the folding... (sorry for the wrapping)... > > qemu-system-x86-2503 [000] d.h. 1145.887872: default_send_IPI_mask_logical <-native_smp_send_reschedule > qemu-system-x86-2503 [000] d... 1145.887888: resched_task <-check_preempt_curr > <idle>-0 [000] d.h. 1145.888001: resched_task <-check_preempt_curr > <idle>-0 [001] dn.. 1145.888788: smp_reschedule_interrupt <-reschedule_interrupt > <idle>-0 [001] dn.. 1145.888789: scheduler_ipi <-smp_reschedule_interrupt > gnome-terminal-2440 [000] d... 1145.888957: resched_task <-check_preempt_curr > gnome-terminal-2440 [000] d... 1145.888959: native_smp_send_reschedule <-resched_task > gnome-terminal-2440 [000] d... 1145.888959: default_send_IPI_mask_logical <-native_smp_send_reschedule > <idle>-0 [001] dn.. 1145.889296: smp_reschedule_interrupt <-reschedule_interrupt > <idle>-0 [001] dn.. 1145.889297: scheduler_ipi <-smp_reschedule_interrupt > gnome-terminal-2440 [000] d... 1145.889465: resched_task <-check_preempt_curr > gnome-terminal-2440 [000] d... 1145.889467: native_smp_send_reschedule <-resched_task > gnome-terminal-2440 [000] d... 1145.889467: default_send_IPI_mask_logical <-native_smp_send_reschedule > Xorg-1054 [001] d.h. 1145.889469: resched_task <-check_preempt_wakeup > Xorg-1054 [001] d.h. 1145.889469: native_smp_send_reschedule <-resched_task > Xorg-1054 [001] d.h. 1145.889469: default_send_IPI_mask_logical <-native_smp_send_reschedule > qemu-system-x86-2503 [000] d.s. 1145.889643: smp_reschedule_interrupt <-reschedule_interrupt > qemu-system-x86-2503 [000] d.s. 1145.889643: scheduler_ipi <-smp_reschedule_interrupt > qemu-system-x86-2503 [000] .n.. 1145.889647: kvm_arch_vcpu_ioctl_run: sched inconsistency > qemu-system-x86-2503 [000] .n.. 1145.889662: kvm_arch_vcpu_ioctl_run: sched inconsistency > qemu-system-x86-2503 [000] .n.. 1145.889664: kvm_arch_vcpu_ioctl_run: sched inconsistency > qemu-system-x86-2503 [000] .n.. 1145.889668: kvm_arch_vcpu_ioctl_run: need(1) != should(0) > qemu-system-x86-2503 [000] .n.. 1145.889669: kvm_arch_vcpu_ioctl_run: exit_reason=2 > qemu-system-x86-2503 [000] .n.. 1145.889669: kvm_arch_vcpu_ioctl_run: preempt_count=80000000 > So afaiu you're running this on an AMD fam 0xf which doesn't have synchronized TSC; so again the timing might be off.. however: The first block from cpu 0, sends an IPI to cpu 1, cpu 1 sees 'n'. Then again, cpu 0 sends to cpu 1, we observe 'n'. So far so good and timing seems to reflect causality. Then however, cpu 0 sends, but nothing in this trace answers on cpu 1.. odd. Then cpu 1 sends to cpu 0, cpu 0 does receive the IPI but does _not_ observe 'n', which is mighty odd. However after the IPI is handled cpu 0 magically gets the 'n'. So one thing I did wonder about; if cpu 0 sends an IPI while cpu 1 is already processing the same interrupt; cpu 1 should latch the interrupt and it would re-trigger with when we re-enable interrupts after we're done with the first. The only thing that would ruin that is late-ack, eg. ack (clear the latch) when exiting the interrupt; but I checked and we ack before we call scheduler_ipi(). Now; if this were to happen (but I don't exactly see how on a 2 CPU system, nor do I see this in the trace), there might be a hole... CPU 0 | CPU 1 --------------------------------+-------------------------------- | /* ! resched_task */ | smp_send_reschedule(1) |L |L smp_reschedule_interrupt() resched_task() |L set_tsk_need_resched() |L smp_mb() |L smp_send_reschedule(1) |L |L ack_APIC_irq() | scheduler_ipi() | need_resched(); Now strictly speaking that need_resched() load could creep up before the store and we'd fail to observe 'n'. However, x86 forbids reads from passing an earlier write to the same location (it allows passing earlier writes in general). At which point the only other option is GCC emitting that load far earlier, in fact before ack_APIC_irq(), but that would be a violation of the C std because scheduler_ipi() is a (real) function call and acts as a compiler barrier AFAIK (as opposed to inlining which doesn't act as a compiler barrier). So colour me confused.. You could try putting an smp_mb() at the very start of scheduler_ipi() just to make absolutely sure, but it _should_ not matter. Of course, I could be completely off base with how interrupts work; and there's a wide gaping hole in there, in which case, someone please educate me ;-) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Feb 13, 2014 at 07:26:05PM +0100, Peter Zijlstra wrote: > On Thu, Feb 13, 2014 at 07:03:56PM +0100, Stefan Bader wrote: > > Yeah... not sure the interleaved source helps or not ... > > It did, thanks! Stefan, can you also send sched/core.s? I'm particularly interested in what resched_task() has been turned into by your build env. Thanks.
On Fri, Feb 14, 2014 at 03:24:09PM +0100, Stefan Bader wrote: > Actually, this code just makes so much more sense if I let objdump do > relocation info... Ok, we're pretty sure you have an MFENCE there in resched_task but can you confirm it please. First, does /proc/cpuinfo have the "sse2" string? It should but nowadays I don't trust anything. Then, can you boot that kernel in qemu with the -gdb flag so that it starts the gdb stub, here's the manpage about it: -gdb dev Wait for gdb connection on device dev. Typical connections will likely be TCP-based, but also UDP, pseudo TTY, or even stdio are reasonable use case. The latter is allowing to start QEMU from within gdb and establish the connection via a pipe: (gdb) target remote | exec qemu-system-i386 -gdb stdio ... -s Shorthand for -gdb tcp::1234, i.e. open a gdbserver on TCP port 1234. then boot the guest and when it is up, do $ gdb ./vmlinux $ target remote localhost:1234 and type in the prompt $ (gdb) x/50i resched_task It gives here: (gdb) x/50i resched_task 0xffffffff810836f0 <resched_task>: data32 data32 data32 xchg %ax,%ax 0xffffffff810836f5 <resched_task+5>: push %rbp 0xffffffff810836f6 <resched_task+6>: mov 0x85e123(%rip),%r10d # 0xffffffff818e1820 <debug_locks> 0xffffffff810836fd <resched_task+13>: mov %rsp,%rbp 0xffffffff81083700 <resched_task+16>: push %r12 0xffffffff81083702 <resched_task+18>: test %r10d,%r10d 0xffffffff81083705 <resched_task+21>: push %rbx 0xffffffff81083706 <resched_task+22>: mov %rdi,%rbx 0xffffffff81083709 <resched_task+25>: jne 0xffffffff81083760 <resched_task+112> 0xffffffff8108370b <resched_task+27>: mov 0x8(%rbx),%rax 0xffffffff8108370f <resched_task+31>: mov 0x10(%rax),%rdx 0xffffffff81083713 <resched_task+35>: and $0x8,%edx 0xffffffff81083716 <resched_task+38>: jne 0xffffffff8108373c <resched_task+76> 0xffffffff81083718 <resched_task+40>: lock orb $0x8,0x10(%rax) 0xffffffff8108371d <resched_task+45>: mov 0x8(%rbx),%rax 0xffffffff81083721 <resched_task+49>: mov 0x18(%rax),%r12d 0xffffffff81083725 <resched_task+53>: callq 0xffffffff812d8fc0 <debug_smp_processor_id> 0xffffffff8108372a <resched_task+58>: cmp %r12d,%eax 0xffffffff8108372d <resched_task+61>: je 0xffffffff810837a0 <resched_task+176> 0xffffffff8108372f <resched_task+63>: mfence ^^^^^^ I want to make sure the smp_mb() is really replaced with an MFENCE there. Thanks!
Oh, and just in case this is relatively easy to reproduce and in case we don't have any other idea, bisection might be another option. I'm not saying you should do it right away - I'm just putting it on the table... :-) :-)
On 14.02.2014 16:21, Borislav Petkov wrote: > Oh, and just in case this is relatively easy to reproduce and in case we > don't have any other idea, bisection might be another option. I'm not > saying you should do it right away - I'm just putting it on the table... > > :-) > > :-) > Oh yeah, bisection is nearly as entertaining as doing my tax records. Hm, on the other hand those will have to be done at some point too... :-P
On Fri, Feb 14, 2014 at 04:28:16PM +0100, Stefan Bader wrote: > Oh yeah, bisection is nearly as entertaining as doing my tax records. > Hm, on the other hand those will have to be done at some point too... > :-P Ah, tax records, crap, I have those lying around in the corner since forever... I need to do those too :-(
On Fri, Feb 14, 2014 at 04:21:32PM +0100, Borislav Petkov wrote: > Oh, and just in case this is relatively easy to reproduce and in case we > don't have any other idea, bisection might be another option. I'm not > saying you should do it right away - I'm just putting it on the table... I'm fairly sure you're going to hit the patch that introduced the whole per-cpu preemption thing. Before that not receiving the IPI wasn't as bad as it is now. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 14.02.2014 15:47, Borislav Petkov wrote: > On Fri, Feb 14, 2014 at 03:24:09PM +0100, Stefan Bader wrote: >> Actually, this code just makes so much more sense if I let objdump do >> relocation info... > > Ok, we're pretty sure you have an MFENCE there in resched_task but can > you confirm it please. > > First, does /proc/cpuinfo have the "sse2" string? It should but nowadays > I don't trust anything. > > Then, can you boot that kernel in qemu with the -gdb flag so that it > starts the gdb stub, here's the manpage about it: > > -gdb dev > Wait for gdb connection on device dev. Typical connections will likely be > TCP-based, but also UDP, pseudo TTY, or even stdio are reasonable use case. > The latter is allowing to start QEMU from within gdb and establish the > connection via a pipe: > > (gdb) target remote | exec qemu-system-i386 -gdb stdio ... > > -s Shorthand for -gdb tcp::1234, i.e. open a gdbserver on TCP port 1234. > > then boot the guest and when it is up, do > > $ gdb ./vmlinux > $ target remote localhost:1234 > > and type in the prompt > > $ (gdb) x/50i resched_task > > It gives here: > > (gdb) x/50i resched_task > 0xffffffff810836f0 <resched_task>: data32 data32 data32 xchg %ax,%ax > 0xffffffff810836f5 <resched_task+5>: push %rbp > 0xffffffff810836f6 <resched_task+6>: mov 0x85e123(%rip),%r10d # 0xffffffff818e1820 <debug_locks> > 0xffffffff810836fd <resched_task+13>: mov %rsp,%rbp > 0xffffffff81083700 <resched_task+16>: push %r12 > 0xffffffff81083702 <resched_task+18>: test %r10d,%r10d > 0xffffffff81083705 <resched_task+21>: push %rbx > 0xffffffff81083706 <resched_task+22>: mov %rdi,%rbx > 0xffffffff81083709 <resched_task+25>: jne 0xffffffff81083760 <resched_task+112> > 0xffffffff8108370b <resched_task+27>: mov 0x8(%rbx),%rax > 0xffffffff8108370f <resched_task+31>: mov 0x10(%rax),%rdx > 0xffffffff81083713 <resched_task+35>: and $0x8,%edx > 0xffffffff81083716 <resched_task+38>: jne 0xffffffff8108373c <resched_task+76> > 0xffffffff81083718 <resched_task+40>: lock orb $0x8,0x10(%rax) > 0xffffffff8108371d <resched_task+45>: mov 0x8(%rbx),%rax > 0xffffffff81083721 <resched_task+49>: mov 0x18(%rax),%r12d > 0xffffffff81083725 <resched_task+53>: callq 0xffffffff812d8fc0 <debug_smp_processor_id> > 0xffffffff8108372a <resched_task+58>: cmp %r12d,%eax > 0xffffffff8108372d <resched_task+61>: je 0xffffffff810837a0 <resched_task+176> > 0xffffffff8108372f <resched_task+63>: mfence > ^^^^^^ > I want to make sure the smp_mb() is really replaced with an MFENCE there. > > Thanks! > Okaaay, I think I did what you asked. So yes, there is sse2 in the cpu info. And there is a mfence in the disassembly: 0xc107dcb0 <resched_task>: push %ebp 0xc107dcb1 <resched_task+1>: mov %esp,%ebp 0xc107dcb3 <resched_task+3>: lea %ds:0x0(%esi,%eiz,1),%esi 0xc107dcb8 <resched_task+8>: mov 0x4(%eax),%edx 0xc107dcbb <resched_task+11>: mov 0x8(%edx),%ecx 0xc107dcbe <resched_task+14>: and $0x8,%ecx 0xc107dcc1 <resched_task+17>: jne 0xc107dce7 <resched_task+55> 0xc107dcc3 <resched_task+19>: orb $0x8,%ds:0x8(%edx) 0xc107dcc8 <resched_task+24>: mov 0x4(%eax),%edx 0xc107dccb <resched_task+27>: mov %fs:0xc1a71010,%ecx 0xc107dcd2 <resched_task+34>: mov 0x10(%edx),%edx 0xc107dcd5 <resched_task+37>: cmp %ecx,%edx 0xc107dcd7 <resched_task+39>: je 0xc107dd00 <resched_task+80> 0xc107dcd9 <resched_task+41>: mfence 0xc107dcdc <resched_task+44>: mov %esi,%esi 0xc107dcde <resched_task+46>: mov 0x4(%eax),%eax 0xc107dce1 <resched_task+49>: testb $0x4,0xc(%eax) 0xc107dce5 <resched_task+53>: je 0xc107dcf0 <resched_task+64> 0xc107dce7 <resched_task+55>: pop %ebp 0xc107dce8 <resched_task+56>: ret 0xc107dce9 <resched_task+57>: lea 0x0(%esi,%eiz,1),%esi 0xc107dcf0 <resched_task+64>: mov %edx,%eax 0xc107dcf2 <resched_task+66>: call *0xc1917950 0xc107dcf8 <resched_task+72>: pop %ebp Thinking about it, I guess Peter is quite right saying that I likely will end on the patch that converted preempt_count to percpu. One thing I likely should do is to reinstall the exact same laptop with 64bit kernel and userspace... maybe only 64bit kernel first... and make sure on my side that this does not show up on 64bit, too. I took the word of reporters for that (and the impression that otherwise many more people would have complained).
On Fri, Feb 14, 2014 at 06:02:32PM +0100, Stefan Bader wrote: > One thing I likely should do is to reinstall the exact same laptop with 64bit > kernel and userspace... maybe only 64bit kernel first... and make sure on my > side that this does not show up on 64bit, too. I took the word of reporters for > that (and the impression that otherwise many more people would have complained). Yeha, I'm going to try and install some 32bit userspace on a usb harddisk I've got and see if I can boot my Core2 laptop from that to try and reproduce. But all that is probably going to be Monday :/ -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Feb 14, 2014 at 06:02:32PM +0100, Stefan Bader wrote: > Okaaay, I think I did what you asked. So yes, there is sse2 in the cpu info. And > there is a mfence in the disassembly: Btw, I just realized booting the kernel in the guest was a dumb idea, because, doh, the guest is not baremetal. The only reliable thing we can say is that sse2 is present and that MFENCE alternative replacement works :) But for simplicity's sake let's just assume the machine can do MFENCE just fine and it gets replaced by the alternatives code. Besides, if that weren't true, we'd have a whole lot of other problems on those boxes. > Thinking about it, I guess Peter is quite right saying that I likely > will end on the patch that converted preempt_count to percpu. Yeah, c2daa3bed53a81171cf8c1a36db798e82b91afe8 et al. > One thing I likely should do is to reinstall the exact same laptop > with 64bit kernel and userspace... maybe only 64bit kernel first... > and make sure on my side that this does not show up on 64bit, too. I > took the word of reporters for that (and the impression that otherwise > many more people would have complained). Yeah, that should be a prudent thing to do. Also, Paolo and I were wondering whether you can trigger this thing without kvm, i.e. virtualization involved... do you have any data on that? Thanks.
On 14.02.2014 18:33, Borislav Petkov wrote: > On Fri, Feb 14, 2014 at 06:02:32PM +0100, Stefan Bader wrote: >> Okaaay, I think I did what you asked. So yes, there is sse2 in the cpu info. And >> there is a mfence in the disassembly: > > Btw, I just realized booting the kernel in the guest was a dumb idea, > because, doh, the guest is not baremetal. The only reliable thing we > can say is that sse2 is present and that MFENCE alternative replacement > works :) > > But for simplicity's sake let's just assume the machine can do MFENCE > just fine and it gets replaced by the alternatives code. > > Besides, if that weren't true, we'd have a whole lot of other problems > on those boxes. > >> Thinking about it, I guess Peter is quite right saying that I likely >> will end on the patch that converted preempt_count to percpu. > > Yeah, c2daa3bed53a81171cf8c1a36db798e82b91afe8 et al. > >> One thing I likely should do is to reinstall the exact same laptop >> with 64bit kernel and userspace... maybe only 64bit kernel first... >> and make sure on my side that this does not show up on 64bit, too. I >> took the word of reporters for that (and the impression that otherwise >> many more people would have complained). > > Yeah, that should be a prudent thing to do. > > Also, Paolo and I were wondering whether you can trigger this thing > without kvm, i.e. virtualization involved... do you have any data on > that? Unfortunately no hard evidence. Kvm just happens to be such a good way to notice this as it is using the reschedule interrupt itself and has this exit before running the guest vcpu to hadnle it in the outer loop by calling cond_resched() and repeat. I find running kvm seems to make that laptop quite sluggish in responding to other tasks (in that install) and I got some oddness going on when lightdm quite often refuses to take keyboard input without opening some menu with the mouse first... But I could not be sure whether that is the kernel or some new user-space ... errr "feature". At least Marcello (iirc that other report came from him directly or indirectly) has seen it, too. And he likely has complete different user-space. So I will go and do that different (64bit) kernel and kernel + user-space test. But like fo Peter, it likely is a Monday thing...
On 14.02.2014 19:23, Stefan Bader wrote: > On 14.02.2014 18:33, Borislav Petkov wrote: >> On Fri, Feb 14, 2014 at 06:02:32PM +0100, Stefan Bader wrote: >>> Okaaay, I think I did what you asked. So yes, there is sse2 in the cpu info. And >>> there is a mfence in the disassembly: >> >> Btw, I just realized booting the kernel in the guest was a dumb idea, >> because, doh, the guest is not baremetal. The only reliable thing we >> can say is that sse2 is present and that MFENCE alternative replacement >> works :) >> >> But for simplicity's sake let's just assume the machine can do MFENCE >> just fine and it gets replaced by the alternatives code. >> >> Besides, if that weren't true, we'd have a whole lot of other problems >> on those boxes. >> >>> Thinking about it, I guess Peter is quite right saying that I likely >>> will end on the patch that converted preempt_count to percpu. >> >> Yeah, c2daa3bed53a81171cf8c1a36db798e82b91afe8 et al. >> >>> One thing I likely should do is to reinstall the exact same laptop >>> with 64bit kernel and userspace... maybe only 64bit kernel first... >>> and make sure on my side that this does not show up on 64bit, too. I >>> took the word of reporters for that (and the impression that otherwise >>> many more people would have complained). >> >> Yeah, that should be a prudent thing to do. >> >> Also, Paolo and I were wondering whether you can trigger this thing >> without kvm, i.e. virtualization involved... do you have any data on >> that? > > Unfortunately no hard evidence. Kvm just happens to be such a good way to notice > this as it is using the reschedule interrupt itself and has this exit before > running the guest vcpu to hadnle it in the outer loop by calling cond_resched() > and repeat. > I find running kvm seems to make that laptop quite sluggish in responding to > other tasks (in that install) and I got some oddness going on when lightdm quite > often refuses to take keyboard input without opening some menu with the mouse > first... But I could not be sure whether that is the kernel or some new > user-space ... errr "feature". > At least Marcello (iirc that other report came from him directly or indirectly) > has seen it, too. And he likely has complete different user-space. > > So I will go and do that different (64bit) kernel and kernel + user-space test. > But like fo Peter, it likely is a Monday thing... > Ok, it is still Friday... So a quick test (2 boots of a 32bit guest, same as before) on the 32bit user-space, with the same kernel source, but compiled as 64bit (obviously not 100% same config but close). While I see the false inconsistency messages (I modified the in kernel-test to trigger the trace stop only if the __vcpu_run loop encounters an inconsistent state three times in a row), I do not see the final stop message. Also (but that is rather feeling) the system seems to remain more responsive (switching to other windows, opening terminal windows,...) compared to 32bit kernel.
On 14.02.2014 18:21, Peter Zijlstra wrote: > On Fri, Feb 14, 2014 at 06:02:32PM +0100, Stefan Bader wrote: >> One thing I likely should do is to reinstall the exact same laptop with 64bit >> kernel and userspace... maybe only 64bit kernel first... and make sure on my >> side that this does not show up on 64bit, too. I took the word of reporters for >> that (and the impression that otherwise many more people would have complained). > > Yeha, I'm going to try and install some 32bit userspace on a usb > harddisk I've got and see if I can boot my Core2 laptop from that to try > and reproduce. > > But all that is probably going to be Monday :/ > *sigh* Already Thursday... Peter, did you get to reproduce this locally? Unfortunately I had some interruption and have not more Information than on last Friday (which is that the same hw but 64bit kernel does not show it). Meanwhile I wonder whether it would make sense to push the following (or more?) to stable for 3.13.y: 1) 16824255394f55adf31b9a96a9965d8c15bdac4c x86, acpi, idle: Restructure the mwait idle routines 2) 7e98b71920464b8d15fa95c74366416cd3c88861 x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers 3) 8cb75e0c4ec9786b81439761eac1d18d4a931af3 sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding 4) 215393bc1fab3d61a5a296838bdffce22f27ffda sched/preempt/x86: Fix voluntary preempt for x86 1+2 would be to avoid touching 3 too much and looked to be improvements on their own. 3+4 would be cherry-picks if not for some fuzz 2. I saw a few more things labelled preempt between 3.13 and current HEAD but am not sure whether or which of those are strictly required. Namely some fixing to preempt_enable_no_resched() mis-usage and maybe one fixing an issue of ftrace locking up. -Stefan 3
On Thu, Feb 20, 2014 at 04:38:13PM +0100, Stefan Bader wrote: > On 14.02.2014 18:21, Peter Zijlstra wrote: > > On Fri, Feb 14, 2014 at 06:02:32PM +0100, Stefan Bader wrote: > >> One thing I likely should do is to reinstall the exact same laptop with 64bit > >> kernel and userspace... maybe only 64bit kernel first... and make sure on my > >> side that this does not show up on 64bit, too. I took the word of reporters for > >> that (and the impression that otherwise many more people would have complained). > > > > Yeha, I'm going to try and install some 32bit userspace on a usb > > harddisk I've got and see if I can boot my Core2 laptop from that to try > > and reproduce. > > > > But all that is probably going to be Monday :/ > > > *sigh* Already Thursday... > > Peter, did you get to reproduce this locally? Unfortunately I had some > interruption and have not more Information than on last Friday (which is that > the same hw but 64bit kernel does not show it). I got side-tracked as well, someone reported crashes, which come above weird behaviour :/ > Meanwhile I wonder whether it would make sense to push the following (or more?) > to stable for 3.13.y: > > 1) 16824255394f55adf31b9a96a9965d8c15bdac4c > x86, acpi, idle: Restructure the mwait idle routines > 2) 7e98b71920464b8d15fa95c74366416cd3c88861 > x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers > 3) 8cb75e0c4ec9786b81439761eac1d18d4a931af3 > sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding > 4) 215393bc1fab3d61a5a296838bdffce22f27ffda > sched/preempt/x86: Fix voluntary preempt for x86 > > 1+2 would be to avoid touching 3 too much and looked to be improvements on their > own. 3+4 would be cherry-picks if not for some fuzz 2. I'll not object; but if someone wants a smaller set you could do with just the idle.c hunk from 3. > I saw a few more things labelled preempt between 3.13 and current HEAD but am > not sure whether or which of those are strictly required. Namely some fixing to > preempt_enable_no_resched() mis-usage and maybe one fixing an issue of ftrace > locking up. Yeah, that's all fresh breakage and doesn't go back to 3.13 I think. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Il 20/02/2014 16:50, Peter Zijlstra ha scritto: >>>> > >> One thing I likely should do is to reinstall the exact same laptop with 64bit >>>> > >> kernel and userspace... maybe only 64bit kernel first... and make sure on my >>>> > >> side that this does not show up on 64bit, too. I took the word of reporters for >>>> > >> that (and the impression that otherwise many more people would have complained). >>> > > >>> > > Yeha, I'm going to try and install some 32bit userspace on a usb >>> > > harddisk I've got and see if I can boot my Core2 laptop from that to try >>> > > and reproduce. >>> > > >>> > > But all that is probably going to be Monday :/ >>> > > >> > *sigh* Already Thursday... >> > >> > Peter, did you get to reproduce this locally? Unfortunately I had some >> > interruption and have not more Information than on last Friday (which is that >> > the same hw but 64bit kernel does not show it). > I got side-tracked as well, someone reported crashes, which come above > weird behaviour :/ > Stefan, Peter, any news here? Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 24.03.2014 18:39, Paolo Bonzini wrote: > Il 20/02/2014 16:50, Peter Zijlstra ha scritto: >>>>> > >> One thing I likely should do is to reinstall the exact same laptop >>>>> with 64bit >>>>> > >> kernel and userspace... maybe only 64bit kernel first... and make sure >>>>> on my >>>>> > >> side that this does not show up on 64bit, too. I took the word of >>>>> reporters for >>>>> > >> that (and the impression that otherwise many more people would have >>>>> complained). >>>> > > >>>> > > Yeha, I'm going to try and install some 32bit userspace on a usb >>>> > > harddisk I've got and see if I can boot my Core2 laptop from that to try >>>> > > and reproduce. >>>> > > >>>> > > But all that is probably going to be Monday :/ >>>> > > >>> > *sigh* Already Thursday... >>> > >>> > Peter, did you get to reproduce this locally? Unfortunately I had some >>> > interruption and have not more Information than on last Friday (which is that >>> > the same hw but 64bit kernel does not show it). >> I got side-tracked as well, someone reported crashes, which come above >> weird behaviour :/ >> > > Stefan, Peter, any news here? > > Paolo No, unfortunately not. :( I thought I had some idea as i386 seems to have soft and hard irqs using a seperate stack and thought they maybe loose the process flag there (x86_64 seems to always do that for soft irqs and if it does for hard irqs it was hidden well enough for me not to see). But trial and error showed showed no improvement when I made a copy of the original TIF_NEED_RSCHED before executing on stack. So not really any further. And then the usual distractions hit... -Stefan
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 39c28f09dfd5..f8b2fc7ce491 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -6115,7 +6115,11 @@ static int __vcpu_run(struct kvm_vcpu *vcpu) } if (need_resched()) { srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx); - cond_resched(); + if (need_resched() != should_resched()) { + tracing_stop(); + printk(KERN_ERR "Stopped tracing, due to inconsistent state.\n"); + } + schedule(); vcpu->srcu_idx = srcu_read_lock(&kvm->srcu); } }