diff mbox

Another preempt folding issue?

Message ID 20140211194553.GZ9987@twins.programming.kicks-ass.net (mailing list archive)
State New, archived
Headers show

Commit Message

Peter Zijlstra Feb. 11, 2014, 7:45 p.m. UTC
On Tue, Feb 11, 2014 at 07:34:51PM +0100, Stefan Bader wrote:
> Hi Peter,
> 
> I am currently looking at a weird issue that manifest itself when trying to run
> kvm enabled qemu on a i386 host (v3.13 kernel, oh and potentially important the
> cpu is 64bit capable, so qemu-system-x86_64 is called). 

AMD or Intel machine?

> Sooner or later this
> causes softlockup messages on the host. I tracked this down to __vcpu_run in
> arch/x86/kvm/x86.c which does a loop which in that case never seems to make
> progress or exit.
> 
> What I found is that vcpu_enter_guest will exit quickly without causing the loop
> to exit when need_resched() is true. Looking at a crash dump I took, this was
> the case (thread_info->flags had TIF_NEED_RESCHED set). So after immediately
> returning __vcpu_run has the following code:
> 
>         if (need_resched()) {
>                 srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
>                 kvm_resched(vcpu); // now cond_resched();
>                 vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
>         }
> 
> The kvm_resched basically would end up doing a cond_resched() which now checks
> preempt_count() to be 0. If that is zero it will do the reschedule, otherwise it
> just does nothing. Looking at the percpu variables in the dump, I saw that
> the preempt_count was 0x8000000 (actually it was 0x80110000 but that was me
> triggering the kexec crashdump with sysrq-c).
> 
> I saw that there have been some changes in the upstream kernel and have picked
> the following:
> 1) x86, acpi, idle: Restructure the mwait idle routines
> 2) x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers
> 3) sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding
> 4) sched/preempt/x86: Fix voluntary preempt for x86
> 
> Patch 1) and 2) as dependencies of 3) (to get the mwait function correct and to
> the other file). Finally 4) is fixing up 3). [maybe worth suggesting to do for
> 3.13.y stable].
> 
> Still, with all those I got the softlockup. Since I knew from the dump info that
> something is wrong with the folding, I made the pragmatic approach and added the
> following:
> 
>         if (need_resched()) {
>                 srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
> +               preempt_fold_need_resched();
>                 kvm_resched(vcpu); // now cond_resched();
>                 vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
>         }
> 
> And this lets the kvm guest run without the softlockups! However I am less than
> convinced that this is the right thing to do. Somehow something done when
> converting the preempt_count into percpu has caused at least the i386 side to
> get into this mess (as there has not been any whining about 64bit). Just fail to
> see what.

I've been looking at the same thing too; I've got a trace from someone
who can reproduce and its just not making sense.

Looks like the hardware is loosing an interrupt (or worse).

With the below patch on top of current -git (might be whitespace damaged
due to copy-paste).

---
 arch/x86/kvm/x86.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)


---

I got the following from an Athlon X2 (fam 0xf):

On Mon, Feb 10, 2014 at 05:23:15PM -0200, Marcelo Tosatti wrote:
> What is the problem exactly?

From the trace:

 qemu-system-x86-2455  [001] d.s3   393.630586: smp_reschedule_interrupt <-reschedule_interrupt
 qemu-system-x86-2455  [001] d.s3   393.630586: scheduler_ipi <-smp_reschedule_interrupt
 qemu-system-x86-2455  [001] ..s3   393.630586: preempt_count_sub <-_raw_spin_unlock_irqrestore
 qemu-system-x86-2455  [001] ..s2   393.630586: _raw_spin_lock_irq <-run_timer_softirq
 qemu-system-x86-2455  [001] d.s2   393.630586: preempt_count_add <-_raw_spin_lock_irq
 qemu-system-x86-2455  [001] d.s3   393.630586: _raw_spin_unlock_irq <-run_timer_softirq
 qemu-system-x86-2455  [001] ..s3   393.630586: preempt_count_sub <-_raw_spin_unlock_irq
 qemu-system-x86-2455  [001] ..s2   393.630586: rcu_bh_qs <-__do_softirq
 qemu-system-x86-2455  [001] d.s2   393.630586: irqtime_account_irq <-__do_softirq
 qemu-system-x86-2455  [001] d.s2   393.630586: __local_bh_enable <-__do_softirq
 qemu-system-x86-2455  [001] d.s2   393.630586: preempt_count_sub <-__local_bh_enable


           dmesg-2458  [000] d.s5   393.630614: resched_task <-check_preempt_wakeup
           dmesg-2458  [000] d.s5   393.630614: native_smp_send_reschedule <-resched_task
           dmesg-2458  [000] d.s5   393.630614: default_send_IPI_mask_logical <-native_smp_send_reschedule


 qemu-system-x86-2455  [001] .n..   393.630636: preempt_count_add <-kmap_atomic_prot
 qemu-system-x86-2455  [001] .n.1   393.630639: __kunmap_atomic <-clear_huge_page
 qemu-system-x86-2455  [001] .n.1   393.630639: preempt_count_sub <-__kunmap_atomic
 qemu-system-x86-2455  [001] .n..   393.630639: _cond_resched <-clear_huge_page
 qemu-system-x86-2455  [001] .n..   393.630640: kmap_atomic <-clear_huge_page
 qemu-system-x86-2455  [001] .n..   393.630640: kmap_atomic_prot <-kmap_atomic
 qemu-system-x86-2455  [001] .n..   393.630640: preempt_count_add <-kmap_atomic_prot


The resched_task() in the middle does:

  set_tsk_need_resched();
  smp_mb();
  smp_send_reschedule();

This means that the remote cpu must observe TIF_NEED_RESCHED (.n.. in
the traces) when the IPI (smp_reschedule_interrupt) hits.

Now given the machine has unsynchronized TSC the order in the above is
not necessarily the true order, but there's only two possible scenarios:

1) its the right order; the smp_reschedule_interrupt() is from a
 previous resched_task()/rcu kick or any of the other users that sends
 it.

 In which case we should get another smp_reschedule_interrupt() _after_
 the resched_task() which should observe the 'n' and promote it to an
 'N'.

 This doesn't happen and we're stuck with the 'n' state..

2) its the wrong order and the smp_reschedule_interrupt() actually
 happened after the resched_task(), but in that case we should have
 observed the 'n', we did not.

Now given that 2) would mean a bug in the coherency fabric and 1) would
mean a lost interrupt, 1 seems like the more likely suspect.

Joerg wanted the extra kvm tracepoints to see what the vm-exit reason
is.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Stefan Bader Feb. 12, 2014, 8:20 a.m. UTC | #1
On 11.02.2014 20:45, Peter Zijlstra wrote:
> On Tue, Feb 11, 2014 at 07:34:51PM +0100, Stefan Bader wrote:
>> Hi Peter,
>>
>> I am currently looking at a weird issue that manifest itself when trying to run
>> kvm enabled qemu on a i386 host (v3.13 kernel, oh and potentially important the
>> cpu is 64bit capable, so qemu-system-x86_64 is called). 
> 
> AMD or Intel machine?

Personally I am reproducing this on an AMD:

cpu family	: 15
model		: 72
model name	: AMD Turion(tm) 64 X2 Mobile Technology TL-64

but the reports I got are showing the same issue on an Intel i7

cpu family	: 6
model		: 37
model name	: Intel(R) Core(TM) i7 CPU       M 620  @ 2.67GH

Another reporter also saw this on an AMD and said it could not be reproduced on
the same hardware and the same software versions when using 64bit instead of 32.

In my case on a 32bit installation I will see this on every invocation of

qemu-system-x86_64 -cdrom <32bit live iso> -boot d -m 512

> 
>> Sooner or later this
>> causes softlockup messages on the host. I tracked this down to __vcpu_run in
>> arch/x86/kvm/x86.c which does a loop which in that case never seems to make
>> progress or exit.
>>
>> What I found is that vcpu_enter_guest will exit quickly without causing the loop
>> to exit when need_resched() is true. Looking at a crash dump I took, this was
>> the case (thread_info->flags had TIF_NEED_RESCHED set). So after immediately
>> returning __vcpu_run has the following code:
>>
>>         if (need_resched()) {
>>                 srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
>>                 kvm_resched(vcpu); // now cond_resched();
>>                 vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
>>         }
>>
>> The kvm_resched basically would end up doing a cond_resched() which now checks
>> preempt_count() to be 0. If that is zero it will do the reschedule, otherwise it
>> just does nothing. Looking at the percpu variables in the dump, I saw that
>> the preempt_count was 0x8000000 (actually it was 0x80110000 but that was me
>> triggering the kexec crashdump with sysrq-c).
>>
>> I saw that there have been some changes in the upstream kernel and have picked
>> the following:
>> 1) x86, acpi, idle: Restructure the mwait idle routines
>> 2) x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers
>> 3) sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding
>> 4) sched/preempt/x86: Fix voluntary preempt for x86
>>
>> Patch 1) and 2) as dependencies of 3) (to get the mwait function correct and to
>> the other file). Finally 4) is fixing up 3). [maybe worth suggesting to do for
>> 3.13.y stable].
>>
>> Still, with all those I got the softlockup. Since I knew from the dump info that
>> something is wrong with the folding, I made the pragmatic approach and added the
>> following:
>>
>>         if (need_resched()) {
>>                 srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
>> +               preempt_fold_need_resched();
>>                 kvm_resched(vcpu); // now cond_resched();
>>                 vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
>>         }
>>
>> And this lets the kvm guest run without the softlockups! However I am less than
>> convinced that this is the right thing to do. Somehow something done when
>> converting the preempt_count into percpu has caused at least the i386 side to
>> get into this mess (as there has not been any whining about 64bit). Just fail to
>> see what.
> 
> I've been looking at the same thing too; I've got a trace from someone
> who can reproduce and its just not making sense.
> 
> Looks like the hardware is loosing an interrupt (or worse).
> 
> With the below patch on top of current -git (might be whitespace damaged
> due to copy-paste).
> 
> ---
>  arch/x86/kvm/x86.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 39c28f09dfd5..f8b2fc7ce491 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6115,7 +6115,11 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
>                 }
>                 if (need_resched()) {
>                         srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
> -                       cond_resched();
> +                       if (need_resched() != should_resched()) {
> +                               tracing_stop();
> +                               printk(KERN_ERR "Stopped tracing, due to inconsistent state.\n");
> +                       }
> +                       schedule();
>                         vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
>                 }
>         }
> 
> ---
> 
> I got the following from an Athlon X2 (fam 0xf):
> 
> On Mon, Feb 10, 2014 at 05:23:15PM -0200, Marcelo Tosatti wrote:
>> What is the problem exactly?
> 
> From the trace:
> 
>  qemu-system-x86-2455  [001] d.s3   393.630586: smp_reschedule_interrupt <-reschedule_interrupt
>  qemu-system-x86-2455  [001] d.s3   393.630586: scheduler_ipi <-smp_reschedule_interrupt
>  qemu-system-x86-2455  [001] ..s3   393.630586: preempt_count_sub <-_raw_spin_unlock_irqrestore
>  qemu-system-x86-2455  [001] ..s2   393.630586: _raw_spin_lock_irq <-run_timer_softirq
>  qemu-system-x86-2455  [001] d.s2   393.630586: preempt_count_add <-_raw_spin_lock_irq
>  qemu-system-x86-2455  [001] d.s3   393.630586: _raw_spin_unlock_irq <-run_timer_softirq
>  qemu-system-x86-2455  [001] ..s3   393.630586: preempt_count_sub <-_raw_spin_unlock_irq
>  qemu-system-x86-2455  [001] ..s2   393.630586: rcu_bh_qs <-__do_softirq
>  qemu-system-x86-2455  [001] d.s2   393.630586: irqtime_account_irq <-__do_softirq
>  qemu-system-x86-2455  [001] d.s2   393.630586: __local_bh_enable <-__do_softirq
>  qemu-system-x86-2455  [001] d.s2   393.630586: preempt_count_sub <-__local_bh_enable
> 
> 
>            dmesg-2458  [000] d.s5   393.630614: resched_task <-check_preempt_wakeup
>            dmesg-2458  [000] d.s5   393.630614: native_smp_send_reschedule <-resched_task
>            dmesg-2458  [000] d.s5   393.630614: default_send_IPI_mask_logical <-native_smp_send_reschedule
> 
> 
>  qemu-system-x86-2455  [001] .n..   393.630636: preempt_count_add <-kmap_atomic_prot
>  qemu-system-x86-2455  [001] .n.1   393.630639: __kunmap_atomic <-clear_huge_page
>  qemu-system-x86-2455  [001] .n.1   393.630639: preempt_count_sub <-__kunmap_atomic
>  qemu-system-x86-2455  [001] .n..   393.630639: _cond_resched <-clear_huge_page
>  qemu-system-x86-2455  [001] .n..   393.630640: kmap_atomic <-clear_huge_page
>  qemu-system-x86-2455  [001] .n..   393.630640: kmap_atomic_prot <-kmap_atomic
>  qemu-system-x86-2455  [001] .n..   393.630640: preempt_count_add <-kmap_atomic_prot
> 
> 
> The resched_task() in the middle does:
> 
>   set_tsk_need_resched();
>   smp_mb();
>   smp_send_reschedule();
> 
> This means that the remote cpu must observe TIF_NEED_RESCHED (.n.. in
> the traces) when the IPI (smp_reschedule_interrupt) hits.
> 
> Now given the machine has unsynchronized TSC the order in the above is
> not necessarily the true order, but there's only two possible scenarios:
> 
> 1) its the right order; the smp_reschedule_interrupt() is from a
>  previous resched_task()/rcu kick or any of the other users that sends
>  it.
> 
>  In which case we should get another smp_reschedule_interrupt() _after_
>  the resched_task() which should observe the 'n' and promote it to an
>  'N'.
> 
>  This doesn't happen and we're stuck with the 'n' state..
> 
> 2) its the wrong order and the smp_reschedule_interrupt() actually
>  happened after the resched_task(), but in that case we should have
>  observed the 'n', we did not.
> 
> Now given that 2) would mean a bug in the coherency fabric and 1) would
> mean a lost interrupt, 1 seems like the more likely suspect.
> 
> Joerg wanted the extra kvm tracepoints to see what the vm-exit reason
> is.
> 

Let me think about this after I had another cup of coffee and my brain decides
to follow my body in the process of waking up...

-Stefan
Peter Zijlstra Feb. 12, 2014, 10:37 a.m. UTC | #2
On Wed, Feb 12, 2014 at 09:20:24AM +0100, Stefan Bader wrote:
> On 11.02.2014 20:45, Peter Zijlstra wrote:
> > On Tue, Feb 11, 2014 at 07:34:51PM +0100, Stefan Bader wrote:
> >> Hi Peter,
> >>
> >> I am currently looking at a weird issue that manifest itself when trying to run
> >> kvm enabled qemu on a i386 host (v3.13 kernel, oh and potentially important the
> >> cpu is 64bit capable, so qemu-system-x86_64 is called). 
> > 
> > AMD or Intel machine?
> 
> Personally I am reproducing this on an AMD:
> 
> cpu family	: 15
> model		: 72
> model name	: AMD Turion(tm) 64 X2 Mobile Technology TL-64
> 
> but the reports I got are showing the same issue on an Intel i7
> 
> cpu family	: 6
> model		: 37
> model name	: Intel(R) Core(TM) i7 CPU       M 620  @ 2.67GH
> 
> Another reporter also saw this on an AMD and said it could not be reproduced on
> the same hardware and the same software versions when using 64bit instead of 32.
> 
> In my case on a 32bit installation I will see this on every invocation of

Curious.. so its both AMD and Intel.

That makes funny hardware less likely... but how can the vmexit loose an
interrupt like this.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Borislav Petkov Feb. 12, 2014, 10:40 a.m. UTC | #3
On Wed, Feb 12, 2014 at 11:37:13AM +0100, Peter Zijlstra wrote:
> > Another reporter also saw this on an AMD and said it could not be reproduced on
> > the same hardware and the same software versions when using 64bit instead of 32.
> > 
> > In my case on a 32bit installation I will see this on every invocation of
> 
> Curious.. so its both AMD and Intel.
> 
> That makes funny hardware less likely... but how can the vmexit loose an
> interrupt like this.

Also what I'm wondering about and what's not clear from Stefan's reply
is whether this is purely a 32-bit issue, i.e. a 32-bit host running a
64-bit qemu running a 32-bit iso or what is it?

Or do we have reports for both 32-bit and 64-bit hosts?

Thanks.
Stefan Bader Feb. 12, 2014, 11:09 a.m. UTC | #4
On 12.02.2014 11:40, Borislav Petkov wrote:
> On Wed, Feb 12, 2014 at 11:37:13AM +0100, Peter Zijlstra wrote:
>>> Another reporter also saw this on an AMD and said it could not be reproduced on
>>> the same hardware and the same software versions when using 64bit instead of 32.
>>>
>>> In my case on a 32bit installation I will see this on every invocation of
>>
>> Curious.. so its both AMD and Intel.
>>
>> That makes funny hardware less likely... but how can the vmexit loose an
>> interrupt like this.
> 
> Also what I'm wondering about and what's not clear from Stefan's reply
> is whether this is purely a 32-bit issue, i.e. a 32-bit host running a
> 64-bit qemu running a 32-bit iso or what is it?
> 
> Or do we have reports for both 32-bit and 64-bit hosts?

Only for 32bit hosts. I have not tested a 64bit kernel plus 32bit userspace,
though. But 64bit kernel and 64bit user-space seemed ok. And one detail I should
add is that while the reproduction is possible in all attempts, the timing can
vary. In some of the cases this was even before running a Linux kernel in the guest.

Something else here I run a kernel with CONFIG_PREEMPT not set and NR_CPUS
limited to 8 (for the 32bit kernel). So the default apic driver is used. Since
default_send_IPI_mask_logical is only used from there, I assume the trace you
got does the same. Maybe something there is wrong which would explain why we
only see it on 32bit hosts.

> 
> Thanks.
>
Joerg Roedel Feb. 12, 2014, 11:12 a.m. UTC | #5
On Wed, Feb 12, 2014 at 11:40:17AM +0100, Borislav Petkov wrote:
> Also what I'm wondering about and what's not clear from Stefan's reply
> is whether this is purely a 32-bit issue, i.e. a 32-bit host running a
> 64-bit qemu running a 32-bit iso or what is it?
> 
> Or do we have reports for both 32-bit and 64-bit hosts?

As I  understand it, the problem reproduces only on a 32bit host, which
also limits the guests to 32bit, at least when ran with KVM. This is
true even if a x86-64 qemu is used. This is also independent on whether
the host-cpu supports 64bit or not.


	Joerg


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Feb. 12, 2014, 11:54 a.m. UTC | #6
On Wed, Feb 12, 2014 at 12:09:29PM +0100, Stefan Bader wrote:
> Something else here I run a kernel with CONFIG_PREEMPT not set and NR_CPUS
> limited to 8 (for the 32bit kernel). So the default apic driver is used. Since
> default_send_IPI_mask_logical is only used from there, I assume the trace you
> got does the same. Maybe something there is wrong which would explain why we
> only see it on 32bit hosts.

Can you try with a different APIC driver to test this?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stefan Bader Feb. 13, 2014, 5 p.m. UTC | #7
On 12.02.2014 12:54, Peter Zijlstra wrote:
> On Wed, Feb 12, 2014 at 12:09:29PM +0100, Stefan Bader wrote:
>> Something else here I run a kernel with CONFIG_PREEMPT not set and NR_CPUS
>> limited to 8 (for the 32bit kernel). So the default apic driver is used. Since
>> default_send_IPI_mask_logical is only used from there, I assume the trace you
>> got does the same. Maybe something there is wrong which would explain why we
>> only see it on 32bit hosts.
> 
> Can you try with a different APIC driver to test this?
> 
I don't think I can. And I think the statement about this only be used for 32bit
could be wrong. I got mislead to think so because those are only defined in
probe_32 but the 64bit counterpart isn't containing much aside that.

Anyway, I played around with tracing a bit more. So with this change:

                if (need_resched()) {
                        srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
                        if (need_resched() != should_resched()) {
+                               trace_printk("need(%i) != should(%i)\n",
+                                       need_resched(), should_resched());
+                               trace_printk("exit_reason=%u\n",
+                                       vcpu->run->exit_reason);
+                                trace_printk("preempt_count=%lx\n",
+                                        __this_cpu_read_4(__preempt_count));
+                                tracing_stop();
+                                printk(KERN_ERR "Stopped tracing, due to
inconsistent state.\n");
                        }
 +                      schedule();
 -                      cond_reschedule();
                        vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
                }

I get the following (weird) output:

            Xorg-1078  [001] d...    71.270251: native_smp_send_reschedule
<-resched_task
            Xorg-1078  [001] d...    71.270251: default_send_IPI_mask_logical
<-native_smp_send_reschedule
      bamfdaemon-2318  [001] d...    71.270465: resched_task <-check_preempt_wakeup
      bamfdaemon-2318  [001] d...    71.270539: resched_task <-check_preempt_wakeup
          compiz-2365  [001] d...    71.270689: resched_task <-check_preempt_wakeup
          compiz-2365  [001] d...    71.270827: resched_task <-check_preempt_wakeup
          compiz-2365  [001] d...    71.270940: resched_task <-check_preempt_wakeup
 qemu-system-x86-2679  [000] dn..    71.270999: smp_reschedule_interrupt
<-reschedule_interrupt
 qemu-system-x86-2679  [000] dn..    71.270999: scheduler_ipi
<-smp_reschedule_interrupt
 qemu-system-x86-2679  [000] .N..    71.271001: kvm_arch_vcpu_ioctl_run: need(1)
!= should(0)
 qemu-system-x86-2679  [000] .N..    71.271002: kvm_arch_vcpu_ioctl_run:
exit_reason=2
 qemu-system-x86-2679  [000] .N..    71.271003: kvm_arch_vcpu_ioctl_run:
preempt_count=0

So am I reading this right, that the interrupt did get delivered to cpu#0 while
the thread info already had the resched flag set. So this really should have
cleared the bit in preempt_count. But while the trace info shows 'N' for some
reason should_reschedule returns false but at the same time reading the preempt
count manually shows it 0?
Peter Zijlstra Feb. 13, 2014, 5:38 p.m. UTC | #8
On Thu, Feb 13, 2014 at 06:00:19PM +0100, Stefan Bader wrote:
> On 12.02.2014 12:54, Peter Zijlstra wrote:
> > On Wed, Feb 12, 2014 at 12:09:29PM +0100, Stefan Bader wrote:
> >> Something else here I run a kernel with CONFIG_PREEMPT not set and NR_CPUS
> >> limited to 8 (for the 32bit kernel). So the default apic driver is used. Since
> >> default_send_IPI_mask_logical is only used from there, I assume the trace you
> >> got does the same. Maybe something there is wrong which would explain why we
> >> only see it on 32bit hosts.
> > 
> > Can you try with a different APIC driver to test this?
> > 
> I don't think I can. And I think the statement about this only be used for 32bit
> could be wrong. I got mislead to think so because those are only defined in
> probe_32 but the 64bit counterpart isn't containing much aside that.
> 
> Anyway, I played around with tracing a bit more. So with this change:
> 
>                 if (need_resched()) {
>                         srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
>                         if (need_resched() != should_resched()) {
> +                               trace_printk("need(%i) != should(%i)\n",
> +                                       need_resched(), should_resched());
> +                               trace_printk("exit_reason=%u\n",
> +                                       vcpu->run->exit_reason);
> +                                trace_printk("preempt_count=%lx\n",
> +                                        __this_cpu_read_4(__preempt_count));
> +                                tracing_stop();
> +                                printk(KERN_ERR "Stopped tracing, due to
> inconsistent state.\n");
>                         }
>  +                      schedule();
>  -                      cond_reschedule();
>                         vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
>                 }
> 
> I get the following (weird) output:
> 
>             Xorg-1078  [001] d...    71.270251: native_smp_send_reschedule
> <-resched_task
>             Xorg-1078  [001] d...    71.270251: default_send_IPI_mask_logical
> <-native_smp_send_reschedule
>       bamfdaemon-2318  [001] d...    71.270465: resched_task <-check_preempt_wakeup
>       bamfdaemon-2318  [001] d...    71.270539: resched_task <-check_preempt_wakeup
>           compiz-2365  [001] d...    71.270689: resched_task <-check_preempt_wakeup
>           compiz-2365  [001] d...    71.270827: resched_task <-check_preempt_wakeup
>           compiz-2365  [001] d...    71.270940: resched_task <-check_preempt_wakeup
>  qemu-system-x86-2679  [000] dn..    71.270999: smp_reschedule_interrupt
> <-reschedule_interrupt
>  qemu-system-x86-2679  [000] dn..    71.270999: scheduler_ipi
> <-smp_reschedule_interrupt
>  qemu-system-x86-2679  [000] .N..    71.271001: kvm_arch_vcpu_ioctl_run: need(1)
> != should(0)
>  qemu-system-x86-2679  [000] .N..    71.271002: kvm_arch_vcpu_ioctl_run:
> exit_reason=2
>  qemu-system-x86-2679  [000] .N..    71.271003: kvm_arch_vcpu_ioctl_run:
> preempt_count=0
> 
> So am I reading this right, that the interrupt did get delivered to cpu#0 while
> the thread info already had the resched flag set. So this really should have
> cleared the bit in preempt_count. But while the trace info shows 'N' for some
> reason should_reschedule returns false but at the same time reading the preempt
> count manually shows it 0?

*blink*... That's weird indeed... do you have the asm that goes along
with that?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Feb. 13, 2014, 6:25 p.m. UTC | #9
On Thu, Feb 13, 2014 at 06:00:19PM +0100, Stefan Bader wrote:
> On 12.02.2014 12:54, Peter Zijlstra wrote:
> > On Wed, Feb 12, 2014 at 12:09:29PM +0100, Stefan Bader wrote:
> >> Something else here I run a kernel with CONFIG_PREEMPT not set and NR_CPUS
> >> limited to 8 (for the 32bit kernel). So the default apic driver is used. Since
> >> default_send_IPI_mask_logical is only used from there, I assume the trace you
> >> got does the same. Maybe something there is wrong which would explain why we
> >> only see it on 32bit hosts.
> > 
> > Can you try with a different APIC driver to test this?
> > 
> I don't think I can. And I think the statement about this only be used for 32bit
> could be wrong. I got mislead to think so because those are only defined in
> probe_32 but the 64bit counterpart isn't containing much aside that.
> 
> Anyway, I played around with tracing a bit more. So with this change:
> 
>                 if (need_resched()) {
>                         srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
>                         if (need_resched() != should_resched()) {
> +                               trace_printk("need(%i) != should(%i)\n",
> +                                       need_resched(), should_resched());
> +                               trace_printk("exit_reason=%u\n",
> +                                       vcpu->run->exit_reason);
> +                                trace_printk("preempt_count=%lx\n",
> +                                        __this_cpu_read_4(__preempt_count));
> +                                tracing_stop();
> +                                printk(KERN_ERR "Stopped tracing, due to
> inconsistent state.\n");
>                         }
>  +                      schedule();
>  -                      cond_reschedule();
>                         vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
>                 }
> 
> I get the following (weird) output:
> 
>             Xorg-1078  [001] d...    71.270251: native_smp_send_reschedule
> <-resched_task
>             Xorg-1078  [001] d...    71.270251: default_send_IPI_mask_logical
> <-native_smp_send_reschedule
>       bamfdaemon-2318  [001] d...    71.270465: resched_task <-check_preempt_wakeup
>       bamfdaemon-2318  [001] d...    71.270539: resched_task <-check_preempt_wakeup
>           compiz-2365  [001] d...    71.270689: resched_task <-check_preempt_wakeup
>           compiz-2365  [001] d...    71.270827: resched_task <-check_preempt_wakeup
>           compiz-2365  [001] d...    71.270940: resched_task <-check_preempt_wakeup
>  qemu-system-x86-2679  [000] dn..    71.270999: smp_reschedule_interrupt
> <-reschedule_interrupt
>  qemu-system-x86-2679  [000] dn..    71.270999: scheduler_ipi
> <-smp_reschedule_interrupt
>  qemu-system-x86-2679  [000] .N..    71.271001: kvm_arch_vcpu_ioctl_run: need(1)
> != should(0)
>  qemu-system-x86-2679  [000] .N..    71.271002: kvm_arch_vcpu_ioctl_run:
> exit_reason=2
>  qemu-system-x86-2679  [000] .N..    71.271003: kvm_arch_vcpu_ioctl_run:
> preempt_count=0
> 
> So am I reading this right, that the interrupt did get delivered to cpu#0 while
> the thread info already had the resched flag set. So this really should have
> cleared the bit in preempt_count. But while the trace info shows 'N' for some
> reason should_reschedule returns false but at the same time reading the preempt
> count manually shows it 0?

So the assembly merges the first and second should_resched(), so its
possible that load got before the interrupt().

The 3rd preempt_count load gets re-issued and so that would show the
'true' value again.

If you want to force a reload after the condition; put in a barrier().

In any case; this looks like a false-positive. Please try again until
you get one where the interrupt doesn't happen and we stay in 'n' state.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Feb. 13, 2014, 6:26 p.m. UTC | #10
On Thu, Feb 13, 2014 at 07:03:56PM +0100, Stefan Bader wrote:
> Yeah... not sure the interleaved source helps or not ...

It did, thanks!
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stefan Bader Feb. 14, 2014, 10:55 a.m. UTC | #11
On 13.02.2014 19:25, Peter Zijlstra wrote:
> On Thu, Feb 13, 2014 at 06:00:19PM +0100, Stefan Bader wrote:
>> On 12.02.2014 12:54, Peter Zijlstra wrote:
>>> On Wed, Feb 12, 2014 at 12:09:29PM +0100, Stefan Bader wrote:
>>>> Something else here I run a kernel with CONFIG_PREEMPT not set and NR_CPUS
>>>> limited to 8 (for the 32bit kernel). So the default apic driver is used. Since
>>>> default_send_IPI_mask_logical is only used from there, I assume the trace you
>>>> got does the same. Maybe something there is wrong which would explain why we
>>>> only see it on 32bit hosts.
>>>
>>> Can you try with a different APIC driver to test this?
>>>
>> I don't think I can. And I think the statement about this only be used for 32bit
>> could be wrong. I got mislead to think so because those are only defined in
>> probe_32 but the 64bit counterpart isn't containing much aside that.
>>
>> Anyway, I played around with tracing a bit more. So with this change:
>>
>>                 if (need_resched()) {
>>                         srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
>>                         if (need_resched() != should_resched()) {
>> +                               trace_printk("need(%i) != should(%i)\n",
>> +                                       need_resched(), should_resched());
>> +                               trace_printk("exit_reason=%u\n",
>> +                                       vcpu->run->exit_reason);
>> +                                trace_printk("preempt_count=%lx\n",
>> +                                        __this_cpu_read_4(__preempt_count));
>> +                                tracing_stop();
>> +                                printk(KERN_ERR "Stopped tracing, due to
>> inconsistent state.\n");
>>                         }
>>  +                      schedule();
>>  -                      cond_reschedule();
>>                         vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
>>                 }
>>
>> I get the following (weird) output:
>>
>>             Xorg-1078  [001] d...    71.270251: native_smp_send_reschedule
>> <-resched_task
>>             Xorg-1078  [001] d...    71.270251: default_send_IPI_mask_logical
>> <-native_smp_send_reschedule
>>       bamfdaemon-2318  [001] d...    71.270465: resched_task <-check_preempt_wakeup
>>       bamfdaemon-2318  [001] d...    71.270539: resched_task <-check_preempt_wakeup
>>           compiz-2365  [001] d...    71.270689: resched_task <-check_preempt_wakeup
>>           compiz-2365  [001] d...    71.270827: resched_task <-check_preempt_wakeup
>>           compiz-2365  [001] d...    71.270940: resched_task <-check_preempt_wakeup
>>  qemu-system-x86-2679  [000] dn..    71.270999: smp_reschedule_interrupt
>> <-reschedule_interrupt
>>  qemu-system-x86-2679  [000] dn..    71.270999: scheduler_ipi
>> <-smp_reschedule_interrupt
>>  qemu-system-x86-2679  [000] .N..    71.271001: kvm_arch_vcpu_ioctl_run: need(1)
>> != should(0)
>>  qemu-system-x86-2679  [000] .N..    71.271002: kvm_arch_vcpu_ioctl_run:
>> exit_reason=2
>>  qemu-system-x86-2679  [000] .N..    71.271003: kvm_arch_vcpu_ioctl_run:
>> preempt_count=0
>>
>> So am I reading this right, that the interrupt did get delivered to cpu#0 while
>> the thread info already had the resched flag set. So this really should have
>> cleared the bit in preempt_count. But while the trace info shows 'N' for some
>> reason should_reschedule returns false but at the same time reading the preempt
>> count manually shows it 0?
> 
> So the assembly merges the first and second should_resched(), so its
> possible that load got before the interrupt().
> 
> The 3rd preempt_count load gets re-issued and so that would show the
> 'true' value again.
> 
> If you want to force a reload after the condition; put in a barrier().
> 
> In any case; this looks like a false-positive. Please try again until
> you get one where the interrupt doesn't happen and we stay in 'n' state.
> 
Ok, I think I now got a log of the actual issue. It seems cpu#1 missed out on
handling a reschedule interrupt but did send one to cpu#0 and on cpu#0 while
handling the interrupt the tif flag was not set (yet?) but then when it is,
there is no more interrupt to do the folding... (sorry for the wrapping)...

 qemu-system-x86-2503  [000] d.h.  1145.887872: default_send_IPI_mask_logical
<-native_smp_send_reschedule
 qemu-system-x86-2503  [000] d...  1145.887888: resched_task <-check_preempt_curr
          <idle>-0     [000] d.h.  1145.888001: resched_task <-check_preempt_curr
          <idle>-0     [001] dn..  1145.888788: smp_reschedule_interrupt
<-reschedule_interrupt
          <idle>-0     [001] dn..  1145.888789: scheduler_ipi
<-smp_reschedule_interrupt
  gnome-terminal-2440  [000] d...  1145.888957: resched_task <-check_preempt_curr
  gnome-terminal-2440  [000] d...  1145.888959: native_smp_send_reschedule
<-resched_task
  gnome-terminal-2440  [000] d...  1145.888959: default_send_IPI_mask_logical
<-native_smp_send_reschedule
          <idle>-0     [001] dn..  1145.889296: smp_reschedule_interrupt
<-reschedule_interrupt
          <idle>-0     [001] dn..  1145.889297: scheduler_ipi
<-smp_reschedule_interrupt
  gnome-terminal-2440  [000] d...  1145.889465: resched_task <-check_preempt_curr
  gnome-terminal-2440  [000] d...  1145.889467: native_smp_send_reschedule
<-resched_task
  gnome-terminal-2440  [000] d...  1145.889467: default_send_IPI_mask_logical
<-native_smp_send_reschedule
            Xorg-1054  [001] d.h.  1145.889469: resched_task <-check_preempt_wakeup
            Xorg-1054  [001] d.h.  1145.889469: native_smp_send_reschedule
<-resched_task
            Xorg-1054  [001] d.h.  1145.889469: default_send_IPI_mask_logical
<-native_smp_send_reschedule
 qemu-system-x86-2503  [000] d.s.  1145.889643: smp_reschedule_interrupt
<-reschedule_interrupt
 qemu-system-x86-2503  [000] d.s.  1145.889643: scheduler_ipi
<-smp_reschedule_interrupt
 qemu-system-x86-2503  [000] .n..  1145.889647: kvm_arch_vcpu_ioctl_run: sched
inconsistency
 qemu-system-x86-2503  [000] .n..  1145.889662: kvm_arch_vcpu_ioctl_run: sched
inconsistency
 qemu-system-x86-2503  [000] .n..  1145.889664: kvm_arch_vcpu_ioctl_run: sched
inconsistency
 qemu-system-x86-2503  [000] .n..  1145.889668: kvm_arch_vcpu_ioctl_run: need(1)
!= should(0)
 qemu-system-x86-2503  [000] .n..  1145.889669: kvm_arch_vcpu_ioctl_run:
exit_reason=2
 qemu-system-x86-2503  [000] .n..  1145.889669: kvm_arch_vcpu_ioctl_run:
preempt_count=80000000
Peter Zijlstra Feb. 14, 2014, 1:17 p.m. UTC | #12
On Fri, Feb 14, 2014 at 11:55:25AM +0100, Stefan Bader wrote:
> Ok, I think I now got a log of the actual issue. It seems cpu#1 missed out on
> handling a reschedule interrupt but did send one to cpu#0 and on cpu#0 while
> handling the interrupt the tif flag was not set (yet?) but then when it is,
> there is no more interrupt to do the folding... (sorry for the wrapping)...
> 
>  qemu-system-x86-2503  [000] d.h.  1145.887872: default_send_IPI_mask_logical <-native_smp_send_reschedule
>  qemu-system-x86-2503  [000] d...  1145.887888: resched_task <-check_preempt_curr
>           <idle>-0     [000] d.h.  1145.888001: resched_task <-check_preempt_curr

>           <idle>-0     [001] dn..  1145.888788: smp_reschedule_interrupt <-reschedule_interrupt
>           <idle>-0     [001] dn..  1145.888789: scheduler_ipi <-smp_reschedule_interrupt

>   gnome-terminal-2440  [000] d...  1145.888957: resched_task <-check_preempt_curr
>   gnome-terminal-2440  [000] d...  1145.888959: native_smp_send_reschedule <-resched_task
>   gnome-terminal-2440  [000] d...  1145.888959: default_send_IPI_mask_logical <-native_smp_send_reschedule

>           <idle>-0     [001] dn..  1145.889296: smp_reschedule_interrupt <-reschedule_interrupt
>           <idle>-0     [001] dn..  1145.889297: scheduler_ipi <-smp_reschedule_interrupt

>   gnome-terminal-2440  [000] d...  1145.889465: resched_task <-check_preempt_curr
>   gnome-terminal-2440  [000] d...  1145.889467: native_smp_send_reschedule <-resched_task
>   gnome-terminal-2440  [000] d...  1145.889467: default_send_IPI_mask_logical <-native_smp_send_reschedule

>             Xorg-1054  [001] d.h.  1145.889469: resched_task <-check_preempt_wakeup
>             Xorg-1054  [001] d.h.  1145.889469: native_smp_send_reschedule <-resched_task
>             Xorg-1054  [001] d.h.  1145.889469: default_send_IPI_mask_logical <-native_smp_send_reschedule

>  qemu-system-x86-2503  [000] d.s.  1145.889643: smp_reschedule_interrupt <-reschedule_interrupt
>  qemu-system-x86-2503  [000] d.s.  1145.889643: scheduler_ipi <-smp_reschedule_interrupt
>  qemu-system-x86-2503  [000] .n..  1145.889647: kvm_arch_vcpu_ioctl_run: sched inconsistency
>  qemu-system-x86-2503  [000] .n..  1145.889662: kvm_arch_vcpu_ioctl_run: sched inconsistency
>  qemu-system-x86-2503  [000] .n..  1145.889664: kvm_arch_vcpu_ioctl_run: sched inconsistency
>  qemu-system-x86-2503  [000] .n..  1145.889668: kvm_arch_vcpu_ioctl_run: need(1) != should(0)
>  qemu-system-x86-2503  [000] .n..  1145.889669: kvm_arch_vcpu_ioctl_run: exit_reason=2
>  qemu-system-x86-2503  [000] .n..  1145.889669: kvm_arch_vcpu_ioctl_run: preempt_count=80000000
> 

So afaiu you're running this on an AMD fam 0xf which doesn't have
synchronized TSC; so again the timing might be off.. however:

The first block from cpu 0, sends an IPI to cpu 1, cpu 1 sees 'n'.

Then again, cpu 0 sends to cpu 1, we observe 'n'.

So far so good and timing seems to reflect causality.

Then however, cpu 0 sends, but nothing in this trace answers on cpu 1..
odd.

Then cpu 1 sends to cpu 0, cpu 0 does receive the IPI but does _not_
observe 'n', which is mighty odd. However after the IPI is handled cpu 0
magically gets the 'n'.



So one thing I did wonder about; if cpu 0 sends an IPI while cpu 1 is
already processing the same interrupt; cpu 1 should latch the interrupt
and it would re-trigger with when we re-enable interrupts after we're
done with the first.

The only thing that would ruin that is late-ack, eg. ack (clear the
latch) when exiting the interrupt; but I checked and we ack before we
call scheduler_ipi().

Now; if this were to happen (but I don't exactly see how on a 2 CPU
system, nor do I see this in the trace), there might be a hole...

CPU 0				|	CPU 1
--------------------------------+--------------------------------
                                |
/* ! resched_task */		|
smp_send_reschedule(1)		|L
				|L  smp_reschedule_interrupt()
resched_task()			|L
  set_tsk_need_resched()	|L
  smp_mb()			|L
  smp_send_reschedule(1)	|L
				|L    ack_APIC_irq()
				|     scheduler_ipi()
				|       need_resched();

Now strictly speaking that need_resched() load could creep up before
the store and we'd fail to observe 'n'.

However, x86 forbids reads from passing an earlier write to the same
location (it allows passing earlier writes in general). At which point
the only other option is GCC emitting that load far earlier, in fact
before ack_APIC_irq(), but that would be a violation of the C std
because scheduler_ipi() is a (real) function call and acts as a compiler
barrier AFAIK (as opposed to inlining which doesn't act as a compiler
barrier).

So colour me confused..


You could try putting an smp_mb() at the very start of scheduler_ipi()
just to make absolutely sure, but it _should_ not matter.

Of course, I could be completely off base with how interrupts work; and
there's a wide gaping hole in there, in which case, someone please
educate me ;-)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Borislav Petkov Feb. 14, 2014, 1:34 p.m. UTC | #13
On Thu, Feb 13, 2014 at 07:26:05PM +0100, Peter Zijlstra wrote:
> On Thu, Feb 13, 2014 at 07:03:56PM +0100, Stefan Bader wrote:
> > Yeah... not sure the interleaved source helps or not ...
> 
> It did, thanks!

Stefan, can you also send sched/core.s? I'm particularly interested in
what resched_task() has been turned into by your build env.

Thanks.
Borislav Petkov Feb. 14, 2014, 2:47 p.m. UTC | #14
On Fri, Feb 14, 2014 at 03:24:09PM +0100, Stefan Bader wrote:
> Actually, this code just makes so much more sense if I let objdump do
> relocation info...

Ok, we're pretty sure you have an MFENCE there in resched_task but can
you confirm it please.

First, does /proc/cpuinfo have the "sse2" string? It should but nowadays
I don't trust anything.

Then, can you boot that kernel in qemu with the -gdb flag so that it
starts the gdb stub, here's the manpage about it:

       -gdb dev
           Wait for gdb connection on device dev. Typical connections will likely be
           TCP-based, but also UDP, pseudo TTY, or even stdio are reasonable use case.
           The latter is allowing to start QEMU from within gdb and establish the
           connection via a pipe:

                   (gdb) target remote | exec qemu-system-i386 -gdb stdio ...

       -s  Shorthand for -gdb tcp::1234, i.e. open a gdbserver on TCP port 1234.

then boot the guest and when it is up, do

$ gdb ./vmlinux
$ target remote localhost:1234

and type in the prompt

$ (gdb) x/50i resched_task

It gives here:

(gdb) x/50i resched_task
   0xffffffff810836f0 <resched_task>:   data32 data32 data32 xchg %ax,%ax
   0xffffffff810836f5 <resched_task+5>: push   %rbp
   0xffffffff810836f6 <resched_task+6>: mov    0x85e123(%rip),%r10d        # 0xffffffff818e1820 <debug_locks>
   0xffffffff810836fd <resched_task+13>:        mov    %rsp,%rbp
   0xffffffff81083700 <resched_task+16>:        push   %r12
   0xffffffff81083702 <resched_task+18>:        test   %r10d,%r10d
   0xffffffff81083705 <resched_task+21>:        push   %rbx
   0xffffffff81083706 <resched_task+22>:        mov    %rdi,%rbx
   0xffffffff81083709 <resched_task+25>:        jne    0xffffffff81083760 <resched_task+112>
   0xffffffff8108370b <resched_task+27>:        mov    0x8(%rbx),%rax
   0xffffffff8108370f <resched_task+31>:        mov    0x10(%rax),%rdx
   0xffffffff81083713 <resched_task+35>:        and    $0x8,%edx
   0xffffffff81083716 <resched_task+38>:        jne    0xffffffff8108373c <resched_task+76>
   0xffffffff81083718 <resched_task+40>:        lock orb $0x8,0x10(%rax)
   0xffffffff8108371d <resched_task+45>:        mov    0x8(%rbx),%rax
   0xffffffff81083721 <resched_task+49>:        mov    0x18(%rax),%r12d
   0xffffffff81083725 <resched_task+53>:        callq  0xffffffff812d8fc0 <debug_smp_processor_id>
   0xffffffff8108372a <resched_task+58>:        cmp    %r12d,%eax
   0xffffffff8108372d <resched_task+61>:        je     0xffffffff810837a0 <resched_task+176>
   0xffffffff8108372f <resched_task+63>:        mfence
   						^^^^^^
I want to make sure the smp_mb() is really replaced with an MFENCE there.

Thanks!
Borislav Petkov Feb. 14, 2014, 3:21 p.m. UTC | #15
Oh, and just in case this is relatively easy to reproduce and in case we
don't have any other idea, bisection might be another option. I'm not
saying you should do it right away - I'm just putting it on the table...

:-)

:-)
Stefan Bader Feb. 14, 2014, 3:28 p.m. UTC | #16
On 14.02.2014 16:21, Borislav Petkov wrote:
> Oh, and just in case this is relatively easy to reproduce and in case we
> don't have any other idea, bisection might be another option. I'm not
> saying you should do it right away - I'm just putting it on the table...
> 
> :-)
> 
> :-)
> 

Oh yeah, bisection is nearly as entertaining as doing my tax records. Hm, on the
other hand those will have to be done at some point too... :-P
Borislav Petkov Feb. 14, 2014, 3:44 p.m. UTC | #17
On Fri, Feb 14, 2014 at 04:28:16PM +0100, Stefan Bader wrote:
> Oh yeah, bisection is nearly as entertaining as doing my tax records.
> Hm, on the other hand those will have to be done at some point too...
> :-P

Ah, tax records, crap, I have those lying around in the corner since
forever... I need to do those too :-(
Peter Zijlstra Feb. 14, 2014, 4:21 p.m. UTC | #18
On Fri, Feb 14, 2014 at 04:21:32PM +0100, Borislav Petkov wrote:
> Oh, and just in case this is relatively easy to reproduce and in case we
> don't have any other idea, bisection might be another option. I'm not
> saying you should do it right away - I'm just putting it on the table...

I'm fairly sure you're going to hit the patch that introduced the whole
per-cpu preemption thing. Before that not receiving the IPI wasn't as
bad as it is now.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stefan Bader Feb. 14, 2014, 5:02 p.m. UTC | #19
On 14.02.2014 15:47, Borislav Petkov wrote:
> On Fri, Feb 14, 2014 at 03:24:09PM +0100, Stefan Bader wrote:
>> Actually, this code just makes so much more sense if I let objdump do
>> relocation info...
> 
> Ok, we're pretty sure you have an MFENCE there in resched_task but can
> you confirm it please.
> 
> First, does /proc/cpuinfo have the "sse2" string? It should but nowadays
> I don't trust anything.
> 
> Then, can you boot that kernel in qemu with the -gdb flag so that it
> starts the gdb stub, here's the manpage about it:
> 
>        -gdb dev
>            Wait for gdb connection on device dev. Typical connections will likely be
>            TCP-based, but also UDP, pseudo TTY, or even stdio are reasonable use case.
>            The latter is allowing to start QEMU from within gdb and establish the
>            connection via a pipe:
> 
>                    (gdb) target remote | exec qemu-system-i386 -gdb stdio ...
> 
>        -s  Shorthand for -gdb tcp::1234, i.e. open a gdbserver on TCP port 1234.
> 
> then boot the guest and when it is up, do
> 
> $ gdb ./vmlinux
> $ target remote localhost:1234
> 
> and type in the prompt
> 
> $ (gdb) x/50i resched_task
> 
> It gives here:
> 
> (gdb) x/50i resched_task
>    0xffffffff810836f0 <resched_task>:   data32 data32 data32 xchg %ax,%ax
>    0xffffffff810836f5 <resched_task+5>: push   %rbp
>    0xffffffff810836f6 <resched_task+6>: mov    0x85e123(%rip),%r10d        # 0xffffffff818e1820 <debug_locks>
>    0xffffffff810836fd <resched_task+13>:        mov    %rsp,%rbp
>    0xffffffff81083700 <resched_task+16>:        push   %r12
>    0xffffffff81083702 <resched_task+18>:        test   %r10d,%r10d
>    0xffffffff81083705 <resched_task+21>:        push   %rbx
>    0xffffffff81083706 <resched_task+22>:        mov    %rdi,%rbx
>    0xffffffff81083709 <resched_task+25>:        jne    0xffffffff81083760 <resched_task+112>
>    0xffffffff8108370b <resched_task+27>:        mov    0x8(%rbx),%rax
>    0xffffffff8108370f <resched_task+31>:        mov    0x10(%rax),%rdx
>    0xffffffff81083713 <resched_task+35>:        and    $0x8,%edx
>    0xffffffff81083716 <resched_task+38>:        jne    0xffffffff8108373c <resched_task+76>
>    0xffffffff81083718 <resched_task+40>:        lock orb $0x8,0x10(%rax)
>    0xffffffff8108371d <resched_task+45>:        mov    0x8(%rbx),%rax
>    0xffffffff81083721 <resched_task+49>:        mov    0x18(%rax),%r12d
>    0xffffffff81083725 <resched_task+53>:        callq  0xffffffff812d8fc0 <debug_smp_processor_id>
>    0xffffffff8108372a <resched_task+58>:        cmp    %r12d,%eax
>    0xffffffff8108372d <resched_task+61>:        je     0xffffffff810837a0 <resched_task+176>
>    0xffffffff8108372f <resched_task+63>:        mfence
>    						^^^^^^
> I want to make sure the smp_mb() is really replaced with an MFENCE there.
> 
> Thanks!
> 
Okaaay, I think I did what you asked. So yes, there is sse2 in the cpu info. And
there is a mfence in the disassembly:

   0xc107dcb0 <resched_task>:	push   %ebp
   0xc107dcb1 <resched_task+1>:	mov    %esp,%ebp
   0xc107dcb3 <resched_task+3>:	lea    %ds:0x0(%esi,%eiz,1),%esi
   0xc107dcb8 <resched_task+8>:	mov    0x4(%eax),%edx
   0xc107dcbb <resched_task+11>:	mov    0x8(%edx),%ecx
   0xc107dcbe <resched_task+14>:	and    $0x8,%ecx
   0xc107dcc1 <resched_task+17>:	jne    0xc107dce7 <resched_task+55>
   0xc107dcc3 <resched_task+19>:	orb    $0x8,%ds:0x8(%edx)
   0xc107dcc8 <resched_task+24>:	mov    0x4(%eax),%edx
   0xc107dccb <resched_task+27>:	mov    %fs:0xc1a71010,%ecx
   0xc107dcd2 <resched_task+34>:	mov    0x10(%edx),%edx
   0xc107dcd5 <resched_task+37>:	cmp    %ecx,%edx
   0xc107dcd7 <resched_task+39>:	je     0xc107dd00 <resched_task+80>
   0xc107dcd9 <resched_task+41>:	mfence
   0xc107dcdc <resched_task+44>:	mov    %esi,%esi
   0xc107dcde <resched_task+46>:	mov    0x4(%eax),%eax
   0xc107dce1 <resched_task+49>:	testb  $0x4,0xc(%eax)
   0xc107dce5 <resched_task+53>:	je     0xc107dcf0 <resched_task+64>
   0xc107dce7 <resched_task+55>:	pop    %ebp
   0xc107dce8 <resched_task+56>:	ret
   0xc107dce9 <resched_task+57>:	lea    0x0(%esi,%eiz,1),%esi
   0xc107dcf0 <resched_task+64>:	mov    %edx,%eax
   0xc107dcf2 <resched_task+66>:	call   *0xc1917950
   0xc107dcf8 <resched_task+72>:	pop    %ebp

Thinking about it, I guess Peter is quite right saying that I likely will end on
the patch that converted preempt_count to percpu.

One thing I likely should do is to reinstall the exact same laptop with 64bit
kernel and userspace... maybe only 64bit kernel first... and make sure on my
side that this does not show up on 64bit, too. I took the word of reporters for
that (and the impression that otherwise many more people would have complained).
Peter Zijlstra Feb. 14, 2014, 5:21 p.m. UTC | #20
On Fri, Feb 14, 2014 at 06:02:32PM +0100, Stefan Bader wrote:
> One thing I likely should do is to reinstall the exact same laptop with 64bit
> kernel and userspace... maybe only 64bit kernel first... and make sure on my
> side that this does not show up on 64bit, too. I took the word of reporters for
> that (and the impression that otherwise many more people would have complained).

Yeha, I'm going to try and install some 32bit userspace on a usb
harddisk I've got and see if I can boot my Core2 laptop from that to try
and reproduce.

But all that is probably going to be Monday :/
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Borislav Petkov Feb. 14, 2014, 5:33 p.m. UTC | #21
On Fri, Feb 14, 2014 at 06:02:32PM +0100, Stefan Bader wrote:
> Okaaay, I think I did what you asked. So yes, there is sse2 in the cpu info. And
> there is a mfence in the disassembly:

Btw, I just realized booting the kernel in the guest was a dumb idea,
because, doh, the guest is not baremetal. The only reliable thing we
can say is that sse2 is present and that MFENCE alternative replacement
works :)

But for simplicity's sake let's just assume the machine can do MFENCE
just fine and it gets replaced by the alternatives code.

Besides, if that weren't true, we'd have a whole lot of other problems
on those boxes.

> Thinking about it, I guess Peter is quite right saying that I likely
> will end on the patch that converted preempt_count to percpu.

Yeah, c2daa3bed53a81171cf8c1a36db798e82b91afe8 et al.

> One thing I likely should do is to reinstall the exact same laptop
> with 64bit kernel and userspace... maybe only 64bit kernel first...
> and make sure on my side that this does not show up on 64bit, too. I
> took the word of reporters for that (and the impression that otherwise
> many more people would have complained).

Yeah, that should be a prudent thing to do.

Also, Paolo and I were wondering whether you can trigger this thing
without kvm, i.e. virtualization involved... do you have any data on
that?

Thanks.
Stefan Bader Feb. 14, 2014, 6:23 p.m. UTC | #22
On 14.02.2014 18:33, Borislav Petkov wrote:
> On Fri, Feb 14, 2014 at 06:02:32PM +0100, Stefan Bader wrote:
>> Okaaay, I think I did what you asked. So yes, there is sse2 in the cpu info. And
>> there is a mfence in the disassembly:
> 
> Btw, I just realized booting the kernel in the guest was a dumb idea,
> because, doh, the guest is not baremetal. The only reliable thing we
> can say is that sse2 is present and that MFENCE alternative replacement
> works :)
> 
> But for simplicity's sake let's just assume the machine can do MFENCE
> just fine and it gets replaced by the alternatives code.
> 
> Besides, if that weren't true, we'd have a whole lot of other problems
> on those boxes.
> 
>> Thinking about it, I guess Peter is quite right saying that I likely
>> will end on the patch that converted preempt_count to percpu.
> 
> Yeah, c2daa3bed53a81171cf8c1a36db798e82b91afe8 et al.
> 
>> One thing I likely should do is to reinstall the exact same laptop
>> with 64bit kernel and userspace... maybe only 64bit kernel first...
>> and make sure on my side that this does not show up on 64bit, too. I
>> took the word of reporters for that (and the impression that otherwise
>> many more people would have complained).
> 
> Yeah, that should be a prudent thing to do.
> 
> Also, Paolo and I were wondering whether you can trigger this thing
> without kvm, i.e. virtualization involved... do you have any data on
> that?

Unfortunately no hard evidence. Kvm just happens to be such a good way to notice
this as it is using the reschedule interrupt itself and has this exit before
running the guest vcpu to hadnle it in the outer loop by calling cond_resched()
and repeat.
I find running kvm seems to make that laptop quite sluggish in responding to
other tasks (in that install) and I got some oddness going on when lightdm quite
often refuses to take keyboard input without opening some menu with the mouse
first... But I could not be sure whether that is the kernel or some new
user-space ... errr "feature".
At least Marcello (iirc that other report came from him directly or indirectly)
has seen it, too. And he likely has complete different user-space.

So I will go and do that different (64bit) kernel and kernel + user-space test.
But like fo Peter, it likely is a Monday thing...
Stefan Bader Feb. 14, 2014, 7:03 p.m. UTC | #23
On 14.02.2014 19:23, Stefan Bader wrote:
> On 14.02.2014 18:33, Borislav Petkov wrote:
>> On Fri, Feb 14, 2014 at 06:02:32PM +0100, Stefan Bader wrote:
>>> Okaaay, I think I did what you asked. So yes, there is sse2 in the cpu info. And
>>> there is a mfence in the disassembly:
>>
>> Btw, I just realized booting the kernel in the guest was a dumb idea,
>> because, doh, the guest is not baremetal. The only reliable thing we
>> can say is that sse2 is present and that MFENCE alternative replacement
>> works :)
>>
>> But for simplicity's sake let's just assume the machine can do MFENCE
>> just fine and it gets replaced by the alternatives code.
>>
>> Besides, if that weren't true, we'd have a whole lot of other problems
>> on those boxes.
>>
>>> Thinking about it, I guess Peter is quite right saying that I likely
>>> will end on the patch that converted preempt_count to percpu.
>>
>> Yeah, c2daa3bed53a81171cf8c1a36db798e82b91afe8 et al.
>>
>>> One thing I likely should do is to reinstall the exact same laptop
>>> with 64bit kernel and userspace... maybe only 64bit kernel first...
>>> and make sure on my side that this does not show up on 64bit, too. I
>>> took the word of reporters for that (and the impression that otherwise
>>> many more people would have complained).
>>
>> Yeah, that should be a prudent thing to do.
>>
>> Also, Paolo and I were wondering whether you can trigger this thing
>> without kvm, i.e. virtualization involved... do you have any data on
>> that?
> 
> Unfortunately no hard evidence. Kvm just happens to be such a good way to notice
> this as it is using the reschedule interrupt itself and has this exit before
> running the guest vcpu to hadnle it in the outer loop by calling cond_resched()
> and repeat.
> I find running kvm seems to make that laptop quite sluggish in responding to
> other tasks (in that install) and I got some oddness going on when lightdm quite
> often refuses to take keyboard input without opening some menu with the mouse
> first... But I could not be sure whether that is the kernel or some new
> user-space ... errr "feature".
> At least Marcello (iirc that other report came from him directly or indirectly)
> has seen it, too. And he likely has complete different user-space.
> 
> So I will go and do that different (64bit) kernel and kernel + user-space test.
> But like fo Peter, it likely is a Monday thing...
> 

Ok, it is still Friday... So a quick test (2 boots of a 32bit guest, same as
before) on the 32bit user-space, with the same kernel source, but compiled as
64bit (obviously not 100% same config but close). While I see the false
inconsistency messages (I modified the in kernel-test to trigger the trace stop
only if the __vcpu_run loop encounters an inconsistent state three times in a
row), I do not see the final stop message. Also (but that is rather feeling) the
system seems to remain more responsive (switching to other windows, opening
terminal windows,...) compared to 32bit kernel.
Stefan Bader Feb. 20, 2014, 3:38 p.m. UTC | #24
On 14.02.2014 18:21, Peter Zijlstra wrote:
> On Fri, Feb 14, 2014 at 06:02:32PM +0100, Stefan Bader wrote:
>> One thing I likely should do is to reinstall the exact same laptop with 64bit
>> kernel and userspace... maybe only 64bit kernel first... and make sure on my
>> side that this does not show up on 64bit, too. I took the word of reporters for
>> that (and the impression that otherwise many more people would have complained).
> 
> Yeha, I'm going to try and install some 32bit userspace on a usb
> harddisk I've got and see if I can boot my Core2 laptop from that to try
> and reproduce.
> 
> But all that is probably going to be Monday :/
> 
*sigh* Already Thursday...

Peter, did you get to reproduce this locally? Unfortunately I had some
interruption and have not more Information than on last Friday (which is that
the same hw but 64bit kernel does not show it).

Meanwhile I wonder whether it would make sense to push the following (or more?)
to stable for 3.13.y:

1) 16824255394f55adf31b9a96a9965d8c15bdac4c
   x86, acpi, idle: Restructure the mwait idle routines
2) 7e98b71920464b8d15fa95c74366416cd3c88861
   x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers
3) 8cb75e0c4ec9786b81439761eac1d18d4a931af3
   sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding
4) 215393bc1fab3d61a5a296838bdffce22f27ffda
   sched/preempt/x86: Fix voluntary preempt for x86

1+2 would be to avoid touching 3 too much and looked to be improvements on their
own. 3+4 would be cherry-picks if not for some fuzz 2.
I saw a few more things labelled preempt between 3.13 and current HEAD but am
not sure whether or which of those are strictly required. Namely some fixing to
preempt_enable_no_resched() mis-usage and maybe one fixing an issue of ftrace
locking up.

-Stefan
3
Peter Zijlstra Feb. 20, 2014, 3:50 p.m. UTC | #25
On Thu, Feb 20, 2014 at 04:38:13PM +0100, Stefan Bader wrote:
> On 14.02.2014 18:21, Peter Zijlstra wrote:
> > On Fri, Feb 14, 2014 at 06:02:32PM +0100, Stefan Bader wrote:
> >> One thing I likely should do is to reinstall the exact same laptop with 64bit
> >> kernel and userspace... maybe only 64bit kernel first... and make sure on my
> >> side that this does not show up on 64bit, too. I took the word of reporters for
> >> that (and the impression that otherwise many more people would have complained).
> > 
> > Yeha, I'm going to try and install some 32bit userspace on a usb
> > harddisk I've got and see if I can boot my Core2 laptop from that to try
> > and reproduce.
> > 
> > But all that is probably going to be Monday :/
> > 
> *sigh* Already Thursday...
> 
> Peter, did you get to reproduce this locally? Unfortunately I had some
> interruption and have not more Information than on last Friday (which is that
> the same hw but 64bit kernel does not show it).

I got side-tracked as well, someone reported crashes, which come above
weird behaviour :/

> Meanwhile I wonder whether it would make sense to push the following (or more?)
> to stable for 3.13.y:
> 
> 1) 16824255394f55adf31b9a96a9965d8c15bdac4c
>    x86, acpi, idle: Restructure the mwait idle routines
> 2) 7e98b71920464b8d15fa95c74366416cd3c88861
>    x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers
> 3) 8cb75e0c4ec9786b81439761eac1d18d4a931af3
>    sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding
> 4) 215393bc1fab3d61a5a296838bdffce22f27ffda
>    sched/preempt/x86: Fix voluntary preempt for x86
> 
> 1+2 would be to avoid touching 3 too much and looked to be improvements on their
> own. 3+4 would be cherry-picks if not for some fuzz 2.

I'll not object; but if someone wants a smaller set you could do with
just the idle.c hunk from 3.

> I saw a few more things labelled preempt between 3.13 and current HEAD but am
> not sure whether or which of those are strictly required. Namely some fixing to
> preempt_enable_no_resched() mis-usage and maybe one fixing an issue of ftrace
> locking up.

Yeah, that's all fresh breakage and doesn't go back to 3.13 I think.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini March 24, 2014, 5:39 p.m. UTC | #26
Il 20/02/2014 16:50, Peter Zijlstra ha scritto:
>>>> > >> One thing I likely should do is to reinstall the exact same laptop with 64bit
>>>> > >> kernel and userspace... maybe only 64bit kernel first... and make sure on my
>>>> > >> side that this does not show up on 64bit, too. I took the word of reporters for
>>>> > >> that (and the impression that otherwise many more people would have complained).
>>> > >
>>> > > Yeha, I'm going to try and install some 32bit userspace on a usb
>>> > > harddisk I've got and see if I can boot my Core2 laptop from that to try
>>> > > and reproduce.
>>> > >
>>> > > But all that is probably going to be Monday :/
>>> > >
>> > *sigh* Already Thursday...
>> >
>> > Peter, did you get to reproduce this locally? Unfortunately I had some
>> > interruption and have not more Information than on last Friday (which is that
>> > the same hw but 64bit kernel does not show it).
> I got side-tracked as well, someone reported crashes, which come above
> weird behaviour :/
>

Stefan, Peter, any news here?

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stefan Bader March 25, 2014, 8:23 a.m. UTC | #27
On 24.03.2014 18:39, Paolo Bonzini wrote:
> Il 20/02/2014 16:50, Peter Zijlstra ha scritto:
>>>>> > >> One thing I likely should do is to reinstall the exact same laptop
>>>>> with 64bit
>>>>> > >> kernel and userspace... maybe only 64bit kernel first... and make sure
>>>>> on my
>>>>> > >> side that this does not show up on 64bit, too. I took the word of
>>>>> reporters for
>>>>> > >> that (and the impression that otherwise many more people would have
>>>>> complained).
>>>> > >
>>>> > > Yeha, I'm going to try and install some 32bit userspace on a usb
>>>> > > harddisk I've got and see if I can boot my Core2 laptop from that to try
>>>> > > and reproduce.
>>>> > >
>>>> > > But all that is probably going to be Monday :/
>>>> > >
>>> > *sigh* Already Thursday...
>>> >
>>> > Peter, did you get to reproduce this locally? Unfortunately I had some
>>> > interruption and have not more Information than on last Friday (which is that
>>> > the same hw but 64bit kernel does not show it).
>> I got side-tracked as well, someone reported crashes, which come above
>> weird behaviour :/
>>
> 
> Stefan, Peter, any news here?
> 
> Paolo

No, unfortunately not. :( I thought I had some idea as i386 seems to have soft
and hard irqs using a seperate stack and thought they maybe loose the process
flag there (x86_64 seems to always do that for soft irqs and if it does for hard
irqs it was hidden well enough for me not to see). But trial and error showed
showed no improvement when I made a copy of the original TIF_NEED_RSCHED before
executing on stack. So not really any further.
And then the usual distractions hit...

-Stefan
diff mbox

Patch

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 39c28f09dfd5..f8b2fc7ce491 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6115,7 +6115,11 @@  static int __vcpu_run(struct kvm_vcpu *vcpu)
                }
                if (need_resched()) {
                        srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
-                       cond_resched();
+                       if (need_resched() != should_resched()) {
+                               tracing_stop();
+                               printk(KERN_ERR "Stopped tracing, due to inconsistent state.\n");
+                       }
+                       schedule();
                        vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
                }
        }