diff mbox

[BUG] Assertion '(sp == 0) || (peoi[sp-1].vector < vector)' failed at irq.c:1163

Message ID 569BB05B.1000801@citrix.com (mailing list archive)
State New, archived
Headers show

Commit Message

Andrew Cooper Jan. 17, 2016, 3:16 p.m. UTC
On 17/01/16 14:50, Håkon Alstadheim wrote:
> Den 15. jan. 2016 12:05, skrev Andrew Cooper:
>> On 15/01/16 10:58, Håkon Alstadheim wrote:
>>> CPUINFO:
>>> vendor_id    : GenuineIntel
>>> cpu family    : 6
>>> model        : 63
>>> model name    : Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
>>>
>>> # smbios-sys-info
>>> Libsmbios version:      2.2.28
>>> Product Name:           Z10PE-D8 WS
>>> Vendor:                 ASUSTeK COMPUTER INC.
>>> BIOS Version:           3101
>>>
>>>
>>> I have been experiencing issues with domains with passed through PCIe
>>> devices since I first installed xen. Then at version 4.5.x , I'm now
>>> at 4.6.0 with gentoo patches. Crashes SEEM mostly related to this pci
>>> pass through and interrupts (usb-cards, sound cards).
>>>
>>> Recently the system has been more stable, whether it is because I pass
>>> through as few things as possible, or because of improvements in Xen I
>>> do not know. I have also taken to building with debug, which leads to
>>> more abrupt but less mysterious failures. Earlier (w/o debug and under
>>> xen 4.5 ) stuff would just gradually stop working and end up in total
>>> hang of everything. So, hey, things are improving :-b
>> This isn't the first time we have seen this on Haswell processors. Do
>> you have microcode loading set up?
>>
>> ~Andrew
>>
> Still happening with kernel-genkernel-x86_64-4.1.15-gentoo and updated
> cpu microcode, using microcode from 20151106.

Ok - I previously investigated this issue, but my repro evaporated from
under my feet with a firmware update, and I never got to the bottom of it.

Please can you start with the following patch which will dump some more
information on crash.

---8<---
         peoi[sp].irq = irq;

Comments

Håkon Alstadheim Jan. 17, 2016, 4:30 p.m. UTC | #1
Den 17. jan. 2016 16:16, skrev Andrew Cooper:
> On 17/01/16 14:50, Håkon Alstadheim wrote:
>> Den 15. jan. 2016 12:05, skrev Andrew Cooper:
>>> On 15/01/16 10:58, Håkon Alstadheim wrote:
>>>> CPUINFO:
>>>> vendor_id    : GenuineIntel
>>>> cpu family    : 6
>>>> model        : 63
>>>> model name    : Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
>>>>
>>>> # smbios-sys-info
>>>> Libsmbios version:      2.2.28
>>>> Product Name:           Z10PE-D8 WS
>>>> Vendor:                 ASUSTeK COMPUTER INC.
>>>> BIOS Version:           3101
>>>>
>>>>
>>>> I have been experiencing issues with domains with passed through PCIe
>>>> devices since I first installed xen. Then at version 4.5.x , I'm now
>>>> at 4.6.0 with gentoo patches. Crashes SEEM mostly related to this pci
>>>> pass through and interrupts (usb-cards, sound cards).
>>>>
>>>> Recently the system has been more stable, whether it is because I pass
>>>> through as few things as possible, or because of improvements in Xen I
>>>> do not know. I have also taken to building with debug, which leads to
>>>> more abrupt but less mysterious failures. Earlier (w/o debug and under
>>>> xen 4.5 ) stuff would just gradually stop working and end up in total
>>>> hang of everything. So, hey, things are improving :-b
>>> This isn't the first time we have seen this on Haswell processors. Do
>>> you have microcode loading set up?
>>>
>>> ~Andrew
>>>
>> Still happening with kernel-genkernel-x86_64-4.1.15-gentoo and updated
>> cpu microcode, using microcode from 20151106.
> Ok - I previously investigated this issue, but my repro evaporated from
> under my feet with a firmware update, and I never got to the bottom of it.
>
> Please can you start with the following patch which will dump some more
> information on crash.
>
> ---8<---
> diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
> index 1228568..588b562 100644
> --- a/xen/arch/x86/irq.c
> +++ b/xen/arch/x86/irq.c
> @@ -1165,6 +1165,13 @@ static void __do_IRQ_guest(int irq)
>      if ( action->ack_type == ACKTYPE_EOI )
>      {
>          sp = pending_eoi_sp(peoi);
> +        if ( unlikely(!((sp == 0) || (peoi[sp-1].vector < vector))) )
> +        {
> +            int p;
> +            for ( p = sp; p > 0; --p )
> +                printk("**peoi[%d] = {%d, 0x%u, %d}\n",
> +                       p-1, peoi[p-1].irq, peoi[p-1].vector,
> peoi[p-1].ready);
> +        }
>          ASSERT((sp == 0) || (peoi[sp-1].vector < vector));
>          ASSERT(sp < (NR_DYNAMIC_VECTORS-1));
>          peoi[sp].irq = irq;
>
>
Will do. Building now.
Seems there is a line accidentally folded "peoi[p-1].ready);" belongs at
the end of preceding line I presume?
Andrew Cooper Jan. 17, 2016, 11:12 p.m. UTC | #2
On 17/01/2016 23:07, Håkon Alstadheim wrote:
> Den 17. jan. 2016 17:30, skrev Håkon Alstadheim:
>> Den 17. jan. 2016 16:16, skrev Andrew Cooper:
>>> On 17/01/16 14:50, Håkon Alstadheim wrote:
>>>> Den 15. jan. 2016 12:05, skrev Andrew Cooper:
>>>>> On 15/01/16 10:58, Håkon Alstadheim wrote:
>>>>>> CPUINFO:
>>>>>> vendor_id    : GenuineIntel
>>>>>> cpu family    : 6
>>>>>> model        : 63
>>>>>> model name    : Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
>>>>>>
>>>>>> # smbios-sys-info
>>>>>> Libsmbios version:      2.2.28
>>>>>> Product Name:           Z10PE-D8 WS
>>>>>> Vendor:                 ASUSTeK COMPUTER INC.
>>>>>> BIOS Version:           3101
>>>>>>
>>>>>>
>>>>>> I have been experiencing issues with domains with passed through PCIe
>>>>>> devices since I first installed xen. Then at version 4.5.x , I'm now
>>>>>> at 4.6.0 with gentoo patches. Crashes SEEM mostly related to this pci
>>>>>> pass through and interrupts (usb-cards, sound cards).
>>>>>>
>>>>>> Recently the system has been more stable, whether it is because I pass
>>>>>> through as few things as possible, or because of improvements in Xen I
>>>>>> do not know. I have also taken to building with debug, which leads to
>>>>>> more abrupt but less mysterious failures. Earlier (w/o debug and under
>>>>>> xen 4.5 ) stuff would just gradually stop working and end up in total
>>>>>> hang of everything. So, hey, things are improving :-b
>>>>> This isn't the first time we have seen this on Haswell processors. Do
>>>>> you have microcode loading set up?
>>>>>
>>>>> ~Andrew
>>>>>
>>>> Still happening with kernel-genkernel-x86_64-4.1.15-gentoo and updated
>>>> cpu microcode, using microcode from 20151106.
>>> Ok - I previously investigated this issue, but my repro evaporated from
>>> under my feet with a firmware update, and I never got to the bottom of it.
>>>
>>> Please can you start with the following patch which will dump some more
>>> information on crash.
>>>
>>> ---8<---
>>> diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
>>> index 1228568..588b562 100644
>>> --- a/xen/arch/x86/irq.c
>>> +++ b/xen/arch/x86/irq.c
>>> @@ -1165,6 +1165,13 @@ static void __do_IRQ_guest(int irq)
>>>      if ( action->ack_type == ACKTYPE_EOI )
>>>      {
>>>          sp = pending_eoi_sp(peoi);
>>> +        if ( unlikely(!((sp == 0) || (peoi[sp-1].vector < vector))) )
>>> +        {
>>> +            int p;
>>> +            for ( p = sp; p > 0; --p )
>>> +                printk("**peoi[%d] = {%d, 0x%u, %d}\n",
>>> +                       p-1, peoi[p-1].irq, peoi[p-1].vector,
>>> peoi[p-1].ready);
>>> +        }
>>>          ASSERT((sp == 0) || (peoi[sp-1].vector < vector));
>>>          ASSERT(sp < (NR_DYNAMIC_VECTORS-1));
>>>          peoi[sp].irq = irq;
>>>
>>>
>> Will do. Building now.
>> Seems there is a line accidentally folded "peoi[p-1].ready);" belongs at
>> the end of preceding line I presume?
>>
> There we go :-/ . Log attached from boot to assertion-failure with
> loglvl=all guest_loglvl=all . Some of the log output might be a bit
> cryptic, they are notes to myself from local boot-scripts, basically
> firing up my router/name-server/dhcp-server and waiting until services
> are ready before continuing.

Would you mind running with the second patch I sent? It gathers more
information.

~Andrew
Jan Beulich Jan. 18, 2016, 10:31 a.m. UTC | #3
>>> On 18.01.16 at 00:07, <hakon@alstadheim.priv.no> wrote:
> There we go :-/ . Log attached from boot to assertion-failure with
> loglvl=all guest_loglvl=all . Some of the log output might be a bit
> cryptic, they are notes to myself from local boot-scripts, basically
> firing up my router/name-server/dhcp-server and waiting until services
> are ready before continuing.
> 
> ---
> (XEN) [2016-01-17 22:46:38] **peoi[0] = {107, 0x40, 0}

According to

(XEN) [2016-01-17 16:50:49] IOAPIC[0]: Set PCI routing entry (1-3 -> 0x40 -> IRQ 3 Mode:0 Active:0)

this might be the serial console, albeit IRQ 107 contradicts this
afaict. Does this also occur without serial console? Are we
perhaps wrongly re-using vector 0x40 (and if so might this be
fixed with -unstable commit fc0c3fa2ad, in turn requiring
e509b8e09c)?

Jan
Andrew Cooper Jan. 18, 2016, 10:35 a.m. UTC | #4
On 18/01/16 10:31, Jan Beulich wrote:
>>>> On 18.01.16 at 00:07, <hakon@alstadheim.priv.no> wrote:
>> There we go :-/ . Log attached from boot to assertion-failure with
>> loglvl=all guest_loglvl=all . Some of the log output might be a bit
>> cryptic, they are notes to myself from local boot-scripts, basically
>> firing up my router/name-server/dhcp-server and waiting until services
>> are ready before continuing.
>>
>> ---
>> (XEN) [2016-01-17 22:46:38] **peoi[0] = {107, 0x40, 0}
> According to
>
> (XEN) [2016-01-17 16:50:49] IOAPIC[0]: Set PCI routing entry (1-3 -> 0x40 -> IRQ 3 Mode:0 Active:0)
>
> this might be the serial console, albeit IRQ 107 contradicts this
> afaict. Does this also occur without serial console? Are we
> perhaps wrongly re-using vector 0x40 (and if so might this be
> fixed with -unstable commit fc0c3fa2ad, in turn requiring
> e509b8e09c)?

I also had a bug in the first patch which printed the vector as 0x%u,
fixed in the second to be %#x.  As such, the actual vector on the
pending EOI stack is 0x28.

~Andrew
Jan Beulich Jan. 18, 2016, 10:54 a.m. UTC | #5
>>> On 18.01.16 at 11:35, <andrew.cooper3@citrix.com> wrote:
> On 18/01/16 10:31, Jan Beulich wrote:
>>>>> On 18.01.16 at 00:07, <hakon@alstadheim.priv.no> wrote:
>>> There we go :-/ . Log attached from boot to assertion-failure with
>>> loglvl=all guest_loglvl=all . Some of the log output might be a bit
>>> cryptic, they are notes to myself from local boot-scripts, basically
>>> firing up my router/name-server/dhcp-server and waiting until services
>>> are ready before continuing.
>>>
>>> ---
>>> (XEN) [2016-01-17 22:46:38] **peoi[0] = {107, 0x40, 0}
>> According to
>>
>> (XEN) [2016-01-17 16:50:49] IOAPIC[0]: Set PCI routing entry (1-3 -> 0x40 -> IRQ 3 
> Mode:0 Active:0)
>>
>> this might be the serial console, albeit IRQ 107 contradicts this
>> afaict. Does this also occur without serial console? Are we
>> perhaps wrongly re-using vector 0x40 (and if so might this be
>> fixed with -unstable commit fc0c3fa2ad, in turn requiring
>> e509b8e09c)?
> 
> I also had a bug in the first patch which printed the vector as 0x%u,
> fixed in the second to be %#x.  As such, the actual vector on the
> pending EOI stack is 0x28.

That wouldn't make it any better, as then, considering the other
similar messages, we would have to conclude it's the vector of
some other Xen internally used device (the IOMMU?), which again
shouldn't be used by guest IRQ unless it got recycled (albeit I don't
think e.g. IOMMU vectors get recycled at all).

Håkon, considering

(XEN) Failed to enable Interrupt Remapping: Will not enable x2APIC.

plus

(XEN) Intel VT-d Interrupt Remapping enabled.

(a logging inconsistency addressed on -unstable already) could you
check your BIOS setup whether you can make firmware permit use
of x2APIC mode? And could you try whether the issue goes away
with "maxcpus=6" (or less) on the Xen command line?

Also, you appear to be doing GPU pass-through - is the problem
connected to that?

Jan
Håkon Alstadheim Jan. 18, 2016, 4:35 p.m. UTC | #6
Den 18. jan. 2016 11:31, skrev Jan Beulich:
>>>> On 18.01.16 at 00:07, <hakon@alstadheim.priv.no> wrote:
>> There we go :-/ . Log attached from boot to assertion-failure with
>> loglvl=all guest_loglvl=all . Some of the log output might be a bit
>> cryptic, they are notes to myself from local boot-scripts, basically
>> firing up my router/name-server/dhcp-server and waiting until services
>> are ready before continuing.
>>
>> ---
>> (XEN) [2016-01-17 22:46:38] **peoi[0] = {107, 0x40, 0}
> According to
>
> (XEN) [2016-01-17 16:50:49] IOAPIC[0]: Set PCI routing entry (1-3 -> 0x40 -> IRQ 3 Mode:0 Active:0)
>
> this might be the serial console, albeit IRQ 107 contradicts this
> afaict. Does this also occur without serial console? Are we
> perhaps wrongly re-using vector 0x40 (and if so might this be
> fixed with -unstable commit fc0c3fa2ad, in turn requiring
> e509b8e09c)?
>
I don't understand all this, but fyi I believe I have two "serial ports"
on the motherboard, one old fashioned serial-port and one created by BMC
for "SOL". They show up in dom0 as ttyS0 and ttyS1. Only ttyS0 is ever
used that I am aware of. I also have one usb-rs232 emulation thingy
which is actually my UPS. All of these are used directly by dom0.
diff mbox

Patch

diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index 1228568..588b562 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -1165,6 +1165,13 @@  static void __do_IRQ_guest(int irq)
     if ( action->ack_type == ACKTYPE_EOI )
     {
         sp = pending_eoi_sp(peoi);
+        if ( unlikely(!((sp == 0) || (peoi[sp-1].vector < vector))) )
+        {
+            int p;
+            for ( p = sp; p > 0; --p )
+                printk("**peoi[%d] = {%d, 0x%u, %d}\n",
+                       p-1, peoi[p-1].irq, peoi[p-1].vector,
peoi[p-1].ready);
+        }
         ASSERT((sp == 0) || (peoi[sp-1].vector < vector));
         ASSERT(sp < (NR_DYNAMIC_VECTORS-1));