Message ID | 569BB05B.1000801@citrix.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Den 17. jan. 2016 16:16, skrev Andrew Cooper: > On 17/01/16 14:50, Håkon Alstadheim wrote: >> Den 15. jan. 2016 12:05, skrev Andrew Cooper: >>> On 15/01/16 10:58, Håkon Alstadheim wrote: >>>> CPUINFO: >>>> vendor_id : GenuineIntel >>>> cpu family : 6 >>>> model : 63 >>>> model name : Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz >>>> >>>> # smbios-sys-info >>>> Libsmbios version: 2.2.28 >>>> Product Name: Z10PE-D8 WS >>>> Vendor: ASUSTeK COMPUTER INC. >>>> BIOS Version: 3101 >>>> >>>> >>>> I have been experiencing issues with domains with passed through PCIe >>>> devices since I first installed xen. Then at version 4.5.x , I'm now >>>> at 4.6.0 with gentoo patches. Crashes SEEM mostly related to this pci >>>> pass through and interrupts (usb-cards, sound cards). >>>> >>>> Recently the system has been more stable, whether it is because I pass >>>> through as few things as possible, or because of improvements in Xen I >>>> do not know. I have also taken to building with debug, which leads to >>>> more abrupt but less mysterious failures. Earlier (w/o debug and under >>>> xen 4.5 ) stuff would just gradually stop working and end up in total >>>> hang of everything. So, hey, things are improving :-b >>> This isn't the first time we have seen this on Haswell processors. Do >>> you have microcode loading set up? >>> >>> ~Andrew >>> >> Still happening with kernel-genkernel-x86_64-4.1.15-gentoo and updated >> cpu microcode, using microcode from 20151106. > Ok - I previously investigated this issue, but my repro evaporated from > under my feet with a firmware update, and I never got to the bottom of it. > > Please can you start with the following patch which will dump some more > information on crash. > > ---8<--- > diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c > index 1228568..588b562 100644 > --- a/xen/arch/x86/irq.c > +++ b/xen/arch/x86/irq.c > @@ -1165,6 +1165,13 @@ static void __do_IRQ_guest(int irq) > if ( action->ack_type == ACKTYPE_EOI ) > { > sp = pending_eoi_sp(peoi); > + if ( unlikely(!((sp == 0) || (peoi[sp-1].vector < vector))) ) > + { > + int p; > + for ( p = sp; p > 0; --p ) > + printk("**peoi[%d] = {%d, 0x%u, %d}\n", > + p-1, peoi[p-1].irq, peoi[p-1].vector, > peoi[p-1].ready); > + } > ASSERT((sp == 0) || (peoi[sp-1].vector < vector)); > ASSERT(sp < (NR_DYNAMIC_VECTORS-1)); > peoi[sp].irq = irq; > > Will do. Building now. Seems there is a line accidentally folded "peoi[p-1].ready);" belongs at the end of preceding line I presume?
On 17/01/2016 23:07, Håkon Alstadheim wrote: > Den 17. jan. 2016 17:30, skrev Håkon Alstadheim: >> Den 17. jan. 2016 16:16, skrev Andrew Cooper: >>> On 17/01/16 14:50, Håkon Alstadheim wrote: >>>> Den 15. jan. 2016 12:05, skrev Andrew Cooper: >>>>> On 15/01/16 10:58, Håkon Alstadheim wrote: >>>>>> CPUINFO: >>>>>> vendor_id : GenuineIntel >>>>>> cpu family : 6 >>>>>> model : 63 >>>>>> model name : Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz >>>>>> >>>>>> # smbios-sys-info >>>>>> Libsmbios version: 2.2.28 >>>>>> Product Name: Z10PE-D8 WS >>>>>> Vendor: ASUSTeK COMPUTER INC. >>>>>> BIOS Version: 3101 >>>>>> >>>>>> >>>>>> I have been experiencing issues with domains with passed through PCIe >>>>>> devices since I first installed xen. Then at version 4.5.x , I'm now >>>>>> at 4.6.0 with gentoo patches. Crashes SEEM mostly related to this pci >>>>>> pass through and interrupts (usb-cards, sound cards). >>>>>> >>>>>> Recently the system has been more stable, whether it is because I pass >>>>>> through as few things as possible, or because of improvements in Xen I >>>>>> do not know. I have also taken to building with debug, which leads to >>>>>> more abrupt but less mysterious failures. Earlier (w/o debug and under >>>>>> xen 4.5 ) stuff would just gradually stop working and end up in total >>>>>> hang of everything. So, hey, things are improving :-b >>>>> This isn't the first time we have seen this on Haswell processors. Do >>>>> you have microcode loading set up? >>>>> >>>>> ~Andrew >>>>> >>>> Still happening with kernel-genkernel-x86_64-4.1.15-gentoo and updated >>>> cpu microcode, using microcode from 20151106. >>> Ok - I previously investigated this issue, but my repro evaporated from >>> under my feet with a firmware update, and I never got to the bottom of it. >>> >>> Please can you start with the following patch which will dump some more >>> information on crash. >>> >>> ---8<--- >>> diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c >>> index 1228568..588b562 100644 >>> --- a/xen/arch/x86/irq.c >>> +++ b/xen/arch/x86/irq.c >>> @@ -1165,6 +1165,13 @@ static void __do_IRQ_guest(int irq) >>> if ( action->ack_type == ACKTYPE_EOI ) >>> { >>> sp = pending_eoi_sp(peoi); >>> + if ( unlikely(!((sp == 0) || (peoi[sp-1].vector < vector))) ) >>> + { >>> + int p; >>> + for ( p = sp; p > 0; --p ) >>> + printk("**peoi[%d] = {%d, 0x%u, %d}\n", >>> + p-1, peoi[p-1].irq, peoi[p-1].vector, >>> peoi[p-1].ready); >>> + } >>> ASSERT((sp == 0) || (peoi[sp-1].vector < vector)); >>> ASSERT(sp < (NR_DYNAMIC_VECTORS-1)); >>> peoi[sp].irq = irq; >>> >>> >> Will do. Building now. >> Seems there is a line accidentally folded "peoi[p-1].ready);" belongs at >> the end of preceding line I presume? >> > There we go :-/ . Log attached from boot to assertion-failure with > loglvl=all guest_loglvl=all . Some of the log output might be a bit > cryptic, they are notes to myself from local boot-scripts, basically > firing up my router/name-server/dhcp-server and waiting until services > are ready before continuing. Would you mind running with the second patch I sent? It gathers more information. ~Andrew
>>> On 18.01.16 at 00:07, <hakon@alstadheim.priv.no> wrote: > There we go :-/ . Log attached from boot to assertion-failure with > loglvl=all guest_loglvl=all . Some of the log output might be a bit > cryptic, they are notes to myself from local boot-scripts, basically > firing up my router/name-server/dhcp-server and waiting until services > are ready before continuing. > > --- > (XEN) [2016-01-17 22:46:38] **peoi[0] = {107, 0x40, 0} According to (XEN) [2016-01-17 16:50:49] IOAPIC[0]: Set PCI routing entry (1-3 -> 0x40 -> IRQ 3 Mode:0 Active:0) this might be the serial console, albeit IRQ 107 contradicts this afaict. Does this also occur without serial console? Are we perhaps wrongly re-using vector 0x40 (and if so might this be fixed with -unstable commit fc0c3fa2ad, in turn requiring e509b8e09c)? Jan
On 18/01/16 10:31, Jan Beulich wrote: >>>> On 18.01.16 at 00:07, <hakon@alstadheim.priv.no> wrote: >> There we go :-/ . Log attached from boot to assertion-failure with >> loglvl=all guest_loglvl=all . Some of the log output might be a bit >> cryptic, they are notes to myself from local boot-scripts, basically >> firing up my router/name-server/dhcp-server and waiting until services >> are ready before continuing. >> >> --- >> (XEN) [2016-01-17 22:46:38] **peoi[0] = {107, 0x40, 0} > According to > > (XEN) [2016-01-17 16:50:49] IOAPIC[0]: Set PCI routing entry (1-3 -> 0x40 -> IRQ 3 Mode:0 Active:0) > > this might be the serial console, albeit IRQ 107 contradicts this > afaict. Does this also occur without serial console? Are we > perhaps wrongly re-using vector 0x40 (and if so might this be > fixed with -unstable commit fc0c3fa2ad, in turn requiring > e509b8e09c)? I also had a bug in the first patch which printed the vector as 0x%u, fixed in the second to be %#x. As such, the actual vector on the pending EOI stack is 0x28. ~Andrew
>>> On 18.01.16 at 11:35, <andrew.cooper3@citrix.com> wrote: > On 18/01/16 10:31, Jan Beulich wrote: >>>>> On 18.01.16 at 00:07, <hakon@alstadheim.priv.no> wrote: >>> There we go :-/ . Log attached from boot to assertion-failure with >>> loglvl=all guest_loglvl=all . Some of the log output might be a bit >>> cryptic, they are notes to myself from local boot-scripts, basically >>> firing up my router/name-server/dhcp-server and waiting until services >>> are ready before continuing. >>> >>> --- >>> (XEN) [2016-01-17 22:46:38] **peoi[0] = {107, 0x40, 0} >> According to >> >> (XEN) [2016-01-17 16:50:49] IOAPIC[0]: Set PCI routing entry (1-3 -> 0x40 -> IRQ 3 > Mode:0 Active:0) >> >> this might be the serial console, albeit IRQ 107 contradicts this >> afaict. Does this also occur without serial console? Are we >> perhaps wrongly re-using vector 0x40 (and if so might this be >> fixed with -unstable commit fc0c3fa2ad, in turn requiring >> e509b8e09c)? > > I also had a bug in the first patch which printed the vector as 0x%u, > fixed in the second to be %#x. As such, the actual vector on the > pending EOI stack is 0x28. That wouldn't make it any better, as then, considering the other similar messages, we would have to conclude it's the vector of some other Xen internally used device (the IOMMU?), which again shouldn't be used by guest IRQ unless it got recycled (albeit I don't think e.g. IOMMU vectors get recycled at all). Håkon, considering (XEN) Failed to enable Interrupt Remapping: Will not enable x2APIC. plus (XEN) Intel VT-d Interrupt Remapping enabled. (a logging inconsistency addressed on -unstable already) could you check your BIOS setup whether you can make firmware permit use of x2APIC mode? And could you try whether the issue goes away with "maxcpus=6" (or less) on the Xen command line? Also, you appear to be doing GPU pass-through - is the problem connected to that? Jan
Den 18. jan. 2016 11:31, skrev Jan Beulich: >>>> On 18.01.16 at 00:07, <hakon@alstadheim.priv.no> wrote: >> There we go :-/ . Log attached from boot to assertion-failure with >> loglvl=all guest_loglvl=all . Some of the log output might be a bit >> cryptic, they are notes to myself from local boot-scripts, basically >> firing up my router/name-server/dhcp-server and waiting until services >> are ready before continuing. >> >> --- >> (XEN) [2016-01-17 22:46:38] **peoi[0] = {107, 0x40, 0} > According to > > (XEN) [2016-01-17 16:50:49] IOAPIC[0]: Set PCI routing entry (1-3 -> 0x40 -> IRQ 3 Mode:0 Active:0) > > this might be the serial console, albeit IRQ 107 contradicts this > afaict. Does this also occur without serial console? Are we > perhaps wrongly re-using vector 0x40 (and if so might this be > fixed with -unstable commit fc0c3fa2ad, in turn requiring > e509b8e09c)? > I don't understand all this, but fyi I believe I have two "serial ports" on the motherboard, one old fashioned serial-port and one created by BMC for "SOL". They show up in dom0 as ttyS0 and ttyS1. Only ttyS0 is ever used that I am aware of. I also have one usb-rs232 emulation thingy which is actually my UPS. All of these are used directly by dom0.
diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c index 1228568..588b562 100644 --- a/xen/arch/x86/irq.c +++ b/xen/arch/x86/irq.c @@ -1165,6 +1165,13 @@ static void __do_IRQ_guest(int irq) if ( action->ack_type == ACKTYPE_EOI ) { sp = pending_eoi_sp(peoi); + if ( unlikely(!((sp == 0) || (peoi[sp-1].vector < vector))) ) + { + int p; + for ( p = sp; p > 0; --p ) + printk("**peoi[%d] = {%d, 0x%u, %d}\n", + p-1, peoi[p-1].irq, peoi[p-1].vector, peoi[p-1].ready); + } ASSERT((sp == 0) || (peoi[sp-1].vector < vector)); ASSERT(sp < (NR_DYNAMIC_VECTORS-1));