Message ID | 1362423859-18516-1-git-send-email-nhorman@tuxdriver.com (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
On Mon, Mar 04, 2013 at 02:04:19PM -0500, Neil Horman wrote: > A few years back intel published a spec update: > http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf > > For the 5520 and 5500 chipsets which contained an errata (specificially errata > 53), which noted that these chipsets can't properly do interrupt remapping, and > as a result the recommend that interrupt remapping be disabled in bios. While > many vendors have a bios update to do exactly that, not all do, and of course > not all users update their bios to a level that corrects the problem. As a > result, occasionally interrupts can arrive at a cpu even after affinity for that > interrupt has be moved, leading to lost or spurrious interrupts (usually > characterized by the message: > kernel: do_IRQ: 7.71 No irq handler for vector (irq -1) > > There have been several incidents recently of people seeing this error, and > investigation has shown that they have system for which their BIOS level is such > that this feature was not properly turned off. As such, it would be good to > give them a reminder that their systems are vulnurable to this problem. > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > CC: Prarit Bhargava <prarit@redhat.com> > CC: Don Zickus <dzickus@redhat.com> > CC: Don Dutile <ddutile@redhat.com> > CC: Bjorn Helgaas <bhelgaas@google.com> > CC: Asit Mallick <asit.k.mallick@intel.com> > CC: linux-pci@vger.kernel.org > Ping, anyone want to Ack/Nack this? Neil -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Mar 9, 2013 at 1:49 PM, Neil Horman <nhorman@tuxdriver.com> wrote: > On Mon, Mar 04, 2013 at 02:04:19PM -0500, Neil Horman wrote: >> A few years back intel published a spec update: >> http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf >> >> For the 5520 and 5500 chipsets which contained an errata (specificially errata >> 53), which noted that these chipsets can't properly do interrupt remapping, and >> as a result the recommend that interrupt remapping be disabled in bios. While >> many vendors have a bios update to do exactly that, not all do, and of course >> not all users update their bios to a level that corrects the problem. As a >> result, occasionally interrupts can arrive at a cpu even after affinity for that >> interrupt has be moved, leading to lost or spurrious interrupts (usually >> characterized by the message: >> kernel: do_IRQ: 7.71 No irq handler for vector (irq -1) >> >> There have been several incidents recently of people seeing this error, and >> investigation has shown that they have system for which their BIOS level is such >> that this feature was not properly turned off. As such, it would be good to >> give them a reminder that their systems are vulnurable to this problem. >> >> Signed-off-by: Neil Horman <nhorman@tuxdriver.com> >> CC: Prarit Bhargava <prarit@redhat.com> >> CC: Don Zickus <dzickus@redhat.com> >> CC: Don Dutile <ddutile@redhat.com> >> CC: Bjorn Helgaas <bhelgaas@google.com> >> CC: Asit Mallick <asit.k.mallick@intel.com> >> CC: linux-pci@vger.kernel.org >> > Ping, anyone want to Ack/Nack this? Don's comment earlier seems to imply that this is a short term fix and that a more long term fix may be coming soon. If that is the case wouldn't we want to wait for the long term fix and just pull that in? Myron > Neil > > -- > To unsubscribe from this list: send the line "unsubscribe linux-pci" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/09/2013 05:20 PM, Myron Stowe wrote: > On Sat, Mar 9, 2013 at 1:49 PM, Neil Horman<nhorman@tuxdriver.com> wrote: >> On Mon, Mar 04, 2013 at 02:04:19PM -0500, Neil Horman wrote: >>> A few years back intel published a spec update: >>> http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf >>> >>> For the 5520 and 5500 chipsets which contained an errata (specificially errata >>> 53), which noted that these chipsets can't properly do interrupt remapping, and >>> as a result the recommend that interrupt remapping be disabled in bios. While >>> many vendors have a bios update to do exactly that, not all do, and of course >>> not all users update their bios to a level that corrects the problem. As a >>> result, occasionally interrupts can arrive at a cpu even after affinity for that >>> interrupt has be moved, leading to lost or spurrious interrupts (usually >>> characterized by the message: >>> kernel: do_IRQ: 7.71 No irq handler for vector (irq -1) >>> >>> There have been several incidents recently of people seeing this error, and >>> investigation has shown that they have system for which their BIOS level is such >>> that this feature was not properly turned off. As such, it would be good to >>> give them a reminder that their systems are vulnurable to this problem. >>> >>> Signed-off-by: Neil Horman<nhorman@tuxdriver.com> >>> CC: Prarit Bhargava<prarit@redhat.com> >>> CC: Don Zickus<dzickus@redhat.com> >>> CC: Don Dutile<ddutile@redhat.com> >>> CC: Bjorn Helgaas<bhelgaas@google.com> >>> CC: Asit Mallick<asit.k.mallick@intel.com> >>> CC: linux-pci@vger.kernel.org >>> >> Ping, anyone want to Ack/Nack this? > > Don's comment earlier seems to imply that this is a short term fix and > that a more long term fix may be coming soon. If that is the case > wouldn't we want to wait for the long term fix and just pull that in? > > Myron > At the time of Neil's postings, multiple changes were being considered, and we didn't know how long it would take to verify any one change. Thus, Neil's patch was proposed to identify a known problem that was being seen on multiple systems, and it was proposed so further system issues wouldn't go mis-diagnosed. We are still testing a minor change, and test results are positive so far. After we're sure of its logic as it's results, it can be posted and this patch can be removed if it is taken before we are certain. As Prarit stated, we should be sure of changes in this area before throwing this patch away for an option that could yield other failures. - Don >> Neil >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-pci" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Mar 09, 2013 at 03:20:57PM -0700, Myron Stowe wrote: > On Sat, Mar 9, 2013 at 1:49 PM, Neil Horman <nhorman@tuxdriver.com> wrote: > > On Mon, Mar 04, 2013 at 02:04:19PM -0500, Neil Horman wrote: > >> A few years back intel published a spec update: > >> http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf > >> > >> For the 5520 and 5500 chipsets which contained an errata (specificially errata > >> 53), which noted that these chipsets can't properly do interrupt remapping, and > >> as a result the recommend that interrupt remapping be disabled in bios. While > >> many vendors have a bios update to do exactly that, not all do, and of course > >> not all users update their bios to a level that corrects the problem. As a > >> result, occasionally interrupts can arrive at a cpu even after affinity for that > >> interrupt has be moved, leading to lost or spurrious interrupts (usually > >> characterized by the message: > >> kernel: do_IRQ: 7.71 No irq handler for vector (irq -1) > >> > >> There have been several incidents recently of people seeing this error, and > >> investigation has shown that they have system for which their BIOS level is such > >> that this feature was not properly turned off. As such, it would be good to > >> give them a reminder that their systems are vulnurable to this problem. > >> > >> Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > >> CC: Prarit Bhargava <prarit@redhat.com> > >> CC: Don Zickus <dzickus@redhat.com> > >> CC: Don Dutile <ddutile@redhat.com> > >> CC: Bjorn Helgaas <bhelgaas@google.com> > >> CC: Asit Mallick <asit.k.mallick@intel.com> > >> CC: linux-pci@vger.kernel.org > >> > > Ping, anyone want to Ack/Nack this? > > Don's comment earlier seems to imply that this is a short term fix and > that a more long term fix may be coming soon. If that is the case > wouldn't we want to wait for the long term fix and just pull that in? > > Myron > As Don and Prarit have mentioned, an alternate change is being worked on and tested that may work around this issue, but we're not yet sure that it will, and we're not sure of the time frame for this fix. Normally I would agree, that it would be easier just to wait for the long term fix, but as Prarit noted, since this hardware is in fact broken, I would rather do a both approach. Its fine if this gets reverted tomorrow with a longer term fix as far as I'm concerned, its just caused enough problems already that I'd like to see it in place until the better solution arrives. Neil -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/11/2013 07:25 AM, Neil Horman wrote: > On Sat, Mar 09, 2013 at 03:20:57PM -0700, Myron Stowe wrote: >> On Sat, Mar 9, 2013 at 1:49 PM, Neil Horman <nhorman@tuxdriver.com> wrote: >>> On Mon, Mar 04, 2013 at 02:04:19PM -0500, Neil Horman wrote: >>>> A few years back intel published a spec update: >>>> http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf >>>> >>>> For the 5520 and 5500 chipsets which contained an errata (specificially errata >>>> 53), which noted that these chipsets can't properly do interrupt remapping, and >>>> as a result the recommend that interrupt remapping be disabled in bios. While >>>> many vendors have a bios update to do exactly that, not all do, and of course >>>> not all users update their bios to a level that corrects the problem. As a >>>> result, occasionally interrupts can arrive at a cpu even after affinity for that >>>> interrupt has be moved, leading to lost or spurrious interrupts (usually >>>> characterized by the message: >>>> kernel: do_IRQ: 7.71 No irq handler for vector (irq -1) >>>> >>>> There have been several incidents recently of people seeing this error, and >>>> investigation has shown that they have system for which their BIOS level is such >>>> that this feature was not properly turned off. As such, it would be good to >>>> give them a reminder that their systems are vulnurable to this problem. >>>> >>>> Signed-off-by: Neil Horman <nhorman@tuxdriver.com> >>>> CC: Prarit Bhargava <prarit@redhat.com> >>>> CC: Don Zickus <dzickus@redhat.com> >>>> CC: Don Dutile <ddutile@redhat.com> >>>> CC: Bjorn Helgaas <bhelgaas@google.com> >>>> CC: Asit Mallick <asit.k.mallick@intel.com> >>>> CC: linux-pci@vger.kernel.org >>>> >>> Ping, anyone want to Ack/Nack this? >> >> Don's comment earlier seems to imply that this is a short term fix and >> that a more long term fix may be coming soon. If that is the case >> wouldn't we want to wait for the long term fix and just pull that in? >> >> Myron >> > As Don and Prarit have mentioned, an alternate change is being worked on and > tested that may work around this issue, but we're not yet sure that it will, and > we're not sure of the time frame for this fix. Normally I would agree, that it > would be easier just to wait for the long term fix, but as Prarit noted, since > this hardware is in fact broken, I would rather do a both approach. Its fine if > this gets reverted tomorrow with a longer term fix as far as I'm concerned, its > just caused enough problems already that I'd like to see it in place until the > better solution arrives. I agree with Neil on this. While vendors are supposed to fix their BIOSes, experience has shown that not all vendors will fix their BIOSes for a problem like this. Ack this quirk. P. > Neil > > -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[+cc David and iommu list, Yinghai, Jiang] On Mon, Mar 4, 2013 at 12:04 PM, Neil Horman <nhorman@tuxdriver.com> wrote: > A few years back intel published a spec update: > http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf > > For the 5520 and 5500 chipsets which contained an errata (specificially errata > 53), which noted that these chipsets can't properly do interrupt remapping, and > as a result the recommend that interrupt remapping be disabled in bios. While > many vendors have a bios update to do exactly that, not all do, and of course > not all users update their bios to a level that corrects the problem. As a > result, occasionally interrupts can arrive at a cpu even after affinity for that > interrupt has be moved, leading to lost or spurrious interrupts (usually > characterized by the message: > kernel: do_IRQ: 7.71 No irq handler for vector (irq -1) > > There have been several incidents recently of people seeing this error, and > investigation has shown that they have system for which their BIOS level is such > that this feature was not properly turned off. As such, it would be good to > give them a reminder that their systems are vulnurable to this problem. > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > CC: Prarit Bhargava <prarit@redhat.com> > CC: Don Zickus <dzickus@redhat.com> > CC: Don Dutile <ddutile@redhat.com> > CC: Bjorn Helgaas <bhelgaas@google.com> > CC: Asit Mallick <asit.k.mallick@intel.com> > CC: linux-pci@vger.kernel.org > > --- > > Change notes: > > v2) > > * Moved the quirk to the x86 arch, since consensus seems to be that the 55XX > chipset series is x86 only. I decided however to keep the quirk as a regular > quirk, not an early_quirk. Early quirks have no way currently to determine if > BIOS has properly disabled the feature in the iommu, at least not without > significant hacking, and since its quite possible this will be a short lived > quirk, should Don Z's workaround code prove successful (and it looks like it may > well), I don't think that necessecary. > > * Removed the WARNING banner from the quirk, and added the HW_ERR token to the > string, I opted to leave the newlines in place however, as I really couldnt > find a way to keep the text on a single line is still legible from a code > perspective. I think theres enough language in there that using cscope on just > about any substring however will turn it up, and again, this may be a short > lived quirk. > --- > arch/x86/kernel/quirks.c | 18 ++++++++++++++++++ > include/linux/pci_ids.h | 2 ++ > 2 files changed, 20 insertions(+) > > diff --git a/arch/x86/kernel/quirks.c b/arch/x86/kernel/quirks.c > index 26ee48a..a718ea2 100644 > --- a/arch/x86/kernel/quirks.c > +++ b/arch/x86/kernel/quirks.c > @@ -5,6 +5,7 @@ > #include <linux/irq.h> > > #include <asm/hpet.h> > +#include "../../../drivers/iommu/irq_remapping.h" > > #if defined(CONFIG_X86_IO_APIC) && defined(CONFIG_SMP) && defined(CONFIG_PCI) > > @@ -567,3 +568,20 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_15H_NB_F5, > quirk_amd_nb_node); > > #endif > + > +static void intel_remapping_check(struct pci_dev *dev) > +{ > + u8 revision; > + > + pci_read_config_byte(dev, PCI_REVISION_ID, &revision); > + > + if ((revision == 0x13) && irq_remapping_enabled) { > + pr_warn(HW_ERR "This system BIOS has enabled interrupt remapping\n" > + "on a chipset that contains an errata making that\n" > + "feature unstable. Please reboot with nointremap\n" > + "added to the kernel command line and contact\n" > + "your BIOS vendor for an update"); > + } > +} > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_5520_IOHUB, intel_remapping_check); > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_5500_IOHUB, intel_remapping_check); This started as an IOMMU change, and I'm not an expert in that area, so I added David and the IOMMU list. I'd rather have him deal with this than me. Is this something we can just *fix* in the kernel, e.g., by turning off interrupt remapping ourselves, or does it have to be done before the OS boots? > diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h > index 31717bd..54027a6 100644 > --- a/include/linux/pci_ids.h > +++ b/include/linux/pci_ids.h > @@ -2732,6 +2732,8 @@ > #define PCI_DEVICE_ID_INTEL_LYNNFIELD_MC_CH2_RANK_REV2 0x2db2 > #define PCI_DEVICE_ID_INTEL_LYNNFIELD_MC_CH2_TC_REV2 0x2db3 > #define PCI_DEVICE_ID_INTEL_82855PM_HB 0x3340 > +#define PCI_DEVICE_ID_INTEL_5500_IOHUB 0x3403 > +#define PCI_DEVICE_ID_INTEL_5520_IOHUB 0x3406 For constants only used in one place, we just use the bare constant (0x3403) in the quirk rather than editing pci_ids.h (see comment at the top of that file). > #define PCI_DEVICE_ID_INTEL_IOAT_TBG4 0x3429 > #define PCI_DEVICE_ID_INTEL_IOAT_TBG5 0x342a > #define PCI_DEVICE_ID_INTEL_IOAT_TBG6 0x342b > -- > 1.7.11.7 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-pci" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Apr 03, 2013 at 05:53:27PM -0600, Bjorn Helgaas wrote: > [+cc David and iommu list, Yinghai, Jiang] > > On Mon, Mar 4, 2013 at 12:04 PM, Neil Horman <nhorman@tuxdriver.com> wrote: > > A few years back intel published a spec update: > > http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf > > > > For the 5520 and 5500 chipsets which contained an errata (specificially errata > > 53), which noted that these chipsets can't properly do interrupt remapping, and > > as a result the recommend that interrupt remapping be disabled in bios. While > > many vendors have a bios update to do exactly that, not all do, and of course > > not all users update their bios to a level that corrects the problem. As a > > result, occasionally interrupts can arrive at a cpu even after affinity for that > > interrupt has be moved, leading to lost or spurrious interrupts (usually > > characterized by the message: > > kernel: do_IRQ: 7.71 No irq handler for vector (irq -1) > > > > There have been several incidents recently of people seeing this error, and > > investigation has shown that they have system for which their BIOS level is such > > that this feature was not properly turned off. As such, it would be good to > > give them a reminder that their systems are vulnurable to this problem. > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > > CC: Prarit Bhargava <prarit@redhat.com> > > CC: Don Zickus <dzickus@redhat.com> > > CC: Don Dutile <ddutile@redhat.com> > > CC: Bjorn Helgaas <bhelgaas@google.com> > > CC: Asit Mallick <asit.k.mallick@intel.com> > > CC: linux-pci@vger.kernel.org > > > > --- > > > > Change notes: > > > > v2) > > > > * Moved the quirk to the x86 arch, since consensus seems to be that the 55XX > > chipset series is x86 only. I decided however to keep the quirk as a regular > > quirk, not an early_quirk. Early quirks have no way currently to determine if > > BIOS has properly disabled the feature in the iommu, at least not without > > significant hacking, and since its quite possible this will be a short lived > > quirk, should Don Z's workaround code prove successful (and it looks like it may > > well), I don't think that necessecary. > > > > * Removed the WARNING banner from the quirk, and added the HW_ERR token to the > > string, I opted to leave the newlines in place however, as I really couldnt > > find a way to keep the text on a single line is still legible from a code > > perspective. I think theres enough language in there that using cscope on just > > about any substring however will turn it up, and again, this may be a short > > lived quirk. > > --- > > arch/x86/kernel/quirks.c | 18 ++++++++++++++++++ > > include/linux/pci_ids.h | 2 ++ > > 2 files changed, 20 insertions(+) > > > > diff --git a/arch/x86/kernel/quirks.c b/arch/x86/kernel/quirks.c > > index 26ee48a..a718ea2 100644 > > --- a/arch/x86/kernel/quirks.c > > +++ b/arch/x86/kernel/quirks.c > > @@ -5,6 +5,7 @@ > > #include <linux/irq.h> > > > > #include <asm/hpet.h> > > +#include "../../../drivers/iommu/irq_remapping.h" > > > > #if defined(CONFIG_X86_IO_APIC) && defined(CONFIG_SMP) && defined(CONFIG_PCI) > > > > @@ -567,3 +568,20 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_15H_NB_F5, > > quirk_amd_nb_node); > > > > #endif > > + > > +static void intel_remapping_check(struct pci_dev *dev) > > +{ > > + u8 revision; > > + > > + pci_read_config_byte(dev, PCI_REVISION_ID, &revision); > > + > > + if ((revision == 0x13) && irq_remapping_enabled) { > > + pr_warn(HW_ERR "This system BIOS has enabled interrupt remapping\n" > > + "on a chipset that contains an errata making that\n" > > + "feature unstable. Please reboot with nointremap\n" > > + "added to the kernel command line and contact\n" > > + "your BIOS vendor for an update"); > > + } > > +} > > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_5520_IOHUB, intel_remapping_check); > > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_5500_IOHUB, intel_remapping_check); > > This started as an IOMMU change, and I'm not an expert in that area, > so I added David and the IOMMU list. I'd rather have him deal with > this than me. > This was never an iommu change, other than the fact that interrupt mapping and the iommu are tightly coupled. > Is this something we can just *fix* in the kernel, e.g., by turning > off interrupt remapping ourselves, or does it have to be done before > the OS boots? > Possibly, but it didn't seem safe to me. As it currently stands, pci quirks are executed after the apic is configured, which means its possible the problem might trigger before the quirk has a chance to execute (even the early quirk). There was an exchange about this earlier in the thread IIRC. > > diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h > > index 31717bd..54027a6 100644 > > --- a/include/linux/pci_ids.h > > +++ b/include/linux/pci_ids.h > > @@ -2732,6 +2732,8 @@ > > #define PCI_DEVICE_ID_INTEL_LYNNFIELD_MC_CH2_RANK_REV2 0x2db2 > > #define PCI_DEVICE_ID_INTEL_LYNNFIELD_MC_CH2_TC_REV2 0x2db3 > > #define PCI_DEVICE_ID_INTEL_82855PM_HB 0x3340 > > +#define PCI_DEVICE_ID_INTEL_5500_IOHUB 0x3403 > > +#define PCI_DEVICE_ID_INTEL_5520_IOHUB 0x3406 > > For constants only used in one place, we just use the bare constant > (0x3403) in the quirk rather than editing pci_ids.h (see comment at > the top of that file). > Ok, I'll repost with this removed. Thanks & Regards Neil -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2013-04-03 at 17:53 -0600, Bjorn Helgaas wrote: > ); > > + > > + if ((revision == 0x13) && irq_remapping_enabled) { > > + pr_warn(HW_ERR "This system BIOS has enabled interrupt remapping\n" > > + "on a chipset that contains an errata making that\n" > > + "feature unstable. Please reboot with nointremap\n" > > + "added to the kernel command line and contact\n" > > + "your BIOS vendor for an update"); This should be WARN_TAINT(TAINT_FIRMWARE_WORKAROUND). And 'an erratum'.
On Thu, Apr 04, 2013 at 03:27:29PM +0100, David Woodhouse wrote: > On Wed, 2013-04-03 at 17:53 -0600, Bjorn Helgaas wrote: > > ); > > > + > > > + if ((revision == 0x13) && irq_remapping_enabled) { > > > + pr_warn(HW_ERR "This system BIOS has enabled interrupt remapping\n" > > > + "on a chipset that contains an errata making that\n" > > > + "feature unstable. Please reboot with nointremap\n" > > > + "added to the kernel command line and contact\n" > > > + "your BIOS vendor for an update"); > > This should be WARN_TAINT(TAINT_FIRMWARE_WORKAROUND). And 'an erratum'. > Ok, copy that. I'll repost shortly > -- > dwmw2 -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 4, 2013 at 8:50 AM, Neil Horman <nhorman@tuxdriver.com> wrote: > On Thu, Apr 04, 2013 at 03:27:29PM +0100, David Woodhouse wrote: >> On Wed, 2013-04-03 at 17:53 -0600, Bjorn Helgaas wrote: >> > ); >> > > + >> > > + if ((revision == 0x13) && irq_remapping_enabled) { >> > > + pr_warn(HW_ERR "This system BIOS has enabled interrupt remapping\n" >> > > + "on a chipset that contains an errata making that\n" >> > > + "feature unstable. Please reboot with nointremap\n" >> > > + "added to the kernel command line and contact\n" >> > > + "your BIOS vendor for an update"); >> >> This should be WARN_TAINT(TAINT_FIRMWARE_WORKAROUND). And 'an erratum'. >> > Ok, copy that. I'll repost shortly When you do, please include URLs for any problem reports or bugzillas you have. I assume Windows "just works" in this situation? -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 04, 2013 at 08:57:06AM -0600, Bjorn Helgaas wrote: > On Thu, Apr 4, 2013 at 8:50 AM, Neil Horman <nhorman@tuxdriver.com> wrote: > > On Thu, Apr 04, 2013 at 03:27:29PM +0100, David Woodhouse wrote: > >> On Wed, 2013-04-03 at 17:53 -0600, Bjorn Helgaas wrote: > >> > ); > >> > > + > >> > > + if ((revision == 0x13) && irq_remapping_enabled) { > >> > > + pr_warn(HW_ERR "This system BIOS has enabled interrupt remapping\n" > >> > > + "on a chipset that contains an errata making that\n" > >> > > + "feature unstable. Please reboot with nointremap\n" > >> > > + "added to the kernel command line and contact\n" > >> > > + "your BIOS vendor for an update"); > >> > >> This should be WARN_TAINT(TAINT_FIRMWARE_WORKAROUND). And 'an erratum'. > >> > > Ok, copy that. I'll repost shortly > > When you do, please include URLs for any problem reports or bugzillas you have. > Well, those are going to be vendor specific, so I'm not sure we can really do that, at least not in any meaningful way. > I assume Windows "just works" in this situation? No more or less than linux does in this case. The Intel provided errata indicates that the only acceptable workaround is to disable remapping in the BIOS, so I would presume that if a windows system has a BIOS that doesn't implement this fix, its just as exposed as we are. Neil -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 4, 2013 at 9:39 AM, Neil Horman <nhorman@tuxdriver.com> wrote: > On Thu, Apr 04, 2013 at 08:57:06AM -0600, Bjorn Helgaas wrote: >> On Thu, Apr 4, 2013 at 8:50 AM, Neil Horman <nhorman@tuxdriver.com> wrote: >> > On Thu, Apr 04, 2013 at 03:27:29PM +0100, David Woodhouse wrote: >> >> On Wed, 2013-04-03 at 17:53 -0600, Bjorn Helgaas wrote: >> >> > ); >> >> > > + >> >> > > + if ((revision == 0x13) && irq_remapping_enabled) { >> >> > > + pr_warn(HW_ERR "This system BIOS has enabled interrupt remapping\n" >> >> > > + "on a chipset that contains an errata making that\n" >> >> > > + "feature unstable. Please reboot with nointremap\n" >> >> > > + "added to the kernel command line and contact\n" >> >> > > + "your BIOS vendor for an update"); >> >> >> >> This should be WARN_TAINT(TAINT_FIRMWARE_WORKAROUND). And 'an erratum'. >> >> >> > Ok, copy that. I'll repost shortly >> >> When you do, please include URLs for any problem reports or bugzillas you have. >> > Well, those are going to be vendor specific, so I'm not sure we can really do > that, at least not in any meaningful way. Sorry, I don't understand your point. It's useful to know who reported it (e.g., for future testers) and what happened and what bugzillas it solved. Of course it applies only to machines with this chipset and certain BIOS revisions. >> I assume Windows "just works" in this situation? > No more or less than linux does in this case. The Intel provided errata > indicates that the only acceptable workaround is to disable remapping in the > BIOS, so I would presume that if a windows system has a BIOS that doesn't > implement this fix, its just as exposed as we are. It sounds like the effect of this bug is that on Linux, devices may not work at all because of lost interrupts. Either Windows must never enable remapping (so it never sees the bug), or it must be designed in a way that tolerates the problem. I can't believe these machines shipped with Windows certification if devices didn't work correctly. Either way, I don't understand why we can't make the quirk just fix this. Booting with "nointremap" only sets disable_irq_remap, which is only used by irq_remapping_supported(). Early quirks are run before irq_remapping_supported () is ever called, so an early quirk ought to be just as effective as the command line option. Here's the relevant call tree I see: start_kernel setup_arch parse_early_param early_quirks rest_init ... <first use of disable_irq_remap> The x86 setup_arch() does call generic_apic_probe(), but as far as I can tell, none of the APIC .probe() methods reference disable_irq_remap, so that doesn't look like a problem. Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 04, 2013 at 11:14:30AM -0600, Bjorn Helgaas wrote: > On Thu, Apr 4, 2013 at 9:39 AM, Neil Horman <nhorman@tuxdriver.com> wrote: > > On Thu, Apr 04, 2013 at 08:57:06AM -0600, Bjorn Helgaas wrote: > >> On Thu, Apr 4, 2013 at 8:50 AM, Neil Horman <nhorman@tuxdriver.com> wrote: > >> > On Thu, Apr 04, 2013 at 03:27:29PM +0100, David Woodhouse wrote: > >> >> On Wed, 2013-04-03 at 17:53 -0600, Bjorn Helgaas wrote: > >> >> > ); > >> >> > > + > >> >> > > + if ((revision == 0x13) && irq_remapping_enabled) { > >> >> > > + pr_warn(HW_ERR "This system BIOS has enabled interrupt remapping\n" > >> >> > > + "on a chipset that contains an errata making that\n" > >> >> > > + "feature unstable. Please reboot with nointremap\n" > >> >> > > + "added to the kernel command line and contact\n" > >> >> > > + "your BIOS vendor for an update"); > >> >> > >> >> This should be WARN_TAINT(TAINT_FIRMWARE_WORKAROUND). And 'an erratum'. > >> >> > >> > Ok, copy that. I'll repost shortly > >> > >> When you do, please include URLs for any problem reports or bugzillas you have. > >> > > Well, those are going to be vendor specific, so I'm not sure we can really do > > that, at least not in any meaningful way. > > Sorry, I don't understand your point. It's useful to know who > reported it (e.g., for future testers) and what happened and what > bugzillas it solved. Of course it applies only to machines with this > chipset and certain BIOS revisions. > Oh, you want the bug report that I'm fixing this against? Sure, I can do that. I thought you wanted me to include a url in the WARN_TAINT, with which user could report occurances of this bug. Yeah, the bug that this is reported in is: https://bugzilla.redhat.com/show_bug.cgi?id=887006 Its standing in for about a dozen or so variants of this issue we've seen > >> I assume Windows "just works" in this situation? > > No more or less than linux does in this case. The Intel provided errata > > indicates that the only acceptable workaround is to disable remapping in the > > BIOS, so I would presume that if a windows system has a BIOS that doesn't > > implement this fix, its just as exposed as we are. > > It sounds like the effect of this bug is that on Linux, devices may > not work at all because of lost interrupts. Either Windows must never > enable remapping (so it never sees the bug), or it must be designed in > a way that tolerates the problem. I can't believe these machines > shipped with Windows certification if devices didn't work correctly. > I don't know what to tell you here. We explicitly asked intel about this, and there was an effort to recode the interrupt migration code to properly manage this situation, and intels response was "No, just disable it in BIOS", so we're telling people to disable it in BIOS. You'd have to ask somebody at Microsoft what Windows did or did not do about this problem. > Either way, I don't understand why we can't make the quirk just fix > this. Booting with "nointremap" only sets disable_irq_remap, which is > only used by irq_remapping_supported(). Early quirks are run before > irq_remapping_supported () is ever called, so an early quirk ought to > be just as effective as the command line option. Here's the relevant > call tree I see: > > start_kernel > setup_arch > parse_early_param > early_quirks > rest_init > ... > <first use of disable_irq_remap> > > The x86 setup_arch() does call generic_apic_probe(), but as far as I > can tell, none of the APIC .probe() methods reference > disable_irq_remap, so that doesn't look like a problem. > Well, I can give it a try, but I'm sure something went wrong last time I did. Regardless, theres also the security issue to consider here - namely that disabling irq remapping opens up users of virt to a possible security bug (potential irq injection). Some users may wish to live with the remapping error, given that error typically leads to devices that need to be restarted/reset to start working again, rather than live with the security hole. I rather like the warning, that gives users a choice, but I'll spin up a version that just disables it if you would rather. Neil -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 4, 2013 at 11:51 AM, Neil Horman <nhorman@tuxdriver.com> wrote: > Oh, you want the bug report that I'm fixing this against? Sure, I can do that. > I thought you wanted me to include a url in the WARN_TAINT, with which user > could report occurances of this bug. Yeah, the bug that this is reported in is: > https://bugzilla.redhat.com/show_bug.cgi?id=887006 > > Its standing in for about a dozen or so variants of this issue we've seen Exactly -- I'm just hoping for something in the changelog. BTW, this particular bugzilla is not public. > Regardless, theres also the security issue to consider here - namely that > disabling irq remapping opens up users of virt to a possible security bug > (potential irq injection). Some users may wish to live with the remapping > error, given that error typically leads to devices that need to be > restarted/reset to start working again, rather than live with the security hole. > I rather like the warning, that gives users a choice, but I'll spin up a version > that just disables it if you would rather. I don't believe users will want to make a choice like that or even be sophisticated enough to do it, at least not based on something in dmesg. I'm pretty sure I'm not :) The only supportable thing I can imagine doing would be: - Disable interrupt remapping if this chipset defect is present, so devices work reliably (they don't need whatever restart/reset you referred to above). - Disable virt functionality when interrupt remapping is disabled to avoid the security problem (I don't know the details of this.) - Add a command-line option to enable interrupt remapping (I think "intremap=on" is currently parsed too early, but maybe this could be reworked so the option could override the quirk disable). - Add release notes saying "boot with 'intremap=on' if you want the virt functionality and can accept unreliable devices." That way the default behavior is safe and reliable (though perhaps lacking some functionality), and you have told the user a way to get safe and unreliable operation if he's willing to accept that. At least, that's what I think I would want if I were in RH's shoes. Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 04, 2013 at 12:41:27PM -0600, Bjorn Helgaas wrote: > On Thu, Apr 4, 2013 at 11:51 AM, Neil Horman <nhorman@tuxdriver.com> wrote: > > > Oh, you want the bug report that I'm fixing this against? Sure, I can do that. > > I thought you wanted me to include a url in the WARN_TAINT, with which user > > could report occurances of this bug. Yeah, the bug that this is reported in is: > > https://bugzilla.redhat.com/show_bug.cgi?id=887006 > > > > Its standing in for about a dozen or so variants of this issue we've seen > > Exactly -- I'm just hoping for something in the changelog. BTW, this > particular bugzilla is not public. > Ok, that I can do, I'll get the bz fixed up to be public in a bit. > > Regardless, theres also the security issue to consider here - namely that > > disabling irq remapping opens up users of virt to a possible security bug > > (potential irq injection). Some users may wish to live with the remapping > > error, given that error typically leads to devices that need to be > > restarted/reset to start working again, rather than live with the security hole. > > I rather like the warning, that gives users a choice, but I'll spin up a version > > that just disables it if you would rather. > > I don't believe users will want to make a choice like that or even be > sophisticated enough to do it, at least not based on something in > dmesg. I'm pretty sure I'm not :) > > The only supportable thing I can imagine doing would be: > > - Disable interrupt remapping if this chipset defect is present, so > devices work reliably (they don't need whatever restart/reset you > referred to above). > - Disable virt functionality when interrupt remapping is disabled to > avoid the security problem (I don't know the details of this.) > - Add a command-line option to enable interrupt remapping (I think > "intremap=on" is currently parsed too early, but maybe this could be > reworked so the option could override the quirk disable). > - Add release notes saying "boot with 'intremap=on' if you want the > virt functionality and can accept unreliable devices." > > That way the default behavior is safe and reliable (though perhaps > lacking some functionality), and you have told the user a way to get > safe and unreliable operation if he's willing to accept that. At > least, that's what I think I would want if I were in RH's shoes. > Theres a long argument behind this, and I'm with you. At the least I don't see a problem with disabling it upstream, at least not a policy-oriented reason. That said, having looked back at my notes, I do now recall a technical impediment behind disabling interrupt remapping. If we were to do this in an early quirk, we would need to determine the status of the interrupt remapping capability flag in the iommu in question. Looking at the interrupt remapping code, it appears that the capability flag isn't directly contained in the pci config space, but rather in a memory mapped address range who's base address is parsed out of an acpi table. Since we check the irq_remapping_enabled flag in the current quirk (which is set after the early quirks run), to do this in an early quirk, we would need to somehow parse that base address register out of the available acpi tables ourselves, and I'm not at all sure how to do that. Do you know if theres some available parsing mechanism that early in boot? If so, I can probably make this happen. Neil > Bjorn > -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/arch/x86/kernel/quirks.c b/arch/x86/kernel/quirks.c index 26ee48a..a718ea2 100644 --- a/arch/x86/kernel/quirks.c +++ b/arch/x86/kernel/quirks.c @@ -5,6 +5,7 @@ #include <linux/irq.h> #include <asm/hpet.h> +#include "../../../drivers/iommu/irq_remapping.h" #if defined(CONFIG_X86_IO_APIC) && defined(CONFIG_SMP) && defined(CONFIG_PCI) @@ -567,3 +568,20 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_15H_NB_F5, quirk_amd_nb_node); #endif + +static void intel_remapping_check(struct pci_dev *dev) +{ + u8 revision; + + pci_read_config_byte(dev, PCI_REVISION_ID, &revision); + + if ((revision == 0x13) && irq_remapping_enabled) { + pr_warn(HW_ERR "This system BIOS has enabled interrupt remapping\n" + "on a chipset that contains an errata making that\n" + "feature unstable. Please reboot with nointremap\n" + "added to the kernel command line and contact\n" + "your BIOS vendor for an update"); + } +} +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_5520_IOHUB, intel_remapping_check); +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_5500_IOHUB, intel_remapping_check); diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h index 31717bd..54027a6 100644 --- a/include/linux/pci_ids.h +++ b/include/linux/pci_ids.h @@ -2732,6 +2732,8 @@ #define PCI_DEVICE_ID_INTEL_LYNNFIELD_MC_CH2_RANK_REV2 0x2db2 #define PCI_DEVICE_ID_INTEL_LYNNFIELD_MC_CH2_TC_REV2 0x2db3 #define PCI_DEVICE_ID_INTEL_82855PM_HB 0x3340 +#define PCI_DEVICE_ID_INTEL_5500_IOHUB 0x3403 +#define PCI_DEVICE_ID_INTEL_5520_IOHUB 0x3406 #define PCI_DEVICE_ID_INTEL_IOAT_TBG4 0x3429 #define PCI_DEVICE_ID_INTEL_IOAT_TBG5 0x342a #define PCI_DEVICE_ID_INTEL_IOAT_TBG6 0x342b
A few years back intel published a spec update: http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf For the 5520 and 5500 chipsets which contained an errata (specificially errata 53), which noted that these chipsets can't properly do interrupt remapping, and as a result the recommend that interrupt remapping be disabled in bios. While many vendors have a bios update to do exactly that, not all do, and of course not all users update their bios to a level that corrects the problem. As a result, occasionally interrupts can arrive at a cpu even after affinity for that interrupt has be moved, leading to lost or spurrious interrupts (usually characterized by the message: kernel: do_IRQ: 7.71 No irq handler for vector (irq -1) There have been several incidents recently of people seeing this error, and investigation has shown that they have system for which their BIOS level is such that this feature was not properly turned off. As such, it would be good to give them a reminder that their systems are vulnurable to this problem. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: Prarit Bhargava <prarit@redhat.com> CC: Don Zickus <dzickus@redhat.com> CC: Don Dutile <ddutile@redhat.com> CC: Bjorn Helgaas <bhelgaas@google.com> CC: Asit Mallick <asit.k.mallick@intel.com> CC: linux-pci@vger.kernel.org --- Change notes: v2) * Moved the quirk to the x86 arch, since consensus seems to be that the 55XX chipset series is x86 only. I decided however to keep the quirk as a regular quirk, not an early_quirk. Early quirks have no way currently to determine if BIOS has properly disabled the feature in the iommu, at least not without significant hacking, and since its quite possible this will be a short lived quirk, should Don Z's workaround code prove successful (and it looks like it may well), I don't think that necessecary. * Removed the WARNING banner from the quirk, and added the HW_ERR token to the string, I opted to leave the newlines in place however, as I really couldnt find a way to keep the text on a single line is still legible from a code perspective. I think theres enough language in there that using cscope on just about any substring however will turn it up, and again, this may be a short lived quirk. --- arch/x86/kernel/quirks.c | 18 ++++++++++++++++++ include/linux/pci_ids.h | 2 ++ 2 files changed, 20 insertions(+)