Message ID | 20230725080427.23883-1-qiuxu.zhuo@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/1] EDAC/igen6: Fix the issue of no error events | expand |
> Fix this issue by moving the pending error handler after the registration > of the error handler, ensuring that no pending errors are left unhandled. Do you think drivers/edac/e7xxx_edac.c has the same issue? 491 492 /* clear any pending errors, or initial state bits */ 493 e7xxx_get_error_info(mci, &discard); 494 495 /* Here we assume that we will never see multiple instances of this 496 * type of memory controller. The ID is therefore hardcoded to 0. 497 */ 498 if (edac_mc_add_mc(mci)) { 499 edac_dbg(3, "failed edac_mc_add_mc()\n"); 500 goto fail1; 501 } 502 503 /* allocating generic PCI control info */ 504 e7xxx_pci = edac_pci_create_generic_ctl(&pdev->dev, EDAC_MOD_STR); Though it might be hard to find such an old system to test. -Tony
> From: Luck, Tony <tony.luck@intel.com> > ... > Subject: RE: [PATCH 1/1] EDAC/igen6: Fix the issue of no error events > > > Fix this issue by moving the pending error handler after the > > registration of the error handler, ensuring that no pending errors are left > unhandled. > > Do you think drivers/edac/e7xxx_edac.c has the same issue? Hi Tony, Based on the code [1], the e7xxx_edac works in polling mode and the pending errors can be handled within one period after the error handler registration. So, I don't think the e7xxx_edac has the same issue. [1] e7xxx_probe1()-> mci->edac_check = e7xxx_check; -> edac_mc_add_mc() -> if (mci->edac_check) { mci->op_state = OP_RUNNING_POLL; ... } > 491 > 492 /* clear any pending errors, or initial state bits */ > 493 e7xxx_get_error_info(mci, &discard); This function is also invoked periodically in polling mode: mci->edac_check() -> e7xxx_check() -> e7xxx_get_error_info() > 494 > 495 /* Here we assume that we will never see multiple instances of this > 496 * type of memory controller. The ID is therefore hardcoded to 0. > 497 */ > 498 if (edac_mc_add_mc(mci)) { > 499 edac_dbg(3, "failed edac_mc_add_mc()\n"); > 500 goto fail1; > 501 } > 502 > 503 /* allocating generic PCI control info */ > 504 e7xxx_pci = edac_pci_create_generic_ctl(&pdev->dev, > EDAC_MOD_STR); > > Though it might be hard to find such an old system to test. > > -Tony
> Current igen6_edac checks for pending errors before the registration > of the error handler. However, there is a possibility that the error > occurs during the registration process, leading to unhandled pending > errors and no future error events. This issue can be reproduced by > repeatedly injecting errors during the loading of the igen6_edac. > > Fix this issue by moving the pending error handler after the registration > of the error handler, ensuring that no pending errors are left unhandled. > > Fixes: 10590a9d4f23 ("EDAC/igen6: Add EDAC driver for Intel client SoCs using IBECC") > Reported-by: Ee Wey Lim <ee.wey.lim@intel.com> > Tested-by: Ee Wey Lim <ee.wey.lim@intel.com> > Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Applied to RAS tree for next merge window. Thanks -Tony
diff --git a/drivers/edac/igen6_edac.c b/drivers/edac/igen6_edac.c index 544dd19072ea..1a18693294db 100644 --- a/drivers/edac/igen6_edac.c +++ b/drivers/edac/igen6_edac.c @@ -27,7 +27,7 @@ #include "edac_mc.h" #include "edac_module.h" -#define IGEN6_REVISION "v2.5" +#define IGEN6_REVISION "v2.5.1" #define EDAC_MOD_STR "igen6_edac" #define IGEN6_NMI_NAME "igen6_ibecc" @@ -1216,9 +1216,6 @@ static int igen6_probe(struct pci_dev *pdev, const struct pci_device_id *ent) INIT_WORK(&ecclog_work, ecclog_work_cb); init_irq_work(&ecclog_irq_work, ecclog_irq_work_cb); - /* Check if any pending errors before registering the NMI handler */ - ecclog_handler(); - rc = register_err_handler(); if (rc) goto fail3; @@ -1230,6 +1227,9 @@ static int igen6_probe(struct pci_dev *pdev, const struct pci_device_id *ent) goto fail4; } + /* Check if any pending errors before/during the registration of the error handler */ + ecclog_handler(); + igen6_debug_setup(); return 0; fail4: