Message ID | 20240509084833.2147767-1-zhenzhong.duan@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | PCI/AER: Handle Advisory Non-Fatal error | expand |
Hi, Kindly ping. Appreciate comments and suggestions so I could go ahead. Thanks Zhenzhong >-----Original Message----- >From: Duan, Zhenzhong <zhenzhong.duan@intel.com> >Subject: [PATCH v4 0/3] PCI/AER: Handle Advisory Non-Fatal error > >Hi, > >This is a relay work of Qingshun's v2 [1], but changed to focus on ANFE >processing as subject suggests and drops trace-event for now. I think it's >a bit heavy to do extra IOes to get PCIe registers only for trace purpose >and not see it a community request for now. > >According to PCIe Base Specification Revision 6.1, Sections 6.2.3.2.4 and >6.2.4.3, certain uncorrectable errors will signal ERR_COR instead of >ERR_NONFATAL, logged as Advisory Non-Fatal Error(ANFE), and set bits in >both Correctable Error(CE) Status register and Uncorrectable Error(UE) >Status register. Currently, when handling AER events the kernel will only >look at CE status or UE status, but never both. In the ANFE case, bits set >in the UE status register will not be reported and cleared until the next >FE/NFE arrives. > >For instance, previously, when the kernel receives an ANFE with Poisoned >TLP in OS native AER mode, only the status of CE will be reported and >cleared: > > AER: Correctable error message received from 0000:b7:02.0 > PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID) > device [8086:0db0] error status/mask=00002000/00000000 > [13] NonFatalErr > >If the kernel receives a Malformed TLP after that, two UEs will be >reported, which is unexpected. The Malformed TLP Header is lost since >the previous ANFE gated the TLP header logs: > > PCIe Bus Error: severity="Uncorrectable (Fatal), type=Transaction Layer, >(Receiver ID) > device [8086:0db0] error status/mask=00041000/00180020 > [12] TLP (First) > [18] MalfTLP > >To handle this case properly, calculate potential ANFE related status bits >and save in aer_err_info. Use this information to determine the status bits >that need to be cleared. > >Now, for the previous scenario, both CE status and related UE status will >be reported and cleared after ANFE: > > AER: Correctable error message received from 0000:b7:02.0 > PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID) > device [8086:0db0] error status/mask=00002000/00000000 > [13] NonFatalErr > Uncorrectable errors that may cause Advisory Non-Fatal: > [18] TLP > >Note: >checkpatch.pl will produce following warnings on PATCH2/3: > >WARNING: 'UE' may be misspelled - perhaps 'USE'? >#22: >uncorrectable error(UE) status should be cleared. However, there is no > >...similar warnings omitted... > >This is a false-positive, so not fixed. > >WARNING: Prefer a maximum 75 chars per line (possible unwrapped commit >description?) >#10: > PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID) > >...similar warnings omitted... > >For readability reasons, these warnings are not fixed. > > > >[1] https://lore.kernel.org/linux-pci/20240125062802.50819-1- >qingshun.wang@linux.intel.com > >Thanks >Qingshun, Zhenzhong > >Changelog: >v4: > - Fix a race in anfe_get_uc_status() (Jonathan) > - Add a comment to explain side effect of processing ANFE as NFE (Jonathan) > - Drop the check for PCI_EXP_DEVSTA_NFED > >v3: > - Split ANFE print and processing to two patches (Bjorn) > - Simplify ANFE handling, drop trace event > - Polish comments and patch description > - Add Tested-by > >v2: > - Reference to the latest PCIe Specification in both commit messages > and comments, as suggested by Bjorn Helgaas. > - Describe the reason for storing additional information in > aer_err_info in the commit message of PATCH 1, as suggested by Bjorn > Helgaas. > - Add more details of behavior changes in the commit message of PATCH > 2, as suggested by Bjorn Helgaas. > >v3: https://lore.kernel.org/lkml/20240417061407.1491361-1- >zhenzhong.duan@intel.com >v2: https://lore.kernel.org/linux-pci/20240125062802.50819-1- >qingshun.wang@linux.intel.com >v1: https://lore.kernel.org/linux-pci/20240111073227.31488-1- >qingshun.wang@linux.intel.com > >Zhenzhong Duan (3): > PCI/AER: Store UNCOR_STATUS bits that might be ANFE in aer_err_info > PCI/AER: Print UNCOR_STATUS bits that might be ANFE > PCI/AER: Clear UNCOR_STATUS bits that might be ANFE > > drivers/pci/pci.h | 1 + > drivers/pci/pcie/aer.c | 75 >+++++++++++++++++++++++++++++++++++++++++- > 2 files changed, 75 insertions(+), 1 deletion(-) > >-- >2.34.1