Message ID | 20230803230129.127590-2-Smita.KoralahalliChannabasappa@amd.com |
---|---|
State | Superseded |
Headers | show |
Series | PCI/AER, CXL: Fix appropriate _OSC check for CXL RAS Cap | expand |
On Thu, 3 Aug 2023 23:01:27 +0000 Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> wrote: > According to Section 9.17.2, Table 9-26 of CXL Specification [1], owner > of AER should also own CXL Protocol Error Management as there is no > explicit control of CXL Protocol error. And the CXL RAS Cap registers > reported on Protocol errors should check for AER _OSC rather than CXL > Memory Error Reporting Control _OSC. > > The CXL Memory Error Reporting Control _OSC specifically highlights > handling Memory Error Logging and Signaling Enhancements. These kinds of > errors are reported through a device's mailbox and can be managed > independently from CXL Protocol Errors. > > This change fixes handling and reporting CXL Protocol Errors and RAS > registers natively with native AER and FW-First CXL Memory Error Reporting > Control. > > [1] Compute Express Link (CXL) Specification, Revision 3.1, Aug 1 2022. > > Fixes: 248529edc86f ("cxl: add RAS status unmasking for CXL") > Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> > Reviewed-by: Robert Richter <rrichter@amd.com> I'd be tempted to add a comment on why this returns 0 rather than an error. I think that makes sense but it isn't immediately obvious from the local context. Otherwise LGTM Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> > --- > v2: > Added fixes tag. > Included what the patch fixes in commit message. > v3: > Added "Reviewed-by" tag. > --- > drivers/cxl/pci.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c > index 1cb1494c28fe..2323169b6e5f 100644 > --- a/drivers/cxl/pci.c > +++ b/drivers/cxl/pci.c > @@ -541,9 +541,9 @@ static int cxl_pci_ras_unmask(struct pci_dev *pdev) > return 0; > } > > - /* BIOS has CXL error control */ > - if (!host_bridge->native_cxl_error) > - return -ENXIO; > + /* BIOS has PCIe AER error control */ > + if (!host_bridge->native_aer) > + return 0; > > rc = pcie_capability_read_word(pdev, PCI_EXP_DEVCTL, &cap); > if (rc)
On 8/4/23 05:09, Jonathan Cameron wrote: > On Thu, 3 Aug 2023 23:01:27 +0000 > Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> wrote: > >> According to Section 9.17.2, Table 9-26 of CXL Specification [1], owner >> of AER should also own CXL Protocol Error Management as there is no >> explicit control of CXL Protocol error. And the CXL RAS Cap registers >> reported on Protocol errors should check for AER _OSC rather than CXL >> Memory Error Reporting Control _OSC. >> >> The CXL Memory Error Reporting Control _OSC specifically highlights >> handling Memory Error Logging and Signaling Enhancements. These kinds of >> errors are reported through a device's mailbox and can be managed >> independently from CXL Protocol Errors. >> >> This change fixes handling and reporting CXL Protocol Errors and RAS >> registers natively with native AER and FW-First CXL Memory Error Reporting >> Control. >> >> [1] Compute Express Link (CXL) Specification, Revision 3.1, Aug 1 2022. >> >> Fixes: 248529edc86f ("cxl: add RAS status unmasking for CXL") >> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> >> Reviewed-by: Robert Richter <rrichter@amd.com> > > I'd be tempted to add a comment on why this returns 0 rather than an > error. I think that makes sense but it isn't immediately obvious from > the local context. > > Otherwise LGTM > > Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Echo Jonathan's comment. Reviewed-by: Dave Jiang <dave.jiang@intel.com> > > >> --- >> v2: >> Added fixes tag. >> Included what the patch fixes in commit message. >> v3: >> Added "Reviewed-by" tag. >> --- >> drivers/cxl/pci.c | 6 +++--- >> 1 file changed, 3 insertions(+), 3 deletions(-) >> >> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c >> index 1cb1494c28fe..2323169b6e5f 100644 >> --- a/drivers/cxl/pci.c >> +++ b/drivers/cxl/pci.c >> @@ -541,9 +541,9 @@ static int cxl_pci_ras_unmask(struct pci_dev *pdev) >> return 0; >> } >> >> - /* BIOS has CXL error control */ >> - if (!host_bridge->native_cxl_error) >> - return -ENXIO; >> + /* BIOS has PCIe AER error control */ >> + if (!host_bridge->native_aer) >> + return 0; >> >> rc = pcie_capability_read_word(pdev, PCI_EXP_DEVCTL, &cap); >> if (rc) >
On 8/16/2023 11:06 AM, Dave Jiang wrote: > > > On 8/4/23 05:09, Jonathan Cameron wrote: >> On Thu, 3 Aug 2023 23:01:27 +0000 >> Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> wrote: >> >>> According to Section 9.17.2, Table 9-26 of CXL Specification [1], owner >>> of AER should also own CXL Protocol Error Management as there is no >>> explicit control of CXL Protocol error. And the CXL RAS Cap registers >>> reported on Protocol errors should check for AER _OSC rather than CXL >>> Memory Error Reporting Control _OSC. >>> >>> The CXL Memory Error Reporting Control _OSC specifically highlights >>> handling Memory Error Logging and Signaling Enhancements. These kinds of >>> errors are reported through a device's mailbox and can be managed >>> independently from CXL Protocol Errors. >>> >>> This change fixes handling and reporting CXL Protocol Errors and RAS >>> registers natively with native AER and FW-First CXL Memory Error >>> Reporting >>> Control. >>> >>> [1] Compute Express Link (CXL) Specification, Revision 3.1, Aug 1 2022. >>> >>> Fixes: 248529edc86f ("cxl: add RAS status unmasking for CXL") >>> Signed-off-by: Smita Koralahalli >>> <Smita.KoralahalliChannabasappa@amd.com> >>> Reviewed-by: Robert Richter <rrichter@amd.com> >> >> I'd be tempted to add a comment on why this returns 0 rather than an >> error. I think that makes sense but it isn't immediately obvious from >> the local context. >> >> Otherwise LGTM >> >> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> > > Echo Jonathan's comment. > > Reviewed-by: Dave Jiang <dave.jiang@intel.com> Yes, and Dan is probably against returning error code. https://lore.kernel.org/all/64d1b3e78629f_5ea6e2944@dwillia2-xfh.jf.intel.com.notmuch/ But I think returning zero is required as we don't want to interfere with cxl device access when operating in native cxl memory error reporting. Returning error code will basically fail cxl_pci_probe() and thus fail to create a cxl device node. I was thinking a single line comment as: "Return zero to not block the communication with the cxl device when in native memory error reporting mode". Agree? Or anything more that needs to be added? Thanks, Smita > >> >> >>> --- >>> v2: >>> Added fixes tag. >>> Included what the patch fixes in commit message. >>> v3: >>> Added "Reviewed-by" tag. >>> --- >>> drivers/cxl/pci.c | 6 +++--- >>> 1 file changed, 3 insertions(+), 3 deletions(-) >>> >>> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c >>> index 1cb1494c28fe..2323169b6e5f 100644 >>> --- a/drivers/cxl/pci.c >>> +++ b/drivers/cxl/pci.c >>> @@ -541,9 +541,9 @@ static int cxl_pci_ras_unmask(struct pci_dev *pdev) >>> return 0; >>> } >>> - /* BIOS has CXL error control */ >>> - if (!host_bridge->native_cxl_error) >>> - return -ENXIO; >>> + /* BIOS has PCIe AER error control */ >>> + if (!host_bridge->native_aer) >>> + return 0; >>> rc = pcie_capability_read_word(pdev, PCI_EXP_DEVCTL, &cap); >>> if (rc) >>
On 8/16/2023 2:33 PM, Smita Koralahalli wrote: > On 8/16/2023 11:06 AM, Dave Jiang wrote: >> >> >> On 8/4/23 05:09, Jonathan Cameron wrote: >>> On Thu, 3 Aug 2023 23:01:27 +0000 >>> Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> wrote: >>> >>>> According to Section 9.17.2, Table 9-26 of CXL Specification [1], owner >>>> of AER should also own CXL Protocol Error Management as there is no >>>> explicit control of CXL Protocol error. And the CXL RAS Cap registers >>>> reported on Protocol errors should check for AER _OSC rather than CXL >>>> Memory Error Reporting Control _OSC. >>>> >>>> The CXL Memory Error Reporting Control _OSC specifically highlights >>>> handling Memory Error Logging and Signaling Enhancements. These >>>> kinds of >>>> errors are reported through a device's mailbox and can be managed >>>> independently from CXL Protocol Errors. >>>> >>>> This change fixes handling and reporting CXL Protocol Errors and RAS >>>> registers natively with native AER and FW-First CXL Memory Error >>>> Reporting >>>> Control. >>>> >>>> [1] Compute Express Link (CXL) Specification, Revision 3.1, Aug 1 2022. >>>> >>>> Fixes: 248529edc86f ("cxl: add RAS status unmasking for CXL") >>>> Signed-off-by: Smita Koralahalli >>>> <Smita.KoralahalliChannabasappa@amd.com> >>>> Reviewed-by: Robert Richter <rrichter@amd.com> >>> >>> I'd be tempted to add a comment on why this returns 0 rather than an >>> error. I think that makes sense but it isn't immediately obvious from >>> the local context. >>> >>> Otherwise LGTM >>> >>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> >> >> Echo Jonathan's comment. >> >> Reviewed-by: Dave Jiang <dave.jiang@intel.com> > > Yes, and Dan is probably against returning error code. Against returning zero. My bad sorry! > https://lore.kernel.org/all/64d1b3e78629f_5ea6e2944@dwillia2-xfh.jf.intel.com.notmuch/ > > > But I think returning zero is required as we don't want to interfere > with cxl device access when operating in native cxl memory error > reporting. Returning error code will basically fail cxl_pci_probe() and > thus fail to create a cxl device node. > > I was thinking a single line comment as: "Return zero to not block the > communication with the cxl device when in native memory error reporting > mode". > > Agree? Or anything more that needs to be added? > > Thanks, > Smita >> >>> >>> >>>> --- >>>> v2: >>>> Added fixes tag. >>>> Included what the patch fixes in commit message. >>>> v3: >>>> Added "Reviewed-by" tag. >>>> --- >>>> drivers/cxl/pci.c | 6 +++--- >>>> 1 file changed, 3 insertions(+), 3 deletions(-) >>>> >>>> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c >>>> index 1cb1494c28fe..2323169b6e5f 100644 >>>> --- a/drivers/cxl/pci.c >>>> +++ b/drivers/cxl/pci.c >>>> @@ -541,9 +541,9 @@ static int cxl_pci_ras_unmask(struct pci_dev *pdev) >>>> return 0; >>>> } >>>> - /* BIOS has CXL error control */ >>>> - if (!host_bridge->native_cxl_error) >>>> - return -ENXIO; >>>> + /* BIOS has PCIe AER error control */ >>>> + if (!host_bridge->native_aer) >>>> + return 0; >>>> rc = pcie_capability_read_word(pdev, PCI_EXP_DEVCTL, &cap); >>>> if (rc) >>> >
On 8/16/23 14:33, Smita Koralahalli wrote: > On 8/16/2023 11:06 AM, Dave Jiang wrote: >> >> >> On 8/4/23 05:09, Jonathan Cameron wrote: >>> On Thu, 3 Aug 2023 23:01:27 +0000 >>> Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> wrote: >>> >>>> According to Section 9.17.2, Table 9-26 of CXL Specification [1], owner >>>> of AER should also own CXL Protocol Error Management as there is no >>>> explicit control of CXL Protocol error. And the CXL RAS Cap registers >>>> reported on Protocol errors should check for AER _OSC rather than CXL >>>> Memory Error Reporting Control _OSC. >>>> >>>> The CXL Memory Error Reporting Control _OSC specifically highlights >>>> handling Memory Error Logging and Signaling Enhancements. These >>>> kinds of >>>> errors are reported through a device's mailbox and can be managed >>>> independently from CXL Protocol Errors. >>>> >>>> This change fixes handling and reporting CXL Protocol Errors and RAS >>>> registers natively with native AER and FW-First CXL Memory Error >>>> Reporting >>>> Control. >>>> >>>> [1] Compute Express Link (CXL) Specification, Revision 3.1, Aug 1 2022. >>>> >>>> Fixes: 248529edc86f ("cxl: add RAS status unmasking for CXL") >>>> Signed-off-by: Smita Koralahalli >>>> <Smita.KoralahalliChannabasappa@amd.com> >>>> Reviewed-by: Robert Richter <rrichter@amd.com> >>> >>> I'd be tempted to add a comment on why this returns 0 rather than an >>> error. I think that makes sense but it isn't immediately obvious from >>> the local context. >>> >>> Otherwise LGTM >>> >>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> >> >> Echo Jonathan's comment. >> >> Reviewed-by: Dave Jiang <dave.jiang@intel.com> > > Yes, and Dan is probably against returning error code. > https://lore.kernel.org/all/64d1b3e78629f_5ea6e2944@dwillia2-xfh.jf.intel.com.notmuch/ > > But I think returning zero is required as we don't want to interfere > with cxl device access when operating in native cxl memory error > reporting. Returning error code will basically fail cxl_pci_probe() and > thus fail to create a cxl device node. > > I was thinking a single line comment as: "Return zero to not block the > communication with the cxl device when in native memory error reporting > mode". Looks reasonable to me. Unless Dan feels it needs to fail the probe. > > Agree? Or anything more that needs to be added? > > Thanks, > Smita >> >>> >>> >>>> --- >>>> v2: >>>> Added fixes tag. >>>> Included what the patch fixes in commit message. >>>> v3: >>>> Added "Reviewed-by" tag. >>>> --- >>>> drivers/cxl/pci.c | 6 +++--- >>>> 1 file changed, 3 insertions(+), 3 deletions(-) >>>> >>>> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c >>>> index 1cb1494c28fe..2323169b6e5f 100644 >>>> --- a/drivers/cxl/pci.c >>>> +++ b/drivers/cxl/pci.c >>>> @@ -541,9 +541,9 @@ static int cxl_pci_ras_unmask(struct pci_dev *pdev) >>>> return 0; >>>> } >>>> - /* BIOS has CXL error control */ >>>> - if (!host_bridge->native_cxl_error) >>>> - return -ENXIO; >>>> + /* BIOS has PCIe AER error control */ >>>> + if (!host_bridge->native_aer) >>>> + return 0; >>>> rc = pcie_capability_read_word(pdev, PCI_EXP_DEVCTL, &cap); >>>> if (rc) >>> >
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c index 1cb1494c28fe..2323169b6e5f 100644 --- a/drivers/cxl/pci.c +++ b/drivers/cxl/pci.c @@ -541,9 +541,9 @@ static int cxl_pci_ras_unmask(struct pci_dev *pdev) return 0; } - /* BIOS has CXL error control */ - if (!host_bridge->native_cxl_error) - return -ENXIO; + /* BIOS has PCIe AER error control */ + if (!host_bridge->native_aer) + return 0; rc = pcie_capability_read_word(pdev, PCI_EXP_DEVCTL, &cap); if (rc)