diff mbox series

[1/1] cxl/pci: Skip to handle RAS errors if CXL.mem device is detached

Message ID 20240125081414.2189572-1-ming4.li@intel.com
State Superseded
Headers show
Series [1/1] cxl/pci: Skip to handle RAS errors if CXL.mem device is detached | expand

Commit Message

Li, Ming4 Jan. 25, 2024, 8:14 a.m. UTC
CXL.mem protocol errors are logged in CXL RAS capability, if CXL.mem
device is unbound from CXL.mem driver, will not expect any CXL.mem
protocol errors happen on the endpoint or the dport connected to the
endpoint. Giving up these unexpected errors to avoid error handler to
access unmapped RCH dport's RAS capability. The error handler of CXL PCI
device helps to handle RAS errors happened on RCH dport. The host of the
RCH dport's RAS capability mapping is CXL.mem device, so the error
handler will access unmapped RCH dport's RAS capability after CXL.mem
device is unbound from the CXL.mem driver.

Fixes: 6ac07883dbb5 ("cxl/pci: Add RCH downstream port error logging")
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Li Ming <ming4.li@intel.com>
---
 drivers/cxl/core/pci.c | 43 ++++++++++++++++++++++++++++++------------
 1 file changed, 31 insertions(+), 12 deletions(-)

Comments

Dan Williams Jan. 26, 2024, 6:37 a.m. UTC | #1
Li Ming wrote:
> CXL.mem protocol errors are logged in CXL RAS capability, if CXL.mem
> device is unbound from CXL.mem driver, will not expect any CXL.mem
> protocol errors happen on the endpoint or the dport connected to the
> endpoint. Giving up these unexpected errors to avoid error handler to
> access unmapped RCH dport's RAS capability. The error handler of CXL PCI
> device helps to handle RAS errors happened on RCH dport. The host of the
> RCH dport's RAS capability mapping is CXL.mem device, so the error
> handler will access unmapped RCH dport's RAS capability after CXL.mem
> device is unbound from the CXL.mem driver.

Thanks for this Li Ming!

I am going to reword this to add more context:

---
The PCI AER model is an awkward fit for CXL error handling. While the
expectation is that a PCI device can escalate to link reset to recover
from an AER event, the same reset on CXL amounts to a suprise memory
hotplug of massive amounts of memory.

At present, the CXL error handler attempts some optimisitic error
handling to unbind the device from the cxl_mem driver after reaping some
RAS register values. This results in a "hopeful" attempt to unplug the
memory, but there is no guarantee that will succeed.

A subsequent AER notification after the memdev unbind event can no
longer assume the registers are mapped. Check for memdev bind before
reaping status register values to avoid crashes of the form:

 RIP: 0010:__cxl_handle_ras+0x30/0x110 [cxl_core]
 Call Trace:
  <TASK>
  cxl_handle_rp_ras+0xbc/0xd0 [cxl_core]
  cxl_error_detected+0x6c/0xf0 [cxl_core]
  report_error_detected+0xc7/0x1c0
  ? __pfx_report_frozen_detected+0x10/0x10
  pci_walk_bus+0x73/0x90
  pcie_do_recovery+0x23f/0x330

Longer term, the unbind and PCI_ERS_RESULT_DISCONNECT behavior might
need to be replaced with a new PCI_ERS_RESULT_PANIC.
---

> Fixes: 6ac07883dbb5 ("cxl/pci: Add RCH downstream port error logging")
> Suggested-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Li Ming <ming4.li@intel.com>
Bowman, Terry Jan. 26, 2024, 2:04 p.m. UTC | #2
Hi Li and Dan,

I added comment below.

On 1/26/2024 12:37 AM, Dan Williams wrote:
> Li Ming wrote:
>> CXL.mem protocol errors are logged in CXL RAS capability, if CXL.mem
>> device is unbound from CXL.mem driver, will not expect any CXL.mem
>> protocol errors happen on the endpoint or the dport connected to the
>> endpoint. Giving up these unexpected errors to avoid error handler to
>> access unmapped RCH dport's RAS capability. The error handler of CXL PCI
>> device helps to handle RAS errors happened on RCH dport. The host of the
>> RCH dport's RAS capability mapping is CXL.mem device, so the error
>> handler will access unmapped RCH dport's RAS capability after CXL.mem
>> device is unbound from the CXL.mem driver.
> Thanks for this Li Ming!
>
> I am going to reword this to add more context:
>
> ---
> The PCI AER model is an awkward fit for CXL error handling. While the
> expectation is that a PCI device can escalate to link reset to recover
> from an AER event, the same reset on CXL amounts to a suprise memory
> hotplug of massive amounts of memory.
>
> At present, the CXL error handler attempts some optimisitic error
> handling to unbind the device from the cxl_mem driver after reaping some
> RAS register values. This results in a "hopeful" attempt to unplug the
> memory, but there is no guarantee that will succeed.
>
> A subsequent AER notification after the memdev unbind event can no
> longer assume the registers are mapped. Check for memdev bind before
> reaping status register values to avoid crashes of the form:
>
>   RIP: 0010:__cxl_handle_ras+0x30/0x110 [cxl_core]
>   Call Trace:
>    <TASK>
>    cxl_handle_rp_ras+0xbc/0xd0 [cxl_core]
>    cxl_error_detected+0x6c/0xf0 [cxl_core]
>    report_error_detected+0xc7/0x1c0
>    ? __pfx_report_frozen_detected+0x10/0x10
>    pci_walk_bus+0x73/0x90
>    pcie_do_recovery+0x23f/0x330

report_error_detected() includes the same "if (dev->driver)" check before calling the device's err_handler(). The same check again in the CXL device error handler increases the chances of catching the surprise unbind case but not by much.

Regards, Terry


> Longer term, the unbind and PCI_ERS_RESULT_DISCONNECT behavior might
> need to be replaced with a new PCI_ERS_RESULT_PANIC.
> ---
>
>> Fixes: 6ac07883dbb5 ("cxl/pci: Add RCH downstream port error logging")
>> Suggested-by: Dan Williams <dan.j.williams@intel.com>
>> Signed-off-by: Li Ming <ming4.li@intel.com>
Dan Williams Jan. 27, 2024, 3:05 a.m. UTC | #3
Bowman, Terry wrote:
> Hi Li and Dan,
> 
> I added comment below.
> 
> On 1/26/2024 12:37 AM, Dan Williams wrote:
> > Li Ming wrote:
> >> CXL.mem protocol errors are logged in CXL RAS capability, if CXL.mem
> >> device is unbound from CXL.mem driver, will not expect any CXL.mem
> >> protocol errors happen on the endpoint or the dport connected to the
> >> endpoint. Giving up these unexpected errors to avoid error handler to
> >> access unmapped RCH dport's RAS capability. The error handler of CXL PCI
> >> device helps to handle RAS errors happened on RCH dport. The host of the
> >> RCH dport's RAS capability mapping is CXL.mem device, so the error
> >> handler will access unmapped RCH dport's RAS capability after CXL.mem
> >> device is unbound from the CXL.mem driver.
> > Thanks for this Li Ming!
> >
> > I am going to reword this to add more context:
> >
> > ---
> > The PCI AER model is an awkward fit for CXL error handling. While the
> > expectation is that a PCI device can escalate to link reset to recover
> > from an AER event, the same reset on CXL amounts to a suprise memory
> > hotplug of massive amounts of memory.
> >
> > At present, the CXL error handler attempts some optimisitic error
> > handling to unbind the device from the cxl_mem driver after reaping some
> > RAS register values. This results in a "hopeful" attempt to unplug the
> > memory, but there is no guarantee that will succeed.
> >
> > A subsequent AER notification after the memdev unbind event can no
> > longer assume the registers are mapped. Check for memdev bind before
> > reaping status register values to avoid crashes of the form:
> >
> >   RIP: 0010:__cxl_handle_ras+0x30/0x110 [cxl_core]
> >   Call Trace:
> >    <TASK>
> >    cxl_handle_rp_ras+0xbc/0xd0 [cxl_core]
> >    cxl_error_detected+0x6c/0xf0 [cxl_core]
> >    report_error_detected+0xc7/0x1c0
> >    ? __pfx_report_frozen_detected+0x10/0x10
> >    pci_walk_bus+0x73/0x90
> >    pcie_do_recovery+0x23f/0x330
> 
> report_error_detected() includes the same "if (dev->driver)" check
> before calling the device's err_handler(). The same check again in the
> CXL device error handler increases the chances of catching the
> surprise unbind case but not by much.

So report_error_detected() is checking if pdev->dev.driver is NULL, in
this case we are checking whether *cxlmd->dev.driver is NULL*, where
cxlmd->dev.parent == pdev.

In other words when cxl_pci sees an error it tries to keep the CXL.io up
and running while shutting down the CXL.mem side, but it's not clear if
that is just making a bad situation worse. So might need a follow-up to
just panic() rather than hope that unbinding the cxl_memdev does
anything useful.
diff mbox series

Patch

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 6c9c8d92f8f7..480489f5644e 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -932,11 +932,21 @@  static void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds) { }
 void cxl_cor_error_detected(struct pci_dev *pdev)
 {
 	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
+	struct device *dev = &cxlds->cxlmd->dev;
+
+	scoped_guard(device, dev) {
+		if (!dev->driver) {
+			dev_warn(&pdev->dev,
+				 "%s: memdev disabled, abort error handling\n",
+				 dev_name(dev));
+			return;
+		}
 
-	if (cxlds->rcd)
-		cxl_handle_rdport_errors(cxlds);
+		if (cxlds->rcd)
+			cxl_handle_rdport_errors(cxlds);
 
-	cxl_handle_endpoint_cor_ras(cxlds);
+		cxl_handle_endpoint_cor_ras(cxlds);
+	}
 }
 EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, CXL);
 
@@ -948,16 +958,25 @@  pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
 	struct device *dev = &cxlmd->dev;
 	bool ue;
 
-	if (cxlds->rcd)
-		cxl_handle_rdport_errors(cxlds);
+	scoped_guard(device, dev) {
+		if (!dev->driver) {
+			dev_warn(&pdev->dev,
+				 "%s: memdev disabled, abort error handling\n",
+				 dev_name(dev));
+			return PCI_ERS_RESULT_DISCONNECT;
+		}
+
+		if (cxlds->rcd)
+			cxl_handle_rdport_errors(cxlds);
+		/*
+		 * A frozen channel indicates an impending reset which is fatal to
+		 * CXL.mem operation, and will likely crash the system. On the off
+		 * chance the situation is recoverable dump the status of the RAS
+		 * capability registers and bounce the active state of the memdev.
+		 */
+		ue = cxl_handle_endpoint_ras(cxlds);
+	}
 
-	/*
-	 * A frozen channel indicates an impending reset which is fatal to
-	 * CXL.mem operation, and will likely crash the system. On the off
-	 * chance the situation is recoverable dump the status of the RAS
-	 * capability registers and bounce the active state of the memdev.
-	 */
-	ue = cxl_handle_endpoint_ras(cxlds);
 
 	switch (state) {
 	case pci_channel_io_normal: