diff mbox series

cxl/core/port: defer probe when memdev fails to find correct port

Message ID 20240913183234.17302-1-gourry@gourry.net
State New
Headers show
Series cxl/core/port: defer probe when memdev fails to find correct port | expand

Commit Message

Gregory Price Sept. 13, 2024, 6:32 p.m. UTC
Depending on device/hierarchy readiness, it can be possible for the
async probe process to attempt to register an endpoint before the
entire port hierarchy is ready.  This currently fails with -ENXIO.

Return -EPROBE_DEFER to try again later automatically (which is
what the local comments already say we should do anyway).

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 drivers/cxl/core/port.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Dan Williams Sept. 14, 2024, 4:32 a.m. UTC | #1
Gregory Price wrote:
> Depending on device/hierarchy readiness, it can be possible for the
> async probe process to attempt to register an endpoint before the
> entire port hierarchy is ready.  This currently fails with -ENXIO.
> 
> Return -EPROBE_DEFER to try again later automatically (which is
> what the local comments already say we should do anyway).

I want to make sure this is not papering over some other issue. Can you
post the final topology when this works (cxl list -BPET)? My working
theory is that you have 2 devices that share an intermediate port.
Otherwise, I am having a hard time understanding why the
cxl_bus_rescan() in cxl_acpi_probe() does not obviate the explicit
EPROBE_DEFER.

So, devA is dependendent on devB to create a common port, but devA loses
that race after cxl_bus_rescan() has already run. Then EBPROBE_DEFER is
the right answer to trigger devA to try again.
Gregory Price Sept. 18, 2024, 11:59 a.m. UTC | #2
On Fri, Sep 13, 2024 at 09:32:48PM -0700, Dan Williams wrote:
> Gregory Price wrote:
> > Depending on device/hierarchy readiness, it can be possible for the
> > async probe process to attempt to register an endpoint before the
> > entire port hierarchy is ready.  This currently fails with -ENXIO.
> > 
> > Return -EPROBE_DEFER to try again later automatically (which is
> > what the local comments already say we should do anyway).
> 
> I want to make sure this is not papering over some other issue. Can you
> post the final topology when this works (cxl list -BPET)? My working
> theory is that you have 2 devices that share an intermediate port.
> Otherwise, I am having a hard time understanding why the
> cxl_bus_rescan() in cxl_acpi_probe() does not obviate the explicit
> EPROBE_DEFER.
>

Sorry for the delay

[
  {
    "bus":"root0",
    "provider":"ACPI.CXL",
    "nr_dports":4,
    "dports":[
      {
        "dport":"pci0000:e0",
        "alias":"ACPI0016:00",
        "id":7
      },
      {
        "dport":"pci0000:00",
        "alias":"ACPI0016:01",
        "id":0
      },
      {
        "dport":"pci0000:c0",
        "alias":"ACPI0016:02",
        "id":6
      },
      {
        "dport":"pci0000:20",
        "alias":"ACPI0016:03",
        "id":1
      }
    ],
    "ports:root0":[
      {
        "port":"port1",
        "host":"pci0000:e0",
        "depth":1,
        "decoders_committed":2,
        "nr_dports":4,
        "dports":[
          {
            "dport":"0000:e0:07.2",
            "alias":"device:16",
            "id":114
          },
          {
            "dport":"0000:e0:01.1",
            "alias":"device:02",
            "id":0
          },
          {
            "dport":"0000:e0:01.3",
            "alias":"device:05",
            "id":2
          },
          {
            "dport":"0000:e0:07.1",
            "alias":"device:0d",
            "id":113
          }
        ],
        "endpoints:port1":[
          {
            "endpoint":"endpoint5",
            "host":"mem0",
            "parent_dport":"0000:e0:01.1",
            "depth":2,
            "decoders_committed":1
          }
        ]
      },
      {
        "port":"port3",
        "host":"pci0000:c0",
        "depth":1,
        "decoders_committed":2,
        "nr_dports":1,
        "dports":[
          {
            "dport":"0000:c0:01.1",
            "alias":"device:c3",
            "id":0
          }
        ],
        "endpoints:port3":[
          {
            "endpoint":"endpoint6",
            "host":"mem1",
            "parent_dport":"0000:c0:01.1",
            "depth":2,
            "decoders_committed":1
          }
        ]
      },
      {
        "port":"port2",
        "host":"pci0000:00",
        "depth":1,
        "decoders_committed":0,
        "nr_dports":2,
        "dports":[
          {
            "dport":"0000:00:01.3",
            "alias":"device:55",
            "id":2
          },
          {
            "dport":"0000:00:07.1",
            "alias":"device:5d",
            "id":113
          }
        ]
      },
      {
        "port":"port4",
        "host":"pci0000:20",
        "depth":1,
        "decoders_committed":0,
        "nr_dports":1,
        "dports":[
          {
            "dport":"0000:20:01.1",
            "alias":"device:d0",
            "id":0
          }
        ]
      }
    ]
  }
]
 
> So, devA is dependendent on devB to create a common port, but devA loses
> that race after cxl_bus_rescan() has already run. Then EBPROBE_DEFER is
> the right answer to trigger devA to try again.
Gregory Price Oct. 1, 2024, 3:20 p.m. UTC | #3
On Wed, Sep 18, 2024 at 01:59:07PM +0200, Gregory Price wrote:
> On Fri, Sep 13, 2024 at 09:32:48PM -0700, Dan Williams wrote:
> > Gregory Price wrote:
> > > Depending on device/hierarchy readiness, it can be possible for the
> > > async probe process to attempt to register an endpoint before the
> > > entire port hierarchy is ready.  This currently fails with -ENXIO.
> > > 
> > > Return -EPROBE_DEFER to try again later automatically (which is
> > > what the local comments already say we should do anyway).
> > 
> > I want to make sure this is not papering over some other issue. Can you
> > post the final topology when this works (cxl list -BPET)? My working
> > theory is that you have 2 devices that share an intermediate port.
> > Otherwise, I am having a hard time understanding why the
> > cxl_bus_rescan() in cxl_acpi_probe() does not obviate the explicit
> > EPROBE_DEFER.
> >

just reporting back with a fully functional layout - they do not appear to
share an intermediate port, unless you consider the root a shared port.

I see your concern about this papering over another issue, but it's not
clear what to look for or how to look for it at this point.

For what it's worth - another group observed the same issue with different
hardware and produced the same patch.

    "ports:root0":[
      {
        "port":"port1",
        "host":"pci0000:e0",
        "depth":1,
        "decoders_committed":1,
        "nr_dports":4,
        "dports":[
          {
            "dport":"0000:e0:07.2",
            "alias":"device:16",
            "id":114
          },
          {
            "dport":"0000:e0:01.1",
            "alias":"device:02",
            "id":0
          },
          {
            "dport":"0000:e0:01.3",
            "alias":"device:05",
            "id":2
          },
          {
            "dport":"0000:e0:07.1",
            "alias":"device:0d",
            "id":113
          }
        ],
        "endpoints:port1":[
          {
            "endpoint":"endpoint5",
            "host":"mem0",
            "parent_dport":"0000:e0:01.1",
            "depth":2,
            "decoders_committed":1
          }
        ]
      },
      {
        "port":"port3",
        "host":"pci0000:c0",
        "depth":1,
        "decoders_committed":2,
        "nr_dports":1,
        "dports":[
          {
            "dport":"0000:c0:01.1",
            "alias":"device:c3",
            "id":0
          }
        ],
        "endpoints:port3":[
          {
            "endpoint":"endpoint6",
            "host":"mem1",
            "parent_dport":"0000:c0:01.1",
            "depth":2,
            "decoders_committed":2
          }
        ]
      },
    ]

> > So, devA is dependendent on devB to create a common port, but devA loses
> > that race after cxl_bus_rescan() has already run. Then EBPROBE_DEFER is
> > the right answer to trigger devA to try again.
diff mbox series

Patch

diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 1d5007e3795a..d6bebf70d142 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -1553,7 +1553,7 @@  static int add_port_attach_ep(struct cxl_memdev *cxlmd,
 		 */
 		dev_dbg(&cxlmd->dev, "%s is a root dport\n",
 			dev_name(dport_dev));
-		return -ENXIO;
+		return -EPROBE_DEFER;
 	}
 
 	parent_port = find_cxl_port(dparent, &parent_dport);