diff mbox series

[BUG] Root port fails to match with port driver on non-RCH topology

Message ID ZIs5Ra+9SGGbUZoZ@memverge.com
State New, archived
Headers show
Series [BUG] Root port fails to match with port driver on non-RCH topology | expand

Commit Message

Gregory Price June 15, 2023, 4:16 p.m. UTC
On boot in an AMD Genoa production system with EFI_MEMORY_SP set, the
root port fails to enable port topology.  Notably, this is not an RCH
system, so i'm not sure this code path has been tested previously.


Nebulous message in dmesg:

[   15.589072] cxl_mem mem0: CXL port topology root0 not enabled

A bit non-obvious what is occuring here, this occurs during endpoint
allocation, at this point is cxl_mem_probe:

drivers/cxl/mem.c
static int cxl_mem_probe(struct device *dev)
{
...
    if (dport->rch)
        endpoint_parent = parent_port->uport;
    else
        endpoint_parent = &parent_port->dev;

    device_lock(endpoint_parent);
    if (!endpoint_parent->driver) {
        dev_err(dev, "CXL port topology %s not enabled\n",
            dev_name(endpoint_parent));
        rc = -ENXIO;
        goto unlock;
    }
...
}


endpoint_parent->driver is NULL.  In this case, endpoint driver traces
back to root0 not having it's driver set.  This occurs only when the
type matching fails:

drivers/cxl/core/port.c
static int cxl_bus_match(struct device *dev, struct device_driver *drv)
{
    return cxl_device_id(dev) == to_cxl_drv(drv)->id;
}


As it turns out, this can never return true for the root port.

drivers/cxl/core/port.c
static int cxl_device_id(const struct device *dev)
{
...
    if (is_cxl_port(dev)) {
        if (is_cxl_root(to_cxl_port(dev)))
            return CXL_DEVICE_ROOT;
        return CXL_DEVICE_PORT;
    }
...
}

drivers/cxl/port.c
static struct cxl_driver cxl_port_driver = {
    .name = "cxl_port",
    .probe = cxl_port_probe,
    .id = CXL_DEVICE_PORT,
    .drv = {
        .dev_groups = cxl_port_attribute_groups,
    },
};


the root will always be identified as CXL_DEVICE_ROOT and can never
match the port driver.  As a result, the >driver field can never be
set by the device drive base:


static int __device_attach_driver(struct device_driver *drv, void *_data)
{
...
        ret = driver_match_device(drv, dev);
...
        driver_probe_device(drv, dev)
}

static int driver_probe_device(struct device_driver *drv, struct device *dev)
{
...
        ret = __driver_probe_device(drv, dev);
...
}

static int __driver_probe_device(struct device_driver *drv, struct device *dev)
{
...
                ret = really_probe(dev, drv);
...
}

static int really_probe(struct device *dev, struct device_driver *drv)
{
...
re_probe:
        dev->driver = drv;
...
}




Unfortunately, I added this as a hack, but it did not resolve the issue.

I'm a bit lost in the device-driver core trying to track down the exact
path that is being taken, it's possible another subsequent error is
occuring that subsequently fails as well.


~Gregory

Comments

Gregory Price June 15, 2023, 6:46 p.m. UTC | #1
On Thu, Jun 15, 2023 at 12:16:05PM -0400, Gregory Price wrote:
> Unfortunately, I added this as a hack, but it did not resolve the issue.
> 
> I'm a bit lost in the device-driver core trying to track down the exact
> path that is being taken, it's possible another subsequent error is
> occuring that subsequently fails as well.
> 
> 
> ~Gregory
> 
> 
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 4d1f9c5b5029..7f99e4f790d8 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -1853,7 +1853,9 @@ static int cxl_bus_uevent(const struct device *dev, struct kobj_uevent_env *env)
> 
>  static int cxl_bus_match(struct device *dev, struct device_driver *drv)
>  {
> -       return cxl_device_id(dev) == to_cxl_drv(drv)->id;
> +       int devid = cxl_device_id(dev);
> +       int drvid = to_cxl_drv(drv)->id
> +       return (devid == drvid) || (devid == CXL_DEVICE_ROOT && drvid == CXL_DEVICE_PORT);
>  }

follow up with more context, there is a subsequent error on
cxl_port_probe which causes dev->driver to remain NULL.


[   15.561313] cxl_mem mem0: CXL port topology root0 not enabled
[   15.561315] cxl_mem mem0: call_driver_probe failed


drivers/cxl/port.c
static int cxl_port_probe(struct device *dev)
{
    struct cxl_port *port = to_cxl_port(dev);

    if (is_cxl_endpoint(port))
        return cxl_endpoint_port_probe(port);
    return cxl_switch_port_probe(port);
}


since root is not an endpoint, we probe root as a switch port.

static int cxl_switch_port_probe(struct cxl_port *port)
{
    struct cxl_hdm *cxlhdm;
    int rc;

    rc = devm_cxl_port_enumerate_dports(port);
    if (rc < 0)
        return rc;

    if (rc == 1)
        return devm_cxl_add_passthrough_decoder(port);

    cxlhdm = devm_cxl_setup_hdm(port, NULL);
    if (IS_ERR(cxlhdm))
        return PTR_ERR(cxlhdm);

    return devm_cxl_enumerate_decoders(cxlhdm, NULL);
}


It is likely that dev_cxl_port_enumate_dports is failing, though more
investigation is needed at this point.

What I don't know is whether cxl_port_probe is intended to probe the
root port, or if we should simply be attaching the driver to the root
port device explicitly rather than going through probe.

Any guidance would be appreciated.

~Gregory
Gregory Price June 15, 2023, 6:51 p.m. UTC | #2
On Thu, Jun 15, 2023 at 02:46:13PM -0400, Gregory Price wrote:
> On Thu, Jun 15, 2023 at 12:16:05PM -0400, Gregory Price wrote:
> follow up with more context, there is a subsequent error on
> cxl_port_probe which causes dev->driver to remain NULL.
> 
> 
> [   15.561313] cxl_mem mem0: CXL port topology root0 not enabled
> [   15.561315] cxl_mem mem0: call_driver_probe failed

wrong prints here, this should have read:

[   15.549316] cxl root0: driver set to cxl_port
[   15.549732] cxl_port root0: call_driver_probe failed

I confirmed that call_driver_probe sinks down to a call to
cxl_port_probe and that is failing on cxl_port_probe(root0).

Rest of the commentary below is accurate.

> 
> 
> drivers/cxl/port.c
> static int cxl_port_probe(struct device *dev)
> {
>     struct cxl_port *port = to_cxl_port(dev);
> 
>     if (is_cxl_endpoint(port))
>         return cxl_endpoint_port_probe(port);
>     return cxl_switch_port_probe(port);
> }
> 
> 
> since root is not an endpoint, we probe root as a switch port.
> 
> static int cxl_switch_port_probe(struct cxl_port *port)
> {
>     struct cxl_hdm *cxlhdm;
>     int rc;
> 
>     rc = devm_cxl_port_enumerate_dports(port);
>     if (rc < 0)
>         return rc;
> 
>     if (rc == 1)
>         return devm_cxl_add_passthrough_decoder(port);
> 
>     cxlhdm = devm_cxl_setup_hdm(port, NULL);
>     if (IS_ERR(cxlhdm))
>         return PTR_ERR(cxlhdm);
> 
>     return devm_cxl_enumerate_decoders(cxlhdm, NULL);
> }
> 
> 
> It is likely that dev_cxl_port_enumate_dports is failing, though more
> investigation is needed at this point.
> 
> What I don't know is whether cxl_port_probe is intended to probe the
> root port, or if we should simply be attaching the driver to the root
> port device explicitly rather than going through probe.
> 
> Any guidance would be appreciated.
> 
> ~Gregory
Gregory Price June 15, 2023, 10:43 p.m. UTC | #3
On Thu, Jun 15, 2023 at 02:51:13PM -0400, Gregory Price wrote:
> On Thu, Jun 15, 2023 at 02:46:13PM -0400, Gregory Price wrote:
> > On Thu, Jun 15, 2023 at 12:16:05PM -0400, Gregory Price wrote:
> > follow up with more context, there is a subsequent error on
> > cxl_port_probe which causes dev->driver to remain NULL.
> > 
> > 
> > [   15.561313] cxl_mem mem0: CXL port topology root0 not enabled
> > [   15.561315] cxl_mem mem0: call_driver_probe failed
> 
> wrong prints here, this should have read:
> 
> [   15.549316] cxl root0: driver set to cxl_port
> [   15.549732] cxl_port root0: call_driver_probe failed
> 
> I confirmed that call_driver_probe sinks down to a call to
> cxl_port_probe and that is failing on cxl_port_probe(root0).
> 
> Rest of the commentary below is accurate.
> 


so cxl_switch_port_probe fails on this stack:

- cxl_switch_port_probe
  - devm_cxl_port_enumerate_dports
    - cxl_port_to_pci_bus

int devm_cxl_port_enumerate_dports(struct cxl_port *port)
{
        struct pci_bus *bus = cxl_port_to_pci_bus(port);
        struct cxl_walk_context ctx;
        int type;

        if (!bus)
                return -ENXIO;
...
}

struct pci_bus *cxl_port_to_pci_bus(struct cxl_port *port)
{
    /* There is no pci_bus associated with a CXL platform-root port */
    if (is_cxl_root(port))
        return NULL;
...
}


I presume then that for the root port, we should just return 0 to note
success?  The rest of the devices should register themselves correctly
with the root as they're iterated over - i think?

What I can't figure out is why this doesn't happen on QEMU, which also
presents a similar topology.

Still not sure what is correct vs incorrect here, will need to dig
around in the spec and system settings unless someone has an idea.

~Gregory
Jonathan Cameron June 22, 2023, 9:48 a.m. UTC | #4
On Thu, 15 Jun 2023 18:43:59 -0400
Gregory Price <gregory.price@memverge.com> wrote:

> On Thu, Jun 15, 2023 at 02:51:13PM -0400, Gregory Price wrote:
> > On Thu, Jun 15, 2023 at 02:46:13PM -0400, Gregory Price wrote:  
> > > On Thu, Jun 15, 2023 at 12:16:05PM -0400, Gregory Price wrote:
> > > follow up with more context, there is a subsequent error on
> > > cxl_port_probe which causes dev->driver to remain NULL.
> > > 
> > > 
> > > [   15.561313] cxl_mem mem0: CXL port topology root0 not enabled
> > > [   15.561315] cxl_mem mem0: call_driver_probe failed  
> > 
> > wrong prints here, this should have read:
> > 
> > [   15.549316] cxl root0: driver set to cxl_port
> > [   15.549732] cxl_port root0: call_driver_probe failed
> > 
> > I confirmed that call_driver_probe sinks down to a call to
> > cxl_port_probe and that is failing on cxl_port_probe(root0).
> > 
> > Rest of the commentary below is accurate.
> >   
> 
> 
> so cxl_switch_port_probe fails on this stack:
> 
> - cxl_switch_port_probe
>   - devm_cxl_port_enumerate_dports
>     - cxl_port_to_pci_bus
> 
> int devm_cxl_port_enumerate_dports(struct cxl_port *port)
> {
>         struct pci_bus *bus = cxl_port_to_pci_bus(port);
>         struct cxl_walk_context ctx;
>         int type;
> 
>         if (!bus)
>                 return -ENXIO;
> ...
> }
> 
> struct pci_bus *cxl_port_to_pci_bus(struct cxl_port *port)
> {
>     /* There is no pci_bus associated with a CXL platform-root port */
>     if (is_cxl_root(port))
>         return NULL;
> ...
> }
> 
> 
> I presume then that for the root port, we should just return 0 to note
> success?  The rest of the devices should register themselves correctly
> with the root as they're iterated over - i think?
> 
> What I can't figure out is why this doesn't happen on QEMU, which also
> presents a similar topology.
> 
> Still not sure what is correct vs incorrect here, will need to dig
> around in the spec and system settings unless someone has an idea.

Interestingly on QEMU we never get near the path that's failing on root0
as no driver is associated with that particular device.  There are plenty of
child devices with drivers though - just not the top level one.
I wonder if a rescan or something similar is causing the probe on your
system but not qemu?

Dan, should we be seeing a driver binding to root0 or not?

Jonathan



 

> 
> ~Gregory
Gregory Price June 22, 2023, 2:47 p.m. UTC | #5
On Thu, Jun 22, 2023 at 10:48:20AM +0100, Jonathan Cameron wrote:
> On Thu, 15 Jun 2023 18:43:59 -0400
> Gregory Price <gregory.price@memverge.com> wrote:
> 
> > On Thu, Jun 15, 2023 at 02:51:13PM -0400, Gregory Price wrote:
> > > On Thu, Jun 15, 2023 at 02:46:13PM -0400, Gregory Price wrote:  
> > > > On Thu, Jun 15, 2023 at 12:16:05PM -0400, Gregory Price wrote:
> > > > follow up with more context, there is a subsequent error on
> > > > cxl_port_probe which causes dev->driver to remain NULL.
> > > > 
> > > > 
> > > > [   15.561313] cxl_mem mem0: CXL port topology root0 not enabled
> > > > [   15.561315] cxl_mem mem0: call_driver_probe failed  
> > > 
> > > wrong prints here, this should have read:
> > > 
> > > [   15.549316] cxl root0: driver set to cxl_port
> > > [   15.549732] cxl_port root0: call_driver_probe failed
> > > 
> > > I confirmed that call_driver_probe sinks down to a call to
> > > cxl_port_probe and that is failing on cxl_port_probe(root0).
> > > 
> > > Rest of the commentary below is accurate.
> > >   
> > 
> > 
> > so cxl_switch_port_probe fails on this stack:
> > 
> > - cxl_switch_port_probe
> >   - devm_cxl_port_enumerate_dports
> >     - cxl_port_to_pci_bus
> > 
> > int devm_cxl_port_enumerate_dports(struct cxl_port *port)
> > {
> >         struct pci_bus *bus = cxl_port_to_pci_bus(port);
> >         struct cxl_walk_context ctx;
> >         int type;
> > 
> >         if (!bus)
> >                 return -ENXIO;
> > ...
> > }
> > 
> > struct pci_bus *cxl_port_to_pci_bus(struct cxl_port *port)
> > {
> >     /* There is no pci_bus associated with a CXL platform-root port */
> >     if (is_cxl_root(port))
> >         return NULL;
> > ...
> > }
> > 
> > 
> > I presume then that for the root port, we should just return 0 to note
> > success?  The rest of the devices should register themselves correctly
> > with the root as they're iterated over - i think?
> > 
> > What I can't figure out is why this doesn't happen on QEMU, which also
> > presents a similar topology.
> > 
> > Still not sure what is correct vs incorrect here, will need to dig
> > around in the spec and system settings unless someone has an idea.
> 
> Interestingly on QEMU we never get near the path that's failing on root0
> as no driver is associated with that particular device.  There are plenty of
> child devices with drivers though - just not the top level one.
> I wonder if a rescan or something similar is causing the probe on your
> system but not qemu?
> 
> Dan, should we be seeing a driver binding to root0 or not?

I added a local patch to associate the root port with the port driver
after observing that the association was failing:

diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c index 4d1f9c5b5029..7f99e4f790d8 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -1853,7 +1853,9 @@ static int cxl_bus_uevent(const struct device *dev, struct kobj_uevent_env *env)

 static int cxl_bus_match(struct device *dev, struct device_driver *drv)  {
-       return cxl_device_id(dev) == to_cxl_drv(drv)->id;
+       int devid = cxl_device_id(dev);
+       int drvid = to_cxl_drv(drv)->id
+       return (devid == drvid) || (devid == CXL_DEVICE_ROOT && drvid ==
+ CXL_DEVICE_PORT);
 }

This is what is causing the probe, that's not what i'm mostly concerned
with.  The core issue here appears to be that the driver is seeing the
root port as the endpoint parent:

drivers/cxl/mem.c
static int cxl_mem_probe(struct device *dev) { ...
    if (dport->rch)
        endpoint_parent = parent_port->uport;
    else
        endpoint_parent = &parent_port->dev;

    device_lock(endpoint_parent);
    if (!endpoint_parent->driver) {   <---- endpoint here is root0
        dev_err(dev, "CXL port topology %s not enabled\n",
            dev_name(endpoint_parent));
        rc = -ENXIO;
        goto unlock;
    }
    ...
}

I suppose the root question may be whether an endpoint_parent should be a
root port, or if there is topological error occurring somewhere (missing
port in the topology).


> 
> Jonathan
> 
> > 
> > ~Gregory
>
diff mbox series

Patch

diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 4d1f9c5b5029..7f99e4f790d8 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -1853,7 +1853,9 @@  static int cxl_bus_uevent(const struct device *dev, struct kobj_uevent_env *env)

 static int cxl_bus_match(struct device *dev, struct device_driver *drv)
 {
-       return cxl_device_id(dev) == to_cxl_drv(drv)->id;
+       int devid = cxl_device_id(dev);
+       int drvid = to_cxl_drv(drv)->id
+       return (devid == drvid) || (devid == CXL_DEVICE_ROOT && drvid == CXL_DEVICE_PORT);
 }