mbox series

[0/5] cxl: Initialization and shutdown fixes

Message ID 172862483180.2150669.5564474284074502692.stgit@dwillia2-xfh.jf.intel.com
Headers show
Series cxl: Initialization and shutdown fixes | expand

Message

Dan Williams Oct. 11, 2024, 5:33 a.m. UTC
Gregory's modest proposal to fix CXL cxl_mem_probe() failures due to
delayed arrival of the CXL "root" infrastructure [1] prompted questions
of how the existing mechanism for retrying cxl_mem_probe() could be
failing.

The critical missing piece in the debug was that Gregory's setup had
almost all CXL modules built-in to the kernel.

On the way to that discovery several other bugs and init-order corner
cases were discovered.

The main fix is to make sure the drivers/cxl/Makefile object order
supports root CXL ports being fully initialized upon cxl_acpi_probe()
exit. The modular case has some similar potential holes that are fixed
with MODULE_SOFTDEP() and other fix ups. Finally, an attempt to update
cxl_test to reproduce the original report resulted in the discovery of a
separate long standing use after free bug in cxl_region_detach().

[1]: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net

---

Dan Williams (5):
      cxl/port: Fix CXL port initialization order when the subsystem is built-in
      cxl/port: Fix cxl_bus_rescan() vs bus_rescan_devices()
      cxl/acpi: Ensure ports ready at cxl_acpi_probe() return
      cxl/port: Fix use-after-free, permit out-of-order decoder shutdown
      cxl/test: Improve init-order fidelity relative to real-world systems


 drivers/base/core.c          |   35 +++++++
 drivers/cxl/Kconfig          |    1
 drivers/cxl/Makefile         |   12 +--
 drivers/cxl/acpi.c           |    7 +
 drivers/cxl/core/hdm.c       |   50 +++++++++--
 drivers/cxl/core/port.c      |   13 ++-
 drivers/cxl/core/region.c    |   48 +++-------
 drivers/cxl/cxl.h            |    3 -
 include/linux/device.h       |    3 +
 tools/testing/cxl/test/cxl.c |  200 +++++++++++++++++++++++-------------------
 tools/testing/cxl/test/mem.c |    1
 11 files changed, 228 insertions(+), 145 deletions(-)

base-commit: 8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b

Comments

Alejandro Lucero Palau Oct. 11, 2024, 11:21 a.m. UTC | #1
Hi Dan,


I think this is the same issue one of the patches in type2 support tries 
to deal with:


https://lore.kernel.org/linux-cxl/20240907081836.5801-1-alejandro.lucero-palau@amd.com/T/#m9357a559c1a3cc7869ecce44a1801d51518d106e


If this fixes that situation, I guess I can drop that one from v4 which 
is ready to be sent.


The other problem I try to fix in that patch, the endpoint not being 
there when that code tries to use it, it is likely not needed either, 
although I have a trivial fix for it now instead of that ugly loop with 
delays. The solution is to add PROBE_FORCE_SYNCHRONOUS as probe_type for 
the cxl_mem_driver which implies the device_add will only return when 
the device is really created. Maybe that is worth it for other potential 
situations suffering the delayed creation.


On 10/11/24 06:33, Dan Williams wrote:
> Gregory's modest proposal to fix CXL cxl_mem_probe() failures due to
> delayed arrival of the CXL "root" infrastructure [1] prompted questions
> of how the existing mechanism for retrying cxl_mem_probe() could be
> failing.
>
> The critical missing piece in the debug was that Gregory's setup had
> almost all CXL modules built-in to the kernel.
>
> On the way to that discovery several other bugs and init-order corner
> cases were discovered.
>
> The main fix is to make sure the drivers/cxl/Makefile object order
> supports root CXL ports being fully initialized upon cxl_acpi_probe()
> exit. The modular case has some similar potential holes that are fixed
> with MODULE_SOFTDEP() and other fix ups. Finally, an attempt to update
> cxl_test to reproduce the original report resulted in the discovery of a
> separate long standing use after free bug in cxl_region_detach().
>
> [1]: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net
>
> ---
>
> Dan Williams (5):
>        cxl/port: Fix CXL port initialization order when the subsystem is built-in
>        cxl/port: Fix cxl_bus_rescan() vs bus_rescan_devices()
>        cxl/acpi: Ensure ports ready at cxl_acpi_probe() return
>        cxl/port: Fix use-after-free, permit out-of-order decoder shutdown
>        cxl/test: Improve init-order fidelity relative to real-world systems
>
>
>   drivers/base/core.c          |   35 +++++++
>   drivers/cxl/Kconfig          |    1
>   drivers/cxl/Makefile         |   12 +--
>   drivers/cxl/acpi.c           |    7 +
>   drivers/cxl/core/hdm.c       |   50 +++++++++--
>   drivers/cxl/core/port.c      |   13 ++-
>   drivers/cxl/core/region.c    |   48 +++-------
>   drivers/cxl/cxl.h            |    3 -
>   include/linux/device.h       |    3 +
>   tools/testing/cxl/test/cxl.c |  200 +++++++++++++++++++++++-------------------
>   tools/testing/cxl/test/mem.c |    1
>   11 files changed, 228 insertions(+), 145 deletions(-)
>
> base-commit: 8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b
>
Dan Williams Oct. 11, 2024, 5:38 p.m. UTC | #2
Alejandro Lucero Palau wrote:
> Hi Dan,
> 
> 
> I think this is the same issue one of the patches in type2 support tries 
> to deal with:
> 
> 
> https://lore.kernel.org/linux-cxl/20240907081836.5801-1-alejandro.lucero-palau@amd.com/T/#m9357a559c1a3cc7869ecce44a1801d51518d106e
> 
> 
> If this fixes that situation, I guess I can drop that one from v4 which 
> is ready to be sent.
> 
> 
> The other problem I try to fix in that patch, the endpoint not being 
> there when that code tries to use it, it is likely not needed either, 
> although I have a trivial fix for it now instead of that ugly loop with 
> delays. The solution is to add PROBE_FORCE_SYNCHRONOUS as probe_type for 
> the cxl_mem_driver which implies the device_add will only return when 
> the device is really created. Maybe that is worth it for other potential 
> situations suffering the delayed creation.

I am skeptical that PROBE_FORCE_SYNCRONOUS is a fix for any
device-readiness bug. Some other assumption is violated if that is
required.

For the type-2 case I did have an EPROBE_DEFER in my initial RFC on the
assumption that an accelerator driver might want to wait until CXL is
initialized before the base accelerator proceeds. However, if
accelerator drivers behave the same as the cxl_pci driver and are ok
with asynchronus arrival of CXL functionality then no deferral is
needed.

Otherwise, the only motivation for synchronous probing I can think of
would be to have more predictable naming of kernel objects. So yes, I
would be curious to understand what scenarios probe deferral is still
needed.
Alejandro Lucero Palau Oct. 12, 2024, 6:30 a.m. UTC | #3
On 10/11/24 18:38, Dan Williams wrote:
> Alejandro Lucero Palau wrote:
>> Hi Dan,
>>
>>
>> I think this is the same issue one of the patches in type2 support tries
>> to deal with:
>>
>>
>> https://lore.kernel.org/linux-cxl/20240907081836.5801-1-alejandro.lucero-palau@amd.com/T/#m9357a559c1a3cc7869ecce44a1801d51518d106e
>>
>>
>> If this fixes that situation, I guess I can drop that one from v4 which
>> is ready to be sent.
>>
>>
>> The other problem I try to fix in that patch, the endpoint not being
>> there when that code tries to use it, it is likely not needed either,
>> although I have a trivial fix for it now instead of that ugly loop with
>> delays. The solution is to add PROBE_FORCE_SYNCHRONOUS as probe_type for
>> the cxl_mem_driver which implies the device_add will only return when
>> the device is really created. Maybe that is worth it for other potential
>> situations suffering the delayed creation.
> I am skeptical that PROBE_FORCE_SYNCRONOUS is a fix for any
> device-readiness bug. Some other assumption is violated if that is
> required.


But that problem is not about device readiness but just how the device 
model works. In this case the memdev creation is adding devices, no real 
ones but those abstractions we use from the device model, and that 
device creation is done asynchronously. When the code creating the 
memdev, a Type2 driver in my case, is going to work with such a device 
abstraction just after the memdev creation, it is not there yet. It is 
true that clx_mem_probe will interact with the real device, but the fact 
is such a function is not invoked by the device model synchronously, so 
the code using such a device abstraction needs to wait until it is 
there. With this flag the waiting is implicit to device creation. 
Without that flag other "nasty dancing" with delays and checks needs to 
be done as the code in v3 did.


> For the type-2 case I did have an EPROBE_DEFER in my initial RFC on the
> assumption that an accelerator driver might want to wait until CXL is
> initialized before the base accelerator proceeds. However, if
> accelerator drivers behave the same as the cxl_pci driver and are ok
> with asynchronus arrival of CXL functionality then no deferral is
> needed.


I think deferring the accel driver makes sense. In the sfc driver case, 
it could work without CXL and then change to use it once the CXL kernel 
support is fully initialised, but I guess other accel drivers will rely 
on CXL with no other option, and even with the sfc driver, supporting 
such a change will make the code far more complex.

>
> Otherwise, the only motivation for synchronous probing I can think of
> would be to have more predictable naming of kernel objects. So yes, I
> would be curious to understand what scenarios probe deferral is still
> needed.


OK. I will keep that patch with the last change in the v4. Let's discuss 
this further with that patch as a reference.


Thanks.
Dan Williams Oct. 12, 2024, 9:57 p.m. UTC | #4
Alejandro Lucero Palau wrote:
[..]
> > I am skeptical that PROBE_FORCE_SYNCRONOUS is a fix for any
> > device-readiness bug. Some other assumption is violated if that is
> > required.
> 
> 
> But that problem is not about device readiness but just how the device 
> model works. In this case the memdev creation is adding devices, no real 
> ones but those abstractions we use from the device model, and that 
> device creation is done asynchronously.

Device creation is not done asynchronously, the PCI driver is attaching
asynchrounously. When the PCI driver attaches it creates memdevs and
those are attached to cxl_mem synchronously.

> memdev, a Type2 driver in my case, is going to work with such a device 
> abstraction just after the memdev creation, it is not there yet.

Oh, is the concern that you always want to have the memdev attached to
cxl_mem immediately after it is registered?

I think that is another case where "MODULE_SOFTDEP("pre: cxl_mem")" is
needed. However, to fix this situation once and for all I think I would
rather just drop all this modularity and move both cxl_port and cxl_mem
to be drivers internal to cxl_core.ko similar to the cxl_region driver.


> It is true that clx_mem_probe will interact with the real device, but
> the fact is such a function is not invoked by the device model
> synchronously, so the code using such a device abstraction needs to
> wait until it is there. With this flag the waiting is implicit to
> device creation.  Without that flag other "nasty dancing" with delays
> and checks needs to be done as the code in v3 did.

It is invoked synchronously *if* the driver is loaded *and* the user has
not forced asynchronous attach on the command line in which case they
get to keep the pieces.

> > For the type-2 case I did have an EPROBE_DEFER in my initial RFC on the
> > assumption that an accelerator driver might want to wait until CXL is
> > initialized before the base accelerator proceeds. However, if
> > accelerator drivers behave the same as the cxl_pci driver and are ok
> > with asynchronus arrival of CXL functionality then no deferral is
> > needed.
> 
> 
> I think deferring the accel driver makes sense. In the sfc driver case, 
> it could work without CXL and then change to use it once the CXL kernel 
> support is fully initialised, but I guess other accel drivers will rely 
> on CXL with no other option, and even with the sfc driver, supporting 
> such a change will make the code far more complex.

Makes sense.

> > Otherwise, the only motivation for synchronous probing I can think of
> > would be to have more predictable naming of kernel objects. So yes, I
> > would be curious to understand what scenarios probe deferral is still
> > needed.
> 
> OK. I will keep that patch with the last change in the v4. Let's discuss 
> this further with that patch as a reference.

EPROBE_DEFER when CXL is not ready yet is fine in the sfc driver, just
comment that the driver does not have a PCIe-only operation mode and
requires that CXL initializes.
Alejandro Lucero Palau Oct. 14, 2024, 3:13 p.m. UTC | #5
On 10/12/24 22:57, Dan Williams wrote:
> Alejandro Lucero Palau wrote:
> [..]
>>> I am skeptical that PROBE_FORCE_SYNCRONOUS is a fix for any
>>> device-readiness bug. Some other assumption is violated if that is
>>> required.
>>
>> But that problem is not about device readiness but just how the device
>> model works. In this case the memdev creation is adding devices, no real
>> ones but those abstractions we use from the device model, and that
>> device creation is done asynchronously.
> Device creation is not done asynchronously, the PCI driver is attaching
> asynchrounously. When the PCI driver attaches it creates memdevs and
> those are attached to cxl_mem synchronously.
>
>> memdev, a Type2 driver in my case, is going to work with such a device
>> abstraction just after the memdev creation, it is not there yet.
> Oh, is the concern that you always want to have the memdev attached to
> cxl_mem immediately after it is registered?
>
> I think that is another case where "MODULE_SOFTDEP("pre: cxl_mem")" is
> needed. However, to fix this situation once and for all I think I would
> rather just drop all this modularity and move both cxl_port and cxl_mem
> to be drivers internal to cxl_core.ko similar to the cxl_region driver.


Oh, so the problem is the code is not ready because the functionality is 
in a module not loaded yet.

Then it makes sense that change. I'll do it if not already taken. I'll 
send v4 without the PROBE_FORCE_SYNCHRONOUS flag and without the 
previous loop with delays implemented in v3.

Thanks


>
>> It is true that clx_mem_probe will interact with the real device, but
>> the fact is such a function is not invoked by the device model
>> synchronously, so the code using such a device abstraction needs to
>> wait until it is there. With this flag the waiting is implicit to
>> device creation.  Without that flag other "nasty dancing" with delays
>> and checks needs to be done as the code in v3 did.
> It is invoked synchronously *if* the driver is loaded *and* the user has
> not forced asynchronous attach on the command line in which case they
> get to keep the pieces.
>
>>> For the type-2 case I did have an EPROBE_DEFER in my initial RFC on the
>>> assumption that an accelerator driver might want to wait until CXL is
>>> initialized before the base accelerator proceeds. However, if
>>> accelerator drivers behave the same as the cxl_pci driver and are ok
>>> with asynchronus arrival of CXL functionality then no deferral is
>>> needed.
>>
>> I think deferring the accel driver makes sense. In the sfc driver case,
>> it could work without CXL and then change to use it once the CXL kernel
>> support is fully initialised, but I guess other accel drivers will rely
>> on CXL with no other option, and even with the sfc driver, supporting
>> such a change will make the code far more complex.
> Makes sense.
>
>>> Otherwise, the only motivation for synchronous probing I can think of
>>> would be to have more predictable naming of kernel objects. So yes, I
>>> would be curious to understand what scenarios probe deferral is still
>>> needed.
>> OK. I will keep that patch with the last change in the v4. Let's discuss
>> this further with that patch as a reference.
> EPROBE_DEFER when CXL is not ready yet is fine in the sfc driver, just
> comment that the driver does not have a PCIe-only operation mode and
> requires that CXL initializes.
Dan Williams Oct. 14, 2024, 10:24 p.m. UTC | #6
Alejandro Lucero Palau wrote:
> 
> On 10/12/24 22:57, Dan Williams wrote:
> > Alejandro Lucero Palau wrote:
> > [..]
> >>> I am skeptical that PROBE_FORCE_SYNCRONOUS is a fix for any
> >>> device-readiness bug. Some other assumption is violated if that is
> >>> required.
> >>
> >> But that problem is not about device readiness but just how the device
> >> model works. In this case the memdev creation is adding devices, no real
> >> ones but those abstractions we use from the device model, and that
> >> device creation is done asynchronously.
> > Device creation is not done asynchronously, the PCI driver is attaching
> > asynchrounously. When the PCI driver attaches it creates memdevs and
> > those are attached to cxl_mem synchronously.
> >
> >> memdev, a Type2 driver in my case, is going to work with such a device
> >> abstraction just after the memdev creation, it is not there yet.
> > Oh, is the concern that you always want to have the memdev attached to
> > cxl_mem immediately after it is registered?
> >
> > I think that is another case where "MODULE_SOFTDEP("pre: cxl_mem")" is
> > needed. However, to fix this situation once and for all I think I would
> > rather just drop all this modularity and move both cxl_port and cxl_mem
> > to be drivers internal to cxl_core.ko similar to the cxl_region driver.
> 
> 
> Oh, so the problem is the code is not ready because the functionality is 
> in a module not loaded yet.

Right.

> Then it makes sense that change. I'll do it if not already taken. I'll 
> send v4 without the PROBE_FORCE_SYNCHRONOUS flag and without the 
> previous loop with delays implemented in v3.

So I think EPROBE_DEFER can stay out of the CXL core because it is up to
the accelerator driver to decide whether CXL availability is fatal to
init or not.

Additionally, I am less and less convinced that Type-2 drivers should be
forced to depend on the cxl_mem driver to attach vs just arranging for
those Type-2 drivers to call devm_cxl_enumerate_ports() and
devm_cxl_add_endpoint() directly. In other words I am starting to worry
that the generic cxl_mem driver design pattern is a midlayer mistake.

So, if it makes it easier for sfc, I would be open to exploring a direct
scheme for endpoint attachment, and not requiring the generic cxl_mem
driver as an intermediary.
Alejandro Lucero Palau Oct. 15, 2024, 8:45 a.m. UTC | #7
On 10/14/24 23:24, Dan Williams wrote:
> Alejandro Lucero Palau wrote:
>> On 10/12/24 22:57, Dan Williams wrote:
>>> Alejandro Lucero Palau wrote:
>>> [..]
>>>>> I am skeptical that PROBE_FORCE_SYNCRONOUS is a fix for any
>>>>> device-readiness bug. Some other assumption is violated if that is
>>>>> required.
>>>> But that problem is not about device readiness but just how the device
>>>> model works. In this case the memdev creation is adding devices, no real
>>>> ones but those abstractions we use from the device model, and that
>>>> device creation is done asynchronously.
>>> Device creation is not done asynchronously, the PCI driver is attaching
>>> asynchrounously. When the PCI driver attaches it creates memdevs and
>>> those are attached to cxl_mem synchronously.
>>>
>>>> memdev, a Type2 driver in my case, is going to work with such a device
>>>> abstraction just after the memdev creation, it is not there yet.
>>> Oh, is the concern that you always want to have the memdev attached to
>>> cxl_mem immediately after it is registered?
>>>
>>> I think that is another case where "MODULE_SOFTDEP("pre: cxl_mem")" is
>>> needed. However, to fix this situation once and for all I think I would
>>> rather just drop all this modularity and move both cxl_port and cxl_mem
>>> to be drivers internal to cxl_core.ko similar to the cxl_region driver.
>>
>> Oh, so the problem is the code is not ready because the functionality is
>> in a module not loaded yet.
> Right.
>
>> Then it makes sense that change. I'll do it if not already taken. I'll
>> send v4 without the PROBE_FORCE_SYNCHRONOUS flag and without the
>> previous loop with delays implemented in v3.
> So I think EPROBE_DEFER can stay out of the CXL core because it is up to
> the accelerator driver to decide whether CXL availability is fatal to
> init or not.


It needs support from the cxl core though. If the cxl root is not there 
yet, the driver needs to know, and that is what you did in your original 
patch and I'm keeping as well.


> Additionally, I am less and less convinced that Type-2 drivers should be
> forced to depend on the cxl_mem driver to attach vs just arranging for
> those Type-2 drivers to call devm_cxl_enumerate_ports() and
> devm_cxl_add_endpoint() directly. In other words I am starting to worry
> that the generic cxl_mem driver design pattern is a midlayer mistake.


You know better than me but in my view, a Type2 should follow what a 
Type3 does with some small changes for dealing with the differences, 
mainly the way it is going to be used and those optional capabilities 
for Type2. This makes more sense to me for Type1.

> So, if it makes it easier for sfc, I would be open to exploring a direct
> scheme for endpoint attachment, and not requiring the generic cxl_mem
> driver as an intermediary.


V4 is ready (just having problems when testing with 6.12-rcX) so I would 
like to keep it and explore this once we have something working and 
accepted. Type2 and Type1 with CXL cache will bring new challenges and I 
bet we will need refactoring in the code and potentially new design for 
generic code (Type3 and Type2, Type2 and Type1).
Dan Williams Oct. 15, 2024, 4:37 p.m. UTC | #8
Alejandro Lucero Palau wrote:
[..]
> >> Then it makes sense that change. I'll do it if not already taken. I'll
> >> send v4 without the PROBE_FORCE_SYNCHRONOUS flag and without the
> >> previous loop with delays implemented in v3.
> > So I think EPROBE_DEFER can stay out of the CXL core because it is up to
> > the accelerator driver to decide whether CXL availability is fatal to
> > init or not.
> 
> 
> It needs support from the cxl core though. If the cxl root is not there 
> yet, the driver needs to know, and that is what you did in your original 
> patch and I'm keeping as well.

So there are two ways, check if a registered @memdev has
@memdev->dev.driver set, assuming you know that the cxl_mem driver is
available, or call devm_cxl_enumerate_ports() yourself and skip the
cxl_mem driver indirection.

Setting @memdev->endpoint to ERR_PTR(-EPROBE_DEFER), as I originally
had, is an even more indirect way to convey a similar result and is
starting to feel a bit "mid-layer-y".

> > Additionally, I am less and less convinced that Type-2 drivers should be
> > forced to depend on the cxl_mem driver to attach vs just arranging for
> > those Type-2 drivers to call devm_cxl_enumerate_ports() and
> > devm_cxl_add_endpoint() directly. In other words I am starting to worry
> > that the generic cxl_mem driver design pattern is a midlayer mistake.
> 
> You know better than me but in my view, a Type2 should follow what a 
> Type3 does with some small changes for dealing with the differences, 
> mainly the way it is going to be used and those optional capabilities 
> for Type2. This makes more sense to me for Type1.

If an accelerator driver *wants* to depend on cxl_mem, it can, but all
the cxl_core functions that cxl_mem uses are also exported for
accelerator drivers to use directly to avoid needing to guess about when
/ if cxl_mem is going to attach.

> > So, if it makes it easier for sfc, I would be open to exploring a direct
> > scheme for endpoint attachment, and not requiring the generic cxl_mem
> > driver as an intermediary.
> 
> V4 is ready (just having problems when testing with 6.12-rcX) so I would 
> like to keep it and explore this once we have something working and 
> accepted. Type2 and Type1 with CXL cache will bring new challenges and I 
> bet we will need refactoring in the code and potentially new design for 
> generic code (Type3 and Type2, Type2 and Type1).

Yeah, no need to switch horses mid-race if you already have a cxl_mem
dependent approach nearly ready, but a potential simplification to
explore going forward.
Alejandro Lucero Palau Oct. 16, 2024, 2:41 p.m. UTC | #9
On 10/15/24 17:37, Dan Williams wrote:
> Alejandro Lucero Palau wrote:
> [..]
>>>> Then it makes sense that change. I'll do it if not already taken. I'll
>>>> send v4 without the PROBE_FORCE_SYNCHRONOUS flag and without the
>>>> previous loop with delays implemented in v3.
>>> So I think EPROBE_DEFER can stay out of the CXL core because it is up to
>>> the accelerator driver to decide whether CXL availability is fatal to
>>> init or not.
>>
>> It needs support from the cxl core though. If the cxl root is not there
>> yet, the driver needs to know, and that is what you did in your original
>> patch and I'm keeping as well.
> So there are two ways, check if a registered @memdev has
> @memdev->dev.driver set, assuming you know that the cxl_mem driver is
> available, or call devm_cxl_enumerate_ports() yourself and skip the
> cxl_mem driver indirection.
>
> Setting @memdev->endpoint to ERR_PTR(-EPROBE_DEFER), as I originally
> had, is an even more indirect way to convey a similar result and is
> starting to feel a bit "mid-layer-y".
>

I was a bit confused with this answer until I read again the patch 
commit from your original work.


The confusion came from my assumption about if the root device is not 
there, it is due to the hardware root initialization requiring more 
time. But I realize now you specifically said "the root driver has not 
attached yet" what turns it into this problem of kernel modules not 
loaded yet.


If so, I think I can solve this within the type2 driver code and 
kconfig. Kconfig will force the driver being compiled as a module if the 
cxl core is a module, and with MODULE_SOFTDEP("pre: cxl_core cxl_port 
cxl_acpi cxl-mem) the type2 driver modprobe will trigger loading of 
dependencies beforehand. This makes unnecessary EPROBE_DEFER.


What do you think?


>>> Additionally, I am less and less convinced that Type-2 drivers should be
>>> forced to depend on the cxl_mem driver to attach vs just arranging for
>>> those Type-2 drivers to call devm_cxl_enumerate_ports() and
>>> devm_cxl_add_endpoint() directly. In other words I am starting to worry
>>> that the generic cxl_mem driver design pattern is a midlayer mistake.
>> You know better than me but in my view, a Type2 should follow what a
>> Type3 does with some small changes for dealing with the differences,
>> mainly the way it is going to be used and those optional capabilities
>> for Type2. This makes more sense to me for Type1.
> If an accelerator driver *wants* to depend on cxl_mem, it can, but all
> the cxl_core functions that cxl_mem uses are also exported for
> accelerator drivers to use directly to avoid needing to guess about when
> / if cxl_mem is going to attach.
>
>>> So, if it makes it easier for sfc, I would be open to exploring a direct
>>> scheme for endpoint attachment, and not requiring the generic cxl_mem
>>> driver as an intermediary.
>> V4 is ready (just having problems when testing with 6.12-rcX) so I would
>> like to keep it and explore this once we have something working and
>> accepted. Type2 and Type1 with CXL cache will bring new challenges and I
>> bet we will need refactoring in the code and potentially new design for
>> generic code (Type3 and Type2, Type2 and Type1).
> Yeah, no need to switch horses mid-race if you already have a cxl_mem
> dependent approach nearly ready, but a potential simplification to
> explore going forward.
Dan Williams Oct. 23, 2024, 12:46 a.m. UTC | #10
Alejandro Lucero Palau wrote:
[..]
> > Setting @memdev->endpoint to ERR_PTR(-EPROBE_DEFER), as I originally
> > had, is an even more indirect way to convey a similar result and is
> > starting to feel a bit "mid-layer-y".
> >
> 
> I was a bit confused with this answer until I read again the patch 
> commit from your original work.
> 
> The confusion came from my assumption about if the root device is not 
> there, it is due to the hardware root initialization requiring more 
> time. But I realize now you specifically said "the root driver has not 
> attached yet" what turns it into this problem of kernel modules not 
> loaded yet.
> 
> If so, I think I can solve this within the type2 driver code and 
> kconfig. Kconfig will force the driver being compiled as a module...

There should be no requirement that accelerator drivers must be built as
modules. An accelerator driver simply cannot enforce, via module load
order, that CXL root infrastructure is up and ready before the
accelerator 'probe' routine runs. This is because enumeration order still
dominiates and enumeration order is effectively random*.

The accelerator driver only has 2 options, return EPROBE_DEFER until all
resource dependencies are ready, or do what cxl_pci + cxl_mem do. What
cxl_pci + cxl_mem do is, cxl_pci_probe() registers a memdev and then at
some point later cxl_mem notices that the root infrastructure has
arrived via the cxl_bus_rescan() event.

Note that these patches are about fixing the assumptions of
cxl_bus_rescan(), not about ensuring init order.

* ...at least nothing should break if CXL root and CXL endpoint
  enumeration happens out of order.