Message ID | 172862483180.2150669.5564474284074502692.stgit@dwillia2-xfh.jf.intel.com |
---|---|
Headers | show |
Series | cxl: Initialization and shutdown fixes | expand |
Hi Dan, I think this is the same issue one of the patches in type2 support tries to deal with: https://lore.kernel.org/linux-cxl/20240907081836.5801-1-alejandro.lucero-palau@amd.com/T/#m9357a559c1a3cc7869ecce44a1801d51518d106e If this fixes that situation, I guess I can drop that one from v4 which is ready to be sent. The other problem I try to fix in that patch, the endpoint not being there when that code tries to use it, it is likely not needed either, although I have a trivial fix for it now instead of that ugly loop with delays. The solution is to add PROBE_FORCE_SYNCHRONOUS as probe_type for the cxl_mem_driver which implies the device_add will only return when the device is really created. Maybe that is worth it for other potential situations suffering the delayed creation. On 10/11/24 06:33, Dan Williams wrote: > Gregory's modest proposal to fix CXL cxl_mem_probe() failures due to > delayed arrival of the CXL "root" infrastructure [1] prompted questions > of how the existing mechanism for retrying cxl_mem_probe() could be > failing. > > The critical missing piece in the debug was that Gregory's setup had > almost all CXL modules built-in to the kernel. > > On the way to that discovery several other bugs and init-order corner > cases were discovered. > > The main fix is to make sure the drivers/cxl/Makefile object order > supports root CXL ports being fully initialized upon cxl_acpi_probe() > exit. The modular case has some similar potential holes that are fixed > with MODULE_SOFTDEP() and other fix ups. Finally, an attempt to update > cxl_test to reproduce the original report resulted in the discovery of a > separate long standing use after free bug in cxl_region_detach(). > > [1]: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net > > --- > > Dan Williams (5): > cxl/port: Fix CXL port initialization order when the subsystem is built-in > cxl/port: Fix cxl_bus_rescan() vs bus_rescan_devices() > cxl/acpi: Ensure ports ready at cxl_acpi_probe() return > cxl/port: Fix use-after-free, permit out-of-order decoder shutdown > cxl/test: Improve init-order fidelity relative to real-world systems > > > drivers/base/core.c | 35 +++++++ > drivers/cxl/Kconfig | 1 > drivers/cxl/Makefile | 12 +-- > drivers/cxl/acpi.c | 7 + > drivers/cxl/core/hdm.c | 50 +++++++++-- > drivers/cxl/core/port.c | 13 ++- > drivers/cxl/core/region.c | 48 +++------- > drivers/cxl/cxl.h | 3 - > include/linux/device.h | 3 + > tools/testing/cxl/test/cxl.c | 200 +++++++++++++++++++++++------------------- > tools/testing/cxl/test/mem.c | 1 > 11 files changed, 228 insertions(+), 145 deletions(-) > > base-commit: 8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b >
Alejandro Lucero Palau wrote: > Hi Dan, > > > I think this is the same issue one of the patches in type2 support tries > to deal with: > > > https://lore.kernel.org/linux-cxl/20240907081836.5801-1-alejandro.lucero-palau@amd.com/T/#m9357a559c1a3cc7869ecce44a1801d51518d106e > > > If this fixes that situation, I guess I can drop that one from v4 which > is ready to be sent. > > > The other problem I try to fix in that patch, the endpoint not being > there when that code tries to use it, it is likely not needed either, > although I have a trivial fix for it now instead of that ugly loop with > delays. The solution is to add PROBE_FORCE_SYNCHRONOUS as probe_type for > the cxl_mem_driver which implies the device_add will only return when > the device is really created. Maybe that is worth it for other potential > situations suffering the delayed creation. I am skeptical that PROBE_FORCE_SYNCRONOUS is a fix for any device-readiness bug. Some other assumption is violated if that is required. For the type-2 case I did have an EPROBE_DEFER in my initial RFC on the assumption that an accelerator driver might want to wait until CXL is initialized before the base accelerator proceeds. However, if accelerator drivers behave the same as the cxl_pci driver and are ok with asynchronus arrival of CXL functionality then no deferral is needed. Otherwise, the only motivation for synchronous probing I can think of would be to have more predictable naming of kernel objects. So yes, I would be curious to understand what scenarios probe deferral is still needed.
On 10/11/24 18:38, Dan Williams wrote: > Alejandro Lucero Palau wrote: >> Hi Dan, >> >> >> I think this is the same issue one of the patches in type2 support tries >> to deal with: >> >> >> https://lore.kernel.org/linux-cxl/20240907081836.5801-1-alejandro.lucero-palau@amd.com/T/#m9357a559c1a3cc7869ecce44a1801d51518d106e >> >> >> If this fixes that situation, I guess I can drop that one from v4 which >> is ready to be sent. >> >> >> The other problem I try to fix in that patch, the endpoint not being >> there when that code tries to use it, it is likely not needed either, >> although I have a trivial fix for it now instead of that ugly loop with >> delays. The solution is to add PROBE_FORCE_SYNCHRONOUS as probe_type for >> the cxl_mem_driver which implies the device_add will only return when >> the device is really created. Maybe that is worth it for other potential >> situations suffering the delayed creation. > I am skeptical that PROBE_FORCE_SYNCRONOUS is a fix for any > device-readiness bug. Some other assumption is violated if that is > required. But that problem is not about device readiness but just how the device model works. In this case the memdev creation is adding devices, no real ones but those abstractions we use from the device model, and that device creation is done asynchronously. When the code creating the memdev, a Type2 driver in my case, is going to work with such a device abstraction just after the memdev creation, it is not there yet. It is true that clx_mem_probe will interact with the real device, but the fact is such a function is not invoked by the device model synchronously, so the code using such a device abstraction needs to wait until it is there. With this flag the waiting is implicit to device creation. Without that flag other "nasty dancing" with delays and checks needs to be done as the code in v3 did. > For the type-2 case I did have an EPROBE_DEFER in my initial RFC on the > assumption that an accelerator driver might want to wait until CXL is > initialized before the base accelerator proceeds. However, if > accelerator drivers behave the same as the cxl_pci driver and are ok > with asynchronus arrival of CXL functionality then no deferral is > needed. I think deferring the accel driver makes sense. In the sfc driver case, it could work without CXL and then change to use it once the CXL kernel support is fully initialised, but I guess other accel drivers will rely on CXL with no other option, and even with the sfc driver, supporting such a change will make the code far more complex. > > Otherwise, the only motivation for synchronous probing I can think of > would be to have more predictable naming of kernel objects. So yes, I > would be curious to understand what scenarios probe deferral is still > needed. OK. I will keep that patch with the last change in the v4. Let's discuss this further with that patch as a reference. Thanks.
Alejandro Lucero Palau wrote: [..] > > I am skeptical that PROBE_FORCE_SYNCRONOUS is a fix for any > > device-readiness bug. Some other assumption is violated if that is > > required. > > > But that problem is not about device readiness but just how the device > model works. In this case the memdev creation is adding devices, no real > ones but those abstractions we use from the device model, and that > device creation is done asynchronously. Device creation is not done asynchronously, the PCI driver is attaching asynchrounously. When the PCI driver attaches it creates memdevs and those are attached to cxl_mem synchronously. > memdev, a Type2 driver in my case, is going to work with such a device > abstraction just after the memdev creation, it is not there yet. Oh, is the concern that you always want to have the memdev attached to cxl_mem immediately after it is registered? I think that is another case where "MODULE_SOFTDEP("pre: cxl_mem")" is needed. However, to fix this situation once and for all I think I would rather just drop all this modularity and move both cxl_port and cxl_mem to be drivers internal to cxl_core.ko similar to the cxl_region driver. > It is true that clx_mem_probe will interact with the real device, but > the fact is such a function is not invoked by the device model > synchronously, so the code using such a device abstraction needs to > wait until it is there. With this flag the waiting is implicit to > device creation. Without that flag other "nasty dancing" with delays > and checks needs to be done as the code in v3 did. It is invoked synchronously *if* the driver is loaded *and* the user has not forced asynchronous attach on the command line in which case they get to keep the pieces. > > For the type-2 case I did have an EPROBE_DEFER in my initial RFC on the > > assumption that an accelerator driver might want to wait until CXL is > > initialized before the base accelerator proceeds. However, if > > accelerator drivers behave the same as the cxl_pci driver and are ok > > with asynchronus arrival of CXL functionality then no deferral is > > needed. > > > I think deferring the accel driver makes sense. In the sfc driver case, > it could work without CXL and then change to use it once the CXL kernel > support is fully initialised, but I guess other accel drivers will rely > on CXL with no other option, and even with the sfc driver, supporting > such a change will make the code far more complex. Makes sense. > > Otherwise, the only motivation for synchronous probing I can think of > > would be to have more predictable naming of kernel objects. So yes, I > > would be curious to understand what scenarios probe deferral is still > > needed. > > OK. I will keep that patch with the last change in the v4. Let's discuss > this further with that patch as a reference. EPROBE_DEFER when CXL is not ready yet is fine in the sfc driver, just comment that the driver does not have a PCIe-only operation mode and requires that CXL initializes.
On 10/12/24 22:57, Dan Williams wrote: > Alejandro Lucero Palau wrote: > [..] >>> I am skeptical that PROBE_FORCE_SYNCRONOUS is a fix for any >>> device-readiness bug. Some other assumption is violated if that is >>> required. >> >> But that problem is not about device readiness but just how the device >> model works. In this case the memdev creation is adding devices, no real >> ones but those abstractions we use from the device model, and that >> device creation is done asynchronously. > Device creation is not done asynchronously, the PCI driver is attaching > asynchrounously. When the PCI driver attaches it creates memdevs and > those are attached to cxl_mem synchronously. > >> memdev, a Type2 driver in my case, is going to work with such a device >> abstraction just after the memdev creation, it is not there yet. > Oh, is the concern that you always want to have the memdev attached to > cxl_mem immediately after it is registered? > > I think that is another case where "MODULE_SOFTDEP("pre: cxl_mem")" is > needed. However, to fix this situation once and for all I think I would > rather just drop all this modularity and move both cxl_port and cxl_mem > to be drivers internal to cxl_core.ko similar to the cxl_region driver. Oh, so the problem is the code is not ready because the functionality is in a module not loaded yet. Then it makes sense that change. I'll do it if not already taken. I'll send v4 without the PROBE_FORCE_SYNCHRONOUS flag and without the previous loop with delays implemented in v3. Thanks > >> It is true that clx_mem_probe will interact with the real device, but >> the fact is such a function is not invoked by the device model >> synchronously, so the code using such a device abstraction needs to >> wait until it is there. With this flag the waiting is implicit to >> device creation. Without that flag other "nasty dancing" with delays >> and checks needs to be done as the code in v3 did. > It is invoked synchronously *if* the driver is loaded *and* the user has > not forced asynchronous attach on the command line in which case they > get to keep the pieces. > >>> For the type-2 case I did have an EPROBE_DEFER in my initial RFC on the >>> assumption that an accelerator driver might want to wait until CXL is >>> initialized before the base accelerator proceeds. However, if >>> accelerator drivers behave the same as the cxl_pci driver and are ok >>> with asynchronus arrival of CXL functionality then no deferral is >>> needed. >> >> I think deferring the accel driver makes sense. In the sfc driver case, >> it could work without CXL and then change to use it once the CXL kernel >> support is fully initialised, but I guess other accel drivers will rely >> on CXL with no other option, and even with the sfc driver, supporting >> such a change will make the code far more complex. > Makes sense. > >>> Otherwise, the only motivation for synchronous probing I can think of >>> would be to have more predictable naming of kernel objects. So yes, I >>> would be curious to understand what scenarios probe deferral is still >>> needed. >> OK. I will keep that patch with the last change in the v4. Let's discuss >> this further with that patch as a reference. > EPROBE_DEFER when CXL is not ready yet is fine in the sfc driver, just > comment that the driver does not have a PCIe-only operation mode and > requires that CXL initializes.
Alejandro Lucero Palau wrote: > > On 10/12/24 22:57, Dan Williams wrote: > > Alejandro Lucero Palau wrote: > > [..] > >>> I am skeptical that PROBE_FORCE_SYNCRONOUS is a fix for any > >>> device-readiness bug. Some other assumption is violated if that is > >>> required. > >> > >> But that problem is not about device readiness but just how the device > >> model works. In this case the memdev creation is adding devices, no real > >> ones but those abstractions we use from the device model, and that > >> device creation is done asynchronously. > > Device creation is not done asynchronously, the PCI driver is attaching > > asynchrounously. When the PCI driver attaches it creates memdevs and > > those are attached to cxl_mem synchronously. > > > >> memdev, a Type2 driver in my case, is going to work with such a device > >> abstraction just after the memdev creation, it is not there yet. > > Oh, is the concern that you always want to have the memdev attached to > > cxl_mem immediately after it is registered? > > > > I think that is another case where "MODULE_SOFTDEP("pre: cxl_mem")" is > > needed. However, to fix this situation once and for all I think I would > > rather just drop all this modularity and move both cxl_port and cxl_mem > > to be drivers internal to cxl_core.ko similar to the cxl_region driver. > > > Oh, so the problem is the code is not ready because the functionality is > in a module not loaded yet. Right. > Then it makes sense that change. I'll do it if not already taken. I'll > send v4 without the PROBE_FORCE_SYNCHRONOUS flag and without the > previous loop with delays implemented in v3. So I think EPROBE_DEFER can stay out of the CXL core because it is up to the accelerator driver to decide whether CXL availability is fatal to init or not. Additionally, I am less and less convinced that Type-2 drivers should be forced to depend on the cxl_mem driver to attach vs just arranging for those Type-2 drivers to call devm_cxl_enumerate_ports() and devm_cxl_add_endpoint() directly. In other words I am starting to worry that the generic cxl_mem driver design pattern is a midlayer mistake. So, if it makes it easier for sfc, I would be open to exploring a direct scheme for endpoint attachment, and not requiring the generic cxl_mem driver as an intermediary.
On 10/14/24 23:24, Dan Williams wrote: > Alejandro Lucero Palau wrote: >> On 10/12/24 22:57, Dan Williams wrote: >>> Alejandro Lucero Palau wrote: >>> [..] >>>>> I am skeptical that PROBE_FORCE_SYNCRONOUS is a fix for any >>>>> device-readiness bug. Some other assumption is violated if that is >>>>> required. >>>> But that problem is not about device readiness but just how the device >>>> model works. In this case the memdev creation is adding devices, no real >>>> ones but those abstractions we use from the device model, and that >>>> device creation is done asynchronously. >>> Device creation is not done asynchronously, the PCI driver is attaching >>> asynchrounously. When the PCI driver attaches it creates memdevs and >>> those are attached to cxl_mem synchronously. >>> >>>> memdev, a Type2 driver in my case, is going to work with such a device >>>> abstraction just after the memdev creation, it is not there yet. >>> Oh, is the concern that you always want to have the memdev attached to >>> cxl_mem immediately after it is registered? >>> >>> I think that is another case where "MODULE_SOFTDEP("pre: cxl_mem")" is >>> needed. However, to fix this situation once and for all I think I would >>> rather just drop all this modularity and move both cxl_port and cxl_mem >>> to be drivers internal to cxl_core.ko similar to the cxl_region driver. >> >> Oh, so the problem is the code is not ready because the functionality is >> in a module not loaded yet. > Right. > >> Then it makes sense that change. I'll do it if not already taken. I'll >> send v4 without the PROBE_FORCE_SYNCHRONOUS flag and without the >> previous loop with delays implemented in v3. > So I think EPROBE_DEFER can stay out of the CXL core because it is up to > the accelerator driver to decide whether CXL availability is fatal to > init or not. It needs support from the cxl core though. If the cxl root is not there yet, the driver needs to know, and that is what you did in your original patch and I'm keeping as well. > Additionally, I am less and less convinced that Type-2 drivers should be > forced to depend on the cxl_mem driver to attach vs just arranging for > those Type-2 drivers to call devm_cxl_enumerate_ports() and > devm_cxl_add_endpoint() directly. In other words I am starting to worry > that the generic cxl_mem driver design pattern is a midlayer mistake. You know better than me but in my view, a Type2 should follow what a Type3 does with some small changes for dealing with the differences, mainly the way it is going to be used and those optional capabilities for Type2. This makes more sense to me for Type1. > So, if it makes it easier for sfc, I would be open to exploring a direct > scheme for endpoint attachment, and not requiring the generic cxl_mem > driver as an intermediary. V4 is ready (just having problems when testing with 6.12-rcX) so I would like to keep it and explore this once we have something working and accepted. Type2 and Type1 with CXL cache will bring new challenges and I bet we will need refactoring in the code and potentially new design for generic code (Type3 and Type2, Type2 and Type1).
Alejandro Lucero Palau wrote: [..] > >> Then it makes sense that change. I'll do it if not already taken. I'll > >> send v4 without the PROBE_FORCE_SYNCHRONOUS flag and without the > >> previous loop with delays implemented in v3. > > So I think EPROBE_DEFER can stay out of the CXL core because it is up to > > the accelerator driver to decide whether CXL availability is fatal to > > init or not. > > > It needs support from the cxl core though. If the cxl root is not there > yet, the driver needs to know, and that is what you did in your original > patch and I'm keeping as well. So there are two ways, check if a registered @memdev has @memdev->dev.driver set, assuming you know that the cxl_mem driver is available, or call devm_cxl_enumerate_ports() yourself and skip the cxl_mem driver indirection. Setting @memdev->endpoint to ERR_PTR(-EPROBE_DEFER), as I originally had, is an even more indirect way to convey a similar result and is starting to feel a bit "mid-layer-y". > > Additionally, I am less and less convinced that Type-2 drivers should be > > forced to depend on the cxl_mem driver to attach vs just arranging for > > those Type-2 drivers to call devm_cxl_enumerate_ports() and > > devm_cxl_add_endpoint() directly. In other words I am starting to worry > > that the generic cxl_mem driver design pattern is a midlayer mistake. > > You know better than me but in my view, a Type2 should follow what a > Type3 does with some small changes for dealing with the differences, > mainly the way it is going to be used and those optional capabilities > for Type2. This makes more sense to me for Type1. If an accelerator driver *wants* to depend on cxl_mem, it can, but all the cxl_core functions that cxl_mem uses are also exported for accelerator drivers to use directly to avoid needing to guess about when / if cxl_mem is going to attach. > > So, if it makes it easier for sfc, I would be open to exploring a direct > > scheme for endpoint attachment, and not requiring the generic cxl_mem > > driver as an intermediary. > > V4 is ready (just having problems when testing with 6.12-rcX) so I would > like to keep it and explore this once we have something working and > accepted. Type2 and Type1 with CXL cache will bring new challenges and I > bet we will need refactoring in the code and potentially new design for > generic code (Type3 and Type2, Type2 and Type1). Yeah, no need to switch horses mid-race if you already have a cxl_mem dependent approach nearly ready, but a potential simplification to explore going forward.
On 10/15/24 17:37, Dan Williams wrote: > Alejandro Lucero Palau wrote: > [..] >>>> Then it makes sense that change. I'll do it if not already taken. I'll >>>> send v4 without the PROBE_FORCE_SYNCHRONOUS flag and without the >>>> previous loop with delays implemented in v3. >>> So I think EPROBE_DEFER can stay out of the CXL core because it is up to >>> the accelerator driver to decide whether CXL availability is fatal to >>> init or not. >> >> It needs support from the cxl core though. If the cxl root is not there >> yet, the driver needs to know, and that is what you did in your original >> patch and I'm keeping as well. > So there are two ways, check if a registered @memdev has > @memdev->dev.driver set, assuming you know that the cxl_mem driver is > available, or call devm_cxl_enumerate_ports() yourself and skip the > cxl_mem driver indirection. > > Setting @memdev->endpoint to ERR_PTR(-EPROBE_DEFER), as I originally > had, is an even more indirect way to convey a similar result and is > starting to feel a bit "mid-layer-y". > I was a bit confused with this answer until I read again the patch commit from your original work. The confusion came from my assumption about if the root device is not there, it is due to the hardware root initialization requiring more time. But I realize now you specifically said "the root driver has not attached yet" what turns it into this problem of kernel modules not loaded yet. If so, I think I can solve this within the type2 driver code and kconfig. Kconfig will force the driver being compiled as a module if the cxl core is a module, and with MODULE_SOFTDEP("pre: cxl_core cxl_port cxl_acpi cxl-mem) the type2 driver modprobe will trigger loading of dependencies beforehand. This makes unnecessary EPROBE_DEFER. What do you think? >>> Additionally, I am less and less convinced that Type-2 drivers should be >>> forced to depend on the cxl_mem driver to attach vs just arranging for >>> those Type-2 drivers to call devm_cxl_enumerate_ports() and >>> devm_cxl_add_endpoint() directly. In other words I am starting to worry >>> that the generic cxl_mem driver design pattern is a midlayer mistake. >> You know better than me but in my view, a Type2 should follow what a >> Type3 does with some small changes for dealing with the differences, >> mainly the way it is going to be used and those optional capabilities >> for Type2. This makes more sense to me for Type1. > If an accelerator driver *wants* to depend on cxl_mem, it can, but all > the cxl_core functions that cxl_mem uses are also exported for > accelerator drivers to use directly to avoid needing to guess about when > / if cxl_mem is going to attach. > >>> So, if it makes it easier for sfc, I would be open to exploring a direct >>> scheme for endpoint attachment, and not requiring the generic cxl_mem >>> driver as an intermediary. >> V4 is ready (just having problems when testing with 6.12-rcX) so I would >> like to keep it and explore this once we have something working and >> accepted. Type2 and Type1 with CXL cache will bring new challenges and I >> bet we will need refactoring in the code and potentially new design for >> generic code (Type3 and Type2, Type2 and Type1). > Yeah, no need to switch horses mid-race if you already have a cxl_mem > dependent approach nearly ready, but a potential simplification to > explore going forward.
Alejandro Lucero Palau wrote: [..] > > Setting @memdev->endpoint to ERR_PTR(-EPROBE_DEFER), as I originally > > had, is an even more indirect way to convey a similar result and is > > starting to feel a bit "mid-layer-y". > > > > I was a bit confused with this answer until I read again the patch > commit from your original work. > > The confusion came from my assumption about if the root device is not > there, it is due to the hardware root initialization requiring more > time. But I realize now you specifically said "the root driver has not > attached yet" what turns it into this problem of kernel modules not > loaded yet. > > If so, I think I can solve this within the type2 driver code and > kconfig. Kconfig will force the driver being compiled as a module... There should be no requirement that accelerator drivers must be built as modules. An accelerator driver simply cannot enforce, via module load order, that CXL root infrastructure is up and ready before the accelerator 'probe' routine runs. This is because enumeration order still dominiates and enumeration order is effectively random*. The accelerator driver only has 2 options, return EPROBE_DEFER until all resource dependencies are ready, or do what cxl_pci + cxl_mem do. What cxl_pci + cxl_mem do is, cxl_pci_probe() registers a memdev and then at some point later cxl_mem notices that the root infrastructure has arrived via the cxl_bus_rescan() event. Note that these patches are about fixing the assumptions of cxl_bus_rescan(), not about ensuring init order. * ...at least nothing should break if CXL root and CXL endpoint enumeration happens out of order.