Message ID | 20240529164103.31671-1-Jonathan.Cameron@huawei.com (mailing list archive) |
---|---|
Headers | show |
Series | pci: portdrv: Add auxiliary bus and register CXL PMUs (and aer) | expand |
On Wed, May 29, 2024 at 05:40:54PM +0100, Jonathan Cameron wrote: > Focus of this RFC is CXL Performance Monitoring Units in CXL Root Ports + > Switch USP and DSPs. > > The final patch moving AER to the aux bus is in the category of 'why > not try this' rather than something I feel is a particularly good idea. > I would like to get opinions on the various ways to avoid the double bus > situation introduced here. Some comments on other options for those existing > 'pci_express' bus devices at the end of this cover letter. > > The primary problem this RFC tries to solve is providing a means to > register the CXL PMUs found in devices which will be bound to the > PCIe portdrv. The proposed solution is to avoid the limitations of > the existing pcie service driver approach by bolting on support > for devices registered on the auxiliary_bus. > > Background > ========== > > When the CXL PMU driver was added, only the instances found in CXL type 3 > memory devices (end points) were supported. These PMUs also occur in CXL > root ports, and CXL switch upstream and downstream ports. Adding > support for these was deliberately left for future work because theses > ports are all bound to the pcie portdrv via the appropriate PCI class > code. Whilst some CXL support of functionality on RP and Switch Ports > is handled by the CXL port driver, that is not bound to the PCIe device, > is only instantiated as part of enumerating endpoints, and cannot use > interrupts because those are in control of the PCIe portdrv. A PMU with > out interrupts would be unfortunate so a different solution is needed. > > Requirements > ============ > > - Register multiple CXL PMUs (CPMUs) per portdrv instance. What resources do CPMUs use? BARs? Config space registers in PCI Capabilities? Interrupts? Are any of these resources device-specific, or are they all defined by the CXL specs? The "device" is a nice way to package those up and manage ownership since there are often interdependencies. I wish portdrv was directly implemented as part of the PCI core, without binding to the device as a "driver". We handle some other pieces of functionality that way, e.g., the PCIe Capability itself, PM, VPD, MSI/MSI-X, > - portdrv binds to the PCIe class code for PCI_CLASS_BRIDGE_PCI_NORMAL which > covers any PCI-Express port. > - PCI MSI / MSIX message based interrupts must be registered by one driver - > in this case it's the portdrv. The fact that AER/DPC/hotplug/etc (the portdrv services) use MSI/MSI-X is a really annoying part of PCIe because bridges may have a BAR or two and are allowed to have device-specific endpoint-like behavior, so we may have to figure out how to share those interrupts between the PCIe-architected stuff and the device-specific stuff. I guess that's messy no matter whether portdrv is its own separate driver or it's integrated into the core. > Side question. Does anyone care if /sys/bus/pci_express goes away? > - in theory it's ABI, but in practice it provides very little > actual ABI. I would *love* to get rid of /sys/bus/pci_express, but I don't know what, if anything, would break if we removed it. Bjorn
On Wed, 5 Jun 2024 13:04:09 -0500 Bjorn Helgaas <helgaas@kernel.org> wrote: > On Wed, May 29, 2024 at 05:40:54PM +0100, Jonathan Cameron wrote: > > Focus of this RFC is CXL Performance Monitoring Units in CXL Root Ports + > > Switch USP and DSPs. > > > > The final patch moving AER to the aux bus is in the category of 'why > > not try this' rather than something I feel is a particularly good idea. > > I would like to get opinions on the various ways to avoid the double bus > > situation introduced here. Some comments on other options for those existing > > 'pci_express' bus devices at the end of this cover letter. > > > > The primary problem this RFC tries to solve is providing a means to > > register the CXL PMUs found in devices which will be bound to the > > PCIe portdrv. The proposed solution is to avoid the limitations of > > the existing pcie service driver approach by bolting on support > > for devices registered on the auxiliary_bus. > > > > Background > > ========== > > > > When the CXL PMU driver was added, only the instances found in CXL type 3 > > memory devices (end points) were supported. These PMUs also occur in CXL > > root ports, and CXL switch upstream and downstream ports. Adding > > support for these was deliberately left for future work because theses > > ports are all bound to the pcie portdrv via the appropriate PCI class > > code. Whilst some CXL support of functionality on RP and Switch Ports > > is handled by the CXL port driver, that is not bound to the PCIe device, > > is only instantiated as part of enumerating endpoints, and cannot use > > interrupts because those are in control of the PCIe portdrv. A PMU with > > out interrupts would be unfortunate so a different solution is needed. > > > > Requirements > > ============ > > > > - Register multiple CXL PMUs (CPMUs) per portdrv instance. > > What resources do CPMUs use? BARs? Config space registers in PCI > Capabilities? Interrupts? Are any of these resources > device-specific, or are they all defined by the CXL specs? Uses the whole lot (there isn't a toy out there that the CXL spec doesn't like to play with :). Discoverable via a CXL defined DVSEC in config space. Registers are in a BAR (discovered from the DVSEC). MSI/MSIX, number in a register in the BAR. All the basic infrastructure and some performance event definitions are in the CXL spec. It's all discoverable. The events are even tagged with VID so a vendor can implement someone else's definitions or those coming from another specification. A bunch of CXL VID tagged ones exist for protocol events etc. It's a really nice general spec. If I were more of a dreamer and had the time I'd try to get PCI-SIG to adopt it and we'd finally have something generic for PCI performance counting. > > The "device" is a nice way to package those up and manage ownership > since there are often interdependencies. > > I wish portdrv was directly implemented as part of the PCI core, > without binding to the device as a "driver". We handle some other > pieces of functionality that way, e.g., the PCIe Capability itself, > PM, VPD, MSI/MSI-X, Understood, though I would guess that even then there needs to be a pci_driver binding to the class code to actually query all those facilities and get interrupts registered etc. > > > - portdrv binds to the PCIe class code for PCI_CLASS_BRIDGE_PCI_NORMAL which > > covers any PCI-Express port. > > - PCI MSI / MSIX message based interrupts must be registered by one driver - > > in this case it's the portdrv. > > The fact that AER/DPC/hotplug/etc (the portdrv services) use MSI/MSI-X > is a really annoying part of PCIe because bridges may have a BAR or > two and are allowed to have device-specific endpoint-like behavior, so > we may have to figure out how to share those interrupts between the > PCIe-architected stuff and the device-specific stuff. I guess that's > messy no matter whether portdrv is its own separate driver or it's > integrated into the core. Yes, the interrupt stuff is the tricky bit. I think whatever happens we end up with a pci device driver that binds to the class code. Maybe we have a way to switch in alternative drivers if that turns out to be necessary. Trying to actually manage the interrupts in the core (without a driver binding) is really tricky. Maybe we'll get there one day, but I don't think it is the first target. We should however make sure not to do anything that would make that harder such as adding ABI in the wrong places. > > > Side question. Does anyone care if /sys/bus/pci_express goes away? > > - in theory it's ABI, but in practice it provides very little > > actual ABI. > > I would *love* to get rid of /sys/bus/pci_express, but I don't know > what, if anything, would break if we removed it. Could try it and see who screams ;) or fake it via symlinks or similar (under a suitable 'legacy' config variable that is on by default.) How about I try the following steps: 0) An experiment to make sure I can fake /sys/bus/pci_express so it's at least hard to tell there aren't real 'devices' registered on that bus. Even if we decide not to do that long term, we need to know we can if a regressions report comes in. Worst comes to the worse we can register fake devices that fake drivers bind to that don't do anything beyond fill in sysfs. 1) Replace the bus/pci_express with direct functional calls in portdrv. I'm thinking we have a few calls for each: bool aer_present(struct pci_dev *pci); aer_query_max_message_num() etc - used so we can do the msi/msix bit. aer_initialize() aer_finalize() aer_runtime_suspend() (where relevant for particular services) at the cost of a few less loops and a few more explicit calls that should simplify how things fit together. 2) Fix up anything in all that functionality that really cares that they are part of portdrv. Not sure yet, but we may need a few additional things in struct pci_dev, or a to pass in some state storage in the ae_initialize() call or similar. 3) Consider moving that functionality to a more appropriate place. At this point we'll have services that can be used by other drivers - though I'm not yet sure how useful that will be. 4) End up with a small pci_driver that binds to pci devices with the appropriate class code. Should even be able to make it a module. 5) (maybe do this first) carry my auxiliary_bus + cpmu stuff and allow adopting other similar cases alongside the other parts. Of course, no statements on 'when' I'll make progress if this seems like a step forwards, but probably sometime this summer. Maybe sooner. Jonathan > > Bjorn
On Wed, 5 Jun 2024 20:44:28 +0100 Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote: > On Wed, 5 Jun 2024 13:04:09 -0500 > Bjorn Helgaas <helgaas@kernel.org> wrote: > > > On Wed, May 29, 2024 at 05:40:54PM +0100, Jonathan Cameron wrote: > > > Focus of this RFC is CXL Performance Monitoring Units in CXL Root Ports + > > > Switch USP and DSPs. > > > > > > The final patch moving AER to the aux bus is in the category of 'why > > > not try this' rather than something I feel is a particularly good idea. > > > I would like to get opinions on the various ways to avoid the double bus > > > situation introduced here. Some comments on other options for those existing > > > 'pci_express' bus devices at the end of this cover letter. > > > > > > The primary problem this RFC tries to solve is providing a means to > > > register the CXL PMUs found in devices which will be bound to the > > > PCIe portdrv. The proposed solution is to avoid the limitations of > > > the existing pcie service driver approach by bolting on support > > > for devices registered on the auxiliary_bus. > > > > > > Background > > > ========== > > > > > > When the CXL PMU driver was added, only the instances found in CXL type 3 > > > memory devices (end points) were supported. These PMUs also occur in CXL > > > root ports, and CXL switch upstream and downstream ports. Adding > > > support for these was deliberately left for future work because theses > > > ports are all bound to the pcie portdrv via the appropriate PCI class > > > code. Whilst some CXL support of functionality on RP and Switch Ports > > > is handled by the CXL port driver, that is not bound to the PCIe device, > > > is only instantiated as part of enumerating endpoints, and cannot use > > > interrupts because those are in control of the PCIe portdrv. A PMU with > > > out interrupts would be unfortunate so a different solution is needed. > > > > > > Requirements > > > ============ > > > > > > - Register multiple CXL PMUs (CPMUs) per portdrv instance. > > > > What resources do CPMUs use? BARs? Config space registers in PCI > > Capabilities? Interrupts? Are any of these resources > > device-specific, or are they all defined by the CXL specs? > > Uses the whole lot (there isn't a toy out there that the CXL > spec doesn't like to play with :). Discoverable via a CXL defined DVSEC > in config space. Registers are in a BAR (discovered from the DVSEC). > MSI/MSIX, number in a register in the BAR. > > All the basic infrastructure and some performance event definitions > are in the CXL spec. It's all discoverable. > > The events are even tagged with VID so a vendor can implement > someone else's definitions or those coming from another specification. > A bunch of CXL VID tagged ones exist for protocol events etc. > > It's a really nice general spec. If I were more of a dreamer and had > the time I'd try to get PCI-SIG to adopt it and we'd finally have > something generic for PCI performance counting. > > > > > The "device" is a nice way to package those up and manage ownership > > since there are often interdependencies. > > > > I wish portdrv was directly implemented as part of the PCI core, > > without binding to the device as a "driver". We handle some other > > pieces of functionality that way, e.g., the PCIe Capability itself, > > PM, VPD, MSI/MSI-X, > > Understood, though I would guess that even then there needs to > be a pci_driver binding to the class code to actually query all those > facilities and get interrupts registered etc. Or are you thinking we can make something like the following work (even when we can't do dynamic msix)? Core bring up facilities and interrupt etc. That includes bus master so msi/msix can be issued (which makes me nervous - putting aside other concerns as IIRC there are drivers where you have to be careful to tweak registers to ensure you don't get a storm of traffic when you hit bus master enable. Specific driver may bind later - everything keeps running until the specific driver calls pci_alloc_irq_vectors(), then we have to disable all interrupts for core managed services, change the number of vectors and then move them to wherever they end up before starting them again. We have to do the similar in the unwind path. I can see we might make something like that work, but I'd be very worried we'd hit devices that didn't behave correctly under this flow even if there aren't any specification caused problems. Even if this is a possible long term plan, maybe we don't want to try and get there in one hop? Jonathan > > > > > > - portdrv binds to the PCIe class code for PCI_CLASS_BRIDGE_PCI_NORMAL which > > > covers any PCI-Express port. > > > - PCI MSI / MSIX message based interrupts must be registered by one driver - > > > in this case it's the portdrv. > > > > The fact that AER/DPC/hotplug/etc (the portdrv services) use MSI/MSI-X > > is a really annoying part of PCIe because bridges may have a BAR or > > two and are allowed to have device-specific endpoint-like behavior, so > > we may have to figure out how to share those interrupts between the > > PCIe-architected stuff and the device-specific stuff. I guess that's > > messy no matter whether portdrv is its own separate driver or it's > > integrated into the core. > > Yes, the interrupt stuff is the tricky bit. I think whatever happens > we end up with a pci device driver that binds to the class code. > Maybe we have a way to switch in alternative drivers if that turns > out to be necessary. > > Trying to actually manage the interrupts in the core (without a driver binding) > is really tricky. Maybe we'll get there one day, but I don't think > it is the first target. We should however make sure not to do anything > that would make that harder such as adding ABI in the wrong places. > > > > > > Side question. Does anyone care if /sys/bus/pci_express goes away? > > > - in theory it's ABI, but in practice it provides very little > > > actual ABI. > > > > I would *love* to get rid of /sys/bus/pci_express, but I don't know > > what, if anything, would break if we removed it. > > Could try it and see who screams ;) or fake it via symlinks or similar > (under a suitable 'legacy' config variable that is on by default.) > > How about I try the following steps: > 0) An experiment to make sure I can fake /sys/bus/pci_express so it's > at least hard to tell there aren't real 'devices' registered on that > bus. Even if we decide not to do that long term, we need to know > we can if a regressions report comes in. > Worst comes to the worse we can register fake devices that fake > drivers bind to that don't do anything beyond fill in sysfs. > 1) Replace the bus/pci_express with direct functional calls in > portdrv. > I'm thinking we have a few calls for each: > bool aer_present(struct pci_dev *pci); > aer_query_max_message_num() etc - used so we can do the msi/msix bit. > aer_initialize() > aer_finalize() > aer_runtime_suspend() (where relevant for particular services) > > at the cost of a few less loops and a few more explicit calls > that should simplify how things fit together. > > 2) Fix up anything in all that functionality that really cares > that they are part of portdrv. Not sure yet, but we may need > a few additional things in struct pci_dev, or a to pass in some > state storage in the ae_initialize() call or similar. > 3) Consider moving that functionality to a more appropriate place. > At this point we'll have services that can be used by other > drivers - though I'm not yet sure how useful that will be. > 4) End up with a small pci_driver that binds to pci devices with > the appropriate class code. Should even be able to make it > a module. > 5) (maybe do this first) carry my auxiliary_bus + cpmu stuff > and allow adopting other similar cases alongside the other > parts. > > Of course, no statements on 'when' I'll make progress if this seems > like a step forwards, but probably sometime this summer. Maybe sooner. > > Jonathan > > > > > Bjorn > >
On Thu, Jun 06, 2024 at 01:57:56PM +0100, Jonathan Cameron wrote: > Or are you thinking we can make something like the following work > (even when we can't do dynamic msix)? > > Core bring up facilities and interrupt etc. That includes bus master > so msi/msix can be issued (which makes me nervous - putting aside other > concerns as IIRC there are drivers where you have to be careful to > tweak registers to ensure you don't get a storm of traffic when you > hit bus master enable. > > Specific driver may bind later - everything keeps running until > the specific driver calls pci_alloc_irq_vectors(), then we have to disable all > interrupts for core managed services, change the number of vectors and > then move them to wherever they end up before starting them again. > We have to do the similar in the unwind path. My recollection is that Thomas Gleixner has brought up dynamic MSI-X allocation. So if both the PCI core services as well as the driver use MSI-X, there's no problem. For MSI, one approach might be to reserve a certain number of vectors in the core for later use by the driver. Thanks, Lukas
On Thu, 6 Jun 2024, Jonathan Cameron wrote: > On Wed, 5 Jun 2024 20:44:28 +0100 > Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote: > > > On Wed, 5 Jun 2024 13:04:09 -0500 > > Bjorn Helgaas <helgaas@kernel.org> wrote: > > > > > On Wed, May 29, 2024 at 05:40:54PM +0100, Jonathan Cameron wrote: > > > > Focus of this RFC is CXL Performance Monitoring Units in CXL Root Ports + > > > > Switch USP and DSPs. > > > > > > > > The final patch moving AER to the aux bus is in the category of 'why > > > > not try this' rather than something I feel is a particularly good idea. > > > > I would like to get opinions on the various ways to avoid the double bus > > > > situation introduced here. Some comments on other options for those existing > > > > 'pci_express' bus devices at the end of this cover letter. > > > > > > > > The primary problem this RFC tries to solve is providing a means to > > > > register the CXL PMUs found in devices which will be bound to the > > > > PCIe portdrv. The proposed solution is to avoid the limitations of > > > > the existing pcie service driver approach by bolting on support > > > > for devices registered on the auxiliary_bus. > > > > > > > > Background > > > > ========== > > > > > > > > When the CXL PMU driver was added, only the instances found in CXL type 3 > > > > memory devices (end points) were supported. These PMUs also occur in CXL > > > > root ports, and CXL switch upstream and downstream ports. Adding > > > > support for these was deliberately left for future work because theses > > > > ports are all bound to the pcie portdrv via the appropriate PCI class > > > > code. Whilst some CXL support of functionality on RP and Switch Ports > > > > is handled by the CXL port driver, that is not bound to the PCIe device, > > > > is only instantiated as part of enumerating endpoints, and cannot use > > > > interrupts because those are in control of the PCIe portdrv. A PMU with > > > > out interrupts would be unfortunate so a different solution is needed. > > > > > > > > Requirements > > > > ============ > > > > > > > > - Register multiple CXL PMUs (CPMUs) per portdrv instance. > > > > > > What resources do CPMUs use? BARs? Config space registers in PCI > > > Capabilities? Interrupts? Are any of these resources > > > device-specific, or are they all defined by the CXL specs? > > > > Uses the whole lot (there isn't a toy out there that the CXL > > spec doesn't like to play with :). Discoverable via a CXL defined DVSEC > > in config space. Registers are in a BAR (discovered from the DVSEC). > > MSI/MSIX, number in a register in the BAR. > > > > All the basic infrastructure and some performance event definitions > > are in the CXL spec. It's all discoverable. > > > > The events are even tagged with VID so a vendor can implement > > someone else's definitions or those coming from another specification. > > A bunch of CXL VID tagged ones exist for protocol events etc. > > > > It's a really nice general spec. If I were more of a dreamer and had > > the time I'd try to get PCI-SIG to adopt it and we'd finally have > > something generic for PCI performance counting. > > > > > > > > The "device" is a nice way to package those up and manage ownership > > > since there are often interdependencies. > > > > > > I wish portdrv was directly implemented as part of the PCI core, > > > without binding to the device as a "driver". We handle some other > > > pieces of functionality that way, e.g., the PCIe Capability itself, > > > PM, VPD, MSI/MSI-X, > > > > Understood, though I would guess that even then there needs to > > be a pci_driver binding to the class code to actually query all those > > facilities and get interrupts registered etc. > > Or are you thinking we can make something like the following work > (even when we can't do dynamic msix)? > > Core bring up facilities and interrupt etc. That includes bus master > so msi/msix can be issued (which makes me nervous - putting aside other > concerns as IIRC there are drivers where you have to be careful to > tweak registers to ensure you don't get a storm of traffic when you > hit bus master enable. > > Specific driver may bind later - everything keeps running until > the specific driver calls pci_alloc_irq_vectors(), then we have to disable all > interrupts for core managed services, change the number of vectors and > then move them to wherever they end up before starting them again. > We have to do the similar in the unwind path. > > I can see we might make something like that work, but I'd be very worried > we'd hit devices that didn't behave correctly under this flow even if > there aren't any specification caused problems. > > Even if this is a possible long term plan, maybe we don't want to try and > get there in one hop? I've just a small bit of detail to add here (but I'm far from expert when it comes to the driver model, rpm, etc. areas so please keep that in mind if something I say doesn't make sense). One problematic thing with current portdrv model and interrupts is that portdrv resumes too late from hotplug's point of view. When resuming, the PCI core tries to ensure downstream port's bus and devices there are in usable state by waiting for the devices before portdrv is resumed. For hotplug case, this is problematic because if devices were unplugged during suspend or are unplugged during that wait, hotplug cannot notice the unplug before the wait times out because hotplug is not yet resumed.
On Thu, 6 Jun 2024 13:57:56 +0100 Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote: > On Wed, 5 Jun 2024 20:44:28 +0100 > Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote: > > > On Wed, 5 Jun 2024 13:04:09 -0500 > > Bjorn Helgaas <helgaas@kernel.org> wrote: > > > > > On Wed, May 29, 2024 at 05:40:54PM +0100, Jonathan Cameron wrote: > > > > Focus of this RFC is CXL Performance Monitoring Units in CXL Root Ports + > > > > Switch USP and DSPs. > > > > > > > > The final patch moving AER to the aux bus is in the category of 'why > > > > not try this' rather than something I feel is a particularly good idea. > > > > I would like to get opinions on the various ways to avoid the double bus > > > > situation introduced here. Some comments on other options for those existing > > > > 'pci_express' bus devices at the end of this cover letter. > > > > > > > > The primary problem this RFC tries to solve is providing a means to > > > > register the CXL PMUs found in devices which will be bound to the > > > > PCIe portdrv. The proposed solution is to avoid the limitations of > > > > the existing pcie service driver approach by bolting on support > > > > for devices registered on the auxiliary_bus. > > > > > > > > Background > > > > ========== > > > > > > > > When the CXL PMU driver was added, only the instances found in CXL type 3 > > > > memory devices (end points) were supported. These PMUs also occur in CXL > > > > root ports, and CXL switch upstream and downstream ports. Adding > > > > support for these was deliberately left for future work because theses > > > > ports are all bound to the pcie portdrv via the appropriate PCI class > > > > code. Whilst some CXL support of functionality on RP and Switch Ports > > > > is handled by the CXL port driver, that is not bound to the PCIe device, > > > > is only instantiated as part of enumerating endpoints, and cannot use > > > > interrupts because those are in control of the PCIe portdrv. A PMU with > > > > out interrupts would be unfortunate so a different solution is needed. > > > > > > > > Requirements > > > > ============ > > > > > > > > - Register multiple CXL PMUs (CPMUs) per portdrv instance. > > > > > > What resources do CPMUs use? BARs? Config space registers in PCI > > > Capabilities? Interrupts? Are any of these resources > > > device-specific, or are they all defined by the CXL specs? > > > > Uses the whole lot (there isn't a toy out there that the CXL > > spec doesn't like to play with :). Discoverable via a CXL defined DVSEC > > in config space. Registers are in a BAR (discovered from the DVSEC). > > MSI/MSIX, number in a register in the BAR. > > > > All the basic infrastructure and some performance event definitions > > are in the CXL spec. It's all discoverable. > > > > The events are even tagged with VID so a vendor can implement > > someone else's definitions or those coming from another specification. > > A bunch of CXL VID tagged ones exist for protocol events etc. > > > > It's a really nice general spec. If I were more of a dreamer and had > > the time I'd try to get PCI-SIG to adopt it and we'd finally have > > something generic for PCI performance counting. > > > > > > > > The "device" is a nice way to package those up and manage ownership > > > since there are often interdependencies. > > > > > > I wish portdrv was directly implemented as part of the PCI core, > > > without binding to the device as a "driver". We handle some other > > > pieces of functionality that way, e.g., the PCIe Capability itself, > > > PM, VPD, MSI/MSI-X, > > > > Understood, though I would guess that even then there needs to > > be a pci_driver binding to the class code to actually query all those > > facilities and get interrupts registered etc. > > Or are you thinking we can make something like the following work > (even when we can't do dynamic msix)? > > Core bring up facilities and interrupt etc. That includes bus master > so msi/msix can be issued (which makes me nervous - putting aside other > concerns as IIRC there are drivers where you have to be careful to > tweak registers to ensure you don't get a storm of traffic when you > hit bus master enable. > > Specific driver may bind later - everything keeps running until > the specific driver calls pci_alloc_irq_vectors(), then we have to disable all > interrupts for core managed services, change the number of vectors and > then move them to wherever they end up before starting them again. > We have to do the similar in the unwind path. > > I can see we might make something like that work, but I'd be very worried > we'd hit devices that didn't behave correctly under this flow even if > there aren't any specification caused problems. > > Even if this is a possible long term plan, maybe we don't want to try and > get there in one hop? In the spirit of 'conference driven' development. I'm planning to put in a proposal for a discussion of portdrv issues for the LPC VFIO/IOMMU/PCI uconf having prototyped some options. A key bit will be agreeing what the actual requirements are and how we might meet them. If it's not selected I'd still be keen to discuss whatever state things are in come September - I'll just have a less busy summer ;) Jonathan > > Jonathan > > > > > > > > > > - portdrv binds to the PCIe class code for PCI_CLASS_BRIDGE_PCI_NORMAL which > > > > covers any PCI-Express port. > > > > - PCI MSI / MSIX message based interrupts must be registered by one driver - > > > > in this case it's the portdrv. > > > > > > The fact that AER/DPC/hotplug/etc (the portdrv services) use MSI/MSI-X > > > is a really annoying part of PCIe because bridges may have a BAR or > > > two and are allowed to have device-specific endpoint-like behavior, so > > > we may have to figure out how to share those interrupts between the > > > PCIe-architected stuff and the device-specific stuff. I guess that's > > > messy no matter whether portdrv is its own separate driver or it's > > > integrated into the core. > > > > Yes, the interrupt stuff is the tricky bit. I think whatever happens > > we end up with a pci device driver that binds to the class code. > > Maybe we have a way to switch in alternative drivers if that turns > > out to be necessary. > > > > Trying to actually manage the interrupts in the core (without a driver binding) > > is really tricky. Maybe we'll get there one day, but I don't think > > it is the first target. We should however make sure not to do anything > > that would make that harder such as adding ABI in the wrong places. > > > > > > > > > Side question. Does anyone care if /sys/bus/pci_express goes away? > > > > - in theory it's ABI, but in practice it provides very little > > > > actual ABI. > > > > > > I would *love* to get rid of /sys/bus/pci_express, but I don't know > > > what, if anything, would break if we removed it. > > > > Could try it and see who screams ;) or fake it via symlinks or similar > > (under a suitable 'legacy' config variable that is on by default.) > > > > How about I try the following steps: > > 0) An experiment to make sure I can fake /sys/bus/pci_express so it's > > at least hard to tell there aren't real 'devices' registered on that > > bus. Even if we decide not to do that long term, we need to know > > we can if a regressions report comes in. > > Worst comes to the worse we can register fake devices that fake > > drivers bind to that don't do anything beyond fill in sysfs. > > 1) Replace the bus/pci_express with direct functional calls in > > portdrv. > > I'm thinking we have a few calls for each: > > bool aer_present(struct pci_dev *pci); > > aer_query_max_message_num() etc - used so we can do the msi/msix bit. > > aer_initialize() > > aer_finalize() > > aer_runtime_suspend() (where relevant for particular services) > > > > at the cost of a few less loops and a few more explicit calls > > that should simplify how things fit together. > > > > 2) Fix up anything in all that functionality that really cares > > that they are part of portdrv. Not sure yet, but we may need > > a few additional things in struct pci_dev, or a to pass in some > > state storage in the ae_initialize() call or similar. > > 3) Consider moving that functionality to a more appropriate place. > > At this point we'll have services that can be used by other > > drivers - though I'm not yet sure how useful that will be. > > 4) End up with a small pci_driver that binds to pci devices with > > the appropriate class code. Should even be able to make it > > a module. > > 5) (maybe do this first) carry my auxiliary_bus + cpmu stuff > > and allow adopting other similar cases alongside the other > > parts. > > > > Of course, no statements on 'when' I'll make progress if this seems > > like a step forwards, but probably sometime this summer. Maybe sooner. > > > > Jonathan > > > > > > > > Bjorn > > > > >
On Thu, 6 Jun 2024 15:21:02 +0200 Lukas Wunner <lukas@wunner.de> wrote: > On Thu, Jun 06, 2024 at 01:57:56PM +0100, Jonathan Cameron wrote: > > Or are you thinking we can make something like the following work > > (even when we can't do dynamic msix)? > > > > Core bring up facilities and interrupt etc. That includes bus master > > so msi/msix can be issued (which makes me nervous - putting aside other > > concerns as IIRC there are drivers where you have to be careful to > > tweak registers to ensure you don't get a storm of traffic when you > > hit bus master enable. > > > > Specific driver may bind later - everything keeps running until > > the specific driver calls pci_alloc_irq_vectors(), then we have to disable all > > interrupts for core managed services, change the number of vectors and > > then move them to wherever they end up before starting them again. > > We have to do the similar in the unwind path. > > My recollection is that Thomas Gleixner has brought up dynamic MSI-X > allocation. So if both the PCI core services as well as the driver > use MSI-X, there's no problem. > > For MSI, one approach might be to reserve a certain number of vectors > in the core for later use by the driver. So, my first attempt at doing things in the core ran into what I think is probably a blocker. It's not entirely technical... +CC Thomas who can probably very quickly confirm if my reasoning is correct. If we move these services into the PCI core itself (as opposed keeping a PCI portdrv layer), then we need to bring up MSI for a device before we bind a driver to that struct pci_dev / struct device. If we follow through pci_alloc_irq_vectors_affinity() -> __pci_enable_msix_range() / __pci_enable_msi_range() -> pci_setup_msi_context() -> msi_setup_device_data() -> devres_add() //there is actually devres usage before this in msi_sysfs_create_group() but this is a shorter path to illustrate the point. We will have registered a dev_res callback on the struct pci_dev that we later want to be able to bind a driver to. That driver won't bind - because lifetimes of devres will be garbage (see really_probe() for the specific check and resulting "Resources present before probing") So we in theory 'could' provide a non devres path through the MSI code, but I'm going to guess that Thomas will not be particularly keen on that to support this single use case. Thomas, can you confirm if this is something you'd consider or outright reject? Or is there an existing route to have MSI/X working pre driver bind and still be able to bind the driver later. Hence, subject to Thomas reply to above, I think we are back to having a pci portdrv, The options then become a case of tidying everything up and supporting one of (or combination of). 1) Make all functionality currently handled by portdrv available in easily consumed form. Handle 'extended' drivers by unbinding the class driver and binding another one (if the more specific driver happened not to bind) - we could make this consistent by checking for class matches first so we always land the generic driver if multiple are ready to go. This may be a simple as an 'devm_init_pcie_portdrv_standard(int mymaxmsgnum);' in probe of an extended driver. 2) Add lightweight static path to adding additional 'services' to portdrv. Similar to current detection code to work out what message numbers are in going to be needed 3) Make portdrv support dynamic child drivers including increasing number of irq vectors if needed. 4) Do nothing at all and revisit next year (that 'worked' for last few years :) Option 3 is most flexible but is it worth it? - If we resize irq vectors, we will need to free all interrupts anyway and then reallocate them because they may move (maybe we can elide that if we are lucky for some devices). - Will need to handle missed interrupts whilst they were disabled. - Each 'child driver' will have to provide a 'detection' routine that will be called for all registered instances so that we can match against particular capabilities, stuff from BARs etc. - Maybe we treat the 'normal' facilities from the PCI spec specially to avoid the interrupt enable/disable/enable dances except when there is something 'extra' present. That is more likely to hide odd bugs however. I'm thinking option 3 is a case of over engineering something better solved by 1 (or maybe 2). There will be an LPC uconf session on this, but I'm going to have some more time before LPC for exploration, so any inputs now might help focus that effort. If not, see you in Vienna. Jonathan
Jonathan! On Fri, Aug 23 2024 at 12:05, Jonathan Cameron wrote: > On Thu, 6 Jun 2024 15:21:02 +0200 Lukas Wunner <lukas@wunner.de> wrote: >> On Thu, Jun 06, 2024 at 01:57:56PM +0100, Jonathan Cameron wrote: >> > Or are you thinking we can make something like the following work >> > (even when we can't do dynamic msix)? >> > >> > Core bring up facilities and interrupt etc. That includes bus master >> > so msi/msix can be issued (which makes me nervous - putting aside other >> > concerns as IIRC there are drivers where you have to be careful to >> > tweak registers to ensure you don't get a storm of traffic when you >> > hit bus master enable. Correct. You have to be really careful about this. >> > Specific driver may bind later - everything keeps running until >> > the specific driver calls pci_alloc_irq_vectors(), then we have to disable all >> > interrupts for core managed services, change the number of vectors and >> > then move them to wherever they end up before starting them again. >> > We have to do the similar in the unwind path. >> >> My recollection is that Thomas Gleixner has brought up dynamic MSI-X >> allocation. So if both the PCI core services as well as the driver >> use MSI-X, there's no problem. Correct. If the core sets up N interrupts for core usage, then devices can later on allocate MSI-X vectors on top if the underlying interrupt domain hierarchy supports it. But see below. >> For MSI, one approach might be to reserve a certain number of vectors >> in the core for later use by the driver. MSI is a problem in several aspects. 1) You really have to know how many vectors you need at the end as you need to disable MSI in order to change the number of vectors in the config space. So if you do that after interrupts are already in use this can cause lost interrupts and stale devices. When MSI is disabled the interrupt might be routed to the legacy intX and aside of an eventual spurious interrupt message there is no trace of it. 2) Allocating N vectors upfront and only using a small portion for the core and have the rest handy for drivers is problematic when the MSI implementation does not provide masking of the individual MSI interrupts. The driver probing might result in an interrupt storm for obvious reasons. Aside of that if the core allocates N interrupts then all the resources required (interrupt descriptors....CPU vectors) are allocated right away. There is no reasonable way to do that post factum like we can do with MSI-X. Any attempt to hack that into submission is futile and I have zero interrest to even look at it. The shortcomings of MSI are known for two decades and it's sufficiently documented that it's a horrorshow for software to deal with it. Though the 'save 5 gates and deal with it in software' camp still has not got the memo. TBH, I don't care at all. We can talk about MSI-X, but MSI is not going to gain support for any of this. That's the only leverage we have to educate people who refuse to accept reality. > So, my first attempt at doing things in the core ran into what I think > is probably a blocker. It's not entirely technical... > > +CC Thomas who can probably very quickly confirm if my reasoning is > correct. > > If we move these services into the PCI core itself (as opposed keeping > a PCI portdrv layer), then we need to bring up MSI for a device before > we bind a driver to that struct pci_dev / struct device. > > If we follow through > pci_alloc_irq_vectors_affinity() > -> __pci_enable_msix_range() / __pci_enable_msi_range() > -> pci_setup_msi_context() > -> msi_setup_device_data() > -> devres_add() > //there is actually devres usage before this in msi_sysfs_create_group() > but this is a shorter path to illustrate the point. > > We will have registered a dev_res callback on the struct pci_dev > that we later want to be able to bind a driver to. That driver > won't bind - because lifetimes of devres will be garbage > (see really_probe() for the specific check and resulting > "Resources present before probing") That's because you are violating all layering rules. See below. > So we in theory 'could' provide a non devres path through > the MSI code, but I'm going to guess that Thomas will not > be particularly keen on that to support this single use case. Correct. > Thomas, can you confirm if this is something you'd consider > or outright reject? Or is there an existing route to have > MSI/X working pre driver bind and still be able to bind > the driver later. The initial enable has to set up at least one vector. That vector does not have to be requested by anything, but that's it. After that you can allocate with pci_msix_alloc_irq_at(). So looking at the ASCII art of the cover letter: _____________ __ ________ __________ | Port | | creates | | | | PCI Dev | |--------->| CPMU A | | |_____________| | |________| | |portdrv binds | | perf/cxlpmu binds | |________________| |___________________| \ \ \ ________ __________ \ | | | ------>| CPMU B | | |________| | | perf/cxlpmu binds | |___________________| AND _____________ __ | Type 3 | | creates ________ _________ | PCI Dev | |--------------------------------------->| | | |_____________| | | CPMU C | | | cxl_pci binds | |________| | |________________| | perf/cxpmu binds | |__________________| If I understand it correctly then both the portdrv and the cxl_pci drivers create a "bus". The CPMU devices are attached to these buses. So we are talking about distinctly different devices with the twist that these devices somehow need to utilize the MSI/X (forget MSI) facility of the device which creates the bus. From the devres perspective we look at separate devices and that's what the interrupt code expects too. This reminds me of the lengthy discussion we had about IMS a couple of years ago. https://lore.kernel.org/all/87bljg7u4f.fsf@nanos.tec.linutronix.de/ My view on that issue was wrong because the Intel people described the problem wrong. But the above pretty much begs for a proper separation and hierarchical design because you provide an actual bus and distinct devices. Reusing the ASCII art from that old thread for the second case, but it's probably the same for the first one: ]-------------------------------------------| | PCI device | ]-------------------| | | Physical function | | ]-------------------| | ]-------------------|----------| | | Control block for subdevices | | ]------------------------------| | | | <- "Subdevice BUS" | | | | | |-- Subddevice 0 | | |-- Subddevice 1 | | |-- ... | | |-- Subddevice N | ]-------------------------------------------| 1) cxl_pci driver binds to the PCI device. 2) cxl_pci driver creates AUXbus 3) Bus enumerates devices on AUXbus 4) Drivers bind to the AUXbus devices So you have a clear provider consumer relationship. Whether the subdevice utilizes resources of the PCI device or not is a hardware implementation detail. The important aspect is that you want to associate the subdevice resources to the subdevice instances and not to the PCI device which provides them. Let me focus on interrupts, but it's probably the same for everything else which is shared. Look at how the PCI device manages interrupts with the per device MSI mechanism. Forget doing this with either one of the legacy mechanisms. 1) It creates a hierarchical interrupt domain and gets the required resources from the provided parent domain. The PCI side does not care whether this is x86 or arm64 and it neither cares whether the parent domain does remapping or not. The only way it cares is about the features supported by the different parent domains (think multi-MSI). 2) Driver side allocations go through the per device domain That AUXbus is not any different. When the CPMU driver binds it wants to allocate interrupts. So instead of having a convoluted construct reaching into the parent PCI device, you simply can do: 1) Let the cxl_pci driver create a MSI parent domain and set that in the subdevice::msi::irqdomain pointer. 2) Provide cxl_aux_bus_create_irqdomain() which allows the CPMU device driver to create a per device interrupt domain. 3) Let the CPMU driver create its per device interrupt domain with the provided parent domain 4) Let the CPMU driver allocate its MSI-X interrupts through the per device domain Now on the cxl_pci side the AUXbus parent interrupt domain allocation function does: if (!pci_msix_enabled(pdev)) return pci_msix_enable_and_alloc_irq_at(pdev, ....); else return pci_msix_alloc_irq_at(pdev, ....); The condition can be avoided if the clx_pci driver enables MSI-X anyway for it's own purposes. Obviously you need the corresponding counterpart for free(), but that's left as an exercise for the reader. That still associates the subdevice specific MSI-X interrups to the underlying PCI device, but then you need to look at teardown. The cxl_pci driver cannot go away before the CPMU drivers are shutdown. The CPMU driver shutdown releases the interrupts through the interrupt domain hierarchy, which removes them from the parent PCI device. Once all CPMU drivers are gone, the shutdown of the cxl_pci driver gets rid of the cxl_pci driver specific interrupts and everything is cleaned up. This works out of the box on x86. The rest needs to gain support for MSI_FLAG_PCI_MSIX_ALLOC_DYN. ARM64 and parts of ARM32 are already converted over to the MSI parent domain concept, they just lack support for dynamic allocation. The rest of the arch/ world needs some more work. That's the problem of the architecture folks and that CXL auxbus thing can simply tell them if (!pci_msix_can_alloc_dyn(pdev)) return -E_MAKE_YOUR_HOME_WORK; and be done with it. Proliferating support for "two" legacy PCI/MSI mechanisms by violating layering is not going to happen. Neither I'm interrested to have creative workarounds for MSI as they are going to create more problems than they solve, unless you come up with a clean, simple and maintainable solution which works under all circumstances. I'm not holding my breath... I might not have all the crucial information as it happened in the previous IMS discussion, but looking at your ASCII art makes me halfways confident that I'm on the right track. Thanks, tglx
On Wed, 28 Aug 2024 23:11:46 +0200 Thomas Gleixner <tglx@linutronix.de> wrote: > Jonathan! Hi Thomas, Thanks for taking a look. This has slightly suffered from being a thread that diverged a long way from the original patch set. So the question here wasn't really related to the cover letter :( I think much of the discussion still applies however and some of it answers questions I didn't even know I had. Some new ascii art follows with the various options we are considering (having ruled out the one this patch set covers). The meat of the query was whether MSI or MSIX can be used prior to binding a driver to struct device (pci_dev). In theory possible, but the infrastructure currently doesn't enable this. The rest is details of the pain in actually using that if it is possible. You answered that I think so all good. > > On Fri, Aug 23 2024 at 12:05, Jonathan Cameron wrote: > > On Thu, 6 Jun 2024 15:21:02 +0200 Lukas Wunner <lukas@wunner.de> wrote: > >> On Thu, Jun 06, 2024 at 01:57:56PM +0100, Jonathan Cameron wrote: > >> > Or are you thinking we can make something like the following work > >> > (even when we can't do dynamic msix)? > >> > > >> > Core bring up facilities and interrupt etc. That includes bus master > >> > so msi/msix can be issued (which makes me nervous - putting aside other > >> > concerns as IIRC there are drivers where you have to be careful to > >> > tweak registers to ensure you don't get a storm of traffic when you > >> > hit bus master enable. > > Correct. You have to be really careful about this. > > >> > Specific driver may bind later - everything keeps running until > >> > the specific driver calls pci_alloc_irq_vectors(), then we have to disable all > >> > interrupts for core managed services, change the number of vectors and > >> > then move them to wherever they end up before starting them again. > >> > We have to do the similar in the unwind path. > >> > >> My recollection is that Thomas Gleixner has brought up dynamic MSI-X > >> allocation. So if both the PCI core services as well as the driver > >> use MSI-X, there's no problem. > > Correct. If the core sets up N interrupts for core usage, then devices > can later on allocate MSI-X vectors on top if the underlying interrupt > domain hierarchy supports it. But see below. > > >> For MSI, one approach might be to reserve a certain number of vectors > >> in the core for later use by the driver. > > MSI is a problem in several aspects. > > 1) You really have to know how many vectors you need at the end as you > need to disable MSI in order to change the number of vectors in the > config space. So if you do that after interrupts are already in use > this can cause lost interrupts and stale devices. When MSI is > disabled the interrupt might be routed to the legacy intX and aside > of an eventual spurious interrupt message there is no trace of it. Ah. I'd not thought of the intX fallback. Ouch, so we'd definitely need to do device side masking of all the sources before disabling MSI and poll all the status registers to see if we missed anything. That's not implausible to implement if fiddly. > > 2) Allocating N vectors upfront and only using a small portion for the > core and have the rest handy for drivers is problematic when the > MSI implementation does not provide masking of the individual MSI > interrupts. The driver probing might result in an interrupt > storm for obvious reasons. Ouch. We 'should' be ok on that because the device should be well behaved and as long as a given feature isn't enabled. If there are devices out there that come up with an interrupt enabled by default (to which portdrv attaches), then this problem would already have surfaced. There might be some corners where a feature isn't enabled unless enough vectors are requested but even those should have surfaced due to the portdrv 'find what is used' dance where it first requests all the MSI vectors, then checks what numbers are used (which aren't known until MSI is enabled) and then releases all the vectors. > > Aside of that if the core allocates N interrupts then all the resources > required (interrupt descriptors....CPU vectors) are allocated right > away. There is no reasonable way to do that post factum like we can do > with MSI-X. Any attempt to hack that into submission is futile and I > have zero interrest to even look at it. > > The shortcomings of MSI are known for two decades and it's sufficiently > documented that it's a horrorshow for software to deal with it. Though > the 'save 5 gates and deal with it in software' camp still has not got > the memo. TBH, I don't care at all. Snag here is the aim is refactor a driver that has to handle ancient implementation without breaking. We can enable new shiny stuff for MSI-X only but I don't want parallel infrastructure. > > We can talk about MSI-X, but MSI is not going to gain support for any of > this. That's the only leverage we have to educate people who refuse to > accept reality. Fully on board with the aim, but legacy is the pain point here. > > > So, my first attempt at doing things in the core ran into what I think > > is probably a blocker. It's not entirely technical... > > > > +CC Thomas who can probably very quickly confirm if my reasoning is > > correct. > > > > If we move these services into the PCI core itself (as opposed keeping > > a PCI portdrv layer), then we need to bring up MSI for a device before > > we bind a driver to that struct pci_dev / struct device. > > > > If we follow through > > pci_alloc_irq_vectors_affinity() > > -> __pci_enable_msix_range() / __pci_enable_msi_range() > > -> pci_setup_msi_context() > > -> msi_setup_device_data() > > -> devres_add() > > //there is actually devres usage before this in msi_sysfs_create_group() > > but this is a shorter path to illustrate the point. > > > > We will have registered a dev_res callback on the struct pci_dev > > that we later want to be able to bind a driver to. That driver > > won't bind - because lifetimes of devres will be garbage > > (see really_probe() for the specific check and resulting > > "Resources present before probing") > > That's because you are violating all layering rules. See below. Agreed. I never liked this approach. This was an exercise in ruling it out. > > > So we in theory 'could' provide a non devres path through > > the MSI code, but I'm going to guess that Thomas will not > > be particularly keen on that to support this single use case. > > Correct. Excellent. > > > Thomas, can you confirm if this is something you'd consider > > or outright reject? Or is there an existing route to have > > MSI/X working pre driver bind and still be able to bind > > the driver later. > > The initial enable has to set up at least one vector. That vector does > not have to be requested by anything, but that's it. After that you can > allocate with pci_msix_alloc_irq_at(). > > So looking at the ASCII art of the cover letter: This is focused on the CXL CPMU thing which is where the patch set was focused, but not so much the discussion on 'fixing' portdrv. > > _____________ __ ________ __________ > | Port | | creates | | | > | PCI Dev | |--------->| CPMU A | | > |_____________| | |________| | > |portdrv binds | | perf/cxlpmu binds | > |________________| |___________________| > \ > \ > \ ________ __________ > \ | | | > ------>| CPMU B | | > |________| | > | perf/cxlpmu binds | > |___________________| > > AND > > _____________ __ > | Type 3 | | creates ________ _________ > | PCI Dev | |--------------------------------------->| | | > |_____________| | | CPMU C | | > | cxl_pci binds | |________| | > |________________| | perf/cxpmu binds | > |__________________| > > If I understand it correctly then both the portdrv and the cxl_pci > drivers create a "bus". The CPMU devices are attached to these buses. > > So we are talking about distinctly different devices with the twist that > these devices somehow need to utilize the MSI/X (forget MSI) facility of > the device which creates the bus. yes. > > From the devres perspective we look at separate devices and that's what > the interrupt code expects too. This reminds me of the lengthy > discussion we had about IMS a couple of years ago. > > https://lore.kernel.org/all/87bljg7u4f.fsf@nanos.tec.linutronix.de/ > > My view on that issue was wrong because the Intel people described the > problem wrong. But the above pretty much begs for a proper separation > and hierarchical design because you provide an actual bus and distinct > devices. Reusing the ASCII art from that old thread for the second case, > but it's probably the same for the first one: > > ]-------------------------------------------| > | PCI device | > ]-------------------| | > | Physical function | | > ]-------------------| | > ]-------------------|----------| | > | Control block for subdevices | | > ]------------------------------| | > | | <- "Subdevice BUS" | > | | | > | |-- Subddevice 0 | > | |-- Subddevice 1 | > | |-- ... | > | |-- Subddevice N | > ]-------------------------------------------| > > 1) cxl_pci driver binds to the PCI device. > > 2) cxl_pci driver creates AUXbus It's using a different CXL specific bus, but I the discussion still holds and it's very auxbus like. > > 3) Bus enumerates devices on AUXbus > > 4) Drivers bind to the AUXbus devices > > So you have a clear provider consumer relationship. Whether the > subdevice utilizes resources of the PCI device or not is a hardware > implementation detail. > > The important aspect is that you want to associate the subdevice > resources to the subdevice instances and not to the PCI device which > provides them. > > Let me focus on interrupts, but it's probably the same for everything > else which is shared. > > Look at how the PCI device manages interrupts with the per device MSI > mechanism. Forget doing this with either one of the legacy mechanisms. Unfortunately legacy is a problem because the support for MSI is already upstream. So anything we do to cxl_pci and children is an optimization only. This is probably doable for the portdrv refactoring of extensions + hanging the CXL PMU off that though as we don't currently have upstream support for that combination (so no chance of regressions) Anyway I'll read this as talking about portdrv.c and the CPMU driver binding to an auxbus device rather than cxl_pci. > > 1) It creates a hierarchical interrupt domain and gets the required > resources from the provided parent domain. The PCI side does not > care whether this is x86 or arm64 and it neither cares whether the > parent domain does remapping or not. The only way it cares is about > the features supported by the different parent domains (think > multi-MSI). > > 2) Driver side allocations go through the per device domain > > That AUXbus is not any different. When the CPMU driver binds it wants to > allocate interrupts. So instead of having a convoluted construct > reaching into the parent PCI device, you simply can do: > > 1) Let the cxl_pci driver create a MSI parent domain and set that in > the subdevice::msi::irqdomain pointer. > > 2) Provide cxl_aux_bus_create_irqdomain() which allows the CPMU device > driver to create a per device interrupt domain. > > 3) Let the CPMU driver create its per device interrupt domain with > the provided parent domain > > 4) Let the CPMU driver allocate its MSI-X interrupts through the per > device domain > > Now on the cxl_pci side the AUXbus parent interrupt domain allocation > function does: > > if (!pci_msix_enabled(pdev)) > return pci_msix_enable_and_alloc_irq_at(pdev, ....); > else > return pci_msix_alloc_irq_at(pdev, ....); > > The condition can be avoided if the clx_pci driver enables MSI-X anyway > for it's own purposes. Obviously you need the corresponding counterpart > for free(), but that's left as an exercise for the reader. > > That still associates the subdevice specific MSI-X interrups to the > underlying PCI device, but then you need to look at teardown. The > cxl_pci driver cannot go away before the CPMU drivers are shutdown. > > The CPMU driver shutdown releases the interrupts through the interrupt > domain hierarchy, which removes them from the parent PCI device. Once > all CPMU drivers are gone, the shutdown of the cxl_pci driver gets rid > of the cxl_pci driver specific interrupts and everything is cleaned up. > > This works out of the box on x86. The rest needs to gain support for > MSI_FLAG_PCI_MSIX_ALLOC_DYN. ARM64 and parts of ARM32 are already > converted over to the MSI parent domain concept, they just lack support > for dynamic allocation. The rest of the arch/ world needs some more > work. That's the problem of the architecture folks and that CXL auxbus > thing can simply tell them > > if (!pci_msix_can_alloc_dyn(pdev)) > return -E_MAKE_YOUR_HOME_WORK; Given my focus I need the arm64 case, but fine with this for others. > > and be done with it. Proliferating support for "two" legacy PCI/MSI > mechanisms by violating layering is not going to happen. > > Neither I'm interrested to have creative workarounds for MSI as they are > going to create more problems than they solve, unless you come up with a > clean, simple and maintainable solution which works under all > circumstances. I'm not holding my breath... Nor am I. > > I might not have all the crucial information as it happened in the > previous IMS discussion, but looking at your ASCII art makes me halfways > confident that I'm on the right track. So that background I mentioned on the options for portdrv. Note that the CPMU bit is an extension to this. Much as I'd love to break non per device msix /dynamic msix it is a non starter given need to improve existing driver. However enabling CPMU and other extensions only for dynamic msix seems fine to me. Option 1. Tidy up today's solution only --------------------------------------- Layering violations etc as you state above, but IRQ + MSI has to work. Port Drv PME AER -------- --- --- | 0. Bind to pci_dev 1. Set max MSI/MSIX 2. Read per subdriver msgnum 3. Set to max msgnum seen 4. Bus master enable etc. | 5. Register pci_express bus devices | (note this is a weird bus we'd all | like to get rid of) |---------------------------------->| | 6. Bind PME driver | 7. Look up msgnum (again) | 8. request_irq() | 9. enable PME int (device side) | | | | |---------------------------------------------------> 6a. Bind AER driver | | 7a. Look up msgnum(again) | | 8a. request_irq() | | 9a. handle AER int (device side) | | | 10. unbind parent. | | 11. unregister children, | | |-----------------------------12. PME remove() does... | | 13. disable int (device side) | | 14. free_irq() and cancel work | | | |--------------------------------------------------->12a. AER remove() does... | 13a. Disable AER int (dev side) | 14a. free irq cancel work etc. | 15. Finish remove pordrv remove() - New functions added via same framework, so register on the pci_express bus. or via an auxbus. The auxbus was focus of this series. If we are switching to a different bus, can do the approach you suggest and dynamic MSI-X only, - would be odd to do this for the pci_express bus given need the layering breakage anyway to keep MSI working. (I think option 3 is better as it gets rid of existing layering issues that you called out). Option 2: What provoked this bit of the thread (but not the original thread which was option 1). (This doesn't work - I think) The PCI core would have to work with out dynamic MSIX as legacy but we could do what you suggested for the 'extension' part and only succeed in Extended PCI Port driver probing if that feature is available (and safe on device) Aim here was that 'core' PCI defined services 'should' need a driver bound. This is true for some already. PCI core Extended PCI Port driver. | This only handles non PCI spec stuff. | 1. Enumerate device 2. Do interrupt discover as previously done in portdrv 3. Enable irq / MSI / MSIX as appropriate for AER, PME, DPC etc but not extended features. (note this would require binding to the device, which doesn't have a driver bound yet). 4. Provide all the services original in portdrv | | 5. Driver bind happens (fails because of note above). |------------------------------------------------------------| | 6. request MSI/MSIX vectors. |<-----------------------------------------------------------| 6. Quiesce AER PME Etc. . 7. Unbind all irqs (they may move anyway) . 8. Release msi/msix vecotrs etc. . 9. Request expanded set of MSI/MISX . 10. request all AER/PME etc irqs again. . 11. Check for missed interrupts (status register reads) . and handle if we did. . |------------------------------------------------------------>| | Carry on as normal. With your notes on dynamic MSIx and domains, this could be simplified but falls down on the fact we can't handle stuff pre driver bind. We could do a layered driver with domains and maybe hide that in the PCI core, but why bother. (Option dead because you quite rightly are not interested in more layering violations to solve that devres issue). Option 3: Move to flatter design for existing portdrv (+ children) The children (standard features) become library function calls Extensibility solved in 2a and 2b below. Port Drv | 1. Set max MSI/MSIX 2. Read per subdriver msgnum 3. Set to max msgnum seen 4. Bus master enable etc. 5. PME setup Call pme_init(pci_dev); Request irq etc 6. AER setup Call aer_rp_init(pci_dev)' Request irq etc. | | | 7. unbind parent. 8. remove calls pme_exit(pci_dev) 9. remove calls aer_exit(pci_dev) etc So how to make this extensible ------------------------------ Option 3a: - Augment main driver - possibly only if MSIX and dynamic allocation supported. Option 3b: - Leave main driver alone, extension requires binding a specific driver. Advantage is this is simple and that we don't end up with an ever increasing amount of 'feature' discovery code in a single driver. Note that this choice has nothing to do with the interrupt bits but is more about feature discovery. Either way, structure looks like above for PCI spec defined 'normal' services + extra. Port Drv CXL PMU driver | | 1. Set max MSI/MSIX 2. Read per subdriver msgnum 3. Set to max msgnum seen If handling interrupts other than dynamic msix then we need to find child driver interrupts here too. Note that the Portdrv has to have specific code to enumerate the 'features' for which child devices are created anyway, so this isn't much worse. 4. Bus master enable etc. 5. PME setup Call pme_init(pci_dev); Request irq etc 6. AER setup Call aer_rp_init(pci_dev)' Request irq etc. | 7. Create create child devices |------------------------------------------------->Non dynamic msix 8. Request IRQ. 9. Normal driver operation. Dynamic msix 8. Setup per device MSI Domain 9. request msix vectors as you describe above. So currently I'm thinking option 3 and with your inputs above only support CXL PMU and other extended features with dynamic MSI-X. Hopefully anyone who has built an MSI only CXL switch with a CPMU hasn't read this far so won't object :) Congratulations to anyone who gets this far. My current plan is to do: 1. Squash the various existing pci_express child drivers into the portdrv as simple library calls. If using MSIX. 2. Add interrupt domain / msix stuff as you suggest + CPMU child as example. Whether we end up with option 3a or 3b can be figure out later. Thanks again for very useful feedback. Jonathan > > Thanks, > > tglx
Hi Thomas, One quick follow up question below. > > So looking at the ASCII art of the cover letter: > > _____________ __ ________ __________ > | Port | | creates | | | > | PCI Dev | |--------->| CPMU A | | > |_____________| | |________| | > |portdrv binds | | perf/cxlpmu binds | > |________________| |___________________| > \ > \ > \ ________ __________ > \ | | | > ------>| CPMU B | | > |________| | > | perf/cxlpmu binds | > |___________________| > > AND > > _____________ __ > | Type 3 | | creates ________ _________ > | PCI Dev | |--------------------------------------->| | | > |_____________| | | CPMU C | | > | cxl_pci binds | |________| | > |________________| | perf/cxpmu binds | > |__________________| > > If I understand it correctly then both the portdrv and the cxl_pci > drivers create a "bus". The CPMU devices are attached to these buses. > > So we are talking about distinctly different devices with the twist that > these devices somehow need to utilize the MSI/X (forget MSI) facility of > the device which creates the bus. > > From the devres perspective we look at separate devices and that's what > the interrupt code expects too. This reminds me of the lengthy > discussion we had about IMS a couple of years ago. > > https://lore.kernel.org/all/87bljg7u4f.fsf@nanos.tec.linutronix.de/ > > My view on that issue was wrong because the Intel people described the > problem wrong. But the above pretty much begs for a proper separation > and hierarchical design because you provide an actual bus and distinct > devices. Reusing the ASCII art from that old thread for the second case, > but it's probably the same for the first one: > > ]-------------------------------------------| > | PCI device | > ]-------------------| | > | Physical function | | > ]-------------------| | > ]-------------------|----------| | > | Control block for subdevices | | > ]------------------------------| | > | | <- "Subdevice BUS" | > | | | > | |-- Subddevice 0 | > | |-- Subddevice 1 | > | |-- ... | > | |-- Subddevice N | > ]-------------------------------------------| > > 1) cxl_pci driver binds to the PCI device. > > 2) cxl_pci driver creates AUXbus > > 3) Bus enumerates devices on AUXbus > > 4) Drivers bind to the AUXbus devices > > So you have a clear provider consumer relationship. Whether the > subdevice utilizes resources of the PCI device or not is a hardware > implementation detail. > > The important aspect is that you want to associate the subdevice > resources to the subdevice instances and not to the PCI device which > provides them. > > Let me focus on interrupts, but it's probably the same for everything > else which is shared. > > Look at how the PCI device manages interrupts with the per device MSI > mechanism. Forget doing this with either one of the legacy mechanisms. > > 1) It creates a hierarchical interrupt domain and gets the required > resources from the provided parent domain. The PCI side does not > care whether this is x86 or arm64 and it neither cares whether the > parent domain does remapping or not. The only way it cares is about > the features supported by the different parent domains (think > multi-MSI). > > 2) Driver side allocations go through the per device domain > > That AUXbus is not any different. When the CPMU driver binds it wants to > allocate interrupts. So instead of having a convoluted construct > reaching into the parent PCI device, you simply can do: > > 1) Let the cxl_pci driver create a MSI parent domain and set that in > the subdevice::msi::irqdomain pointer. > > 2) Provide cxl_aux_bus_create_irqdomain() which allows the CPMU device > driver to create a per device interrupt domain. > > 3) Let the CPMU driver create its per device interrupt domain with > the provided parent domain > > 4) Let the CPMU driver allocate its MSI-X interrupts through the per > device domain > > Now on the cxl_pci side the AUXbus parent interrupt domain allocation > function does: > > if (!pci_msix_enabled(pdev)) > return pci_msix_enable_and_alloc_irq_at(pdev, ....); > else > return pci_msix_alloc_irq_at(pdev, ....); Hi Thomas, I'm struggling to follow this suggestion Would you expect the cxl_pci MSI parent domain to set it's parent as msi_domain = irq_domain_create_hierarchy(dev_get_msi_domain(&pdev->dev), IRQ_DOMAIN_FLAG_MSI_PARENT, ... which seems to cause a problem with deadlocks in __irq_domain_alloc_irqs() or create a separate domain structure and provide callbacks that reach into the parent domain as necessary? Or do I have this entirely wrong? I'm struggling to relate what existing code like PCI does to what I need to do here. Jonathan
On Thu, Sep 05 2024 at 12:23, Jonathan Cameron wrote: >> Look at how the PCI device manages interrupts with the per device MSI >> mechanism. Forget doing this with either one of the legacy mechanisms. >> >> 1) It creates a hierarchical interrupt domain and gets the required >> resources from the provided parent domain. The PCI side does not >> care whether this is x86 or arm64 and it neither cares whether the >> parent domain does remapping or not. The only way it cares is about >> the features supported by the different parent domains (think >> multi-MSI). >> >> 2) Driver side allocations go through the per device domain >> >> That AUXbus is not any different. When the CPMU driver binds it wants to >> allocate interrupts. So instead of having a convoluted construct >> reaching into the parent PCI device, you simply can do: >> >> 1) Let the cxl_pci driver create a MSI parent domain and set that in >> the subdevice::msi::irqdomain pointer. >> >> 2) Provide cxl_aux_bus_create_irqdomain() which allows the CPMU device >> driver to create a per device interrupt domain. >> >> 3) Let the CPMU driver create its per device interrupt domain with >> the provided parent domain >> >> 4) Let the CPMU driver allocate its MSI-X interrupts through the per >> device domain >> >> Now on the cxl_pci side the AUXbus parent interrupt domain allocation >> function does: >> >> if (!pci_msix_enabled(pdev)) >> return pci_msix_enable_and_alloc_irq_at(pdev, ....); >> else >> return pci_msix_alloc_irq_at(pdev, ....); Ignore the above. Brainfart. > I'm struggling to follow this suggestion > Would you expect the cxl_pci MSI parent domain to set it's parent as > msi_domain = irq_domain_create_hierarchy(dev_get_msi_domain(&pdev->dev), > IRQ_DOMAIN_FLAG_MSI_PARENT, > ... > which seems to cause a problem with deadlocks in __irq_domain_alloc_irqs() > or create a separate domain structure and provide callbacks that reach into > the parent domain as necessary? > > Or do I have this entirely wrong? I'm struggling to relate what existing > code like PCI does to what I need to do here. dev_get_msi_domain(&pdev->dev) is a nightmare due to the 4 different models we have: - Legacy has no domain - Non-hierarchical domain - Hierarchical domain v1 That's the global PCI/MSI domain - Hierarchical domain v2 That's the underlying MSI_PARENT domain, which is on x86 either the interrupt remap unit or the vector domain. On arm64 it's the ITS domain. See e.g. pci_msi_domain_supports() which handles the mess. Now this proposal will only work on hierarchical domain v2 because that can do the post enable allocations on MSI-X. Let's put a potential solution for MSI aside for now to avoid complexity explosion. So there are two ways to solve this: 1) Implement the CXL MSI parent domain as disjunct domain 2) Teach the V2 per device MSI-X domain to be a parent domain #1 Looks pretty straight forward, but does not seemlessly fit the hierarchical domain model and creates horrible indirections. #2 Is the proper abstraction, but requires to teach the MSI core code about stacked MSI parent domains, which should not be horribly hard or maybe just works out of the box. The PCI device driver invokes the not yet existing function pci_enable_msi_parent_domain(pci_dev). This function does: A) validate that this PCI device has a V2 parent domain B) validate that the device has enabled MSI-X C) validate that the PCI/MSI-X domain supports dynamic MSI-X allocation D) if #A to #C are true, it enables the PCI device parent domain I made #B a prerequisite for now, as that's an orthogonal problem, which does not need to be solved upfront. Maybe it does not need to be solved at all because the underlying PCI driver always allocates a management interrupt before dealing with the underlying "bus", which is IMHO a reasonable expectation. At least it's a reasonable restriction for getting started. That function does: msix_dom = pci_msi_get_msix_domain(pdev); msix_dom->flags |= IRQ_DOMAIN_FLAG_MSI_PARENT; msix_dom->msi_parent_ops = &pci_msix_parent_ops; When creating the CXL devices the CXL code invokes pci_msi_init_subdevice(pdev, &cxldev->device), which just does: dev_set_msi_domain(dev, pci_msi_get_msix_subdevice_parent(pdev)); That allows the CXL devices to create their own per device MSI domain via a new function pci_msi_create_subdevice_irq_domain(). That function can just use a variant of pci_create_device_domain() with a different domain template and a different irqchip, where the irqchip just redirects to the underlying parent domain chip, aka PCI-MSI-X. I don't think the CXL device will require a special chip as all they should need to know is the linux interrupt number. If there are special requirements then we can bring the IMS code back or do something similar to the platform MSI code. Then you need pci_subdev_msix_alloc_at() and pci_subdev_msix_free() which are the only two functions which the CXL (or similar) need. The existing pci_msi*() API function might need a safety check so they can't be abused by subdevices, but that's a problem for a final solution. That's pretty much all what it takes. Hope that helps. Thanks, tglx
On Fri, 06 Sep 2024 12:11:36 +0200 Thomas Gleixner <tglx@linutronix.de> wrote: > On Thu, Sep 05 2024 at 12:23, Jonathan Cameron wrote: > >> Look at how the PCI device manages interrupts with the per device MSI > >> mechanism. Forget doing this with either one of the legacy mechanisms. > >> > >> 1) It creates a hierarchical interrupt domain and gets the required > >> resources from the provided parent domain. The PCI side does not > >> care whether this is x86 or arm64 and it neither cares whether the > >> parent domain does remapping or not. The only way it cares is about > >> the features supported by the different parent domains (think > >> multi-MSI). > >> > >> 2) Driver side allocations go through the per device domain > >> > >> That AUXbus is not any different. When the CPMU driver binds it wants to > >> allocate interrupts. So instead of having a convoluted construct > >> reaching into the parent PCI device, you simply can do: > >> > >> 1) Let the cxl_pci driver create a MSI parent domain and set that in > >> the subdevice::msi::irqdomain pointer. > >> > >> 2) Provide cxl_aux_bus_create_irqdomain() which allows the CPMU device > >> driver to create a per device interrupt domain. > >> > >> 3) Let the CPMU driver create its per device interrupt domain with > >> the provided parent domain > >> > >> 4) Let the CPMU driver allocate its MSI-X interrupts through the per > >> device domain > >> > >> Now on the cxl_pci side the AUXbus parent interrupt domain allocation > >> function does: > >> > >> if (!pci_msix_enabled(pdev)) > >> return pci_msix_enable_and_alloc_irq_at(pdev, ....); > >> else > >> return pci_msix_alloc_irq_at(pdev, ....); > > Ignore the above. Brainfart. > > > I'm struggling to follow this suggestion > > Would you expect the cxl_pci MSI parent domain to set it's parent as > > msi_domain = irq_domain_create_hierarchy(dev_get_msi_domain(&pdev->dev), > > IRQ_DOMAIN_FLAG_MSI_PARENT, > > ... > > which seems to cause a problem with deadlocks in __irq_domain_alloc_irqs() > > or create a separate domain structure and provide callbacks that reach into > > the parent domain as necessary? > > > > Or do I have this entirely wrong? I'm struggling to relate what existing > > code like PCI does to what I need to do here. > > dev_get_msi_domain(&pdev->dev) is a nightmare due to the 4 different > models we have: > > - Legacy has no domain > > - Non-hierarchical domain > > - Hierarchical domain v1 > > That's the global PCI/MSI domain > > - Hierarchical domain v2 > > That's the underlying MSI_PARENT domain, which is on x86 > either the interrupt remap unit or the vector domain. On arm64 > it's the ITS domain. > > See e.g. pci_msi_domain_supports() which handles the mess. > > Now this proposal will only work on hierarchical domain v2 because that > can do the post enable allocations on MSI-X. Let's put a potential > solution for MSI aside for now to avoid complexity explosion. > > So there are two ways to solve this: > > 1) Implement the CXL MSI parent domain as disjunct domain > > 2) Teach the V2 per device MSI-X domain to be a parent domain > > #1 Looks pretty straight forward, but does not seemlessly fit the > hierarchical domain model and creates horrible indirections. > > #2 Is the proper abstraction, but requires to teach the MSI core code > about stacked MSI parent domains, which should not be horribly hard > or maybe just works out of the box. > > The PCI device driver invokes the not yet existing function > pci_enable_msi_parent_domain(pci_dev). This function does: > > A) validate that this PCI device has a V2 parent domain > > B) validate that the device has enabled MSI-X > > C) validate that the PCI/MSI-X domain supports dynamic MSI-X > allocation > > D) if #A to #C are true, it enables the PCI device parent domain > > I made #B a prerequisite for now, as that's an orthogonal problem, which > does not need to be solved upfront. Maybe it does not need to be solved > at all because the underlying PCI driver always allocates a management > interrupt before dealing with the underlying "bus", which is IMHO a > reasonable expectation. At least it's a reasonable restriction for > getting started. > > That function does: > > msix_dom = pci_msi_get_msix_domain(pdev); > msix_dom->flags |= IRQ_DOMAIN_FLAG_MSI_PARENT; > msix_dom->msi_parent_ops = &pci_msix_parent_ops; > > When creating the CXL devices the CXL code invokes > pci_msi_init_subdevice(pdev, &cxldev->device), which just does: > > dev_set_msi_domain(dev, pci_msi_get_msix_subdevice_parent(pdev)); > > That allows the CXL devices to create their own per device MSI > domain via a new function pci_msi_create_subdevice_irq_domain(). > > That function can just use a variant of pci_create_device_domain() with > a different domain template and a different irqchip, where the irqchip > just redirects to the underlying parent domain chip, aka PCI-MSI-X. > > I don't think the CXL device will require a special chip as all they > should need to know is the linux interrupt number. If there are special > requirements then we can bring the IMS code back or do something similar > to the platform MSI code. > > Then you need pci_subdev_msix_alloc_at() and pci_subdev_msix_free() > which are the only two functions which the CXL (or similar) need. > > The existing pci_msi*() API function might need a safety check so they > can't be abused by subdevices, but that's a problem for a final > solution. > > That's pretty much all what it takes. Hope that helps. Absolutely! Thanks for providing the detailed suggestion this definitely smells a lot less nasty than previous approach. I have things sort of working now but it's ugly code with a few cross layer hacks that need tidying up (around programming the msi registers from wrong 'device'), so may be a little while before I get it in a state to post. Jonathan > > Thanks, > > tglx > > >
On Fri, 6 Sep 2024 18:18:32 +0100 Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote: > On Fri, 06 Sep 2024 12:11:36 +0200 > Thomas Gleixner <tglx@linutronix.de> wrote: > > > On Thu, Sep 05 2024 at 12:23, Jonathan Cameron wrote: > > >> Look at how the PCI device manages interrupts with the per device MSI > > >> mechanism. Forget doing this with either one of the legacy mechanisms. > > >> > > >> 1) It creates a hierarchical interrupt domain and gets the required > > >> resources from the provided parent domain. The PCI side does not > > >> care whether this is x86 or arm64 and it neither cares whether the > > >> parent domain does remapping or not. The only way it cares is about > > >> the features supported by the different parent domains (think > > >> multi-MSI). > > >> > > >> 2) Driver side allocations go through the per device domain > > >> > > >> That AUXbus is not any different. When the CPMU driver binds it wants to > > >> allocate interrupts. So instead of having a convoluted construct > > >> reaching into the parent PCI device, you simply can do: > > >> > > >> 1) Let the cxl_pci driver create a MSI parent domain and set that in > > >> the subdevice::msi::irqdomain pointer. > > >> > > >> 2) Provide cxl_aux_bus_create_irqdomain() which allows the CPMU device > > >> driver to create a per device interrupt domain. > > >> > > >> 3) Let the CPMU driver create its per device interrupt domain with > > >> the provided parent domain > > >> > > >> 4) Let the CPMU driver allocate its MSI-X interrupts through the per > > >> device domain > > >> > > >> Now on the cxl_pci side the AUXbus parent interrupt domain allocation > > >> function does: > > >> > > >> if (!pci_msix_enabled(pdev)) > > >> return pci_msix_enable_and_alloc_irq_at(pdev, ....); > > >> else > > >> return pci_msix_alloc_irq_at(pdev, ....); > > > > Ignore the above. Brainfart. > > > > > I'm struggling to follow this suggestion > > > Would you expect the cxl_pci MSI parent domain to set it's parent as > > > msi_domain = irq_domain_create_hierarchy(dev_get_msi_domain(&pdev->dev), > > > IRQ_DOMAIN_FLAG_MSI_PARENT, > > > ... > > > which seems to cause a problem with deadlocks in __irq_domain_alloc_irqs() > > > or create a separate domain structure and provide callbacks that reach into > > > the parent domain as necessary? > > > > > > Or do I have this entirely wrong? I'm struggling to relate what existing > > > code like PCI does to what I need to do here. > > > > dev_get_msi_domain(&pdev->dev) is a nightmare due to the 4 different > > models we have: > > > > - Legacy has no domain > > > > - Non-hierarchical domain > > > > - Hierarchical domain v1 > > > > That's the global PCI/MSI domain > > > > - Hierarchical domain v2 > > > > That's the underlying MSI_PARENT domain, which is on x86 > > either the interrupt remap unit or the vector domain. On arm64 > > it's the ITS domain. > > > > See e.g. pci_msi_domain_supports() which handles the mess. > > > > Now this proposal will only work on hierarchical domain v2 because that > > can do the post enable allocations on MSI-X. Let's put a potential > > solution for MSI aside for now to avoid complexity explosion. > > > > So there are two ways to solve this: > > > > 1) Implement the CXL MSI parent domain as disjunct domain > > > > 2) Teach the V2 per device MSI-X domain to be a parent domain > > > > #1 Looks pretty straight forward, but does not seemlessly fit the > > hierarchical domain model and creates horrible indirections. > > > > #2 Is the proper abstraction, but requires to teach the MSI core code > > about stacked MSI parent domains, which should not be horribly hard > > or maybe just works out of the box. > > > > The PCI device driver invokes the not yet existing function > > pci_enable_msi_parent_domain(pci_dev). This function does: > > > > A) validate that this PCI device has a V2 parent domain > > > > B) validate that the device has enabled MSI-X > > > > C) validate that the PCI/MSI-X domain supports dynamic MSI-X > > allocation > > > > D) if #A to #C are true, it enables the PCI device parent domain > > > > I made #B a prerequisite for now, as that's an orthogonal problem, which > > does not need to be solved upfront. Maybe it does not need to be solved > > at all because the underlying PCI driver always allocates a management > > interrupt before dealing with the underlying "bus", which is IMHO a > > reasonable expectation. At least it's a reasonable restriction for > > getting started. > > > > That function does: > > > > msix_dom = pci_msi_get_msix_domain(pdev); > > msix_dom->flags |= IRQ_DOMAIN_FLAG_MSI_PARENT; > > msix_dom->msi_parent_ops = &pci_msix_parent_ops; > > > > When creating the CXL devices the CXL code invokes > > pci_msi_init_subdevice(pdev, &cxldev->device), which just does: > > > > dev_set_msi_domain(dev, pci_msi_get_msix_subdevice_parent(pdev)); > > > > That allows the CXL devices to create their own per device MSI > > domain via a new function pci_msi_create_subdevice_irq_domain(). > > > > That function can just use a variant of pci_create_device_domain() with > > a different domain template and a different irqchip, where the irqchip > > just redirects to the underlying parent domain chip, aka PCI-MSI-X. > > > > I don't think the CXL device will require a special chip as all they > > should need to know is the linux interrupt number. If there are special > > requirements then we can bring the IMS code back or do something similar > > to the platform MSI code. > > > > Then you need pci_subdev_msix_alloc_at() and pci_subdev_msix_free() > > which are the only two functions which the CXL (or similar) need. > > > > The existing pci_msi*() API function might need a safety check so they > > can't be abused by subdevices, but that's a problem for a final > > solution. > > > > That's pretty much all what it takes. Hope that helps. > > Absolutely! Thanks for providing the detailed suggestion > this definitely smells a lot less nasty than previous approach. > > I have things sort of working now but it's ugly code with a few > cross layer hacks that need tidying up (around programming the > msi registers from wrong 'device'), so may be a little > while before I get it in a state to post. Hi Thomas, My first solution had some callbacks where we created a local descriptor so that I could swap in the pci_dev->dev and just use the various standard pci/msi/irqdomain.c functions. The 'minimal' solution seems to be a bit ugly. ` There are quite a few places that make the assumption that the preirq_data->parent->chip is the right chip to for example call irq_set_affinity on. So the simple way to make it all work is to just have a msi_domain_template->msi_domain_ops->prepare_desc that sets the desc->dev = to the parent device (here the pci_dev->dev) At that point everything more or less just works and all the rest of the callbacks can use generic forms. Alternative might be to make calls like the one in arch/x86/kernel/apic/msi.c msi_set_affinity search for the first ancestor device that has an irq_set_affinity(). That unfortunately may affect quite a few places. Anyhow, I'll probably send out the prepare_desc hack set with driver usage etc after I've written up a proper description of problems encountered etc so you can see what it all looks like and will be more palatable in general but in the meantime this is the guts of it of the variant where the subdev related desc has the dev set to the parent device. Note for the avoidance of doubt, I don't much like this solution but maybe it will grow on me ;) Jonathan From eb5761b1cb7b6278c86c836ae552982621c3504e Mon Sep 17 00:00:00 2001 From: Jonathan Cameron <Jonathan.Cameron@huawei.com> Date: Tue, 10 Sep 2024 17:10:09 +0100 Subject: [PATCH 1/1] pci: msi: Add subdevice irq domains for dynamic MSI-X PoC code only. All sorts of missing checks etc. Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> --- arch/x86/kernel/apic/msi.c | 3 ++ drivers/pci/msi/api.c | 14 ++++++ drivers/pci/msi/irqdomain.c | 87 +++++++++++++++++++++++++++++++++++++ include/linux/msi.h | 5 +++ include/linux/pci.h | 3 ++ kernel/irq/msi.c | 5 +++ 6 files changed, 117 insertions(+) diff --git a/arch/x86/kernel/apic/msi.c b/arch/x86/kernel/apic/msi.c index 340769242dea..8ab4391d4559 100644 --- a/arch/x86/kernel/apic/msi.c +++ b/arch/x86/kernel/apic/msi.c @@ -218,6 +218,9 @@ static bool x86_init_dev_msi_info(struct device *dev, struct irq_domain *domain, case DOMAIN_BUS_DMAR: case DOMAIN_BUS_AMDVI: break; + case DOMAIN_BUS_PCI_DEVICE_MSIX: + /* Currently needed just for the PCI MSI-X subdevice handling */ + break; default: WARN_ON_ONCE(1); return false; diff --git a/drivers/pci/msi/api.c b/drivers/pci/msi/api.c index b956ce591f96..2b4c15102671 100644 --- a/drivers/pci/msi/api.c +++ b/drivers/pci/msi/api.c @@ -179,6 +179,20 @@ void pci_msix_free_irq(struct pci_dev *dev, struct msi_map map) } EXPORT_SYMBOL_GPL(pci_msix_free_irq); +struct msi_map pci_subdev_msix_alloc_irq_at(struct device *dev, unsigned int index, + const struct irq_affinity_desc *affdesc) +{ + //missing sanity checks + return msi_domain_alloc_irq_at(dev, MSI_DEFAULT_DOMAIN, index, affdesc, NULL); +} +EXPORT_SYMBOL_GPL(pci_subdev_msix_alloc_irq_at); + +void pci_subdev_msix_free_irq(struct device *dev, struct msi_map map) +{ + msi_domain_free_irqs_range(dev, MSI_DEFAULT_DOMAIN, map.index, map.index); +} +EXPORT_SYMBOL_GPL(pci_subdev_msix_free_irq); + /** * pci_disable_msix() - Disable MSI-X interrupt mode on device * @dev: the PCI device to operate on diff --git a/drivers/pci/msi/irqdomain.c b/drivers/pci/msi/irqdomain.c index 569125726b3e..48357a8054ff 100644 --- a/drivers/pci/msi/irqdomain.c +++ b/drivers/pci/msi/irqdomain.c @@ -444,3 +444,90 @@ struct irq_domain *pci_msi_get_device_domain(struct pci_dev *pdev) DOMAIN_BUS_PCI_MSI); return dom; } + +static const struct msi_parent_ops pci_msix_parent_ops = { + .supported_flags = MSI_FLAG_PCI_MSIX | MSI_FLAG_PCI_MSIX_ALLOC_DYN, + .prefix = "PCI-MSIX-PROXY-", + .init_dev_msi_info = msi_parent_init_dev_msi_info, +}; + +int pci_msi_enable_parent_domain(struct pci_dev *pdev) +{ + struct irq_domain *msix_dom; + /* TGLX: Validate has v2 parent domain */ + /* TGLX: validate msix enabled */ + /* TGLX: Validate msix domain supports dynamics msi-x */ + + /* Enable PCI device parent domain */ + msix_dom = dev_msi_get_msix_device_domain(&pdev->dev); + msix_dom->flags |= IRQ_DOMAIN_FLAG_MSI_PARENT; + msix_dom->msi_parent_ops = &pci_msix_parent_ops; + return 0; +} + +void pci_msi_init_subdevice(struct pci_dev *pdev, struct device *dev) +{ + dev_set_msi_domain(dev, dev_msi_get_msix_device_domain(&pdev->dev)); +} + +static bool pci_subdev_create_device_domain(struct device *dev, const struct msi_domain_template *tmpl, + unsigned int hwsize) +{ + struct irq_domain *domain = dev_get_msi_domain(dev); + + if (!domain || !irq_domain_is_msi_parent(domain)) + return true; + + return msi_create_device_irq_domain(dev, MSI_DEFAULT_DOMAIN, tmpl, + hwsize, NULL, NULL); +} + +static void pci_subdev_msix_prepare_desc(struct irq_domain *domain, msi_alloc_info_t *arg, + struct msi_desc *desc) +{ + struct device *parent = desc->dev->parent; + + if (!desc->pci.mask_base) { + /* Not elegant - but needed for irq affinity to work? */ + desc->dev = parent; + msix_prepare_msi_desc(to_pci_dev(parent), desc); + } +} + +static const struct msi_domain_template subdev_template = { + .chip = { + .name = "SUBDEV", + .irq_mask = irq_chip_unmask_parent, + .irq_unmask = irq_chip_unmask_parent, + .irq_write_msi_msg = pci_msi_domain_write_msg, + .irq_set_affinity = irq_chip_set_affinity_parent, + .flags = IRQCHIP_ONESHOT_SAFE, + }, + .ops = { + /* + * RFC: Sets the desc->dev to the parent PCI device + * Needed for + * irq_setup_affinity() -> + * msi_set_affinity() -> + * parent = irq_d->parent_data; + * parent->chip->irq_set_affinity() to work. + * That could be made more flexible perhaps as + * currently it makes assumption that parent of + * the MSI device is the one to set the affinity on. + */ + .prepare_desc = pci_subdev_msix_prepare_desc, + /* Works because the desc->dev is the parent PCI device */ + .set_desc = pci_msi_domain_set_desc, + }, + .info = { + .flags = MSI_FLAG_PCI_MSIX | MSI_FLAG_PCI_MSIX_ALLOC_DYN | + MSI_FLAG_USE_DEF_CHIP_OPS | MSI_FLAG_USE_DEF_DOM_OPS, + .bus_token = DOMAIN_BUS_PCI_DEVICE_MSIX, + }, +}; + +bool pci_subdev_setup_device_domain(struct device *dev, unsigned int hwsize) +{ + return pci_subdev_create_device_domain(dev, &subdev_template, hwsize); +} +EXPORT_SYMBOL_GPL(pci_subdev_setup_device_domain); diff --git a/include/linux/msi.h b/include/linux/msi.h index 944979763825..ff81b4dcc1d9 100644 --- a/include/linux/msi.h +++ b/include/linux/msi.h @@ -656,8 +656,13 @@ void pci_msi_unmask_irq(struct irq_data *data); struct irq_domain *pci_msi_create_irq_domain(struct fwnode_handle *fwnode, struct msi_domain_info *info, struct irq_domain *parent); +int pci_msi_enable_parent_domain(struct pci_dev *pdev); +struct irq_domain *pci_msi_get_msix_device_domain(struct device *dev); u32 pci_msi_domain_get_msi_rid(struct irq_domain *domain, struct pci_dev *pdev); struct irq_domain *pci_msi_get_device_domain(struct pci_dev *pdev); +struct irq_domain *dev_msi_get_msix_device_domain(struct device *dev); +void pci_msi_init_subdevice(struct pci_dev *pdev, struct device *dev); +bool pci_subdev_setup_device_domain(struct device *dev, unsigned int hwsize); #else /* CONFIG_PCI_MSI */ static inline struct irq_domain *pci_msi_get_device_domain(struct pci_dev *pdev) { diff --git a/include/linux/pci.h b/include/linux/pci.h index 225de9be3006..460551f1bd6e 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -1679,6 +1679,9 @@ struct msi_map pci_msix_alloc_irq_at(struct pci_dev *dev, unsigned int index, const struct irq_affinity_desc *affdesc); void pci_msix_free_irq(struct pci_dev *pdev, struct msi_map map); +struct msi_map pci_subdev_msix_alloc_irq_at(struct device *dev, unsigned int index, + const struct irq_affinity_desc *affdesc); +void pci_subdev_msix_free_irq(struct device *dev, struct msi_map map); void pci_free_irq_vectors(struct pci_dev *dev); int pci_irq_vector(struct pci_dev *dev, unsigned int nr); const struct cpumask *pci_irq_get_affinity(struct pci_dev *pdev, int vec); diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c index 5fa0547ece0c..d55a91c7a496 100644 --- a/kernel/irq/msi.c +++ b/kernel/irq/msi.c @@ -1720,3 +1720,8 @@ bool msi_device_has_isolated_msi(struct device *dev) return arch_is_isolated_msi(); } EXPORT_SYMBOL_GPL(msi_device_has_isolated_msi); +struct irq_domain *dev_msi_get_msix_device_domain(struct device *dev) +{ + return dev->msi.data->__domains[0].domain; +} +
On Tue, 10 Sep 2024 17:47:43 +0100 Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote: > On Fri, 6 Sep 2024 18:18:32 +0100 > Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote: > > > On Fri, 06 Sep 2024 12:11:36 +0200 > > Thomas Gleixner <tglx@linutronix.de> wrote: > > > > > On Thu, Sep 05 2024 at 12:23, Jonathan Cameron wrote: > > > >> Look at how the PCI device manages interrupts with the per device MSI > > > >> mechanism. Forget doing this with either one of the legacy mechanisms. > > > >> > > > >> 1) It creates a hierarchical interrupt domain and gets the required > > > >> resources from the provided parent domain. The PCI side does not > > > >> care whether this is x86 or arm64 and it neither cares whether the > > > >> parent domain does remapping or not. The only way it cares is about > > > >> the features supported by the different parent domains (think > > > >> multi-MSI). > > > >> > > > >> 2) Driver side allocations go through the per device domain > > > >> > > > >> That AUXbus is not any different. When the CPMU driver binds it wants to > > > >> allocate interrupts. So instead of having a convoluted construct > > > >> reaching into the parent PCI device, you simply can do: > > > >> > > > >> 1) Let the cxl_pci driver create a MSI parent domain and set that in > > > >> the subdevice::msi::irqdomain pointer. > > > >> > > > >> 2) Provide cxl_aux_bus_create_irqdomain() which allows the CPMU device > > > >> driver to create a per device interrupt domain. > > > >> > > > >> 3) Let the CPMU driver create its per device interrupt domain with > > > >> the provided parent domain > > > >> > > > >> 4) Let the CPMU driver allocate its MSI-X interrupts through the per > > > >> device domain > > > >> > > > >> Now on the cxl_pci side the AUXbus parent interrupt domain allocation > > > >> function does: > > > >> > > > >> if (!pci_msix_enabled(pdev)) > > > >> return pci_msix_enable_and_alloc_irq_at(pdev, ....); > > > >> else > > > >> return pci_msix_alloc_irq_at(pdev, ....); > > > > > > Ignore the above. Brainfart. > > > > > > > I'm struggling to follow this suggestion > > > > Would you expect the cxl_pci MSI parent domain to set it's parent as > > > > msi_domain = irq_domain_create_hierarchy(dev_get_msi_domain(&pdev->dev), > > > > IRQ_DOMAIN_FLAG_MSI_PARENT, > > > > ... > > > > which seems to cause a problem with deadlocks in __irq_domain_alloc_irqs() > > > > or create a separate domain structure and provide callbacks that reach into > > > > the parent domain as necessary? > > > > > > > > Or do I have this entirely wrong? I'm struggling to relate what existing > > > > code like PCI does to what I need to do here. > > > > > > dev_get_msi_domain(&pdev->dev) is a nightmare due to the 4 different > > > models we have: > > > > > > - Legacy has no domain > > > > > > - Non-hierarchical domain > > > > > > - Hierarchical domain v1 > > > > > > That's the global PCI/MSI domain > > > > > > - Hierarchical domain v2 > > > > > > That's the underlying MSI_PARENT domain, which is on x86 > > > either the interrupt remap unit or the vector domain. On arm64 > > > it's the ITS domain. > > > > > > See e.g. pci_msi_domain_supports() which handles the mess. > > > > > > Now this proposal will only work on hierarchical domain v2 because that > > > can do the post enable allocations on MSI-X. Let's put a potential > > > solution for MSI aside for now to avoid complexity explosion. > > > > > > So there are two ways to solve this: > > > > > > 1) Implement the CXL MSI parent domain as disjunct domain > > > > > > 2) Teach the V2 per device MSI-X domain to be a parent domain > > > > > > #1 Looks pretty straight forward, but does not seemlessly fit the > > > hierarchical domain model and creates horrible indirections. > > > > > > #2 Is the proper abstraction, but requires to teach the MSI core code > > > about stacked MSI parent domains, which should not be horribly hard > > > or maybe just works out of the box. > > > > > > The PCI device driver invokes the not yet existing function > > > pci_enable_msi_parent_domain(pci_dev). This function does: > > > > > > A) validate that this PCI device has a V2 parent domain > > > > > > B) validate that the device has enabled MSI-X > > > > > > C) validate that the PCI/MSI-X domain supports dynamic MSI-X > > > allocation > > > > > > D) if #A to #C are true, it enables the PCI device parent domain > > > > > > I made #B a prerequisite for now, as that's an orthogonal problem, which > > > does not need to be solved upfront. Maybe it does not need to be solved > > > at all because the underlying PCI driver always allocates a management > > > interrupt before dealing with the underlying "bus", which is IMHO a > > > reasonable expectation. At least it's a reasonable restriction for > > > getting started. > > > > > > That function does: > > > > > > msix_dom = pci_msi_get_msix_domain(pdev); > > > msix_dom->flags |= IRQ_DOMAIN_FLAG_MSI_PARENT; > > > msix_dom->msi_parent_ops = &pci_msix_parent_ops; > > > > > > When creating the CXL devices the CXL code invokes > > > pci_msi_init_subdevice(pdev, &cxldev->device), which just does: > > > > > > dev_set_msi_domain(dev, pci_msi_get_msix_subdevice_parent(pdev)); > > > > > > That allows the CXL devices to create their own per device MSI > > > domain via a new function pci_msi_create_subdevice_irq_domain(). > > > > > > That function can just use a variant of pci_create_device_domain() with > > > a different domain template and a different irqchip, where the irqchip > > > just redirects to the underlying parent domain chip, aka PCI-MSI-X. > > > > > > I don't think the CXL device will require a special chip as all they > > > should need to know is the linux interrupt number. If there are special > > > requirements then we can bring the IMS code back or do something similar > > > to the platform MSI code. > > > > > > Then you need pci_subdev_msix_alloc_at() and pci_subdev_msix_free() > > > which are the only two functions which the CXL (or similar) need. > > > > > > The existing pci_msi*() API function might need a safety check so they > > > can't be abused by subdevices, but that's a problem for a final > > > solution. > > > > > > That's pretty much all what it takes. Hope that helps. > > > > Absolutely! Thanks for providing the detailed suggestion > > this definitely smells a lot less nasty than previous approach. > > > > I have things sort of working now but it's ugly code with a few > > cross layer hacks that need tidying up (around programming the > > msi registers from wrong 'device'), so may be a little > > while before I get it in a state to post. > > Hi Thomas, I forgot to say, for anyone reading this, don't spend to much time on it yet! The whole topic area of refactoring portdrv is up for discussion in the PCI related uconf at plumbers next week. Whilst I quite like this approach we might end up going in a totally different direction due to legacy and other concerns :( Jonathan > > My first solution had some callbacks where we created a local > descriptor so that I could swap in the pci_dev->dev and > just use the various standard pci/msi/irqdomain.c functions. > > The 'minimal' solution seems to be a bit ugly. > ` > There are quite a few places that make the assumption that the > preirq_data->parent->chip is the right chip to for example call > irq_set_affinity on. > > So the simple way to make it all work is to just have > a msi_domain_template->msi_domain_ops->prepare_desc > that sets the desc->dev = to the parent device (here the > pci_dev->dev) > > At that point everything more or less just works and all the > rest of the callbacks can use generic forms. > > Alternative might be to make calls like the one in > arch/x86/kernel/apic/msi.c msi_set_affinity search > for the first ancestor device that has an irq_set_affinity(). > That unfortunately may affect quite a few places. > > Anyhow, I'll probably send out the prepare_desc hack set with > driver usage etc after I've written up a proper description of problems encountered > etc so you can see what it all looks like and will be more palatable in > general but in the meantime this is the guts of it of the variant where the > subdev related desc has the dev set to the parent device. > > Note for the avoidance of doubt, I don't much like this > solution but maybe it will grow on me ;) > > Jonathan > > > > From eb5761b1cb7b6278c86c836ae552982621c3504e Mon Sep 17 00:00:00 2001 > From: Jonathan Cameron <Jonathan.Cameron@huawei.com> > Date: Tue, 10 Sep 2024 17:10:09 +0100 > Subject: [PATCH 1/1] pci: msi: Add subdevice irq domains for dynamic MSI-X > > PoC code only. All sorts of missing checks etc. > > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> > --- > arch/x86/kernel/apic/msi.c | 3 ++ > drivers/pci/msi/api.c | 14 ++++++ > drivers/pci/msi/irqdomain.c | 87 +++++++++++++++++++++++++++++++++++++ > include/linux/msi.h | 5 +++ > include/linux/pci.h | 3 ++ > kernel/irq/msi.c | 5 +++ > 6 files changed, 117 insertions(+) > > diff --git a/arch/x86/kernel/apic/msi.c b/arch/x86/kernel/apic/msi.c > index 340769242dea..8ab4391d4559 100644 > --- a/arch/x86/kernel/apic/msi.c > +++ b/arch/x86/kernel/apic/msi.c > @@ -218,6 +218,9 @@ static bool x86_init_dev_msi_info(struct device *dev, struct irq_domain *domain, > case DOMAIN_BUS_DMAR: > case DOMAIN_BUS_AMDVI: > break; > + case DOMAIN_BUS_PCI_DEVICE_MSIX: > + /* Currently needed just for the PCI MSI-X subdevice handling */ > + break; > default: > WARN_ON_ONCE(1); > return false; > diff --git a/drivers/pci/msi/api.c b/drivers/pci/msi/api.c > index b956ce591f96..2b4c15102671 100644 > --- a/drivers/pci/msi/api.c > +++ b/drivers/pci/msi/api.c > @@ -179,6 +179,20 @@ void pci_msix_free_irq(struct pci_dev *dev, struct msi_map map) > } > EXPORT_SYMBOL_GPL(pci_msix_free_irq); > > +struct msi_map pci_subdev_msix_alloc_irq_at(struct device *dev, unsigned int index, > + const struct irq_affinity_desc *affdesc) > +{ > + //missing sanity checks > + return msi_domain_alloc_irq_at(dev, MSI_DEFAULT_DOMAIN, index, affdesc, NULL); > +} > +EXPORT_SYMBOL_GPL(pci_subdev_msix_alloc_irq_at); > + > +void pci_subdev_msix_free_irq(struct device *dev, struct msi_map map) > +{ > + msi_domain_free_irqs_range(dev, MSI_DEFAULT_DOMAIN, map.index, map.index); > +} > +EXPORT_SYMBOL_GPL(pci_subdev_msix_free_irq); > + > /** > * pci_disable_msix() - Disable MSI-X interrupt mode on device > * @dev: the PCI device to operate on > diff --git a/drivers/pci/msi/irqdomain.c b/drivers/pci/msi/irqdomain.c > index 569125726b3e..48357a8054ff 100644 > --- a/drivers/pci/msi/irqdomain.c > +++ b/drivers/pci/msi/irqdomain.c > @@ -444,3 +444,90 @@ struct irq_domain *pci_msi_get_device_domain(struct pci_dev *pdev) > DOMAIN_BUS_PCI_MSI); > return dom; > } > + > +static const struct msi_parent_ops pci_msix_parent_ops = { > + .supported_flags = MSI_FLAG_PCI_MSIX | MSI_FLAG_PCI_MSIX_ALLOC_DYN, > + .prefix = "PCI-MSIX-PROXY-", > + .init_dev_msi_info = msi_parent_init_dev_msi_info, > +}; > + > +int pci_msi_enable_parent_domain(struct pci_dev *pdev) > +{ > + struct irq_domain *msix_dom; > + /* TGLX: Validate has v2 parent domain */ > + /* TGLX: validate msix enabled */ > + /* TGLX: Validate msix domain supports dynamics msi-x */ > + > + /* Enable PCI device parent domain */ > + msix_dom = dev_msi_get_msix_device_domain(&pdev->dev); > + msix_dom->flags |= IRQ_DOMAIN_FLAG_MSI_PARENT; > + msix_dom->msi_parent_ops = &pci_msix_parent_ops; > + return 0; > +} > + > +void pci_msi_init_subdevice(struct pci_dev *pdev, struct device *dev) > +{ > + dev_set_msi_domain(dev, dev_msi_get_msix_device_domain(&pdev->dev)); > +} > + > +static bool pci_subdev_create_device_domain(struct device *dev, const struct msi_domain_template *tmpl, > + unsigned int hwsize) > +{ > + struct irq_domain *domain = dev_get_msi_domain(dev); > + > + if (!domain || !irq_domain_is_msi_parent(domain)) > + return true; > + > + return msi_create_device_irq_domain(dev, MSI_DEFAULT_DOMAIN, tmpl, > + hwsize, NULL, NULL); > +} > + > +static void pci_subdev_msix_prepare_desc(struct irq_domain *domain, msi_alloc_info_t *arg, > + struct msi_desc *desc) > +{ > + struct device *parent = desc->dev->parent; > + > + if (!desc->pci.mask_base) { > + /* Not elegant - but needed for irq affinity to work? */ > + desc->dev = parent; > + msix_prepare_msi_desc(to_pci_dev(parent), desc); > + } > +} > + > +static const struct msi_domain_template subdev_template = { > + .chip = { > + .name = "SUBDEV", > + .irq_mask = irq_chip_unmask_parent, > + .irq_unmask = irq_chip_unmask_parent, > + .irq_write_msi_msg = pci_msi_domain_write_msg, > + .irq_set_affinity = irq_chip_set_affinity_parent, > + .flags = IRQCHIP_ONESHOT_SAFE, > + }, > + .ops = { > + /* > + * RFC: Sets the desc->dev to the parent PCI device > + * Needed for > + * irq_setup_affinity() -> > + * msi_set_affinity() -> > + * parent = irq_d->parent_data; > + * parent->chip->irq_set_affinity() to work. > + * That could be made more flexible perhaps as > + * currently it makes assumption that parent of > + * the MSI device is the one to set the affinity on. > + */ > + .prepare_desc = pci_subdev_msix_prepare_desc, > + /* Works because the desc->dev is the parent PCI device */ > + .set_desc = pci_msi_domain_set_desc, > + }, > + .info = { > + .flags = MSI_FLAG_PCI_MSIX | MSI_FLAG_PCI_MSIX_ALLOC_DYN | > + MSI_FLAG_USE_DEF_CHIP_OPS | MSI_FLAG_USE_DEF_DOM_OPS, > + .bus_token = DOMAIN_BUS_PCI_DEVICE_MSIX, > + }, > +}; > + > +bool pci_subdev_setup_device_domain(struct device *dev, unsigned int hwsize) > +{ > + return pci_subdev_create_device_domain(dev, &subdev_template, hwsize); > +} > +EXPORT_SYMBOL_GPL(pci_subdev_setup_device_domain); > diff --git a/include/linux/msi.h b/include/linux/msi.h > index 944979763825..ff81b4dcc1d9 100644 > --- a/include/linux/msi.h > +++ b/include/linux/msi.h > @@ -656,8 +656,13 @@ void pci_msi_unmask_irq(struct irq_data *data); > struct irq_domain *pci_msi_create_irq_domain(struct fwnode_handle *fwnode, > struct msi_domain_info *info, > struct irq_domain *parent); > +int pci_msi_enable_parent_domain(struct pci_dev *pdev); > +struct irq_domain *pci_msi_get_msix_device_domain(struct device *dev); > u32 pci_msi_domain_get_msi_rid(struct irq_domain *domain, struct pci_dev *pdev); > struct irq_domain *pci_msi_get_device_domain(struct pci_dev *pdev); > +struct irq_domain *dev_msi_get_msix_device_domain(struct device *dev); > +void pci_msi_init_subdevice(struct pci_dev *pdev, struct device *dev); > +bool pci_subdev_setup_device_domain(struct device *dev, unsigned int hwsize); > #else /* CONFIG_PCI_MSI */ > static inline struct irq_domain *pci_msi_get_device_domain(struct pci_dev *pdev) > { > diff --git a/include/linux/pci.h b/include/linux/pci.h > index 225de9be3006..460551f1bd6e 100644 > --- a/include/linux/pci.h > +++ b/include/linux/pci.h > @@ -1679,6 +1679,9 @@ struct msi_map pci_msix_alloc_irq_at(struct pci_dev *dev, unsigned int index, > const struct irq_affinity_desc *affdesc); > void pci_msix_free_irq(struct pci_dev *pdev, struct msi_map map); > > +struct msi_map pci_subdev_msix_alloc_irq_at(struct device *dev, unsigned int index, > + const struct irq_affinity_desc *affdesc); > +void pci_subdev_msix_free_irq(struct device *dev, struct msi_map map); > void pci_free_irq_vectors(struct pci_dev *dev); > int pci_irq_vector(struct pci_dev *dev, unsigned int nr); > const struct cpumask *pci_irq_get_affinity(struct pci_dev *pdev, int vec); > diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c > index 5fa0547ece0c..d55a91c7a496 100644 > --- a/kernel/irq/msi.c > +++ b/kernel/irq/msi.c > @@ -1720,3 +1720,8 @@ bool msi_device_has_isolated_msi(struct device *dev) > return arch_is_isolated_msi(); > } > EXPORT_SYMBOL_GPL(msi_device_has_isolated_msi); > +struct irq_domain *dev_msi_get_msix_device_domain(struct device *dev) > +{ > + return dev->msi.data->__domains[0].domain; > +} > +
On Tue, Sep 10 2024 at 17:47, Jonathan Cameron wrote: > There are quite a few places that make the assumption that the > preirq_data->parent->chip is the right chip to for example call > irq_set_affinity on. > > So the simple way to make it all work is to just have > a msi_domain_template->msi_domain_ops->prepare_desc > that sets the desc->dev = to the parent device (here the > pci_dev->dev) Creative :) > At that point everything more or less just works and all the > rest of the callbacks can use generic forms. > > Alternative might be to make calls like the one in > arch/x86/kernel/apic/msi.c msi_set_affinity search > for the first ancestor device that has an irq_set_affinity(). > That unfortunately may affect quite a few places. What's the problem? None of the CXL drivers should care about that at all. Delegating other callbacks to the parent domain is what hierarchical domains are about. If there is some helper missing, so we add it. > Anyhow, I'll probably send out the prepare_desc hack set with driver > usage etc after I've written up a proper description of problems > encountered etc so you can see what it all looks like and will be more > palatable in general but in the meantime this is the guts of it of the > variant where the subdev related desc has the dev set to the parent > device. Let me look at this pile then :) > Note for the avoidance of doubt, I don't much like this > solution but maybe it will grow on me ;) Maybe, but I doubt it will survive my abstraction taste check. > +static const struct msi_parent_ops pci_msix_parent_ops = { > + .supported_flags = MSI_FLAG_PCI_MSIX | MSI_FLAG_PCI_MSIX_ALLOC_DYN, > + .prefix = "PCI-MSIX-PROXY-", > + .init_dev_msi_info = msi_parent_init_dev_msi_info, The obligatory link to: https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#struct-declarations-and-initializers > +}; > + > +int pci_msi_enable_parent_domain(struct pci_dev *pdev) > +{ > + struct irq_domain *msix_dom; > + /* TGLX: Validate has v2 parent domain */ > + /* TGLX: validate msix enabled */ > + /* TGLX: Validate msix domain supports dynamics msi-x */ > + > + /* Enable PCI device parent domain */ > + msix_dom = dev_msi_get_msix_device_domain(&pdev->dev); > + msix_dom->flags |= IRQ_DOMAIN_FLAG_MSI_PARENT; > + msix_dom->msi_parent_ops = &pci_msix_parent_ops; That want's to be a call into the MSI core code, Something like: msi_enable_sub_parent_domain(&pdev->dev, DOMAIN_BUS_PCI_DEVICE_MSIX) or something like that. I really don't want to expose the internals of MSI. I spent a lot of effort to encapsulate that... > +void pci_msi_init_subdevice(struct pci_dev *pdev, struct device *dev) > +{ > + dev_set_msi_domain(dev, dev_msi_get_msix_device_domain(&pdev->dev)); msi_set_parent_domain(.....); > +static void pci_subdev_msix_prepare_desc(struct irq_domain *domain, msi_alloc_info_t *arg, > + struct msi_desc *desc) > +{ > + struct device *parent = desc->dev->parent; > + > + if (!desc->pci.mask_base) { > + /* Not elegant - but needed for irq affinity to work? */ Which makes me ask the question why desc->pci.mask_base is NULL in the first place. That needs some thought at the place which initializes the descriptor. > --- a/kernel/irq/msi.c > +++ b/kernel/irq/msi.c > @@ -1720,3 +1720,8 @@ bool msi_device_has_isolated_msi(struct device *dev) > return arch_is_isolated_msi(); > } > EXPORT_SYMBOL_GPL(msi_device_has_isolated_msi); > +struct irq_domain *dev_msi_get_msix_device_domain(struct device *dev) You clearly run out of new lines here :) > +{ > + return dev->msi.data->__domains[0].domain; > +} But aside of that. No. See above. Thanks, tglx
On Tue, 10 Sep 2024 22:04:18 +0200 Thomas Gleixner <tglx@linutronix.de> wrote: > On Tue, Sep 10 2024 at 17:47, Jonathan Cameron wrote: > > There are quite a few places that make the assumption that the > > preirq_data->parent->chip is the right chip to for example call > > irq_set_affinity on. > > > > So the simple way to make it all work is to just have > > a msi_domain_template->msi_domain_ops->prepare_desc > > that sets the desc->dev = to the parent device (here the > > pci_dev->dev) > > Creative :) One bit that is challenging me is that the PCI access functions use the pci_dev that is embedded via irq_data->common->msi_desc->dev (and hence doesn't give us a obvious path to any hierarchical structures) E.g. __pci_write_msi_msg() and which checks the pci_dev is powered up. I'd like to be able to do a call to the parent similar to irq_chip_unmask_parent to handle write_msi_msg() for the new device domain. So how should this be avoided or should msi_desc->dev be the PCI device? I can see some options but maybe I'm missing the big picture. 1) Stop them using that path to get to the device because it might not be the right one. Instead use device via irq_data->chip_data (which is currently unused for these cases). Some slightly fiddly work will be needed to handle changing __pci_write_msi_msg() to talk irq_data though. 2) Maybe setting the msi descriptor dev to the one we actually write on is the right solution? (smells bad to me but I'm still getting my head around this stuff). Any immediate thoughts? Or am I still thinking about this the wrong way? (which is likely) > > > At that point everything more or less just works and all the > > rest of the callbacks can use generic forms. > > > > Alternative might be to make calls like the one in > > arch/x86/kernel/apic/msi.c msi_set_affinity search > > for the first ancestor device that has an irq_set_affinity(). > > That unfortunately may affect quite a few places. > > What's the problem? None of the CXL drivers should care about that at > all. Delegating other callbacks to the parent domain is what hierarchical > domains are about. If there is some helper missing, so we add it. > > > Anyhow, I'll probably send out the prepare_desc hack set with driver > > usage etc after I've written up a proper description of problems > > encountered etc so you can see what it all looks like and will be more > > palatable in general but in the meantime this is the guts of it of the > > variant where the subdev related desc has the dev set to the parent > > device. > > Let me look at this pile then :) > > > Note for the avoidance of doubt, I don't much like this > > solution but maybe it will grow on me ;) > > Maybe, but I doubt it will survive my abstraction taste check. > > > +static const struct msi_parent_ops pci_msix_parent_ops = { > > + .supported_flags = MSI_FLAG_PCI_MSIX | MSI_FLAG_PCI_MSIX_ALLOC_DYN, > > + .prefix = "PCI-MSIX-PROXY-", > > + .init_dev_msi_info = msi_parent_init_dev_msi_info, > > The obligatory link to: > > https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#struct-declarations-and-initializers > Thanks. Will fix up before I get anywhere near a real posting! > > +}; > > + > > +int pci_msi_enable_parent_domain(struct pci_dev *pdev) > > +{ > > + struct irq_domain *msix_dom; > > + /* TGLX: Validate has v2 parent domain */ > > + /* TGLX: validate msix enabled */ > > + /* TGLX: Validate msix domain supports dynamics msi-x */ > > + > > + /* Enable PCI device parent domain */ > > + msix_dom = dev_msi_get_msix_device_domain(&pdev->dev); > > + msix_dom->flags |= IRQ_DOMAIN_FLAG_MSI_PARENT; > > + msix_dom->msi_parent_ops = &pci_msix_parent_ops; > > That want's to be a call into the MSI core code, Something like: > > msi_enable_sub_parent_domain(&pdev->dev, DOMAIN_BUS_PCI_DEVICE_MSIX) > > or something like that. I really don't want to expose the internals of > MSI. I spent a lot of effort to encapsulate that... > > > +void pci_msi_init_subdevice(struct pci_dev *pdev, struct device *dev) > > +{ > > + dev_set_msi_domain(dev, dev_msi_get_msix_device_domain(&pdev->dev)); > > msi_set_parent_domain(.....); > > > +static void pci_subdev_msix_prepare_desc(struct irq_domain *domain, msi_alloc_info_t *arg, > > + struct msi_desc *desc) > > +{ > > + struct device *parent = desc->dev->parent; > > + > > + if (!desc->pci.mask_base) { > > + /* Not elegant - but needed for irq affinity to work? */ > > Which makes me ask the question why desc->pci.mask_base is NULL in the > first place. That needs some thought at the place which initializes the > descriptor. Maybe this is the other way around? The descriptor is 'never' initialized for this case. The equivalent code in msix_prepare_msi_desc() has comments " This is separate from msix_setup_msi_descs() below to handle dynamic allocations for MSI-X after initial enablement." So I think this !desc->pci.mask_base should always succeed (and so I should get rid of it and run the descriptor initialize unconditionally) It might be appropriate to call the prepare_desc from the parent msi_domain though via irq_domain->parent_domain->host_data However this has the same need to not be based on the msi_desc->dev subject to the question above. Jonathan > > > --- a/kernel/irq/msi.c > > +++ b/kernel/irq/msi.c > > @@ -1720,3 +1720,8 @@ bool msi_device_has_isolated_msi(struct device *dev) > > return arch_is_isolated_msi(); > > } > > EXPORT_SYMBOL_GPL(msi_device_has_isolated_msi); > > +struct irq_domain *dev_msi_get_msix_device_domain(struct device *dev) > > You clearly run out of new lines here :) > > > +{ > > + return dev->msi.data->__domains[0].domain; > > +} > > But aside of that. No. See above. > > Thanks, > > tglx >
On Thu, 12 Sep 2024 17:37:20 +0100 Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote: > On Tue, 10 Sep 2024 22:04:18 +0200 > Thomas Gleixner <tglx@linutronix.de> wrote: > > > On Tue, Sep 10 2024 at 17:47, Jonathan Cameron wrote: > > > There are quite a few places that make the assumption that the > > > preirq_data->parent->chip is the right chip to for example call > > > irq_set_affinity on. > > > > > > So the simple way to make it all work is to just have > > > a msi_domain_template->msi_domain_ops->prepare_desc > > > that sets the desc->dev = to the parent device (here the > > > pci_dev->dev) > > > > Creative :) > > One bit that is challenging me is that the PCI access functions > use the pci_dev that is embedded via irq_data->common->msi_desc->dev > (and hence doesn't give us a obvious path to any hierarchical structures) > E.g. __pci_write_msi_msg() and which checks the pci_dev is > powered up. > > I'd like to be able to do a call to the parent similar to irq_chip_unmask_parent > to handle write_msi_msg() for the new device domain. > > So how should this be avoided or should msi_desc->dev be the > PCI device? I thought about this whilst cycling home. Perhaps my fundamental question is more basic Which device should msi_dec->dev be? * The ultimate requester of the msi - so here the one calling our new pci_msix_subdev_alloc_at(), * The one on which the msi_write_msg() should occur. - here that would be the struct pci_dev given the checks needed on whether the device is powered up. If this is the case, can we instead use the irq_data->dev in __pci_write_msi_msg()? Also, is it valid to use the irq_domain->dev for callbacks such as msi_prepare_desc on basis that (I think) is the device that 'hosts' the irq domain, so can we use it to replace the use of msi_desc->dev in pci_msix_prepare_desc() If we can that enables our subdev to use a prepare_desc callback that does something like struct msi_domain *info; domain = domain->parent; info = domain->host_data; info->ops->prepare_desc(domain, arg, desc); Thanks for any pointers! Jonathan > > I can see some options but maybe I'm missing the big picture. > 1) Stop them using that path to get to the device because it > might not be the right one. Instead use device via irq_data->chip_data > (which is currently unused for these cases). > Some slightly fiddly work will be needed to handle changing __pci_write_msi_msg() > to talk irq_data though. > 2) Maybe setting the msi descriptor dev to the one we actually write > on is the right solution? (smells bad to me but I'm still getting > my head around this stuff). > > Any immediate thoughts? Or am I still thinking about this the wrong > way? (which is likely) > > > > > > At that point everything more or less just works and all the > > > rest of the callbacks can use generic forms. > > > > > > Alternative might be to make calls like the one in > > > arch/x86/kernel/apic/msi.c msi_set_affinity search > > > for the first ancestor device that has an irq_set_affinity(). > > > That unfortunately may affect quite a few places. > > > > What's the problem? None of the CXL drivers should care about that at > > all. Delegating other callbacks to the parent domain is what hierarchical > > domains are about. If there is some helper missing, so we add it. > > > > > Anyhow, I'll probably send out the prepare_desc hack set with driver > > > usage etc after I've written up a proper description of problems > > > encountered etc so you can see what it all looks like and will be more > > > palatable in general but in the meantime this is the guts of it of the > > > variant where the subdev related desc has the dev set to the parent > > > device. > > > > Let me look at this pile then :) > > > > > Note for the avoidance of doubt, I don't much like this > > > solution but maybe it will grow on me ;) > > > > Maybe, but I doubt it will survive my abstraction taste check. > > > > > +static const struct msi_parent_ops pci_msix_parent_ops = { > > > + .supported_flags = MSI_FLAG_PCI_MSIX | MSI_FLAG_PCI_MSIX_ALLOC_DYN, > > > + .prefix = "PCI-MSIX-PROXY-", > > > + .init_dev_msi_info = msi_parent_init_dev_msi_info, > > > > The obligatory link to: > > > > https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#struct-declarations-and-initializers > > > > Thanks. Will fix up before I get anywhere near a real posting! > > > > +}; > > > + > > > +int pci_msi_enable_parent_domain(struct pci_dev *pdev) > > > +{ > > > + struct irq_domain *msix_dom; > > > + /* TGLX: Validate has v2 parent domain */ > > > + /* TGLX: validate msix enabled */ > > > + /* TGLX: Validate msix domain supports dynamics msi-x */ > > > + > > > + /* Enable PCI device parent domain */ > > > + msix_dom = dev_msi_get_msix_device_domain(&pdev->dev); > > > + msix_dom->flags |= IRQ_DOMAIN_FLAG_MSI_PARENT; > > > + msix_dom->msi_parent_ops = &pci_msix_parent_ops; > > > > That want's to be a call into the MSI core code, Something like: > > > > msi_enable_sub_parent_domain(&pdev->dev, DOMAIN_BUS_PCI_DEVICE_MSIX) > > > > or something like that. I really don't want to expose the internals of > > MSI. I spent a lot of effort to encapsulate that... > > > > > +void pci_msi_init_subdevice(struct pci_dev *pdev, struct device *dev) > > > +{ > > > + dev_set_msi_domain(dev, dev_msi_get_msix_device_domain(&pdev->dev)); > > > > msi_set_parent_domain(.....); > > > > > +static void pci_subdev_msix_prepare_desc(struct irq_domain *domain, msi_alloc_info_t *arg, > > > + struct msi_desc *desc) > > > +{ > > > + struct device *parent = desc->dev->parent; > > > + > > > + if (!desc->pci.mask_base) { > > > + /* Not elegant - but needed for irq affinity to work? */ > > > > Which makes me ask the question why desc->pci.mask_base is NULL in the > > first place. That needs some thought at the place which initializes the > > descriptor. > > Maybe this is the other way around? The descriptor is 'never' initialized > for this case. The equivalent code in msix_prepare_msi_desc() has > comments > " This is separate from msix_setup_msi_descs() below to handle dynamic > allocations for MSI-X after initial enablement." > So I think this !desc->pci.mask_base should always succeed > (and so I should get rid of it and run the descriptor initialize unconditionally) > > It might be appropriate to call the prepare_desc from the parent msi_domain though > via irq_domain->parent_domain->host_data > However this has the same need to not be based on the msi_desc->dev subject > to the question above. > > Jonathan > > > > > > > > --- a/kernel/irq/msi.c > > > +++ b/kernel/irq/msi.c > > > @@ -1720,3 +1720,8 @@ bool msi_device_has_isolated_msi(struct device *dev) > > > return arch_is_isolated_msi(); > > > } > > > EXPORT_SYMBOL_GPL(msi_device_has_isolated_msi); > > > +struct irq_domain *dev_msi_get_msix_device_domain(struct device *dev) > > > > You clearly run out of new lines here :) > > > > > +{ > > > + return dev->msi.data->__domains[0].domain; > > > +} > > > > But aside of that. No. See above. > > > > Thanks, > > > > tglx > > >
On Thu, Sep 12 2024 at 18:34, Jonathan Cameron wrote: > On Thu, 12 Sep 2024 17:37:20 +0100 > Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote: >> One bit that is challenging me is that the PCI access functions >> use the pci_dev that is embedded via irq_data->common->msi_desc->dev >> (and hence doesn't give us a obvious path to any hierarchical structures) >> E.g. __pci_write_msi_msg() and which checks the pci_dev is >> powered up. >> >> I'd like to be able to do a call to the parent similar to >> irq_chip_unmask_parent to handle write_msi_msg() for the new device >> domain. That's what hierarchy is about. I see your problem vs. the msi_desc::dev though. >> So how should this be avoided or should msi_desc->dev be the >> PCI device? See below. > I thought about this whilst cycling home. Perhaps my fundamental > question is more basic Which device should > msi_dec->dev be? > * The ultimate requester of the msi - so here the one calling > our new pci_msix_subdev_alloc_at(), > * The one on which the msi_write_msg() should occur. - here > that would be the struct pci_dev given the checks needed on > whether the device is powered up. If this is the case, can > we instead use the irq_data->dev in __pci_write_msi_msg()? Only for per device MSI domains and you need to look irq data up, so leave that alone. > Also, is it valid to use the irq_domain->dev for callbacks such > as msi_prepare_desc on basis that (I think) is the device > that 'hosts' the irq domain, so can we use it to replace the > use of msi_desc->dev in pci_msix_prepare_desc() > If we can that enables our subdev to use a prepare_desc callback > that does something like > > struct msi_domain *info; > > domain = domain->parent; > info = domain->host_data; > > info->ops->prepare_desc(domain, arg, desc); That should work, but you can simply do: subdev_prepare_desc(domain, arg, desc) { desc->dev = domain->parent->dev; pci_msix_prepare_desc(domain, arg, desc); } as this is a strict relationship in the PCI code. This needs a validation that nothing relies on desc->dev being the same as the device which allocated the descriptor: msi_desc_to_pci_dev(), msi_desc_to_dev() and a pile of open coded accesses. The open coded access should be fixed and prevented by marking it __private so people are forced to use the accessor functions. If there is no reliance, we can just do that. I checked the MSI core code already. Nothing to see there. Thanks, tglx