Message ID | 20240711090250.20827-1-johan+linaro@kernel.org (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
Series | arm64: dts: qcom: x1e80100: enable GICv3 ITS for PCIe | expand |
On 11.07.2024 11:02 AM, Johan Hovold wrote: > The DWC PCIe controller can be used with its internal MSI controller or > with an external one such as the GICv3 Interrupt Translation Service > (ITS). > > Add the msi-map properties needed to use the GIC ITS. This will also > make Linux switch to the ITS implementation, which allows for assigning > affinity to individual MSIs. > > Signed-off-by: Johan Hovold <johan+linaro@kernel.org> > --- X1E CRD throws tons of correctable errors with this on PCIe6a: [ 9.358915] pcieport 0007:00:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID) [ 9.358916] pcieport 0007:00:00.0: device [17cb:0111] error status/mask=00000001/0000e000 [ 9.358917] pcieport 0007:00:00.0: [ 0] RxErr [ 9.358921] pcieport 0007:00:00.0: AER: Multiple Correctable error message received from 0007:00:00.0 [ 9.358952] pcieport 0007:00:00.0: AER: found no error details for 0007:00:00.0 [ 9.358953] pcieport 0007:00:00.0: AER: Multiple Correctable error message received from 0007:00:00.0 [ 9.359003] pcieport 0007:00:00.0: AER: found no error details for 0007:00:00.0 [ 9.359004] pcieport 0007:00:00.0: AER: Multiple Correctable error message received from 0007:01:00.0 [ 9.359008] pcieport 0007:00:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Transmitter ID) [ 9.359009] pcieport 0007:00:00.0: device [17cb:0111] error status/mask=00001001/0000e000 [ 9.359010] pcieport 0007:00:00.0: [ 0] RxErr [ 9.359011] pcieport 0007:00:00.0: [12] Timeout Konrad
On Thu, Jul 11, 2024 at 11:54:15AM +0200, Konrad Dybcio wrote: > On 11.07.2024 11:02 AM, Johan Hovold wrote: > > The DWC PCIe controller can be used with its internal MSI controller or > > with an external one such as the GICv3 Interrupt Translation Service > > (ITS). > > > > Add the msi-map properties needed to use the GIC ITS. This will also > > make Linux switch to the ITS implementation, which allows for assigning > > affinity to individual MSIs. > > > > Signed-off-by: Johan Hovold <johan+linaro@kernel.org> > > --- > > X1E CRD throws tons of correctable errors with this on PCIe6a: > > [ 9.358915] pcieport 0007:00:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID) > [ 9.358916] pcieport 0007:00:00.0: device [17cb:0111] error status/mask=00000001/0000e000 > [ 9.358917] pcieport 0007:00:00.0: [ 0] RxErr > [ 9.358921] pcieport 0007:00:00.0: AER: Multiple Correctable error message received from 0007:00:00.0 > [ 9.358952] pcieport 0007:00:00.0: AER: found no error details for 0007:00:00.0 > [ 9.358953] pcieport 0007:00:00.0: AER: Multiple Correctable error message received from 0007:00:00.0 > [ 9.359003] pcieport 0007:00:00.0: AER: found no error details for 0007:00:00.0 > [ 9.359004] pcieport 0007:00:00.0: AER: Multiple Correctable error message received from 0007:01:00.0 > [ 9.359008] pcieport 0007:00:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Transmitter ID) > [ 9.359009] pcieport 0007:00:00.0: device [17cb:0111] error status/mask=00001001/0000e000 > [ 9.359010] pcieport 0007:00:00.0: [ 0] RxErr > [ 9.359011] pcieport 0007:00:00.0: [12] Timeout What branch are you using? Abel reported seeing this with his branch which has a few work-in-progress patches that try to enable 4-lane PCIe. There are no errors with my wip branch based on rc7, and I have the same drive as Abel. Also note that the errors happen also without this patch applied, they are just being reported now. Johan
On 11.07.2024 11:58 AM, Johan Hovold wrote: > On Thu, Jul 11, 2024 at 11:54:15AM +0200, Konrad Dybcio wrote: >> On 11.07.2024 11:02 AM, Johan Hovold wrote: >>> The DWC PCIe controller can be used with its internal MSI controller or >>> with an external one such as the GICv3 Interrupt Translation Service >>> (ITS). >>> >>> Add the msi-map properties needed to use the GIC ITS. This will also >>> make Linux switch to the ITS implementation, which allows for assigning >>> affinity to individual MSIs. >>> >>> Signed-off-by: Johan Hovold <johan+linaro@kernel.org> >>> --- >> >> X1E CRD throws tons of correctable errors with this on PCIe6a: >> >> [ 9.358915] pcieport 0007:00:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID) >> [ 9.358916] pcieport 0007:00:00.0: device [17cb:0111] error status/mask=00000001/0000e000 >> [ 9.358917] pcieport 0007:00:00.0: [ 0] RxErr >> [ 9.358921] pcieport 0007:00:00.0: AER: Multiple Correctable error message received from 0007:00:00.0 >> [ 9.358952] pcieport 0007:00:00.0: AER: found no error details for 0007:00:00.0 >> [ 9.358953] pcieport 0007:00:00.0: AER: Multiple Correctable error message received from 0007:00:00.0 >> [ 9.359003] pcieport 0007:00:00.0: AER: found no error details for 0007:00:00.0 >> [ 9.359004] pcieport 0007:00:00.0: AER: Multiple Correctable error message received from 0007:01:00.0 >> [ 9.359008] pcieport 0007:00:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Transmitter ID) >> [ 9.359009] pcieport 0007:00:00.0: device [17cb:0111] error status/mask=00001001/0000e000 >> [ 9.359010] pcieport 0007:00:00.0: [ 0] RxErr >> [ 9.359011] pcieport 0007:00:00.0: [12] Timeout > > What branch are you using? Abel reported seeing this with his branch > which has a few work-in-progress patches that try to enable 4-lane PCIe. > > There are no errors with my wip branch based on rc7, and I have the same > drive as Abel. linux-next/master > > Also note that the errors happen also without this patch applied, they > are just being reported now. Ouch.. wonder how much that drives the perf down Konrad
On Thu, Jul 11, 2024 at 12:00:50PM +0200, Konrad Dybcio wrote: > On 11.07.2024 11:58 AM, Johan Hovold wrote: > > What branch are you using? Abel reported seeing this with his branch > > which has a few work-in-progress patches that try to enable 4-lane PCIe. > > > > There are no errors with my wip branch based on rc7, and I have the same > > drive as Abel. > > linux-next/master Hmm. Ok. We may need to disable L0s as I did for sc8280xp as well, but that was not the cause for Abel's errors. > > Also note that the errors happen also without this patch applied, they > > are just being reported now. > > Ouch.. wonder how much that drives the perf down Could you post the output of lspci -vv for the NVMe controller? Johan
[ +CC: Mani ] On Thu, Jul 11, 2024 at 11:58:08AM +0200, Johan Hovold wrote: > On Thu, Jul 11, 2024 at 11:54:15AM +0200, Konrad Dybcio wrote: > > On 11.07.2024 11:02 AM, Johan Hovold wrote: > > > The DWC PCIe controller can be used with its internal MSI controller or > > > with an external one such as the GICv3 Interrupt Translation Service > > > (ITS). > > > > > > Add the msi-map properties needed to use the GIC ITS. This will also > > > make Linux switch to the ITS implementation, which allows for assigning > > > affinity to individual MSIs. > > X1E CRD throws tons of correctable errors with this on PCIe6a: > What branch are you using? Abel reported seeing this with his branch > which has a few work-in-progress patches that try to enable 4-lane PCIe. > > There are no errors with my wip branch based on rc7, and I have the same > drive as Abel. For some reason I don't get these errors on my machine, but this has now been confirmed by two other people running my rc branch (including Abel) so something is broken here, for example, with the PHY settings. I saw five correctable errors once, when running linux-next, but it took several minutes and they were still minutes apart. > Also note that the errors happen also without this patch applied, they > are just being reported now. I guess we need to track down what is causing these errors before enabling ITS (and thereby the error reporting). At least L0s is not involved here, as it was with sc8280xp, as the NVMe controllers in question do not support it. Perhaps something is off because we're running the link at half width? Johan
On Thu, Jul 11, 2024 at 05:01:15PM +0200, Johan Hovold wrote: > [ +CC: Mani ] > > On Thu, Jul 11, 2024 at 11:58:08AM +0200, Johan Hovold wrote: > > On Thu, Jul 11, 2024 at 11:54:15AM +0200, Konrad Dybcio wrote: > > > On 11.07.2024 11:02 AM, Johan Hovold wrote: > > > > The DWC PCIe controller can be used with its internal MSI controller or > > > > with an external one such as the GICv3 Interrupt Translation Service > > > > (ITS). > > > > > > > > Add the msi-map properties needed to use the GIC ITS. This will also > > > > make Linux switch to the ITS implementation, which allows for assigning > > > > affinity to individual MSIs. > > > > X1E CRD throws tons of correctable errors with this on PCIe6a: > > > What branch are you using? Abel reported seeing this with his branch > > which has a few work-in-progress patches that try to enable 4-lane PCIe. > > > > There are no errors with my wip branch based on rc7, and I have the same > > drive as Abel. > > For some reason I don't get these errors on my machine, but this has now > been confirmed by two other people running my rc branch (including Abel) > so something is broken here, for example, with the PHY settings. > I saw AER errors on Abel's machine during probe with 4-lane PHY settings. And that might be the indication why the link width got downgraded to x2. This is still not yet resolved. > I saw five correctable errors once, when running linux-next, but it took > several minutes and they were still minutes apart. > > > Also note that the errors happen also without this patch applied, they > > are just being reported now. > > I guess we need to track down what is causing these errors before > enabling ITS (and thereby the error reporting). > > At least L0s is not involved here, as it was with sc8280xp, as the > NVMe controllers in question do not support it. > > Perhaps something is off because we're running the link at half width? > My hunch is the PHY settings. But Abel cross checked the PHY settings with internal documentation and they seem to match. Also, Qcom submitted a series that is supposed to fix stability issues with Gen4 [1]. With this series, Gen 4 x4 setup is working on SA8775P-RIDE board as reported by Qcom. But Abel confirmed that it didn't help him with the link downgrade issue. Perhaps you can give it a try and see if it makes any difference for this issue? Meantime, I'm checking with Qcom contacts on this. - Mani [1] https://lore.kernel.org/linux-pci/20240320071527.13443-1-quic_schintav@quicinc.com/
On Thu, Jul 11, 2024 at 09:49:52PM +0530, Manivannan Sadhasivam wrote: > On Thu, Jul 11, 2024 at 05:01:15PM +0200, Johan Hovold wrote: > > [ +CC: Mani ] > > > > On Thu, Jul 11, 2024 at 11:58:08AM +0200, Johan Hovold wrote: > > > On Thu, Jul 11, 2024 at 11:54:15AM +0200, Konrad Dybcio wrote: > > > > On 11.07.2024 11:02 AM, Johan Hovold wrote: > > > > > The DWC PCIe controller can be used with its internal MSI controller or > > > > > with an external one such as the GICv3 Interrupt Translation Service > > > > > (ITS). > > > > > > > > > > Add the msi-map properties needed to use the GIC ITS. This will also > > > > > make Linux switch to the ITS implementation, which allows for assigning > > > > > affinity to individual MSIs. > > > > > > X1E CRD throws tons of correctable errors with this on PCIe6a: > > > > > What branch are you using? Abel reported seeing this with his branch > > > which has a few work-in-progress patches that try to enable 4-lane PCIe. > > > > > > There are no errors with my wip branch based on rc7, and I have the same > > > drive as Abel. > > > > For some reason I don't get these errors on my machine, but this has now > > been confirmed by two other people running my rc branch (including Abel) > > so something is broken here, for example, with the PHY settings. > > > > I saw AER errors on Abel's machine during probe with 4-lane PHY settings. And > that might be the indication why the link width got downgraded to x2. This is > still not yet resolved. > > > I saw five correctable errors once, when running linux-next, but it took > > several minutes and they were still minutes apart. > > > > > Also note that the errors happen also without this patch applied, they > > > are just being reported now. > > > > I guess we need to track down what is causing these errors before > > enabling ITS (and thereby the error reporting). > > > > At least L0s is not involved here, as it was with sc8280xp, as the > > NVMe controllers in question do not support it. > > > > Perhaps something is off because we're running the link at half width? > > > > My hunch is the PHY settings. But Abel cross checked the PHY settings with > internal documentation and they seem to match. Also, Qcom submitted a series > that is supposed to fix stability issues with Gen4 [1]. With this series, Gen 4 > x4 setup is working on SA8775P-RIDE board as reported by Qcom. But Abel > confirmed that it didn't help him with the link downgrade issue. > > Perhaps you can give it a try and see if it makes any difference for this issue? > > Meantime, I'm checking with Qcom contacts on this. > One thing I confirmed is, we definitely need different PHY sequence for using 2L. The current PHY settings are for 4L, so limiting the lane count from the controller is going to be problematic. And AER errors might be due to that as well. We need to investigate on enabling 4L. - Mani
On Thu, Jul 11, 2024 at 10:11:53PM +0530, Manivannan Sadhasivam wrote: > On Thu, Jul 11, 2024 at 09:49:52PM +0530, Manivannan Sadhasivam wrote: > > On Thu, Jul 11, 2024 at 05:01:15PM +0200, Johan Hovold wrote: > > > > Also note that the errors happen also without this patch applied, they > > > > are just being reported now. > > > Perhaps something is off because we're running the link at half width? > > > > My hunch is the PHY settings. But Abel cross checked the PHY settings with > > internal documentation and they seem to match. Also, Qcom submitted a series > > that is supposed to fix stability issues with Gen4 [1]. With this series, Gen 4 > > x4 setup is working on SA8775P-RIDE board as reported by Qcom. But Abel > > confirmed that it didn't help him with the link downgrade issue. > > > > Perhaps you can give it a try and see if it makes any difference for > > this issue? If there are known issues with running at Gen4 speed without that series, then it seems quite likely that doing so anyway could also cause correctable errors. Unfortunately, I get a hypervisor reset when I tried booting with that series so there appears to be some implicit dependency on something else (e.g. the 4l stuff). > One thing I confirmed is, we definitely need different PHY sequence for using > 2L. The current PHY settings are for 4L, so limiting the lane count from the > controller is going to be problematic. And AER errors might be due to that as > well. Another good point. But we currently use the "qcom,x1e80100-qmp-gen4x2-pcie-phy" settings. Shouldn't those be for x2, and then Abel has another series that adds the x4 settings? Or are you saying that the currently merged "gen4x2" settings are really for 4l? Johan
On Thu, Jul 11, 2024 at 06:59:22PM +0200, Johan Hovold wrote: > On Thu, Jul 11, 2024 at 10:11:53PM +0530, Manivannan Sadhasivam wrote: > > On Thu, Jul 11, 2024 at 09:49:52PM +0530, Manivannan Sadhasivam wrote: > > > My hunch is the PHY settings. But Abel cross checked the PHY settings with > > > internal documentation and they seem to match. Also, Qcom submitted a series > > > that is supposed to fix stability issues with Gen4 [1]. With this series, Gen 4 > > > x4 setup is working on SA8775P-RIDE board as reported by Qcom. But Abel > > > confirmed that it didn't help him with the link downgrade issue. > > > > > > Perhaps you can give it a try and see if it makes any difference for > > > this issue? > > If there are known issues with running at Gen4 speed without that > series, then it seems quite likely that doing so anyway could also cause > correctable errors. > > Unfortunately, I get a hypervisor reset when I tried booting with that > series so there appears to be some implicit dependency on something > else (e.g. the 4l stuff). The first patch in that series breaks icc handling, which crashes machines like the X13s and the x1e80100 CRD on boot. I've just reported this here: https://lore.kernel.org/lkml/ZpDlf5xD035x2DqL@hovoldconsulting.com/ With that fixed, and with the hacky dependency on having max-link-speed specified in the DT for the series to have any affect at all, the gen4 stability series indeed seems to make the AER error go away (Abel just confirmed using a branch I'd prepared). Let's try to get that series in shape and merged in some form as everyone will be hitting these Correctable Errors currently with the NVMe on x1e80100. Johan
On Fri, Jul 12, 2024 at 10:20:24AM +0200, Johan Hovold wrote: > On Thu, Jul 11, 2024 at 06:59:22PM +0200, Johan Hovold wrote: > > On Thu, Jul 11, 2024 at 10:11:53PM +0530, Manivannan Sadhasivam wrote: > > > On Thu, Jul 11, 2024 at 09:49:52PM +0530, Manivannan Sadhasivam wrote: > > > > > My hunch is the PHY settings. But Abel cross checked the PHY settings with > > > > internal documentation and they seem to match. Also, Qcom submitted a series > > > > that is supposed to fix stability issues with Gen4 [1]. With this series, Gen 4 > > > > x4 setup is working on SA8775P-RIDE board as reported by Qcom. But Abel > > > > confirmed that it didn't help him with the link downgrade issue. > > > > > > > > Perhaps you can give it a try and see if it makes any difference for > > > > this issue? > > > > If there are known issues with running at Gen4 speed without that > > series, then it seems quite likely that doing so anyway could also cause > > correctable errors. > > > > Unfortunately, I get a hypervisor reset when I tried booting with that > > series so there appears to be some implicit dependency on something > > else (e.g. the 4l stuff). > > The first patch in that series breaks icc handling, which crashes > machines like the X13s and the x1e80100 CRD on boot. I've just reported > this here: > > https://lore.kernel.org/lkml/ZpDlf5xD035x2DqL@hovoldconsulting.com/ > Ah, what a blinder... Thanks for reporting. But I'm wondering why Abel was not seeing this crash when he tested this series for 4L. > With that fixed, and with the hacky dependency on having max-link-speed > specified in the DT for the series to have any affect at all, the gen4 > stability series indeed seems to make the AER error go away (Abel just > confirmed using a branch I'd prepared). > Cool, good to know. > Let's try to get that series in shape and merged in some form as > everyone will be hitting these Correctable Errors currently with the > NVMe on x1e80100. > Sure. This series anyway needs respin due to the dependency with the OPP series that just got merged. But merging it for 6.11 is quite unlikely. - Mani
diff --git a/arch/arm64/boot/dts/qcom/x1e80100.dtsi b/arch/arm64/boot/dts/qcom/x1e80100.dtsi index 32a73ff672be..5822ed97ad87 100644 --- a/arch/arm64/boot/dts/qcom/x1e80100.dtsi +++ b/arch/arm64/boot/dts/qcom/x1e80100.dtsi @@ -3114,6 +3114,8 @@ pcie6a: pci@1bf8000 { linux,pci-domain = <7>; num-lanes = <2>; + msi-map = <0x0 &gic_its 0xe0000 0x10000>; + interrupts = <GIC_SPI 773 IRQ_TYPE_LEVEL_HIGH>, <GIC_SPI 774 IRQ_TYPE_LEVEL_HIGH>, <GIC_SPI 837 IRQ_TYPE_LEVEL_HIGH>, @@ -3235,6 +3237,8 @@ pcie4: pci@1c08000 { linux,pci-domain = <5>; num-lanes = <2>; + msi-map = <0x0 &gic_its 0xc0000 0x10000>; + interrupts = <GIC_SPI 141 IRQ_TYPE_LEVEL_HIGH>, <GIC_SPI 142 IRQ_TYPE_LEVEL_HIGH>, <GIC_SPI 143 IRQ_TYPE_LEVEL_HIGH>, @@ -5394,8 +5398,6 @@ gic_its: msi-controller@17040000 { msi-controller; #msi-cells = <1>; - - status = "disabled"; }; };
The DWC PCIe controller can be used with its internal MSI controller or with an external one such as the GICv3 Interrupt Translation Service (ITS). Add the msi-map properties needed to use the GIC ITS. This will also make Linux switch to the ITS implementation, which allows for assigning affinity to individual MSIs. Signed-off-by: Johan Hovold <johan+linaro@kernel.org> --- arch/arm64/boot/dts/qcom/x1e80100.dtsi | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)