diff mbox series

arm64: dts: qcom: x1e80100: enable GICv3 ITS for PCIe

Message ID 20240711090250.20827-1-johan+linaro@kernel.org (mailing list archive)
State New
Headers show
Series arm64: dts: qcom: x1e80100: enable GICv3 ITS for PCIe | expand

Commit Message

Johan Hovold July 11, 2024, 9:02 a.m. UTC
The DWC PCIe controller can be used with its internal MSI controller or
with an external one such as the GICv3 Interrupt Translation Service
(ITS).

Add the msi-map properties needed to use the GIC ITS. This will also
make Linux switch to the ITS implementation, which allows for assigning
affinity to individual MSIs.

Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
---
 arch/arm64/boot/dts/qcom/x1e80100.dtsi | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Comments

Konrad Dybcio July 11, 2024, 9:54 a.m. UTC | #1
On 11.07.2024 11:02 AM, Johan Hovold wrote:
> The DWC PCIe controller can be used with its internal MSI controller or
> with an external one such as the GICv3 Interrupt Translation Service
> (ITS).
> 
> Add the msi-map properties needed to use the GIC ITS. This will also
> make Linux switch to the ITS implementation, which allows for assigning
> affinity to individual MSIs.
> 
> Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
> ---

X1E CRD throws tons of correctable errors with this on PCIe6a:

[    9.358915] pcieport 0007:00:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
[    9.358916] pcieport 0007:00:00.0:   device [17cb:0111] error status/mask=00000001/0000e000
[    9.358917] pcieport 0007:00:00.0:    [ 0] RxErr                 
[    9.358921] pcieport 0007:00:00.0: AER: Multiple Correctable error message received from 0007:00:00.0
[    9.358952] pcieport 0007:00:00.0: AER: found no error details for 0007:00:00.0
[    9.358953] pcieport 0007:00:00.0: AER: Multiple Correctable error message received from 0007:00:00.0
[    9.359003] pcieport 0007:00:00.0: AER: found no error details for 0007:00:00.0
[    9.359004] pcieport 0007:00:00.0: AER: Multiple Correctable error message received from 0007:01:00.0
[    9.359008] pcieport 0007:00:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Transmitter ID)
[    9.359009] pcieport 0007:00:00.0:   device [17cb:0111] error status/mask=00001001/0000e000
[    9.359010] pcieport 0007:00:00.0:    [ 0] RxErr                 
[    9.359011] pcieport 0007:00:00.0:    [12] Timeout  

Konrad
Johan Hovold July 11, 2024, 9:58 a.m. UTC | #2
On Thu, Jul 11, 2024 at 11:54:15AM +0200, Konrad Dybcio wrote:
> On 11.07.2024 11:02 AM, Johan Hovold wrote:
> > The DWC PCIe controller can be used with its internal MSI controller or
> > with an external one such as the GICv3 Interrupt Translation Service
> > (ITS).
> > 
> > Add the msi-map properties needed to use the GIC ITS. This will also
> > make Linux switch to the ITS implementation, which allows for assigning
> > affinity to individual MSIs.
> > 
> > Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
> > ---
> 
> X1E CRD throws tons of correctable errors with this on PCIe6a:
> 
> [    9.358915] pcieport 0007:00:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
> [    9.358916] pcieport 0007:00:00.0:   device [17cb:0111] error status/mask=00000001/0000e000
> [    9.358917] pcieport 0007:00:00.0:    [ 0] RxErr                 
> [    9.358921] pcieport 0007:00:00.0: AER: Multiple Correctable error message received from 0007:00:00.0
> [    9.358952] pcieport 0007:00:00.0: AER: found no error details for 0007:00:00.0
> [    9.358953] pcieport 0007:00:00.0: AER: Multiple Correctable error message received from 0007:00:00.0
> [    9.359003] pcieport 0007:00:00.0: AER: found no error details for 0007:00:00.0
> [    9.359004] pcieport 0007:00:00.0: AER: Multiple Correctable error message received from 0007:01:00.0
> [    9.359008] pcieport 0007:00:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Transmitter ID)
> [    9.359009] pcieport 0007:00:00.0:   device [17cb:0111] error status/mask=00001001/0000e000
> [    9.359010] pcieport 0007:00:00.0:    [ 0] RxErr                 
> [    9.359011] pcieport 0007:00:00.0:    [12] Timeout  

What branch are you using? Abel reported seeing this with his branch
which has a few work-in-progress patches that try to enable 4-lane PCIe.

There are no errors with my wip branch based on rc7, and I have the same
drive as Abel.

Also note that the errors happen also without this patch applied, they
are just being reported now.

Johan
Konrad Dybcio July 11, 2024, 10 a.m. UTC | #3
On 11.07.2024 11:58 AM, Johan Hovold wrote:
> On Thu, Jul 11, 2024 at 11:54:15AM +0200, Konrad Dybcio wrote:
>> On 11.07.2024 11:02 AM, Johan Hovold wrote:
>>> The DWC PCIe controller can be used with its internal MSI controller or
>>> with an external one such as the GICv3 Interrupt Translation Service
>>> (ITS).
>>>
>>> Add the msi-map properties needed to use the GIC ITS. This will also
>>> make Linux switch to the ITS implementation, which allows for assigning
>>> affinity to individual MSIs.
>>>
>>> Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
>>> ---
>>
>> X1E CRD throws tons of correctable errors with this on PCIe6a:
>>
>> [    9.358915] pcieport 0007:00:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
>> [    9.358916] pcieport 0007:00:00.0:   device [17cb:0111] error status/mask=00000001/0000e000
>> [    9.358917] pcieport 0007:00:00.0:    [ 0] RxErr                 
>> [    9.358921] pcieport 0007:00:00.0: AER: Multiple Correctable error message received from 0007:00:00.0
>> [    9.358952] pcieport 0007:00:00.0: AER: found no error details for 0007:00:00.0
>> [    9.358953] pcieport 0007:00:00.0: AER: Multiple Correctable error message received from 0007:00:00.0
>> [    9.359003] pcieport 0007:00:00.0: AER: found no error details for 0007:00:00.0
>> [    9.359004] pcieport 0007:00:00.0: AER: Multiple Correctable error message received from 0007:01:00.0
>> [    9.359008] pcieport 0007:00:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Transmitter ID)
>> [    9.359009] pcieport 0007:00:00.0:   device [17cb:0111] error status/mask=00001001/0000e000
>> [    9.359010] pcieport 0007:00:00.0:    [ 0] RxErr                 
>> [    9.359011] pcieport 0007:00:00.0:    [12] Timeout  
> 
> What branch are you using? Abel reported seeing this with his branch
> which has a few work-in-progress patches that try to enable 4-lane PCIe.
> 
> There are no errors with my wip branch based on rc7, and I have the same
> drive as Abel.

linux-next/master

> 
> Also note that the errors happen also without this patch applied, they
> are just being reported now.

Ouch.. wonder how much that drives the perf down

Konrad
Johan Hovold July 11, 2024, 10:04 a.m. UTC | #4
On Thu, Jul 11, 2024 at 12:00:50PM +0200, Konrad Dybcio wrote:
> On 11.07.2024 11:58 AM, Johan Hovold wrote:

> > What branch are you using? Abel reported seeing this with his branch
> > which has a few work-in-progress patches that try to enable 4-lane PCIe.
> > 
> > There are no errors with my wip branch based on rc7, and I have the same
> > drive as Abel.
> 
> linux-next/master

Hmm. Ok. We may need to disable L0s as I did for sc8280xp as well, but
that was not the cause for Abel's errors.

> > Also note that the errors happen also without this patch applied, they
> > are just being reported now.
> 
> Ouch.. wonder how much that drives the perf down

Could you post the output of lspci -vv for the NVMe controller?

Johan
Johan Hovold July 11, 2024, 3:01 p.m. UTC | #5
[ +CC: Mani ]

On Thu, Jul 11, 2024 at 11:58:08AM +0200, Johan Hovold wrote:
> On Thu, Jul 11, 2024 at 11:54:15AM +0200, Konrad Dybcio wrote:
> > On 11.07.2024 11:02 AM, Johan Hovold wrote:
> > > The DWC PCIe controller can be used with its internal MSI controller or
> > > with an external one such as the GICv3 Interrupt Translation Service
> > > (ITS).
> > > 
> > > Add the msi-map properties needed to use the GIC ITS. This will also
> > > make Linux switch to the ITS implementation, which allows for assigning
> > > affinity to individual MSIs.

> > X1E CRD throws tons of correctable errors with this on PCIe6a:

> What branch are you using? Abel reported seeing this with his branch
> which has a few work-in-progress patches that try to enable 4-lane PCIe.
> 
> There are no errors with my wip branch based on rc7, and I have the same
> drive as Abel.

For some reason I don't get these errors on my machine, but this has now
been confirmed by two other people running my rc branch (including Abel)
so something is broken here, for example, with the PHY settings.

I saw five correctable errors once, when running linux-next, but it took
several minutes and they were still minutes apart.

> Also note that the errors happen also without this patch applied, they
> are just being reported now.

I guess we need to track down what is causing these errors before
enabling ITS (and thereby the error reporting). 

At least L0s is not involved here, as it was with sc8280xp, as the
NVMe controllers in question do not support it.

Perhaps something is off because we're running the link at half width?

Johan
Manivannan Sadhasivam July 11, 2024, 4:19 p.m. UTC | #6
On Thu, Jul 11, 2024 at 05:01:15PM +0200, Johan Hovold wrote:
> [ +CC: Mani ]
> 
> On Thu, Jul 11, 2024 at 11:58:08AM +0200, Johan Hovold wrote:
> > On Thu, Jul 11, 2024 at 11:54:15AM +0200, Konrad Dybcio wrote:
> > > On 11.07.2024 11:02 AM, Johan Hovold wrote:
> > > > The DWC PCIe controller can be used with its internal MSI controller or
> > > > with an external one such as the GICv3 Interrupt Translation Service
> > > > (ITS).
> > > > 
> > > > Add the msi-map properties needed to use the GIC ITS. This will also
> > > > make Linux switch to the ITS implementation, which allows for assigning
> > > > affinity to individual MSIs.
> 
> > > X1E CRD throws tons of correctable errors with this on PCIe6a:
> 
> > What branch are you using? Abel reported seeing this with his branch
> > which has a few work-in-progress patches that try to enable 4-lane PCIe.
> > 
> > There are no errors with my wip branch based on rc7, and I have the same
> > drive as Abel.
> 
> For some reason I don't get these errors on my machine, but this has now
> been confirmed by two other people running my rc branch (including Abel)
> so something is broken here, for example, with the PHY settings.
> 

I saw AER errors on Abel's machine during probe with 4-lane PHY settings. And
that might be the indication why the link width got downgraded to x2. This is
still not yet resolved.

> I saw five correctable errors once, when running linux-next, but it took
> several minutes and they were still minutes apart.
> 
> > Also note that the errors happen also without this patch applied, they
> > are just being reported now.
> 
> I guess we need to track down what is causing these errors before
> enabling ITS (and thereby the error reporting). 
> 
> At least L0s is not involved here, as it was with sc8280xp, as the
> NVMe controllers in question do not support it.
> 
> Perhaps something is off because we're running the link at half width?
> 

My hunch is the PHY settings. But Abel cross checked the PHY settings with
internal documentation and they seem to match. Also, Qcom submitted a series
that is supposed to fix stability issues with Gen4 [1]. With this series, Gen 4
x4 setup is working on SA8775P-RIDE board as reported by Qcom. But Abel
confirmed that it didn't help him with the link downgrade issue.

Perhaps you can give it a try and see if it makes any difference for this issue?

Meantime, I'm checking with Qcom contacts on this.

- Mani

[1] https://lore.kernel.org/linux-pci/20240320071527.13443-1-quic_schintav@quicinc.com/
Manivannan Sadhasivam July 11, 2024, 4:41 p.m. UTC | #7
On Thu, Jul 11, 2024 at 09:49:52PM +0530, Manivannan Sadhasivam wrote:
> On Thu, Jul 11, 2024 at 05:01:15PM +0200, Johan Hovold wrote:
> > [ +CC: Mani ]
> > 
> > On Thu, Jul 11, 2024 at 11:58:08AM +0200, Johan Hovold wrote:
> > > On Thu, Jul 11, 2024 at 11:54:15AM +0200, Konrad Dybcio wrote:
> > > > On 11.07.2024 11:02 AM, Johan Hovold wrote:
> > > > > The DWC PCIe controller can be used with its internal MSI controller or
> > > > > with an external one such as the GICv3 Interrupt Translation Service
> > > > > (ITS).
> > > > > 
> > > > > Add the msi-map properties needed to use the GIC ITS. This will also
> > > > > make Linux switch to the ITS implementation, which allows for assigning
> > > > > affinity to individual MSIs.
> > 
> > > > X1E CRD throws tons of correctable errors with this on PCIe6a:
> > 
> > > What branch are you using? Abel reported seeing this with his branch
> > > which has a few work-in-progress patches that try to enable 4-lane PCIe.
> > > 
> > > There are no errors with my wip branch based on rc7, and I have the same
> > > drive as Abel.
> > 
> > For some reason I don't get these errors on my machine, but this has now
> > been confirmed by two other people running my rc branch (including Abel)
> > so something is broken here, for example, with the PHY settings.
> > 
> 
> I saw AER errors on Abel's machine during probe with 4-lane PHY settings. And
> that might be the indication why the link width got downgraded to x2. This is
> still not yet resolved.
> 
> > I saw five correctable errors once, when running linux-next, but it took
> > several minutes and they were still minutes apart.
> > 
> > > Also note that the errors happen also without this patch applied, they
> > > are just being reported now.
> > 
> > I guess we need to track down what is causing these errors before
> > enabling ITS (and thereby the error reporting). 
> > 
> > At least L0s is not involved here, as it was with sc8280xp, as the
> > NVMe controllers in question do not support it.
> > 
> > Perhaps something is off because we're running the link at half width?
> > 
> 
> My hunch is the PHY settings. But Abel cross checked the PHY settings with
> internal documentation and they seem to match. Also, Qcom submitted a series
> that is supposed to fix stability issues with Gen4 [1]. With this series, Gen 4
> x4 setup is working on SA8775P-RIDE board as reported by Qcom. But Abel
> confirmed that it didn't help him with the link downgrade issue.
> 
> Perhaps you can give it a try and see if it makes any difference for this issue?
> 
> Meantime, I'm checking with Qcom contacts on this.
> 

One thing I confirmed is, we definitely need different PHY sequence for using
2L. The current PHY settings are for 4L, so limiting the lane count from the
controller is going to be problematic. And AER errors might be due to that as
well.

We need to investigate on enabling 4L.

- Mani
Johan Hovold July 11, 2024, 4:59 p.m. UTC | #8
On Thu, Jul 11, 2024 at 10:11:53PM +0530, Manivannan Sadhasivam wrote:
> On Thu, Jul 11, 2024 at 09:49:52PM +0530, Manivannan Sadhasivam wrote:
> > On Thu, Jul 11, 2024 at 05:01:15PM +0200, Johan Hovold wrote:

> > > > Also note that the errors happen also without this patch applied, they
> > > > are just being reported now.

> > > Perhaps something is off because we're running the link at half width?
> > 
> > My hunch is the PHY settings. But Abel cross checked the PHY settings with
> > internal documentation and they seem to match. Also, Qcom submitted a series
> > that is supposed to fix stability issues with Gen4 [1]. With this series, Gen 4
> > x4 setup is working on SA8775P-RIDE board as reported by Qcom. But Abel
> > confirmed that it didn't help him with the link downgrade issue.
> > 
> > Perhaps you can give it a try and see if it makes any difference for
> > this issue?

If there are known issues with running at Gen4 speed without that
series, then it seems quite likely that doing so anyway could also cause
correctable errors.

Unfortunately, I get a hypervisor reset when I tried booting with that
series so there appears to be some implicit dependency on something
else (e.g. the 4l stuff).

> One thing I confirmed is, we definitely need different PHY sequence for using
> 2L. The current PHY settings are for 4L, so limiting the lane count from the
> controller is going to be problematic. And AER errors might be due to that as
> well.

Another good point. But we currently use the
"qcom,x1e80100-qmp-gen4x2-pcie-phy" settings. Shouldn't those be for x2,
and then Abel has another series that adds the x4 settings? Or are you
saying that the currently merged "gen4x2" settings are really for 4l?

Johan
Johan Hovold July 12, 2024, 8:20 a.m. UTC | #9
On Thu, Jul 11, 2024 at 06:59:22PM +0200, Johan Hovold wrote:
> On Thu, Jul 11, 2024 at 10:11:53PM +0530, Manivannan Sadhasivam wrote:
> > On Thu, Jul 11, 2024 at 09:49:52PM +0530, Manivannan Sadhasivam wrote:

> > > My hunch is the PHY settings. But Abel cross checked the PHY settings with
> > > internal documentation and they seem to match. Also, Qcom submitted a series
> > > that is supposed to fix stability issues with Gen4 [1]. With this series, Gen 4
> > > x4 setup is working on SA8775P-RIDE board as reported by Qcom. But Abel
> > > confirmed that it didn't help him with the link downgrade issue.
> > > 
> > > Perhaps you can give it a try and see if it makes any difference for
> > > this issue?
> 
> If there are known issues with running at Gen4 speed without that
> series, then it seems quite likely that doing so anyway could also cause
> correctable errors.
> 
> Unfortunately, I get a hypervisor reset when I tried booting with that
> series so there appears to be some implicit dependency on something
> else (e.g. the 4l stuff).

The first patch in that series breaks icc handling, which crashes
machines like the X13s and the x1e80100 CRD on boot. I've just reported
this here:

	https://lore.kernel.org/lkml/ZpDlf5xD035x2DqL@hovoldconsulting.com/

With that fixed, and with the hacky dependency on having max-link-speed
specified in the DT for the series to have any affect at all, the gen4
stability series indeed seems to make the AER error go away (Abel just
confirmed using a branch I'd prepared).

Let's try to get that series in shape and merged in some form as
everyone will be hitting these Correctable Errors currently with the
NVMe on x1e80100.

Johan
Manivannan Sadhasivam July 12, 2024, 1:31 p.m. UTC | #10
On Fri, Jul 12, 2024 at 10:20:24AM +0200, Johan Hovold wrote:
> On Thu, Jul 11, 2024 at 06:59:22PM +0200, Johan Hovold wrote:
> > On Thu, Jul 11, 2024 at 10:11:53PM +0530, Manivannan Sadhasivam wrote:
> > > On Thu, Jul 11, 2024 at 09:49:52PM +0530, Manivannan Sadhasivam wrote:
> 
> > > > My hunch is the PHY settings. But Abel cross checked the PHY settings with
> > > > internal documentation and they seem to match. Also, Qcom submitted a series
> > > > that is supposed to fix stability issues with Gen4 [1]. With this series, Gen 4
> > > > x4 setup is working on SA8775P-RIDE board as reported by Qcom. But Abel
> > > > confirmed that it didn't help him with the link downgrade issue.
> > > > 
> > > > Perhaps you can give it a try and see if it makes any difference for
> > > > this issue?
> > 
> > If there are known issues with running at Gen4 speed without that
> > series, then it seems quite likely that doing so anyway could also cause
> > correctable errors.
> > 
> > Unfortunately, I get a hypervisor reset when I tried booting with that
> > series so there appears to be some implicit dependency on something
> > else (e.g. the 4l stuff).
> 
> The first patch in that series breaks icc handling, which crashes
> machines like the X13s and the x1e80100 CRD on boot. I've just reported
> this here:
> 
> 	https://lore.kernel.org/lkml/ZpDlf5xD035x2DqL@hovoldconsulting.com/
> 

Ah, what a blinder... Thanks for reporting.

But I'm wondering why Abel was not seeing this crash when he tested this series
for 4L.

> With that fixed, and with the hacky dependency on having max-link-speed
> specified in the DT for the series to have any affect at all, the gen4
> stability series indeed seems to make the AER error go away (Abel just
> confirmed using a branch I'd prepared).
> 

Cool, good to know.

> Let's try to get that series in shape and merged in some form as
> everyone will be hitting these Correctable Errors currently with the
> NVMe on x1e80100.
> 

Sure. This series anyway needs respin due to the dependency with the OPP series
that just got merged. But merging it for 6.11 is quite unlikely.

- Mani
diff mbox series

Patch

diff --git a/arch/arm64/boot/dts/qcom/x1e80100.dtsi b/arch/arm64/boot/dts/qcom/x1e80100.dtsi
index 32a73ff672be..5822ed97ad87 100644
--- a/arch/arm64/boot/dts/qcom/x1e80100.dtsi
+++ b/arch/arm64/boot/dts/qcom/x1e80100.dtsi
@@ -3114,6 +3114,8 @@  pcie6a: pci@1bf8000 {
 			linux,pci-domain = <7>;
 			num-lanes = <2>;
 
+			msi-map = <0x0 &gic_its 0xe0000 0x10000>;
+
 			interrupts = <GIC_SPI 773 IRQ_TYPE_LEVEL_HIGH>,
 				     <GIC_SPI 774 IRQ_TYPE_LEVEL_HIGH>,
 				     <GIC_SPI 837 IRQ_TYPE_LEVEL_HIGH>,
@@ -3235,6 +3237,8 @@  pcie4: pci@1c08000 {
 			linux,pci-domain = <5>;
 			num-lanes = <2>;
 
+			msi-map = <0x0 &gic_its 0xc0000 0x10000>;
+
 			interrupts = <GIC_SPI 141 IRQ_TYPE_LEVEL_HIGH>,
 				     <GIC_SPI 142 IRQ_TYPE_LEVEL_HIGH>,
 				     <GIC_SPI 143 IRQ_TYPE_LEVEL_HIGH>,
@@ -5394,8 +5398,6 @@  gic_its: msi-controller@17040000 {
 
 				msi-controller;
 				#msi-cells = <1>;
-
-				status = "disabled";
 			};
 		};