Message ID | 20181207105231.25593-3-ccaione@baylibre.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | meson: Fix IRQ trigger type | expand |
Hi Carlo, tests[0] conducted on an Odroid-C1+ board equipped with a Meson8b SoC have shown an high packet loss (90% and more) during a simple ping test from a laptop to the board. Testing the two patches separately clearly showed that this depends on the removal of the "eee-broken-1000t" flag from the board PHY description in the relative device tree. About the first patch (MAC IRQ type), no tests have shown an evidence that it is needed. I suggest you to conduct some test on real hardware as I do to confirm or disprove my tests. Thanks for your work, Emiliano [0] http://lists.infradead.org/pipermail/linux-amlogic/2018-December/009397.html On Fri, Dec 07, 2018 at 10:52:31AM +0000, Carlo Caione wrote: > A long running stress test on a custom board shipping an AXG SoCs and a > Realtek RTL8211F PHY revealed that after a few hours the connection > speed would drop drastically, from ~1000Mbps to ~3Mbps. At the same time > the 'macirq' (eth0) IRQ would stop being triggered at all and as > consequence the GMAC IRQs never ACKed. > > After a painful investigation the problem seemed to be due to a wrong > defined IRQ type for the GMAC IRQ that should be LEVEL_HIGH instead of > EDGE_RISING. > > The change in the macirq IRQ type also solved another long standing > issue affecting this SoC/PHY where EEE was causing the network > connection to die after stressing it with iperf3 (even though much > sooner). It's now possible to remove the 'eee-broken-1000t' quirk as > well. > > Fixes: 9c15795a4f96 ("ARM: dts: meson8b-odroidc1: ethernet support") > Signed-off-by: Carlo Caione <ccaione@baylibre.com> > --- > arch/arm/boot/dts/meson.dtsi | 2 +- > arch/arm/boot/dts/meson8b-odroidc1.dts | 1 - > 2 files changed, 1 insertion(+), 2 deletions(-) > > diff --git a/arch/arm/boot/dts/meson.dtsi b/arch/arm/boot/dts/meson.dtsi > index 0d9faf1a51ea..a86b89086334 100644 > --- a/arch/arm/boot/dts/meson.dtsi > +++ b/arch/arm/boot/dts/meson.dtsi > @@ -263,7 +263,7 @@ > compatible = "amlogic,meson6-dwmac", "snps,dwmac"; > reg = <0xc9410000 0x10000 > 0xc1108108 0x4>; > - interrupts = <GIC_SPI 8 IRQ_TYPE_EDGE_RISING>; > + interrupts = <GIC_SPI 8 IRQ_TYPE_LEVEL_HIGH>; > interrupt-names = "macirq"; > status = "disabled"; > }; > diff --git a/arch/arm/boot/dts/meson8b-odroidc1.dts b/arch/arm/boot/dts/meson8b-odroidc1.dts > index 58669abda259..a951a6632d0c 100644 > --- a/arch/arm/boot/dts/meson8b-odroidc1.dts > +++ b/arch/arm/boot/dts/meson8b-odroidc1.dts > @@ -221,7 +221,6 @@ > /* Realtek RTL8211F (0x001cc916) */ > eth_phy: ethernet-phy@0 { > reg = <0>; > - eee-broken-1000t; > interrupt-parent = <&gpio_intc>; > /* GPIOH_3 */ > interrupts = <17 IRQ_TYPE_LEVEL_LOW>; > -- > 2.19.1 >
Hi Carlo, On Fri, Dec 7, 2018 at 11:52 AM Carlo Caione <ccaione@baylibre.com> wrote: > > A long running stress test on a custom board shipping an AXG SoCs and a > Realtek RTL8211F PHY revealed that after a few hours the connection > speed would drop drastically, from ~1000Mbps to ~3Mbps. At the same time > the 'macirq' (eth0) IRQ would stop being triggered at all and as > consequence the GMAC IRQs never ACKed. > > After a painful investigation the problem seemed to be due to a wrong > defined IRQ type for the GMAC IRQ that should be LEVEL_HIGH instead of > EDGE_RISING. > > The change in the macirq IRQ type also solved another long standing > issue affecting this SoC/PHY where EEE was causing the network > connection to die after stressing it with iperf3 (even though much > sooner). It's now possible to remove the 'eee-broken-1000t' quirk as > well. I tested this on my Odroid-C1. however, I must admit that I never had issues *without* eee-broken-1000t on any of my boards without your changes: [root@alarm ~]# iperf3 -c 192.168.1.100 Connecting to host 192.168.1.100, port 5201 [ 5] local 192.168.1.194 port 38870 connected to 192.168.1.100 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 80.6 MBytes 675 Mbits/sec 0 2.78 MBytes [ 5] 1.00-2.00 sec 108 MBytes 904 Mbits/sec 0 3.04 MBytes [ 5] 2.00-3.00 sec 106 MBytes 891 Mbits/sec 0 3.04 MBytes [ 5] 3.00-4.00 sec 105 MBytes 880 Mbits/sec 0 3.04 MBytes [ 5] 4.00-5.00 sec 65.0 MBytes 545 Mbits/sec 0 3.04 MBytes [ 5] 5.00-6.00 sec 92.5 MBytes 777 Mbits/sec 0 3.04 MBytes [ 5] 6.00-7.00 sec 72.5 MBytes 608 Mbits/sec 0 3.04 MBytes [ 5] 7.00-8.19 sec 76.2 MBytes 537 Mbits/sec 0 3.04 MBytes [ 5] 8.19-9.00 sec 48.8 MBytes 504 Mbits/sec 0 3.04 MBytes [ 5] 9.00-10.00 sec 87.5 MBytes 736 Mbits/sec 0 3.04 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 842 MBytes 706 Mbits/sec 0 sender [ 5] 0.00-10.05 sec 839 MBytes 701 Mbits/sec receiver iperf Done. [root@alarm ~]# iperf3 -c 192.168.1.100 -R Connecting to host 192.168.1.100, port 5201 Reverse mode, remote host 192.168.1.100 is sending [ 5] local 192.168.1.194 port 38874 connected to 192.168.1.100 port 5201 [ ID] Interval Transfer Bitrate [ 5] 0.00-1.00 sec 21.0 MBytes 175 Mbits/sec [ 5] 1.00-2.00 sec 20.7 MBytes 174 Mbits/sec [ 5] 2.00-3.00 sec 22.4 MBytes 187 Mbits/sec [ 5] 3.00-4.69 sec 25.2 MBytes 125 Mbits/sec [ 5] 4.69-5.00 sec 7.56 MBytes 206 Mbits/sec [ 5] 5.00-6.00 sec 23.4 MBytes 196 Mbits/sec [ 5] 6.00-7.00 sec 14.6 MBytes 123 Mbits/sec [ 5] 7.00-8.00 sec 23.3 MBytes 196 Mbits/sec [ 5] 8.00-9.00 sec 27.8 MBytes 233 Mbits/sec [ 5] 9.00-10.03 sec 24.9 MBytes 203 Mbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-9.36 sec 212 MBytes 190 Mbits/sec 1588 sender [ 5] 0.00-10.03 sec 211 MBytes 176 Mbits/sec receiver iperf Done. [root@alarm ~]# with your changes: [root@alarm ~]# iperf3 -c 192.168.1.100 Connecting to host 192.168.1.100, port 5201 [ 5] local 192.168.1.197 port 45020 connected to 192.168.1.100 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 74.4 MBytes 624 Mbits/sec 0 2.75 MBytes [ 5] 1.00-2.00 sec 105 MBytes 881 Mbits/sec 0 3.03 MBytes [ 5] 2.00-3.00 sec 106 MBytes 891 Mbits/sec 0 3.03 MBytes [ 5] 3.00-4.00 sec 78.8 MBytes 661 Mbits/sec 0 3.03 MBytes [ 5] 4.00-5.00 sec 73.8 MBytes 617 Mbits/sec 0 3.03 MBytes [ 5] 5.00-6.00 sec 87.5 MBytes 735 Mbits/sec 0 3.03 MBytes [ 5] 6.00-7.15 sec 81.2 MBytes 594 Mbits/sec 0 3.03 MBytes [ 5] 7.15-8.00 sec 61.2 MBytes 603 Mbits/sec 0 3.03 MBytes [ 5] 8.00-9.02 sec 76.2 MBytes 625 Mbits/sec 0 3.03 MBytes [ 5] 9.02-10.00 sec 102 MBytes 880 Mbits/sec 0 3.03 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 847 MBytes 710 Mbits/sec 0 sender [ 5] 0.00-10.05 sec 846 MBytes 706 Mbits/sec receiver iperf Done. [root@alarm ~]# iperf3 -c 192.168.1.100 -R Connecting to host 192.168.1.100, port 5201 Reverse mode, remote host 192.168.1.100 is sending [ 5] local 192.168.1.197 port 45024 connected to 192.168.1.100 port 5201 [ ID] Interval Transfer Bitrate [ 5] 0.00-1.00 sec 22.6 MBytes 190 Mbits/sec [ 5] 1.00-2.00 sec 19.3 MBytes 162 Mbits/sec [ 5] 2.00-3.00 sec 22.1 MBytes 185 Mbits/sec [ 5] 3.00-4.00 sec 29.6 MBytes 248 Mbits/sec [ 5] 4.00-5.00 sec 30.1 MBytes 253 Mbits/sec [ 5] 5.00-6.00 sec 16.7 MBytes 140 Mbits/sec [ 5] 6.00-7.00 sec 21.5 MBytes 180 Mbits/sec [ 5] 7.00-8.00 sec 14.0 MBytes 118 Mbits/sec [ 5] 8.00-9.04 sec 20.4 MBytes 165 Mbits/sec [ 5] 9.04-10.00 sec 19.6 MBytes 171 Mbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.04 sec 217 MBytes 181 Mbits/sec 1795 sender [ 5] 0.00-10.00 sec 216 MBytes 181 Mbits/sec receiver iperf Done. [root@alarm ~]# RX and TX speeds are within 10Mbit/s before and after the test, so I would call the result "identical" (within a bit of measurement tolerance) I'll wait a few days and see what Emiliano finds out on his board, then I'll send my Tested-by and Acked-by Regards Martin
On Fri, 2018-12-07 at 19:51 +0100, Emiliano Ingrassia wrote: > Hi Carlo, Hi Emiliano, > tests[0] conducted on an Odroid-C1+ board equipped with a Meson8b SoC > have shown an high packet loss (90% and more) during a simple ping > test from a laptop to the board. > Testing the two patches separately clearly showed that this depends > on the > removal of the "eee-broken-1000t" flag from the board PHY description > in the relative device tree. > > About the first patch (MAC IRQ type), no tests have shown an evidence > that it is needed. I suggest you to conduct some test on real > hardware > as I do to confirm or disprove my tests. Let's try to step back a bit and see what we can do to clarify this situation. First of all for arm64 we are pretty sure that both patches are needed because we ran extensive and lengthy tests, especially regarding the change in the IRQ trigger type. For arm things are not so clear, so for now we decided to merge the arm64 patch and just wait on the arm one. First of all we can focus on the patch regarding the change in the IRQ type. The problem with the IRQ type is triggered on the arm64 boards we tested using the script in [0]. If we run this stress test on the arm64 boards without the trigger changing patch after a few hours (variable from 2h to 6h sometimes more) we can see the connection dropping from ~1Gbps to <30Mbps. Jerome gave a nice explanation of the why, but after changing the IRQ trigger type we couldn't see the issue anymore. This was confirmed not just by BayLibre but also from other different sources, so we are pretty confident in this solution. So my first two points for you to answer are: 1) Can you reproduce this problem on your board without the patches when running this script? 2) If yes, does only the first patch solve the problem? This brings us to the second issue, the one regarding the 'eee-broken- 1000t' quirk. Since the two issues are strictly related we are confident that the change in the IRQ type solves this problem as well (and this was confirmed by Jerome as well on the arm64 boards). For this case I cannot provide a real reproducer so we need only to stress test the network with iperf3 trying to reproduce the issue. This is also because we think that you approach of using UDP and your packet generator probably is not the best way to test the patch given that (1) using UDP is not reliable according to our tests, (2) there is an asymmetry in TX/RX, (3) the packet loss could be due to the saturation on the bandwidth, etc... So AFAIK the best way to test this problem is using iperf3, the same way it is done in the script in [0]. I was not involved with this issue 1 year and half ago but AFAIK this is the way it was reproduced. This brings me to more answers for you to answer: 3) Running iperf3 tests in TX / RX / TX+RX without the 'eee-broken- 1000' quirk applied are you able to reproduce the EEE problem? 4) Any change when the 'eee-broken-1000' quirk is applied? When testing (3) and (4) also please check the status of the EEE using ethtool. Hopefully this will bring a bit of clarity to the whole situation :) Cheers, [0] https://paste.fedoraproject.org/paste/GBFxjAQ0JULsYQlyYO2KOw -- Carlo Caione
Hi Carlo, On Sat, Dec 08, 2018 at 10:46:17AM +0000, Carlo Caione wrote: > On Fri, 2018-12-07 at 19:51 +0100, Emiliano Ingrassia wrote: > > Hi Carlo, > > Hi Emiliano, > > > tests[0] conducted on an Odroid-C1+ board equipped with a Meson8b SoC > > have shown an high packet loss (90% and more) during a simple ping > > test from a laptop to the board. > > Testing the two patches separately clearly showed that this depends > > on the > > removal of the "eee-broken-1000t" flag from the board PHY description > > in the relative device tree. > > > > About the first patch (MAC IRQ type), no tests have shown an evidence > > that it is needed. I suggest you to conduct some test on real > > hardware > > as I do to confirm or disprove my tests. > > Let's try to step back a bit and see what we can do to clarify this > situation. > Ok, I'll be glad to help you :) > First of all for arm64 we are pretty sure that both patches are needed > because we ran extensive and lengthy tests, especially regarding the > change in the IRQ trigger type. For arm things are not so clear, so for > now we decided to merge the arm64 patch and just wait on the arm one. > > First of all we can focus on the patch regarding the change in the IRQ > type. > > The problem with the IRQ type is triggered on the arm64 boards we > tested using the script in [0]. If we run this stress test on the arm64 > boards without the trigger changing patch after a few hours (variable > from 2h to 6h sometimes more) we can see the connection dropping from > ~1Gbps to <30Mbps. Jerome gave a nice explanation of the why, but after > changing the IRQ trigger type we couldn't see the issue anymore. This > was confirmed not just by BayLibre but also from other different > sources, so we are pretty confident in this solution. > > So my first two points for you to answer are: > > 1) Can you reproduce this problem on your board without the patches > when running this script? > > 2) If yes, does only the first patch solve the problem? > I ran two tests executing the script you provide on an Odroid-C1+ board (REV 0.4 20150930) for 6 hours, using my laptop as server. The kernel I used was compiled from "v4.21/dt64-testing" branch provided by Kevin Hilman (thank you Kevin!). The results are available in [0]. The first test (no-patch-iperf-20181211000039.log) was run with none of your patches applied. The second test (irq-patch-iperf-20181211130953.log) was run with only the patch about IRQ type applied. As you can see, I did not experiment exactly the problem you had but I see a more stable behavior with the IRQ type patch applied. > This brings us to the second issue, the one regarding the 'eee-broken- > 1000t' quirk. Since the two issues are strictly related we are > confident that the change in the IRQ type solves this problem as well > (and this was confirmed by Jerome as well on the arm64 boards). > The problem here is that, without the "eee-broken-1000t" flag, simple ping tests from an host to the board showed an high packet loss (about ~90%), even with the IRQ type patch applied. > For this case I cannot provide a real reproducer so we need only to > stress test the network with iperf3 trying to reproduce the issue. This > is also because we think that you approach of using UDP and your packet > generator probably is not the best way to test the patch given that (1) > using UDP is not reliable according to our tests, (2) there is an > asymmetry in TX/RX, (3) the packet loss could be due to the saturation > on the bandwidth, etc... > The tests I ran with the kernel packet generator showed interesting informations to me. The board dropped all incoming traffic when transmitting at full rate (~940 Mbps). Although there is an asymmetry in the transmission FIFOs size (Rx FIFO is twice as Tx FIFO), I would expect a result more similar to the one I had in step 2 of TEST 0 [1], after a while. However, this behavior could be due to the driver and not so interesting in this discussion ;) > So AFAIK the best way to test this problem is using iperf3, the same > way it is done in the script in [0]. I was not involved with this issue > 1 year and half ago but AFAIK this is the way it was reproduced. > > This brings me to more answers for you to answer: > > 3) Running iperf3 tests in TX / RX / TX+RX without the 'eee-broken- > 1000' quirk applied are you able to reproduce the EEE problem? > > 4) Any change when the 'eee-broken-1000' quirk is applied? > > When testing (3) and (4) also please check the status of the EEE using > ethtool. > > Hopefully this will bring a bit of clarity to the whole situation :) > > Cheers, > > [0] https://paste.fedoraproject.org/paste/GBFxjAQ0JULsYQlyYO2KOw > > -- > Carlo Caione > Best reagrds, Emiliano [0] https://drive.google.com/drive/folders/1BMe8vkm16KdgijlhFfZH_xph5eDNdkqO?usp=sharing [1] http://lists.infradead.org/pipermail/linux-amlogic/2018-December/009397.html
Hi Carlo, On Fri, Dec 7, 2018 at 11:52 AM Carlo Caione <ccaione@baylibre.com> wrote: > > A long running stress test on a custom board shipping an AXG SoCs and a > Realtek RTL8211F PHY revealed that after a few hours the connection > speed would drop drastically, from ~1000Mbps to ~3Mbps. At the same time > the 'macirq' (eth0) IRQ would stop being triggered at all and as > consequence the GMAC IRQs never ACKed. > > After a painful investigation the problem seemed to be due to a wrong > defined IRQ type for the GMAC IRQ that should be LEVEL_HIGH instead of > EDGE_RISING. > > The change in the macirq IRQ type also solved another long standing > issue affecting this SoC/PHY where EEE was causing the network > connection to die after stressing it with iperf3 (even though much > sooner). It's now possible to remove the 'eee-broken-1000t' quirk as > well. (disclaimer: I was not able to reproduce this bug without your patches, but I didn't run iperf3 for more than a couple of minutes) I did test your patch with and without my "Meson8b RGMII Ethernet pin cleanup" from [0] which shows that there's another performance related problem: 1) before and after your patch receive speeds were fine (above 700Mbit/s and no transmit errors / retries in iperf3) but the transmit speed was bad (<200Mbit/s and >1500 retries in perf3) 2) transmit errors (when Odroid-C1 is sending) are not occurring anymore after my patch from [0] thus I believe your patch is fine, especially since we already have IRQ_TYPE_LEVEL_HIGH for the dwc2 controllers > Fixes: 9c15795a4f96 ("ARM: dts: meson8b-odroidc1: ethernet support") > Signed-off-by: Carlo Caione <ccaione@baylibre.com> Reviewed-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Tested-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> I wonder if Kevin can send this as a fix for v4.20 Regards Martin [0] http://lists.infradead.org/pipermail/linux-amlogic/2018-December/009665.html
Martin Blumenstingl <martin.blumenstingl@googlemail.com> writes: > Hi Carlo, > > On Fri, Dec 7, 2018 at 11:52 AM Carlo Caione <ccaione@baylibre.com> wrote: >> >> A long running stress test on a custom board shipping an AXG SoCs and a >> Realtek RTL8211F PHY revealed that after a few hours the connection >> speed would drop drastically, from ~1000Mbps to ~3Mbps. At the same time >> the 'macirq' (eth0) IRQ would stop being triggered at all and as >> consequence the GMAC IRQs never ACKed. >> >> After a painful investigation the problem seemed to be due to a wrong >> defined IRQ type for the GMAC IRQ that should be LEVEL_HIGH instead of >> EDGE_RISING. >> >> The change in the macirq IRQ type also solved another long standing >> issue affecting this SoC/PHY where EEE was causing the network >> connection to die after stressing it with iperf3 (even though much >> sooner). It's now possible to remove the 'eee-broken-1000t' quirk as >> well. > (disclaimer: I was not able to reproduce this bug without your > patches, but I didn't run iperf3 for more than a couple of minutes) > I did test your patch with and without my "Meson8b RGMII Ethernet pin > cleanup" from [0] which shows that there's another performance related > problem: > 1) before and after your patch receive speeds were fine (above > 700Mbit/s and no transmit errors / retries in iperf3) but the transmit > speed was bad (<200Mbit/s and >1500 retries in perf3) > 2) transmit errors (when Odroid-C1 is sending) are not occurring > anymore after my patch from [0] > > thus I believe your patch is fine, especially since we already have > IRQ_TYPE_LEVEL_HIGH for the dwc2 controllers > >> Fixes: 9c15795a4f96 ("ARM: dts: meson8b-odroidc1: ethernet support") >> Signed-off-by: Carlo Caione <ccaione@baylibre.com> > Reviewed-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> > Tested-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> > > I wonder if Kevin can send this as a fix for v4.20 Queued as a fix for v5.0-rc Kevin
diff --git a/arch/arm/boot/dts/meson.dtsi b/arch/arm/boot/dts/meson.dtsi index 0d9faf1a51ea..a86b89086334 100644 --- a/arch/arm/boot/dts/meson.dtsi +++ b/arch/arm/boot/dts/meson.dtsi @@ -263,7 +263,7 @@ compatible = "amlogic,meson6-dwmac", "snps,dwmac"; reg = <0xc9410000 0x10000 0xc1108108 0x4>; - interrupts = <GIC_SPI 8 IRQ_TYPE_EDGE_RISING>; + interrupts = <GIC_SPI 8 IRQ_TYPE_LEVEL_HIGH>; interrupt-names = "macirq"; status = "disabled"; }; diff --git a/arch/arm/boot/dts/meson8b-odroidc1.dts b/arch/arm/boot/dts/meson8b-odroidc1.dts index 58669abda259..a951a6632d0c 100644 --- a/arch/arm/boot/dts/meson8b-odroidc1.dts +++ b/arch/arm/boot/dts/meson8b-odroidc1.dts @@ -221,7 +221,6 @@ /* Realtek RTL8211F (0x001cc916) */ eth_phy: ethernet-phy@0 { reg = <0>; - eee-broken-1000t; interrupt-parent = <&gpio_intc>; /* GPIOH_3 */ interrupts = <17 IRQ_TYPE_LEVEL_LOW>;
A long running stress test on a custom board shipping an AXG SoCs and a Realtek RTL8211F PHY revealed that after a few hours the connection speed would drop drastically, from ~1000Mbps to ~3Mbps. At the same time the 'macirq' (eth0) IRQ would stop being triggered at all and as consequence the GMAC IRQs never ACKed. After a painful investigation the problem seemed to be due to a wrong defined IRQ type for the GMAC IRQ that should be LEVEL_HIGH instead of EDGE_RISING. The change in the macirq IRQ type also solved another long standing issue affecting this SoC/PHY where EEE was causing the network connection to die after stressing it with iperf3 (even though much sooner). It's now possible to remove the 'eee-broken-1000t' quirk as well. Fixes: 9c15795a4f96 ("ARM: dts: meson8b-odroidc1: ethernet support") Signed-off-by: Carlo Caione <ccaione@baylibre.com> --- arch/arm/boot/dts/meson.dtsi | 2 +- arch/arm/boot/dts/meson8b-odroidc1.dts | 1 - 2 files changed, 1 insertion(+), 2 deletions(-)