Message ID | 1533724802-30944-3-git-send-email-wgong@codeaurora.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Change sk_pacing_shift in ieee80211_hw for best tx throughput | expand |
Wen Gong <wgong@codeaurora.org> writes: > Upstream kernel has an interface to help adjust sk_pacing_shift to help > improve TCP UL throughput. > The sk_pacing_shift is 8 in mac80211, this is based on test with 11N > WiFi chips with ath9k. For QCA6174/QCA9377 PCI 11AC chips, the 11AC > VHT80 TCP UL throughput testing result shows 6 is the optimal. > Overwrite the sk_pacing_shift to 6 in ath10k driver for QCA6174/9377 PCI. > > Tested with QCA6174 PCI with firmware > WLAN.RM.4.4.1-00109-QCARMSWPZ-1, but this will also affect QCA9377 PCI. > It's not a regression with new firmware releases. > > There have 2 test result of different settings: > > ARM CPU based device with QCA6174A PCI with different > sk_pacing_shift: > > sk_pacing_shift throughput(Mbps) CPU utilization > 6 500(-P5) ~75% idle, Focus on CPU1: ~14%idle > 7 454(-P5) ~80% idle, Focus on CPU1: ~4%idle > 8 288 ~90% idle, Focus on CPU1: ~35%idle > 9 ~200 ~92% idle, Focus on CPU1: ~50%idle > > 5G TCP UL VTH80 on X86 platform with QCA6174A PCI with sk_packing_shift > set to 6: > > tcp_limit_output_bytes throughput(Mbps) > default(262144)+1 Stream 336 > default(262144)+2 Streams 558 > default(262144)+3 Streams 584 > default(262144)+4 Streams 602 > default(262144)+5 Streams 598 > changed(2621440)+1 Stream 598 > changed(2621440)+2 Streams 601 You still haven't provided any latency numbers for these tests, which makes it impossible to verify that setting sk_pacing_shift to 6 is the right tradeoff. As I said before, from your numbers I suspect the right setting is actually 7, which would be 10-20ms less latency under load; way more important than ~50 Mbps... -Toke
> -----Original Message----- > From: ath10k <ath10k-bounces@lists.infradead.org> On Behalf Of Toke > Høiland-Jørgensen > Sent: Wednesday, August 8, 2018 6:44 PM > To: Wen Gong <wgong@codeaurora.org>; ath10k@lists.infradead.org; > johannes@sipsolutions.net > Cc: linux-wireless@vger.kernel.org > Subject: Re: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi > chips > > Wen Gong <wgong@codeaurora.org> writes: > > > Upstream kernel has an interface to help adjust sk_pacing_shift to > > help improve TCP UL throughput. > > The sk_pacing_shift is 8 in mac80211, this is based on test with 11N > > WiFi chips with ath9k. For QCA6174/QCA9377 PCI 11AC chips, the 11AC > > VHT80 TCP UL throughput testing result shows 6 is the optimal. > > Overwrite the sk_pacing_shift to 6 in ath10k driver for QCA6174/9377 PCI. > > > > Tested with QCA6174 PCI with firmware > > WLAN.RM.4.4.1-00109-QCARMSWPZ-1, but this will also affect QCA9377 > PCI. > > It's not a regression with new firmware releases. > > > > There have 2 test result of different settings: > > > > ARM CPU based device with QCA6174A PCI with different > > sk_pacing_shift: > > > > sk_pacing_shift throughput(Mbps) CPU utilization > > 6 500(-P5) ~75% idle, Focus on CPU1: ~14%idle > > 7 454(-P5) ~80% idle, Focus on CPU1: ~4%idle > > 8 288 ~90% idle, Focus on CPU1: ~35%idle > > 9 ~200 ~92% idle, Focus on CPU1: ~50%idle > > > > 5G TCP UL VTH80 on X86 platform with QCA6174A PCI with > > sk_packing_shift set to 6: > > > > tcp_limit_output_bytes throughput(Mbps) > > default(262144)+1 Stream 336 > > default(262144)+2 Streams 558 > > default(262144)+3 Streams 584 > > default(262144)+4 Streams 602 > > default(262144)+5 Streams 598 > > changed(2621440)+1 Stream 598 > > changed(2621440)+2 Streams 601 > > You still haven't provided any latency numbers for these tests, which makes > it impossible to verify that setting sk_pacing_shift to 6 is the right tradeoff. > > As I said before, from your numbers I suspect the right setting is actually 7, > which would be 10-20ms less latency under load; way more important than > ~50 Mbps... > Hi Toke, Could you give the command line for the latency test? https://flent.org/intro.html#quick-start I used the command but test failed: flent tcp_download -p 1 -l 60 -H 192.168.1.5 -t text-to-be-included-in-plot -o file1.png error loading plotter: unable to find plot configuration "1" > -Toke > > _______________________________________________ > ath10k mailing list > ath10k@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/ath10k
Wen Gong <wgong@qti.qualcomm.com> writes: >> -----Original Message----- >> From: ath10k <ath10k-bounces@lists.infradead.org> On Behalf Of Toke >> Høiland-Jørgensen >> Sent: Wednesday, August 8, 2018 6:44 PM >> To: Wen Gong <wgong@codeaurora.org>; ath10k@lists.infradead.org; >> johannes@sipsolutions.net >> Cc: linux-wireless@vger.kernel.org >> Subject: Re: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi >> chips >> >> Wen Gong <wgong@codeaurora.org> writes: >> >> > Upstream kernel has an interface to help adjust sk_pacing_shift to >> > help improve TCP UL throughput. >> > The sk_pacing_shift is 8 in mac80211, this is based on test with 11N >> > WiFi chips with ath9k. For QCA6174/QCA9377 PCI 11AC chips, the 11AC >> > VHT80 TCP UL throughput testing result shows 6 is the optimal. >> > Overwrite the sk_pacing_shift to 6 in ath10k driver for QCA6174/9377 PCI. >> > >> > Tested with QCA6174 PCI with firmware >> > WLAN.RM.4.4.1-00109-QCARMSWPZ-1, but this will also affect QCA9377 >> PCI. >> > It's not a regression with new firmware releases. >> > >> > There have 2 test result of different settings: >> > >> > ARM CPU based device with QCA6174A PCI with different >> > sk_pacing_shift: >> > >> > sk_pacing_shift throughput(Mbps) CPU utilization >> > 6 500(-P5) ~75% idle, Focus on CPU1: ~14%idle >> > 7 454(-P5) ~80% idle, Focus on CPU1: ~4%idle >> > 8 288 ~90% idle, Focus on CPU1: ~35%idle >> > 9 ~200 ~92% idle, Focus on CPU1: ~50%idle >> > >> > 5G TCP UL VTH80 on X86 platform with QCA6174A PCI with >> > sk_packing_shift set to 6: >> > >> > tcp_limit_output_bytes throughput(Mbps) >> > default(262144)+1 Stream 336 >> > default(262144)+2 Streams 558 >> > default(262144)+3 Streams 584 >> > default(262144)+4 Streams 602 >> > default(262144)+5 Streams 598 >> > changed(2621440)+1 Stream 598 >> > changed(2621440)+2 Streams 601 >> >> You still haven't provided any latency numbers for these tests, which makes >> it impossible to verify that setting sk_pacing_shift to 6 is the right tradeoff. >> >> As I said before, from your numbers I suspect the right setting is actually 7, >> which would be 10-20ms less latency under load; way more important than >> ~50 Mbps... >> > Hi Toke, > Could you give the command line for the latency test? > https://flent.org/intro.html#quick-start > I used the command but test failed: > flent tcp_download -p 1 -l 60 -H 192.168.1.5 -t text-to-be-included-in-plot -o file1.png > error loading plotter: unable to find plot configuration "1" Try something like: flent -H 192.168.1.5 -t "sk_pacing_shift 7" tcp_nup --test-parameter upload_streams=1 This will note the value of sk_pacing_shift you are testing in the data file (so change that as appropriate), and you can vary the number of TCP streams by changing the upload_streams parameter. Note that in this case I'm assuming you are running Flent on the device with the kernel you are trying to test, so you want a TCP transfer going *from* the device. If not, change "tcp_nup" to "tcp_ndown" and "upload_streams" to "download_streams". Upload is netperf TCP_STREAM test, and download is TCP_MAERTS. When running the above command you'll get a summary output on the terminal that you can paste on the list; and also a data file to plot things form. For instance, you can do something like 'flent -p ping_cdf *.flent.gz' to get a CDF plot of all your test results afterwards. You are also very welcome to send me the .flent.gz data files and I'll take a look to make sure everything looks reasonable :) -Toke
> -----Original Message----- > From: Toke Høiland-Jørgensen <toke@toke.dk> > Sent: Friday, August 10, 2018 9:18 PM > To: Wen Gong <wgong@qti.qualcomm.com>; Wen Gong > <wgong@codeaurora.org>; ath10k@lists.infradead.org; > johannes@sipsolutions.net > Cc: linux-wireless@vger.kernel.org > Subject: RE: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi > chips > > Wen Gong <wgong@qti.qualcomm.com> writes: > > >> > > Hi Toke, > > Could you give the command line for the latency test? > > https://flent.org/intro.html#quick-start > > I used the command but test failed: > > flent tcp_download -p 1 -l 60 -H 192.168.1.5 -t > > text-to-be-included-in-plot -o file1.png error loading plotter: unable to find > plot configuration "1" > > Try something like: > > > flent -H 192.168.1.5 -t "sk_pacing_shift 7" tcp_nup --test-parameter > upload_streams=1 > > > This will note the value of sk_pacing_shift you are testing in the data file (so > change that as appropriate), and you can vary the number of TCP streams by > changing the upload_streams parameter. > > Note that in this case I'm assuming you are running Flent on the device with > the kernel you are trying to test, so you want a TCP transfer going > *from* the device. If not, change "tcp_nup" to "tcp_ndown" and > "upload_streams" to "download_streams". Upload is netperf TCP_STREAM > test, and download is TCP_MAERTS. > > When running the above command you'll get a summary output on the > terminal that you can paste on the list; and also a data file to plot things form. > For instance, you can do something like 'flent -p ping_cdf *.flent.gz' to get a > CDF plot of all your test results afterwards. You are also very welcome to > send me the .flent.gz data files and I'll take a look to make sure everything > looks reasonable :) > > -Toke Hi Toke, I have tested with your method for shift 6/8/10/7 all twice time in open air environment. Shift 6: wgong@wgong-Latitude-E5440-1:~/flent$ flent -H 192.168.1.7 -t "sk_pacing_shift6" tcp_nup --test-parameter upload_streams=1 Started Flent 1.2.2 using Python 2.7.12. Starting tcp_nup test. Expected run time: 70 seconds. Data file written to ./tcp_nup-2018-08-13T110414.699512.sk_pacing_shift6.flent.gz. Summary of tcp_nup test run 'sk_pacing_shift6' (at 2018-08-13 03:04:14.699512): avg median # data pts Ping (ms) ICMP : 9.91 4.99 ms 350 TCP upload avg : 242.48 262.43 Mbits/s 301 TCP upload sum : 242.48 262.43 Mbits/s 301 TCP upload::1 : 242.48 263.34 Mbits/s 271 wgong@wgong-Latitude-E5440-1:~/flent$ flent -H 192.168.1.7 -t "sk_pacing_shift6" tcp_nup --test-parameter upload_streams=1 Started Flent 1.2.2 using Python 2.7.12. Starting tcp_nup test. Expected run time: 70 seconds. Data file written to ./tcp_nup-2018-08-13T113317.074077.sk_pacing_shift6.flent.gz. Summary of tcp_nup test run 'sk_pacing_shift6' (at 2018-08-13 03:33:17.074077): avg median # data pts Ping (ms) ICMP : 7.75 5.30 ms 350 TCP upload avg : 239.17 250.84 Mbits/s 301 TCP upload sum : 239.17 250.84 Mbits/s 301 TCP upload::1 : 239.17 255.03 Mbits/s 266 Shift 8: wgong@wgong-Latitude-E5440-1:~/flent$ flent -H 192.168.1.7 -t "sk_pacing_shift8" tcp_nup --test-parameter upload_streams=1 Started Flent 1.2.2 using Python 2.7.12. Starting tcp_nup test. Expected run time: 70 seconds. Data file written to ./tcp_nup-2018-08-13T121433.187781.sk_pacing_shift8.flent.gz. Summary of tcp_nup test run 'sk_pacing_shift8' (at 2018-08-13 04:14:33.187781): avg median # data pts Ping (ms) ICMP : 17.12 7.07 ms 350 TCP upload avg : 180.05 185.82 Mbits/s 301 TCP upload sum : 180.05 185.82 Mbits/s 301 TCP upload::1 : 180.05 189.41 Mbits/s 253 wgong@wgong-Latitude-E5440-1:~/flent$ flent -H 192.168.1.7 -t "sk_pacing_shift8" tcp_nup --test-parameter upload_streams=1 Started Flent 1.2.2 using Python 2.7.12. Starting tcp_nup test. Expected run time: 70 seconds. Data file written to ./tcp_nup-2018-08-13T121602.382575.sk_pacing_shift8.flent.gz. Summary of tcp_nup test run 'sk_pacing_shift8' (at 2018-08-13 04:16:02.382575): avg median # data pts Ping (ms) ICMP : 13.90 5.89 ms 350 TCP upload avg : 207.56 228.16 Mbits/s 301 TCP upload sum : 207.56 228.16 Mbits/s 301 TCP upload::1 : 207.56 228.11 Mbits/s 259 Shift 10: wgong@wgong-Latitude-E5440-1:~/flent$ flent -H 192.168.1.7 -t "sk_pacing_shift10" tcp_nup --test-parameter upload_streams=1 Started Flent 1.2.2 using Python 2.7.12. Starting tcp_nup test. Expected run time: 70 seconds. Data file written to ./tcp_nup-2018-08-13T121844.493498.sk_pacing_shift10.flent.gz. Summary of tcp_nup test run 'sk_pacing_shift10' (at 2018-08-13 04:18:44.493498): avg median # data pts Ping (ms) ICMP : 15.11 7.41 ms 350 TCP upload avg : 162.38 164.10 Mbits/s 301 TCP upload sum : 162.38 164.10 Mbits/s 301 TCP upload::1 : 162.38 165.47 Mbits/s 252 wgong@wgong-Latitude-E5440-1:~/flent$ flent -H 192.168.1.7 -t "sk_pacing_shift10" tcp_nup --test-parameter upload_streams=1 Started Flent 1.2.2 using Python 2.7.12. Starting tcp_nup test. Expected run time: 70 seconds. Data file written to ./tcp_nup-2018-08-13T122108.347163.sk_pacing_shift10.flent.gz. Summary of tcp_nup test run 'sk_pacing_shift10' (at 2018-08-13 04:21:08.347163): avg median # data pts Ping (ms) ICMP : 13.69 7.48 ms 350 TCP upload avg : 171.11 170.52 Mbits/s 301 TCP upload sum : 171.11 170.52 Mbits/s 301 TCP upload::1 : 171.11 171.36 Mbits/s 258 Shift 7: wgong@wgong-Latitude-E5440-1:~/flent$ flent -H 192.168.1.7 -t "sk_pacing_shift7" tcp_nup --test-parameter upload_streams=1 Started Flent 1.2.2 using Python 2.7.12. Starting tcp_nup test. Expected run time: 70 seconds. Data file written to ./tcp_nup-2018-08-13T122948.020974.sk_pacing_shift7.flent.gz. Summary of tcp_nup test run 'sk_pacing_shift7' (at 2018-08-13 04:29:48.020974): avg median # data pts Ping (ms) ICMP : 14.12 6.61 ms 350 TCP upload avg : 188.19 188.04 Mbits/s 301 TCP upload sum : 188.19 188.04 Mbits/s 301 TCP upload::1 : 188.19 190.88 Mbits/s 258 wgong@wgong-Latitude-E5440-1:~/flent$ flent -H 192.168.1.7 -t "sk_pacing_shift7" tcp_nup --test-parameter upload_streams=1 Started Flent 1.2.2 using Python 2.7.12. Starting tcp_nup test. Expected run time: 70 seconds. Data file written to ./tcp_nup-2018-08-13T123129.526514.sk_pacing_shift7.flent.gz. Summary of tcp_nup test run 'sk_pacing_shift7' (at 2018-08-13 04:31:29.526514): avg median # data pts Ping (ms) ICMP : 10.31 6.32 ms 350 TCP upload avg : 212.70 233.69 Mbits/s 301 TCP upload sum : 212.70 233.69 Mbits/s 301 TCP upload::1 : 212.70 237.65 Mbits/s 262
Wen Gong <wgong@qti.qualcomm.com> writes: > Hi Toke, > > I have tested with your method for shift 6/8/10/7 all twice time in > open air environment. Great, thanks! > Shift 6: > wgong@wgong-Latitude-E5440-1:~/flent$ flent -H 192.168.1.7 -t > "sk_pacing_shift6" tcp_nup --test-parameter upload_streams=1 Just to be sure: You are running this on your laptop? And that laptop has the ath10k and the modified kernel you are testing? Is 192.168.1.7 the AP or another device? I'd like to see what happens when the link is fully saturated my multiple streams as well. Could you repeat the tests with upload_streams=5, please? And if you could share the .flent.gz files (upload them somewhere, or email me off-list), that would be useful as well :) Thanks, -Toke
> -----Original Message----- > From: Toke Høiland-Jørgensen <toke@toke.dk> > Sent: Monday, August 13, 2018 7:18 PM > To: Wen Gong <wgong@qti.qualcomm.com>; Wen Gong > <wgong@codeaurora.org>; ath10k@lists.infradead.org; > johannes@sipsolutions.net > Cc: linux-wireless@vger.kernel.org > Subject: RE: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi > chips > > Wen Gong <wgong@qti.qualcomm.com> writes: > > > Hi Toke, > > > > I have tested with your method for shift 6/8/10/7 all twice time in > > open air environment. > > Great, thanks! > > > Shift 6: > > wgong@wgong-Latitude-E5440-1:~/flent$ flent -H 192.168.1.7 -t > > "sk_pacing_shift6" tcp_nup --test-parameter upload_streams=1 > > Just to be sure: You are running this on your laptop? And that laptop has the > ath10k and the modified kernel you are testing? Is 192.168.1.7 the AP or > another device? Hi Toke, Yes, I run on my laptop(Dell LATITUDE E5440 + QCA6174 PCI wireless card) in open air environment, Kernel is 4.18.0-rc2-wt-ath+ #2 SMP which is built myself. The mac80211.ko/ath10k is built with the 2 patches. 192.168.1.7 is another device(PC installed with 4.4.0-116-generic #140~14.04.1-Ubuntu) > > I'd like to see what happens when the link is fully saturated my multiple > streams as well. Could you repeat the tests with upload_streams=5, please? > And if you could share the .flent.gz files (upload them somewhere, or email > me off-list), that would be useful as well :) Test result with upload_streams=5: sk_pacing_shift6: wgong@wgong-Latitude-E5440-1:~/flent/5stream$ flent -H 192.168.1.7 -t "sk_pacing_shift6" tcp_nup --test-parameter upload_streams=5 Started Flent 1.2.2 using Python 2.7.12. Starting tcp_nup test. Expected run time: 70 seconds. Data file written to ./tcp_nup-2018-08-14T105332.356811.sk_pacing_shift6.flent.gz. Summary of tcp_nup test run 'sk_pacing_shift6' (at 2018-08-14 02:53:32.356811): avg median # data pts Ping (ms) ICMP : 20.46 13.85 ms 350 TCP upload avg : 66.30 68.71 Mbits/s 301 TCP upload sum : 331.49 343.55 Mbits/s 301 TCP upload::1 : 60.80 64.65 Mbits/s 202 TCP upload::2 : 77.72 82.89 Mbits/s 211 TCP upload::3 : 60.52 56.09 Mbits/s 202 TCP upload::4 : 67.39 73.56 Mbits/s 204 TCP upload::5 : 65.06 71.97 Mbits/s 201 wgong@wgong-Latitude-E5440-1:~/flent/5stream$ flent -H 192.168.1.7 -t "sk_pacing_shift6" tcp_nup --test-parameter upload_streams=5 Started Flent 1.2.2 using Python 2.7.12. Starting tcp_nup test. Expected run time: 70 seconds. Data file written to ./tcp_nup-2018-08-14T105554.583603.sk_pacing_shift6.flent.gz. Summary of tcp_nup test run 'sk_pacing_shift6' (at 2018-08-14 02:55:54.583603): avg median # data pts Ping (ms) ICMP : 20.86 13.80 ms 350 TCP upload avg : 75.88 83.17 Mbits/s 301 TCP upload sum : 379.42 415.84 Mbits/s 301 TCP upload::1 : 82.07 90.73 Mbits/s 225 TCP upload::2 : 74.08 78.29 Mbits/s 204 TCP upload::3 : 70.44 75.65 Mbits/s 217 TCP upload::4 : 82.70 92.86 Mbits/s 223 TCP upload::5 : 70.13 76.87 Mbits/s 210 sk_pacing_shift7: wgong@wgong-Latitude-E5440-1:~/flent/5stream$ flent -H 192.168.1.7 -t "sk_pacing_shift7" tcp_nup --test-parameter upload_streams=5 Started Flent 1.2.2 using Python 2.7.12. Starting tcp_nup test. Expected run time: 70 seconds. Data file written to ./tcp_nup-2018-08-14T105759.169367.sk_pacing_shift7.flent.gz. Summary of tcp_nup test run 'sk_pacing_shift7' (at 2018-08-14 02:57:59.169367): avg median # data pts Ping (ms) ICMP : 24.66 15.55 ms 350 TCP upload avg : 65.33 72.83 Mbits/s 301 TCP upload sum : 326.67 363.10 Mbits/s 301 TCP upload::1 : 77.60 92.93 Mbits/s 214 TCP upload::2 : 67.22 75.95 Mbits/s 213 TCP upload::3 : 65.81 74.54 Mbits/s 205 TCP upload::4 : 63.06 70.37 Mbits/s 207 TCP upload::5 : 52.98 55.78 Mbits/s 198 wgong@wgong-Latitude-E5440-1:~/flent/5stream$ flent -H 192.168.1.7 -t "sk_pacing_shift7" tcp_nup --test-parameter upload_streams=5 Started Flent 1.2.2 using Python 2.7.12. Starting tcp_nup test. Expected run time: 70 seconds. Data file written to ./tcp_nup-2018-08-14T105923.620572.sk_pacing_shift7.flent.gz. Summary of tcp_nup test run 'sk_pacing_shift7' (at 2018-08-14 02:59:23.620572): avg median # data pts Ping (ms) ICMP : 25.03 14.25 ms 350 TCP upload avg : 74.35 85.64 Mbits/s 297 TCP upload sum : 371.77 428.19 Mbits/s 297 TCP upload::1 : 74.12 82.55 Mbits/s 216 TCP upload::2 : 67.78 71.87 Mbits/s 208 TCP upload::3 : 82.40 94.72 Mbits/s 210 TCP upload::4 : 70.77 82.43 Mbits/s 206 TCP upload::5 : 76.70 88.62 Mbits/s 210 sk_pacing_shift8: wgong@wgong-Latitude-E5440-1:~/flent/5stream$ flent -H 192.168.1.7 -t "sk_pacing_shift8" tcp_nup --test-parameter upload_streams=5 Started Flent 1.2.2 using Python 2.7.12. Starting tcp_nup test. Expected run time: 70 seconds. Data file written to ./tcp_nup-2018-08-14T110334.670845.sk_pacing_shift8.flent.gz. Summary of tcp_nup test run 'sk_pacing_shift8' (at 2018-08-14 03:03:34.670845): avg median # data pts Ping (ms) ICMP : 25.03 19.50 ms 350 TCP upload avg : 58.11 59.70 Mbits/s 301 TCP upload sum : 290.53 298.51 Mbits/s 301 TCP upload::1 : 51.37 51.74 Mbits/s 197 TCP upload::2 : 42.02 43.66 Mbits/s 192 TCP upload::3 : 61.17 62.33 Mbits/s 200 TCP upload::4 : 66.11 70.31 Mbits/s 198 TCP upload::5 : 69.86 76.31 Mbits/s 202 wgong@wgong-Latitude-E5440-1:~/flent/5stream$ flent -H 192.168.1.7 -t "sk_pacing_shift8" tcp_nup --test-parameter upload_streams=5 Started Flent 1.2.2 using Python 2.7.12. Starting tcp_nup test. Expected run time: 70 seconds. Data file written to ./tcp_nup-2018-08-14T110557.587769.sk_pacing_shift8.flent.gz. Summary of tcp_nup test run 'sk_pacing_shift8' (at 2018-08-14 03:05:57.587769): avg median # data pts Ping (ms) ICMP : 21.50 13.05 ms 350 TCP upload avg : 61.59 62.00 Mbits/s 301 TCP upload sum : 307.93 310.01 Mbits/s 301 TCP upload::1 : 69.70 76.22 Mbits/s 210 TCP upload::2 : 68.51 74.88 Mbits/s 207 TCP upload::3 : 71.06 77.57 Mbits/s 200 TCP upload::4 : 47.08 51.42 Mbits/s 196 TCP upload::5 : 51.58 51.98 Mbits/s 203 sk_pacing_shift10: wgong@wgong-Latitude-E5440-1:~/flent/5stream$ flent -H 192.168.1.7 -t "sk_pacing_shift10" tcp_nup --test-parameter upload_streams=5 Started Flent 1.2.2 using Python 2.7.12. Starting tcp_nup test. Expected run time: 70 seconds. Data file written to ./tcp_nup-2018-08-14T110814.434543.sk_pacing_shift10.flent.gz. Summary of tcp_nup test run 'sk_pacing_shift10' (at 2018-08-14 03:08:14.434543): avg median # data pts Ping (ms) ICMP : 31.57 19.35 ms 350 TCP upload avg : 56.61 62.61 Mbits/s 301 TCP upload sum : 283.07 313.07 Mbits/s 301 TCP upload::1 : 39.39 42.96 Mbits/s 187 TCP upload::2 : 62.20 72.39 Mbits/s 193 TCP upload::3 : 61.72 74.31 Mbits/s 191 TCP upload::4 : 61.86 73.74 Mbits/s 190 TCP upload::5 : 57.90 65.11 Mbits/s 193 wgong@wgong-Latitude-E5440-1:~/flent/5stream$ flent -H 192.168.1.7 -t "sk_pacing_shift10" tcp_nup --test-parameter upload_streams=5 Started Flent 1.2.2 using Python 2.7.12. Starting tcp_nup test. Expected run time: 70 seconds. Data file written to ./tcp_nup-2018-08-14T110931.986159.sk_pacing_shift10.flent.gz. Summary of tcp_nup test run 'sk_pacing_shift10' (at 2018-08-14 03:09:31.986159): avg median # data pts Ping (ms) ICMP : 19.23 13.20 ms 350 TCP upload avg : 76.36 81.37 Mbits/s 301 TCP upload sum : 381.80 406.85 Mbits/s 301 TCP upload::1 : 64.95 67.91 Mbits/s 212 TCP upload::2 : 82.16 92.35 Mbits/s 215 TCP upload::3 : 67.51 70.18 Mbits/s 213 TCP upload::4 : 77.42 82.11 Mbits/s 232 TCP upload::5 : 89.76 99.96 Mbits/s 226 > > Thanks, > > -Toke
Wen Gong <wgong@qti.qualcomm.com> writes: >> -----Original Message----- >> From: Toke Høiland-Jørgensen <toke@toke.dk> >> Sent: Monday, August 13, 2018 7:18 PM >> To: Wen Gong <wgong@qti.qualcomm.com>; Wen Gong >> <wgong@codeaurora.org>; ath10k@lists.infradead.org; >> johannes@sipsolutions.net >> Cc: linux-wireless@vger.kernel.org >> Subject: RE: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi >> chips >> >> Wen Gong <wgong@qti.qualcomm.com> writes: >> >> > Hi Toke, >> > >> > I have tested with your method for shift 6/8/10/7 all twice time in >> > open air environment. >> >> Great, thanks! >> >> > Shift 6: >> > wgong@wgong-Latitude-E5440-1:~/flent$ flent -H 192.168.1.7 -t >> > "sk_pacing_shift6" tcp_nup --test-parameter upload_streams=1 >> >> Just to be sure: You are running this on your laptop? And that laptop has the >> ath10k and the modified kernel you are testing? Is 192.168.1.7 the AP or >> another device? > Hi Toke, > Yes, I run on my laptop(Dell LATITUDE E5440 + QCA6174 PCI wireless card) in open air environment, > Kernel is 4.18.0-rc2-wt-ath+ #2 SMP which is built myself. > The mac80211.ko/ath10k is built with the 2 patches. > 192.168.1.7 is another device(PC installed with 4.4.0-116-generic #140~14.04.1-Ubuntu) >> >> I'd like to see what happens when the link is fully saturated my multiple >> streams as well. Could you repeat the tests with upload_streams=5, please? >> And if you could share the .flent.gz files (upload them somewhere, or email >> me off-list), that would be useful as well :) > > Test result with upload_streams=5: Sorry for the delay in looking at this, I've been terribly busy this week; will get back to it in more detail next week. One question for now: In the results you quote below the total throughput is way lower than the up to 600 mbps you quoted in the patch description. How did you run your original tests to achieve that high a throughput? -Toke
Hi Toke, On 08/17/2018 04:32 AM, Toke Høiland-Jørgensen wrote: > >>>> Shift 6: >>>> wgong@wgong-Latitude-E5440-1:~/flent$ flent -H 192.168.1.7 -t >>>> "sk_pacing_shift6" tcp_nup --test-parameter upload_streams=1 >>> Does flent have options to set TOS or QoS queue for TCP or UDP test? If I add "--test-paramete burst_tos=0xe0", will it use corresponding mac80211 QoS queue? flent -H xxxx -t "xxxx" tcp_nup --test-parameter upload_streams=1 --test-paramete burst_tos=0xe0 Thanks, Peter
On Sun, Aug 12, 2018 at 10:37 PM Wen Gong <wgong@qti.qualcomm.com> wrote:
...
> I have tested with your method for shift 6/8/10/7 all twice time in open air environment.
I've attached the graphed data which Wen provided me and I made one
"cleanup": the median and average "ICMP" are computed only using
values where flent has an "antagonist" load running. The first 28 and
last 26 seconds of the test are "idle" - it makes no sense to include
the latency of those. This drops about 1/7th of the data points.
My observations:
o the delta between average and median is evidence this is not
particularly reliable data (but expected result for "open air" wifi
testing in a corp environment).
o sk_pacing_shift=8 and/or =7 appears to have suffered from
significant open air interference - I've attached a graph of the raw
data.
o I'm comfortable asserting throughput is higher and latency is lower
for sk_pacing_shift=6 (vs 7) at the cost of additional CPU
utilization.
My questions:
1) Is the current conversation just quibbling about 6 vs 7 for the
ath10k default?
2) If yes, can both patches be accepted and then later add code to
select a default based on CPU available?
cheers,
grant
Peter Oh <peter.oh@bowerswilkins.com> writes: > Hi Toke, > > > On 08/17/2018 04:32 AM, Toke Høiland-Jørgensen wrote: >> >>>>> Shift 6: >>>>> wgong@wgong-Latitude-E5440-1:~/flent$ flent -H 192.168.1.7 -t >>>>> "sk_pacing_shift6" tcp_nup --test-parameter upload_streams=1 >>>> > Does flent have options to set TOS or QoS queue for TCP or UDP test? > If I add "--test-paramete burst_tos=0xe0", will it use corresponding > mac80211 QoS queue? Yes, for some tests: - The RRUL test will run four TCP flows in each direction, one marked for each 802.11 QoS queue. - For the rtt_fair* tests you can set them with a test parameter: --test-parameter markings=BE,CS1 to set flows 1 and 2 to BE and CS1 respectively. You can use the rtt_fair_var* tests to get a variable number of flows (the same host can be specified multiple times by having different hostnames resolve to the same IP). -Toke
On Thu, 2018-08-30 at 16:32 -0700, Grant Grundler wrote: > On Sun, Aug 12, 2018 at 10:37 PM Wen Gong <wgong@qti.qualcomm.com> wrote: > ... > > I have tested with your method for shift 6/8/10/7 all twice time in open air environment. > > I've attached the graphed data which Wen provided me and I made one > "cleanup": the median and average "ICMP" are computed only using > values where flent has an "antagonist" load running. The first 28 and > last 26 seconds of the test are "idle" - it makes no sense to include > the latency of those. This drops about 1/7th of the data points. > > My observations: > o the delta between average and median is evidence this is not > particularly reliable data (but expected result for "open air" wifi > testing in a corp environment). > o sk_pacing_shift=8 and/or =7 appears to have suffered from > significant open air interference - I've attached a graph of the raw > data. > o I'm comfortable asserting throughput is higher and latency is lower > for sk_pacing_shift=6 (vs 7) at the cost of additional CPU > utilization. > > My questions: > 1) Is the current conversation just quibbling about 6 vs 7 for the > ath10k default? 6 vs. 8, I think? But I didn't follow the full discussion. I also think that we shouldn't necessarily restrict this to "for the ath10k". Is there any reason to think that this would be different for different chips? I could see a reason for a driver to set the default *higher* than the mac80211 default (needing more buffers for whatever internal reason), but I can't really see a reason it should be *lower*. Also, the networking-stack wide default is 0, after all, so it seems that each new layer might have some knowledge about why to *increase* it (like mac80211 knows that A-MPDUs/A-MSDUs need to be built), but it seems not so reasonable that it should be reduced again. Unless you have some specific knowledge that ath10k would have a smaller A-MSDU/A-MPDU size limit or something? > 2) If yes, can both patches be accepted and then later add code to > select a default based on CPU available? I've applied the first patch now (with some comment cleanups, you may want to pick those up for your internal usage) because I can see an argument for increasing it. However, I think that the "quibbling" over the default should most likely result in a changed default value for mac80211, rather than ath10k using the driver setting as a way out of that discussion. johannes
Johannes Berg <johannes@sipsolutions.net> writes: > On Thu, 2018-08-30 at 16:32 -0700, Grant Grundler wrote: >> On Sun, Aug 12, 2018 at 10:37 PM Wen Gong <wgong@qti.qualcomm.com> wrote: >> ... >> > I have tested with your method for shift 6/8/10/7 all twice time in open air environment. >> >> I've attached the graphed data which Wen provided me and I made one >> "cleanup": the median and average "ICMP" are computed only using >> values where flent has an "antagonist" load running. The first 28 and >> last 26 seconds of the test are "idle" - it makes no sense to include >> the latency of those. This drops about 1/7th of the data points. >> >> My observations: >> o the delta between average and median is evidence this is not >> particularly reliable data (but expected result for "open air" wifi >> testing in a corp environment). >> o sk_pacing_shift=8 and/or =7 appears to have suffered from >> significant open air interference - I've attached a graph of the raw >> data. >> o I'm comfortable asserting throughput is higher and latency is lower >> for sk_pacing_shift=6 (vs 7) at the cost of additional CPU >> utilization. >> >> My questions: >> 1) Is the current conversation just quibbling about 6 vs 7 for the >> ath10k default? > > 6 vs. 8, I think? But I didn't follow the full discussion. > > I also think that we shouldn't necessarily restrict this to "for the > ath10k". Is there any reason to think that this would be different for > different chips? No, I also think it should be possible to select a value that will work for all drivers. There's a strong diminishing returns effect here after a certain point, and I believe we should strongly push back against just adding more buffering to chase (single-flow) throughput numbers; and I'm not convinced that having different values for different drivers is worth the complexity. As far as I'm concerned, just showing an increase in maximum throughput under ideal conditions (as this initial patch submission did) is not sufficient argument for adding more buffering. At a minimum, the evaluation should also include: - Impact on latency and throughput for competing flows in the same test. - Impact on latency for the TCP flow itself. - A full evaluation at low rates and in scenarios with more than one client. As far as the actual value, I *think* it may be that the default shift should be 7 (8 ms) rather than 8 (4 ms) as it currently is. Going back and looking at my data from when I submitted the original patch, it looks like the point of diminishing returns is somewhere between those two with ath9k (where I did most of my testing), and it seems reasonable that it could be slightly higher (i.e., more buffering) for ath10k. -Toke
On Mon, 2018-09-03 at 13:11 +0200, Toke Høiland-Jørgensen wrote: > > 6 vs. 8, I think? But I didn't follow the full discussion. Err, I just realized that I was completely wrong - the default, of course, is 10. So smaller values mean more buffering. Most of my argumentation was thus utter garbage :-) > > I also think that we shouldn't necessarily restrict this to "for the > > ath10k". Is there any reason to think that this would be different for > > different chips? > > No, I also think it should be possible to select a value that will work > for all drivers. There's a strong diminishing returns effect here after > a certain point, and I believe we should strongly push back against just > adding more buffering to chase (single-flow) throughput numbers; and I'm > not convinced that having different values for different drivers is > worth the complexity. I think I can see some point in it if the driver requires some buffering for some specific reason? But you'd have to be able to state that reason, I guess, I could imagine having a firmware limitation to need to build two aggregates, or something around MU-MIMO, etc. > As far as I'm concerned, just showing an increase in maximum throughput > under ideal conditions (as this initial patch submission did) is not > sufficient argument for adding more buffering. At a minimum, the > evaluation should also include: > > - Impact on latency and throughput for competing flows in the same test. > - Impact on latency for the TCP flow itself. > - A full evaluation at low rates and in scenarios with more than one > client. That seems reasonable. > As far as the actual value, I *think* it may be that the default shift > should be 7 (8 ms) rather than 8 (4 ms) as it currently is. Going back > and looking at my data from when I submitted the original patch, it > looks like the point of diminishing returns is somewhere between those > two with ath9k (where I did most of my testing), and it seems reasonable > that it could be slightly higher (i.e., more buffering) for ath10k. Grant's data shows a significant difference between 6 and 7 for both latency and throughput: * median tpt - ~241 vs ~201 (both 1 and 5 streams) * median latency - 7.5 vs 6 (1 stream) - 17.3 vs. 16.6 (5 streams) A 20% throughput improvement at <= 1.5ms latency cost seems like a pretty reasonable trade-off? johannes
Johannes Berg <johannes@sipsolutions.net> writes: > On Mon, 2018-09-03 at 13:11 +0200, Toke Høiland-Jørgensen wrote: > >> > 6 vs. 8, I think? But I didn't follow the full discussion. > > Err, I just realized that I was completely wrong - the default, of > course, is 10. So smaller values mean more buffering. > > Most of my argumentation was thus utter garbage :-) Well, I got the gist, even if the sign bit was wrong ;) >> > I also think that we shouldn't necessarily restrict this to "for the >> > ath10k". Is there any reason to think that this would be different for >> > different chips? >> >> No, I also think it should be possible to select a value that will work >> for all drivers. There's a strong diminishing returns effect here after >> a certain point, and I believe we should strongly push back against just >> adding more buffering to chase (single-flow) throughput numbers; and I'm >> not convinced that having different values for different drivers is >> worth the complexity. > > I think I can see some point in it if the driver requires some > buffering for some specific reason? But you'd have to be able to state > that reason, I guess, I could imagine having a firmware limitation to > need to build two aggregates, or something around MU-MIMO, etc. Right, I'm not ruling out that there can be legitimate reasons to add extra buffering; but a lot of times it's just used to paper over other issues, so a good explanation is definitely needed... >> As far as the actual value, I *think* it may be that the default shift >> should be 7 (8 ms) rather than 8 (4 ms) as it currently is. Going back >> and looking at my data from when I submitted the original patch, it >> looks like the point of diminishing returns is somewhere between those >> two with ath9k (where I did most of my testing), and it seems reasonable >> that it could be slightly higher (i.e., more buffering) for ath10k. > > Grant's data shows a significant difference between 6 and 7 for both > latency and throughput: > > * median tpt > - ~241 vs ~201 (both 1 and 5 streams) > * median latency > - 7.5 vs 6 (1 stream) > - 17.3 vs. 16.6 (5 streams) > > A 20% throughput improvement at <= 1.5ms latency cost seems like a > pretty reasonable trade-off? Yeah, on it's face. What I'm bothered about is that it is the exact opposite results that I got from my ath10k tests (there, throughput *dropped* and latency doubled when going to from 4 to 16 ms of buffering). And, well, Grant's data is from a single test in a noisy environment where the timeseries graph shows that throughput is all over the place for the duration of the test; so it's hard to draw solid conclusions from (for instance, for the 5-stream test, the average throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7 it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware used in this test, so I can't go verify it myself; so the only thing I can do is grumble about it here... :) -Toke
I have not been on this thread (I have had to shut down my wifi lab and am not planning on doing any more wifi work in the future). But for what it's worth: * tcp_pacing_shift affects the performance of the tcp stack only. I think 16ms is an outrageous amount, and I'm thinking we actually have a problem on the other side of the link in getting acks out as I write. That said, I don't care all that much compared to the bad old days ( http://blog.cerowrt.org/tags/wifi ) and 16ms is far better than the seconds it used to be. That said, I'd rather appreciate a test on HT20 with the same chipset at a lower rate. The hope was that we'd achieve flat latency at every rate (see http://blog.cerowrt.org/post/real_results/ for some lovely pics) Flent can sample minstrel stats but I think we're SOL on the ath10k. * The irtt test can measure one way delays pretty accurately so you can see if the other side is actually the source of this issue. * Not clear to me if the AP is also running the fq_codel for wifi code? That, um, er, helps an AP enormously.... * Having an aircap and regular capture of what's going on during these tests would be useful. Flent is a great test driver, but a tcpdump taken during a flent test tells the truth via tcptrace and xplot.org. We can inspect cwnd, rwnd, and look at drops. (actually I usually turn ECN on so I can see when the codel algorithm is kicking in). Being able to zoom in on the early ramp would be helpful. Etc. I'll take a look at them if you'll supply them. * At some point the ath10k driver also gained NAPI support. (?) I don't think NAPI is a good fit for wifi. Delays in TCP RX processing lead to less throughput. * The firmware is perpetually trying to do more stuff with less interrupts. Me, I dreamed of a register you could poll for "when will the air be clear", and a "tx has one ms left to go, give me some more data" interrupt. The fact we still have a minimum of ~12ms of uncontrolled buffering bugs me. If we could just hold off submittal until just before it was needed, we could cut it in half (and further, by 10s of ms, when the media is contended). * My general recomendation for contended APs was that they advertise a reduced TXOP as the number of active stations go up. (this also applies to the VI/VO queues). Never got around to algorizing it... I gave up on my links and, disable VO/VI/BK entirely and just tell hostapd to advertise 2ms. Yep, kills conventional measures of throughput. Cuts inter-station service time by 2/3s though, and *that* you can notice and measure with PLT benchmarks and overall feel. This implies increasing the pacing shift to suit the advertised size of the txop, not decreasing it - and I have no idea how to get that from one place to another. Same goes for stuff destined for the VI/VO queues. * (also it would be good to see in the capture what the beacon says) * Powersave has always been a problem... * Cubic is a terrible tcp for wifi. BBR, however, is possibly worse... or better. It would be educational to try running BBR on this link at either shift setting. (seriously!) TCP is not TCP is not TCP.... Lastly... tcp_pacing_shift does not affect the number of other packets that can get into a given txop given the other buffering - it's one in the hardware, one ready to go, one being formed, try setting pacing_shift to to 4 to see what happens....
While I'm being overly vague and general... * Doing some form of ack compression (whether in the tcp stack or in mac80211) seems useful. It's really hard to fill 4Mbytes one way and the needed number of acks the other. I'd also pointed out (on some thread on the make-wifi-fast list that I can't find right now) that drivers had a tendency to eat an entire txop for that first single ack and ping pong between empty and full on a one way tcp benchmark. * Similarly grabbing air during the contention window. If the sender is grabbing 3x as much air as the reciever (A1 A2 A3 B1 A4 B2) * I know of one org that upped the codel target value for 802.11ac with decent results but I can't talk about it... (sort of why I asked for a cap). My problem with decreasing the pacing shift is that it makes for less interleaving (I think! have to look), where increasing the target compensates for more jittery networks. ('course my overall theme of my prior post was to reduce jittery networks with less buffering and smaller txops overall) * Is this a two device test? or a test through an AP? Looking over this thread, I see a suggestion is to expose the pacing rate via a sysctl. I don't think there's any one answer, either, but I'm not sure this is the right variable to expose. Perhaps a tunable for latency that might (one day) look at multiple variables and decide? Getting this more right is really a job for an algorithm of minstrel-like complexity.
On Mon, Sep 3, 2018 at 6:35 AM Toke Høiland-Jørgensen <toke@toke.dk> wrote: > > Johannes Berg <johannes@sipsolutions.net> writes: .... > > Grant's data shows a significant difference between 6 and 7 for both > > latency and throughput: Minor nit: this is wgong's data more thoughtfully processed. > > > > * median tpt > > - ~241 vs ~201 (both 1 and 5 streams) > > * median latency > > - 7.5 vs 6 (1 stream) > > - 17.3 vs. 16.6 (5 streams) > > > > A 20% throughput improvement at <= 1.5ms latency cost seems like a > > pretty reasonable trade-off? > > Yeah, on it's face. What I'm bothered about is that it is the exact > opposite results that I got from my ath10k tests (there, throughput > *dropped* and latency doubled when going to from 4 to 16 ms of > buffering). Hrm, yeah...that would bother me too. I think even if we don't understand why/how that happened, at some level we need to allow subsystems or drivers to adjust sk_pacing_shift value. Changing sk_pacing_shift clearly has an effect that can be optimized. If smaller values of sk_pacing_shift is increasing the interval (and allows more buffering), I'm wondering why CPU utilization gets higher. More buffering is usually more efficient. :/ wgong: can you confirm (a) I've entered the data correctly in the spreadsheet and (b) you've labeled the data sets correctly when you generated the data? If either of us made a mistake, it would be good to know. :) I'm going to speculate that "other processing" (e.g. device level interrupt mitigation or possibly CPU scheduling behaviors which handles TX/RX completion) could cause a "bathtub" effect similar to the performance graphs that originally got NAPI accepted into the kernel ~15 years ago. So the "optimal" value could be different for different architectures and different IO devices (which have different max link rates and different host bus interconnects). But honestly, I don't understand all the details of sk_pacing_shift value nearly as well as just about anyone else reading this thread. > And, well, Grant's data is from a single test in a noisy > environment where the time series graph shows that throughput is all over > the place for the duration of the test; so it's hard to draw solid > conclusions from (for instance, for the 5-stream test, the average > throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7 > it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware > used in this test, so I can't go verify it myself; so the only thing I > can do is grumble about it here... :) It's a fair complaint and I agree with it. My counter argument is the opposite is true too: most ideal benchmarks don't measure what most users see. While the data wgong provided are way more noisy than I like, my overall "confidence" in the "conclusion" I offered is still positive. cheers, grant
On 2018-09-05 07:43, Grant Grundler wrote: > On Mon, Sep 3, 2018 at 6:35 AM Toke Høiland-Jørgensen <toke@toke.dk> > wrote: >> >> Johannes Berg <johannes@sipsolutions.net> writes: > .... >> > Grant's data shows a significant difference between 6 and 7 for both >> > latency and throughput: > > Minor nit: this is wgong's data more thoughtfully processed. > > wgong: can you confirm (a) I've entered the data correctly in the > spreadsheet and (b) you've labeled the data sets correctly when you > generated the data? > If either of us made a mistake, it would be good to know. :) > yes, the data is the test result from flent tool.
Grant Grundler <grundler@google.com> writes: >> And, well, Grant's data is from a single test in a noisy >> environment where the time series graph shows that throughput is all over >> the place for the duration of the test; so it's hard to draw solid >> conclusions from (for instance, for the 5-stream test, the average >> throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7 >> it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware >> used in this test, so I can't go verify it myself; so the only thing I >> can do is grumble about it here... :) > > It's a fair complaint and I agree with it. My counter argument is the > opposite is true too: most ideal benchmarks don't measure what most > users see. While the data wgong provided are way more noisy than I > like, my overall "confidence" in the "conclusion" I offered is still > positive. Right. I guess I would just prefer a slightly more comprehensive evaluation to base a 4x increase in buffer size on... -Toke
On Thu, Sep 6, 2018 at 3:18 AM Toke Høiland-Jørgensen <toke@toke.dk> wrote: > > Grant Grundler <grundler@google.com> writes: > > >> And, well, Grant's data is from a single test in a noisy > >> environment where the time series graph shows that throughput is all over > >> the place for the duration of the test; so it's hard to draw solid > >> conclusions from (for instance, for the 5-stream test, the average > >> throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7 > >> it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware > >> used in this test, so I can't go verify it myself; so the only thing I > >> can do is grumble about it here... :) > > > > It's a fair complaint and I agree with it. My counter argument is the > > opposite is true too: most ideal benchmarks don't measure what most > > users see. While the data wgong provided are way more noisy than I > > like, my overall "confidence" in the "conclusion" I offered is still > > positive. > > Right. I guess I would just prefer a slightly more comprehensive > evaluation to base a 4x increase in buffer size on... Kalle, is this why you didn't accept this patch? Other reasons? Toke, what else would you like to see evaluated? I generally want to see three things measured when "benchmarking" technologies: throughput, latency, cpu utilization We've covered those three I think "reasonably". What does a "4x increase in memory" mean here? Wen, how much more memory does this cause ath10k to use? If a "4x increase in memory" means I'm using 1MB instead of 256KB, I'm not going worry about that on a system with 2GB-16GB of RAM if it doubles the throughput of the WIFI for a given workload. I expect routers with 128-256MB RAM would make that tradeoff as well assuming they don't have other RAM-demanding workload. cheers, grant
Grant Grundler <grundler@google.com> writes: > On Thu, Sep 6, 2018 at 3:18 AM Toke Høiland-Jørgensen <toke@toke.dk> wrote: >> >> Grant Grundler <grundler@google.com> writes: >> >> >> And, well, Grant's data is from a single test in a noisy >> >> environment where the time series graph shows that throughput is all over >> >> the place for the duration of the test; so it's hard to draw solid >> >> conclusions from (for instance, for the 5-stream test, the average >> >> throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7 >> >> it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware >> >> used in this test, so I can't go verify it myself; so the only thing I >> >> can do is grumble about it here... :) >> > >> > It's a fair complaint and I agree with it. My counter argument is the >> > opposite is true too: most ideal benchmarks don't measure what most >> > users see. While the data wgong provided are way more noisy than I >> > like, my overall "confidence" in the "conclusion" I offered is still >> > positive. >> >> Right. I guess I would just prefer a slightly more comprehensive >> evaluation to base a 4x increase in buffer size on... > > Kalle, is this why you didn't accept this patch? Other reasons? Just for reference, here are patchwork links: https://patchwork.kernel.org/cover/10559729/ https://patchwork.kernel.org/patch/10559731/ https://patchwork.kernel.org/patch/10559733/ The mac80211 patch is accepted and the ath10k patch is deferred. IIRC I put it deferred as I didn't see a real consensus about the patch and was supposed to look at it later, but haven't done yet. I don't have any issues with the patch, except maybe removing the duplicate define for 9377 (which I can easily fix in the pending branch).
Grant Grundler <grundler@google.com> writes: > On Thu, Sep 6, 2018 at 3:18 AM Toke Høiland-Jørgensen <toke@toke.dk> wrote: >> >> Grant Grundler <grundler@google.com> writes: >> >> >> And, well, Grant's data is from a single test in a noisy >> >> environment where the time series graph shows that throughput is all over >> >> the place for the duration of the test; so it's hard to draw solid >> >> conclusions from (for instance, for the 5-stream test, the average >> >> throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7 >> >> it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware >> >> used in this test, so I can't go verify it myself; so the only thing I >> >> can do is grumble about it here... :) >> > >> > It's a fair complaint and I agree with it. My counter argument is the >> > opposite is true too: most ideal benchmarks don't measure what most >> > users see. While the data wgong provided are way more noisy than I >> > like, my overall "confidence" in the "conclusion" I offered is still >> > positive. >> >> Right. I guess I would just prefer a slightly more comprehensive >> evaluation to base a 4x increase in buffer size on... > > Kalle, is this why you didn't accept this patch? Other reasons? > > Toke, what else would you like to see evaluated? > > I generally want to see three things measured when "benchmarking" > technologies: throughput, latency, cpu utilization > We've covered those three I think "reasonably". Hmm, going back and looking at this (I'd completely forgotten about this patch), I think I had two main concerns: 1. What happens in a degraded signal situation, where the throughput is limited by the signal conditions, or by contention with other devices. Both of these happen regularly, and I worry that latency will be badly affected under those conditions. 2. What happens with old hardware that has worse buffer management in the driver->firmware path (especially drivers without push/pull mode support)? For these, the lower-level queueing structure is less effective at controlling queueing latency. Getting the queue size limit patches from ChromeOS ported would alleviate point 2. I do believe Kan said he'd look into that once the airtime patches were merged. So Kan, any progress on that front? :) > What does a "4x increase in memory" mean here? Wen, how much more > memory does this cause ath10k to use? I didn't say "memory", I said "buffer size"... :) I.e., it's the latency impact of the increased buffering I'm worried about (see above), not the system memory usage. -Toke
Toke Høiland-Jørgensen <toke@toke.dk> writes: > Grant Grundler <grundler@google.com> writes: > >> On Thu, Sep 6, 2018 at 3:18 AM Toke Høiland-Jørgensen <toke@toke.dk> wrote: >>> >>> Grant Grundler <grundler@google.com> writes: >>> >>> >> And, well, Grant's data is from a single test in a noisy >>> >> environment where the time series graph shows that throughput is all over >>> >> the place for the duration of the test; so it's hard to draw solid >>> >> conclusions from (for instance, for the 5-stream test, the average >>> >> throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7 >>> >> it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware >>> >> used in this test, so I can't go verify it myself; so the only thing I >>> >> can do is grumble about it here... :) >>> > >>> > It's a fair complaint and I agree with it. My counter argument is the >>> > opposite is true too: most ideal benchmarks don't measure what most >>> > users see. While the data wgong provided are way more noisy than I >>> > like, my overall "confidence" in the "conclusion" I offered is still >>> > positive. >>> >>> Right. I guess I would just prefer a slightly more comprehensive >>> evaluation to base a 4x increase in buffer size on... >> >> Kalle, is this why you didn't accept this patch? Other reasons? >> >> Toke, what else would you like to see evaluated? >> >> I generally want to see three things measured when "benchmarking" >> technologies: throughput, latency, cpu utilization >> We've covered those three I think "reasonably". > > Hmm, going back and looking at this (I'd completely forgotten about this > patch), I think I had two main concerns: > > 1. What happens in a degraded signal situation, where the throughput is > limited by the signal conditions, or by contention with other devices. > Both of these happen regularly, and I worry that latency will be > badly affected under those conditions. > > 2. What happens with old hardware that has worse buffer management in > the driver->firmware path (especially drivers without push/pull mode > support)? For these, the lower-level queueing structure is less > effective at controlling queueing latency. Do note that this patch changes behaviour _only_ for QCA6174 and QCA9377 PCI devices, which IIRC do not even support push/pull mode. All the rest, including QCA988X and QCA9984 are unaffected.
On 2/21/19 8:10 AM, Kalle Valo wrote: > Toke Høiland-Jørgensen <toke@toke.dk> writes: > >> Grant Grundler <grundler@google.com> writes: >> >>> On Thu, Sep 6, 2018 at 3:18 AM Toke Høiland-Jørgensen <toke@toke.dk> wrote: >>>> >>>> Grant Grundler <grundler@google.com> writes: >>>> >>>>>> And, well, Grant's data is from a single test in a noisy >>>>>> environment where the time series graph shows that throughput is all over >>>>>> the place for the duration of the test; so it's hard to draw solid >>>>>> conclusions from (for instance, for the 5-stream test, the average >>>>>> throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7 >>>>>> it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware >>>>>> used in this test, so I can't go verify it myself; so the only thing I >>>>>> can do is grumble about it here... :) >>>>> >>>>> It's a fair complaint and I agree with it. My counter argument is the >>>>> opposite is true too: most ideal benchmarks don't measure what most >>>>> users see. While the data wgong provided are way more noisy than I >>>>> like, my overall "confidence" in the "conclusion" I offered is still >>>>> positive. >>>> >>>> Right. I guess I would just prefer a slightly more comprehensive >>>> evaluation to base a 4x increase in buffer size on... >>> >>> Kalle, is this why you didn't accept this patch? Other reasons? >>> >>> Toke, what else would you like to see evaluated? >>> >>> I generally want to see three things measured when "benchmarking" >>> technologies: throughput, latency, cpu utilization >>> We've covered those three I think "reasonably". >> >> Hmm, going back and looking at this (I'd completely forgotten about this >> patch), I think I had two main concerns: >> >> 1. What happens in a degraded signal situation, where the throughput is >> limited by the signal conditions, or by contention with other devices. >> Both of these happen regularly, and I worry that latency will be >> badly affected under those conditions. >> >> 2. What happens with old hardware that has worse buffer management in >> the driver->firmware path (especially drivers without push/pull mode >> support)? For these, the lower-level queueing structure is less >> effective at controlling queueing latency. > > Do note that this patch changes behaviour _only_ for QCA6174 and QCA9377 > PCI devices, which IIRC do not even support push/pull mode. All the > rest, including QCA988X and QCA9984 are unaffected. Just as a note, at least kernels such as 4.14.whatever perform poorly when running ath10k on 9984 when acting as TCP endpoints. This makes them not really usable for stuff like serving video to lots of clients. Tweaking TCP (I do it a bit differently, but either way) can significantly improve performance. Recently I helped a user that could get barely 70 stations streaming at 1Mbps on stock kernel (using one wave1 on 2.4, one wave-2 on 5Ghz), and we got 110 working with a tweaked TCP stack. These were /n stations too. I think it is lame that it _still_ requires out of tree patches to make TCP work well on ath10k...even if you want to default to current behaviour, you should allow users to tweak it to work with their use case. Thanks, Ben
Kalle Valo <kvalo@codeaurora.org> writes: > Toke Høiland-Jørgensen <toke@toke.dk> writes: > >> Grant Grundler <grundler@google.com> writes: >> >>> On Thu, Sep 6, 2018 at 3:18 AM Toke Høiland-Jørgensen <toke@toke.dk> wrote: >>>> >>>> Grant Grundler <grundler@google.com> writes: >>>> >>>> >> And, well, Grant's data is from a single test in a noisy >>>> >> environment where the time series graph shows that throughput is all over >>>> >> the place for the duration of the test; so it's hard to draw solid >>>> >> conclusions from (for instance, for the 5-stream test, the average >>>> >> throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7 >>>> >> it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware >>>> >> used in this test, so I can't go verify it myself; so the only thing I >>>> >> can do is grumble about it here... :) >>>> > >>>> > It's a fair complaint and I agree with it. My counter argument is the >>>> > opposite is true too: most ideal benchmarks don't measure what most >>>> > users see. While the data wgong provided are way more noisy than I >>>> > like, my overall "confidence" in the "conclusion" I offered is still >>>> > positive. >>>> >>>> Right. I guess I would just prefer a slightly more comprehensive >>>> evaluation to base a 4x increase in buffer size on... >>> >>> Kalle, is this why you didn't accept this patch? Other reasons? >>> >>> Toke, what else would you like to see evaluated? >>> >>> I generally want to see three things measured when "benchmarking" >>> technologies: throughput, latency, cpu utilization >>> We've covered those three I think "reasonably". >> >> Hmm, going back and looking at this (I'd completely forgotten about this >> patch), I think I had two main concerns: >> >> 1. What happens in a degraded signal situation, where the throughput is >> limited by the signal conditions, or by contention with other devices. >> Both of these happen regularly, and I worry that latency will be >> badly affected under those conditions. >> >> 2. What happens with old hardware that has worse buffer management in >> the driver->firmware path (especially drivers without push/pull mode >> support)? For these, the lower-level queueing structure is less >> effective at controlling queueing latency. > > Do note that this patch changes behaviour _only_ for QCA6174 and QCA9377 > PCI devices, which IIRC do not even support push/pull mode. All the > rest, including QCA988X and QCA9984 are unaffected. Ah, right; I did not go all the way back and look at the actual patch, so missed that :) But in that case, why are the latency results that low? Were these tests done with the ChromeOS queue limit patches? -Toke
Ben Greear <greearb@candelatech.com> writes: > On 2/21/19 8:10 AM, Kalle Valo wrote: >> Toke Høiland-Jørgensen <toke@toke.dk> writes: >> >>> Grant Grundler <grundler@google.com> writes: >>> >>>> On Thu, Sep 6, 2018 at 3:18 AM Toke Høiland-Jørgensen <toke@toke.dk> wrote: >>>>> >>>>> Grant Grundler <grundler@google.com> writes: >>>>> >>>>>>> And, well, Grant's data is from a single test in a noisy >>>>>>> environment where the time series graph shows that throughput is all over >>>>>>> the place for the duration of the test; so it's hard to draw solid >>>>>>> conclusions from (for instance, for the 5-stream test, the average >>>>>>> throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7 >>>>>>> it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware >>>>>>> used in this test, so I can't go verify it myself; so the only thing I >>>>>>> can do is grumble about it here... :) >>>>>> >>>>>> It's a fair complaint and I agree with it. My counter argument is the >>>>>> opposite is true too: most ideal benchmarks don't measure what most >>>>>> users see. While the data wgong provided are way more noisy than I >>>>>> like, my overall "confidence" in the "conclusion" I offered is still >>>>>> positive. >>>>> >>>>> Right. I guess I would just prefer a slightly more comprehensive >>>>> evaluation to base a 4x increase in buffer size on... >>>> >>>> Kalle, is this why you didn't accept this patch? Other reasons? >>>> >>>> Toke, what else would you like to see evaluated? >>>> >>>> I generally want to see three things measured when "benchmarking" >>>> technologies: throughput, latency, cpu utilization >>>> We've covered those three I think "reasonably". >>> >>> Hmm, going back and looking at this (I'd completely forgotten about this >>> patch), I think I had two main concerns: >>> >>> 1. What happens in a degraded signal situation, where the throughput is >>> limited by the signal conditions, or by contention with other devices. >>> Both of these happen regularly, and I worry that latency will be >>> badly affected under those conditions. >>> >>> 2. What happens with old hardware that has worse buffer management in >>> the driver->firmware path (especially drivers without push/pull mode >>> support)? For these, the lower-level queueing structure is less >>> effective at controlling queueing latency. >> >> Do note that this patch changes behaviour _only_ for QCA6174 and QCA9377 >> PCI devices, which IIRC do not even support push/pull mode. All the >> rest, including QCA988X and QCA9984 are unaffected. > > Just as a note, at least kernels such as 4.14.whatever perform poorly when > running ath10k on 9984 when acting as TCP endpoints. This makes them not > really usable for stuff like serving video to lots of clients. > > Tweaking TCP (I do it a bit differently, but either way) can significantly > improve performance. Differently how? Did you have to do more than fiddle with the pacing_shift? > Recently I helped a user that could get barely 70 stations streaming > at 1Mbps on stock kernel (using one wave1 on 2.4, one wave-2 on 5Ghz), > and we got 110 working with a tweaked TCP stack. These were /n > stations too. > > I think it is lame that it _still_ requires out of tree patches to > make TCP work well on ath10k...even if you want to default to current > behaviour, you should allow users to tweak it to work with their use > case. Well if TCP is broken to the point of being unusable I do think we should fix it; but I think "just provide a configuration knob" should be the last resort... -Toke
On 2/21/19 8:37 AM, Toke Høiland-Jørgensen wrote: > Ben Greear <greearb@candelatech.com> writes: > >> On 2/21/19 8:10 AM, Kalle Valo wrote: >>> Toke Høiland-Jørgensen <toke@toke.dk> writes: >>> >>>> Grant Grundler <grundler@google.com> writes: >>>> >>>>> On Thu, Sep 6, 2018 at 3:18 AM Toke Høiland-Jørgensen <toke@toke.dk> wrote: >>>>>> >>>>>> Grant Grundler <grundler@google.com> writes: >>>>>> >>>>>>>> And, well, Grant's data is from a single test in a noisy >>>>>>>> environment where the time series graph shows that throughput is all over >>>>>>>> the place for the duration of the test; so it's hard to draw solid >>>>>>>> conclusions from (for instance, for the 5-stream test, the average >>>>>>>> throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7 >>>>>>>> it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware >>>>>>>> used in this test, so I can't go verify it myself; so the only thing I >>>>>>>> can do is grumble about it here... :) >>>>>>> >>>>>>> It's a fair complaint and I agree with it. My counter argument is the >>>>>>> opposite is true too: most ideal benchmarks don't measure what most >>>>>>> users see. While the data wgong provided are way more noisy than I >>>>>>> like, my overall "confidence" in the "conclusion" I offered is still >>>>>>> positive. >>>>>> >>>>>> Right. I guess I would just prefer a slightly more comprehensive >>>>>> evaluation to base a 4x increase in buffer size on... >>>>> >>>>> Kalle, is this why you didn't accept this patch? Other reasons? >>>>> >>>>> Toke, what else would you like to see evaluated? >>>>> >>>>> I generally want to see three things measured when "benchmarking" >>>>> technologies: throughput, latency, cpu utilization >>>>> We've covered those three I think "reasonably". >>>> >>>> Hmm, going back and looking at this (I'd completely forgotten about this >>>> patch), I think I had two main concerns: >>>> >>>> 1. What happens in a degraded signal situation, where the throughput is >>>> limited by the signal conditions, or by contention with other devices. >>>> Both of these happen regularly, and I worry that latency will be >>>> badly affected under those conditions. >>>> >>>> 2. What happens with old hardware that has worse buffer management in >>>> the driver->firmware path (especially drivers without push/pull mode >>>> support)? For these, the lower-level queueing structure is less >>>> effective at controlling queueing latency. >>> >>> Do note that this patch changes behaviour _only_ for QCA6174 and QCA9377 >>> PCI devices, which IIRC do not even support push/pull mode. All the >>> rest, including QCA988X and QCA9984 are unaffected. >> >> Just as a note, at least kernels such as 4.14.whatever perform poorly when >> running ath10k on 9984 when acting as TCP endpoints. This makes them not >> really usable for stuff like serving video to lots of clients. >> >> Tweaking TCP (I do it a bit differently, but either way) can significantly >> improve performance. > > Differently how? Did you have to do more than fiddle with the pacing_shift? This one, or a slightly tweaked version that applies to different kernels: https://github.com/greearb/linux-ct-4.16/commit/3e14e8491a5b31ce994fb2752347145e6ab7eaf5 >> Recently I helped a user that could get barely 70 stations streaming >> at 1Mbps on stock kernel (using one wave1 on 2.4, one wave-2 on 5Ghz), >> and we got 110 working with a tweaked TCP stack. These were /n >> stations too. >> >> I think it is lame that it _still_ requires out of tree patches to >> make TCP work well on ath10k...even if you want to default to current >> behaviour, you should allow users to tweak it to work with their use >> case. > > Well if TCP is broken to the point of being unusable I do think we > should fix it; but I think "just provide a configuration knob" should be > the last resort... So, it has been broken for years, and waiting for a perfect solution has not gotten the problem fixed. Thanks, Ben
Ben Greear <greearb@candelatech.com> writes: > On 2/21/19 8:37 AM, Toke Høiland-Jørgensen wrote: >> Ben Greear <greearb@candelatech.com> writes: >> >>> On 2/21/19 8:10 AM, Kalle Valo wrote: >>>> Toke Høiland-Jørgensen <toke@toke.dk> writes: >>>> >>>>> Grant Grundler <grundler@google.com> writes: >>>>> >>>>>> On Thu, Sep 6, 2018 at 3:18 AM Toke Høiland-Jørgensen <toke@toke.dk> wrote: >>>>>>> >>>>>>> Grant Grundler <grundler@google.com> writes: >>>>>>> >>>>>>>>> And, well, Grant's data is from a single test in a noisy >>>>>>>>> environment where the time series graph shows that throughput is all over >>>>>>>>> the place for the duration of the test; so it's hard to draw solid >>>>>>>>> conclusions from (for instance, for the 5-stream test, the average >>>>>>>>> throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7 >>>>>>>>> it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware >>>>>>>>> used in this test, so I can't go verify it myself; so the only thing I >>>>>>>>> can do is grumble about it here... :) >>>>>>>> >>>>>>>> It's a fair complaint and I agree with it. My counter argument is the >>>>>>>> opposite is true too: most ideal benchmarks don't measure what most >>>>>>>> users see. While the data wgong provided are way more noisy than I >>>>>>>> like, my overall "confidence" in the "conclusion" I offered is still >>>>>>>> positive. >>>>>>> >>>>>>> Right. I guess I would just prefer a slightly more comprehensive >>>>>>> evaluation to base a 4x increase in buffer size on... >>>>>> >>>>>> Kalle, is this why you didn't accept this patch? Other reasons? >>>>>> >>>>>> Toke, what else would you like to see evaluated? >>>>>> >>>>>> I generally want to see three things measured when "benchmarking" >>>>>> technologies: throughput, latency, cpu utilization >>>>>> We've covered those three I think "reasonably". >>>>> >>>>> Hmm, going back and looking at this (I'd completely forgotten about this >>>>> patch), I think I had two main concerns: >>>>> >>>>> 1. What happens in a degraded signal situation, where the throughput is >>>>> limited by the signal conditions, or by contention with other devices. >>>>> Both of these happen regularly, and I worry that latency will be >>>>> badly affected under those conditions. >>>>> >>>>> 2. What happens with old hardware that has worse buffer management in >>>>> the driver->firmware path (especially drivers without push/pull mode >>>>> support)? For these, the lower-level queueing structure is less >>>>> effective at controlling queueing latency. >>>> >>>> Do note that this patch changes behaviour _only_ for QCA6174 and QCA9377 >>>> PCI devices, which IIRC do not even support push/pull mode. All the >>>> rest, including QCA988X and QCA9984 are unaffected. >>> >>> Just as a note, at least kernels such as 4.14.whatever perform poorly when >>> running ath10k on 9984 when acting as TCP endpoints. This makes them not >>> really usable for stuff like serving video to lots of clients. >>> >>> Tweaking TCP (I do it a bit differently, but either way) can significantly >>> improve performance. >> >> Differently how? Did you have to do more than fiddle with the pacing_shift? > > This one, or a slightly tweaked version that applies to different kernels: > > https://github.com/greearb/linux-ct-4.16/commit/3e14e8491a5b31ce994fb2752347145e6ab7eaf5 Right; but the current mac80211 default (pacing shift 8) corresponds to setting your sysctl to 4... >>> Recently I helped a user that could get barely 70 stations streaming >>> at 1Mbps on stock kernel (using one wave1 on 2.4, one wave-2 on 5Ghz), >>> and we got 110 working with a tweaked TCP stack. These were /n >>> stations too. >>> >>> I think it is lame that it _still_ requires out of tree patches to >>> make TCP work well on ath10k...even if you want to default to current >>> behaviour, you should allow users to tweak it to work with their use >>> case. >> >> Well if TCP is broken to the point of being unusable I do think we >> should fix it; but I think "just provide a configuration knob" should be >> the last resort... > > So, it has been broken for years, and waiting for a perfect solution > has not gotten the problem fixed. Well, the current default should at least be closer to something that works well. I do think I may have erred on the wrong side of the optimum when I submitted the original patch to set the default to 8; that should probably have been 7 (i.e., 8 ms; the optimum in the evaluation we did was around 6 ms, which is sadly not a power of two). Maybe changing that default is actually better than having to redo the testing for all the different devices as we're discussing in the context of this patch. Maybe I should just send a patch to do that... -Toke
On 2/21/19 9:15 AM, Toke Høiland-Jørgensen wrote: > Ben Greear <greearb@candelatech.com> writes: > >> On 2/21/19 8:37 AM, Toke Høiland-Jørgensen wrote: >>> Ben Greear <greearb@candelatech.com> writes: >>> >>>> On 2/21/19 8:10 AM, Kalle Valo wrote: >>>>> Toke Høiland-Jørgensen <toke@toke.dk> writes: >>>>> >>>>>> Grant Grundler <grundler@google.com> writes: >>>>>> >>>>>>> On Thu, Sep 6, 2018 at 3:18 AM Toke Høiland-Jørgensen <toke@toke.dk> wrote: >>>>>>>> >>>>>>>> Grant Grundler <grundler@google.com> writes: >>>>>>>> >>>>>>>>>> And, well, Grant's data is from a single test in a noisy >>>>>>>>>> environment where the time series graph shows that throughput is all over >>>>>>>>>> the place for the duration of the test; so it's hard to draw solid >>>>>>>>>> conclusions from (for instance, for the 5-stream test, the average >>>>>>>>>> throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7 >>>>>>>>>> it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware >>>>>>>>>> used in this test, so I can't go verify it myself; so the only thing I >>>>>>>>>> can do is grumble about it here... :) >>>>>>>>> >>>>>>>>> It's a fair complaint and I agree with it. My counter argument is the >>>>>>>>> opposite is true too: most ideal benchmarks don't measure what most >>>>>>>>> users see. While the data wgong provided are way more noisy than I >>>>>>>>> like, my overall "confidence" in the "conclusion" I offered is still >>>>>>>>> positive. >>>>>>>> >>>>>>>> Right. I guess I would just prefer a slightly more comprehensive >>>>>>>> evaluation to base a 4x increase in buffer size on... >>>>>>> >>>>>>> Kalle, is this why you didn't accept this patch? Other reasons? >>>>>>> >>>>>>> Toke, what else would you like to see evaluated? >>>>>>> >>>>>>> I generally want to see three things measured when "benchmarking" >>>>>>> technologies: throughput, latency, cpu utilization >>>>>>> We've covered those three I think "reasonably". >>>>>> >>>>>> Hmm, going back and looking at this (I'd completely forgotten about this >>>>>> patch), I think I had two main concerns: >>>>>> >>>>>> 1. What happens in a degraded signal situation, where the throughput is >>>>>> limited by the signal conditions, or by contention with other devices. >>>>>> Both of these happen regularly, and I worry that latency will be >>>>>> badly affected under those conditions. >>>>>> >>>>>> 2. What happens with old hardware that has worse buffer management in >>>>>> the driver->firmware path (especially drivers without push/pull mode >>>>>> support)? For these, the lower-level queueing structure is less >>>>>> effective at controlling queueing latency. >>>>> >>>>> Do note that this patch changes behaviour _only_ for QCA6174 and QCA9377 >>>>> PCI devices, which IIRC do not even support push/pull mode. All the >>>>> rest, including QCA988X and QCA9984 are unaffected. >>>> >>>> Just as a note, at least kernels such as 4.14.whatever perform poorly when >>>> running ath10k on 9984 when acting as TCP endpoints. This makes them not >>>> really usable for stuff like serving video to lots of clients. >>>> >>>> Tweaking TCP (I do it a bit differently, but either way) can significantly >>>> improve performance. >>> >>> Differently how? Did you have to do more than fiddle with the pacing_shift? >> >> This one, or a slightly tweaked version that applies to different kernels: >> >> https://github.com/greearb/linux-ct-4.16/commit/3e14e8491a5b31ce994fb2752347145e6ab7eaf5 > > Right; but the current mac80211 default (pacing shift 8) corresponds to > setting your sysctl to 4... > >>>> Recently I helped a user that could get barely 70 stations streaming >>>> at 1Mbps on stock kernel (using one wave1 on 2.4, one wave-2 on 5Ghz), >>>> and we got 110 working with a tweaked TCP stack. These were /n >>>> stations too. >>>> >>>> I think it is lame that it _still_ requires out of tree patches to >>>> make TCP work well on ath10k...even if you want to default to current >>>> behaviour, you should allow users to tweak it to work with their use >>>> case. >>> >>> Well if TCP is broken to the point of being unusable I do think we >>> should fix it; but I think "just provide a configuration knob" should be >>> the last resort... >> >> So, it has been broken for years, and waiting for a perfect solution >> has not gotten the problem fixed. > > Well, the current default should at least be closer to something that > works well. > > I do think I may have erred on the wrong side of the optimum when I > submitted the original patch to set the default to 8; that should > probably have been 7 (i.e., 8 ms; the optimum in the evaluation we did > was around 6 ms, which is sadly not a power of two). Maybe changing that > default is actually better than having to redo the testing for all the > different devices as we're discussing in the context of this patch. > Maybe I should just send a patch to do that... It is hubris to think one setting works well for everyone. Sure, set a good default, but also let people tune the value. And send the patches to stable so that users on older kernels can have good performance. Thanks, Ben
Ben Greear <greearb@candelatech.com> writes: > On 2/21/19 9:15 AM, Toke Høiland-Jørgensen wrote: >> Ben Greear <greearb@candelatech.com> writes: >> >>> On 2/21/19 8:37 AM, Toke Høiland-Jørgensen wrote: >>>> Ben Greear <greearb@candelatech.com> writes: >>>> >>>>> On 2/21/19 8:10 AM, Kalle Valo wrote: >>>>>> Toke Høiland-Jørgensen <toke@toke.dk> writes: >>>>>> >>>>>>> Grant Grundler <grundler@google.com> writes: >>>>>>> >>>>>>>> On Thu, Sep 6, 2018 at 3:18 AM Toke Høiland-Jørgensen <toke@toke.dk> wrote: >>>>>>>>> >>>>>>>>> Grant Grundler <grundler@google.com> writes: >>>>>>>>> >>>>>>>>>>> And, well, Grant's data is from a single test in a noisy >>>>>>>>>>> environment where the time series graph shows that throughput is all over >>>>>>>>>>> the place for the duration of the test; so it's hard to draw solid >>>>>>>>>>> conclusions from (for instance, for the 5-stream test, the average >>>>>>>>>>> throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7 >>>>>>>>>>> it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware >>>>>>>>>>> used in this test, so I can't go verify it myself; so the only thing I >>>>>>>>>>> can do is grumble about it here... :) >>>>>>>>>> >>>>>>>>>> It's a fair complaint and I agree with it. My counter argument is the >>>>>>>>>> opposite is true too: most ideal benchmarks don't measure what most >>>>>>>>>> users see. While the data wgong provided are way more noisy than I >>>>>>>>>> like, my overall "confidence" in the "conclusion" I offered is still >>>>>>>>>> positive. >>>>>>>>> >>>>>>>>> Right. I guess I would just prefer a slightly more comprehensive >>>>>>>>> evaluation to base a 4x increase in buffer size on... >>>>>>>> >>>>>>>> Kalle, is this why you didn't accept this patch? Other reasons? >>>>>>>> >>>>>>>> Toke, what else would you like to see evaluated? >>>>>>>> >>>>>>>> I generally want to see three things measured when "benchmarking" >>>>>>>> technologies: throughput, latency, cpu utilization >>>>>>>> We've covered those three I think "reasonably". >>>>>>> >>>>>>> Hmm, going back and looking at this (I'd completely forgotten about this >>>>>>> patch), I think I had two main concerns: >>>>>>> >>>>>>> 1. What happens in a degraded signal situation, where the throughput is >>>>>>> limited by the signal conditions, or by contention with other devices. >>>>>>> Both of these happen regularly, and I worry that latency will be >>>>>>> badly affected under those conditions. >>>>>>> >>>>>>> 2. What happens with old hardware that has worse buffer management in >>>>>>> the driver->firmware path (especially drivers without push/pull mode >>>>>>> support)? For these, the lower-level queueing structure is less >>>>>>> effective at controlling queueing latency. >>>>>> >>>>>> Do note that this patch changes behaviour _only_ for QCA6174 and QCA9377 >>>>>> PCI devices, which IIRC do not even support push/pull mode. All the >>>>>> rest, including QCA988X and QCA9984 are unaffected. >>>>> >>>>> Just as a note, at least kernels such as 4.14.whatever perform poorly when >>>>> running ath10k on 9984 when acting as TCP endpoints. This makes them not >>>>> really usable for stuff like serving video to lots of clients. >>>>> >>>>> Tweaking TCP (I do it a bit differently, but either way) can significantly >>>>> improve performance. >>>> >>>> Differently how? Did you have to do more than fiddle with the pacing_shift? >>> >>> This one, or a slightly tweaked version that applies to different kernels: >>> >>> https://github.com/greearb/linux-ct-4.16/commit/3e14e8491a5b31ce994fb2752347145e6ab7eaf5 >> >> Right; but the current mac80211 default (pacing shift 8) corresponds to >> setting your sysctl to 4... >> >>>>> Recently I helped a user that could get barely 70 stations streaming >>>>> at 1Mbps on stock kernel (using one wave1 on 2.4, one wave-2 on 5Ghz), >>>>> and we got 110 working with a tweaked TCP stack. These were /n >>>>> stations too. >>>>> >>>>> I think it is lame that it _still_ requires out of tree patches to >>>>> make TCP work well on ath10k...even if you want to default to current >>>>> behaviour, you should allow users to tweak it to work with their use >>>>> case. >>>> >>>> Well if TCP is broken to the point of being unusable I do think we >>>> should fix it; but I think "just provide a configuration knob" should be >>>> the last resort... >>> >>> So, it has been broken for years, and waiting for a perfect solution >>> has not gotten the problem fixed. >> >> Well, the current default should at least be closer to something that >> works well. >> >> I do think I may have erred on the wrong side of the optimum when I >> submitted the original patch to set the default to 8; that should >> probably have been 7 (i.e., 8 ms; the optimum in the evaluation we did >> was around 6 ms, which is sadly not a power of two). Maybe changing that >> default is actually better than having to redo the testing for all the >> different devices as we're discussing in the context of this patch. >> Maybe I should just send a patch to do that... > > It is hubris to think one setting works well for everyone. Sure, set a > good default, but also let people tune the value. Well I certainly won't object to a knob; I just don't think that most people are going to use it, so we better make the default reasonable. > And send the patches to stable so that users on older kernels can have > good performance. Sure, I can submit stable backports if needed :) -Toke
Wen Gong <wgong@codeaurora.org> wrote: > Upstream kernel has an interface to help adjust sk_pacing_shift to help > improve TCP UL throughput. > The sk_pacing_shift is 8 in mac80211, this is based on test with 11N > WiFi chips with ath9k. For QCA6174/QCA9377 PCI 11AC chips, the 11AC > VHT80 TCP UL throughput testing result shows 6 is the optimal. > Overwrite the sk_pacing_shift to 6 in ath10k driver for QCA6174/9377 PCI. > > Tested with QCA6174 PCI with firmware > WLAN.RM.4.4.1-00109-QCARMSWPZ-1, but this will also affect QCA9377 PCI. > It's not a regression with new firmware releases. > > There have 2 test result of different settings: > > ARM CPU based device with QCA6174A PCI with different > sk_pacing_shift: > > sk_pacing_shift throughput(Mbps) CPU utilization > 6 500(-P5) ~75% idle, Focus on CPU1: ~14%idle > 7 454(-P5) ~80% idle, Focus on CPU1: ~4%idle > 8 288 ~90% idle, Focus on CPU1: ~35%idle > 9 ~200 ~92% idle, Focus on CPU1: ~50%idle > > 5G TCP UL VTH80 on X86 platform with QCA6174A PCI with sk_packing_shift > set to 6: > > tcp_limit_output_bytes throughput(Mbps) > default(262144)+1 Stream 336 > default(262144)+2 Streams 558 > default(262144)+3 Streams 584 > default(262144)+4 Streams 602 > default(262144)+5 Streams 598 > changed(2621440)+1 Stream 598 > changed(2621440)+2 Streams 601 > > Signed-off-by: Wen Gong <wgong@codeaurora.org> The final result of this patch is unclear so I'm dropping this. Please resend if the issue still exists. Patch set to Changes Requested.
diff --git a/drivers/net/wireless/ath/ath10k/core.c b/drivers/net/wireless/ath/ath10k/core.c index 85c58eb..fbd13ec 100644 --- a/drivers/net/wireless/ath/ath10k/core.c +++ b/drivers/net/wireless/ath/ath10k/core.c @@ -189,6 +189,7 @@ .per_ce_irq = false, .shadow_reg_support = false, .rri_on_ddr = false, + .tx_sk_pacing_shift = SK_PACING_SHIFT_6174, }, { .id = QCA6174_HW_2_1_VERSION, @@ -221,6 +222,7 @@ .per_ce_irq = false, .shadow_reg_support = false, .rri_on_ddr = false, + .tx_sk_pacing_shift = SK_PACING_SHIFT_6174, }, { .id = QCA6174_HW_3_0_VERSION, @@ -253,6 +255,7 @@ .per_ce_irq = false, .shadow_reg_support = false, .rri_on_ddr = false, + .tx_sk_pacing_shift = SK_PACING_SHIFT_6174, }, { .id = QCA6174_HW_3_2_VERSION, @@ -288,6 +291,7 @@ .per_ce_irq = false, .shadow_reg_support = false, .rri_on_ddr = false, + .tx_sk_pacing_shift = SK_PACING_SHIFT_6174, }, { .id = QCA99X0_HW_2_0_DEV_VERSION, @@ -443,6 +447,7 @@ .per_ce_irq = false, .shadow_reg_support = false, .rri_on_ddr = false, + .tx_sk_pacing_shift = SK_PACING_SHIFT_9377, }, { .id = QCA9377_HW_1_1_DEV_VERSION, @@ -477,6 +482,7 @@ .per_ce_irq = false, .shadow_reg_support = false, .rri_on_ddr = false, + .tx_sk_pacing_shift = SK_PACING_SHIFT_9377, }, { .id = QCA4019_HW_1_0_DEV_VERSION, diff --git a/drivers/net/wireless/ath/ath10k/hw.h b/drivers/net/wireless/ath/ath10k/hw.h index a274bd8..1f956d6 100644 --- a/drivers/net/wireless/ath/ath10k/hw.h +++ b/drivers/net/wireless/ath/ath10k/hw.h @@ -161,6 +161,9 @@ enum qca9377_chip_id_rev { #define REG_DUMP_COUNT_QCA988X 60 +#define SK_PACING_SHIFT_6174 6 +#define SK_PACING_SHIFT_9377 6 + struct ath10k_fw_ie { __le32 id; __le32 len; @@ -589,6 +592,9 @@ struct ath10k_hw_params { /* Number of bytes to be the offset for each FFT sample */ int spectral_bin_offset; + + /* Number of shift to override the default value of ieee80211_hw*/ + u8 tx_sk_pacing_shift; }; struct htt_rx_desc; diff --git a/drivers/net/wireless/ath/ath10k/mac.c b/drivers/net/wireless/ath/ath10k/mac.c index 95243b4..4f2c07f 100644 --- a/drivers/net/wireless/ath/ath10k/mac.c +++ b/drivers/net/wireless/ath/ath10k/mac.c @@ -8361,6 +8361,9 @@ int ath10k_mac_register(struct ath10k *ar) ar->hw->wiphy->max_scan_ssids = WLAN_SCAN_PARAMS_MAX_SSID; ar->hw->wiphy->max_scan_ie_len = WLAN_SCAN_PARAMS_MAX_IE_LEN; + if (ar->hw_params.tx_sk_pacing_shift != 0) + ar->hw->tx_sk_pacing_shift = ar->hw_params.tx_sk_pacing_shift; + ar->hw->vif_data_size = sizeof(struct ath10k_vif); ar->hw->sta_data_size = sizeof(struct ath10k_sta); ar->hw->txq_data_size = sizeof(struct ath10k_txq);
Upstream kernel has an interface to help adjust sk_pacing_shift to help improve TCP UL throughput. The sk_pacing_shift is 8 in mac80211, this is based on test with 11N WiFi chips with ath9k. For QCA6174/QCA9377 PCI 11AC chips, the 11AC VHT80 TCP UL throughput testing result shows 6 is the optimal. Overwrite the sk_pacing_shift to 6 in ath10k driver for QCA6174/9377 PCI. Tested with QCA6174 PCI with firmware WLAN.RM.4.4.1-00109-QCARMSWPZ-1, but this will also affect QCA9377 PCI. It's not a regression with new firmware releases. There have 2 test result of different settings: ARM CPU based device with QCA6174A PCI with different sk_pacing_shift: sk_pacing_shift throughput(Mbps) CPU utilization 6 500(-P5) ~75% idle, Focus on CPU1: ~14%idle 7 454(-P5) ~80% idle, Focus on CPU1: ~4%idle 8 288 ~90% idle, Focus on CPU1: ~35%idle 9 ~200 ~92% idle, Focus on CPU1: ~50%idle 5G TCP UL VTH80 on X86 platform with QCA6174A PCI with sk_packing_shift set to 6: tcp_limit_output_bytes throughput(Mbps) default(262144)+1 Stream 336 default(262144)+2 Streams 558 default(262144)+3 Streams 584 default(262144)+4 Streams 602 default(262144)+5 Streams 598 changed(2621440)+1 Stream 598 changed(2621440)+2 Streams 601 Signed-off-by: Wen Gong <wgong@codeaurora.org> --- V2: -add the optimal for configurable for each hardware type drivers/net/wireless/ath/ath10k/core.c | 6 ++++++ drivers/net/wireless/ath/ath10k/hw.h | 6 ++++++ drivers/net/wireless/ath/ath10k/mac.c | 3 +++ 3 files changed, 15 insertions(+)