diff mbox series

[v2,2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi chips

Message ID 1533724802-30944-3-git-send-email-wgong@codeaurora.org (mailing list archive)
State Changes Requested
Delegated to: Kalle Valo
Headers show
Series Change sk_pacing_shift in ieee80211_hw for best tx throughput | expand

Commit Message

Wen Gong Aug. 8, 2018, 10:40 a.m. UTC
Upstream kernel has an interface to help adjust sk_pacing_shift to help
improve TCP UL throughput.
The sk_pacing_shift is 8 in mac80211, this is based on test with 11N
WiFi chips with ath9k. For QCA6174/QCA9377 PCI 11AC chips, the 11AC
VHT80 TCP UL throughput testing result shows 6 is the optimal.
Overwrite the sk_pacing_shift to 6 in ath10k driver for QCA6174/9377 PCI.

Tested with QCA6174 PCI with firmware
WLAN.RM.4.4.1-00109-QCARMSWPZ-1, but this will also affect QCA9377 PCI.
It's not a regression with new firmware releases.

There have 2 test result of different settings:

ARM CPU based device with QCA6174A PCI with different
sk_pacing_shift:

 sk_pacing_shift  throughput(Mbps)             CPU utilization
         6            500(-P5)      ~75% idle, Focus on CPU1: ~14%idle
         7            454(-P5)      ~80% idle, Focus on CPU1: ~4%idle
         8               288        ~90% idle, Focus on CPU1: ~35%idle
         9              ~200        ~92% idle, Focus on CPU1: ~50%idle

5G TCP UL VTH80 on X86 platform with QCA6174A PCI with sk_packing_shift
set to 6:

  tcp_limit_output_bytes            throughput(Mbps)
 default(262144)+1 Stream                 336
 default(262144)+2 Streams                558
 default(262144)+3 Streams                584
 default(262144)+4 Streams                602
 default(262144)+5 Streams                598
 changed(2621440)+1 Stream                598
 changed(2621440)+2 Streams               601

Signed-off-by: Wen Gong <wgong@codeaurora.org>
---
V2:
-add the optimal for configurable for each hardware type
 drivers/net/wireless/ath/ath10k/core.c | 6 ++++++
 drivers/net/wireless/ath/ath10k/hw.h   | 6 ++++++
 drivers/net/wireless/ath/ath10k/mac.c  | 3 +++
 3 files changed, 15 insertions(+)

Comments

Toke Høiland-Jørgensen Aug. 8, 2018, 10:43 a.m. UTC | #1
Wen Gong <wgong@codeaurora.org> writes:

> Upstream kernel has an interface to help adjust sk_pacing_shift to help
> improve TCP UL throughput.
> The sk_pacing_shift is 8 in mac80211, this is based on test with 11N
> WiFi chips with ath9k. For QCA6174/QCA9377 PCI 11AC chips, the 11AC
> VHT80 TCP UL throughput testing result shows 6 is the optimal.
> Overwrite the sk_pacing_shift to 6 in ath10k driver for QCA6174/9377 PCI.
>
> Tested with QCA6174 PCI with firmware
> WLAN.RM.4.4.1-00109-QCARMSWPZ-1, but this will also affect QCA9377 PCI.
> It's not a regression with new firmware releases.
>
> There have 2 test result of different settings:
>
> ARM CPU based device with QCA6174A PCI with different
> sk_pacing_shift:
>
>  sk_pacing_shift  throughput(Mbps)             CPU utilization
>          6            500(-P5)      ~75% idle, Focus on CPU1: ~14%idle
>          7            454(-P5)      ~80% idle, Focus on CPU1: ~4%idle
>          8               288        ~90% idle, Focus on CPU1: ~35%idle
>          9              ~200        ~92% idle, Focus on CPU1: ~50%idle
>
> 5G TCP UL VTH80 on X86 platform with QCA6174A PCI with sk_packing_shift
> set to 6:
>
>   tcp_limit_output_bytes            throughput(Mbps)
>  default(262144)+1 Stream                 336
>  default(262144)+2 Streams                558
>  default(262144)+3 Streams                584
>  default(262144)+4 Streams                602
>  default(262144)+5 Streams                598
>  changed(2621440)+1 Stream                598
>  changed(2621440)+2 Streams               601

You still haven't provided any latency numbers for these tests, which
makes it impossible to verify that setting sk_pacing_shift to 6 is the
right tradeoff.

As I said before, from your numbers I suspect the right setting is
actually 7, which would be 10-20ms less latency under load; way more
important than ~50 Mbps...

-Toke
Wen Gong Aug. 10, 2018, 8:05 a.m. UTC | #2
> -----Original Message-----
> From: ath10k <ath10k-bounces@lists.infradead.org> On Behalf Of Toke
> Høiland-Jørgensen
> Sent: Wednesday, August 8, 2018 6:44 PM
> To: Wen Gong <wgong@codeaurora.org>; ath10k@lists.infradead.org;
> johannes@sipsolutions.net
> Cc: linux-wireless@vger.kernel.org
> Subject: Re: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi
> chips
> 
> Wen Gong <wgong@codeaurora.org> writes:
> 
> > Upstream kernel has an interface to help adjust sk_pacing_shift to
> > help improve TCP UL throughput.
> > The sk_pacing_shift is 8 in mac80211, this is based on test with 11N
> > WiFi chips with ath9k. For QCA6174/QCA9377 PCI 11AC chips, the 11AC
> > VHT80 TCP UL throughput testing result shows 6 is the optimal.
> > Overwrite the sk_pacing_shift to 6 in ath10k driver for QCA6174/9377 PCI.
> >
> > Tested with QCA6174 PCI with firmware
> > WLAN.RM.4.4.1-00109-QCARMSWPZ-1, but this will also affect QCA9377
> PCI.
> > It's not a regression with new firmware releases.
> >
> > There have 2 test result of different settings:
> >
> > ARM CPU based device with QCA6174A PCI with different
> > sk_pacing_shift:
> >
> >  sk_pacing_shift  throughput(Mbps)             CPU utilization
> >          6            500(-P5)      ~75% idle, Focus on CPU1: ~14%idle
> >          7            454(-P5)      ~80% idle, Focus on CPU1: ~4%idle
> >          8               288        ~90% idle, Focus on CPU1: ~35%idle
> >          9              ~200        ~92% idle, Focus on CPU1: ~50%idle
> >
> > 5G TCP UL VTH80 on X86 platform with QCA6174A PCI with
> > sk_packing_shift set to 6:
> >
> >   tcp_limit_output_bytes            throughput(Mbps)
> >  default(262144)+1 Stream                 336
> >  default(262144)+2 Streams                558
> >  default(262144)+3 Streams                584
> >  default(262144)+4 Streams                602
> >  default(262144)+5 Streams                598
> >  changed(2621440)+1 Stream                598
> >  changed(2621440)+2 Streams               601
> 
> You still haven't provided any latency numbers for these tests, which makes
> it impossible to verify that setting sk_pacing_shift to 6 is the right tradeoff.
> 
> As I said before, from your numbers I suspect the right setting is actually 7,
> which would be 10-20ms less latency under load; way more important than
> ~50 Mbps...
> 
Hi Toke,
Could you give the command line for the latency test?
https://flent.org/intro.html#quick-start
I used the command but test failed:
flent tcp_download -p 1 -l 60 -H 192.168.1.5 -t text-to-be-included-in-plot -o file1.png
error loading plotter: unable to find plot configuration "1"

> -Toke
> 
> _______________________________________________
> ath10k mailing list
> ath10k@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/ath10k
Toke Høiland-Jørgensen Aug. 10, 2018, 1:17 p.m. UTC | #3
Wen Gong <wgong@qti.qualcomm.com> writes:

>> -----Original Message-----
>> From: ath10k <ath10k-bounces@lists.infradead.org> On Behalf Of Toke
>> Høiland-Jørgensen
>> Sent: Wednesday, August 8, 2018 6:44 PM
>> To: Wen Gong <wgong@codeaurora.org>; ath10k@lists.infradead.org;
>> johannes@sipsolutions.net
>> Cc: linux-wireless@vger.kernel.org
>> Subject: Re: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi
>> chips
>> 
>> Wen Gong <wgong@codeaurora.org> writes:
>> 
>> > Upstream kernel has an interface to help adjust sk_pacing_shift to
>> > help improve TCP UL throughput.
>> > The sk_pacing_shift is 8 in mac80211, this is based on test with 11N
>> > WiFi chips with ath9k. For QCA6174/QCA9377 PCI 11AC chips, the 11AC
>> > VHT80 TCP UL throughput testing result shows 6 is the optimal.
>> > Overwrite the sk_pacing_shift to 6 in ath10k driver for QCA6174/9377 PCI.
>> >
>> > Tested with QCA6174 PCI with firmware
>> > WLAN.RM.4.4.1-00109-QCARMSWPZ-1, but this will also affect QCA9377
>> PCI.
>> > It's not a regression with new firmware releases.
>> >
>> > There have 2 test result of different settings:
>> >
>> > ARM CPU based device with QCA6174A PCI with different
>> > sk_pacing_shift:
>> >
>> >  sk_pacing_shift  throughput(Mbps)             CPU utilization
>> >          6            500(-P5)      ~75% idle, Focus on CPU1: ~14%idle
>> >          7            454(-P5)      ~80% idle, Focus on CPU1: ~4%idle
>> >          8               288        ~90% idle, Focus on CPU1: ~35%idle
>> >          9              ~200        ~92% idle, Focus on CPU1: ~50%idle
>> >
>> > 5G TCP UL VTH80 on X86 platform with QCA6174A PCI with
>> > sk_packing_shift set to 6:
>> >
>> >   tcp_limit_output_bytes            throughput(Mbps)
>> >  default(262144)+1 Stream                 336
>> >  default(262144)+2 Streams                558
>> >  default(262144)+3 Streams                584
>> >  default(262144)+4 Streams                602
>> >  default(262144)+5 Streams                598
>> >  changed(2621440)+1 Stream                598
>> >  changed(2621440)+2 Streams               601
>> 
>> You still haven't provided any latency numbers for these tests, which makes
>> it impossible to verify that setting sk_pacing_shift to 6 is the right tradeoff.
>> 
>> As I said before, from your numbers I suspect the right setting is actually 7,
>> which would be 10-20ms less latency under load; way more important than
>> ~50 Mbps...
>> 
> Hi Toke,
> Could you give the command line for the latency test?
> https://flent.org/intro.html#quick-start
> I used the command but test failed:
> flent tcp_download -p 1 -l 60 -H 192.168.1.5 -t text-to-be-included-in-plot -o file1.png
> error loading plotter: unable to find plot configuration "1"

Try something like:


flent -H 192.168.1.5 -t "sk_pacing_shift 7" tcp_nup --test-parameter upload_streams=1


This will note the value of sk_pacing_shift you are testing in the data
file (so change that as appropriate), and you can vary the number of TCP
streams by changing the upload_streams parameter.

Note that in this case I'm assuming you are running Flent on the device
with the kernel you are trying to test, so you want a TCP transfer going
*from* the device. If not, change "tcp_nup" to "tcp_ndown" and
"upload_streams" to "download_streams". Upload is netperf TCP_STREAM
test, and download is TCP_MAERTS.

When running the above command you'll get a summary output on the
terminal that you can paste on the list; and also a data file to plot
things form. For instance, you can do something like 'flent -p ping_cdf
*.flent.gz' to get a CDF plot of all your test results afterwards. You
are also very welcome to send me the .flent.gz data files and I'll take
a look to make sure everything looks reasonable :)

-Toke
Wen Gong Aug. 13, 2018, 5:37 a.m. UTC | #4
> -----Original Message-----
> From: Toke Høiland-Jørgensen <toke@toke.dk>
> Sent: Friday, August 10, 2018 9:18 PM
> To: Wen Gong <wgong@qti.qualcomm.com>; Wen Gong
> <wgong@codeaurora.org>; ath10k@lists.infradead.org;
> johannes@sipsolutions.net
> Cc: linux-wireless@vger.kernel.org
> Subject: RE: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi
> chips
> 
> Wen Gong <wgong@qti.qualcomm.com> writes:
> 
> >>
> > Hi Toke,
> > Could you give the command line for the latency test?
> > https://flent.org/intro.html#quick-start
> > I used the command but test failed:
> > flent tcp_download -p 1 -l 60 -H 192.168.1.5 -t
> > text-to-be-included-in-plot -o file1.png error loading plotter: unable to find
> plot configuration "1"
> 
> Try something like:
> 
> 
> flent -H 192.168.1.5 -t "sk_pacing_shift 7" tcp_nup --test-parameter
> upload_streams=1
> 
> 
> This will note the value of sk_pacing_shift you are testing in the data file (so
> change that as appropriate), and you can vary the number of TCP streams by
> changing the upload_streams parameter.
> 
> Note that in this case I'm assuming you are running Flent on the device with
> the kernel you are trying to test, so you want a TCP transfer going
> *from* the device. If not, change "tcp_nup" to "tcp_ndown" and
> "upload_streams" to "download_streams". Upload is netperf TCP_STREAM
> test, and download is TCP_MAERTS.
> 
> When running the above command you'll get a summary output on the
> terminal that you can paste on the list; and also a data file to plot things form.
> For instance, you can do something like 'flent -p ping_cdf *.flent.gz' to get a
> CDF plot of all your test results afterwards. You are also very welcome to
> send me the .flent.gz data files and I'll take a look to make sure everything
> looks reasonable :)
> 
> -Toke

Hi Toke,

I have tested with your method for shift 6/8/10/7 all twice time in open air environment.
Shift 6:
wgong@wgong-Latitude-E5440-1:~/flent$ flent -H 192.168.1.7 -t "sk_pacing_shift6" tcp_nup --test-parameter upload_streams=1
Started Flent 1.2.2 using Python 2.7.12.
Starting tcp_nup test. Expected run time: 70 seconds.
Data file written to ./tcp_nup-2018-08-13T110414.699512.sk_pacing_shift6.flent.gz.
Summary of tcp_nup test run 'sk_pacing_shift6' (at 2018-08-13 03:04:14.699512):

                           avg       median          # data pts
 Ping (ms) ICMP :         9.91         4.99 ms              350
 TCP upload avg :       242.48       262.43 Mbits/s         301
 TCP upload sum :       242.48       262.43 Mbits/s         301
 TCP upload::1  :       242.48       263.34 Mbits/s         271
wgong@wgong-Latitude-E5440-1:~/flent$ flent -H 192.168.1.7 -t "sk_pacing_shift6" tcp_nup --test-parameter upload_streams=1
Started Flent 1.2.2 using Python 2.7.12.
Starting tcp_nup test. Expected run time: 70 seconds.
Data file written to ./tcp_nup-2018-08-13T113317.074077.sk_pacing_shift6.flent.gz.
Summary of tcp_nup test run 'sk_pacing_shift6' (at 2018-08-13 03:33:17.074077):

                           avg       median          # data pts
 Ping (ms) ICMP :         7.75         5.30 ms              350
 TCP upload avg :       239.17       250.84 Mbits/s         301
 TCP upload sum :       239.17       250.84 Mbits/s         301
 TCP upload::1  :       239.17       255.03 Mbits/s         266

Shift 8:
wgong@wgong-Latitude-E5440-1:~/flent$ flent -H 192.168.1.7 -t "sk_pacing_shift8" tcp_nup --test-parameter upload_streams=1
Started Flent 1.2.2 using Python 2.7.12.
Starting tcp_nup test. Expected run time: 70 seconds.
Data file written to ./tcp_nup-2018-08-13T121433.187781.sk_pacing_shift8.flent.gz.
Summary of tcp_nup test run 'sk_pacing_shift8' (at 2018-08-13 04:14:33.187781):

                           avg       median          # data pts
 Ping (ms) ICMP :        17.12         7.07 ms              350
 TCP upload avg :       180.05       185.82 Mbits/s         301
 TCP upload sum :       180.05       185.82 Mbits/s         301
 TCP upload::1  :       180.05       189.41 Mbits/s         253
wgong@wgong-Latitude-E5440-1:~/flent$ flent -H 192.168.1.7 -t "sk_pacing_shift8" tcp_nup --test-parameter upload_streams=1
Started Flent 1.2.2 using Python 2.7.12.
Starting tcp_nup test. Expected run time: 70 seconds.
Data file written to ./tcp_nup-2018-08-13T121602.382575.sk_pacing_shift8.flent.gz.
Summary of tcp_nup test run 'sk_pacing_shift8' (at 2018-08-13 04:16:02.382575):

                           avg       median          # data pts
 Ping (ms) ICMP :        13.90         5.89 ms              350
 TCP upload avg :       207.56       228.16 Mbits/s         301
 TCP upload sum :       207.56       228.16 Mbits/s         301
 TCP upload::1  :       207.56       228.11 Mbits/s         259

Shift 10:
wgong@wgong-Latitude-E5440-1:~/flent$ flent -H 192.168.1.7 -t "sk_pacing_shift10" tcp_nup --test-parameter upload_streams=1
Started Flent 1.2.2 using Python 2.7.12.
Starting tcp_nup test. Expected run time: 70 seconds.
Data file written to ./tcp_nup-2018-08-13T121844.493498.sk_pacing_shift10.flent.gz.
Summary of tcp_nup test run 'sk_pacing_shift10' (at 2018-08-13 04:18:44.493498):

                           avg       median          # data pts
 Ping (ms) ICMP :        15.11         7.41 ms              350
 TCP upload avg :       162.38       164.10 Mbits/s         301
 TCP upload sum :       162.38       164.10 Mbits/s         301
 TCP upload::1  :       162.38       165.47 Mbits/s         252
wgong@wgong-Latitude-E5440-1:~/flent$ flent -H 192.168.1.7 -t "sk_pacing_shift10" tcp_nup --test-parameter upload_streams=1
Started Flent 1.2.2 using Python 2.7.12.
Starting tcp_nup test. Expected run time: 70 seconds.
Data file written to ./tcp_nup-2018-08-13T122108.347163.sk_pacing_shift10.flent.gz.
Summary of tcp_nup test run 'sk_pacing_shift10' (at 2018-08-13 04:21:08.347163):

                           avg       median          # data pts
 Ping (ms) ICMP :        13.69         7.48 ms              350
 TCP upload avg :       171.11       170.52 Mbits/s         301
 TCP upload sum :       171.11       170.52 Mbits/s         301
 TCP upload::1  :       171.11       171.36 Mbits/s         258

Shift 7:
wgong@wgong-Latitude-E5440-1:~/flent$ flent -H 192.168.1.7 -t "sk_pacing_shift7" tcp_nup --test-parameter upload_streams=1
Started Flent 1.2.2 using Python 2.7.12.
Starting tcp_nup test. Expected run time: 70 seconds.
Data file written to ./tcp_nup-2018-08-13T122948.020974.sk_pacing_shift7.flent.gz.
Summary of tcp_nup test run 'sk_pacing_shift7' (at 2018-08-13 04:29:48.020974):

                           avg       median          # data pts
 Ping (ms) ICMP :        14.12         6.61 ms              350
 TCP upload avg :       188.19       188.04 Mbits/s         301
 TCP upload sum :       188.19       188.04 Mbits/s         301
 TCP upload::1  :       188.19       190.88 Mbits/s         258
wgong@wgong-Latitude-E5440-1:~/flent$ flent -H 192.168.1.7 -t "sk_pacing_shift7" tcp_nup --test-parameter upload_streams=1
Started Flent 1.2.2 using Python 2.7.12.
Starting tcp_nup test. Expected run time: 70 seconds.
Data file written to ./tcp_nup-2018-08-13T123129.526514.sk_pacing_shift7.flent.gz.
Summary of tcp_nup test run 'sk_pacing_shift7' (at 2018-08-13 04:31:29.526514):

                           avg       median          # data pts
 Ping (ms) ICMP :        10.31         6.32 ms              350
 TCP upload avg :       212.70       233.69 Mbits/s         301
 TCP upload sum :       212.70       233.69 Mbits/s         301
 TCP upload::1  :       212.70       237.65 Mbits/s         262
Toke Høiland-Jørgensen Aug. 13, 2018, 11:18 a.m. UTC | #5
Wen Gong <wgong@qti.qualcomm.com> writes:

> Hi Toke,
>
> I have tested with your method for shift 6/8/10/7 all twice time in
> open air environment.

Great, thanks!

> Shift 6:
> wgong@wgong-Latitude-E5440-1:~/flent$ flent -H 192.168.1.7 -t
> "sk_pacing_shift6" tcp_nup --test-parameter upload_streams=1

Just to be sure: You are running this on your laptop? And that laptop
has the ath10k and the modified kernel you are testing? Is 192.168.1.7
the AP or another device?

I'd like to see what happens when the link is fully saturated my
multiple streams as well. Could you repeat the tests with
upload_streams=5, please? And if you could share the .flent.gz files
(upload them somewhere, or email me off-list), that would be useful as
well :)

Thanks,

-Toke
Wen Gong Aug. 14, 2018, 5:55 a.m. UTC | #6
> -----Original Message-----
> From: Toke Høiland-Jørgensen <toke@toke.dk>
> Sent: Monday, August 13, 2018 7:18 PM
> To: Wen Gong <wgong@qti.qualcomm.com>; Wen Gong
> <wgong@codeaurora.org>; ath10k@lists.infradead.org;
> johannes@sipsolutions.net
> Cc: linux-wireless@vger.kernel.org
> Subject: RE: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi
> chips
> 
> Wen Gong <wgong@qti.qualcomm.com> writes:
> 
> > Hi Toke,
> >
> > I have tested with your method for shift 6/8/10/7 all twice time in
> > open air environment.
> 
> Great, thanks!
> 
> > Shift 6:
> > wgong@wgong-Latitude-E5440-1:~/flent$ flent -H 192.168.1.7 -t
> > "sk_pacing_shift6" tcp_nup --test-parameter upload_streams=1
> 
> Just to be sure: You are running this on your laptop? And that laptop has the
> ath10k and the modified kernel you are testing? Is 192.168.1.7 the AP or
> another device?
Hi Toke,
Yes, I run on my laptop(Dell LATITUDE E5440 + QCA6174 PCI wireless card) in open air environment,
Kernel is 4.18.0-rc2-wt-ath+ #2 SMP which is built myself.
The mac80211.ko/ath10k is built with the 2 patches.
192.168.1.7 is another device(PC installed with 4.4.0-116-generic #140~14.04.1-Ubuntu)
> 
> I'd like to see what happens when the link is fully saturated my multiple
> streams as well. Could you repeat the tests with upload_streams=5, please?
> And if you could share the .flent.gz files (upload them somewhere, or email
> me off-list), that would be useful as well :)

Test result with upload_streams=5:

sk_pacing_shift6:
wgong@wgong-Latitude-E5440-1:~/flent/5stream$ flent -H 192.168.1.7 -t "sk_pacing_shift6" tcp_nup --test-parameter upload_streams=5
Started Flent 1.2.2 using Python 2.7.12.
Starting tcp_nup test. Expected run time: 70 seconds.
Data file written to ./tcp_nup-2018-08-14T105332.356811.sk_pacing_shift6.flent.gz.
Summary of tcp_nup test run 'sk_pacing_shift6' (at 2018-08-14 02:53:32.356811):

                           avg       median          # data pts
 Ping (ms) ICMP :        20.46        13.85 ms              350
 TCP upload avg :        66.30        68.71 Mbits/s         301
 TCP upload sum :       331.49       343.55 Mbits/s         301
 TCP upload::1  :        60.80        64.65 Mbits/s         202
 TCP upload::2  :        77.72        82.89 Mbits/s         211
 TCP upload::3  :        60.52        56.09 Mbits/s         202
 TCP upload::4  :        67.39        73.56 Mbits/s         204
 TCP upload::5  :        65.06        71.97 Mbits/s         201
wgong@wgong-Latitude-E5440-1:~/flent/5stream$ flent -H 192.168.1.7 -t "sk_pacing_shift6" tcp_nup --test-parameter upload_streams=5
Started Flent 1.2.2 using Python 2.7.12.
Starting tcp_nup test. Expected run time: 70 seconds.
Data file written to ./tcp_nup-2018-08-14T105554.583603.sk_pacing_shift6.flent.gz.
Summary of tcp_nup test run 'sk_pacing_shift6' (at 2018-08-14 02:55:54.583603):

                           avg       median          # data pts
 Ping (ms) ICMP :        20.86        13.80 ms              350
 TCP upload avg :        75.88        83.17 Mbits/s         301
 TCP upload sum :       379.42       415.84 Mbits/s         301
 TCP upload::1  :        82.07        90.73 Mbits/s         225
 TCP upload::2  :        74.08        78.29 Mbits/s         204
 TCP upload::3  :        70.44        75.65 Mbits/s         217
 TCP upload::4  :        82.70        92.86 Mbits/s         223
 TCP upload::5  :        70.13        76.87 Mbits/s         210

sk_pacing_shift7:
wgong@wgong-Latitude-E5440-1:~/flent/5stream$ flent -H 192.168.1.7 -t "sk_pacing_shift7" tcp_nup --test-parameter upload_streams=5
Started Flent 1.2.2 using Python 2.7.12.
Starting tcp_nup test. Expected run time: 70 seconds.
Data file written to ./tcp_nup-2018-08-14T105759.169367.sk_pacing_shift7.flent.gz.
Summary of tcp_nup test run 'sk_pacing_shift7' (at 2018-08-14 02:57:59.169367):

                           avg       median          # data pts
 Ping (ms) ICMP :        24.66        15.55 ms              350
 TCP upload avg :        65.33        72.83 Mbits/s         301
 TCP upload sum :       326.67       363.10 Mbits/s         301
 TCP upload::1  :        77.60        92.93 Mbits/s         214
 TCP upload::2  :        67.22        75.95 Mbits/s         213
 TCP upload::3  :        65.81        74.54 Mbits/s         205
 TCP upload::4  :        63.06        70.37 Mbits/s         207
 TCP upload::5  :        52.98        55.78 Mbits/s         198
wgong@wgong-Latitude-E5440-1:~/flent/5stream$ flent -H 192.168.1.7 -t "sk_pacing_shift7" tcp_nup --test-parameter upload_streams=5
Started Flent 1.2.2 using Python 2.7.12.
Starting tcp_nup test. Expected run time: 70 seconds.
Data file written to ./tcp_nup-2018-08-14T105923.620572.sk_pacing_shift7.flent.gz.
Summary of tcp_nup test run 'sk_pacing_shift7' (at 2018-08-14 02:59:23.620572):

                           avg       median          # data pts
 Ping (ms) ICMP :        25.03        14.25 ms              350
 TCP upload avg :        74.35        85.64 Mbits/s         297
 TCP upload sum :       371.77       428.19 Mbits/s         297
 TCP upload::1  :        74.12        82.55 Mbits/s         216
 TCP upload::2  :        67.78        71.87 Mbits/s         208
 TCP upload::3  :        82.40        94.72 Mbits/s         210
 TCP upload::4  :        70.77        82.43 Mbits/s         206
 TCP upload::5  :        76.70        88.62 Mbits/s         210

sk_pacing_shift8:
wgong@wgong-Latitude-E5440-1:~/flent/5stream$ flent -H 192.168.1.7 -t "sk_pacing_shift8" tcp_nup --test-parameter upload_streams=5
Started Flent 1.2.2 using Python 2.7.12.
Starting tcp_nup test. Expected run time: 70 seconds.
Data file written to ./tcp_nup-2018-08-14T110334.670845.sk_pacing_shift8.flent.gz.
Summary of tcp_nup test run 'sk_pacing_shift8' (at 2018-08-14 03:03:34.670845):

                           avg       median          # data pts
 Ping (ms) ICMP :        25.03        19.50 ms              350
 TCP upload avg :        58.11        59.70 Mbits/s         301
 TCP upload sum :       290.53       298.51 Mbits/s         301
 TCP upload::1  :        51.37        51.74 Mbits/s         197
 TCP upload::2  :        42.02        43.66 Mbits/s         192
 TCP upload::3  :        61.17        62.33 Mbits/s         200
 TCP upload::4  :        66.11        70.31 Mbits/s         198
 TCP upload::5  :        69.86        76.31 Mbits/s         202
wgong@wgong-Latitude-E5440-1:~/flent/5stream$ flent -H 192.168.1.7 -t "sk_pacing_shift8" tcp_nup --test-parameter upload_streams=5
Started Flent 1.2.2 using Python 2.7.12.
Starting tcp_nup test. Expected run time: 70 seconds.
Data file written to ./tcp_nup-2018-08-14T110557.587769.sk_pacing_shift8.flent.gz.
Summary of tcp_nup test run 'sk_pacing_shift8' (at 2018-08-14 03:05:57.587769):

                           avg       median          # data pts
 Ping (ms) ICMP :        21.50        13.05 ms              350
 TCP upload avg :        61.59        62.00 Mbits/s         301
 TCP upload sum :       307.93       310.01 Mbits/s         301
 TCP upload::1  :        69.70        76.22 Mbits/s         210
 TCP upload::2  :        68.51        74.88 Mbits/s         207
 TCP upload::3  :        71.06        77.57 Mbits/s         200
 TCP upload::4  :        47.08        51.42 Mbits/s         196
 TCP upload::5  :        51.58        51.98 Mbits/s         203

sk_pacing_shift10:
wgong@wgong-Latitude-E5440-1:~/flent/5stream$ flent -H 192.168.1.7 -t "sk_pacing_shift10" tcp_nup --test-parameter upload_streams=5
Started Flent 1.2.2 using Python 2.7.12.
Starting tcp_nup test. Expected run time: 70 seconds.
Data file written to ./tcp_nup-2018-08-14T110814.434543.sk_pacing_shift10.flent.gz.
Summary of tcp_nup test run 'sk_pacing_shift10' (at 2018-08-14 03:08:14.434543):

                           avg       median          # data pts
 Ping (ms) ICMP :        31.57        19.35 ms              350
 TCP upload avg :        56.61        62.61 Mbits/s         301
 TCP upload sum :       283.07       313.07 Mbits/s         301
 TCP upload::1  :        39.39        42.96 Mbits/s         187
 TCP upload::2  :        62.20        72.39 Mbits/s         193
 TCP upload::3  :        61.72        74.31 Mbits/s         191
 TCP upload::4  :        61.86        73.74 Mbits/s         190
 TCP upload::5  :        57.90        65.11 Mbits/s         193
wgong@wgong-Latitude-E5440-1:~/flent/5stream$ flent -H 192.168.1.7 -t "sk_pacing_shift10" tcp_nup --test-parameter upload_streams=5
Started Flent 1.2.2 using Python 2.7.12.
Starting tcp_nup test. Expected run time: 70 seconds.
Data file written to ./tcp_nup-2018-08-14T110931.986159.sk_pacing_shift10.flent.gz.
Summary of tcp_nup test run 'sk_pacing_shift10' (at 2018-08-14 03:09:31.986159):

                           avg       median          # data pts
 Ping (ms) ICMP :        19.23        13.20 ms              350
 TCP upload avg :        76.36        81.37 Mbits/s         301
 TCP upload sum :       381.80       406.85 Mbits/s         301
 TCP upload::1  :        64.95        67.91 Mbits/s         212
 TCP upload::2  :        82.16        92.35 Mbits/s         215
 TCP upload::3  :        67.51        70.18 Mbits/s         213
 TCP upload::4  :        77.42        82.11 Mbits/s         232
 TCP upload::5  :        89.76        99.96 Mbits/s         226

> 
> Thanks,
> 
> -Toke
Toke Høiland-Jørgensen Aug. 17, 2018, 11:32 a.m. UTC | #7
Wen Gong <wgong@qti.qualcomm.com> writes:

>> -----Original Message-----
>> From: Toke Høiland-Jørgensen <toke@toke.dk>
>> Sent: Monday, August 13, 2018 7:18 PM
>> To: Wen Gong <wgong@qti.qualcomm.com>; Wen Gong
>> <wgong@codeaurora.org>; ath10k@lists.infradead.org;
>> johannes@sipsolutions.net
>> Cc: linux-wireless@vger.kernel.org
>> Subject: RE: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi
>> chips
>> 
>> Wen Gong <wgong@qti.qualcomm.com> writes:
>> 
>> > Hi Toke,
>> >
>> > I have tested with your method for shift 6/8/10/7 all twice time in
>> > open air environment.
>> 
>> Great, thanks!
>> 
>> > Shift 6:
>> > wgong@wgong-Latitude-E5440-1:~/flent$ flent -H 192.168.1.7 -t
>> > "sk_pacing_shift6" tcp_nup --test-parameter upload_streams=1
>> 
>> Just to be sure: You are running this on your laptop? And that laptop has the
>> ath10k and the modified kernel you are testing? Is 192.168.1.7 the AP or
>> another device?
> Hi Toke,
> Yes, I run on my laptop(Dell LATITUDE E5440 + QCA6174 PCI wireless card) in open air environment,
> Kernel is 4.18.0-rc2-wt-ath+ #2 SMP which is built myself.
> The mac80211.ko/ath10k is built with the 2 patches.
> 192.168.1.7 is another device(PC installed with 4.4.0-116-generic #140~14.04.1-Ubuntu)
>> 
>> I'd like to see what happens when the link is fully saturated my multiple
>> streams as well. Could you repeat the tests with upload_streams=5, please?
>> And if you could share the .flent.gz files (upload them somewhere, or email
>> me off-list), that would be useful as well :)
>
> Test result with upload_streams=5:

Sorry for the delay in looking at this, I've been terribly busy this
week; will get back to it in more detail next week.

One question for now: In the results you quote below the total
throughput is way lower than the up to 600 mbps you quoted in the patch
description. How did you run your original tests to achieve that high a
throughput?

-Toke
Peter Oh Aug. 30, 2018, 11:25 p.m. UTC | #8
Hi Toke,


On 08/17/2018 04:32 AM, Toke Høiland-Jørgensen wrote:
>
>>>> Shift 6:
>>>> wgong@wgong-Latitude-E5440-1:~/flent$ flent -H 192.168.1.7 -t
>>>> "sk_pacing_shift6" tcp_nup --test-parameter upload_streams=1
>>>
Does flent have options to set TOS or QoS queue for TCP or UDP test?
If I add "--test-paramete burst_tos=0xe0", will it use corresponding 
mac80211 QoS queue?
flent -H xxxx -t "xxxx" tcp_nup --test-parameter upload_streams=1 
--test-paramete burst_tos=0xe0

Thanks,
Peter
Toke Høiland-Jørgensen Aug. 31, 2018, 3:36 p.m. UTC | #9
Peter Oh <peter.oh@bowerswilkins.com> writes:

> Hi Toke,
>
>
> On 08/17/2018 04:32 AM, Toke Høiland-Jørgensen wrote:
>>
>>>>> Shift 6:
>>>>> wgong@wgong-Latitude-E5440-1:~/flent$ flent -H 192.168.1.7 -t
>>>>> "sk_pacing_shift6" tcp_nup --test-parameter upload_streams=1
>>>>
> Does flent have options to set TOS or QoS queue for TCP or UDP test?
> If I add "--test-paramete burst_tos=0xe0", will it use corresponding 
> mac80211 QoS queue?

Yes, for some tests:

- The RRUL test will run four TCP flows in each direction, one marked for
  each 802.11 QoS queue.

- For the rtt_fair* tests you can set them with a test parameter:
--test-parameter markings=BE,CS1 to set flows 1 and 2 to BE and CS1
respectively. You can use the rtt_fair_var* tests to get a variable
number of flows (the same host can be specified multiple times by having
different hostnames resolve to the same IP).

-Toke
Johannes Berg Sept. 3, 2018, 9:38 a.m. UTC | #10
On Thu, 2018-08-30 at 16:32 -0700, Grant Grundler wrote:
> On Sun, Aug 12, 2018 at 10:37 PM Wen Gong <wgong@qti.qualcomm.com> wrote:
> ...
> > I have tested with your method for shift 6/8/10/7 all twice time in open air environment.
> 
> I've attached the graphed data which Wen provided me and I made one
> "cleanup": the median and average "ICMP" are computed only using
> values where flent has an "antagonist" load running.  The first 28 and
> last 26 seconds of the test are "idle" - it makes no sense to include
> the latency of those. This drops about 1/7th of the data points.
> 
> My observations:
> o the delta between average and median is evidence this is not
> particularly reliable data (but expected result for "open air" wifi
> testing in a corp environment).
> o sk_pacing_shift=8 and/or =7 appears to have suffered from
> significant open air interference - I've attached a graph of the raw
> data.
> o I'm comfortable asserting throughput is higher and latency is lower
> for sk_pacing_shift=6 (vs 7) at the cost of additional CPU
> utilization.
> 
> My questions:
> 1) Is the current conversation just quibbling about 6 vs 7 for the
> ath10k default?

6 vs. 8, I think? But I didn't follow the full discussion.

I also think that we shouldn't necessarily restrict this to "for the
ath10k". Is there any reason to think that this would be different for
different chips?

I could see a reason for a driver to set the default *higher* than the
mac80211 default (needing more buffers for whatever internal reason),
but I can't really see a reason it should be *lower*. Also, the
networking-stack wide default is 0, after all, so it seems that each new
layer might have some knowledge about why to *increase* it (like
mac80211 knows that A-MPDUs/A-MSDUs need to be built), but it seems not
so reasonable that it should be reduced again. Unless you have some
specific knowledge that ath10k would have a smaller A-MSDU/A-MPDU size
limit or something?

> 2) If yes, can both patches be accepted and then later add code to
> select a default based on CPU available?

I've applied the first patch now (with some comment cleanups, you may
want to pick those up for your internal usage) because I can see an
argument for increasing it.

However, I think that the "quibbling" over the default should most
likely result in a changed default value for mac80211, rather than
ath10k using the driver setting as a way out of that discussion.

johannes
Toke Høiland-Jørgensen Sept. 3, 2018, 11:11 a.m. UTC | #11
Johannes Berg <johannes@sipsolutions.net> writes:

> On Thu, 2018-08-30 at 16:32 -0700, Grant Grundler wrote:
>> On Sun, Aug 12, 2018 at 10:37 PM Wen Gong <wgong@qti.qualcomm.com> wrote:
>> ...
>> > I have tested with your method for shift 6/8/10/7 all twice time in open air environment.
>> 
>> I've attached the graphed data which Wen provided me and I made one
>> "cleanup": the median and average "ICMP" are computed only using
>> values where flent has an "antagonist" load running.  The first 28 and
>> last 26 seconds of the test are "idle" - it makes no sense to include
>> the latency of those. This drops about 1/7th of the data points.
>> 
>> My observations:
>> o the delta between average and median is evidence this is not
>> particularly reliable data (but expected result for "open air" wifi
>> testing in a corp environment).
>> o sk_pacing_shift=8 and/or =7 appears to have suffered from
>> significant open air interference - I've attached a graph of the raw
>> data.
>> o I'm comfortable asserting throughput is higher and latency is lower
>> for sk_pacing_shift=6 (vs 7) at the cost of additional CPU
>> utilization.
>> 
>> My questions:
>> 1) Is the current conversation just quibbling about 6 vs 7 for the
>> ath10k default?
>
> 6 vs. 8, I think? But I didn't follow the full discussion.
>
> I also think that we shouldn't necessarily restrict this to "for the
> ath10k". Is there any reason to think that this would be different for
> different chips?

No, I also think it should be possible to select a value that will work
for all drivers. There's a strong diminishing returns effect here after
a certain point, and I believe we should strongly push back against just
adding more buffering to chase (single-flow) throughput numbers; and I'm
not convinced that having different values for different drivers is
worth the complexity.

As far as I'm concerned, just showing an increase in maximum throughput
under ideal conditions (as this initial patch submission did) is not
sufficient argument for adding more buffering. At a minimum, the
evaluation should also include:

- Impact on latency and throughput for competing flows in the same test.
- Impact on latency for the TCP flow itself.
- A full evaluation at low rates and in scenarios with more than one
  client.

As far as the actual value, I *think* it may be that the default shift
should be 7 (8 ms) rather than 8 (4 ms) as it currently is. Going back
and looking at my data from when I submitted the original patch, it
looks like the point of diminishing returns is somewhere between those
two with ath9k (where I did most of my testing), and it seems reasonable
that it could be slightly higher (i.e., more buffering) for ath10k.

-Toke
Johannes Berg Sept. 3, 2018, 11:47 a.m. UTC | #12
On Mon, 2018-09-03 at 13:11 +0200, Toke Høiland-Jørgensen wrote:

> > 6 vs. 8, I think? But I didn't follow the full discussion.

Err, I just realized that I was completely wrong - the default, of
course, is 10. So smaller values mean more buffering.

Most of my argumentation was thus utter garbage :-)

> > I also think that we shouldn't necessarily restrict this to "for the
> > ath10k". Is there any reason to think that this would be different for
> > different chips?
> 
> No, I also think it should be possible to select a value that will work
> for all drivers. There's a strong diminishing returns effect here after
> a certain point, and I believe we should strongly push back against just
> adding more buffering to chase (single-flow) throughput numbers; and I'm
> not convinced that having different values for different drivers is
> worth the complexity.

I think I can see some point in it if the driver requires some buffering
for some specific reason? But you'd have to be able to state that
reason, I guess, I could imagine having a firmware limitation to need to
build two aggregates, or something around MU-MIMO, etc.

> As far as I'm concerned, just showing an increase in maximum throughput
> under ideal conditions (as this initial patch submission did) is not
> sufficient argument for adding more buffering. At a minimum, the
> evaluation should also include:
> 
> - Impact on latency and throughput for competing flows in the same test.
> - Impact on latency for the TCP flow itself.
> - A full evaluation at low rates and in scenarios with more than one
>   client.

That seems reasonable.

> As far as the actual value, I *think* it may be that the default shift
> should be 7 (8 ms) rather than 8 (4 ms) as it currently is. Going back
> and looking at my data from when I submitted the original patch, it
> looks like the point of diminishing returns is somewhere between those
> two with ath9k (where I did most of my testing), and it seems reasonable
> that it could be slightly higher (i.e., more buffering) for ath10k.

Grant's data shows a significant difference between 6 and 7 for both
latency and throughput:

 * median tpt
   - ~241 vs ~201 (both 1 and 5 streams)
 * median latency
   - 7.5 vs 6 (1 stream)
   - 17.3 vs. 16.6 (5 streams)

A 20% throughput improvement at <= 1.5ms latency cost seems like a
pretty reasonable trade-off?

johannes
Toke Høiland-Jørgensen Sept. 3, 2018, 1:35 p.m. UTC | #13
Johannes Berg <johannes@sipsolutions.net> writes:

> On Mon, 2018-09-03 at 13:11 +0200, Toke Høiland-Jørgensen wrote:
>
>> > 6 vs. 8, I think? But I didn't follow the full discussion.
>
> Err, I just realized that I was completely wrong - the default, of
> course, is 10. So smaller values mean more buffering.
>
> Most of my argumentation was thus utter garbage :-)

Well, I got the gist, even if the sign bit was wrong ;)

>> > I also think that we shouldn't necessarily restrict this to "for the
>> > ath10k". Is there any reason to think that this would be different for
>> > different chips?
>> 
>> No, I also think it should be possible to select a value that will work
>> for all drivers. There's a strong diminishing returns effect here after
>> a certain point, and I believe we should strongly push back against just
>> adding more buffering to chase (single-flow) throughput numbers; and I'm
>> not convinced that having different values for different drivers is
>> worth the complexity.
>
> I think I can see some point in it if the driver requires some
> buffering for some specific reason? But you'd have to be able to state
> that reason, I guess, I could imagine having a firmware limitation to
> need to build two aggregates, or something around MU-MIMO, etc.

Right, I'm not ruling out that there can be legitimate reasons to add
extra buffering; but a lot of times it's just used to paper over other
issues, so a good explanation is definitely needed...

>> As far as the actual value, I *think* it may be that the default shift
>> should be 7 (8 ms) rather than 8 (4 ms) as it currently is. Going back
>> and looking at my data from when I submitted the original patch, it
>> looks like the point of diminishing returns is somewhere between those
>> two with ath9k (where I did most of my testing), and it seems reasonable
>> that it could be slightly higher (i.e., more buffering) for ath10k.
>
> Grant's data shows a significant difference between 6 and 7 for both
> latency and throughput:
>
>  * median tpt
>    - ~241 vs ~201 (both 1 and 5 streams)
>  * median latency
>    - 7.5 vs 6 (1 stream)
>    - 17.3 vs. 16.6 (5 streams)
>
> A 20% throughput improvement at <= 1.5ms latency cost seems like a
> pretty reasonable trade-off?

Yeah, on it's face. What I'm bothered about is that it is the exact
opposite results that I got from my ath10k tests (there, throughput
*dropped* and latency doubled when going to from 4 to 16 ms of
buffering). And, well, Grant's data is from a single test in a noisy
environment where the timeseries graph shows that throughput is all over
the place for the duration of the test; so it's hard to draw solid
conclusions from (for instance, for the 5-stream test, the average
throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7
it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware
used in this test, so I can't go verify it myself; so the only thing I
can do is grumble about it here... :)

-Toke
Dave Taht Sept. 3, 2018, 2:57 p.m. UTC | #14
I have not been on this thread (I have had to shut down my wifi lab
and am not planning on doing any more wifi work in the future). But
for what it's worth:

* tcp_pacing_shift affects the performance of the tcp stack only. I
think 16ms is an outrageous amount, and I'm thinking we actually have
a problem on the other side of the link in getting acks out as I
write. That said,
I don't care all that much compared to the bad old days (
http://blog.cerowrt.org/tags/wifi ) and 16ms
is far better than the seconds it used to be.

That said, I'd rather appreciate a test on HT20 with the same chipset
at a lower rate. The hope was that
we'd achieve flat latency at every rate (see
http://blog.cerowrt.org/post/real_results/ for some lovely pics)

Flent can sample minstrel stats but I think we're SOL on the ath10k.

* The irtt test can measure one way delays pretty accurately so you
can see if the other side is actually the source of this issue.

* Not clear to me if the AP is also running the fq_codel for wifi
code? That, um, er, helps an AP enormously....

* Having an aircap and regular capture of what's going on during these
tests would be useful. Flent is a great test driver, but a tcpdump
taken during a flent test tells the truth via tcptrace and xplot.org.
We can inspect cwnd, rwnd, and look at drops. (actually I usually turn
ECN on so I can see when the codel algorithm is kicking in). Being
able to zoom in on the early ramp would be helpful. Etc.

I'll take a look at them if you'll supply them.

* At some point the ath10k driver also gained NAPI support. (?) I
don't think NAPI is a good fit for wifi. Delays in TCP RX processing
lead to less throughput.

* The firmware is perpetually trying to do more stuff with less
interrupts. Me, I dreamed of a register you could poll for "when will
the air be clear", and a "tx has one ms left to go, give me some more
data" interrupt. The fact we still have a minimum of ~12ms of
uncontrolled buffering bugs me. If we could just hold off submittal
until just before it was needed, we could cut it in half (and further,
by 10s of ms, when the media is contended).

* My general recomendation for contended APs was that they advertise a
reduced TXOP as the number of active stations go up. (this also
applies to the VI/VO queues). Never got around to algorizing it... I
gave up on my links and, disable VO/VI/BK entirely and just tell
hostapd to advertise 2ms. Yep, kills conventional measures of
throughput. Cuts inter-station service time by 2/3s though, and *that*
you can notice and measure with PLT benchmarks and overall feel.

This implies increasing the pacing shift to suit the advertised size
of the txop, not decreasing it - and I have no idea how to get that
from one place to another. Same goes for stuff destined for the VI/VO
queues.

* (also it would be good to see in the capture what the beacon says)

* Powersave has always been a problem...

* Cubic is a terrible tcp for wifi. BBR, however, is possibly worse...
or better. It would be educational to try running BBR on this link at
either shift setting. (seriously!) TCP is not TCP is not TCP....

Lastly... tcp_pacing_shift does not affect the number of other packets
that can get into a given txop given the other buffering - it's one in
the hardware, one ready to go, one being formed, try setting
pacing_shift to to 4 to see what happens....
Dave Taht Sept. 3, 2018, 3:35 p.m. UTC | #15
While I'm being overly vague and general...

* Doing some form of ack compression (whether in the tcp stack or in
mac80211) seems useful. It's really
hard to fill 4Mbytes one way and the needed number of acks the other.
I'd also pointed out (on some thread on the make-wifi-fast list that I
can't find right now) that drivers had a tendency to eat an entire
txop for that first single ack and ping pong between empty and full on
a one way tcp benchmark.

* Similarly grabbing air during the contention window. If the sender
is grabbing 3x as much air as the reciever

(A1 A2 A3 B1 A4 B2)

* I know of one org that upped the codel target value for 802.11ac
with decent results but I can't talk about it...
(sort of why I asked for a cap). My problem with decreasing the pacing
shift is that it makes for less interleaving (I think! have to look),
where increasing the target compensates for more jittery networks.
('course my overall theme of my prior post was to reduce jittery
networks with less buffering and smaller txops overall)

* Is this a two device test? or a test through an AP?

Looking over this thread, I see a suggestion is to expose the pacing
rate via a sysctl. I don't think there's
any one answer, either, but I'm not sure this is the right variable to
expose. Perhaps a tunable for
latency that might (one day) look at multiple variables and decide?

Getting this more right is really a job for an algorithm of
minstrel-like complexity.
Grant Grundler Sept. 4, 2018, 11:43 p.m. UTC | #16
On Mon, Sep 3, 2018 at 6:35 AM Toke Høiland-Jørgensen <toke@toke.dk> wrote:
>
> Johannes Berg <johannes@sipsolutions.net> writes:
....
> > Grant's data shows a significant difference between 6 and 7 for both
> > latency and throughput:

Minor nit: this is wgong's data more thoughtfully processed.

> >
> >  * median tpt
> >    - ~241 vs ~201 (both 1 and 5 streams)
> >  * median latency
> >    - 7.5 vs 6 (1 stream)
> >    - 17.3 vs. 16.6 (5 streams)
> >
> > A 20% throughput improvement at <= 1.5ms latency cost seems like a
> > pretty reasonable trade-off?
>
> Yeah, on it's face. What I'm bothered about is that it is the exact
> opposite results that I got from my ath10k tests (there, throughput
> *dropped* and latency doubled when going to from 4 to 16 ms of
> buffering).

Hrm, yeah...that would bother me too. I think even if we don't
understand why/how that happened, at some level we need to allow
subsystems or drivers to adjust sk_pacing_shift value. Changing
sk_pacing_shift clearly has an effect that can be optimized.

If smaller values of sk_pacing_shift is increasing the interval (and
allows more buffering), I'm wondering why CPU utilization gets higher.
More buffering is usually more efficient. :/

wgong: can you confirm (a) I've entered the data correctly in the
spreadsheet and (b) you've labeled the data sets correctly when you
generated the data?
If either of us made a mistake, it would be good to know. :)

I'm going to speculate that "other processing" (e.g. device level
interrupt mitigation or possibly CPU scheduling behaviors which
handles TX/RX completion) could cause a "bathtub" effect similar to
the performance graphs that originally got NAPI accepted into the
kernel ~15 years ago.  So the "optimal" value could be different for
different architectures and different IO devices (which have different
max link rates and different host bus interconnects). But honestly, I
don't understand all the details of sk_pacing_shift value nearly as
well as just about anyone else reading this thread.

> And, well, Grant's data is from a single test in a noisy
> environment where the time series graph shows that throughput is all over
> the place for the duration of the test; so it's hard to draw solid
> conclusions from (for instance, for the 5-stream test, the average
> throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7
> it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware
> used in this test, so I can't go verify it myself; so the only thing I
> can do is grumble about it here... :)

It's a fair complaint and I agree with it.  My counter argument is the
opposite is true too: most ideal benchmarks don't measure what most
users see. While the data wgong provided are way more noisy than I
like, my overall "confidence" in the "conclusion" I offered is still
positive.

cheers,
grant
Wen Gong Sept. 5, 2018, 7:23 a.m. UTC | #17
On 2018-09-05 07:43, Grant Grundler wrote:
> On Mon, Sep 3, 2018 at 6:35 AM Toke Høiland-Jørgensen <toke@toke.dk> 
> wrote:
>> 
>> Johannes Berg <johannes@sipsolutions.net> writes:
> ....
>> > Grant's data shows a significant difference between 6 and 7 for both
>> > latency and throughput:
> 
> Minor nit: this is wgong's data more thoughtfully processed.

> 
> wgong: can you confirm (a) I've entered the data correctly in the
> spreadsheet and (b) you've labeled the data sets correctly when you
> generated the data?
> If either of us made a mistake, it would be good to know. :)
> 
yes, the data is the test result from flent tool.
Toke Høiland-Jørgensen Sept. 6, 2018, 10:18 a.m. UTC | #18
Grant Grundler <grundler@google.com> writes:

>> And, well, Grant's data is from a single test in a noisy
>> environment where the time series graph shows that throughput is all over
>> the place for the duration of the test; so it's hard to draw solid
>> conclusions from (for instance, for the 5-stream test, the average
>> throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7
>> it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware
>> used in this test, so I can't go verify it myself; so the only thing I
>> can do is grumble about it here... :)
>
> It's a fair complaint and I agree with it. My counter argument is the
> opposite is true too: most ideal benchmarks don't measure what most
> users see. While the data wgong provided are way more noisy than I
> like, my overall "confidence" in the "conclusion" I offered is still
> positive.

Right. I guess I would just prefer a slightly more comprehensive
evaluation to base a 4x increase in buffer size on...

-Toke
Grant Grundler Feb. 20, 2019, 7:15 p.m. UTC | #19
On Thu, Sep 6, 2018 at 3:18 AM Toke Høiland-Jørgensen <toke@toke.dk> wrote:
>
> Grant Grundler <grundler@google.com> writes:
>
> >> And, well, Grant's data is from a single test in a noisy
> >> environment where the time series graph shows that throughput is all over
> >> the place for the duration of the test; so it's hard to draw solid
> >> conclusions from (for instance, for the 5-stream test, the average
> >> throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7
> >> it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware
> >> used in this test, so I can't go verify it myself; so the only thing I
> >> can do is grumble about it here... :)
> >
> > It's a fair complaint and I agree with it. My counter argument is the
> > opposite is true too: most ideal benchmarks don't measure what most
> > users see. While the data wgong provided are way more noisy than I
> > like, my overall "confidence" in the "conclusion" I offered is still
> > positive.
>
> Right. I guess I would just prefer a slightly more comprehensive
> evaluation to base a 4x increase in buffer size on...

Kalle, is this why you didn't accept this patch? Other reasons?

Toke, what else would you like to see evaluated?

I generally want to see three things measured when "benchmarking"
technologies: throughput, latency, cpu utilization
We've covered those three I think "reasonably".

What does a "4x increase in memory" mean here?  Wen, how much more
memory does this cause ath10k to use?

If a "4x increase in memory" means I'm using 1MB instead of 256KB, I'm
not going worry about that on a system with 2GB-16GB of RAM if it
doubles the throughput of the WIFI for a given workload.  I expect
routers with 128-256MB RAM would make that tradeoff as well assuming
they don't have other RAM-demanding workload.

cheers,
grant
Kalle Valo Feb. 21, 2019, 4:39 a.m. UTC | #20
Grant Grundler <grundler@google.com> writes:

> On Thu, Sep 6, 2018 at 3:18 AM Toke Høiland-Jørgensen <toke@toke.dk> wrote:
>>
>> Grant Grundler <grundler@google.com> writes:
>>
>> >> And, well, Grant's data is from a single test in a noisy
>> >> environment where the time series graph shows that throughput is all over
>> >> the place for the duration of the test; so it's hard to draw solid
>> >> conclusions from (for instance, for the 5-stream test, the average
>> >> throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7
>> >> it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware
>> >> used in this test, so I can't go verify it myself; so the only thing I
>> >> can do is grumble about it here... :)
>> >
>> > It's a fair complaint and I agree with it. My counter argument is the
>> > opposite is true too: most ideal benchmarks don't measure what most
>> > users see. While the data wgong provided are way more noisy than I
>> > like, my overall "confidence" in the "conclusion" I offered is still
>> > positive.
>>
>> Right. I guess I would just prefer a slightly more comprehensive
>> evaluation to base a 4x increase in buffer size on...
>
> Kalle, is this why you didn't accept this patch? Other reasons?

Just for reference, here are patchwork links:

https://patchwork.kernel.org/cover/10559729/
https://patchwork.kernel.org/patch/10559731/
https://patchwork.kernel.org/patch/10559733/

The mac80211 patch is accepted and the ath10k patch is deferred. IIRC I
put it deferred as I didn't see a real consensus about the patch and was
supposed to look at it later, but haven't done yet. I don't have any
issues with the patch, except maybe removing the duplicate define for
9377 (which I can easily fix in the pending branch).
Toke Høiland-Jørgensen Feb. 21, 2019, 3:42 p.m. UTC | #21
Grant Grundler <grundler@google.com> writes:

> On Thu, Sep 6, 2018 at 3:18 AM Toke Høiland-Jørgensen <toke@toke.dk> wrote:
>>
>> Grant Grundler <grundler@google.com> writes:
>>
>> >> And, well, Grant's data is from a single test in a noisy
>> >> environment where the time series graph shows that throughput is all over
>> >> the place for the duration of the test; so it's hard to draw solid
>> >> conclusions from (for instance, for the 5-stream test, the average
>> >> throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7
>> >> it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware
>> >> used in this test, so I can't go verify it myself; so the only thing I
>> >> can do is grumble about it here... :)
>> >
>> > It's a fair complaint and I agree with it. My counter argument is the
>> > opposite is true too: most ideal benchmarks don't measure what most
>> > users see. While the data wgong provided are way more noisy than I
>> > like, my overall "confidence" in the "conclusion" I offered is still
>> > positive.
>>
>> Right. I guess I would just prefer a slightly more comprehensive
>> evaluation to base a 4x increase in buffer size on...
>
> Kalle, is this why you didn't accept this patch? Other reasons?
>
> Toke, what else would you like to see evaluated?
>
> I generally want to see three things measured when "benchmarking"
> technologies: throughput, latency, cpu utilization
> We've covered those three I think "reasonably".

Hmm, going back and looking at this (I'd completely forgotten about this
patch), I think I had two main concerns:

1. What happens in a degraded signal situation, where the throughput is
   limited by the signal conditions, or by contention with other devices.
   Both of these happen regularly, and I worry that latency will be
   badly affected under those conditions.

2. What happens with old hardware that has worse buffer management in
   the driver->firmware path (especially drivers without push/pull mode
   support)? For these, the lower-level queueing structure is less
   effective at controlling queueing latency.

Getting the queue size limit patches from ChromeOS ported would
alleviate point 2. I do believe Kan said he'd look into that once the
airtime patches were merged. So Kan, any progress on that front? :)

> What does a "4x increase in memory" mean here? Wen, how much more
> memory does this cause ath10k to use?

I didn't say "memory", I said "buffer size"... :)
I.e., it's the latency impact of the increased buffering I'm worried
about (see above), not the system memory usage.

-Toke
Kalle Valo Feb. 21, 2019, 4:10 p.m. UTC | #22
Toke Høiland-Jørgensen <toke@toke.dk> writes:

> Grant Grundler <grundler@google.com> writes:
>
>> On Thu, Sep 6, 2018 at 3:18 AM Toke Høiland-Jørgensen <toke@toke.dk> wrote:
>>>
>>> Grant Grundler <grundler@google.com> writes:
>>>
>>> >> And, well, Grant's data is from a single test in a noisy
>>> >> environment where the time series graph shows that throughput is all over
>>> >> the place for the duration of the test; so it's hard to draw solid
>>> >> conclusions from (for instance, for the 5-stream test, the average
>>> >> throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7
>>> >> it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware
>>> >> used in this test, so I can't go verify it myself; so the only thing I
>>> >> can do is grumble about it here... :)
>>> >
>>> > It's a fair complaint and I agree with it. My counter argument is the
>>> > opposite is true too: most ideal benchmarks don't measure what most
>>> > users see. While the data wgong provided are way more noisy than I
>>> > like, my overall "confidence" in the "conclusion" I offered is still
>>> > positive.
>>>
>>> Right. I guess I would just prefer a slightly more comprehensive
>>> evaluation to base a 4x increase in buffer size on...
>>
>> Kalle, is this why you didn't accept this patch? Other reasons?
>>
>> Toke, what else would you like to see evaluated?
>>
>> I generally want to see three things measured when "benchmarking"
>> technologies: throughput, latency, cpu utilization
>> We've covered those three I think "reasonably".
>
> Hmm, going back and looking at this (I'd completely forgotten about this
> patch), I think I had two main concerns:
>
> 1. What happens in a degraded signal situation, where the throughput is
>    limited by the signal conditions, or by contention with other devices.
>    Both of these happen regularly, and I worry that latency will be
>    badly affected under those conditions.
>
> 2. What happens with old hardware that has worse buffer management in
>    the driver->firmware path (especially drivers without push/pull mode
>    support)? For these, the lower-level queueing structure is less
>    effective at controlling queueing latency.

Do note that this patch changes behaviour _only_ for QCA6174 and QCA9377
PCI devices, which IIRC do not even support push/pull mode. All the
rest, including QCA988X and QCA9984 are unaffected.
Ben Greear Feb. 21, 2019, 4:22 p.m. UTC | #23
On 2/21/19 8:10 AM, Kalle Valo wrote:
> Toke Høiland-Jørgensen <toke@toke.dk> writes:
> 
>> Grant Grundler <grundler@google.com> writes:
>>
>>> On Thu, Sep 6, 2018 at 3:18 AM Toke Høiland-Jørgensen <toke@toke.dk> wrote:
>>>>
>>>> Grant Grundler <grundler@google.com> writes:
>>>>
>>>>>> And, well, Grant's data is from a single test in a noisy
>>>>>> environment where the time series graph shows that throughput is all over
>>>>>> the place for the duration of the test; so it's hard to draw solid
>>>>>> conclusions from (for instance, for the 5-stream test, the average
>>>>>> throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7
>>>>>> it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware
>>>>>> used in this test, so I can't go verify it myself; so the only thing I
>>>>>> can do is grumble about it here... :)
>>>>>
>>>>> It's a fair complaint and I agree with it. My counter argument is the
>>>>> opposite is true too: most ideal benchmarks don't measure what most
>>>>> users see. While the data wgong provided are way more noisy than I
>>>>> like, my overall "confidence" in the "conclusion" I offered is still
>>>>> positive.
>>>>
>>>> Right. I guess I would just prefer a slightly more comprehensive
>>>> evaluation to base a 4x increase in buffer size on...
>>>
>>> Kalle, is this why you didn't accept this patch? Other reasons?
>>>
>>> Toke, what else would you like to see evaluated?
>>>
>>> I generally want to see three things measured when "benchmarking"
>>> technologies: throughput, latency, cpu utilization
>>> We've covered those three I think "reasonably".
>>
>> Hmm, going back and looking at this (I'd completely forgotten about this
>> patch), I think I had two main concerns:
>>
>> 1. What happens in a degraded signal situation, where the throughput is
>>     limited by the signal conditions, or by contention with other devices.
>>     Both of these happen regularly, and I worry that latency will be
>>     badly affected under those conditions.
>>
>> 2. What happens with old hardware that has worse buffer management in
>>     the driver->firmware path (especially drivers without push/pull mode
>>     support)? For these, the lower-level queueing structure is less
>>     effective at controlling queueing latency.
> 
> Do note that this patch changes behaviour _only_ for QCA6174 and QCA9377
> PCI devices, which IIRC do not even support push/pull mode. All the
> rest, including QCA988X and QCA9984 are unaffected.

Just as a note, at least kernels such as 4.14.whatever perform poorly when
running ath10k on 9984 when acting as TCP endpoints.  This makes them not
really usable for stuff like serving video to lots of clients.

Tweaking TCP (I do it a bit differently, but either way) can significantly
improve performance.

Recently I helped a user that could get barely 70 stations streaming at 1Mbps
on stock kernel (using one wave1 on 2.4, one wave-2 on 5Ghz),
and we got 110 working with a tweaked TCP stack.  These were /n stations too.

I think it is lame that it _still_ requires out of tree patches to make TCP work
well on ath10k...even if you want to default to current behaviour, you should
allow users to tweak it to work with their use case.

Thanks,
Ben
Toke Høiland-Jørgensen Feb. 21, 2019, 4:28 p.m. UTC | #24
Kalle Valo <kvalo@codeaurora.org> writes:

> Toke Høiland-Jørgensen <toke@toke.dk> writes:
>
>> Grant Grundler <grundler@google.com> writes:
>>
>>> On Thu, Sep 6, 2018 at 3:18 AM Toke Høiland-Jørgensen <toke@toke.dk> wrote:
>>>>
>>>> Grant Grundler <grundler@google.com> writes:
>>>>
>>>> >> And, well, Grant's data is from a single test in a noisy
>>>> >> environment where the time series graph shows that throughput is all over
>>>> >> the place for the duration of the test; so it's hard to draw solid
>>>> >> conclusions from (for instance, for the 5-stream test, the average
>>>> >> throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7
>>>> >> it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware
>>>> >> used in this test, so I can't go verify it myself; so the only thing I
>>>> >> can do is grumble about it here... :)
>>>> >
>>>> > It's a fair complaint and I agree with it. My counter argument is the
>>>> > opposite is true too: most ideal benchmarks don't measure what most
>>>> > users see. While the data wgong provided are way more noisy than I
>>>> > like, my overall "confidence" in the "conclusion" I offered is still
>>>> > positive.
>>>>
>>>> Right. I guess I would just prefer a slightly more comprehensive
>>>> evaluation to base a 4x increase in buffer size on...
>>>
>>> Kalle, is this why you didn't accept this patch? Other reasons?
>>>
>>> Toke, what else would you like to see evaluated?
>>>
>>> I generally want to see three things measured when "benchmarking"
>>> technologies: throughput, latency, cpu utilization
>>> We've covered those three I think "reasonably".
>>
>> Hmm, going back and looking at this (I'd completely forgotten about this
>> patch), I think I had two main concerns:
>>
>> 1. What happens in a degraded signal situation, where the throughput is
>>    limited by the signal conditions, or by contention with other devices.
>>    Both of these happen regularly, and I worry that latency will be
>>    badly affected under those conditions.
>>
>> 2. What happens with old hardware that has worse buffer management in
>>    the driver->firmware path (especially drivers without push/pull mode
>>    support)? For these, the lower-level queueing structure is less
>>    effective at controlling queueing latency.
>
> Do note that this patch changes behaviour _only_ for QCA6174 and QCA9377
> PCI devices, which IIRC do not even support push/pull mode. All the
> rest, including QCA988X and QCA9984 are unaffected.

Ah, right; I did not go all the way back and look at the actual patch,
so missed that :)

But in that case, why are the latency results that low? Were these tests
done with the ChromeOS queue limit patches?

-Toke
Toke Høiland-Jørgensen Feb. 21, 2019, 4:37 p.m. UTC | #25
Ben Greear <greearb@candelatech.com> writes:

> On 2/21/19 8:10 AM, Kalle Valo wrote:
>> Toke Høiland-Jørgensen <toke@toke.dk> writes:
>> 
>>> Grant Grundler <grundler@google.com> writes:
>>>
>>>> On Thu, Sep 6, 2018 at 3:18 AM Toke Høiland-Jørgensen <toke@toke.dk> wrote:
>>>>>
>>>>> Grant Grundler <grundler@google.com> writes:
>>>>>
>>>>>>> And, well, Grant's data is from a single test in a noisy
>>>>>>> environment where the time series graph shows that throughput is all over
>>>>>>> the place for the duration of the test; so it's hard to draw solid
>>>>>>> conclusions from (for instance, for the 5-stream test, the average
>>>>>>> throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7
>>>>>>> it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware
>>>>>>> used in this test, so I can't go verify it myself; so the only thing I
>>>>>>> can do is grumble about it here... :)
>>>>>>
>>>>>> It's a fair complaint and I agree with it. My counter argument is the
>>>>>> opposite is true too: most ideal benchmarks don't measure what most
>>>>>> users see. While the data wgong provided are way more noisy than I
>>>>>> like, my overall "confidence" in the "conclusion" I offered is still
>>>>>> positive.
>>>>>
>>>>> Right. I guess I would just prefer a slightly more comprehensive
>>>>> evaluation to base a 4x increase in buffer size on...
>>>>
>>>> Kalle, is this why you didn't accept this patch? Other reasons?
>>>>
>>>> Toke, what else would you like to see evaluated?
>>>>
>>>> I generally want to see three things measured when "benchmarking"
>>>> technologies: throughput, latency, cpu utilization
>>>> We've covered those three I think "reasonably".
>>>
>>> Hmm, going back and looking at this (I'd completely forgotten about this
>>> patch), I think I had two main concerns:
>>>
>>> 1. What happens in a degraded signal situation, where the throughput is
>>>     limited by the signal conditions, or by contention with other devices.
>>>     Both of these happen regularly, and I worry that latency will be
>>>     badly affected under those conditions.
>>>
>>> 2. What happens with old hardware that has worse buffer management in
>>>     the driver->firmware path (especially drivers without push/pull mode
>>>     support)? For these, the lower-level queueing structure is less
>>>     effective at controlling queueing latency.
>> 
>> Do note that this patch changes behaviour _only_ for QCA6174 and QCA9377
>> PCI devices, which IIRC do not even support push/pull mode. All the
>> rest, including QCA988X and QCA9984 are unaffected.
>
> Just as a note, at least kernels such as 4.14.whatever perform poorly when
> running ath10k on 9984 when acting as TCP endpoints.  This makes them not
> really usable for stuff like serving video to lots of clients.
>
> Tweaking TCP (I do it a bit differently, but either way) can significantly
> improve performance.

Differently how? Did you have to do more than fiddle with the pacing_shift?

> Recently I helped a user that could get barely 70 stations streaming
> at 1Mbps on stock kernel (using one wave1 on 2.4, one wave-2 on 5Ghz),
> and we got 110 working with a tweaked TCP stack. These were /n
> stations too.
>
> I think it is lame that it _still_ requires out of tree patches to
> make TCP work well on ath10k...even if you want to default to current
> behaviour, you should allow users to tweak it to work with their use
> case.

Well if TCP is broken to the point of being unusable I do think we
should fix it; but I think "just provide a configuration knob" should be
the last resort...

-Toke
Ben Greear Feb. 21, 2019, 4:57 p.m. UTC | #26
On 2/21/19 8:37 AM, Toke Høiland-Jørgensen wrote:
> Ben Greear <greearb@candelatech.com> writes:
> 
>> On 2/21/19 8:10 AM, Kalle Valo wrote:
>>> Toke Høiland-Jørgensen <toke@toke.dk> writes:
>>>
>>>> Grant Grundler <grundler@google.com> writes:
>>>>
>>>>> On Thu, Sep 6, 2018 at 3:18 AM Toke Høiland-Jørgensen <toke@toke.dk> wrote:
>>>>>>
>>>>>> Grant Grundler <grundler@google.com> writes:
>>>>>>
>>>>>>>> And, well, Grant's data is from a single test in a noisy
>>>>>>>> environment where the time series graph shows that throughput is all over
>>>>>>>> the place for the duration of the test; so it's hard to draw solid
>>>>>>>> conclusions from (for instance, for the 5-stream test, the average
>>>>>>>> throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7
>>>>>>>> it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware
>>>>>>>> used in this test, so I can't go verify it myself; so the only thing I
>>>>>>>> can do is grumble about it here... :)
>>>>>>>
>>>>>>> It's a fair complaint and I agree with it. My counter argument is the
>>>>>>> opposite is true too: most ideal benchmarks don't measure what most
>>>>>>> users see. While the data wgong provided are way more noisy than I
>>>>>>> like, my overall "confidence" in the "conclusion" I offered is still
>>>>>>> positive.
>>>>>>
>>>>>> Right. I guess I would just prefer a slightly more comprehensive
>>>>>> evaluation to base a 4x increase in buffer size on...
>>>>>
>>>>> Kalle, is this why you didn't accept this patch? Other reasons?
>>>>>
>>>>> Toke, what else would you like to see evaluated?
>>>>>
>>>>> I generally want to see three things measured when "benchmarking"
>>>>> technologies: throughput, latency, cpu utilization
>>>>> We've covered those three I think "reasonably".
>>>>
>>>> Hmm, going back and looking at this (I'd completely forgotten about this
>>>> patch), I think I had two main concerns:
>>>>
>>>> 1. What happens in a degraded signal situation, where the throughput is
>>>>      limited by the signal conditions, or by contention with other devices.
>>>>      Both of these happen regularly, and I worry that latency will be
>>>>      badly affected under those conditions.
>>>>
>>>> 2. What happens with old hardware that has worse buffer management in
>>>>      the driver->firmware path (especially drivers without push/pull mode
>>>>      support)? For these, the lower-level queueing structure is less
>>>>      effective at controlling queueing latency.
>>>
>>> Do note that this patch changes behaviour _only_ for QCA6174 and QCA9377
>>> PCI devices, which IIRC do not even support push/pull mode. All the
>>> rest, including QCA988X and QCA9984 are unaffected.
>>
>> Just as a note, at least kernels such as 4.14.whatever perform poorly when
>> running ath10k on 9984 when acting as TCP endpoints.  This makes them not
>> really usable for stuff like serving video to lots of clients.
>>
>> Tweaking TCP (I do it a bit differently, but either way) can significantly
>> improve performance.
> 
> Differently how? Did you have to do more than fiddle with the pacing_shift?

This one, or a slightly tweaked version that applies to different kernels:

https://github.com/greearb/linux-ct-4.16/commit/3e14e8491a5b31ce994fb2752347145e6ab7eaf5

>> Recently I helped a user that could get barely 70 stations streaming
>> at 1Mbps on stock kernel (using one wave1 on 2.4, one wave-2 on 5Ghz),
>> and we got 110 working with a tweaked TCP stack. These were /n
>> stations too.
>>
>> I think it is lame that it _still_ requires out of tree patches to
>> make TCP work well on ath10k...even if you want to default to current
>> behaviour, you should allow users to tweak it to work with their use
>> case.
> 
> Well if TCP is broken to the point of being unusable I do think we
> should fix it; but I think "just provide a configuration knob" should be
> the last resort...

So, it has been broken for years, and waiting for a perfect solution has not
gotten the problem fixed.

Thanks,
Ben
Toke Høiland-Jørgensen Feb. 21, 2019, 5:15 p.m. UTC | #27
Ben Greear <greearb@candelatech.com> writes:

> On 2/21/19 8:37 AM, Toke Høiland-Jørgensen wrote:
>> Ben Greear <greearb@candelatech.com> writes:
>> 
>>> On 2/21/19 8:10 AM, Kalle Valo wrote:
>>>> Toke Høiland-Jørgensen <toke@toke.dk> writes:
>>>>
>>>>> Grant Grundler <grundler@google.com> writes:
>>>>>
>>>>>> On Thu, Sep 6, 2018 at 3:18 AM Toke Høiland-Jørgensen <toke@toke.dk> wrote:
>>>>>>>
>>>>>>> Grant Grundler <grundler@google.com> writes:
>>>>>>>
>>>>>>>>> And, well, Grant's data is from a single test in a noisy
>>>>>>>>> environment where the time series graph shows that throughput is all over
>>>>>>>>> the place for the duration of the test; so it's hard to draw solid
>>>>>>>>> conclusions from (for instance, for the 5-stream test, the average
>>>>>>>>> throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7
>>>>>>>>> it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware
>>>>>>>>> used in this test, so I can't go verify it myself; so the only thing I
>>>>>>>>> can do is grumble about it here... :)
>>>>>>>>
>>>>>>>> It's a fair complaint and I agree with it. My counter argument is the
>>>>>>>> opposite is true too: most ideal benchmarks don't measure what most
>>>>>>>> users see. While the data wgong provided are way more noisy than I
>>>>>>>> like, my overall "confidence" in the "conclusion" I offered is still
>>>>>>>> positive.
>>>>>>>
>>>>>>> Right. I guess I would just prefer a slightly more comprehensive
>>>>>>> evaluation to base a 4x increase in buffer size on...
>>>>>>
>>>>>> Kalle, is this why you didn't accept this patch? Other reasons?
>>>>>>
>>>>>> Toke, what else would you like to see evaluated?
>>>>>>
>>>>>> I generally want to see three things measured when "benchmarking"
>>>>>> technologies: throughput, latency, cpu utilization
>>>>>> We've covered those three I think "reasonably".
>>>>>
>>>>> Hmm, going back and looking at this (I'd completely forgotten about this
>>>>> patch), I think I had two main concerns:
>>>>>
>>>>> 1. What happens in a degraded signal situation, where the throughput is
>>>>>      limited by the signal conditions, or by contention with other devices.
>>>>>      Both of these happen regularly, and I worry that latency will be
>>>>>      badly affected under those conditions.
>>>>>
>>>>> 2. What happens with old hardware that has worse buffer management in
>>>>>      the driver->firmware path (especially drivers without push/pull mode
>>>>>      support)? For these, the lower-level queueing structure is less
>>>>>      effective at controlling queueing latency.
>>>>
>>>> Do note that this patch changes behaviour _only_ for QCA6174 and QCA9377
>>>> PCI devices, which IIRC do not even support push/pull mode. All the
>>>> rest, including QCA988X and QCA9984 are unaffected.
>>>
>>> Just as a note, at least kernels such as 4.14.whatever perform poorly when
>>> running ath10k on 9984 when acting as TCP endpoints.  This makes them not
>>> really usable for stuff like serving video to lots of clients.
>>>
>>> Tweaking TCP (I do it a bit differently, but either way) can significantly
>>> improve performance.
>> 
>> Differently how? Did you have to do more than fiddle with the pacing_shift?
>
> This one, or a slightly tweaked version that applies to different kernels:
>
> https://github.com/greearb/linux-ct-4.16/commit/3e14e8491a5b31ce994fb2752347145e6ab7eaf5

Right; but the current mac80211 default (pacing shift 8) corresponds to
setting your sysctl to 4...

>>> Recently I helped a user that could get barely 70 stations streaming
>>> at 1Mbps on stock kernel (using one wave1 on 2.4, one wave-2 on 5Ghz),
>>> and we got 110 working with a tweaked TCP stack. These were /n
>>> stations too.
>>>
>>> I think it is lame that it _still_ requires out of tree patches to
>>> make TCP work well on ath10k...even if you want to default to current
>>> behaviour, you should allow users to tweak it to work with their use
>>> case.
>> 
>> Well if TCP is broken to the point of being unusable I do think we
>> should fix it; but I think "just provide a configuration knob" should be
>> the last resort...
>
> So, it has been broken for years, and waiting for a perfect solution
> has not gotten the problem fixed.

Well, the current default should at least be closer to something that
works well.

I do think I may have erred on the wrong side of the optimum when I
submitted the original patch to set the default to 8; that should
probably have been 7 (i.e., 8 ms; the optimum in the evaluation we did
was around 6 ms, which is sadly not a power of two). Maybe changing that
default is actually better than having to redo the testing for all the
different devices as we're discussing in the context of this patch.
Maybe I should just send a patch to do that...

-Toke
Ben Greear Feb. 21, 2019, 5:29 p.m. UTC | #28
On 2/21/19 9:15 AM, Toke Høiland-Jørgensen wrote:
> Ben Greear <greearb@candelatech.com> writes:
> 
>> On 2/21/19 8:37 AM, Toke Høiland-Jørgensen wrote:
>>> Ben Greear <greearb@candelatech.com> writes:
>>>
>>>> On 2/21/19 8:10 AM, Kalle Valo wrote:
>>>>> Toke Høiland-Jørgensen <toke@toke.dk> writes:
>>>>>
>>>>>> Grant Grundler <grundler@google.com> writes:
>>>>>>
>>>>>>> On Thu, Sep 6, 2018 at 3:18 AM Toke Høiland-Jørgensen <toke@toke.dk> wrote:
>>>>>>>>
>>>>>>>> Grant Grundler <grundler@google.com> writes:
>>>>>>>>
>>>>>>>>>> And, well, Grant's data is from a single test in a noisy
>>>>>>>>>> environment where the time series graph shows that throughput is all over
>>>>>>>>>> the place for the duration of the test; so it's hard to draw solid
>>>>>>>>>> conclusions from (for instance, for the 5-stream test, the average
>>>>>>>>>> throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7
>>>>>>>>>> it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware
>>>>>>>>>> used in this test, so I can't go verify it myself; so the only thing I
>>>>>>>>>> can do is grumble about it here... :)
>>>>>>>>>
>>>>>>>>> It's a fair complaint and I agree with it. My counter argument is the
>>>>>>>>> opposite is true too: most ideal benchmarks don't measure what most
>>>>>>>>> users see. While the data wgong provided are way more noisy than I
>>>>>>>>> like, my overall "confidence" in the "conclusion" I offered is still
>>>>>>>>> positive.
>>>>>>>>
>>>>>>>> Right. I guess I would just prefer a slightly more comprehensive
>>>>>>>> evaluation to base a 4x increase in buffer size on...
>>>>>>>
>>>>>>> Kalle, is this why you didn't accept this patch? Other reasons?
>>>>>>>
>>>>>>> Toke, what else would you like to see evaluated?
>>>>>>>
>>>>>>> I generally want to see three things measured when "benchmarking"
>>>>>>> technologies: throughput, latency, cpu utilization
>>>>>>> We've covered those three I think "reasonably".
>>>>>>
>>>>>> Hmm, going back and looking at this (I'd completely forgotten about this
>>>>>> patch), I think I had two main concerns:
>>>>>>
>>>>>> 1. What happens in a degraded signal situation, where the throughput is
>>>>>>       limited by the signal conditions, or by contention with other devices.
>>>>>>       Both of these happen regularly, and I worry that latency will be
>>>>>>       badly affected under those conditions.
>>>>>>
>>>>>> 2. What happens with old hardware that has worse buffer management in
>>>>>>       the driver->firmware path (especially drivers without push/pull mode
>>>>>>       support)? For these, the lower-level queueing structure is less
>>>>>>       effective at controlling queueing latency.
>>>>>
>>>>> Do note that this patch changes behaviour _only_ for QCA6174 and QCA9377
>>>>> PCI devices, which IIRC do not even support push/pull mode. All the
>>>>> rest, including QCA988X and QCA9984 are unaffected.
>>>>
>>>> Just as a note, at least kernels such as 4.14.whatever perform poorly when
>>>> running ath10k on 9984 when acting as TCP endpoints.  This makes them not
>>>> really usable for stuff like serving video to lots of clients.
>>>>
>>>> Tweaking TCP (I do it a bit differently, but either way) can significantly
>>>> improve performance.
>>>
>>> Differently how? Did you have to do more than fiddle with the pacing_shift?
>>
>> This one, or a slightly tweaked version that applies to different kernels:
>>
>> https://github.com/greearb/linux-ct-4.16/commit/3e14e8491a5b31ce994fb2752347145e6ab7eaf5
> 
> Right; but the current mac80211 default (pacing shift 8) corresponds to
> setting your sysctl to 4...
> 
>>>> Recently I helped a user that could get barely 70 stations streaming
>>>> at 1Mbps on stock kernel (using one wave1 on 2.4, one wave-2 on 5Ghz),
>>>> and we got 110 working with a tweaked TCP stack. These were /n
>>>> stations too.
>>>>
>>>> I think it is lame that it _still_ requires out of tree patches to
>>>> make TCP work well on ath10k...even if you want to default to current
>>>> behaviour, you should allow users to tweak it to work with their use
>>>> case.
>>>
>>> Well if TCP is broken to the point of being unusable I do think we
>>> should fix it; but I think "just provide a configuration knob" should be
>>> the last resort...
>>
>> So, it has been broken for years, and waiting for a perfect solution
>> has not gotten the problem fixed.
> 
> Well, the current default should at least be closer to something that
> works well.
> 
> I do think I may have erred on the wrong side of the optimum when I
> submitted the original patch to set the default to 8; that should
> probably have been 7 (i.e., 8 ms; the optimum in the evaluation we did
> was around 6 ms, which is sadly not a power of two). Maybe changing that
> default is actually better than having to redo the testing for all the
> different devices as we're discussing in the context of this patch.
> Maybe I should just send a patch to do that...

It is hubris to think one setting works well for everyone.  Sure, set a good
default, but also let people tune the value.

And send the patches to stable so that users on older kernels can have good
performance.

Thanks,
Ben
Toke Høiland-Jørgensen Feb. 21, 2019, 10:50 p.m. UTC | #29
Ben Greear <greearb@candelatech.com> writes:

> On 2/21/19 9:15 AM, Toke Høiland-Jørgensen wrote:
>> Ben Greear <greearb@candelatech.com> writes:
>> 
>>> On 2/21/19 8:37 AM, Toke Høiland-Jørgensen wrote:
>>>> Ben Greear <greearb@candelatech.com> writes:
>>>>
>>>>> On 2/21/19 8:10 AM, Kalle Valo wrote:
>>>>>> Toke Høiland-Jørgensen <toke@toke.dk> writes:
>>>>>>
>>>>>>> Grant Grundler <grundler@google.com> writes:
>>>>>>>
>>>>>>>> On Thu, Sep 6, 2018 at 3:18 AM Toke Høiland-Jørgensen <toke@toke.dk> wrote:
>>>>>>>>>
>>>>>>>>> Grant Grundler <grundler@google.com> writes:
>>>>>>>>>
>>>>>>>>>>> And, well, Grant's data is from a single test in a noisy
>>>>>>>>>>> environment where the time series graph shows that throughput is all over
>>>>>>>>>>> the place for the duration of the test; so it's hard to draw solid
>>>>>>>>>>> conclusions from (for instance, for the 5-stream test, the average
>>>>>>>>>>> throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7
>>>>>>>>>>> it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware
>>>>>>>>>>> used in this test, so I can't go verify it myself; so the only thing I
>>>>>>>>>>> can do is grumble about it here... :)
>>>>>>>>>>
>>>>>>>>>> It's a fair complaint and I agree with it. My counter argument is the
>>>>>>>>>> opposite is true too: most ideal benchmarks don't measure what most
>>>>>>>>>> users see. While the data wgong provided are way more noisy than I
>>>>>>>>>> like, my overall "confidence" in the "conclusion" I offered is still
>>>>>>>>>> positive.
>>>>>>>>>
>>>>>>>>> Right. I guess I would just prefer a slightly more comprehensive
>>>>>>>>> evaluation to base a 4x increase in buffer size on...
>>>>>>>>
>>>>>>>> Kalle, is this why you didn't accept this patch? Other reasons?
>>>>>>>>
>>>>>>>> Toke, what else would you like to see evaluated?
>>>>>>>>
>>>>>>>> I generally want to see three things measured when "benchmarking"
>>>>>>>> technologies: throughput, latency, cpu utilization
>>>>>>>> We've covered those three I think "reasonably".
>>>>>>>
>>>>>>> Hmm, going back and looking at this (I'd completely forgotten about this
>>>>>>> patch), I think I had two main concerns:
>>>>>>>
>>>>>>> 1. What happens in a degraded signal situation, where the throughput is
>>>>>>>       limited by the signal conditions, or by contention with other devices.
>>>>>>>       Both of these happen regularly, and I worry that latency will be
>>>>>>>       badly affected under those conditions.
>>>>>>>
>>>>>>> 2. What happens with old hardware that has worse buffer management in
>>>>>>>       the driver->firmware path (especially drivers without push/pull mode
>>>>>>>       support)? For these, the lower-level queueing structure is less
>>>>>>>       effective at controlling queueing latency.
>>>>>>
>>>>>> Do note that this patch changes behaviour _only_ for QCA6174 and QCA9377
>>>>>> PCI devices, which IIRC do not even support push/pull mode. All the
>>>>>> rest, including QCA988X and QCA9984 are unaffected.
>>>>>
>>>>> Just as a note, at least kernels such as 4.14.whatever perform poorly when
>>>>> running ath10k on 9984 when acting as TCP endpoints.  This makes them not
>>>>> really usable for stuff like serving video to lots of clients.
>>>>>
>>>>> Tweaking TCP (I do it a bit differently, but either way) can significantly
>>>>> improve performance.
>>>>
>>>> Differently how? Did you have to do more than fiddle with the pacing_shift?
>>>
>>> This one, or a slightly tweaked version that applies to different kernels:
>>>
>>> https://github.com/greearb/linux-ct-4.16/commit/3e14e8491a5b31ce994fb2752347145e6ab7eaf5
>> 
>> Right; but the current mac80211 default (pacing shift 8) corresponds to
>> setting your sysctl to 4...
>> 
>>>>> Recently I helped a user that could get barely 70 stations streaming
>>>>> at 1Mbps on stock kernel (using one wave1 on 2.4, one wave-2 on 5Ghz),
>>>>> and we got 110 working with a tweaked TCP stack. These were /n
>>>>> stations too.
>>>>>
>>>>> I think it is lame that it _still_ requires out of tree patches to
>>>>> make TCP work well on ath10k...even if you want to default to current
>>>>> behaviour, you should allow users to tweak it to work with their use
>>>>> case.
>>>>
>>>> Well if TCP is broken to the point of being unusable I do think we
>>>> should fix it; but I think "just provide a configuration knob" should be
>>>> the last resort...
>>>
>>> So, it has been broken for years, and waiting for a perfect solution
>>> has not gotten the problem fixed.
>> 
>> Well, the current default should at least be closer to something that
>> works well.
>> 
>> I do think I may have erred on the wrong side of the optimum when I
>> submitted the original patch to set the default to 8; that should
>> probably have been 7 (i.e., 8 ms; the optimum in the evaluation we did
>> was around 6 ms, which is sadly not a power of two). Maybe changing that
>> default is actually better than having to redo the testing for all the
>> different devices as we're discussing in the context of this patch.
>> Maybe I should just send a patch to do that...
>
> It is hubris to think one setting works well for everyone. Sure, set a
> good default, but also let people tune the value.

Well I certainly won't object to a knob; I just don't think that most
people are going to use it, so we better make the default reasonable.

> And send the patches to stable so that users on older kernels can have
> good performance.

Sure, I can submit stable backports if needed :)

-Toke
Kalle Valo April 23, 2020, 6:31 a.m. UTC | #30
Wen Gong <wgong@codeaurora.org> wrote:

> Upstream kernel has an interface to help adjust sk_pacing_shift to help
> improve TCP UL throughput.
> The sk_pacing_shift is 8 in mac80211, this is based on test with 11N
> WiFi chips with ath9k. For QCA6174/QCA9377 PCI 11AC chips, the 11AC
> VHT80 TCP UL throughput testing result shows 6 is the optimal.
> Overwrite the sk_pacing_shift to 6 in ath10k driver for QCA6174/9377 PCI.
> 
> Tested with QCA6174 PCI with firmware
> WLAN.RM.4.4.1-00109-QCARMSWPZ-1, but this will also affect QCA9377 PCI.
> It's not a regression with new firmware releases.
> 
> There have 2 test result of different settings:
> 
> ARM CPU based device with QCA6174A PCI with different
> sk_pacing_shift:
> 
>  sk_pacing_shift  throughput(Mbps)             CPU utilization
>          6            500(-P5)      ~75% idle, Focus on CPU1: ~14%idle
>          7            454(-P5)      ~80% idle, Focus on CPU1: ~4%idle
>          8               288        ~90% idle, Focus on CPU1: ~35%idle
>          9              ~200        ~92% idle, Focus on CPU1: ~50%idle
> 
> 5G TCP UL VTH80 on X86 platform with QCA6174A PCI with sk_packing_shift
> set to 6:
> 
>   tcp_limit_output_bytes            throughput(Mbps)
>  default(262144)+1 Stream                 336
>  default(262144)+2 Streams                558
>  default(262144)+3 Streams                584
>  default(262144)+4 Streams                602
>  default(262144)+5 Streams                598
>  changed(2621440)+1 Stream                598
>  changed(2621440)+2 Streams               601
> 
> Signed-off-by: Wen Gong <wgong@codeaurora.org>

The final result of this patch is unclear so I'm dropping this. Please
resend if the issue still exists.

Patch set to Changes Requested.
diff mbox series

Patch

diff --git a/drivers/net/wireless/ath/ath10k/core.c b/drivers/net/wireless/ath/ath10k/core.c
index 85c58eb..fbd13ec 100644
--- a/drivers/net/wireless/ath/ath10k/core.c
+++ b/drivers/net/wireless/ath/ath10k/core.c
@@ -189,6 +189,7 @@ 
 		.per_ce_irq = false,
 		.shadow_reg_support = false,
 		.rri_on_ddr = false,
+		.tx_sk_pacing_shift = SK_PACING_SHIFT_6174,
 	},
 	{
 		.id = QCA6174_HW_2_1_VERSION,
@@ -221,6 +222,7 @@ 
 		.per_ce_irq = false,
 		.shadow_reg_support = false,
 		.rri_on_ddr = false,
+		.tx_sk_pacing_shift = SK_PACING_SHIFT_6174,
 	},
 	{
 		.id = QCA6174_HW_3_0_VERSION,
@@ -253,6 +255,7 @@ 
 		.per_ce_irq = false,
 		.shadow_reg_support = false,
 		.rri_on_ddr = false,
+		.tx_sk_pacing_shift = SK_PACING_SHIFT_6174,
 	},
 	{
 		.id = QCA6174_HW_3_2_VERSION,
@@ -288,6 +291,7 @@ 
 		.per_ce_irq = false,
 		.shadow_reg_support = false,
 		.rri_on_ddr = false,
+		.tx_sk_pacing_shift = SK_PACING_SHIFT_6174,
 	},
 	{
 		.id = QCA99X0_HW_2_0_DEV_VERSION,
@@ -443,6 +447,7 @@ 
 		.per_ce_irq = false,
 		.shadow_reg_support = false,
 		.rri_on_ddr = false,
+		.tx_sk_pacing_shift = SK_PACING_SHIFT_9377,
 	},
 	{
 		.id = QCA9377_HW_1_1_DEV_VERSION,
@@ -477,6 +482,7 @@ 
 		.per_ce_irq = false,
 		.shadow_reg_support = false,
 		.rri_on_ddr = false,
+		.tx_sk_pacing_shift = SK_PACING_SHIFT_9377,
 	},
 	{
 		.id = QCA4019_HW_1_0_DEV_VERSION,
diff --git a/drivers/net/wireless/ath/ath10k/hw.h b/drivers/net/wireless/ath/ath10k/hw.h
index a274bd8..1f956d6 100644
--- a/drivers/net/wireless/ath/ath10k/hw.h
+++ b/drivers/net/wireless/ath/ath10k/hw.h
@@ -161,6 +161,9 @@  enum qca9377_chip_id_rev {
 
 #define REG_DUMP_COUNT_QCA988X 60
 
+#define SK_PACING_SHIFT_6174 6
+#define SK_PACING_SHIFT_9377 6
+
 struct ath10k_fw_ie {
 	__le32 id;
 	__le32 len;
@@ -589,6 +592,9 @@  struct ath10k_hw_params {
 
 	/* Number of bytes to be the offset for each FFT sample */
 	int spectral_bin_offset;
+
+	/* Number of shift to override the default value of ieee80211_hw*/
+	u8 tx_sk_pacing_shift;
 };
 
 struct htt_rx_desc;
diff --git a/drivers/net/wireless/ath/ath10k/mac.c b/drivers/net/wireless/ath/ath10k/mac.c
index 95243b4..4f2c07f 100644
--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@ -8361,6 +8361,9 @@  int ath10k_mac_register(struct ath10k *ar)
 	ar->hw->wiphy->max_scan_ssids = WLAN_SCAN_PARAMS_MAX_SSID;
 	ar->hw->wiphy->max_scan_ie_len = WLAN_SCAN_PARAMS_MAX_IE_LEN;
 
+	if (ar->hw_params.tx_sk_pacing_shift != 0)
+		ar->hw->tx_sk_pacing_shift = ar->hw_params.tx_sk_pacing_shift;
+
 	ar->hw->vif_data_size = sizeof(struct ath10k_vif);
 	ar->hw->sta_data_size = sizeof(struct ath10k_sta);
 	ar->hw->txq_data_size = sizeof(struct ath10k_txq);