Throughput regression with `tcp: refine TSO autosizing`

Message ID	CA+BoTQmcShK0U_cXvEOLY_8y7LH8x3taTgjcyMzv0MLVn4UtCA@mail.gmail.com (mailing list archive)
State	Not Applicable
Headers	show Return-Path: <linux-wireless-owner@kernel.org> MIME-Version: 1.0 In-Reply-To: <1423142342.31870.49.camel@edumazet-glaptop2.roam.corp.google.com> References: <CA+BoTQkVu23P3EOmY_Q3E1GJnWsyF==Pawz4iPOS_Bq5dvfO5Q@mail.gmail.com> <1422537297.21689.15.camel@edumazet-glaptop2.roam.corp.google.com> <CA+BoTQk2xT-8DqPuiiKG+kHAjLPrj8F9dLTb-rcGhvMq0u_2Qw@mail.gmail.com> <1422628835.21689.95.camel@edumazet-glaptop2.roam.corp.google.com> <CA+BoTQkV+mOZfe_Niz5101sMQeaV6muKCsShptjGQ1AgOHqqoQ@mail.gmail.com> <1422903136.21689.114.camel@edumazet-glaptop2.roam.corp.google.com> <1422926330.21689.138.camel@edumazet-glaptop2.roam.corp.google.com> <CA+BoTQkMikA8wxm1ce2DkKhPB0HiKeAqT7f+sQ=91W40z=X0Rg@mail.gmail.com> <1422973660.907.10.camel@edumazet-glaptop2.roam.corp.google.com> <CA+BoTQmvUuFdfYF=wVMYxrf_nQZB5GCV=LvDZVvfs-3hAE4WKw@mail.gmail.com> <1423051045.907.108.camel@edumazet-glaptop2.roam.corp.google.com> <CA+BoTQ=BDcQ779uKCuX+f40=4npXVF4MTQnpjKimNYAxPsxBoQ@mail.gmail.com> <1423053531.907.115.camel@edumazet-glaptop2.roam.corp.google.com> <CA+BoTQ=qmCZz4CmSOvCOzMLowrDEG12XBffkTcYxjGqVD9604g@mail.gmail.com> <1423055810.907.125.camel@edumazet-glaptop2.roam.corp.google.com> <1423056591.907.130.camel@edumazet-glaptop2.roam.corp.google.com> <1423084303.31870.15.camel@edumazet-glaptop2.roam.corp.google.com> <CA+BoTQnnwrv0nrKyGyQNvosz_E4e5fBa9iN8fpeqcd-iRfi17g@mail.gmail.com> <1423141038.31870.38.camel@edumazet-glaptop2.roam.corp.google.com> <1423142342.31870.49.camel@edumazet-glaptop2.roam.corp.google.com> Date: Thu, 5 Feb 2015 14:44:01 +0100 Message-ID: <CA+BoTQmcShK0U_cXvEOLY_8y7LH8x3taTgjcyMzv0MLVn4UtCA@mail.gmail.com> Subject: Re: Throughput regression with `tcp: refine TSO autosizing` From: Michal Kazior <michal.kazior@tieto.com> To: Eric Dumazet <eric.dumazet@gmail.com> Cc: Neal Cardwell <ncardwell@google.com>, linux-wireless <linux-wireless@vger.kernel.org>, Network Development <netdev@vger.kernel.org>, eyalpe@dev.mellanox.co.il Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT Sender: linux-wireless-owner@vger.kernel.org Precedence: bulk

Michal Kazior Feb. 5, 2015, 1:44 p.m. UTC

On 5 February 2015 at 14:19, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2015-02-05 at 04:57 -0800, Eric Dumazet wrote:
>
>> The intention is to control the queues to the following :
>>
>> 1 ms of buffering, but limited to a configurable value.
>>
>> On a 40Gbps flow, 1ms represents 5 MB, which is insane.
>>
>> We do not want to queue 5 MB of traffic, this would destroy latencies
>> for all concurrent flows. (Or would require having fq_codel or fq as
>> packet schedulers, instead of default pfifo_fast)
>>
>> This is why having 1.5 ms delay between the transmit and TX completion
>> is a problem in your case.

I do get your point. But 1.5ms is really tough on Wi-Fi.

Just look at this:

; ping 192.168.1.2 -c 3
PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=1.83 ms
64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=2.02 ms
64 bytes from 192.168.1.2: icmp_seq=3 ttl=64 time=1.98 ms

; ping 192.168.1.2 -c 3 -Q 224
PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=0.939 ms
64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=0.906 ms
64 bytes from 192.168.1.2: icmp_seq=3 ttl=64 time=0.946 ms

This was run with no load so batching code in the driver itself should
have no measurable effect. The channel was near-ideal: low noise
floor, cabled rf, no other traffic.

The lower latency ping is when 802.11 QoS Voice Access Category is
used. I also get 400mbps instead of 250mbps in this case with 5 flows
(net/master).

Dealing with black box firmware blobs is a pain.


> Note that TCP stack could detect when this happens, *if* ACK where
> delivered before the TX completions, or when TX completion happens,
> we could detect that the clone of the freed packet was freed.
>
> In my test, when I did "ethtool -C eth0 tx-usecs 1024 tx-frames 64", and
> disabling GSO, TCP stack sends a bunch of packets (a bit less than 64),
> blocks on tcp_limit_output_bytes.
>
> Then we receive 2 stretch ACKS after ~50 usec.
>
> TCP stack tries to push again some packets but blocks on
> tcp_limit_output_bytes again.
>
> 1ms later, TX completion happens, tcp_wfree() is called, and TCP stack
> push following ~60 packets.
>
>
> TCP could  eventually dynamically adjust the tcp_limit_output_bytes,
> using a per flow dynamic value, but I would rather not add a kludge in
> TCP stack only to deal with a possible bug in ath10k driver.
>
> niu has a similar issue and simply had to call skb_orphan() :
>
> drivers/net/ethernet/sun/niu.c:6669:            skb_orphan(skb);

Ok. I tried calling skb_orphan() right after I submit each Tx frame
(similar to niu which does this in start_xmit):

 err_unmap_msdu:


Now, with {net/master + ath10k GRO + the above} I get 620mbps on a
single flow (even better then before). Wow.

Does this look ok/safe as a solution to you?


Micha?
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Dumazet Feb. 5, 2015, 2:41 p.m. UTC | #1

On Thu, 2015-02-05 at 14:44 +0100, Michal Kazior wrote:

> Ok. I tried calling skb_orphan() right after I submit each Tx frame
> (similar to niu which does this in start_xmit):
> 
> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
> @@ -564,6 +564,8 @@ int ath10k_htt_tx(struct ath10k_htt *htt, struct
> sk_buff *msdu)
>         if (res)
>                 goto err_unmap_msdu;
> 
> +       skb_orphan(msdu);
> +
>         return 0;
> 
>  err_unmap_msdu:
> 
> 
> Now, with {net/master + ath10k GRO + the above} I get 620mbps on a
> single flow (even better then before). Wow.
> 
> Does this look ok/safe as a solution to you?

Not at all. This basically removes backpressure.

A single UDP socket can now blast packets regardless of SO_SNDBUF
limits.

This basically remove years of work trying to fix bufferbloat.

I still do not understand why increasing tcp_limit_output_bytes is not
working for you.




--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Dumazet Feb. 5, 2015, 2:48 p.m. UTC | #2

On Thu, 2015-02-05 at 14:44 +0100, Michal Kazior wrote:

> I do get your point. But 1.5ms is really tough on Wi-Fi.
> 
> Just look at this:
> 
> ; ping 192.168.1.2 -c 3
> PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
> 64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=1.83 ms
> 64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=2.02 ms
> 64 bytes from 192.168.1.2: icmp_seq=3 ttl=64 time=1.98 ms

Thats a different point.

I dont care about rtt but TX completions. (usually much much lower than
rtt)

I can have a 4 usec delay from the moment a NIC submits a packet to the
wire and I get TX completion IRQ, free the packet.

Yet the pong reply can come 100 ms later.

It does not mean the 4 usec delay is a problem.

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Taht Feb. 5, 2015, 7:50 p.m. UTC | #3

On Fri, Feb 6, 2015 at 2:44 AM, Michal Kazior <michal.kazior@tieto.com> wrote:
>
> On 5 February 2015 at 14:19, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Thu, 2015-02-05 at 04:57 -0800, Eric Dumazet wrote:
> >
> >> The intention is to control the queues to the following :
> >>
> >> 1 ms of buffering, but limited to a configurable value.
> >>
> >> On a 40Gbps flow, 1ms represents 5 MB, which is insane.
> >>
> >> We do not want to queue 5 MB of traffic, this would destroy latencies
> >> for all concurrent flows. (Or would require having fq_codel or fq as
> >> packet schedulers, instead of default pfifo_fast)
> >>
> >> This is why having 1.5 ms delay between the transmit and TX completion
> >> is a problem in your case.
>
> I do get your point. But 1.5ms is really tough on Wi-Fi.
>
> Just look at this:
>
> ; ping 192.168.1.2 -c 3
> PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
> 64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=1.83 ms
> 64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=2.02 ms
> 64 bytes from 192.168.1.2: icmp_seq=3 ttl=64 time=1.98 ms
>
> ; ping 192.168.1.2 -c 3 -Q 224
> PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
> 64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=0.939 ms
> 64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=0.906 ms
> 64 bytes from 192.168.1.2: icmp_seq=3 ttl=64 time=0.946 ms
>
> This was run with no load so batching code in the driver itself should
> have no measurable effect. The channel was near-ideal: low noise
> floor, cabled rf, no other traffic.
>
> The lower latency ping is when 802.11 QoS Voice Access Category is
> used. I also get 400mbps instead of 250mbps in this case with 5 flows
> (net/master).
>

The VO queue is now nearly useless in a real world environment. Whlle
it does grab the media mildly faster in some cases, on a good day with
no other competing APs, it cannot aggregate packets, and wastes TXOPS.
It is far saner to aim for better aggregate (use the VI queue if you
must try to get better media acquisition).

It is disabled in multiple products I know of.

And I really, really, really wish, that just once during this thread,
someone had bothered to try running a test
at a real world MCS rate - say MCS1, or MCS4, and measured the latency
under load of that...

or tried talking to two or more stations at the same time.

Instead of trying for 1.5Gbits in a faraday cage.


>
> Dealing with black box firmware blobs is a pain.
>

+10

>
>
> > Note that TCP stack could detect when this happens, *if* ACK where
> > delivered before the TX completions, or when TX completion happens,
> > we could detect that the clone of the freed packet was freed.
> >
> > In my test, when I did "ethtool -C eth0 tx-usecs 1024 tx-frames 64", and
> > disabling GSO, TCP stack sends a bunch of packets (a bit less than 64),
> > blocks on tcp_limit_output_bytes.
> >
> > Then we receive 2 stretch ACKS after ~50 usec.
> >
> > TCP stack tries to push again some packets but blocks on
> > tcp_limit_output_bytes again.
> >
> > 1ms later, TX completion happens, tcp_wfree() is called, and TCP stack
> > push following ~60 packets.
> >
> >
> > TCP could  eventually dynamically adjust the tcp_limit_output_bytes,
> > using a per flow dynamic value, but I would rather not add a kludge in
> > TCP stack only to deal with a possible bug in ath10k driver.
> >
> > niu has a similar issue and simply had to call skb_orphan() :
> >
> > drivers/net/ethernet/sun/niu.c:6669:            skb_orphan(skb);
>
> Ok. I tried calling skb_orphan() right after I submit each Tx frame
> (similar to niu which does this in start_xmit):
>
> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
> @@ -564,6 +564,8 @@ int ath10k_htt_tx(struct ath10k_htt *htt, struct
> sk_buff *msdu)
>         if (res)
>                 goto err_unmap_msdu;
>
> +       skb_orphan(msdu);
> +
>         return 0;
>
>  err_unmap_msdu:
>
>
> Now, with {net/master + ath10k GRO + the above} I get 620mbps on a
> single flow (even better then before). Wow.
>
> Does this look ok/safe as a solution to you?
>
>
> Micha?
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Nicolas Cavallari Feb. 6, 2015, 9:39 a.m. UTC | #4

On 05/02/2015 15:48, Eric Dumazet wrote:
> On Thu, 2015-02-05 at 14:44 +0100, Michal Kazior wrote:
> 
>> I do get your point. But 1.5ms is really tough on Wi-Fi.
>>
>> Just look at this:
>>
>> ; ping 192.168.1.2 -c 3
>> PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
>> 64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=1.83 ms
>> 64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=2.02 ms
>> 64 bytes from 192.168.1.2: icmp_seq=3 ttl=64 time=1.98 ms
> 
> Thats a different point.
> 
> I dont care about rtt but TX completions. (usually much much lower than
> rtt)

On wired network perhaps, but definitely not on Wi-Fi.

With aggregation, you may send up to 4ms of data before the receiver
can acknowledge anything. But you have to gain access to the channel
first, so you may wait while others finish off their 4ms
transmissions. And this does not account for retransmissions.

And aggregation is not the only problem as far as bufferbloat is
concerned. I don't even want to think about powersave.
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Michal Kazior Feb. 6, 2015, 9:57 a.m. UTC | #5

On 5 February 2015 at 20:50, Dave Taht <dave.taht@gmail.com> wrote:
[...]
> And I really, really, really wish, that just once during this thread,
> someone had bothered to try running a test
> at a real world MCS rate - say MCS1, or MCS4, and measured the latency
> under load of that...

Time between frame submission to firmware and tx-completion on one of
my ath10k machines:

Legacy 54mbps: ~18ms
Legacy 6mbps: ~37ms
11n MCS 3 (nss=0): ~13ms
11n MCS 8 (nss=1): ~6-8ms
11ac NSS=1 MCS=2: ~4-6ms
11ac NSS=2 MCS=0: ~5-8ms

Keep in mind this is a clean room environment so retransmissions are
kept at minimum. Obviously with a noisy environment you'll get retries
at different rates and higher latency.


Micha?
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Throughput regression with `tcp: refine TSO autosizing`

Commit Message

Comments

Patch