Throughput regression with `tcp: refine TSO autosizing`

Message ID	1423084303.31870.15.camel@edumazet-glaptop2.roam.corp.google.com (mailing list archive)
State	Not Applicable
Headers	show Return-Path: <linux-wireless-owner@kernel.org> Message-ID: <1423084303.31870.15.camel@edumazet-glaptop2.roam.corp.google.com> Subject: Re: Throughput regression with `tcp: refine TSO autosizing` From: Eric Dumazet <eric.dumazet@gmail.com> To: Michal Kazior <michal.kazior@tieto.com> Cc: Neal Cardwell <ncardwell@google.com>, linux-wireless <linux-wireless@vger.kernel.org>, Network Development <netdev@vger.kernel.org>, eyalpe@dev.mellanox.co.il Date: Wed, 04 Feb 2015 13:11:43 -0800 In-Reply-To: <1423056591.907.130.camel@edumazet-glaptop2.roam.corp.google.com> References: <CA+BoTQkVu23P3EOmY_Q3E1GJnWsyF==Pawz4iPOS_Bq5dvfO5Q@mail.gmail.com> <1422537297.21689.15.camel@edumazet-glaptop2.roam.corp.google.com> <CA+BoTQk2xT-8DqPuiiKG+kHAjLPrj8F9dLTb-rcGhvMq0u_2Qw@mail.gmail.com> <1422628835.21689.95.camel@edumazet-glaptop2.roam.corp.google.com> <CA+BoTQkV+mOZfe_Niz5101sMQeaV6muKCsShptjGQ1AgOHqqoQ@mail.gmail.com> <1422903136.21689.114.camel@edumazet-glaptop2.roam.corp.google.com> <1422926330.21689.138.camel@edumazet-glaptop2.roam.corp.google.com> <CA+BoTQkMikA8wxm1ce2DkKhPB0HiKeAqT7f+sQ=91W40z=X0Rg@mail.gmail.com> <1422973660.907.10.camel@edumazet-glaptop2.roam.corp.google.com> <CA+BoTQmvUuFdfYF=wVMYxrf_nQZB5GCV=LvDZVvfs-3hAE4WKw@mail.gmail.com> <1423051045.907.108.camel@edumazet-glaptop2.roam.corp.google.com> <CA+BoTQ=BDcQ779uKCuX+f40=4npXVF4MTQnpjKimNYAxPsxBoQ@mail.gmail.com> <1423053531.907.115.camel@edumazet-glaptop2.roam.corp.google.com> <CA+BoTQ=qmCZz4CmSOvCOzMLowrDEG12XBffkTcYxjGqVD9604g@mail.gmail.com> <1423055810.907.125.camel@edumazet-glaptop2.roam.corp.google.com> <1423056591.907.130.camel@edumazet-glaptop2.roam.corp.google.com> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-wireless-owner@vger.kernel.org Precedence: bulk

Eric Dumazet Feb. 4, 2015, 9:11 p.m. UTC

I do not see how a TSO patch could hurt a flow not using TSO/GSO.

This makes no sense.

ath10k tx completions being batched/deferred to a tasklet might increase
probability to hit this condition in tcp_wfree() :

        /* If this softirq is serviced by ksoftirqd, we are likely under stress.
         * Wait until our queues (qdisc + devices) are drained.
         * This gives :
         * - less callbacks to tcp_write_xmit(), reducing stress (batches)
         * - chance for incoming ACK (processed by another cpu maybe)
         *   to migrate this flow (skb->ooo_okay will be eventually set)
         */
        if (wmem >= SKB_TRUESIZE(1) && this_cpu_ksoftirqd() == current)
                goto out;

Meaning tcp stack waits all skbs left qdisc/NIC queues before queuing
additional packets.

I would try to call skb_orphan() in ath10k if you really want to keep
these batches.

I have hard time to understand why tx completed packets go through
ath10k_htc_rx_completion_handler().. anyway...

Most conservative patch would be :



--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Michal Kazior Feb. 5, 2015, 6:46 a.m. UTC | #1

On 4 February 2015 at 22:11, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> I do not see how a TSO patch could hurt a flow not using TSO/GSO.
>
> This makes no sense.
>
> ath10k tx completions being batched/deferred to a tasklet might increase
> probability to hit this condition in tcp_wfree() :
>
>         /* If this softirq is serviced by ksoftirqd, we are likely under stress.
>          * Wait until our queues (qdisc + devices) are drained.
>          * This gives :
>          * - less callbacks to tcp_write_xmit(), reducing stress (batches)
>          * - chance for incoming ACK (processed by another cpu maybe)
>          *   to migrate this flow (skb->ooo_okay will be eventually set)
>          */
>         if (wmem >= SKB_TRUESIZE(1) && this_cpu_ksoftirqd() == current)
>                 goto out;
>
> Meaning tcp stack waits all skbs left qdisc/NIC queues before queuing
> additional packets.
>
> I would try to call skb_orphan() in ath10k if you really want to keep
> these batches.
>
> I have hard time to understand why tx completed packets go through
> ath10k_htc_rx_completion_handler().. anyway...

There's a couple of layers for host-firmware communication. The
transport layer (e.g. PCI) delivers HTC packets. These contain WMI
(configuration stuff) or HTT (traffic stuff). HTT can contain
different events (tx complete, rx complete, etc). HTT Tx completion
contains a list of ids which refer to frames that have been completed
(either sent or dropped).

I've tried reverting tx/rx tasklet batching. No change in throughput.
I can get tcpdump if you're interested.


> Most conservative patch would be :
>
> diff --git a/drivers/net/wireless/ath/ath10k/htt_rx.c b/drivers/net/wireless/ath/ath10k/htt_rx.c
> index 9c782a42665e1aaf43bfbca441631ee58da50c09..6a36317d6bb0447202dee15528130bd5e21248c4 100644
> --- a/drivers/net/wireless/ath/ath10k/htt_rx.c
> +++ b/drivers/net/wireless/ath/ath10k/htt_rx.c
> @@ -1642,6 +1642,7 @@ void ath10k_htt_t2h_msg_handler(struct ath10k *ar, struct sk_buff *skb)
>                 break;
>         }
>         case HTT_T2H_MSG_TYPE_TX_COMPL_IND:
> +               skb_orphan(skb);
>                 spin_lock_bh(&htt->tx_lock);
>                 __skb_queue_tail(&htt->tx_compl_q, skb);
>                 spin_unlock_bh(&htt->tx_lock);

I suppose you want to call skb_orphan() on actual data packets, right?
This skb is just a host-firmware communication buffer.


Micha?
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Michal Kazior Feb. 5, 2015, 8:38 a.m. UTC | #2

On 4 February 2015 at 22:11, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> I do not see how a TSO patch could hurt a flow not using TSO/GSO.
>
> This makes no sense.

Hmm..

@@ -2018,8 +2053,8 @@ static bool tcp_write_xmit(struct sock *sk,
unsigned int mss_now, int nonagle,
                 * of queued bytes to ensure line rate.
                 * One example is wifi aggregation (802.11 AMPDU)
                 */
-               limit = max_t(unsigned int, sysctl_tcp_limit_output_bytes,
-                             sk->sk_pacing_rate >> 10);
+               limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
+               limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);

                if (atomic_read(&sk->sk_wmem_alloc) > limit) {
                        set_bit(TSQ_THROTTLED, &tp->tsq_flags);

Doesn't this effectively invert how tcp_limit_output_bytes is used?
This would explain why raising the limit wasn't changing anything
anymore when you asked me do so. Only decreasing it yielded any
change.

I've added a printk to show up the new and old values. Excerpt from logs:

[  114.782740] (4608 39126 131072 = 39126) vs (131072 39126 = 131072)

(2*truesize, pacing_rate, tcp_limit = limit) vs (tcp_limit, pacing_rate = limit)

Reverting this patch hunk alone fixes my TCP problem. Not that I'm
saying the old logic was correct (it seems it wasn't, a limit should
be applied as min(value, max_value), right?).

Anyway the change doesn't seem to be TSO-only oriented so it would
explain the "makes no sense".

Micha?
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Dumazet Feb. 5, 2015, 12:57 p.m. UTC | #3

On Thu, 2015-02-05 at 09:38 +0100, Michal Kazior wrote:
> On 4 February 2015 at 22:11, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > I do not see how a TSO patch could hurt a flow not using TSO/GSO.
> >
> > This makes no sense.
> 
> Hmm..
> 
> @@ -2018,8 +2053,8 @@ static bool tcp_write_xmit(struct sock *sk,
> unsigned int mss_now, int nonagle,
>                  * of queued bytes to ensure line rate.
>                  * One example is wifi aggregation (802.11 AMPDU)
>                  */
> -               limit = max_t(unsigned int, sysctl_tcp_limit_output_bytes,
> -                             sk->sk_pacing_rate >> 10);
> +               limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
> +               limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
> 
>                 if (atomic_read(&sk->sk_wmem_alloc) > limit) {
>                         set_bit(TSQ_THROTTLED, &tp->tsq_flags);
> 
> Doesn't this effectively invert how tcp_limit_output_bytes is used?
> This would explain why raising the limit wasn't changing anything
> anymore when you asked me do so. Only decreasing it yielded any
> change.
> 
> I've added a printk to show up the new and old values. Excerpt from logs:
> 
> [  114.782740] (4608 39126 131072 = 39126) vs (131072 39126 = 131072)
> 
> (2*truesize, pacing_rate, tcp_limit = limit) vs (tcp_limit, pacing_rate = limit)
> 
> Reverting this patch hunk alone fixes my TCP problem. Not that I'm
> saying the old logic was correct (it seems it wasn't, a limit should
> be applied as min(value, max_value), right?).
> 
> Anyway the change doesn't seem to be TSO-only oriented so it would
> explain the "makes no sense".


The intention is to control the queues to the following :

1 ms of buffering, but limited to a configurable value.

On a 40Gbps flow, 1ms represents 5 MB, which is insane.

We do not want to queue 5 MB of traffic, this would destroy latencies
for all concurrent flows. (Or would require having fq_codel or fq as
packet schedulers, instead of default pfifo_fast)

This is why having 1.5 ms delay between the transmit and TX completion
is a problem in your case.

 



--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Dumazet Feb. 5, 2015, 1:03 p.m. UTC | #4

On Thu, 2015-02-05 at 07:46 +0100, Michal Kazior wrote:
> On 4 February 2015 at 22:11, Eric Dumazet <eric.dumazet@gmail.com> wrote:

> > Most conservative patch would be :
> >
> > diff --git a/drivers/net/wireless/ath/ath10k/htt_rx.c b/drivers/net/wireless/ath/ath10k/htt_rx.c
> > index 9c782a42665e1aaf43bfbca441631ee58da50c09..6a36317d6bb0447202dee15528130bd5e21248c4 100644
> > --- a/drivers/net/wireless/ath/ath10k/htt_rx.c
> > +++ b/drivers/net/wireless/ath/ath10k/htt_rx.c
> > @@ -1642,6 +1642,7 @@ void ath10k_htt_t2h_msg_handler(struct ath10k *ar, struct sk_buff *skb)
> >                 break;
> >         }
> >         case HTT_T2H_MSG_TYPE_TX_COMPL_IND:
> > +               skb_orphan(skb);
> >                 spin_lock_bh(&htt->tx_lock);
> >                 __skb_queue_tail(&htt->tx_compl_q, skb);
> >                 spin_unlock_bh(&htt->tx_lock);
> 
> I suppose you want to call skb_orphan() on actual data packets, right?
> This skb is just a host-firmware communication buffer.

Right. I have no idea how you find the actual data packet at this stage.


--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Dumazet Feb. 5, 2015, 1:19 p.m. UTC | #5

On Thu, 2015-02-05 at 04:57 -0800, Eric Dumazet wrote:

> The intention is to control the queues to the following :
> 
> 1 ms of buffering, but limited to a configurable value.
> 
> On a 40Gbps flow, 1ms represents 5 MB, which is insane.
> 
> We do not want to queue 5 MB of traffic, this would destroy latencies
> for all concurrent flows. (Or would require having fq_codel or fq as
> packet schedulers, instead of default pfifo_fast)
> 
> This is why having 1.5 ms delay between the transmit and TX completion
> is a problem in your case.

Note that TCP stack could detect when this happens, *if* ACK where
delivered before the TX completions, or when TX completion happens,
we could detect that the clone of the freed packet was freed.

In my test, when I did "ethtool -C eth0 tx-usecs 1024 tx-frames 64", and
disabling GSO, TCP stack sends a bunch of packets (a bit less than 64),
blocks on tcp_limit_output_bytes.

Then we receive 2 stretch ACKS after ~50 usec.

TCP stack tries to push again some packets but blocks on
tcp_limit_output_bytes again.

1ms later, TX completion happens, tcp_wfree() is called, and TCP stack
push following ~60 packets.

TCP could  eventually dynamically adjust the tcp_limit_output_bytes,
using a per flow dynamic value, but I would rather not add a kludge in
TCP stack only to deal with a possible bug in ath10k driver.

niu has a similar issue and simply had to call skb_orphan() :

drivers/net/ethernet/sun/niu.c:6669:            skb_orphan(skb);

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Throughput regression with `tcp: refine TSO autosizing`

Commit Message

Comments

Patch