Message ID | 1423156205.31870.86.camel@edumazet-glaptop2.roam.corp.google.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
On 5 February 2015 at 18:10, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Thu, 2015-02-05 at 06:41 -0800, Eric Dumazet wrote: > >> Not at all. This basically removes backpressure. >> >> A single UDP socket can now blast packets regardless of SO_SNDBUF >> limits. >> >> This basically remove years of work trying to fix bufferbloat. >> >> I still do not understand why increasing tcp_limit_output_bytes is not >> working for you. > > Oh well, tcp_limit_output_bytes might be ok. > > In fact, the problem comes from GSO assumption. Maybe Herbert was right, > when he suggested TCP would be simpler if we enforced GSO... > > When GSO is used, the thing works because 2*skb->truesize is roughly 2 > ms worth of traffic. > > Because you do not use GSO, and tx completions are slow, we need this : > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index 65caf8b95e17..ac01b4cd0035 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -2044,7 +2044,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, > break; > > /* TCP Small Queues : > - * Control number of packets in qdisc/devices to two packets / or ~1 ms. > + * Control number of packets in qdisc/devices to two packets / > + * or ~2 ms (sk->sk_pacing_rate >> 9) in case GSO is off. > * This allows for : > * - better RTT estimation and ACK scheduling > * - faster recovery > @@ -2053,7 +2054,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, > * of queued bytes to ensure line rate. > * One example is wifi aggregation (802.11 AMPDU) > */ > - limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10); > + limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 9); > limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes); > > if (atomic_read(&sk->sk_wmem_alloc) > limit) { > The above brings back previous behaviour, i.e. I can get 600mbps TCP on 5 flows again. Single flow is still (as it was before TSO autosizing) limited to roughly ~280mbps. I never really bothered before to understand why I need to push a few flows through ath10k to max it out, i.e. if I run a single UDP flow I get ~300mbps while with, e.g. 5 I get 670mbps easily. I guess it was the tx completion latency all along. I just put an extra debug to ath10k to see the latency between submission and completion. Here's a log (http://www.filedropper.com/complete-log) of 2s run of UDP iperf trying to push 1gbps but managing only 300mbps. I've made sure to not hold any locks nor introduce internal to ath10k delays. Frames get completed between 2-4ms in avarage during load. When I tried using different ath10k hw&fw I got between 1-2ms of latency for tx completionsyielding ~430mbps while max should be around 670mbps. Micha? -- To unsubscribe from this list: send the line "unsubscribe linux-wireless" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2015-02-06 at 10:42 +0100, Michal Kazior wrote: > The above brings back previous behaviour, i.e. I can get 600mbps TCP > on 5 flows again. Single flow is still (as it was before TSO > autosizing) limited to roughly ~280mbps. > > I never really bothered before to understand why I need to push a few > flows through ath10k to max it out, i.e. if I run a single UDP flow I > get ~300mbps while with, e.g. 5 I get 670mbps easily. > For single UDP flow, tweaking /proc/sys/net/core/wmem_default might be enough : UDP has no callback from TX completion to feed following frames (No write queue like TCP) # cat /proc/sys/net/core/wmem_default 212992 # ethtool -C eth1 tx-usecs 1024 tx-frames 120 # ./netperf -H remote -t UDP_STREAM -- -m 1450 Socket Message Elapsed Messages Size Size Time Okay Errors Throughput bytes bytes secs # # 10^6bits/sec 212992 1450 10.00 697705 0 809.27 212992 10.00 673412 781.09 # echo 800000 >/proc/sys/net/core/wmem_default # ./netperf -H remote -t UDP_STREAM -- -m 1450 Socket Message Elapsed Messages Size Size Time Okay Errors Throughput bytes bytes secs # # 10^6bits/sec 800000 1450 10.00 7329221 0 8501.84 212992 10.00 7284051 8449.44 > I guess it was the tx completion latency all along. > > I just put an extra debug to ath10k to see the latency between > submission and completion. Here's a log > (http://www.filedropper.com/complete-log) of 2s run of UDP iperf > trying to push 1gbps but managing only 300mbps. > > I've made sure to not hold any locks nor introduce internal to ath10k > delays. Frames get completed between 2-4ms in avarage during load. tcp_wfree() could maintain in tp->tx_completion_delay_ms an EWMA of TX completion delay. But this would require yet another expensive call to ktime_get() if HZ < 1000. Then tcp_write_xmit() could use it to adjust : limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 9); to amount = (2 + tp->tx_completion_delay_ms) * sk->sk_pacing_rate limit = max(2 * skb->truesize, amount / 1000); I'll cook a patch. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-wireless" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2015-02-06 at 05:40 -0800, Eric Dumazet wrote: > tcp_wfree() could maintain in tp->tx_completion_delay_ms an EWMA > of TX completion delay. But this would require yet another expensive > call to ktime_get() if HZ < 1000. > > Then tcp_write_xmit() could use it to adjust : > > limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 9); > > to > > amount = (2 + tp->tx_completion_delay_ms) * sk->sk_pacing_rate > > limit = max(2 * skb->truesize, amount / 1000); > > I'll cook a patch. Hmm... doing this in all protocols would be too expensive, and we do not want to include time spent in qdiscs. wifi could eventually do that, providing in skb->tx_completion_delay_us the time spent in wifi driver. This way, we would have no penalty for network devices doing normal skb orphaning (loopback interface, ethernet, ...) -- To unsubscribe from this list: send the line "unsubscribe linux-wireless" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 6 February 2015 at 14:40, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Fri, 2015-02-06 at 10:42 +0100, Michal Kazior wrote: > >> The above brings back previous behaviour, i.e. I can get 600mbps TCP >> on 5 flows again. Single flow is still (as it was before TSO >> autosizing) limited to roughly ~280mbps. >> >> I never really bothered before to understand why I need to push a few >> flows through ath10k to max it out, i.e. if I run a single UDP flow I >> get ~300mbps while with, e.g. 5 I get 670mbps easily. >> > > For single UDP flow, tweaking /proc/sys/net/core/wmem_default might be > enough : UDP has no callback from TX completion to feed following frames > (No write queue like TCP) > > # cat /proc/sys/net/core/wmem_default > 212992 > # ethtool -C eth1 tx-usecs 1024 tx-frames 120 > # ./netperf -H remote -t UDP_STREAM -- -m 1450 > Socket Message Elapsed Messages > Size Size Time Okay Errors Throughput > bytes bytes secs # # 10^6bits/sec > > 212992 1450 10.00 697705 0 809.27 > 212992 10.00 673412 781.09 > > # echo 800000 >/proc/sys/net/core/wmem_default > # ./netperf -H remote -t UDP_STREAM -- -m 1450 > Socket Message Elapsed Messages > Size Size Time Okay Errors Throughput > bytes bytes secs # # 10^6bits/sec > > 800000 1450 10.00 7329221 0 8501.84 > 212992 10.00 7284051 8449.44 Hmm.. I confirm it works. However the value at which I get full rate on a single flow is more than 2048K. Also using non-default wmem_default seems to introduce packet loss as per iperf reports at the receiver. I suppose this is kind of expected but on the other hand wmem_default=262992 and 5 flows of UDP max the device out with 0 packet loss. Micha? -- To unsubscribe from this list: send the line "unsubscribe linux-wireless" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 6 February 2015 at 14:53, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Fri, 2015-02-06 at 05:40 -0800, Eric Dumazet wrote: > >> tcp_wfree() could maintain in tp->tx_completion_delay_ms an EWMA >> of TX completion delay. But this would require yet another expensive >> call to ktime_get() if HZ < 1000. >> >> Then tcp_write_xmit() could use it to adjust : >> >> limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 9); >> >> to >> >> amount = (2 + tp->tx_completion_delay_ms) * sk->sk_pacing_rate >> >> limit = max(2 * skb->truesize, amount / 1000); >> >> I'll cook a patch. > > Hmm... doing this in all protocols would be too expensive, > and we do not want to include time spent in qdiscs. > > wifi could eventually do that, providing in skb->tx_completion_delay_us > the time spent in wifi driver. > > This way, we would have no penalty for network devices doing normal skb > orphaning (loopback interface, ethernet, ...) I'll play around with this idea and report back later. Micha? -- To unsubscribe from this list: send the line "unsubscribe linux-wireless" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2015-02-06 at 05:53 -0800, Eric Dumazet wrote: > wifi could eventually do that, providing in skb->tx_completion_delay_us > the time spent in wifi driver. > > This way, we would have no penalty for network devices doing normal skb > orphaning (loopback interface, ethernet, ...) Another way would be that wifi does an automatic orphaning after 1 or 2ms. -- To unsubscribe from this list: send the line "unsubscribe linux-wireless" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet > On Fri, 2015-02-06 at 05:53 -0800, Eric Dumazet wrote: > > > > wifi could eventually do that, providing in skb->tx_completion_delay_us > > the time spent in wifi driver. > > > > This way, we would have no penalty for network devices doing normal skb > > orphaning (loopback interface, ethernet, ...) > > Another way would be that wifi does an automatic orphaning after 1 or > 2ms. Couldn't you do byte counting? So orphan enough packets to keep a few ms of tx traffic (at the current tx rate) orphaned. You might need to give the hardware both orphaned and non-orphaned (parented?) packets and orphan some when you get a tx complete for an orphaned packet. David
On Fri, 2015-02-06 at 15:08 +0100, Michal Kazior wrote: > Hmm.. I confirm it works. However the value at which I get full rate > on a single flow is more than 2048K. Also using non-default > wmem_default seems to introduce packet loss as per iperf reports at > the receiver. I suppose this is kind of expected but on the other hand > wmem_default=262992 and 5 flows of UDP max the device out with 0 > packet loss. If you increase ability to flood on one flow, then you need to make sure receiver has big rcvbuf as well. echo 2000000 >/proc/sys/net/core/rmem_default Otherwise it might drop bursts. This is the kind of things that TCP does automatically, not UDP. -- To unsubscribe from this list: send the line "unsubscribe linux-wireless" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2015-02-06 at 14:31 +0000, David Laight wrote: > From: Eric Dumazet > > On Fri, 2015-02-06 at 05:53 -0800, Eric Dumazet wrote: > > > > > > > wifi could eventually do that, providing in skb->tx_completion_delay_us > > > the time spent in wifi driver. > > > > > > This way, we would have no penalty for network devices doing normal skb > > > orphaning (loopback interface, ethernet, ...) > > > > Another way would be that wifi does an automatic orphaning after 1 or > > 2ms. > > Couldn't you do byte counting? > So orphan enough packets to keep a few ms of tx traffic (at the current > tx rate) orphaned. > You might need to give the hardware both orphaned and non-orphaned (parented?) > packets and orphan some when you get a tx complete for an orphaned packet. We already have byte counting. The thing is : A driver can keep an skb for itself, but calling skb_orphan() in time to allow a socket to send more packets. For say a UDP server, it would be quite mandatory, as it usually uses a single UDP socket to receive and send messages. -- To unsubscribe from this list: send the line "unsubscribe linux-wireless" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> If you increase ability to flood on one flow, then you need to make sure > receiver has big rcvbuf as well. > > echo 2000000 >/proc/sys/net/core/rmem_default > > Otherwise it might drop bursts. > > This is the kind of things that TCP does automatically, not UDP. An alternative, if the application involved can make explicit setsockopt() calls to set SO_SNDBUF and/or SO_RCVBUF, is to tweak rmem_max and wmem_max and then let the application make the setsockopt() calls. Which path one would take would depend on circumstances I suspect. rick jones -- To unsubscribe from this list: send the line "unsubscribe linux-wireless" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 65caf8b95e17..ac01b4cd0035 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -2044,7 +2044,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, break; /* TCP Small Queues : - * Control number of packets in qdisc/devices to two packets / or ~1 ms. + * Control number of packets in qdisc/devices to two packets / + * or ~2 ms (sk->sk_pacing_rate >> 9) in case GSO is off. * This allows for : * - better RTT estimation and ACK scheduling * - faster recovery @@ -2053,7 +2054,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, * of queued bytes to ensure line rate. * One example is wifi aggregation (802.11 AMPDU) */ - limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10); + limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 9); limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes); if (atomic_read(&sk->sk_wmem_alloc) > limit) {