diff mbox series

[net] net_sched: sch_fq: don't follow the fast path if Tx is behind now

Message ID 20241122162108.2697803-1-kuba@kernel.org (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net] net_sched: sch_fq: don't follow the fast path if Tx is behind now | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 3 this patch: 3
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers warning 1 maintainers not CCed: horms@kernel.org
netdev/build_clang success Errors and warnings before: 3 this patch: 3
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 4 this patch: 4
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 12 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Jakub Kicinski Nov. 22, 2024, 4:21 p.m. UTC
Recent kernels cause a lot of TCP retransmissions

[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  2.24 GBytes  19.2 Gbits/sec  2767    442 KBytes
[  5]   1.00-2.00   sec  2.23 GBytes  19.1 Gbits/sec  2312    350 KBytes
                                                      ^^^^

Replacing the qdisc with pfifo makes them go away. It appears that
a flow may get throttled with a very near unthrottle time.
Later we may get busy processing Rx and the unthrottling time will
pass, but we won't service Tx since the core is busy with Rx.
If Rx sees an ACK and we try to push more data for the throttled flow
we may fastpath the skb, not realizing that there are already "ready
to send" packets for this flow sitting in the qdisc.
At least this is my theory on what happens.

Don't trust the fastpath if we are "behind" according to the projected
unthrottle time for some flow waiting in the Qdisc.

Qdisc config:

qdisc fq 8001: dev eth0 parent 1234:1 limit 10000p flow_limit 100p \
  buckets 32768 orphan_mask 1023 bands 3 \
  priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 \
  weights 589824 196608 65536 quantum 3028b initial_quantum 15140b \
  low_rate_threshold 550Kbit \
  refill_delay 40ms timer_slack 10us horizon 10s horizon_drop

For iperf this change seems to do fine, the reordering is gone.
The fastpath still gets used most of the time:

  gc 0 highprio 0 fastpath 142614 throttled 418309 latency 19.1us
 xx_behind 2731

where "xx_behind" counts how many times we hit the new return false.

Fixes: 076433bd78d7 ("net_sched: sch_fq: add fast path for mostly idle qdisc")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
CC: jhs@mojatatu.com
CC: xiyou.wangcong@gmail.com
CC: jiri@resnulli.us
---
 net/sched/sch_fq.c | 6 ++++++
 1 file changed, 6 insertions(+)

Comments

Eric Dumazet Nov. 22, 2024, 4:44 p.m. UTC | #1
On Fri, Nov 22, 2024 at 5:21 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> Recent kernels cause a lot of TCP retransmissions
>
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec  2.24 GBytes  19.2 Gbits/sec  2767    442 KBytes
> [  5]   1.00-2.00   sec  2.23 GBytes  19.1 Gbits/sec  2312    350 KBytes
>                                                       ^^^^
>
> Replacing the qdisc with pfifo makes them go away. It appears that
> a flow may get throttled with a very near unthrottle time.
> Later we may get busy processing Rx and the unthrottling time will
> pass, but we won't service Tx since the core is busy with Rx.
> If Rx sees an ACK and we try to push more data for the throttled flow
> we may fastpath the skb, not realizing that there are already "ready
> to send" packets for this flow sitting in the qdisc.
> At least this is my theory on what happens.
>
> Don't trust the fastpath if we are "behind" according to the projected
> unthrottle time for some flow waiting in the Qdisc.
>
> Qdisc config:
>
> qdisc fq 8001: dev eth0 parent 1234:1 limit 10000p flow_limit 100p \
>   buckets 32768 orphan_mask 1023 bands 3 \
>   priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 \
>   weights 589824 196608 65536 quantum 3028b initial_quantum 15140b \
>   low_rate_threshold 550Kbit \
>   refill_delay 40ms timer_slack 10us horizon 10s horizon_drop
>
> For iperf this change seems to do fine, the reordering is gone.
> The fastpath still gets used most of the time:
>
>   gc 0 highprio 0 fastpath 142614 throttled 418309 latency 19.1us
>  xx_behind 2731
>
> where "xx_behind" counts how many times we hit the new return false.
>
> Fixes: 076433bd78d7 ("net_sched: sch_fq: add fast path for mostly idle qdisc")
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
> CC: jhs@mojatatu.com
> CC: xiyou.wangcong@gmail.com
> CC: jiri@resnulli.us
> ---
>  net/sched/sch_fq.c | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
> index 19a49af5a9e5..3d932b262159 100644
> --- a/net/sched/sch_fq.c
> +++ b/net/sched/sch_fq.c
> @@ -331,6 +331,12 @@ static bool fq_fastpath_check(const struct Qdisc *sch, struct sk_buff *skb,
>                  */
>                 if (q->internal.qlen >= 8)
>                         return false;
> +
> +               /* Ordering invariants fall apart if some throttled flows
> +                * are ready but we haven't serviced them, yet.
> +                */
> +               if (q->throttled_flows && q->time_next_delayed_flow <= now)
> +                       return false;
>         }

Interesting... I guess we could also call fq_check_throttled() to
refresh a better view of the qdisc state ?

But perhaps your patch is simpler. I guess it could be reduced to

if (q->time_next_delayed_flow <= now + q->offload_horizon)
      return false;

(Note the + q->offload_horizon)

I do not think testing q->throttled_flows is strictly needed :
If 0, then q->time_next_delayed_flow is set to ~0ULL.
Jakub Kicinski Nov. 22, 2024, 5:31 p.m. UTC | #2
On Fri, 22 Nov 2024 17:44:33 +0100 Eric Dumazet wrote:
> Interesting... I guess we could also call fq_check_throttled() to
> refresh a better view of the qdisc state ?
> 
> But perhaps your patch is simpler. I guess it could be reduced to
> 
> if (q->time_next_delayed_flow <= now + q->offload_horizon)
>       return false;
> 
> (Note the + q->offload_horizon)
> 
> I do not think testing q->throttled_flows is strictly needed :
> If 0, then q->time_next_delayed_flow is set to ~0ULL.

Makes sense, I'll respin using your check tomorrow.
Eric Dumazet Nov. 22, 2024, 6:09 p.m. UTC | #3
On Fri, Nov 22, 2024 at 6:31 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 22 Nov 2024 17:44:33 +0100 Eric Dumazet wrote:
> > Interesting... I guess we could also call fq_check_throttled() to
> > refresh a better view of the qdisc state ?
> >
> > But perhaps your patch is simpler. I guess it could be reduced to
> >
> > if (q->time_next_delayed_flow <= now + q->offload_horizon)
> >       return false;
> >
> > (Note the + q->offload_horizon)
> >
> > I do not think testing q->throttled_flows is strictly needed :
> > If 0, then q->time_next_delayed_flow is set to ~0ULL.
>
> Makes sense, I'll respin using your check tomorrow.

Great. I confirm that the fix reduces the TcpExtTCPSACKReorder SNMP
counter increase we had recently.

Also "ss -temoi" was showing suspect reordering:300 values I had no
time yet to investigate.

Thanks a lot for finding this !
diff mbox series

Patch

diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index 19a49af5a9e5..3d932b262159 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -331,6 +331,12 @@  static bool fq_fastpath_check(const struct Qdisc *sch, struct sk_buff *skb,
 		 */
 		if (q->internal.qlen >= 8)
 			return false;
+
+		/* Ordering invariants fall apart if some throttled flows
+		 * are ready but we haven't serviced them, yet.
+		 */
+		if (q->throttled_flows && q->time_next_delayed_flow <= now)
+			return false;
 	}
 
 	sk = skb->sk;