Message ID | 20240817163400.2616134-3-mrzhang97@gmail.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | tcp_cubic: fix to achieve at least the same throughput as Reno | expand |
On Sat, Aug 17, 2024 at 6:35 PM Mingrui Zhang <mrzhang97@gmail.com> wrote: > > The original code follows RFC 8312 (obsoleted CUBIC RFC). > > The patched code follows RFC 9438 (new CUBIC RFC): Please give the precise location in the RFC (4.3 Reno-Friendly Region) > "Once _W_est_ has grown to reach the _cwnd_ at the time of most > recently setting _ssthresh_ -- that is, _W_est_ >= _cwnd_prior_ -- > the sender SHOULD set α__cubic_ to 1 to ensure that it can achieve > the same congestion window increment rate as Reno, which uses AIMD > (1,0.5)." > > Add new field 'cwnd_prior' in bictcp to hold cwnd before a loss event > > Fixes: 89b3d9aaf467 ("[TCP] cubic: precompute constants") RFC 9438 is brand new, I think we should not backport this patch to stable linux versions. This would target net-next, unless there is clear evidence that it is absolutely safe. Note the existence of tools/testing/selftests/bpf/progs/bpf_cc_cubic.c and tools/testing/selftests/bpf/progs/bpf_cubic.c If this patch was a fix, I presume we would need to fix these files ?
On 8/19/24 03:22, Eric Dumazet wrote: > On Sat, Aug 17, 2024 at 6:35 PM Mingrui Zhang <mrzhang97@gmail.com> wrote: >> The original code follows RFC 8312 (obsoleted CUBIC RFC). >> >> The patched code follows RFC 9438 (new CUBIC RFC): > Please give the precise location in the RFC (4.3 Reno-Friendly Region) Thank you, Eric, I will write it more clearly in the next version patch to submit. > >> "Once _W_est_ has grown to reach the _cwnd_ at the time of most >> recently setting _ssthresh_ -- that is, _W_est_ >= _cwnd_prior_ -- >> the sender SHOULD set α__cubic_ to 1 to ensure that it can achieve >> the same congestion window increment rate as Reno, which uses AIMD >> (1,0.5)." >> >> Add new field 'cwnd_prior' in bictcp to hold cwnd before a loss event >> >> Fixes: 89b3d9aaf467 ("[TCP] cubic: precompute constants") > RFC 9438 is brand new, I think we should not backport this patch to > stable linux versions. > > This would target net-next, unless there is clear evidence that it is > absolutely safe. I agree with you that this patch would target net-next. > Note the existence of tools/testing/selftests/bpf/progs/bpf_cc_cubic.c > and tools/testing/selftests/bpf/progs/bpf_cubic.c > > If this patch was a fix, I presume we would need to fix these files ? In my understanding, the bpf_cubic.c and bpf_cc_cubic.c are not designed to create a fully equivalent version of tcp_cubic, but more focus on BPF logic testing usage. For example, the up-to-date bpf_cubic does not involve the changes in commit 9957b38b5e7a ("tcp_cubic: make hystart_ack_delay() aware of BIG TCP") Maybe we would ask BPF maintainers whether to fix these BPF files?
On Mon, Aug 19, 2024 at 11:03 PM Mingrui Zhang <mrzhang97@gmail.com> wrote: > > On 8/19/24 03:22, Eric Dumazet wrote: > > On Sat, Aug 17, 2024 at 6:35 PM Mingrui Zhang <mrzhang97@gmail.com> wrote: > >> The original code follows RFC 8312 (obsoleted CUBIC RFC). > >> > >> The patched code follows RFC 9438 (new CUBIC RFC): > > Please give the precise location in the RFC (4.3 Reno-Friendly Region) > > Thank you, Eric, > I will write it more clearly in the next version patch to submit. > > > > >> "Once _W_est_ has grown to reach the _cwnd_ at the time of most > >> recently setting _ssthresh_ -- that is, _W_est_ >= _cwnd_prior_ -- > >> the sender SHOULD set α__cubic_ to 1 to ensure that it can achieve > >> the same congestion window increment rate as Reno, which uses AIMD > >> (1,0.5)." > >> > >> Add new field 'cwnd_prior' in bictcp to hold cwnd before a loss event > >> > >> Fixes: 89b3d9aaf467 ("[TCP] cubic: precompute constants") > > RFC 9438 is brand new, I think we should not backport this patch to > > stable linux versions. > > > > This would target net-next, unless there is clear evidence that it is > > absolutely safe. > > I agree with you that this patch would target net-next. > > > Note the existence of tools/testing/selftests/bpf/progs/bpf_cc_cubic.c > > and tools/testing/selftests/bpf/progs/bpf_cubic.c > > > > If this patch was a fix, I presume we would need to fix these files ? > In my understanding, the bpf_cubic.c and bpf_cc_cubic.c are not designed to create a fully equivalent version of tcp_cubic, but more focus on BPF logic testing usage. > For example, the up-to-date bpf_cubic does not involve the changes in commit 9957b38b5e7a ("tcp_cubic: make hystart_ack_delay() aware of BIG TCP") > > Maybe we would ask BPF maintainers whether to fix these BPF files? We try (as TCP maintainers) to keep tools/testing/selftests/bpf/progs/bpf_cubic.c up to date with the kernel C code. Because _if_ someone is really using BPF based cubic, they should get the fix eventually. See for instance commit 7d21d54d624777358ab6c7be7ff778808fef70ba Author: Neal Cardwell <ncardwell@google.com> Date: Wed Jun 24 12:42:03 2020 -0400 bpf: tcp: bpf_cubic: fix spurious HYSTART_DELAY exit upon drop in min RTT Apply the fix from: "tcp_cubic: fix spurious HYSTART_DELAY exit upon drop in min RTT" to the BPF implementation of TCP CUBIC congestion control. Repeating the commit description here for completeness: Mirja Kuehlewind reported a bug in Linux TCP CUBIC Hystart, where Hystart HYSTART_DELAY mechanism can exit Slow Start spuriously on an ACK when the minimum rtt of a connection goes down. From inspection it is clear from the existing code that this could happen in an example like the following: o The first 8 RTT samples in a round trip are 150ms, resulting in a curr_rtt of 150ms and a delay_min of 150ms. o The 9th RTT sample is 100ms. The curr_rtt does not change after the first 8 samples, so curr_rtt remains 150ms. But delay_min can be lowered at any time, so delay_min falls to 100ms. The code executes the HYSTART_DELAY comparison between curr_rtt of 150ms and delay_min of 100ms, and the curr_rtt is declared far enough above delay_min to force a (spurious) exit of Slow start. The fix here is simple: allow every RTT sample in a round trip to lower the curr_rtt. Fixes: 6de4a9c430b5 ("bpf: tcp: Add bpf_cubic example") Reported-by: Mirja Kuehlewind <mirja.kuehlewind@ericsson.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
On 8/20/24 07:56, Eric Dumazet wrote: > On Mon, Aug 19, 2024 at 11:03 PM Mingrui Zhang <mrzhang97@gmail.com> wrote: >> On 8/19/24 03:22, Eric Dumazet wrote: >>> On Sat, Aug 17, 2024 at 6:35 PM Mingrui Zhang <mrzhang97@gmail.com> wrote: >>>> The original code follows RFC 8312 (obsoleted CUBIC RFC). >>>> >>>> The patched code follows RFC 9438 (new CUBIC RFC): >>> Please give the precise location in the RFC (4.3 Reno-Friendly Region) >> Thank you, Eric, >> I will write it more clearly in the next version patch to submit. >> >>>> "Once _W_est_ has grown to reach the _cwnd_ at the time of most >>>> recently setting _ssthresh_ -- that is, _W_est_ >= _cwnd_prior_ -- >>>> the sender SHOULD set α__cubic_ to 1 to ensure that it can achieve >>>> the same congestion window increment rate as Reno, which uses AIMD >>>> (1,0.5)." >>>> >>>> Add new field 'cwnd_prior' in bictcp to hold cwnd before a loss event >>>> >>>> Fixes: 89b3d9aaf467 ("[TCP] cubic: precompute constants") >>> RFC 9438 is brand new, I think we should not backport this patch to >>> stable linux versions. >>> >>> This would target net-next, unless there is clear evidence that it is >>> absolutely safe. >> I agree with you that this patch would target net-next. >> >>> Note the existence of tools/testing/selftests/bpf/progs/bpf_cc_cubic.c >>> and tools/testing/selftests/bpf/progs/bpf_cubic.c >>> >>> If this patch was a fix, I presume we would need to fix these files ? >> In my understanding, the bpf_cubic.c and bpf_cc_cubic.c are not designed to create a fully equivalent version of tcp_cubic, but more focus on BPF logic testing usage. >> For example, the up-to-date bpf_cubic does not involve the changes in commit 9957b38b5e7a ("tcp_cubic: make hystart_ack_delay() aware of BIG TCP") >> >> Maybe we would ask BPF maintainers whether to fix these BPF files? > We try (as TCP maintainers) to keep > tools/testing/selftests/bpf/progs/bpf_cubic.c up to date with the > kernel C code. > Because _if_ someone is really using BPF based cubic, they should get > the fix eventually. > I got your point. Yes, we should fix those BPF based cubic files if this patch was confirmed. > See for instance > > commit 7d21d54d624777358ab6c7be7ff778808fef70ba > Author: Neal Cardwell <ncardwell@google.com> > Date: Wed Jun 24 12:42:03 2020 -0400 > > bpf: tcp: bpf_cubic: fix spurious HYSTART_DELAY exit upon drop in min RTT > > Apply the fix from: > "tcp_cubic: fix spurious HYSTART_DELAY exit upon drop in min RTT" > to the BPF implementation of TCP CUBIC congestion control. > > Repeating the commit description here for completeness: > > Mirja Kuehlewind reported a bug in Linux TCP CUBIC Hystart, where > Hystart HYSTART_DELAY mechanism can exit Slow Start spuriously on an > ACK when the minimum rtt of a connection goes down. From inspection it > is clear from the existing code that this could happen in an example > like the following: > > o The first 8 RTT samples in a round trip are 150ms, resulting in a > curr_rtt of 150ms and a delay_min of 150ms. > > o The 9th RTT sample is 100ms. The curr_rtt does not change after the > first 8 samples, so curr_rtt remains 150ms. But delay_min can be > lowered at any time, so delay_min falls to 100ms. The code executes > the HYSTART_DELAY comparison between curr_rtt of 150ms and delay_min > of 100ms, and the curr_rtt is declared far enough above delay_min to > force a (spurious) exit of Slow start. > > The fix here is simple: allow every RTT sample in a round trip to > lower the curr_rtt. > > Fixes: 6de4a9c430b5 ("bpf: tcp: Add bpf_cubic example") > Reported-by: Mirja Kuehlewind <mirja.kuehlewind@ericsson.com> > Signed-off-by: Neal Cardwell <ncardwell@google.com> > Signed-off-by: Eric Dumazet <edumazet@google.com> > Acked-by: Soheil Hassas Yeganeh <soheil@google.com> > Signed-off-by: David S. Miller <davem@davemloft.net> Thank you for pointing out this patch as an example.
diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c index 00da7d592032..03cfbad37dab 100644 --- a/net/ipv4/tcp_cubic.c +++ b/net/ipv4/tcp_cubic.c @@ -102,6 +102,7 @@ struct bictcp { u32 end_seq; /* end_seq of the round */ u32 last_ack; /* last time when the ACK spacing is close */ u32 curr_rtt; /* the minimum rtt of current round */ + u32 cwnd_prior; /* cwnd before a loss event */ }; static inline void bictcp_reset(struct bictcp *ca) @@ -305,7 +306,10 @@ static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked) if (tcp_friendliness) { u32 scale = beta_scale; - delta = (cwnd * scale) >> 3; + if (cwnd < ca->cwnd_prior) + delta = (cwnd * scale) >> 3; /* CUBIC additive increment */ + else + delta = cwnd; /* Reno additive increment */ while (ca->ack_cnt > delta) { /* update tcp cwnd */ ca->ack_cnt -= delta; ca->tcp_cwnd++; @@ -355,6 +359,7 @@ __bpf_kfunc static u32 cubictcp_recalc_ssthresh(struct sock *sk) / (2 * BICTCP_BETA_SCALE); else ca->last_max_cwnd = tcp_snd_cwnd(tp); + ca->cwnd_prior = tcp_snd_cwnd(tp); return max((tcp_snd_cwnd(tp) * beta) / BICTCP_BETA_SCALE, 2U); }