Message ID | 20240810223130.379146-1-mrzhang97@gmail.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net] tcp_cubic fix to achieve at least the same throughput as Reno | expand |
On Sat, Aug 10, 2024 at 6:34 PM Mingrui Zhang <mrzhang97@gmail.com> wrote: > > This patch fixes some CUBIC bugs so that "CUBIC achieves at least > the same throughput as Reno in small-BDP networks" > [RFC 9438: https://www.rfc-editor.org/rfc/rfc9438.html] > > It consists of three bug fixes, all changing function bictcp_update() > of tcp_cubic.c, which controls how fast CUBIC increases its > congestion window size snd_cwnd. Thanks for these fixes! I had some patches under development for bugs (2) and (3), but had not found the time to finish them up and send them out... And I had not noticed bug (1). :-) So thanks for sending out these fixes! :-) A few comments: (1) Since these are three separate bug fixes, in accordance with Linux development standards, which use minimal/narrow/focused patches, please submit each of the 3 separate fixes as a separate patch, as part of a patch series (ideally using git format-patch). (2) I would suggest using the --cover-letter flag to git format-patch to create a cover letter for the 3-patch series. You could put the nice experiment data in the cover letter, since the cover letter applies to all patches, and so do the experiment results. (3) Please include a "Fixes:" footer in the git commit message footeres, to indicate which patch this is fixing, to enable backports of these fixes to stable kernels. You can see examples in the Linux git history. For example, you might use something like: Fixes: df3271f3361b ("[TCP] BIC: CUBIC window growth (2.0)") You can use the following to find the SHA1 of the patch you are fixing: git blame -- net/ipv4/tcp_cubic.c (4) For each patch title, please use "tcp_cubic:" as the first token in the patch title to indicate the area of the kernel you are fixing, and have a brief description of the specifics of the fix. For example, some suggested titles: tcp_cubic: fix to run bictcp_update() at least once per round tcp_cubic: fix to match Reno additive increment tcp_cubic: fix to use emulated Reno cwnd one round in the future (5) Please run ./scripts/checkpatch.pl to look for issues before sending: ./scripts/checkpatch.pl *patch (6) Please re-test before sending. > Bug fix 1: > ca->ack_cnt += acked; /* count the number of ACKed packets */ > > if (ca->last_cwnd == cwnd && > - (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32) > + (s32)(tcp_jiffies32 - ca->last_time) <= min(HZ / 32, usecs_to_jiffies(ca->delay_min))) > return; > > /* The CUBIC function can update ca->cnt at most once per jiffy. > > The original code bypasses bictcp_update() under certain conditions > to reduce the CPU overhead. Intuitively, when last_cwnd==cwnd, > bictcp_update() is executed 32 times per second. As a result, > it is possible that bictcp_update() is not executed for several > RTTs when RTT is short (specifically < 1/32 second = 31 ms and > last_cwnd==cwnd which may happen in small-BDP networks), > thus leading to low throughput in these RTTs. > > The patched code executes bictcp_update() 32 times per second > if RTT > 31 ms or every RTT if RTT < 31 ms, when last_cwnd==cwnd. > > Bug fix 2: > if (tcp_friendliness) { > u32 scale = beta_scale; > > - delta = (cwnd * scale) >> 3; > + if (cwnd < ca->bic_origin_point) > + delta = (cwnd * scale) >> 3; > + else > + delta = cwnd; > while (ca->ack_cnt > delta) { /* update tcp cwnd */ > ca->ack_cnt -= delta; > ca->tcp_cwnd++; > } > > The original code follows RFC 8312 (obsoleted CUBIC RFC). > > The patched code follows RFC 9438 (new CUBIC RFC). > > "Once _W_est_ has grown to reach the _cwnd_ at the time of most > recently setting _ssthresh_ -- that is, _W_est_ >= _cwnd_prior_ -- > the sender SHOULD set α__cubic_ to 1 to ensure that it can achieve > the same congestion window increment rate as Reno, which uses AIMD > (1,0.5)." Since ca->bic_origin_point does not really hold _cwnd_prior_ in the case of fast_convergence (which is very common), I would suggest using a new field, perhaps called ca->cwnd_prior, to hold the actual _cwnd_prior_ value. Something like the following: - delta = (cwnd * scale) >> 3; + if (cwnd < ca->cwnd_prior) + delta = (cwnd * scale) >> 3; + else + delta = cwnd; Then, in __bpf_kfunc static u32 cubictcp_recalc_ssthresh(struct sock *sk): else ca->last_max_cwnd = tcp_snd_cwnd(tp); +ca->cwnd_prior = tcp_snd_cwnd(tp); How does that sound? Thanks again for these fixes! Best regards, neal
On 8/13/24 20:59, Neal Cardwell wrote: > On Sat, Aug 10, 2024 at 6:34 PM Mingrui Zhang <mrzhang97@gmail.com> wrote: >> This patch fixes some CUBIC bugs so that "CUBIC achieves at least >> the same throughput as Reno in small-BDP networks" >> [RFC 9438: https://www.rfc-editor.org/rfc/rfc9438.html] >> >> It consists of three bug fixes, all changing function bictcp_update() >> of tcp_cubic.c, which controls how fast CUBIC increases its >> congestion window size snd_cwnd. > Thanks for these fixes! I had some patches under development for bugs > (2) and (3), but had not found the time to finish them up and send > them out... And I had not noticed bug (1). :-) So thanks for sending > out these fixes! :-) > > A few comments: > > (1) Since these are three separate bug fixes, in accordance with Linux > development standards, which use minimal/narrow/focused patches, > please submit each of the 3 separate fixes as a separate patch, as > part of a patch series (ideally using git format-patch). > > (2) I would suggest using the --cover-letter flag to git format-patch > to create a cover letter for the 3-patch series. You could put the > nice experiment data in the cover letter, since the cover letter > applies to all patches, and so do the experiment results. > > (3) Please include a "Fixes:" footer in the git commit message > footeres, to indicate which patch this is fixing, to enable backports > of these fixes to stable kernels. You can see examples in the Linux > git history. For example, you might use something like: > > Fixes: df3271f3361b ("[TCP] BIC: CUBIC window growth (2.0)") > > You can use the following to find the SHA1 of the patch you are fixing: > > git blame -- net/ipv4/tcp_cubic.c > > (4) For each patch title, please use "tcp_cubic:" as the first token > in the patch title to indicate the area of the kernel you are fixing, > and have a brief description of the specifics of the fix. For example, > some suggested titles: > > tcp_cubic: fix to run bictcp_update() at least once per round > > tcp_cubic: fix to match Reno additive increment > > tcp_cubic: fix to use emulated Reno cwnd one round in the future > > (5) Please run ./scripts/checkpatch.pl to look for issues before sending: > > ./scripts/checkpatch.pl *patch > > (6) Please re-test before sending. Thank you, Neal. I appreciate your comments on the patch format, and I will follow your detailed instructions on patches. >> Bug fix 1: >> ca->ack_cnt += acked; /* count the number of ACKed packets */ >> >> if (ca->last_cwnd == cwnd && >> - (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32) >> + (s32)(tcp_jiffies32 - ca->last_time) <= min(HZ / 32, usecs_to_jiffies(ca->delay_min))) >> return; >> >> /* The CUBIC function can update ca->cnt at most once per jiffy. >> >> The original code bypasses bictcp_update() under certain conditions >> to reduce the CPU overhead. Intuitively, when last_cwnd==cwnd, >> bictcp_update() is executed 32 times per second. As a result, >> it is possible that bictcp_update() is not executed for several >> RTTs when RTT is short (specifically < 1/32 second = 31 ms and >> last_cwnd==cwnd which may happen in small-BDP networks), >> thus leading to low throughput in these RTTs. >> >> The patched code executes bictcp_update() 32 times per second >> if RTT > 31 ms or every RTT if RTT < 31 ms, when last_cwnd==cwnd. >> >> Bug fix 2: >> if (tcp_friendliness) { >> u32 scale = beta_scale; >> >> - delta = (cwnd * scale) >> 3; >> + if (cwnd < ca->bic_origin_point) >> + delta = (cwnd * scale) >> 3; >> + else >> + delta = cwnd; >> while (ca->ack_cnt > delta) { /* update tcp cwnd */ >> ca->ack_cnt -= delta; >> ca->tcp_cwnd++; >> } >> >> The original code follows RFC 8312 (obsoleted CUBIC RFC). >> >> The patched code follows RFC 9438 (new CUBIC RFC). >> >> "Once _W_est_ has grown to reach the _cwnd_ at the time of most >> recently setting _ssthresh_ -- that is, _W_est_ >= _cwnd_prior_ -- >> the sender SHOULD set α__cubic_ to 1 to ensure that it can achieve >> the same congestion window increment rate as Reno, which uses AIMD >> (1,0.5)." > Since ca->bic_origin_point does not really hold _cwnd_prior_ in the > case of fast_convergence (which is very common), I would suggest using > a new field, perhaps called ca->cwnd_prior, to hold the actual > _cwnd_prior_ value. Something like the following: > > - delta = (cwnd * scale) >> 3; > + if (cwnd < ca->cwnd_prior) > + delta = (cwnd * scale) >> 3; > + else > + delta = cwnd; > > Then, in __bpf_kfunc static u32 cubictcp_recalc_ssthresh(struct sock *sk): > > else > ca->last_max_cwnd = tcp_snd_cwnd(tp); > +ca->cwnd_prior = tcp_snd_cwnd(tp); > > How does that sound? Adding a new field is a good suggestion, and we will refine the patch accordingly. > > Thanks again for these fixes! > > Best regards, > neal Thank you for your valuable suggestions. Best, Mingrui
diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c index 5dbed91c6178..2b5723194bfa 100644 --- a/net/ipv4/tcp_cubic.c +++ b/net/ipv4/tcp_cubic.c @@ -219,7 +219,7 @@ static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked) ca->ack_cnt += acked; /* count the number of ACKed packets */ if (ca->last_cwnd == cwnd && - (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32) + (s32)(tcp_jiffies32 - ca->last_time) <= min(HZ / 32, usecs_to_jiffies(ca->delay_min))) return; /* The CUBIC function can update ca->cnt at most once per jiffy. @@ -301,14 +301,19 @@ static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked) if (tcp_friendliness) { u32 scale = beta_scale; - delta = (cwnd * scale) >> 3; + if (cwnd < ca->bic_origin_point) + delta = (cwnd * scale) >> 3; + else + delta = cwnd; while (ca->ack_cnt > delta) { /* update tcp cwnd */ ca->ack_cnt -= delta; ca->tcp_cwnd++; } - if (ca->tcp_cwnd > cwnd) { /* if bic is slower than tcp */ - delta = ca->tcp_cwnd - cwnd; + u32 tcp_cwnd_next_rtt = ca->tcp_cwnd + (ca->ack_cnt + cwnd) / delta; + + if (tcp_cwnd_next_rtt > cwnd) { /* if bic is slower than tcp */ + delta = tcp_cwnd_next_rtt - cwnd; max_cnt = cwnd / delta; if (ca->cnt > max_cnt) ca->cnt = max_cnt;