diff mbox series

[net] tcp_cubic fix to achieve at least the same throughput as Reno

Message ID 20240810223130.379146-1-mrzhang97@gmail.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net] tcp_cubic fix to achieve at least the same throughput as Reno | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present fail Series targets non-next tree, but doesn't contain any Fixes tags
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 29 this patch: 29
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers warning 3 maintainers not CCed: kuba@kernel.org dsahern@kernel.org pabeni@redhat.com
netdev/build_clang success Errors and warnings before: 29 this patch: 29
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 29 this patch: 29
netdev/checkpatch warning WARNING: line length of 84 exceeds 80 columns WARNING: line length of 98 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-08-11--21-00 (tests: 707)

Commit Message

Mingrui Zhang Aug. 10, 2024, 10:31 p.m. UTC
This patch fixes some CUBIC bugs so that  "CUBIC achieves at least
the same throughput as Reno in small-BDP networks"
[RFC 9438: https://www.rfc-editor.org/rfc/rfc9438.html]

It consists of three bug fixes, all changing function bictcp_update()
of tcp_cubic.c, which controls how fast CUBIC increases its
congestion window size snd_cwnd.

Bug fix 1:
 	ca->ack_cnt += acked;	/* count the number of ACKed packets */
 
 	if (ca->last_cwnd == cwnd &&
-	    (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32)
+	    (s32)(tcp_jiffies32 - ca->last_time) <= min(HZ / 32, usecs_to_jiffies(ca->delay_min)))
 		return;
 
 	/* The CUBIC function can update ca->cnt at most once per jiffy.

The original code bypasses bictcp_update() under certain conditions
to reduce the CPU overhead. Intuitively, when last_cwnd==cwnd,
bictcp_update() is executed 32 times per second. As a result, 
it is possible that bictcp_update() is not executed for several 
RTTs when RTT is short (specifically < 1/32 second = 31 ms and 
last_cwnd==cwnd which may happen in small-BDP networks), 
thus leading to low throughput in these RTTs.

The patched code executes bictcp_update() 32 times per second
if RTT > 31 ms or every RTT if RTT < 31 ms, when last_cwnd==cwnd.

Bug fix 2:
 	if (tcp_friendliness) {
 		u32 scale = beta_scale;
 
-		delta = (cwnd * scale) >> 3;
+		if (cwnd < ca->bic_origin_point)
+			delta = (cwnd * scale) >> 3;
+		else
+			delta = cwnd;
 		while (ca->ack_cnt > delta) {		/* update tcp cwnd */
 			ca->ack_cnt -= delta;
 			ca->tcp_cwnd++;
 		}

The original code follows RFC 8312 (obsoleted CUBIC RFC).

The patched code follows RFC 9438 (new CUBIC RFC).

"Once _W_est_ has grown to reach the _cwnd_ at the time of most
recently setting _ssthresh_ -- that is, _W_est_ >= _cwnd_prior_ --
the sender SHOULD set α__cubic_ to 1 to ensure that it can achieve
the same congestion window increment rate as Reno, which uses AIMD
(1,0.5)."

Bug fix 3:
-		if (ca->tcp_cwnd > cwnd) {	/* if bic is slower than tcp */
-			delta = ca->tcp_cwnd - cwnd;
+		u32 tcp_cwnd_next_rtt = ca->tcp_cwnd + (ca->ack_cnt + cwnd) / delta;
+
+		if (tcp_cwnd_next_rtt > cwnd) {  /* if bic is slower than tcp */
+			delta = tcp_cwnd_next_rtt - cwnd;
 			max_cnt = cwnd / delta;
 			if (ca->cnt > max_cnt)
 				ca->cnt = max_cnt;
 
The original code estimates RENO snd_cwnd using the estimated 
RENO snd_cwnd at the current time (i.e., tcp_cwnd).

The patched code estimates RENO snd_cwnd using the estimated 
RENO snd_cwnd after one RTT (i.e., tcp_cwnd_next_rtt), 
because ca->cnt is used to increase snd_cwnd for the next RTT.

Experiments:

Below are Mininet experiments to demonstrate the performance difference
between the original CUBIC and patched CUBIC.

Network: link capacity = 100Mbps, RTT = 4ms

TCP flows: one RENO and one CUBIC. initial cwnd = 10 packets.
The first data packet of each flow is lost

snd_cwnd of RENO and original CUBIC flows

https://github.com/zmrui/tcp_cubic_fix/blob/main/reno_and_cubic.jpg

snd_cwnd of RENO and patched CUBIC (with bug fixes 1, 2, and 3) flows.

https://github.com/zmrui/tcp_cubic_fix/blob/main/reno_and_cubic_fixb1b2b3.jpg

The result of patched CUBIC with different combinations of
bug fixes 1, 2, and 3 can be found at the following link,
where you can also find more experiment results.

https://github.com/zmrui/tcp_cubic_fix


Thanks
Mingrui, and Lisong

Signed-off-by: Mingrui Zhang <mrzhang97@gmail.com>
Signed-off-by: Lisong Xu <xu@unl.edu>
---
 net/ipv4/tcp_cubic.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

Comments

Neal Cardwell Aug. 14, 2024, 1:59 a.m. UTC | #1
On Sat, Aug 10, 2024 at 6:34 PM Mingrui Zhang <mrzhang97@gmail.com> wrote:
>
> This patch fixes some CUBIC bugs so that  "CUBIC achieves at least
> the same throughput as Reno in small-BDP networks"
> [RFC 9438: https://www.rfc-editor.org/rfc/rfc9438.html]
>
> It consists of three bug fixes, all changing function bictcp_update()
> of tcp_cubic.c, which controls how fast CUBIC increases its
> congestion window size snd_cwnd.

Thanks for these fixes! I had some patches under development for bugs
(2) and (3), but had not found the time to finish them up and send
them out... And I had not noticed bug (1).  :-) So thanks for sending
out these fixes! :-)

A few comments:

(1) Since these are three separate bug fixes, in accordance with Linux
development standards, which use minimal/narrow/focused patches,
please submit each of the 3 separate fixes as a separate patch, as
part of a patch series (ideally using git format-patch).

(2) I would suggest using the --cover-letter flag to git format-patch
to create a cover letter for the 3-patch series. You could put the
nice experiment data in the cover letter, since the cover letter
applies to all patches, and so do the experiment results.

(3) Please include a "Fixes:" footer in the git commit message
footeres, to indicate which patch this is fixing, to enable backports
of these fixes to stable kernels. You can see examples in the Linux
git history. For example, you might use something like:

Fixes: df3271f3361b ("[TCP] BIC: CUBIC window growth (2.0)")

You can use the following to find the SHA1 of the patch you are fixing:

   git blame -- net/ipv4/tcp_cubic.c

(4) For each patch title, please use "tcp_cubic:" as the first token
in the patch title to indicate the area of the kernel you are fixing,
and have a brief description of the specifics of the fix. For example,
some suggested titles:

tcp_cubic: fix to run bictcp_update() at least once per round

tcp_cubic: fix to match Reno additive increment

tcp_cubic: fix to use emulated Reno cwnd one round in the future

(5) Please run ./scripts/checkpatch.pl to look for issues before sending:

  ./scripts/checkpatch.pl *patch

(6) Please re-test before sending.

> Bug fix 1:
>         ca->ack_cnt += acked;   /* count the number of ACKed packets */
>
>         if (ca->last_cwnd == cwnd &&
> -           (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32)
> +           (s32)(tcp_jiffies32 - ca->last_time) <= min(HZ / 32, usecs_to_jiffies(ca->delay_min)))
>                 return;
>
>         /* The CUBIC function can update ca->cnt at most once per jiffy.
>
> The original code bypasses bictcp_update() under certain conditions
> to reduce the CPU overhead. Intuitively, when last_cwnd==cwnd,
> bictcp_update() is executed 32 times per second. As a result,
> it is possible that bictcp_update() is not executed for several
> RTTs when RTT is short (specifically < 1/32 second = 31 ms and
> last_cwnd==cwnd which may happen in small-BDP networks),
> thus leading to low throughput in these RTTs.
>
> The patched code executes bictcp_update() 32 times per second
> if RTT > 31 ms or every RTT if RTT < 31 ms, when last_cwnd==cwnd.
>
> Bug fix 2:
>         if (tcp_friendliness) {
>                 u32 scale = beta_scale;
>
> -               delta = (cwnd * scale) >> 3;
> +               if (cwnd < ca->bic_origin_point)
> +                       delta = (cwnd * scale) >> 3;
> +               else
> +                       delta = cwnd;
>                 while (ca->ack_cnt > delta) {           /* update tcp cwnd */
>                         ca->ack_cnt -= delta;
>                         ca->tcp_cwnd++;
>                 }
>
> The original code follows RFC 8312 (obsoleted CUBIC RFC).
>
> The patched code follows RFC 9438 (new CUBIC RFC).
>
> "Once _W_est_ has grown to reach the _cwnd_ at the time of most
> recently setting _ssthresh_ -- that is, _W_est_ >= _cwnd_prior_ --
> the sender SHOULD set α__cubic_ to 1 to ensure that it can achieve
> the same congestion window increment rate as Reno, which uses AIMD
> (1,0.5)."

Since ca->bic_origin_point does not really hold _cwnd_prior_ in the
case of fast_convergence (which is very common), I would suggest using
a new field, perhaps called ca->cwnd_prior, to hold the actual
_cwnd_prior_ value. Something like the following:

-               delta = (cwnd * scale) >> 3;
+               if (cwnd < ca->cwnd_prior)
+                       delta = (cwnd * scale) >> 3;
+               else
+                       delta = cwnd;

Then, in __bpf_kfunc static u32 cubictcp_recalc_ssthresh(struct sock *sk):

         else
                 ca->last_max_cwnd = tcp_snd_cwnd(tp);
        +ca->cwnd_prior = tcp_snd_cwnd(tp);

How does that sound?


Thanks again for these fixes!

Best regards,
neal
Mingrui Zhang Aug. 14, 2024, 7:33 p.m. UTC | #2
On 8/13/24 20:59, Neal Cardwell wrote:
> On Sat, Aug 10, 2024 at 6:34 PM Mingrui Zhang <mrzhang97@gmail.com> wrote:
>> This patch fixes some CUBIC bugs so that  "CUBIC achieves at least
>> the same throughput as Reno in small-BDP networks"
>> [RFC 9438: https://www.rfc-editor.org/rfc/rfc9438.html]
>>
>> It consists of three bug fixes, all changing function bictcp_update()
>> of tcp_cubic.c, which controls how fast CUBIC increases its
>> congestion window size snd_cwnd.
> Thanks for these fixes! I had some patches under development for bugs
> (2) and (3), but had not found the time to finish them up and send
> them out... And I had not noticed bug (1).  :-) So thanks for sending
> out these fixes! :-)
>
> A few comments:
>
> (1) Since these are three separate bug fixes, in accordance with Linux
> development standards, which use minimal/narrow/focused patches,
> please submit each of the 3 separate fixes as a separate patch, as
> part of a patch series (ideally using git format-patch).
>
> (2) I would suggest using the --cover-letter flag to git format-patch
> to create a cover letter for the 3-patch series. You could put the
> nice experiment data in the cover letter, since the cover letter
> applies to all patches, and so do the experiment results.
>
> (3) Please include a "Fixes:" footer in the git commit message
> footeres, to indicate which patch this is fixing, to enable backports
> of these fixes to stable kernels. You can see examples in the Linux
> git history. For example, you might use something like:
>
> Fixes: df3271f3361b ("[TCP] BIC: CUBIC window growth (2.0)")
>
> You can use the following to find the SHA1 of the patch you are fixing:
>
>    git blame -- net/ipv4/tcp_cubic.c
>
> (4) For each patch title, please use "tcp_cubic:" as the first token
> in the patch title to indicate the area of the kernel you are fixing,
> and have a brief description of the specifics of the fix. For example,
> some suggested titles:
>
> tcp_cubic: fix to run bictcp_update() at least once per round
>
> tcp_cubic: fix to match Reno additive increment
>
> tcp_cubic: fix to use emulated Reno cwnd one round in the future
>
> (5) Please run ./scripts/checkpatch.pl to look for issues before sending:
>
>   ./scripts/checkpatch.pl *patch
>
> (6) Please re-test before sending.

Thank you, Neal.
I appreciate your comments on the patch format, and I will follow your detailed instructions on patches.

>> Bug fix 1:
>>         ca->ack_cnt += acked;   /* count the number of ACKed packets */
>>
>>         if (ca->last_cwnd == cwnd &&
>> -           (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32)
>> +           (s32)(tcp_jiffies32 - ca->last_time) <= min(HZ / 32, usecs_to_jiffies(ca->delay_min)))
>>                 return;
>>
>>         /* The CUBIC function can update ca->cnt at most once per jiffy.
>>
>> The original code bypasses bictcp_update() under certain conditions
>> to reduce the CPU overhead. Intuitively, when last_cwnd==cwnd,
>> bictcp_update() is executed 32 times per second. As a result,
>> it is possible that bictcp_update() is not executed for several
>> RTTs when RTT is short (specifically < 1/32 second = 31 ms and
>> last_cwnd==cwnd which may happen in small-BDP networks),
>> thus leading to low throughput in these RTTs.
>>
>> The patched code executes bictcp_update() 32 times per second
>> if RTT > 31 ms or every RTT if RTT < 31 ms, when last_cwnd==cwnd.
>>
>> Bug fix 2:
>>         if (tcp_friendliness) {
>>                 u32 scale = beta_scale;
>>
>> -               delta = (cwnd * scale) >> 3;
>> +               if (cwnd < ca->bic_origin_point)
>> +                       delta = (cwnd * scale) >> 3;
>> +               else
>> +                       delta = cwnd;
>>                 while (ca->ack_cnt > delta) {           /* update tcp cwnd */
>>                         ca->ack_cnt -= delta;
>>                         ca->tcp_cwnd++;
>>                 }
>>
>> The original code follows RFC 8312 (obsoleted CUBIC RFC).
>>
>> The patched code follows RFC 9438 (new CUBIC RFC).
>>
>> "Once _W_est_ has grown to reach the _cwnd_ at the time of most
>> recently setting _ssthresh_ -- that is, _W_est_ >= _cwnd_prior_ --
>> the sender SHOULD set α__cubic_ to 1 to ensure that it can achieve
>> the same congestion window increment rate as Reno, which uses AIMD
>> (1,0.5)."
> Since ca->bic_origin_point does not really hold _cwnd_prior_ in the
> case of fast_convergence (which is very common), I would suggest using
> a new field, perhaps called ca->cwnd_prior, to hold the actual
> _cwnd_prior_ value. Something like the following:
>
> -               delta = (cwnd * scale) >> 3;
> +               if (cwnd < ca->cwnd_prior)
> +                       delta = (cwnd * scale) >> 3;
> +               else
> +                       delta = cwnd;
>
> Then, in __bpf_kfunc static u32 cubictcp_recalc_ssthresh(struct sock *sk):
>
>          else
>                  ca->last_max_cwnd = tcp_snd_cwnd(tp);
>         +ca->cwnd_prior = tcp_snd_cwnd(tp);
>
> How does that sound?

Adding a new field is a good suggestion, and we will refine the patch accordingly.

>
> Thanks again for these fixes!
>
> Best regards,
> neal

Thank you for your valuable suggestions.

Best,
Mingrui
diff mbox series

Patch

diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
index 5dbed91c6178..2b5723194bfa 100644
--- a/net/ipv4/tcp_cubic.c
+++ b/net/ipv4/tcp_cubic.c
@@ -219,7 +219,7 @@  static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked)
 	ca->ack_cnt += acked;	/* count the number of ACKed packets */
 
 	if (ca->last_cwnd == cwnd &&
-	    (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32)
+	    (s32)(tcp_jiffies32 - ca->last_time) <= min(HZ / 32, usecs_to_jiffies(ca->delay_min)))
 		return;
 
 	/* The CUBIC function can update ca->cnt at most once per jiffy.
@@ -301,14 +301,19 @@  static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked)
 	if (tcp_friendliness) {
 		u32 scale = beta_scale;
 
-		delta = (cwnd * scale) >> 3;
+		if (cwnd < ca->bic_origin_point)
+			delta = (cwnd * scale) >> 3;
+		else
+			delta = cwnd;
 		while (ca->ack_cnt > delta) {		/* update tcp cwnd */
 			ca->ack_cnt -= delta;
 			ca->tcp_cwnd++;
 		}
 
-		if (ca->tcp_cwnd > cwnd) {	/* if bic is slower than tcp */
-			delta = ca->tcp_cwnd - cwnd;
+		u32 tcp_cwnd_next_rtt = ca->tcp_cwnd + (ca->ack_cnt + cwnd) / delta;
+
+		if (tcp_cwnd_next_rtt > cwnd) {  /* if bic is slower than tcp */
+			delta = tcp_cwnd_next_rtt - cwnd;
 			max_cnt = cwnd / delta;
 			if (ca->cnt > max_cnt)
 				ca->cnt = max_cnt;