Message ID | 20240826092707.2661435-1-edumazet@google.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net] tcp_cubic: switch ca->last_time to usec resolution | expand |
On Mon, Aug 26, 2024 at 5:27 AM Eric Dumazet <edumazet@google.com> wrote: > > bictcp_update() uses ca->last_time as a timestamp > to decide of several heuristics. > > Historically this timestamp has been fed with jiffies, > which has too coarse resolution, some distros are > still using CONFIG_HZ_250=y > > It is time to switch to usec resolution, now TCP stack > already caches in tp->tcp_mstamp the high resolution time. > > Also remove the 'inline' qualifier, this helper is used > once and compilers are smarts. > > Signed-off-by: Eric Dumazet <edumazet@google.com> > Link: https://lore.kernel.org/netdev/20240817163400.2616134-1-mrzhang97@gmail.com/T/#mb6a64c9e2309eb98eaeeeb4b085c4a2270b6789d > Cc: Mingrui Zhang <mrzhang97@gmail.com> > Cc: Lisong Xu <xu@unl.edu> > --- > net/ipv4/tcp_cubic.c | 18 ++++++++++-------- > 1 file changed, 10 insertions(+), 8 deletions(-) > > diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c > index 5dbed91c6178257df8d2ccd1c8690a10bdbaf56a..3b1845103ee1866a316926a130c212e6f5e78ef0 100644 > --- a/net/ipv4/tcp_cubic.c > +++ b/net/ipv4/tcp_cubic.c > @@ -87,7 +87,7 @@ struct bictcp { > u32 cnt; /* increase cwnd by 1 after ACKs */ > u32 last_max_cwnd; /* last maximum snd_cwnd */ > u32 last_cwnd; /* the last snd_cwnd */ > - u32 last_time; /* time when updated last_cwnd */ > + u32 last_time; /* time when updated last_cwnd (usec) */ > u32 bic_origin_point;/* origin point of bic function */ > u32 bic_K; /* time to origin point > from the beginning of the current epoch */ > @@ -211,26 +211,28 @@ static u32 cubic_root(u64 a) > /* > * Compute congestion window to use. > */ > -static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked) > +static void bictcp_update(struct sock *sk, u32 cwnd, u32 acked) > { > + const struct tcp_sock *tp = tcp_sk(sk); > + struct bictcp *ca = inet_csk_ca(sk); > u32 delta, bic_target, max_cnt; > u64 offs, t; > > ca->ack_cnt += acked; /* count the number of ACKed packets */ > > - if (ca->last_cwnd == cwnd && > - (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32) > + delta = tp->tcp_mstamp - ca->last_time; > + if (ca->last_cwnd == cwnd && delta <= USEC_PER_SEC / 32) > return; > > - /* The CUBIC function can update ca->cnt at most once per jiffy. > + /* The CUBIC function can update ca->cnt at most once per ms. > * On all cwnd reduction events, ca->epoch_start is set to 0, > * which will force a recalculation of ca->cnt. > */ > - if (ca->epoch_start && tcp_jiffies32 == ca->last_time) > + if (ca->epoch_start && delta < USEC_PER_MSEC) > goto tcp_friendliness; AFAICT there is a problem here. It is switching this line of code to use microsecond resolution without also changing the core CUBIC slope (ca->cnt) calculation to also use microseconds. AFAICT that means we would be re-introducing the bug that was fixed in 2015 in d6b1a8a92a1417f8859a6937d2e6ffe2dfab4e6d (see below). Basically, if the CUBIC slope (ca->cnt) calculation uses jiffies, then we should only run that code once per jiffy, to avoid getting the wrong answer for the slope: commit d6b1a8a92a1417f8859a6937d2e6ffe2dfab4e6d Author: Neal Cardwell <ncardwell@google.com> Date: Wed Jan 28 20:01:39 2015 -0500 tcp: fix timing issue in CUBIC slope calculation This patch fixes a bug in CUBIC that causes cwnd to increase slightly too slowly when multiple ACKs arrive in the same jiffy. If cwnd is supposed to increase at a rate of more than once per jiffy, then CUBIC was sometimes too slow. Because the bic_target is calculated for a future point in time, calculated with time in jiffies, the cwnd can increase over the course of the jiffy while the bic_target calculated as the proper CUBIC cwnd at time t=tcp_time_stamp+rtt does not increase, because tcp_time_stamp only increases on jiffy tick boundaries. So since the cnt is set to: ca->cnt = cwnd / (bic_target - cwnd); as cwnd increases but bic_target does not increase due to jiffy granularity, the cnt becomes too large, causing cwnd to increase too slowly. For example: - suppose at the beginning of a jiffy, cwnd=40, bic_target=44 - so CUBIC sets: ca->cnt = cwnd / (bic_target - cwnd) = 40 / (44 - 40) = 40/4 = 10 - suppose we get 10 acks, each for 1 segment, so tcp_cong_avoid_ai() increases cwnd to 41 - so CUBIC sets: ca->cnt = cwnd / (bic_target - cwnd) = 41 / (44 - 41) = 41 / 3 = 13 So now CUBIC will wait for 13 packets to be ACKed before increasing cwnd to 42, insted of 10 as it should. The fix is to avoid adjusting the slope (determined by ca->cnt) multiple times within a jiffy, and instead skip to compute the Reno cwnd, the "TCP friendliness" code path. Reported-by: Eyal Perry <eyalpe@mellanox.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c index ffc045da2fd5..4b276d1ed980 100644 --- a/net/ipv4/tcp_cubic.c +++ b/net/ipv4/tcp_cubic.c @@ -213,6 +213,13 @@ static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked) (s32)(tcp_time_stamp - ca->last_time) <= HZ / 32) return; + /* The CUBIC function can update ca->cnt at most once per jiffy. + * On all cwnd reduction events, ca->epoch_start is set to 0, + * which will force a recalculation of ca->cnt. + */ + if (ca->epoch_start && tcp_time_stamp == ca->last_time) + goto tcp_friendliness; + ca->last_cwnd = cwnd; ca->last_time = tcp_time_stamp; @@ -280,6 +287,7 @@ static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked) if (ca->last_max_cwnd == 0 && ca->cnt > 20) ca->cnt = 20; /* increase cwnd 5% per RTT */ +tcp_friendliness: /* TCP Friendly */ if (tcp_friendliness) { u32 scale = beta_scale; --- neal
On Mon, Aug 26, 2024 at 3:26 PM Neal Cardwell <ncardwell@google.com> wrote: > > On Mon, Aug 26, 2024 at 5:27 AM Eric Dumazet <edumazet@google.com> wrote: > > > > bictcp_update() uses ca->last_time as a timestamp > > to decide of several heuristics. > > > > Historically this timestamp has been fed with jiffies, > > which has too coarse resolution, some distros are > > still using CONFIG_HZ_250=y > > > > It is time to switch to usec resolution, now TCP stack > > already caches in tp->tcp_mstamp the high resolution time. > > > > Also remove the 'inline' qualifier, this helper is used > > once and compilers are smarts. > > > > Signed-off-by: Eric Dumazet <edumazet@google.com> > > Link: https://lore.kernel.org/netdev/20240817163400.2616134-1-mrzhang97@gmail.com/T/#mb6a64c9e2309eb98eaeeeb4b085c4a2270b6789d > > Cc: Mingrui Zhang <mrzhang97@gmail.com> > > Cc: Lisong Xu <xu@unl.edu> > > --- > > net/ipv4/tcp_cubic.c | 18 ++++++++++-------- > > 1 file changed, 10 insertions(+), 8 deletions(-) > > > > diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c > > index 5dbed91c6178257df8d2ccd1c8690a10bdbaf56a..3b1845103ee1866a316926a130c212e6f5e78ef0 100644 > > --- a/net/ipv4/tcp_cubic.c > > +++ b/net/ipv4/tcp_cubic.c > > @@ -87,7 +87,7 @@ struct bictcp { > > u32 cnt; /* increase cwnd by 1 after ACKs */ > > u32 last_max_cwnd; /* last maximum snd_cwnd */ > > u32 last_cwnd; /* the last snd_cwnd */ > > - u32 last_time; /* time when updated last_cwnd */ > > + u32 last_time; /* time when updated last_cwnd (usec) */ > > u32 bic_origin_point;/* origin point of bic function */ > > u32 bic_K; /* time to origin point > > from the beginning of the current epoch */ > > @@ -211,26 +211,28 @@ static u32 cubic_root(u64 a) > > /* > > * Compute congestion window to use. > > */ > > -static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked) > > +static void bictcp_update(struct sock *sk, u32 cwnd, u32 acked) > > { > > + const struct tcp_sock *tp = tcp_sk(sk); > > + struct bictcp *ca = inet_csk_ca(sk); > > u32 delta, bic_target, max_cnt; > > u64 offs, t; > > > > ca->ack_cnt += acked; /* count the number of ACKed packets */ > > > > - if (ca->last_cwnd == cwnd && > > - (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32) > > + delta = tp->tcp_mstamp - ca->last_time; > > + if (ca->last_cwnd == cwnd && delta <= USEC_PER_SEC / 32) > > return; > > > > - /* The CUBIC function can update ca->cnt at most once per jiffy. > > + /* The CUBIC function can update ca->cnt at most once per ms. > > * On all cwnd reduction events, ca->epoch_start is set to 0, > > * which will force a recalculation of ca->cnt. > > */ > > - if (ca->epoch_start && tcp_jiffies32 == ca->last_time) > > + if (ca->epoch_start && delta < USEC_PER_MSEC) > > goto tcp_friendliness; > > AFAICT there is a problem here. It is switching this line of code to > use microsecond resolution without also changing the core CUBIC slope > (ca->cnt) calculation to also use microseconds. AFAICT that means we > would be re-introducing the bug that was fixed in 2015 in > d6b1a8a92a1417f8859a6937d2e6ffe2dfab4e6d (see below). Basically, if > the CUBIC slope (ca->cnt) calculation uses jiffies, then we should > only run that code once per jiffy, to avoid getting the wrong answer > for the slope: Interesting.... would adding the following part deal with this problem, or is it something else ? diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c index 3b1845103ee1866a316926a130c212e6f5e78ef0..bff5688ba5109fa5a0bbff7dc529525b2752dc46 100644 --- a/net/ipv4/tcp_cubic.c +++ b/net/ipv4/tcp_cubic.c @@ -268,9 +268,10 @@ static void bictcp_update(struct sock *sk, u32 cwnd, u32 acked) t = (s32)(tcp_jiffies32 - ca->epoch_start); t += usecs_to_jiffies(ca->delay_min); - /* change the unit from HZ to bictcp_HZ */ + t = jiffies_to_msecs(t); + /* change the unit from ms to bictcp_HZ */ t <<= BICTCP_HZ; - do_div(t, HZ); + do_div(t, MSEC_PER_SEC); if (t < ca->bic_K) /* t - K */ offs = ca->bic_K - t;
On Mon, Aug 26, 2024 at 1:27 PM Eric Dumazet <edumazet@google.com> wrote: > > On Mon, Aug 26, 2024 at 3:26 PM Neal Cardwell <ncardwell@google.com> wrote: > > > > On Mon, Aug 26, 2024 at 5:27 AM Eric Dumazet <edumazet@google.com> wrote: > > > > > > bictcp_update() uses ca->last_time as a timestamp > > > to decide of several heuristics. > > > > > > Historically this timestamp has been fed with jiffies, > > > which has too coarse resolution, some distros are > > > still using CONFIG_HZ_250=y > > > > > > It is time to switch to usec resolution, now TCP stack > > > already caches in tp->tcp_mstamp the high resolution time. > > > > > > Also remove the 'inline' qualifier, this helper is used > > > once and compilers are smarts. > > > > > > Signed-off-by: Eric Dumazet <edumazet@google.com> > > > Link: https://lore.kernel.org/netdev/20240817163400.2616134-1-mrzhang97@gmail.com/T/#mb6a64c9e2309eb98eaeeeb4b085c4a2270b6789d > > > Cc: Mingrui Zhang <mrzhang97@gmail.com> > > > Cc: Lisong Xu <xu@unl.edu> > > > --- > > > net/ipv4/tcp_cubic.c | 18 ++++++++++-------- > > > 1 file changed, 10 insertions(+), 8 deletions(-) > > > > > > diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c > > > index 5dbed91c6178257df8d2ccd1c8690a10bdbaf56a..3b1845103ee1866a316926a130c212e6f5e78ef0 100644 > > > --- a/net/ipv4/tcp_cubic.c > > > +++ b/net/ipv4/tcp_cubic.c > > > @@ -87,7 +87,7 @@ struct bictcp { > > > u32 cnt; /* increase cwnd by 1 after ACKs */ > > > u32 last_max_cwnd; /* last maximum snd_cwnd */ > > > u32 last_cwnd; /* the last snd_cwnd */ > > > - u32 last_time; /* time when updated last_cwnd */ > > > + u32 last_time; /* time when updated last_cwnd (usec) */ > > > u32 bic_origin_point;/* origin point of bic function */ > > > u32 bic_K; /* time to origin point > > > from the beginning of the current epoch */ > > > @@ -211,26 +211,28 @@ static u32 cubic_root(u64 a) > > > /* > > > * Compute congestion window to use. > > > */ > > > -static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked) > > > +static void bictcp_update(struct sock *sk, u32 cwnd, u32 acked) > > > { > > > + const struct tcp_sock *tp = tcp_sk(sk); > > > + struct bictcp *ca = inet_csk_ca(sk); > > > u32 delta, bic_target, max_cnt; > > > u64 offs, t; > > > > > > ca->ack_cnt += acked; /* count the number of ACKed packets */ > > > > > > - if (ca->last_cwnd == cwnd && > > > - (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32) > > > + delta = tp->tcp_mstamp - ca->last_time; > > > + if (ca->last_cwnd == cwnd && delta <= USEC_PER_SEC / 32) > > > return; > > > > > > - /* The CUBIC function can update ca->cnt at most once per jiffy. > > > + /* The CUBIC function can update ca->cnt at most once per ms. > > > * On all cwnd reduction events, ca->epoch_start is set to 0, > > > * which will force a recalculation of ca->cnt. > > > */ > > > - if (ca->epoch_start && tcp_jiffies32 == ca->last_time) > > > + if (ca->epoch_start && delta < USEC_PER_MSEC) > > > goto tcp_friendliness; > > > > AFAICT there is a problem here. It is switching this line of code to > > use microsecond resolution without also changing the core CUBIC slope > > (ca->cnt) calculation to also use microseconds. AFAICT that means we > > would be re-introducing the bug that was fixed in 2015 in > > d6b1a8a92a1417f8859a6937d2e6ffe2dfab4e6d (see below). Basically, if > > the CUBIC slope (ca->cnt) calculation uses jiffies, then we should > > only run that code once per jiffy, to avoid getting the wrong answer > > for the slope: > > Interesting.... would adding the following part deal with this > problem, or is it something else ? > > diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c > index 3b1845103ee1866a316926a130c212e6f5e78ef0..bff5688ba5109fa5a0bbff7dc529525b2752dc46 > 100644 > --- a/net/ipv4/tcp_cubic.c > +++ b/net/ipv4/tcp_cubic.c > @@ -268,9 +268,10 @@ static void bictcp_update(struct sock *sk, u32 > cwnd, u32 acked) > > t = (s32)(tcp_jiffies32 - ca->epoch_start); > t += usecs_to_jiffies(ca->delay_min); > - /* change the unit from HZ to bictcp_HZ */ > + t = jiffies_to_msecs(t); > + /* change the unit from ms to bictcp_HZ */ > t <<= BICTCP_HZ; > - do_div(t, HZ); > + do_div(t, MSEC_PER_SEC); > > if (t < ca->bic_K) /* t - K */ > offs = ca->bic_K - t; I don't think that would be sufficient to take care of the issue. The issue (addressed in d6b1a8a92a1417f8859a6937d2e6ffe2dfab4e6d) is that in the CUBIC bictcp_update() computation of bic_target the input is tcp_jiffies32. That means that the output bic_target will only change when the tcp_jiffies32 increments to a new jiffies value. That means that if we were to go back to executing the bic_target and ca->cnt computations more than once per jiffy, the ca->cnt "slope" value becomes increasingly incorrect over the course of each jiffy, due to the ca->cnt computation looking like: ca->cnt = cwnd / (bic_target - cwnd); ...and the fact that cwnd can update for each ACK event, while bic_target is "stuck" during the course of the jiffy due to the jiffy granularity. I guess one approach to trying to avoid this issue would be to change the initial computation of the "t" variable to be in microseconds and increase BICTCP_HZ from 10 to 20 so that the final value of t also increases roughly once per microsecond. But then I suspect a lot of code would have to be tweaked to avoid overflows... e.g., AFAICT with microsecond units the core logic to cube the offs value would overflow quite often: delta = (cube_rtt_scale * offs * offs * offs) >> (10+3*BICTCP_HZ) IMHO it's safest to just leave last_time in jiffies. :-) neal
diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c index 5dbed91c6178257df8d2ccd1c8690a10bdbaf56a..3b1845103ee1866a316926a130c212e6f5e78ef0 100644 --- a/net/ipv4/tcp_cubic.c +++ b/net/ipv4/tcp_cubic.c @@ -87,7 +87,7 @@ struct bictcp { u32 cnt; /* increase cwnd by 1 after ACKs */ u32 last_max_cwnd; /* last maximum snd_cwnd */ u32 last_cwnd; /* the last snd_cwnd */ - u32 last_time; /* time when updated last_cwnd */ + u32 last_time; /* time when updated last_cwnd (usec) */ u32 bic_origin_point;/* origin point of bic function */ u32 bic_K; /* time to origin point from the beginning of the current epoch */ @@ -211,26 +211,28 @@ static u32 cubic_root(u64 a) /* * Compute congestion window to use. */ -static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked) +static void bictcp_update(struct sock *sk, u32 cwnd, u32 acked) { + const struct tcp_sock *tp = tcp_sk(sk); + struct bictcp *ca = inet_csk_ca(sk); u32 delta, bic_target, max_cnt; u64 offs, t; ca->ack_cnt += acked; /* count the number of ACKed packets */ - if (ca->last_cwnd == cwnd && - (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32) + delta = tp->tcp_mstamp - ca->last_time; + if (ca->last_cwnd == cwnd && delta <= USEC_PER_SEC / 32) return; - /* The CUBIC function can update ca->cnt at most once per jiffy. + /* The CUBIC function can update ca->cnt at most once per ms. * On all cwnd reduction events, ca->epoch_start is set to 0, * which will force a recalculation of ca->cnt. */ - if (ca->epoch_start && tcp_jiffies32 == ca->last_time) + if (ca->epoch_start && delta < USEC_PER_MSEC) goto tcp_friendliness; ca->last_cwnd = cwnd; - ca->last_time = tcp_jiffies32; + ca->last_time = tp->tcp_mstamp; if (ca->epoch_start == 0) { ca->epoch_start = tcp_jiffies32; /* record beginning */ @@ -334,7 +336,7 @@ __bpf_kfunc static void cubictcp_cong_avoid(struct sock *sk, u32 ack, u32 acked) if (!acked) return; } - bictcp_update(ca, tcp_snd_cwnd(tp), acked); + bictcp_update(sk, tcp_snd_cwnd(tp), acked); tcp_cong_avoid_ai(tp, ca->cnt, acked); }
bictcp_update() uses ca->last_time as a timestamp to decide of several heuristics. Historically this timestamp has been fed with jiffies, which has too coarse resolution, some distros are still using CONFIG_HZ_250=y It is time to switch to usec resolution, now TCP stack already caches in tp->tcp_mstamp the high resolution time. Also remove the 'inline' qualifier, this helper is used once and compilers are smarts. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/netdev/20240817163400.2616134-1-mrzhang97@gmail.com/T/#mb6a64c9e2309eb98eaeeeb4b085c4a2270b6789d Cc: Mingrui Zhang <mrzhang97@gmail.com> Cc: Lisong Xu <xu@unl.edu> --- net/ipv4/tcp_cubic.c | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-)