Message ID | 20240817163400.2616134-2-mrzhang97@gmail.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | tcp_cubic: fix to achieve at least the same throughput as Reno | expand |
On Sat, Aug 17, 2024 at 6:35 PM Mingrui Zhang <mrzhang97@gmail.com> wrote: > > The original code bypasses bictcp_update() under certain conditions > to reduce the CPU overhead. Intuitively, when last_cwnd==cwnd, > bictcp_update() is executed 32 times per second. As a result, > it is possible that bictcp_update() is not executed for several > RTTs when RTT is short (specifically < 1/32 second = 31 ms and > last_cwnd==cwnd which may happen in small-BDP networks), > thus leading to low throughput in these RTTs. > > The patched code executes bictcp_update() 32 times per second > if RTT > 31 ms or every RTT if RTT < 31 ms, when last_cwnd==cwnd. > > Fixes: df3271f3361b ("[TCP] BIC: CUBIC window growth (2.0)") > Fixes: ac35f562203a ("tcp: bic, cubic: use tcp_jiffies32 instead of tcp_time_stamp") I do not understand this Fixes: tag ? Commit ac35f562203a was essentially a nop at that time... > Signed-off-by: Mingrui Zhang <mrzhang97@gmail.com> > Signed-off-by: Lisong Xu <xu@unl.edu> > --- > v3->v4: Replace min() with min_t() > v2->v3: Correct the "Fixes:" footer content > v1->v2: Separate patches > > net/ipv4/tcp_cubic.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c > index 5dbed91c6178..00da7d592032 100644 > --- a/net/ipv4/tcp_cubic.c > +++ b/net/ipv4/tcp_cubic.c > @@ -218,8 +218,12 @@ static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked) > > ca->ack_cnt += acked; /* count the number of ACKed packets */ > > + /* Update 32 times per second if RTT > 1/32 second, > + * or every RTT if RTT < 1/32 second even when last_cwnd == cwnd > + */ > if (ca->last_cwnd == cwnd && > - (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32) > + (s32)(tcp_jiffies32 - ca->last_time) <= > + min_t(s32, HZ / 32, usecs_to_jiffies(ca->delay_min))) This looks convoluted to me and still limited if HZ=250 (some distros still use 250 jiffies per second :/ ) I would suggest switching to usec right away. diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c index 5dbed91c6178257df8d2ccd1c8690a10bdbaf56a..fae000a57bf7d3803c5dd854af64b6933c4e26dd 100644 --- a/net/ipv4/tcp_cubic.c +++ b/net/ipv4/tcp_cubic.c @@ -211,26 +211,27 @@ static u32 cubic_root(u64 a) /* * Compute congestion window to use. */ -static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked) +static inline void bictcp_update(struct tcp_sock *tp, struct bictcp *ca, + u32 cwnd, u32 acked) { u32 delta, bic_target, max_cnt; u64 offs, t; ca->ack_cnt += acked; /* count the number of ACKed packets */ - if (ca->last_cwnd == cwnd && - (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32) + delta = tp->tcp_mstamp - ca->last_time; + if (ca->last_cwnd == cwnd && delta <= ca->delay_min) return; - /* The CUBIC function can update ca->cnt at most once per jiffy. + /* The CUBIC function can update ca->cnt at most once per ms. * On all cwnd reduction events, ca->epoch_start is set to 0, * which will force a recalculation of ca->cnt. */ - if (ca->epoch_start && tcp_jiffies32 == ca->last_time) + if (ca->epoch_start && delta < USEC_PER_MSEC) goto tcp_friendliness; ca->last_cwnd = cwnd; - ca->last_time = tcp_jiffies32; + ca->last_time = tp->tcp_mstamp; if (ca->epoch_start == 0) { ca->epoch_start = tcp_jiffies32; /* record beginning */ @@ -334,7 +335,7 @@ __bpf_kfunc static void cubictcp_cong_avoid(struct sock *sk, u32 ack, u32 acked) if (!acked) return; } - bictcp_update(ca, tcp_snd_cwnd(tp), acked); + bictcp_update(tp, ca, tcp_snd_cwnd(tp), acked); tcp_cong_avoid_ai(tp, ca->cnt, acked); }
On 8/19/24 04:00, Eric Dumazet wrote: > On Sat, Aug 17, 2024 at 6:35 PM Mingrui Zhang <mrzhang97@gmail.com> wrote: >> The original code bypasses bictcp_update() under certain conditions >> to reduce the CPU overhead. Intuitively, when last_cwnd==cwnd, >> bictcp_update() is executed 32 times per second. As a result, >> it is possible that bictcp_update() is not executed for several >> RTTs when RTT is short (specifically < 1/32 second = 31 ms and >> last_cwnd==cwnd which may happen in small-BDP networks), >> thus leading to low throughput in these RTTs. >> >> The patched code executes bictcp_update() 32 times per second >> if RTT > 31 ms or every RTT if RTT < 31 ms, when last_cwnd==cwnd. >> >> Fixes: df3271f3361b ("[TCP] BIC: CUBIC window growth (2.0)") >> Fixes: ac35f562203a ("tcp: bic, cubic: use tcp_jiffies32 instead of tcp_time_stamp") > I do not understand this Fixes: tag ? > > Commit ac35f562203a was essentially a nop at that time... > I may misunderstood the use of Fixes tag and choose the latest commit of that line. Shall it supposed to be the very first commit with that behavior? That is, the very first commit (df3271f3361b ("[TCP] BIC: CUBIC window growth (2.0)")) when the code was first introduced? >> Signed-off-by: Mingrui Zhang <mrzhang97@gmail.com> >> Signed-off-by: Lisong Xu <xu@unl.edu> >> --- >> v3->v4: Replace min() with min_t() >> v2->v3: Correct the "Fixes:" footer content >> v1->v2: Separate patches >> >> net/ipv4/tcp_cubic.c | 6 +++++- >> 1 file changed, 5 insertions(+), 1 deletion(-) >> >> diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c >> index 5dbed91c6178..00da7d592032 100644 >> --- a/net/ipv4/tcp_cubic.c >> +++ b/net/ipv4/tcp_cubic.c >> @@ -218,8 +218,12 @@ static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked) >> >> ca->ack_cnt += acked; /* count the number of ACKed packets */ >> >> + /* Update 32 times per second if RTT > 1/32 second, >> + * or every RTT if RTT < 1/32 second even when last_cwnd == cwnd >> + */ >> if (ca->last_cwnd == cwnd && >> - (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32) >> + (s32)(tcp_jiffies32 - ca->last_time) <= >> + min_t(s32, HZ / 32, usecs_to_jiffies(ca->delay_min))) > This looks convoluted to me and still limited if HZ=250 (some distros > still use 250 jiffies per second :/ ) > > I would suggest switching to usec right away. Thank you for the suggestion, however, I may need more time to discuss with another author for this revision. :) Thank you > > diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c > index 5dbed91c6178257df8d2ccd1c8690a10bdbaf56a..fae000a57bf7d3803c5dd854af64b6933c4e26dd > 100644 > --- a/net/ipv4/tcp_cubic.c > +++ b/net/ipv4/tcp_cubic.c > @@ -211,26 +211,27 @@ static u32 cubic_root(u64 a) > /* > * Compute congestion window to use. > */ > -static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked) > +static inline void bictcp_update(struct tcp_sock *tp, struct bictcp *ca, > + u32 cwnd, u32 acked) > { > u32 delta, bic_target, max_cnt; > u64 offs, t; > > ca->ack_cnt += acked; /* count the number of ACKed packets */ > > - if (ca->last_cwnd == cwnd && > - (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32) > + delta = tp->tcp_mstamp - ca->last_time; > + if (ca->last_cwnd == cwnd && delta <= ca->delay_min) > return; > > - /* The CUBIC function can update ca->cnt at most once per jiffy. > + /* The CUBIC function can update ca->cnt at most once per ms. > * On all cwnd reduction events, ca->epoch_start is set to 0, > * which will force a recalculation of ca->cnt. > */ > - if (ca->epoch_start && tcp_jiffies32 == ca->last_time) > + if (ca->epoch_start && delta < USEC_PER_MSEC) > goto tcp_friendliness; > > ca->last_cwnd = cwnd; > - ca->last_time = tcp_jiffies32; > + ca->last_time = tp->tcp_mstamp; > > if (ca->epoch_start == 0) { > ca->epoch_start = tcp_jiffies32; /* record beginning */ > @@ -334,7 +335,7 @@ __bpf_kfunc static void cubictcp_cong_avoid(struct > sock *sk, u32 ack, u32 acked) > if (!acked) > return; > } > - bictcp_update(ca, tcp_snd_cwnd(tp), acked); > + bictcp_update(tp, ca, tcp_snd_cwnd(tp), acked); > tcp_cong_avoid_ai(tp, ca->cnt, acked); > }
On Mon, Aug 19, 2024 at 10:36 PM Mingrui Zhang <mrzhang97@gmail.com> wrote: > > On 8/19/24 04:00, Eric Dumazet wrote: > > On Sat, Aug 17, 2024 at 6:35 PM Mingrui Zhang <mrzhang97@gmail.com> wrote: > >> The original code bypasses bictcp_update() under certain conditions > >> to reduce the CPU overhead. Intuitively, when last_cwnd==cwnd, > >> bictcp_update() is executed 32 times per second. As a result, > >> it is possible that bictcp_update() is not executed for several > >> RTTs when RTT is short (specifically < 1/32 second = 31 ms and > >> last_cwnd==cwnd which may happen in small-BDP networks), > >> thus leading to low throughput in these RTTs. > >> > >> The patched code executes bictcp_update() 32 times per second > >> if RTT > 31 ms or every RTT if RTT < 31 ms, when last_cwnd==cwnd. > >> > >> Fixes: df3271f3361b ("[TCP] BIC: CUBIC window growth (2.0)") > >> Fixes: ac35f562203a ("tcp: bic, cubic: use tcp_jiffies32 instead of tcp_time_stamp") > > I do not understand this Fixes: tag ? > > > > Commit ac35f562203a was essentially a nop at that time... > > > I may misunderstood the use of Fixes tag and choose the latest commit of that line. > > Shall it supposed to be the very first commit with that behavior? > That is, the very first commit (df3271f3361b ("[TCP] BIC: CUBIC window growth (2.0)")) when the code was first introduced? I was referring to this line : Fixes: ac35f562203a ("tcp: bic, cubic: use tcp_jiffies32 instead of tcp_time_stamp") Commit ac35f562203a did not change the behavior at all. I see no particular reason to mention it, this is confusing. > >> Signed-off-by: Mingrui Zhang <mrzhang97@gmail.com> > >> Signed-off-by: Lisong Xu <xu@unl.edu> > >> --- > >> v3->v4: Replace min() with min_t() > >> v2->v3: Correct the "Fixes:" footer content > >> v1->v2: Separate patches > >> > >> net/ipv4/tcp_cubic.c | 6 +++++- > >> 1 file changed, 5 insertions(+), 1 deletion(-) > >> > >> diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c > >> index 5dbed91c6178..00da7d592032 100644 > >> --- a/net/ipv4/tcp_cubic.c > >> +++ b/net/ipv4/tcp_cubic.c > >> @@ -218,8 +218,12 @@ static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked) > >> > >> ca->ack_cnt += acked; /* count the number of ACKed packets */ > >> > >> + /* Update 32 times per second if RTT > 1/32 second, > >> + * or every RTT if RTT < 1/32 second even when last_cwnd == cwnd > >> + */ > >> if (ca->last_cwnd == cwnd && > >> - (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32) > >> + (s32)(tcp_jiffies32 - ca->last_time) <= > >> + min_t(s32, HZ / 32, usecs_to_jiffies(ca->delay_min))) > > This looks convoluted to me and still limited if HZ=250 (some distros > > still use 250 jiffies per second :/ ) > > > > I would suggest switching to usec right away. > Thank you for the suggestion, however, I may need more time to discuss with another author for this revision. :) > Thank you No problem, there is no hurry.
On 8/20/24 07:53, Eric Dumazet wrote: > On Mon, Aug 19, 2024 at 10:36 PM Mingrui Zhang <mrzhang97@gmail.com> wrote: >> On 8/19/24 04:00, Eric Dumazet wrote: >>> On Sat, Aug 17, 2024 at 6:35 PM Mingrui Zhang <mrzhang97@gmail.com> wrote: >>>> The original code bypasses bictcp_update() under certain conditions >>>> to reduce the CPU overhead. Intuitively, when last_cwnd==cwnd, >>>> bictcp_update() is executed 32 times per second. As a result, >>>> it is possible that bictcp_update() is not executed for several >>>> RTTs when RTT is short (specifically < 1/32 second = 31 ms and >>>> last_cwnd==cwnd which may happen in small-BDP networks), >>>> thus leading to low throughput in these RTTs. >>>> >>>> The patched code executes bictcp_update() 32 times per second >>>> if RTT > 31 ms or every RTT if RTT < 31 ms, when last_cwnd==cwnd. >>>> >>>> Fixes: df3271f3361b ("[TCP] BIC: CUBIC window growth (2.0)") >>>> Fixes: ac35f562203a ("tcp: bic, cubic: use tcp_jiffies32 instead of tcp_time_stamp") >>> I do not understand this Fixes: tag ? >>> >>> Commit ac35f562203a was essentially a nop at that time... >>> >> I may misunderstood the use of Fixes tag and choose the latest commit of that line. >> >> Shall it supposed to be the very first commit with that behavior? >> That is, the very first commit (df3271f3361b ("[TCP] BIC: CUBIC window growth (2.0)")) when the code was first introduced? > I was referring to this line : Fixes: ac35f562203a ("tcp: bic, cubic: > use tcp_jiffies32 instead of tcp_time_stamp") > > Commit ac35f562203a did not change the behavior at all. > > I see no particular reason to mention it, this is confusing. > > >>>> Signed-off-by: Mingrui Zhang <mrzhang97@gmail.com> >>>> Signed-off-by: Lisong Xu <xu@unl.edu> >>>> --- >>>> v3->v4: Replace min() with min_t() >>>> v2->v3: Correct the "Fixes:" footer content >>>> v1->v2: Separate patches >>>> >>>> net/ipv4/tcp_cubic.c | 6 +++++- >>>> 1 file changed, 5 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c >>>> index 5dbed91c6178..00da7d592032 100644 >>>> --- a/net/ipv4/tcp_cubic.c >>>> +++ b/net/ipv4/tcp_cubic.c >>>> @@ -218,8 +218,12 @@ static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked) >>>> >>>> ca->ack_cnt += acked; /* count the number of ACKed packets */ >>>> >>>> + /* Update 32 times per second if RTT > 1/32 second, >>>> + * or every RTT if RTT < 1/32 second even when last_cwnd == cwnd >>>> + */ >>>> if (ca->last_cwnd == cwnd && >>>> - (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32) >>>> + (s32)(tcp_jiffies32 - ca->last_time) <= >>>> + min_t(s32, HZ / 32, usecs_to_jiffies(ca->delay_min))) >>> This looks convoluted to me and still limited if HZ=250 (some distros >>> still use 250 jiffies per second :/ ) >>> >>> I would suggest switching to usec right away. >> Thank you for the suggestion, however, I may need more time to discuss with another author for this revision. :) >> Thank you > No problem, there is no hurry. Thank you, Eric, for your suggestion (switching ca->last_time from jiffies to usec)! We thought about it and feel that it is more complicated and beyond the scope of this patch. There are two blocks of code in bictcp_update(). * Block 1: cubic calculation, which is computationally intensive. * Block 2: tcp friendliness, which emulates RENO. There are two if statements to control how often these two blocks are called to reduce CPU overhead. * If statement 1: if the condition is true, none of the two blocks are called. if (ca->last_cwnd == cwnd && (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32) return; * If statement 2: If the condition is true, block 1 is not called. Intuitively, block 1 is called at most once per jiffy. if (ca->epoch_start && tcp_jiffies32 == ca->last_time) goto tcp_friendliness; This patch changes only the first if statement. If we switch ca->last_time from jiffies to usec, we need to change not only the first if statement but also the second if statement, as well as block 1. * change the first if statement from jiffies to usec. * change the second if statement from jiffies to usec. Need to determine how often (in usec) block 1 is called * change block 1 from jiffies to usec. Should be fine, but need to make sure no calculation overflow. Therefore, it might be better to keep the current patch as it is, and address the switch from jiffies to usec in future patches. Thank you! Lisong (Mingrui Send On Behalf of)
On Sun, Aug 25, 2024 at 7:47 PM Mingrui Zhang <mrzhang97@gmail.com> wrote: > > On 8/20/24 07:53, Eric Dumazet wrote: > > On Mon, Aug 19, 2024 at 10:36 PM Mingrui Zhang <mrzhang97@gmail.com> wrote: > >> On 8/19/24 04:00, Eric Dumazet wrote: > >>> On Sat, Aug 17, 2024 at 6:35 PM Mingrui Zhang <mrzhang97@gmail.com> wrote: > >>>> The original code bypasses bictcp_update() under certain conditions > >>>> to reduce the CPU overhead. Intuitively, when last_cwnd==cwnd, > >>>> bictcp_update() is executed 32 times per second. As a result, > >>>> it is possible that bictcp_update() is not executed for several > >>>> RTTs when RTT is short (specifically < 1/32 second = 31 ms and > >>>> last_cwnd==cwnd which may happen in small-BDP networks), > >>>> thus leading to low throughput in these RTTs. > >>>> > >>>> The patched code executes bictcp_update() 32 times per second > >>>> if RTT > 31 ms or every RTT if RTT < 31 ms, when last_cwnd==cwnd. > >>>> > >>>> Fixes: df3271f3361b ("[TCP] BIC: CUBIC window growth (2.0)") > >>>> Fixes: ac35f562203a ("tcp: bic, cubic: use tcp_jiffies32 instead of tcp_time_stamp") > >>> I do not understand this Fixes: tag ? > >>> > >>> Commit ac35f562203a was essentially a nop at that time... > >>> > >> I may misunderstood the use of Fixes tag and choose the latest commit of that line. > >> > >> Shall it supposed to be the very first commit with that behavior? > >> That is, the very first commit (df3271f3361b ("[TCP] BIC: CUBIC window growth (2.0)")) when the code was first introduced? > > I was referring to this line : Fixes: ac35f562203a ("tcp: bic, cubic: > > use tcp_jiffies32 instead of tcp_time_stamp") > > > > Commit ac35f562203a did not change the behavior at all. > > > > I see no particular reason to mention it, this is confusing. > > > > > >>>> Signed-off-by: Mingrui Zhang <mrzhang97@gmail.com> > >>>> Signed-off-by: Lisong Xu <xu@unl.edu> > >>>> --- > >>>> v3->v4: Replace min() with min_t() > >>>> v2->v3: Correct the "Fixes:" footer content > >>>> v1->v2: Separate patches > >>>> > >>>> net/ipv4/tcp_cubic.c | 6 +++++- > >>>> 1 file changed, 5 insertions(+), 1 deletion(-) > >>>> > >>>> diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c > >>>> index 5dbed91c6178..00da7d592032 100644 > >>>> --- a/net/ipv4/tcp_cubic.c > >>>> +++ b/net/ipv4/tcp_cubic.c > >>>> @@ -218,8 +218,12 @@ static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked) > >>>> > >>>> ca->ack_cnt += acked; /* count the number of ACKed packets */ > >>>> > >>>> + /* Update 32 times per second if RTT > 1/32 second, > >>>> + * or every RTT if RTT < 1/32 second even when last_cwnd == cwnd > >>>> + */ > >>>> if (ca->last_cwnd == cwnd && > >>>> - (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32) > >>>> + (s32)(tcp_jiffies32 - ca->last_time) <= > >>>> + min_t(s32, HZ / 32, usecs_to_jiffies(ca->delay_min))) > >>> This looks convoluted to me and still limited if HZ=250 (some distros > >>> still use 250 jiffies per second :/ ) > >>> > >>> I would suggest switching to usec right away. > >> Thank you for the suggestion, however, I may need more time to discuss with another author for this revision. :) > >> Thank you > > No problem, there is no hurry. > > Thank you, Eric, for your suggestion (switching ca->last_time from jiffies to usec)! > We thought about it and feel that it is more complicated and beyond the scope of this patch. > > There are two blocks of code in bictcp_update(). > * Block 1: cubic calculation, which is computationally intensive. > * Block 2: tcp friendliness, which emulates RENO. > > There are two if statements to control how often these two blocks are called to reduce CPU overhead. > * If statement 1: if the condition is true, none of the two blocks are called. > if (ca->last_cwnd == cwnd && > (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32) > return; > > * If statement 2: If the condition is true, block 1 is not called. Intuitively, block 1 is called at most once per jiffy. > if (ca->epoch_start && tcp_jiffies32 == ca->last_time) > goto tcp_friendliness; > > > This patch changes only the first if statement. If we switch ca->last_time from jiffies to usec, > we need to change not only the first if statement but also the second if statement, as well as block 1. > * change the first if statement from jiffies to usec. > * change the second if statement from jiffies to usec. Need to determine how often (in usec) block 1 is called > * change block 1 from jiffies to usec. Should be fine, but need to make sure no calculation overflow. No problem, I can take care of the jiffies -> usec conversion, you can then send your patch on top of it. > > Therefore, it might be better to keep the current patch as it is, and address the switch from jiffies to usec in future patches. I prefer you rebase your patch after mine is merged. There is a common misconception with jiffies. It can change in less than 20 nsec. Assuming that delta(jiffies) == 1 means that 1ms has elapsed is plain wrong. In the old days, linux TCP only could rely on jiffies and we had to accept its limits. We now can switch to high resolution clocks, without extra costs, because we already cache in tcp->tcp_mstamp the usec timestamp for the current time. Some distros are using CONFIG_HZ_250=y or CONFIG_HZ_100=y, this means current logic in cubic is more fuzzy for them. Without ca->last_time conversion to jiffies, your patch would still be limited to jiffies resolution: usecs_to_jiffies(ca->delay_min) would round up to much bigger values for DC communications.
On Mon, Aug 26, 2024 at 5:26 AM Eric Dumazet <edumazet@google.com> wrote: ... > I prefer you rebase your patch after mine is merged. > > There is a common misconception with jiffies. > It can change in less than 20 nsec. > Assuming that delta(jiffies) == 1 means that 1ms has elapsed is plain wrong. > In the old days, linux TCP only could rely on jiffies and we had to > accept its limits. > We now can switch to high resolution clocks, without extra costs, > because we already cache in tcp->tcp_mstamp > the usec timestamp for the current time. > > Some distros are using CONFIG_HZ_250=y or CONFIG_HZ_100=y, this means > current logic in cubic is more fuzzy for them. > > Without ca->last_time conversion to jiffies, your patch would still be > limited to jiffies resolution: > usecs_to_jiffies(ca->delay_min) would round up to much bigger values > for DC communications. Even given Eric's excellent point that is raised above, that an increase of jiffies by one can happen even though only O(us) or less may have elapsed, AFAICT the patch should be fine in practice. The patch says: + /* Update 32 times per second if RTT > 1/32 second, + * or every RTT if RTT < 1/32 second even when last_cwnd == cwnd + */ if (ca->last_cwnd == cwnd && - (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32) + (s32)(tcp_jiffies32 - ca->last_time) <= + min_t(s32, HZ / 32, usecs_to_jiffies(ca->delay_min))) return; So, basically, we only run fall through and try to run the core of bictcp_update() if cwnd has increased since ca-> last_cwnd, or tcp_jiffies32 has increased by more than min_t(s32, HZ / 32, usecs_to_jiffies(ca->delay_min)) since ca->last_time. AFAICT this works out OK because the logic is looking for "more than" and usecs_to_jiffies() rounds up. That means that in the interesting/tricky/common case where ca->delay_min is less than a jiffy, usecs_to_jiffies(ca->delay_min) will return 1 jiffy. That means that in this case we will only fall through and try to run the core of bictcp_update() if cwnd has increased since ca-> last_cwnd, or tcp_jiffies32 has increased by more than 1 jiffy (i.e., 2 or more jiffies). AFAICT the fact that this check is triggering only if tcp_jiffies32 has increased by 2 or more means that at least one full jiffy has elapsed between when we set ca->last_time and the time when this check triggers running the core of bictcp_update(). So AFAICT this logic is not tricked by the fact that a single increment of tcp_jiffies32 can happen over O(us) or less. At first glance it may sound like if the RTT is much less than a jiffy, many RTTs could elapse before we run the core of bictcp_update(). However, AFAIK if the RTT is much less than a jiffy then CUBIC is very likely in Reno mode, and so is very likely to increase cwnd by roughly 1 packet per round trip (the behavior of Reno), so the (ca->last_cwnd == cwnd) condition should fail roughly once per round trip and allow recomputation of the ca->cnt slope. So AFAICT this patch should be OK in practice. Given those considerations, Eric, do you think it would be OK to accept the patch as-is? Thanks! neal
On Wed, Aug 28, 2024 at 10:32 PM Neal Cardwell <ncardwell@google.com> wrote: > > On Mon, Aug 26, 2024 at 5:26 AM Eric Dumazet <edumazet@google.com> wrote: > ... > > I prefer you rebase your patch after mine is merged. > > > > There is a common misconception with jiffies. > > It can change in less than 20 nsec. > > Assuming that delta(jiffies) == 1 means that 1ms has elapsed is plain wrong. > > In the old days, linux TCP only could rely on jiffies and we had to > > accept its limits. > > We now can switch to high resolution clocks, without extra costs, > > because we already cache in tcp->tcp_mstamp > > the usec timestamp for the current time. > > > > Some distros are using CONFIG_HZ_250=y or CONFIG_HZ_100=y, this means > > current logic in cubic is more fuzzy for them. > > > > Without ca->last_time conversion to jiffies, your patch would still be > > limited to jiffies resolution: > > usecs_to_jiffies(ca->delay_min) would round up to much bigger values > > for DC communications. > > Even given Eric's excellent point that is raised above, that an > increase of jiffies by one can happen even though only O(us) or less > may have elapsed, AFAICT the patch should be fine in practice. > > The patch says: > > + /* Update 32 times per second if RTT > 1/32 second, > + * or every RTT if RTT < 1/32 second even when last_cwnd == cwnd > + */ > if (ca->last_cwnd == cwnd && > - (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32) > + (s32)(tcp_jiffies32 - ca->last_time) <= > + min_t(s32, HZ / 32, usecs_to_jiffies(ca->delay_min))) > return; > > So, basically, we only run fall through and try to run the core of > bictcp_update() if cwnd has increased since ca-> last_cwnd, or > tcp_jiffies32 has increased by more than > min_t(s32, HZ / 32, usecs_to_jiffies(ca->delay_min)) since ca->last_time. > > AFAICT this works out OK because the logic is looking for "more than" > and usecs_to_jiffies() rounds up. That means that in the > interesting/tricky/common case where ca->delay_min is less than a > jiffy, usecs_to_jiffies(ca->delay_min) will return 1 jiffy. That means > that in this case we will only fall through and try to run the core of > bictcp_update() if cwnd has increased since ca-> last_cwnd, or > tcp_jiffies32 has increased by more than 1 jiffy (i.e., 2 or more > jiffies). > > AFAICT the fact that this check is triggering only if tcp_jiffies32 > has increased by 2 or more means that at least one full jiffy has > elapsed between when we set ca->last_time and the time when this check > triggers running the core of bictcp_update(). > > So AFAICT this logic is not tricked by the fact that a single > increment of tcp_jiffies32 can happen over O(us) or less. > > At first glance it may sound like if the RTT is much less than a > jiffy, many RTTs could elapse before we run the core of > bictcp_update(). However, AFAIK if the RTT is much less than a jiffy > then CUBIC is very likely in Reno mode, and so is very likely to > increase cwnd by roughly 1 packet per round trip (the behavior of > Reno), so the (ca->last_cwnd == cwnd) condition should fail roughly > once per round trip and allow recomputation of the ca->cnt slope. > > So AFAICT this patch should be OK in practice. > > Given those considerations, Eric, do you think it would be OK to > accept the patch as-is? > Ok, what about updating net/ipv4/tcp_bic.c at the same time ?
diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c index 5dbed91c6178..00da7d592032 100644 --- a/net/ipv4/tcp_cubic.c +++ b/net/ipv4/tcp_cubic.c @@ -218,8 +218,12 @@ static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked) ca->ack_cnt += acked; /* count the number of ACKed packets */ + /* Update 32 times per second if RTT > 1/32 second, + * or every RTT if RTT < 1/32 second even when last_cwnd == cwnd + */ if (ca->last_cwnd == cwnd && - (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32) + (s32)(tcp_jiffies32 - ca->last_time) <= + min_t(s32, HZ / 32, usecs_to_jiffies(ca->delay_min))) return; /* The CUBIC function can update ca->cnt at most once per jiffy.