diff mbox series

[net,v4,1/3] tcp_cubic: fix to run bictcp_update() at least once per RTT

Message ID 20240817163400.2616134-2-mrzhang97@gmail.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series tcp_cubic: fix to achieve at least the same throughput as Reno | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 16 this patch: 16
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers fail 2 blamed authors not CCed: soheil@google.com stephen@networkplumber.org; 5 maintainers not CCed: pabeni@redhat.com kuba@kernel.org stephen@networkplumber.org soheil@google.com dsahern@kernel.org
netdev/build_clang success Errors and warnings before: 16 this patch: 16
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 16 this patch: 16
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 13 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-08-17--18-00 (tests: 709)

Commit Message

Mingrui Zhang Aug. 17, 2024, 4:33 p.m. UTC
The original code bypasses bictcp_update() under certain conditions
to reduce the CPU overhead. Intuitively, when last_cwnd==cwnd,
bictcp_update() is executed 32 times per second. As a result, 
it is possible that bictcp_update() is not executed for several 
RTTs when RTT is short (specifically < 1/32 second = 31 ms and 
last_cwnd==cwnd which may happen in small-BDP networks), 
thus leading to low throughput in these RTTs.

The patched code executes bictcp_update() 32 times per second
if RTT > 31 ms or every RTT if RTT < 31 ms, when last_cwnd==cwnd.

Fixes: df3271f3361b ("[TCP] BIC: CUBIC window growth (2.0)")
Fixes: ac35f562203a ("tcp: bic, cubic: use tcp_jiffies32 instead of tcp_time_stamp")
Signed-off-by: Mingrui Zhang <mrzhang97@gmail.com>
Signed-off-by: Lisong Xu <xu@unl.edu>
---
v3->v4: Replace min() with min_t()
v2->v3: Correct the "Fixes:" footer content
v1->v2: Separate patches

 net/ipv4/tcp_cubic.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Comments

Eric Dumazet Aug. 19, 2024, 9 a.m. UTC | #1
On Sat, Aug 17, 2024 at 6:35 PM Mingrui Zhang <mrzhang97@gmail.com> wrote:
>
> The original code bypasses bictcp_update() under certain conditions
> to reduce the CPU overhead. Intuitively, when last_cwnd==cwnd,
> bictcp_update() is executed 32 times per second. As a result,
> it is possible that bictcp_update() is not executed for several
> RTTs when RTT is short (specifically < 1/32 second = 31 ms and
> last_cwnd==cwnd which may happen in small-BDP networks),
> thus leading to low throughput in these RTTs.
>
> The patched code executes bictcp_update() 32 times per second
> if RTT > 31 ms or every RTT if RTT < 31 ms, when last_cwnd==cwnd.
>
> Fixes: df3271f3361b ("[TCP] BIC: CUBIC window growth (2.0)")
> Fixes: ac35f562203a ("tcp: bic, cubic: use tcp_jiffies32 instead of tcp_time_stamp")

I do not understand this Fixes: tag ?

Commit  ac35f562203a was essentially a nop at that time...


> Signed-off-by: Mingrui Zhang <mrzhang97@gmail.com>
> Signed-off-by: Lisong Xu <xu@unl.edu>
> ---
> v3->v4: Replace min() with min_t()
> v2->v3: Correct the "Fixes:" footer content
> v1->v2: Separate patches
>
>  net/ipv4/tcp_cubic.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
> index 5dbed91c6178..00da7d592032 100644
> --- a/net/ipv4/tcp_cubic.c
> +++ b/net/ipv4/tcp_cubic.c
> @@ -218,8 +218,12 @@ static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked)
>
>         ca->ack_cnt += acked;   /* count the number of ACKed packets */
>
> +       /* Update 32 times per second if RTT > 1/32 second,
> +        * or every RTT if RTT < 1/32 second even when last_cwnd == cwnd
> +        */
>         if (ca->last_cwnd == cwnd &&
> -           (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32)
> +           (s32)(tcp_jiffies32 - ca->last_time) <=
> +           min_t(s32, HZ / 32, usecs_to_jiffies(ca->delay_min)))

This looks convoluted to me and still limited if HZ=250 (some distros
still use 250 jiffies per second :/ )

I would suggest switching to usec right away.


diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
index 5dbed91c6178257df8d2ccd1c8690a10bdbaf56a..fae000a57bf7d3803c5dd854af64b6933c4e26dd
100644
--- a/net/ipv4/tcp_cubic.c
+++ b/net/ipv4/tcp_cubic.c
@@ -211,26 +211,27 @@ static u32 cubic_root(u64 a)
 /*
  * Compute congestion window to use.
  */
-static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked)
+static inline void bictcp_update(struct tcp_sock *tp, struct bictcp *ca,
+                                u32 cwnd, u32 acked)
 {
        u32 delta, bic_target, max_cnt;
        u64 offs, t;

        ca->ack_cnt += acked;   /* count the number of ACKed packets */

-       if (ca->last_cwnd == cwnd &&
-           (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32)
+       delta = tp->tcp_mstamp - ca->last_time;
+       if (ca->last_cwnd == cwnd && delta <= ca->delay_min)
                return;

-       /* The CUBIC function can update ca->cnt at most once per jiffy.
+       /* The CUBIC function can update ca->cnt at most once per ms.
         * On all cwnd reduction events, ca->epoch_start is set to 0,
         * which will force a recalculation of ca->cnt.
         */
-       if (ca->epoch_start && tcp_jiffies32 == ca->last_time)
+       if (ca->epoch_start && delta < USEC_PER_MSEC)
                goto tcp_friendliness;

        ca->last_cwnd = cwnd;
-       ca->last_time = tcp_jiffies32;
+       ca->last_time = tp->tcp_mstamp;

        if (ca->epoch_start == 0) {
                ca->epoch_start = tcp_jiffies32;        /* record beginning */
@@ -334,7 +335,7 @@ __bpf_kfunc static void cubictcp_cong_avoid(struct
sock *sk, u32 ack, u32 acked)
                if (!acked)
                        return;
        }
-       bictcp_update(ca, tcp_snd_cwnd(tp), acked);
+       bictcp_update(tp, ca, tcp_snd_cwnd(tp), acked);
        tcp_cong_avoid_ai(tp, ca->cnt, acked);
 }
Mingrui Zhang Aug. 19, 2024, 8:36 p.m. UTC | #2
On 8/19/24 04:00, Eric Dumazet wrote:
> On Sat, Aug 17, 2024 at 6:35 PM Mingrui Zhang <mrzhang97@gmail.com> wrote:
>> The original code bypasses bictcp_update() under certain conditions
>> to reduce the CPU overhead. Intuitively, when last_cwnd==cwnd,
>> bictcp_update() is executed 32 times per second. As a result,
>> it is possible that bictcp_update() is not executed for several
>> RTTs when RTT is short (specifically < 1/32 second = 31 ms and
>> last_cwnd==cwnd which may happen in small-BDP networks),
>> thus leading to low throughput in these RTTs.
>>
>> The patched code executes bictcp_update() 32 times per second
>> if RTT > 31 ms or every RTT if RTT < 31 ms, when last_cwnd==cwnd.
>>
>> Fixes: df3271f3361b ("[TCP] BIC: CUBIC window growth (2.0)")
>> Fixes: ac35f562203a ("tcp: bic, cubic: use tcp_jiffies32 instead of tcp_time_stamp")
> I do not understand this Fixes: tag ?
>
> Commit  ac35f562203a was essentially a nop at that time...
>
I may misunderstood the use of Fixes tag and choose the latest commit of that line.
 
Shall it supposed to be the very first commit with that behavior?
That is, the very first commit (df3271f3361b ("[TCP] BIC: CUBIC window growth (2.0)")) when the code was first introduced?
>> Signed-off-by: Mingrui Zhang <mrzhang97@gmail.com>
>> Signed-off-by: Lisong Xu <xu@unl.edu>
>> ---
>> v3->v4: Replace min() with min_t()
>> v2->v3: Correct the "Fixes:" footer content
>> v1->v2: Separate patches
>>
>>  net/ipv4/tcp_cubic.c | 6 +++++-
>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
>> index 5dbed91c6178..00da7d592032 100644
>> --- a/net/ipv4/tcp_cubic.c
>> +++ b/net/ipv4/tcp_cubic.c
>> @@ -218,8 +218,12 @@ static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked)
>>
>>         ca->ack_cnt += acked;   /* count the number of ACKed packets */
>>
>> +       /* Update 32 times per second if RTT > 1/32 second,
>> +        * or every RTT if RTT < 1/32 second even when last_cwnd == cwnd
>> +        */
>>         if (ca->last_cwnd == cwnd &&
>> -           (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32)
>> +           (s32)(tcp_jiffies32 - ca->last_time) <=
>> +           min_t(s32, HZ / 32, usecs_to_jiffies(ca->delay_min)))
> This looks convoluted to me and still limited if HZ=250 (some distros
> still use 250 jiffies per second :/ )
>
> I would suggest switching to usec right away.
Thank you for the suggestion, however, I may need more time to discuss with another author for this revision. :)
Thank you
>
> diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
> index 5dbed91c6178257df8d2ccd1c8690a10bdbaf56a..fae000a57bf7d3803c5dd854af64b6933c4e26dd
> 100644
> --- a/net/ipv4/tcp_cubic.c
> +++ b/net/ipv4/tcp_cubic.c
> @@ -211,26 +211,27 @@ static u32 cubic_root(u64 a)
>  /*
>   * Compute congestion window to use.
>   */
> -static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked)
> +static inline void bictcp_update(struct tcp_sock *tp, struct bictcp *ca,
> +                                u32 cwnd, u32 acked)
>  {
>         u32 delta, bic_target, max_cnt;
>         u64 offs, t;
>
>         ca->ack_cnt += acked;   /* count the number of ACKed packets */
>
> -       if (ca->last_cwnd == cwnd &&
> -           (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32)
> +       delta = tp->tcp_mstamp - ca->last_time;
> +       if (ca->last_cwnd == cwnd && delta <= ca->delay_min)
>                 return;
>
> -       /* The CUBIC function can update ca->cnt at most once per jiffy.
> +       /* The CUBIC function can update ca->cnt at most once per ms.
>          * On all cwnd reduction events, ca->epoch_start is set to 0,
>          * which will force a recalculation of ca->cnt.
>          */
> -       if (ca->epoch_start && tcp_jiffies32 == ca->last_time)
> +       if (ca->epoch_start && delta < USEC_PER_MSEC)
>                 goto tcp_friendliness;
>
>         ca->last_cwnd = cwnd;
> -       ca->last_time = tcp_jiffies32;
> +       ca->last_time = tp->tcp_mstamp;
>
>         if (ca->epoch_start == 0) {
>                 ca->epoch_start = tcp_jiffies32;        /* record beginning */
> @@ -334,7 +335,7 @@ __bpf_kfunc static void cubictcp_cong_avoid(struct
> sock *sk, u32 ack, u32 acked)
>                 if (!acked)
>                         return;
>         }
> -       bictcp_update(ca, tcp_snd_cwnd(tp), acked);
> +       bictcp_update(tp, ca, tcp_snd_cwnd(tp), acked);
>         tcp_cong_avoid_ai(tp, ca->cnt, acked);
>  }
Eric Dumazet Aug. 20, 2024, 12:53 p.m. UTC | #3
On Mon, Aug 19, 2024 at 10:36 PM Mingrui Zhang <mrzhang97@gmail.com> wrote:
>
> On 8/19/24 04:00, Eric Dumazet wrote:
> > On Sat, Aug 17, 2024 at 6:35 PM Mingrui Zhang <mrzhang97@gmail.com> wrote:
> >> The original code bypasses bictcp_update() under certain conditions
> >> to reduce the CPU overhead. Intuitively, when last_cwnd==cwnd,
> >> bictcp_update() is executed 32 times per second. As a result,
> >> it is possible that bictcp_update() is not executed for several
> >> RTTs when RTT is short (specifically < 1/32 second = 31 ms and
> >> last_cwnd==cwnd which may happen in small-BDP networks),
> >> thus leading to low throughput in these RTTs.
> >>
> >> The patched code executes bictcp_update() 32 times per second
> >> if RTT > 31 ms or every RTT if RTT < 31 ms, when last_cwnd==cwnd.
> >>
> >> Fixes: df3271f3361b ("[TCP] BIC: CUBIC window growth (2.0)")
> >> Fixes: ac35f562203a ("tcp: bic, cubic: use tcp_jiffies32 instead of tcp_time_stamp")
> > I do not understand this Fixes: tag ?
> >
> > Commit  ac35f562203a was essentially a nop at that time...
> >
> I may misunderstood the use of Fixes tag and choose the latest commit of that line.
>
> Shall it supposed to be the very first commit with that behavior?
> That is, the very first commit (df3271f3361b ("[TCP] BIC: CUBIC window growth (2.0)")) when the code was first introduced?

I was referring to this line : Fixes: ac35f562203a ("tcp: bic, cubic:
use tcp_jiffies32 instead of tcp_time_stamp")

Commit ac35f562203a did not change the behavior at all.

I see no particular reason to mention it, this is confusing.


> >> Signed-off-by: Mingrui Zhang <mrzhang97@gmail.com>
> >> Signed-off-by: Lisong Xu <xu@unl.edu>
> >> ---
> >> v3->v4: Replace min() with min_t()
> >> v2->v3: Correct the "Fixes:" footer content
> >> v1->v2: Separate patches
> >>
> >>  net/ipv4/tcp_cubic.c | 6 +++++-
> >>  1 file changed, 5 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
> >> index 5dbed91c6178..00da7d592032 100644
> >> --- a/net/ipv4/tcp_cubic.c
> >> +++ b/net/ipv4/tcp_cubic.c
> >> @@ -218,8 +218,12 @@ static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked)
> >>
> >>         ca->ack_cnt += acked;   /* count the number of ACKed packets */
> >>
> >> +       /* Update 32 times per second if RTT > 1/32 second,
> >> +        * or every RTT if RTT < 1/32 second even when last_cwnd == cwnd
> >> +        */
> >>         if (ca->last_cwnd == cwnd &&
> >> -           (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32)
> >> +           (s32)(tcp_jiffies32 - ca->last_time) <=
> >> +           min_t(s32, HZ / 32, usecs_to_jiffies(ca->delay_min)))
> > This looks convoluted to me and still limited if HZ=250 (some distros
> > still use 250 jiffies per second :/ )
> >
> > I would suggest switching to usec right away.
> Thank you for the suggestion, however, I may need more time to discuss with another author for this revision. :)
> Thank you

No problem, there is no hurry.
Mingrui Zhang Aug. 25, 2024, 5:47 p.m. UTC | #4
On 8/20/24 07:53, Eric Dumazet wrote:
> On Mon, Aug 19, 2024 at 10:36 PM Mingrui Zhang <mrzhang97@gmail.com> wrote:
>> On 8/19/24 04:00, Eric Dumazet wrote:
>>> On Sat, Aug 17, 2024 at 6:35 PM Mingrui Zhang <mrzhang97@gmail.com> wrote:
>>>> The original code bypasses bictcp_update() under certain conditions
>>>> to reduce the CPU overhead. Intuitively, when last_cwnd==cwnd,
>>>> bictcp_update() is executed 32 times per second. As a result,
>>>> it is possible that bictcp_update() is not executed for several
>>>> RTTs when RTT is short (specifically < 1/32 second = 31 ms and
>>>> last_cwnd==cwnd which may happen in small-BDP networks),
>>>> thus leading to low throughput in these RTTs.
>>>>
>>>> The patched code executes bictcp_update() 32 times per second
>>>> if RTT > 31 ms or every RTT if RTT < 31 ms, when last_cwnd==cwnd.
>>>>
>>>> Fixes: df3271f3361b ("[TCP] BIC: CUBIC window growth (2.0)")
>>>> Fixes: ac35f562203a ("tcp: bic, cubic: use tcp_jiffies32 instead of tcp_time_stamp")
>>> I do not understand this Fixes: tag ?
>>>
>>> Commit  ac35f562203a was essentially a nop at that time...
>>>
>> I may misunderstood the use of Fixes tag and choose the latest commit of that line.
>>
>> Shall it supposed to be the very first commit with that behavior?
>> That is, the very first commit (df3271f3361b ("[TCP] BIC: CUBIC window growth (2.0)")) when the code was first introduced?
> I was referring to this line : Fixes: ac35f562203a ("tcp: bic, cubic:
> use tcp_jiffies32 instead of tcp_time_stamp")
>
> Commit ac35f562203a did not change the behavior at all.
>
> I see no particular reason to mention it, this is confusing.
>
>
>>>> Signed-off-by: Mingrui Zhang <mrzhang97@gmail.com>
>>>> Signed-off-by: Lisong Xu <xu@unl.edu>
>>>> ---
>>>> v3->v4: Replace min() with min_t()
>>>> v2->v3: Correct the "Fixes:" footer content
>>>> v1->v2: Separate patches
>>>>
>>>>  net/ipv4/tcp_cubic.c | 6 +++++-
>>>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
>>>> index 5dbed91c6178..00da7d592032 100644
>>>> --- a/net/ipv4/tcp_cubic.c
>>>> +++ b/net/ipv4/tcp_cubic.c
>>>> @@ -218,8 +218,12 @@ static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked)
>>>>
>>>>         ca->ack_cnt += acked;   /* count the number of ACKed packets */
>>>>
>>>> +       /* Update 32 times per second if RTT > 1/32 second,
>>>> +        * or every RTT if RTT < 1/32 second even when last_cwnd == cwnd
>>>> +        */
>>>>         if (ca->last_cwnd == cwnd &&
>>>> -           (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32)
>>>> +           (s32)(tcp_jiffies32 - ca->last_time) <=
>>>> +           min_t(s32, HZ / 32, usecs_to_jiffies(ca->delay_min)))
>>> This looks convoluted to me and still limited if HZ=250 (some distros
>>> still use 250 jiffies per second :/ )
>>>
>>> I would suggest switching to usec right away.
>> Thank you for the suggestion, however, I may need more time to discuss with another author for this revision. :)
>> Thank you
> No problem, there is no hurry.

Thank you, Eric, for your suggestion (switching ca->last_time from jiffies to usec)!
We thought about it and feel that it is more complicated and beyond the scope of this patch.

There are two blocks of code in bictcp_update().
* Block 1: cubic calculation, which is computationally intensive.
* Block 2: tcp friendliness, which emulates RENO.

There are two if statements to control how often these two blocks are called to reduce CPU overhead.
 * If statement 1:  if the condition is true, none of the two blocks are called.
if (ca->last_cwnd == cwnd &&
                    (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32)
                                return;

* If statement 2: If the condition is true, block 1 is not called. Intuitively, block 1 is called at most once per jiffy.
if (ca->epoch_start && tcp_jiffies32 == ca->last_time)
                                goto tcp_friendliness;


This patch changes only the first if statement. If we switch ca->last_time from jiffies to usec,
we need to change not only the first if statement but also the second if statement, as well as block 1.
* change the first if statement from jiffies to usec.
* change the second if statement from jiffies to usec. Need to determine how often (in usec) block 1 is called
* change block 1 from jiffies to usec. Should be fine, but need to make sure no calculation overflow.

Therefore, it might be better to keep the current patch as it is, and address the switch from jiffies to usec in future patches.

Thank you!
Lisong (Mingrui Send On Behalf of)
Eric Dumazet Aug. 26, 2024, 9:25 a.m. UTC | #5
On Sun, Aug 25, 2024 at 7:47 PM Mingrui Zhang <mrzhang97@gmail.com> wrote:
>
> On 8/20/24 07:53, Eric Dumazet wrote:
> > On Mon, Aug 19, 2024 at 10:36 PM Mingrui Zhang <mrzhang97@gmail.com> wrote:
> >> On 8/19/24 04:00, Eric Dumazet wrote:
> >>> On Sat, Aug 17, 2024 at 6:35 PM Mingrui Zhang <mrzhang97@gmail.com> wrote:
> >>>> The original code bypasses bictcp_update() under certain conditions
> >>>> to reduce the CPU overhead. Intuitively, when last_cwnd==cwnd,
> >>>> bictcp_update() is executed 32 times per second. As a result,
> >>>> it is possible that bictcp_update() is not executed for several
> >>>> RTTs when RTT is short (specifically < 1/32 second = 31 ms and
> >>>> last_cwnd==cwnd which may happen in small-BDP networks),
> >>>> thus leading to low throughput in these RTTs.
> >>>>
> >>>> The patched code executes bictcp_update() 32 times per second
> >>>> if RTT > 31 ms or every RTT if RTT < 31 ms, when last_cwnd==cwnd.
> >>>>
> >>>> Fixes: df3271f3361b ("[TCP] BIC: CUBIC window growth (2.0)")
> >>>> Fixes: ac35f562203a ("tcp: bic, cubic: use tcp_jiffies32 instead of tcp_time_stamp")
> >>> I do not understand this Fixes: tag ?
> >>>
> >>> Commit  ac35f562203a was essentially a nop at that time...
> >>>
> >> I may misunderstood the use of Fixes tag and choose the latest commit of that line.
> >>
> >> Shall it supposed to be the very first commit with that behavior?
> >> That is, the very first commit (df3271f3361b ("[TCP] BIC: CUBIC window growth (2.0)")) when the code was first introduced?
> > I was referring to this line : Fixes: ac35f562203a ("tcp: bic, cubic:
> > use tcp_jiffies32 instead of tcp_time_stamp")
> >
> > Commit ac35f562203a did not change the behavior at all.
> >
> > I see no particular reason to mention it, this is confusing.
> >
> >
> >>>> Signed-off-by: Mingrui Zhang <mrzhang97@gmail.com>
> >>>> Signed-off-by: Lisong Xu <xu@unl.edu>
> >>>> ---
> >>>> v3->v4: Replace min() with min_t()
> >>>> v2->v3: Correct the "Fixes:" footer content
> >>>> v1->v2: Separate patches
> >>>>
> >>>>  net/ipv4/tcp_cubic.c | 6 +++++-
> >>>>  1 file changed, 5 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
> >>>> index 5dbed91c6178..00da7d592032 100644
> >>>> --- a/net/ipv4/tcp_cubic.c
> >>>> +++ b/net/ipv4/tcp_cubic.c
> >>>> @@ -218,8 +218,12 @@ static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked)
> >>>>
> >>>>         ca->ack_cnt += acked;   /* count the number of ACKed packets */
> >>>>
> >>>> +       /* Update 32 times per second if RTT > 1/32 second,
> >>>> +        * or every RTT if RTT < 1/32 second even when last_cwnd == cwnd
> >>>> +        */
> >>>>         if (ca->last_cwnd == cwnd &&
> >>>> -           (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32)
> >>>> +           (s32)(tcp_jiffies32 - ca->last_time) <=
> >>>> +           min_t(s32, HZ / 32, usecs_to_jiffies(ca->delay_min)))
> >>> This looks convoluted to me and still limited if HZ=250 (some distros
> >>> still use 250 jiffies per second :/ )
> >>>
> >>> I would suggest switching to usec right away.
> >> Thank you for the suggestion, however, I may need more time to discuss with another author for this revision. :)
> >> Thank you
> > No problem, there is no hurry.
>
> Thank you, Eric, for your suggestion (switching ca->last_time from jiffies to usec)!
> We thought about it and feel that it is more complicated and beyond the scope of this patch.
>
> There are two blocks of code in bictcp_update().
> * Block 1: cubic calculation, which is computationally intensive.
> * Block 2: tcp friendliness, which emulates RENO.
>
> There are two if statements to control how often these two blocks are called to reduce CPU overhead.
>  * If statement 1:  if the condition is true, none of the two blocks are called.
> if (ca->last_cwnd == cwnd &&
>                     (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32)
>                                 return;
>
> * If statement 2: If the condition is true, block 1 is not called. Intuitively, block 1 is called at most once per jiffy.
> if (ca->epoch_start && tcp_jiffies32 == ca->last_time)
>                                 goto tcp_friendliness;
>
>
> This patch changes only the first if statement. If we switch ca->last_time from jiffies to usec,
> we need to change not only the first if statement but also the second if statement, as well as block 1.
> * change the first if statement from jiffies to usec.
> * change the second if statement from jiffies to usec. Need to determine how often (in usec) block 1 is called
> * change block 1 from jiffies to usec. Should be fine, but need to make sure no calculation overflow.

No problem, I can take care of the jiffies -> usec conversion, you can
then send your patch on top of it.

>
> Therefore, it might be better to keep the current patch as it is, and address the switch from jiffies to usec in future patches.

I prefer you rebase your patch after mine is merged.

There is a common misconception with jiffies.
It can change in less than 20 nsec.
Assuming that delta(jiffies) == 1 means that 1ms has elapsed is plain wrong.
In the old days, linux TCP only could rely on jiffies and we had to
accept its limits.
We now can switch to high resolution clocks, without extra costs,
because we already cache in tcp->tcp_mstamp
the usec timestamp for the current time.

Some distros are using CONFIG_HZ_250=y or CONFIG_HZ_100=y, this means
current logic in cubic is more fuzzy for them.

Without ca->last_time conversion to jiffies, your patch would still be
limited to jiffies resolution:
usecs_to_jiffies(ca->delay_min) would round up to much bigger values
for DC communications.
Neal Cardwell Aug. 28, 2024, 8:32 p.m. UTC | #6
On Mon, Aug 26, 2024 at 5:26 AM Eric Dumazet <edumazet@google.com> wrote:
...
> I prefer you rebase your patch after mine is merged.
>
> There is a common misconception with jiffies.
> It can change in less than 20 nsec.
> Assuming that delta(jiffies) == 1 means that 1ms has elapsed is plain wrong.
> In the old days, linux TCP only could rely on jiffies and we had to
> accept its limits.
> We now can switch to high resolution clocks, without extra costs,
> because we already cache in tcp->tcp_mstamp
> the usec timestamp for the current time.
>
> Some distros are using CONFIG_HZ_250=y or CONFIG_HZ_100=y, this means
> current logic in cubic is more fuzzy for them.
>
> Without ca->last_time conversion to jiffies, your patch would still be
> limited to jiffies resolution:
> usecs_to_jiffies(ca->delay_min) would round up to much bigger values
> for DC communications.

Even given Eric's excellent point that is raised above, that an
increase of jiffies by one can happen even though only O(us) or less
may have elapsed, AFAICT the patch should be fine in practice.

The patch says:

+       /* Update 32 times per second if RTT > 1/32 second,
+        * or every RTT if RTT < 1/32 second even when last_cwnd == cwnd
+        */
        if (ca->last_cwnd == cwnd &&
-           (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32)
+           (s32)(tcp_jiffies32 - ca->last_time) <=
+           min_t(s32, HZ / 32, usecs_to_jiffies(ca->delay_min)))
                return;

So, basically, we only run fall through and try to run the core of
bictcp_update() if cwnd has increased since ca-> last_cwnd, or
tcp_jiffies32 has increased by more than
min_t(s32, HZ / 32, usecs_to_jiffies(ca->delay_min)) since ca->last_time.

AFAICT  this works out OK because the logic is looking for "more than"
and usecs_to_jiffies() rounds up. That means that in the
interesting/tricky/common case where ca->delay_min is less than a
jiffy, usecs_to_jiffies(ca->delay_min) will return 1 jiffy. That means
that in this case we will only fall through and try to run the core of
bictcp_update() if cwnd has increased since ca-> last_cwnd, or
tcp_jiffies32 has increased by more than 1 jiffy (i.e., 2 or more
jiffies).

AFAICT the fact that this check is triggering only if tcp_jiffies32
has increased by 2 or more means that  at least one full jiffy has
elapsed between when we set ca->last_time and the time when this check
triggers running the core of bictcp_update().

So AFAICT this logic is not tricked by the fact that a single
increment of tcp_jiffies32 can happen over O(us) or less.

At first glance it may sound like if the RTT is much less than a
jiffy, many RTTs could elapse before we run the core of
bictcp_update(). However,  AFAIK if the RTT is much less than a jiffy
then CUBIC is very likely in Reno mode, and so is very likely to
increase cwnd by roughly 1 packet per round trip (the behavior of
Reno), so the (ca->last_cwnd == cwnd) condition should fail roughly
once per round trip and allow recomputation of the ca->cnt slope.

So AFAICT this patch should be OK in practice.

Given those considerations, Eric, do you think it would be OK to
accept the patch as-is?

Thanks!

neal
Eric Dumazet Sept. 30, 2024, 4:24 p.m. UTC | #7
On Wed, Aug 28, 2024 at 10:32 PM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Mon, Aug 26, 2024 at 5:26 AM Eric Dumazet <edumazet@google.com> wrote:
> ...
> > I prefer you rebase your patch after mine is merged.
> >
> > There is a common misconception with jiffies.
> > It can change in less than 20 nsec.
> > Assuming that delta(jiffies) == 1 means that 1ms has elapsed is plain wrong.
> > In the old days, linux TCP only could rely on jiffies and we had to
> > accept its limits.
> > We now can switch to high resolution clocks, without extra costs,
> > because we already cache in tcp->tcp_mstamp
> > the usec timestamp for the current time.
> >
> > Some distros are using CONFIG_HZ_250=y or CONFIG_HZ_100=y, this means
> > current logic in cubic is more fuzzy for them.
> >
> > Without ca->last_time conversion to jiffies, your patch would still be
> > limited to jiffies resolution:
> > usecs_to_jiffies(ca->delay_min) would round up to much bigger values
> > for DC communications.
>
> Even given Eric's excellent point that is raised above, that an
> increase of jiffies by one can happen even though only O(us) or less
> may have elapsed, AFAICT the patch should be fine in practice.
>
> The patch says:
>
> +       /* Update 32 times per second if RTT > 1/32 second,
> +        * or every RTT if RTT < 1/32 second even when last_cwnd == cwnd
> +        */
>         if (ca->last_cwnd == cwnd &&
> -           (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32)
> +           (s32)(tcp_jiffies32 - ca->last_time) <=
> +           min_t(s32, HZ / 32, usecs_to_jiffies(ca->delay_min)))
>                 return;
>
> So, basically, we only run fall through and try to run the core of
> bictcp_update() if cwnd has increased since ca-> last_cwnd, or
> tcp_jiffies32 has increased by more than
> min_t(s32, HZ / 32, usecs_to_jiffies(ca->delay_min)) since ca->last_time.
>
> AFAICT  this works out OK because the logic is looking for "more than"
> and usecs_to_jiffies() rounds up. That means that in the
> interesting/tricky/common case where ca->delay_min is less than a
> jiffy, usecs_to_jiffies(ca->delay_min) will return 1 jiffy. That means
> that in this case we will only fall through and try to run the core of
> bictcp_update() if cwnd has increased since ca-> last_cwnd, or
> tcp_jiffies32 has increased by more than 1 jiffy (i.e., 2 or more
> jiffies).
>
> AFAICT the fact that this check is triggering only if tcp_jiffies32
> has increased by 2 or more means that  at least one full jiffy has
> elapsed between when we set ca->last_time and the time when this check
> triggers running the core of bictcp_update().
>
> So AFAICT this logic is not tricked by the fact that a single
> increment of tcp_jiffies32 can happen over O(us) or less.
>
> At first glance it may sound like if the RTT is much less than a
> jiffy, many RTTs could elapse before we run the core of
> bictcp_update(). However,  AFAIK if the RTT is much less than a jiffy
> then CUBIC is very likely in Reno mode, and so is very likely to
> increase cwnd by roughly 1 packet per round trip (the behavior of
> Reno), so the (ca->last_cwnd == cwnd) condition should fail roughly
> once per round trip and allow recomputation of the ca->cnt slope.
>
> So AFAICT this patch should be OK in practice.
>
> Given those considerations, Eric, do you think it would be OK to
> accept the patch as-is?
>

Ok, what about updating net/ipv4/tcp_bic.c at the same time ?
diff mbox series

Patch

diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
index 5dbed91c6178..00da7d592032 100644
--- a/net/ipv4/tcp_cubic.c
+++ b/net/ipv4/tcp_cubic.c
@@ -218,8 +218,12 @@  static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked)
 
 	ca->ack_cnt += acked;	/* count the number of ACKed packets */
 
+	/* Update 32 times per second if RTT > 1/32 second,
+	 * or every RTT if RTT < 1/32 second even when last_cwnd == cwnd
+	 */
 	if (ca->last_cwnd == cwnd &&
-	    (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32)
+	    (s32)(tcp_jiffies32 - ca->last_time) <=
+	    min_t(s32, HZ / 32, usecs_to_jiffies(ca->delay_min)))
 		return;
 
 	/* The CUBIC function can update ca->cnt at most once per jiffy.