mbox series

[net-next,0/2] tcp: add sysctl_tcp_rto_min_us

Message ID 20240528171320.1332292-1-yyd@google.com (mailing list archive)
Headers show
Series tcp: add sysctl_tcp_rto_min_us | expand

Message

Kevin Yang May 28, 2024, 5:13 p.m. UTC
Adding a sysctl knob to allow user to specify a default
rto_min at socket init time.

After this patch series, the rto_min will has multiple sources:
route option has the highest precedence, followed by the
TCP_BPF_RTO_MIN socket option, followed by this new
tcp_rto_min_us sysctl.

Kevin Yang (2):
  tcp: derive delack_max with tcp_rto_min helper
  tcp: add sysctl_tcp_rto_min_us

 Documentation/networking/ip-sysctl.rst | 13 +++++++++++++
 include/net/netns/ipv4.h               |  1 +
 net/ipv4/sysctl_net_ipv4.c             |  8 ++++++++
 net/ipv4/tcp.c                         |  3 ++-
 net/ipv4/tcp_ipv4.c                    |  1 +
 net/ipv4/tcp_output.c                  | 11 ++---------
 6 files changed, 27 insertions(+), 10 deletions(-)

Comments

Jason Xing May 29, 2024, 6:43 a.m. UTC | #1
Hello Kevin,

On Wed, May 29, 2024 at 1:13 AM Kevin Yang <yyd@google.com> wrote:
>
> Adding a sysctl knob to allow user to specify a default
> rto_min at socket init time.

I wonder what the advantage of this new sysctl knob is since we have
had BPF or something like that to tweak the rto min already?

There are so many places/parameters of the TCP stack that can be
exposed to the user side and adjusted by new sysctls...

Thanks,
Jason

>
> After this patch series, the rto_min will has multiple sources:
> route option has the highest precedence, followed by the
> TCP_BPF_RTO_MIN socket option, followed by this new
> tcp_rto_min_us sysctl.
>
> Kevin Yang (2):
>   tcp: derive delack_max with tcp_rto_min helper
>   tcp: add sysctl_tcp_rto_min_us
>
>  Documentation/networking/ip-sysctl.rst | 13 +++++++++++++
>  include/net/netns/ipv4.h               |  1 +
>  net/ipv4/sysctl_net_ipv4.c             |  8 ++++++++
>  net/ipv4/tcp.c                         |  3 ++-
>  net/ipv4/tcp_ipv4.c                    |  1 +
>  net/ipv4/tcp_output.c                  | 11 ++---------
>  6 files changed, 27 insertions(+), 10 deletions(-)
>
> --
> 2.45.1.288.g0e0cd299f1-goog
>
>
Jason Xing May 29, 2024, 6:59 a.m. UTC | #2
On Wed, May 29, 2024 at 2:43 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> Hello Kevin,
>
> On Wed, May 29, 2024 at 1:13 AM Kevin Yang <yyd@google.com> wrote:
> >
> > Adding a sysctl knob to allow user to specify a default
> > rto_min at socket init time.
>
> I wonder what the advantage of this new sysctl knob is since we have
> had BPF or something like that to tweak the rto min already?
>
> There are so many places/parameters of the TCP stack that can be
> exposed to the user side and adjusted by new sysctls...
>
> Thanks,
> Jason
>
> >
> > After this patch series, the rto_min will has multiple sources:
> > route option has the highest precedence, followed by the
> > TCP_BPF_RTO_MIN socket option, followed by this new
> > tcp_rto_min_us sysctl.
> >
> > Kevin Yang (2):
> >   tcp: derive delack_max with tcp_rto_min helper
> >   tcp: add sysctl_tcp_rto_min_us
> >
> >  Documentation/networking/ip-sysctl.rst | 13 +++++++++++++
> >  include/net/netns/ipv4.h               |  1 +
> >  net/ipv4/sysctl_net_ipv4.c             |  8 ++++++++
> >  net/ipv4/tcp.c                         |  3 ++-
> >  net/ipv4/tcp_ipv4.c                    |  1 +
> >  net/ipv4/tcp_output.c                  | 11 ++---------
> >  6 files changed, 27 insertions(+), 10 deletions(-)
> >
> > --
> > 2.45.1.288.g0e0cd299f1-goog
> >
> >

Oh, I think you should have added Paolo as well.

+Paolo Abeni
Tony Lu May 29, 2024, 7:21 a.m. UTC | #3
On Tue, May 28, 2024 at 05:13:18PM +0000, Kevin Yang wrote:
> Adding a sysctl knob to allow user to specify a default
> rto_min at socket init time.
> 
> After this patch series, the rto_min will has multiple sources:
> route option has the highest precedence, followed by the
> TCP_BPF_RTO_MIN socket option, followed by this new
> tcp_rto_min_us sysctl.

For series:

Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>

I strongly support those patches. For those who use cgroup v1 and want
to take effect with simple settings, sysctl is a good way.

And reducing it is helpful for latency-sensitive applications such as
Redis, net namespace level sysctl knob is enough.

> 
> Kevin Yang (2):
>   tcp: derive delack_max with tcp_rto_min helper
>   tcp: add sysctl_tcp_rto_min_us
> 
>  Documentation/networking/ip-sysctl.rst | 13 +++++++++++++
>  include/net/netns/ipv4.h               |  1 +
>  net/ipv4/sysctl_net_ipv4.c             |  8 ++++++++
>  net/ipv4/tcp.c                         |  3 ++-
>  net/ipv4/tcp_ipv4.c                    |  1 +
>  net/ipv4/tcp_output.c                  | 11 ++---------
>  6 files changed, 27 insertions(+), 10 deletions(-)
> 
> -- 
> 2.45.1.288.g0e0cd299f1-goog
>
Eric Dumazet May 29, 2024, 7:39 a.m. UTC | #4
On Wed, May 29, 2024 at 9:00 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Wed, May 29, 2024 at 2:43 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > Hello Kevin,
> >
> > On Wed, May 29, 2024 at 1:13 AM Kevin Yang <yyd@google.com> wrote:
> > >
> > > Adding a sysctl knob to allow user to specify a default
> > > rto_min at socket init time.
> >
> > I wonder what the advantage of this new sysctl knob is since we have
> > had BPF or something like that to tweak the rto min already?
> >
> > There are so many places/parameters of the TCP stack that can be
> > exposed to the user side and adjusted by new sysctls...
> >
> > Thanks,
> > Jason
> >
> > >
> > > After this patch series, the rto_min will has multiple sources:
> > > route option has the highest precedence, followed by the
> > > TCP_BPF_RTO_MIN socket option, followed by this new
> > > tcp_rto_min_us sysctl.
> > >
> > > Kevin Yang (2):
> > >   tcp: derive delack_max with tcp_rto_min helper
> > >   tcp: add sysctl_tcp_rto_min_us
> > >
> > >  Documentation/networking/ip-sysctl.rst | 13 +++++++++++++
> > >  include/net/netns/ipv4.h               |  1 +
> > >  net/ipv4/sysctl_net_ipv4.c             |  8 ++++++++
> > >  net/ipv4/tcp.c                         |  3 ++-
> > >  net/ipv4/tcp_ipv4.c                    |  1 +
> > >  net/ipv4/tcp_output.c                  | 11 ++---------
> > >  6 files changed, 27 insertions(+), 10 deletions(-)
> > >
> > > --
> > > 2.45.1.288.g0e0cd299f1-goog
> > >
> > >
>
> Oh, I think you should have added Paolo as well.
>
> +Paolo Abeni

Many cloud customers do not have any BPF expertise.
If they use existing BPF programs (added by a product), they might not
have the ability to change it.

We tried advising them to use route attributes, after
commit bbf80d713fe75cfbecda26e7c03a9a8d22af2f4f ("tcp: derive
delack_max from rto_min")

Alas, dhcpd was adding its own routes, without the "rto_min 5"
attribute, then systemd came...
Lots of frustration, lots of wasted time, for something that has been
used for more than a decade
in Google DC.

With a sysctl, we could have saved months of SWE, and helped our
customers sooner.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Tony Lu May 29, 2024, 7:56 a.m. UTC | #5
On Wed, May 29, 2024 at 09:39:02AM +0200, Eric Dumazet wrote:
> On Wed, May 29, 2024 at 9:00 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > On Wed, May 29, 2024 at 2:43 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > >
> > > Hello Kevin,
> > >
> > > On Wed, May 29, 2024 at 1:13 AM Kevin Yang <yyd@google.com> wrote:
> > > >
> > > > Adding a sysctl knob to allow user to specify a default
> > > > rto_min at socket init time.
> > >
> > > I wonder what the advantage of this new sysctl knob is since we have
> > > had BPF or something like that to tweak the rto min already?
> > >
> > > There are so many places/parameters of the TCP stack that can be
> > > exposed to the user side and adjusted by new sysctls...
> > >
> > > Thanks,
> > > Jason
> > >
> > > >
> > > > After this patch series, the rto_min will has multiple sources:
> > > > route option has the highest precedence, followed by the
> > > > TCP_BPF_RTO_MIN socket option, followed by this new
> > > > tcp_rto_min_us sysctl.
> > > >
> > > > Kevin Yang (2):
> > > >   tcp: derive delack_max with tcp_rto_min helper
> > > >   tcp: add sysctl_tcp_rto_min_us
> > > >
> > > >  Documentation/networking/ip-sysctl.rst | 13 +++++++++++++
> > > >  include/net/netns/ipv4.h               |  1 +
> > > >  net/ipv4/sysctl_net_ipv4.c             |  8 ++++++++
> > > >  net/ipv4/tcp.c                         |  3 ++-
> > > >  net/ipv4/tcp_ipv4.c                    |  1 +
> > > >  net/ipv4/tcp_output.c                  | 11 ++---------
> > > >  6 files changed, 27 insertions(+), 10 deletions(-)
> > > >
> > > > --
> > > > 2.45.1.288.g0e0cd299f1-goog
> > > >
> > > >
> >
> > Oh, I think you should have added Paolo as well.
> >
> > +Paolo Abeni
> 
> Many cloud customers do not have any BPF expertise.
> If they use existing BPF programs (added by a product), they might not
> have the ability to change it.

+1, eBPF actually is not easy to write, debug and manage for now.
Sysctls are easy to use, just put it into /etc/sysctl.conf and save it
into users' customized images or templates. AFAIK, there is no standard
system kit to handle eBPF in most OS distros.

> 
> We tried advising them to use route attributes, after
> commit bbf80d713fe75cfbecda26e7c03a9a8d22af2f4f ("tcp: derive
> delack_max from rto_min")
> 
> Alas, dhcpd was adding its own routes, without the "rto_min 5"
> attribute, then systemd came...
> Lots of frustration, lots of wasted time, for something that has been
> used for more than a decade
> in Google DC.
> 
> With a sysctl, we could have saved months of SWE, and helped our
> customers sooner.
> 
> Reviewed-by: Eric Dumazet <edumazet@google.com>
Jason Xing May 29, 2024, 8:43 a.m. UTC | #6
Hello Eric,

On Wed, May 29, 2024 at 3:39 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, May 29, 2024 at 9:00 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > On Wed, May 29, 2024 at 2:43 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > >
> > > Hello Kevin,
> > >
> > > On Wed, May 29, 2024 at 1:13 AM Kevin Yang <yyd@google.com> wrote:
> > > >
> > > > Adding a sysctl knob to allow user to specify a default
> > > > rto_min at socket init time.
> > >
> > > I wonder what the advantage of this new sysctl knob is since we have
> > > had BPF or something like that to tweak the rto min already?
> > >
> > > There are so many places/parameters of the TCP stack that can be
> > > exposed to the user side and adjusted by new sysctls...
> > >
> > > Thanks,
> > > Jason
> > >
> > > >
> > > > After this patch series, the rto_min will has multiple sources:
> > > > route option has the highest precedence, followed by the
> > > > TCP_BPF_RTO_MIN socket option, followed by this new
> > > > tcp_rto_min_us sysctl.
> > > >
> > > > Kevin Yang (2):
> > > >   tcp: derive delack_max with tcp_rto_min helper
> > > >   tcp: add sysctl_tcp_rto_min_us
> > > >
> > > >  Documentation/networking/ip-sysctl.rst | 13 +++++++++++++
> > > >  include/net/netns/ipv4.h               |  1 +
> > > >  net/ipv4/sysctl_net_ipv4.c             |  8 ++++++++
> > > >  net/ipv4/tcp.c                         |  3 ++-
> > > >  net/ipv4/tcp_ipv4.c                    |  1 +
> > > >  net/ipv4/tcp_output.c                  | 11 ++---------
> > > >  6 files changed, 27 insertions(+), 10 deletions(-)
> > > >
> > > > --
> > > > 2.45.1.288.g0e0cd299f1-goog
> > > >
> > > >
> >
> > Oh, I think you should have added Paolo as well.
> >
> > +Paolo Abeni
>
> Many cloud customers do not have any BPF expertise.
> If they use existing BPF programs (added by a product), they might not
> have the ability to change it.
>
> We tried advising them to use route attributes, after
> commit bbf80d713fe75cfbecda26e7c03a9a8d22af2f4f ("tcp: derive
> delack_max from rto_min")
>
> Alas, dhcpd was adding its own routes, without the "rto_min 5"
> attribute, then systemd came...
> Lots of frustration, lots of wasted time, for something that has been
> used for more than a decade
> in Google DC.
>
> With a sysctl, we could have saved months of SWE, and helped our
> customers sooner.

I'm definitely aware of the importance of this kind of sysctl knob.
Many years ago (around 6 or 7 years ago), we already implemented
similar things in the private kernel.

For a long time, netdev guys often proposed the question as I did in
the previous email. I'm not against it, just repeating the same
question and asking ourselves again: is it really necessary? We still
have a lot of places to tune/control by introducing new sysctl.

For a long time, there have been plenty of papers studying different
combinations of different parameters in TCP stack so that they can
serve one particular case well.

Do we also need to expose remaining possible parameters to the user
side? Just curious...

Thanks,
Jason
Jason Xing May 29, 2024, 8:49 a.m. UTC | #7
On Wed, May 29, 2024 at 3:21 PM Tony Lu <tonylu@linux.alibaba.com> wrote:
>
> On Tue, May 28, 2024 at 05:13:18PM +0000, Kevin Yang wrote:
> > Adding a sysctl knob to allow user to specify a default
> > rto_min at socket init time.
> >
> > After this patch series, the rto_min will has multiple sources:
> > route option has the highest precedence, followed by the
> > TCP_BPF_RTO_MIN socket option, followed by this new
> > tcp_rto_min_us sysctl.
>
> For series:
>
> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
>
> I strongly support those patches. For those who use cgroup v1 and want
> to take effect with simple settings, sysctl is a good way.

It's not a good reason to use sysctl.

If you say so, why not introduce many sysctls to replace setsockopt
operations. For example, introducing a new sysctl to disable delayed
ack to improve the speed of transmission in some cases just for ease
of use? No, it's not right, I believe.

>
> And reducing it is helpful for latency-sensitive applications such as
> Redis, net namespace level sysctl knob is enough.

Sure, these key parameters play a big role in the TCP stack.

>
> >
> > Kevin Yang (2):
> >   tcp: derive delack_max with tcp_rto_min helper
> >   tcp: add sysctl_tcp_rto_min_us
> >
> >  Documentation/networking/ip-sysctl.rst | 13 +++++++++++++
> >  include/net/netns/ipv4.h               |  1 +
> >  net/ipv4/sysctl_net_ipv4.c             |  8 ++++++++
> >  net/ipv4/tcp.c                         |  3 ++-
> >  net/ipv4/tcp_ipv4.c                    |  1 +
> >  net/ipv4/tcp_output.c                  | 11 ++---------
> >  6 files changed, 27 insertions(+), 10 deletions(-)
> >
> > --
> > 2.45.1.288.g0e0cd299f1-goog
> >
>
Eric Dumazet May 29, 2024, 9:23 a.m. UTC | #8
On Wed, May 29, 2024 at 10:44 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> Hello Eric,
>
> On Wed, May 29, 2024 at 3:39 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Wed, May 29, 2024 at 9:00 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > >
> > > On Wed, May 29, 2024 at 2:43 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > >
> > > > Hello Kevin,
> > > >
> > > > On Wed, May 29, 2024 at 1:13 AM Kevin Yang <yyd@google.com> wrote:
> > > > >
> > > > > Adding a sysctl knob to allow user to specify a default
> > > > > rto_min at socket init time.
> > > >
> > > > I wonder what the advantage of this new sysctl knob is since we have
> > > > had BPF or something like that to tweak the rto min already?
> > > >
> > > > There are so many places/parameters of the TCP stack that can be
> > > > exposed to the user side and adjusted by new sysctls...
> > > >
> > > > Thanks,
> > > > Jason
> > > >
> > > > >
> > > > > After this patch series, the rto_min will has multiple sources:
> > > > > route option has the highest precedence, followed by the
> > > > > TCP_BPF_RTO_MIN socket option, followed by this new
> > > > > tcp_rto_min_us sysctl.
> > > > >
> > > > > Kevin Yang (2):
> > > > >   tcp: derive delack_max with tcp_rto_min helper
> > > > >   tcp: add sysctl_tcp_rto_min_us
> > > > >
> > > > >  Documentation/networking/ip-sysctl.rst | 13 +++++++++++++
> > > > >  include/net/netns/ipv4.h               |  1 +
> > > > >  net/ipv4/sysctl_net_ipv4.c             |  8 ++++++++
> > > > >  net/ipv4/tcp.c                         |  3 ++-
> > > > >  net/ipv4/tcp_ipv4.c                    |  1 +
> > > > >  net/ipv4/tcp_output.c                  | 11 ++---------
> > > > >  6 files changed, 27 insertions(+), 10 deletions(-)
> > > > >
> > > > > --
> > > > > 2.45.1.288.g0e0cd299f1-goog
> > > > >
> > > > >
> > >
> > > Oh, I think you should have added Paolo as well.
> > >
> > > +Paolo Abeni
> >
> > Many cloud customers do not have any BPF expertise.
> > If they use existing BPF programs (added by a product), they might not
> > have the ability to change it.
> >
> > We tried advising them to use route attributes, after
> > commit bbf80d713fe75cfbecda26e7c03a9a8d22af2f4f ("tcp: derive
> > delack_max from rto_min")
> >
> > Alas, dhcpd was adding its own routes, without the "rto_min 5"
> > attribute, then systemd came...
> > Lots of frustration, lots of wasted time, for something that has been
> > used for more than a decade
> > in Google DC.
> >
> > With a sysctl, we could have saved months of SWE, and helped our
> > customers sooner.
>
> I'm definitely aware of the importance of this kind of sysctl knob.
> Many years ago (around 6 or 7 years ago), we already implemented
> similar things in the private kernel.
>
> For a long time, netdev guys often proposed the question as I did in
> the previous email. I'm not against it, just repeating the same
> question and asking ourselves again: is it really necessary? We still
> have a lot of places to tune/control by introducing new sysctl.
>
> For a long time, there have been plenty of papers studying different
> combinations of different parameters in TCP stack so that they can
> serve one particular case well.
>
> Do we also need to expose remaining possible parameters to the user
> side? Just curious...

You know, counting CLOSE_WAIT can be done with  eBPF program just fine.

I think long-time TCP maintainers like Eric Dumazet, Neal Cardwell,
and Yuchung Cheng know better,
you will have to trust us.

If you do not want to use the sysctl, this is fine, we do not force you.
Jason Xing May 29, 2024, 9:30 a.m. UTC | #9
On Wed, May 29, 2024 at 5:23 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, May 29, 2024 at 10:44 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > Hello Eric,
> >
> > On Wed, May 29, 2024 at 3:39 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Wed, May 29, 2024 at 9:00 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > >
> > > > On Wed, May 29, 2024 at 2:43 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > > >
> > > > > Hello Kevin,
> > > > >
> > > > > On Wed, May 29, 2024 at 1:13 AM Kevin Yang <yyd@google.com> wrote:
> > > > > >
> > > > > > Adding a sysctl knob to allow user to specify a default
> > > > > > rto_min at socket init time.
> > > > >
> > > > > I wonder what the advantage of this new sysctl knob is since we have
> > > > > had BPF or something like that to tweak the rto min already?
> > > > >
> > > > > There are so many places/parameters of the TCP stack that can be
> > > > > exposed to the user side and adjusted by new sysctls...
> > > > >
> > > > > Thanks,
> > > > > Jason
> > > > >
> > > > > >
> > > > > > After this patch series, the rto_min will has multiple sources:
> > > > > > route option has the highest precedence, followed by the
> > > > > > TCP_BPF_RTO_MIN socket option, followed by this new
> > > > > > tcp_rto_min_us sysctl.
> > > > > >
> > > > > > Kevin Yang (2):
> > > > > >   tcp: derive delack_max with tcp_rto_min helper
> > > > > >   tcp: add sysctl_tcp_rto_min_us
> > > > > >
> > > > > >  Documentation/networking/ip-sysctl.rst | 13 +++++++++++++
> > > > > >  include/net/netns/ipv4.h               |  1 +
> > > > > >  net/ipv4/sysctl_net_ipv4.c             |  8 ++++++++
> > > > > >  net/ipv4/tcp.c                         |  3 ++-
> > > > > >  net/ipv4/tcp_ipv4.c                    |  1 +
> > > > > >  net/ipv4/tcp_output.c                  | 11 ++---------
> > > > > >  6 files changed, 27 insertions(+), 10 deletions(-)
> > > > > >
> > > > > > --
> > > > > > 2.45.1.288.g0e0cd299f1-goog
> > > > > >
> > > > > >
> > > >
> > > > Oh, I think you should have added Paolo as well.
> > > >
> > > > +Paolo Abeni
> > >
> > > Many cloud customers do not have any BPF expertise.
> > > If they use existing BPF programs (added by a product), they might not
> > > have the ability to change it.
> > >
> > > We tried advising them to use route attributes, after
> > > commit bbf80d713fe75cfbecda26e7c03a9a8d22af2f4f ("tcp: derive
> > > delack_max from rto_min")
> > >
> > > Alas, dhcpd was adding its own routes, without the "rto_min 5"
> > > attribute, then systemd came...
> > > Lots of frustration, lots of wasted time, for something that has been
> > > used for more than a decade
> > > in Google DC.
> > >
> > > With a sysctl, we could have saved months of SWE, and helped our
> > > customers sooner.
> >
> > I'm definitely aware of the importance of this kind of sysctl knob.
> > Many years ago (around 6 or 7 years ago), we already implemented
> > similar things in the private kernel.
> >
> > For a long time, netdev guys often proposed the question as I did in
> > the previous email. I'm not against it, just repeating the same
> > question and asking ourselves again: is it really necessary? We still
> > have a lot of places to tune/control by introducing new sysctl.
> >
> > For a long time, there have been plenty of papers studying different
> > combinations of different parameters in TCP stack so that they can
> > serve one particular case well.
> >
> > Do we also need to expose remaining possible parameters to the user
> > side? Just curious...
>
> You know, counting CLOSE_WAIT can be done with  eBPF program just fine.
>
> I think long-time TCP maintainers like Eric Dumazet, Neal Cardwell,
> and Yuchung Cheng know better,
> you will have to trust us.

You get me wrong, Eric. I trust you, of course. I'm just out of
curiosity because I saw some threads facing the same question before.

And, as I said, it has been used in our private kernel for a long
time. So it's useful.

BTW, what do you think of that close-wait patch since you mention it.

Thanks,
Jason
Kevin Yang May 29, 2024, 7:56 p.m. UTC | #10
Hello Jason,

I guess I don't have anymore to add than what Eric and Tony mentioned.

The purpose of this sysctl is to resolve problems in a more straightforward
way than other existing methods.

Thanks
kevin

On Wed, May 29, 2024 at 5:23 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, May 29, 2024 at 10:44 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > Hello Eric,
> >
> > On Wed, May 29, 2024 at 3:39 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Wed, May 29, 2024 at 9:00 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > >
> > > > On Wed, May 29, 2024 at 2:43 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > > >
> > > > > Hello Kevin,
> > > > >
> > > > > On Wed, May 29, 2024 at 1:13 AM Kevin Yang <yyd@google.com> wrote:
> > > > > >
> > > > > > Adding a sysctl knob to allow user to specify a default
> > > > > > rto_min at socket init time.
> > > > >
> > > > > I wonder what the advantage of this new sysctl knob is since we have
> > > > > had BPF or something like that to tweak the rto min already?
> > > > >
> > > > > There are so many places/parameters of the TCP stack that can be
> > > > > exposed to the user side and adjusted by new sysctls...
> > > > >
> > > > > Thanks,
> > > > > Jason
> > > > >
> > > > > >
> > > > > > After this patch series, the rto_min will has multiple sources:
> > > > > > route option has the highest precedence, followed by the
> > > > > > TCP_BPF_RTO_MIN socket option, followed by this new
> > > > > > tcp_rto_min_us sysctl.
> > > > > >
> > > > > > Kevin Yang (2):
> > > > > >   tcp: derive delack_max with tcp_rto_min helper
> > > > > >   tcp: add sysctl_tcp_rto_min_us
> > > > > >
> > > > > >  Documentation/networking/ip-sysctl.rst | 13 +++++++++++++
> > > > > >  include/net/netns/ipv4.h               |  1 +
> > > > > >  net/ipv4/sysctl_net_ipv4.c             |  8 ++++++++
> > > > > >  net/ipv4/tcp.c                         |  3 ++-
> > > > > >  net/ipv4/tcp_ipv4.c                    |  1 +
> > > > > >  net/ipv4/tcp_output.c                  | 11 ++---------
> > > > > >  6 files changed, 27 insertions(+), 10 deletions(-)
> > > > > >
> > > > > > --
> > > > > > 2.45.1.288.g0e0cd299f1-goog
> > > > > >
> > > > > >
> > > >
> > > > Oh, I think you should have added Paolo as well.
> > > >
> > > > +Paolo Abeni
> > >
> > > Many cloud customers do not have any BPF expertise.
> > > If they use existing BPF programs (added by a product), they might not
> > > have the ability to change it.
> > >
> > > We tried advising them to use route attributes, after
> > > commit bbf80d713fe75cfbecda26e7c03a9a8d22af2f4f ("tcp: derive
> > > delack_max from rto_min")
> > >
> > > Alas, dhcpd was adding its own routes, without the "rto_min 5"
> > > attribute, then systemd came...
> > > Lots of frustration, lots of wasted time, for something that has been
> > > used for more than a decade
> > > in Google DC.
> > >
> > > With a sysctl, we could have saved months of SWE, and helped our
> > > customers sooner.
> >
> > I'm definitely aware of the importance of this kind of sysctl knob.
> > Many years ago (around 6 or 7 years ago), we already implemented
> > similar things in the private kernel.
> >
> > For a long time, netdev guys often proposed the question as I did in
> > the previous email. I'm not against it, just repeating the same
> > question and asking ourselves again: is it really necessary? We still
> > have a lot of places to tune/control by introducing new sysctl.
> >
> > For a long time, there have been plenty of papers studying different
> > combinations of different parameters in TCP stack so that they can
> > serve one particular case well.
> >
> > Do we also need to expose remaining possible parameters to the user
> > side? Just curious...
>
> You know, counting CLOSE_WAIT can be done with  eBPF program just fine.
>
> I think long-time TCP maintainers like Eric Dumazet, Neal Cardwell,
> and Yuchung Cheng know better,
> you will have to trust us.
>
> If you do not want to use the sysctl, this is fine, we do not force you.
Tony Lu May 30, 2024, 1:08 a.m. UTC | #11
On Wed, May 29, 2024 at 04:49:39PM +0800, Jason Xing wrote:
> On Wed, May 29, 2024 at 3:21 PM Tony Lu <tonylu@linux.alibaba.com> wrote:
> >
> > On Tue, May 28, 2024 at 05:13:18PM +0000, Kevin Yang wrote:
> > > Adding a sysctl knob to allow user to specify a default
> > > rto_min at socket init time.
> > >
> > > After this patch series, the rto_min will has multiple sources:
> > > route option has the highest precedence, followed by the
> > > TCP_BPF_RTO_MIN socket option, followed by this new
> > > tcp_rto_min_us sysctl.
> >
> > For series:
> >
> > Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
> >
> > I strongly support those patches. For those who use cgroup v1 and want
> > to take effect with simple settings, sysctl is a good way.
> 
> It's not a good reason to use sysctl.
> 
> If you say so, why not introduce many sysctls to replace setsockopt
> operations. For example, introducing a new sysctl to disable delayed
> ack to improve the speed of transmission in some cases just for ease
> of use? No, it's not right, I believe.
> 

Hidden behind the words is that if I am a kernel engineer or SRE helping
users troubleshoot latency issues, and I need to tune tcp_rto_min, then
my only means of not intruding on the application are eBPF or the sysctl
mentioned here.

Comparing sysctl and eBPF, I prefer sysctl isolated by net namespace,
which can be modified and verified more directly and quickly. eBPF is
powerful, but it is not easy to write, debug and manage.

> >
> > And reducing it is helpful for latency-sensitive applications such as
> > Redis, net namespace level sysctl knob is enough.
> 
> Sure, these key parameters play a big role in the TCP stack.
> 
> >
> > >
> > > Kevin Yang (2):
> > >   tcp: derive delack_max with tcp_rto_min helper
> > >   tcp: add sysctl_tcp_rto_min_us
> > >
> > >  Documentation/networking/ip-sysctl.rst | 13 +++++++++++++
> > >  include/net/netns/ipv4.h               |  1 +
> > >  net/ipv4/sysctl_net_ipv4.c             |  8 ++++++++
> > >  net/ipv4/tcp.c                         |  3 ++-
> > >  net/ipv4/tcp_ipv4.c                    |  1 +
> > >  net/ipv4/tcp_output.c                  | 11 ++---------
> > >  6 files changed, 27 insertions(+), 10 deletions(-)
> > >
> > > --
> > > 2.45.1.288.g0e0cd299f1-goog
> > >
> >