Message ID | 20220215103639.11739-1-kerneljasonxing@gmail.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | BPF |
Headers | show |
Series | net: do not set SOCK_RCVBUF_LOCK if sk_rcvbuf isn't reduced | expand |
On Tue, Feb 15, 2022 at 2:37 AM <kerneljasonxing@gmail.com> wrote: > > From: Jason Xing <xingwanli@kuaishou.com> > > Normally, user doesn't care the logic behind the kernel if they're > trying to set receive buffer via setsockopt. However, if the new value > of the receive buffer is not smaller than the initial value which is > sysctl_tcp_rmem[1] implemented in tcp_rcv_space_adjust(), the server's > wscale will shrink and then lead to the bad bandwidth. I think it is > not appropriate. Then do not use SO_RCVBUF ? It is working as intended really. > > Here are some numbers: > $ sysctl -a | grep rmem > net.core.rmem_default = 212992 > net.core.rmem_max = 40880000 > net.ipv4.tcp_rmem = 4096 425984 40880000 > > Case 1 > on the server side > # iperf -s -p 5201 > on the client side > # iperf -c [client ip] -p 5201 > It turns out that the bandwidth is 9.34 Gbits/sec while the wscale of > server side is 10. It's good. > > Case 2 > on the server side > #iperf -s -p 5201 -w 425984 > on the client side > # iperf -c [client ip] -p 5201 > It turns out that the bandwidth is reduced to 2.73 Gbits/sec while the > wcale is 2, even though the receive buffer is not changed at all at the > very beginning. Great, you discovered auto tuning is working as intended. > > Therefore, I added one condition where only user is trying to set a > smaller rx buffer. After this patch is applied, the bandwidth of case 2 > is recovered to 9.34 Gbits/sec. > > Fixes: e88c64f0a425 ("tcp: allow effective reduction of TCP's rcv-buffer via setsockopt") This commit has nothing to do with your patch or feature. > Signed-off-by: Jason Xing <xingwanli@kuaishou.com> > --- > net/core/filter.c | 7 ++++--- > net/core/sock.c | 8 +++++--- > 2 files changed, 9 insertions(+), 6 deletions(-) > > diff --git a/net/core/filter.c b/net/core/filter.c > index 4603b7c..99f5d9c 100644 > --- a/net/core/filter.c > +++ b/net/core/filter.c > @@ -4795,9 +4795,10 @@ static int _bpf_setsockopt(struct sock *sk, int level, int optname, > case SO_RCVBUF: > val = min_t(u32, val, sysctl_rmem_max); > val = min_t(int, val, INT_MAX / 2); > - sk->sk_userlocks |= SOCK_RCVBUF_LOCK; > - WRITE_ONCE(sk->sk_rcvbuf, > - max_t(int, val * 2, SOCK_MIN_RCVBUF)); > + val = max_t(int, val * 2, SOCK_MIN_RCVBUF); > + if (val < sock_net(sk)->ipv4.sysctl_tcp_rmem[1]) > + sk->sk_userlocks |= SOCK_RCVBUF_LOCK; > + WRITE_ONCE(sk->sk_rcvbuf, val); > break; > case SO_SNDBUF: > val = min_t(u32, val, sysctl_wmem_max); > diff --git a/net/core/sock.c b/net/core/sock.c > index 4ff806d..e5e9cb0 100644 > --- a/net/core/sock.c > +++ b/net/core/sock.c > @@ -923,8 +923,6 @@ static void __sock_set_rcvbuf(struct sock *sk, int val) > * as a negative value. > */ > val = min_t(int, val, INT_MAX / 2); > - sk->sk_userlocks |= SOCK_RCVBUF_LOCK; > - > /* We double it on the way in to account for "struct sk_buff" etc. > * overhead. Applications assume that the SO_RCVBUF setting they make > * will allow that much actual data to be received on that socket. > @@ -935,7 +933,11 @@ static void __sock_set_rcvbuf(struct sock *sk, int val) > * And after considering the possible alternatives, returning the value > * we actually used in getsockopt is the most desirable behavior. > */ > - WRITE_ONCE(sk->sk_rcvbuf, max_t(int, val * 2, SOCK_MIN_RCVBUF)); > + val = max_t(int, val * 2, SOCK_MIN_RCVBUF); > + if (val < sock_net(sk)->ipv4.sysctl_tcp_rmem[1]) > + sk->sk_userlocks |= SOCK_RCVBUF_LOCK; > + > + WRITE_ONCE(sk->sk_rcvbuf, val); > } > > void sock_set_rcvbuf(struct sock *sk, int val) You are breaking applications that want to set sk->sk_rcvbuf to a fixed value, to control memory usage on millions of active sockets in a host. I think that you want new functionality, with new SO_ socket options, targeting net-next tree (No spurious FIxes: tag) For instance letting an application set or unset SOCK_RCVBUF_LOCK would be more useful, and would not break applications.
On Tue, Feb 15, 2022 at 11:25 PM Eric Dumazet <edumazet@google.com> wrote: > > On Tue, Feb 15, 2022 at 2:37 AM <kerneljasonxing@gmail.com> wrote: > > > > From: Jason Xing <xingwanli@kuaishou.com> > > > > Normally, user doesn't care the logic behind the kernel if they're > > trying to set receive buffer via setsockopt. However, if the new value > > of the receive buffer is not smaller than the initial value which is > > sysctl_tcp_rmem[1] implemented in tcp_rcv_space_adjust(), the server's > > wscale will shrink and then lead to the bad bandwidth. I think it is > > not appropriate. > > Then do not use SO_RCVBUF ? > > It is working as intended really. > > > > > Here are some numbers: > > $ sysctl -a | grep rmem > > net.core.rmem_default = 212992 > > net.core.rmem_max = 40880000 > > net.ipv4.tcp_rmem = 4096 425984 40880000 > > > > Case 1 > > on the server side > > # iperf -s -p 5201 > > on the client side > > # iperf -c [client ip] -p 5201 > > It turns out that the bandwidth is 9.34 Gbits/sec while the wscale of > > server side is 10. It's good. > > > > Case 2 > > on the server side > > #iperf -s -p 5201 -w 425984 > > on the client side > > # iperf -c [client ip] -p 5201 > > It turns out that the bandwidth is reduced to 2.73 Gbits/sec while the > > wcale is 2, even though the receive buffer is not changed at all at the > > very beginning. > > Great, you discovered auto tuning is working as intended. > Thanks. > > > > Therefore, I added one condition where only user is trying to set a > > smaller rx buffer. After this patch is applied, the bandwidth of case 2 > > is recovered to 9.34 Gbits/sec. > > > > Fixes: e88c64f0a425 ("tcp: allow effective reduction of TCP's rcv-buffer via setsockopt") > > This commit has nothing to do with your patch or feature. > > > Signed-off-by: Jason Xing <xingwanli@kuaishou.com> > > --- > > net/core/filter.c | 7 ++++--- > > net/core/sock.c | 8 +++++--- > > 2 files changed, 9 insertions(+), 6 deletions(-) > > > > diff --git a/net/core/filter.c b/net/core/filter.c > > index 4603b7c..99f5d9c 100644 > > --- a/net/core/filter.c > > +++ b/net/core/filter.c > > @@ -4795,9 +4795,10 @@ static int _bpf_setsockopt(struct sock *sk, int level, int optname, > > case SO_RCVBUF: > > val = min_t(u32, val, sysctl_rmem_max); > > val = min_t(int, val, INT_MAX / 2); > > - sk->sk_userlocks |= SOCK_RCVBUF_LOCK; > > - WRITE_ONCE(sk->sk_rcvbuf, > > - max_t(int, val * 2, SOCK_MIN_RCVBUF)); > > + val = max_t(int, val * 2, SOCK_MIN_RCVBUF); > > + if (val < sock_net(sk)->ipv4.sysctl_tcp_rmem[1]) > > + sk->sk_userlocks |= SOCK_RCVBUF_LOCK; > > + WRITE_ONCE(sk->sk_rcvbuf, val); > > break; > > case SO_SNDBUF: > > val = min_t(u32, val, sysctl_wmem_max); > > diff --git a/net/core/sock.c b/net/core/sock.c > > index 4ff806d..e5e9cb0 100644 > > --- a/net/core/sock.c > > +++ b/net/core/sock.c > > @@ -923,8 +923,6 @@ static void __sock_set_rcvbuf(struct sock *sk, int val) > > * as a negative value. > > */ > > val = min_t(int, val, INT_MAX / 2); > > - sk->sk_userlocks |= SOCK_RCVBUF_LOCK; > > - > > /* We double it on the way in to account for "struct sk_buff" etc. > > * overhead. Applications assume that the SO_RCVBUF setting they make > > * will allow that much actual data to be received on that socket. > > @@ -935,7 +933,11 @@ static void __sock_set_rcvbuf(struct sock *sk, int val) > > * And after considering the possible alternatives, returning the value > > * we actually used in getsockopt is the most desirable behavior. > > */ > > - WRITE_ONCE(sk->sk_rcvbuf, max_t(int, val * 2, SOCK_MIN_RCVBUF)); > > + val = max_t(int, val * 2, SOCK_MIN_RCVBUF); > > + if (val < sock_net(sk)->ipv4.sysctl_tcp_rmem[1]) > > + sk->sk_userlocks |= SOCK_RCVBUF_LOCK; > > + > > + WRITE_ONCE(sk->sk_rcvbuf, val); > > } > > > > void sock_set_rcvbuf(struct sock *sk, int val) > > You are breaking applications that want to set sk->sk_rcvbuf to a fixed value, > to control memory usage on millions of active sockets in a host. > > I think that you want new functionality, with new SO_ socket options, > targeting net-next tree (No spurious FIxes: tag) > I agree with you. > For instance letting an application set or unset SOCK_RCVBUF_LOCK > would be more useful, and would not break applications. I would like to change the title of the v2 patch while the title of this version is a little bit misleading, say, "net: introduce SO_RCVBUFAUTO to unset SOCK_RCVBUF_LOCK flag". Eric, what do you think of the new name of this socket option, like SO_RCVBUFAUTO? Thanks, Jason
diff --git a/net/core/filter.c b/net/core/filter.c index 4603b7c..99f5d9c 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -4795,9 +4795,10 @@ static int _bpf_setsockopt(struct sock *sk, int level, int optname, case SO_RCVBUF: val = min_t(u32, val, sysctl_rmem_max); val = min_t(int, val, INT_MAX / 2); - sk->sk_userlocks |= SOCK_RCVBUF_LOCK; - WRITE_ONCE(sk->sk_rcvbuf, - max_t(int, val * 2, SOCK_MIN_RCVBUF)); + val = max_t(int, val * 2, SOCK_MIN_RCVBUF); + if (val < sock_net(sk)->ipv4.sysctl_tcp_rmem[1]) + sk->sk_userlocks |= SOCK_RCVBUF_LOCK; + WRITE_ONCE(sk->sk_rcvbuf, val); break; case SO_SNDBUF: val = min_t(u32, val, sysctl_wmem_max); diff --git a/net/core/sock.c b/net/core/sock.c index 4ff806d..e5e9cb0 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -923,8 +923,6 @@ static void __sock_set_rcvbuf(struct sock *sk, int val) * as a negative value. */ val = min_t(int, val, INT_MAX / 2); - sk->sk_userlocks |= SOCK_RCVBUF_LOCK; - /* We double it on the way in to account for "struct sk_buff" etc. * overhead. Applications assume that the SO_RCVBUF setting they make * will allow that much actual data to be received on that socket. @@ -935,7 +933,11 @@ static void __sock_set_rcvbuf(struct sock *sk, int val) * And after considering the possible alternatives, returning the value * we actually used in getsockopt is the most desirable behavior. */ - WRITE_ONCE(sk->sk_rcvbuf, max_t(int, val * 2, SOCK_MIN_RCVBUF)); + val = max_t(int, val * 2, SOCK_MIN_RCVBUF); + if (val < sock_net(sk)->ipv4.sysctl_tcp_rmem[1]) + sk->sk_userlocks |= SOCK_RCVBUF_LOCK; + + WRITE_ONCE(sk->sk_rcvbuf, val); } void sock_set_rcvbuf(struct sock *sk, int val)