net: do not set SOCK_RCVBUF_LOCK if sk_rcvbuf isn't reduced

Message ID	20220215103639.11739-1-kerneljasonxing@gmail.com (mailing list archive)
State	Changes Requested
Delegated to:	BPF
Headers	show Return-Path: <netdev-owner@kernel.org> From: kerneljasonxing@gmail.com To: davem@davemloft.net, kuba@kernel.org, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, kafai@fb.com, songliubraving@fb.com, yhs@fb.com, john.fastabend@gmail.com, kpsingh@kernel.org, edumazet@google.com, pabeni@redhat.com, weiwan@google.com, aahringo@redhat.com, yangbo.lu@nxp.com, fw@strlen.de, xiangxia.m.yue@gmail.com, tglx@linutronix.de Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, bpf@vger.kernel.org, kerneljasonxing@gmail.com, Jason Xing <xingwanli@kuaishou.com> Subject: [PATCH] net: do not set SOCK_RCVBUF_LOCK if sk_rcvbuf isn't reduced Date: Tue, 15 Feb 2022 18:36:39 +0800 Message-Id: <20220215103639.11739-1-kerneljasonxing@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	net: do not set SOCK_RCVBUF_LOCK if sk_rcvbuf isn't reduced \| expand net: do not set SOCK_RCVBUF_LOCK if sk_rcvbuf isn't reduced

Message ID

20220215103639.11739-1-kerneljasonxing@gmail.com (mailing list archive)

State

Changes Requested

Delegated to:

BPF

Headers

From: kerneljasonxing@gmail.com
To: davem@davemloft.net, kuba@kernel.org, ast@kernel.org,
        daniel@iogearbox.net, andrii@kernel.org, kafai@fb.com,
        songliubraving@fb.com, yhs@fb.com, john.fastabend@gmail.com,
        kpsingh@kernel.org, edumazet@google.com, pabeni@redhat.com,
        weiwan@google.com, aahringo@redhat.com, yangbo.lu@nxp.com,
        fw@strlen.de, xiangxia.m.yue@gmail.com, tglx@linutronix.de
Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
        bpf@vger.kernel.org, kerneljasonxing@gmail.com,
        Jason Xing <xingwanli@kuaishou.com>
Subject: [PATCH] net: do not set SOCK_RCVBUF_LOCK if sk_rcvbuf isn't reduced
Date: Tue, 15 Feb 2022 18:36:39 +0800
Message-Id: <20220215103639.11739-1-kerneljasonxing@gmail.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

net: do not set SOCK_RCVBUF_LOCK if sk_rcvbuf isn't reduced | expand

Context	Check	Description
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/subject_prefix	warning	Target tree name not specified in the subject
netdev/cover_letter	success	Single patches do not need cover letters
netdev/patch_count	success	Link
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 29 this patch: 29
netdev/cc_maintainers	fail	1 blamed authors not CCed: hagen@jauu.net; 1 maintainers not CCed: hagen@jauu.net
netdev/build_clang	success	Errors and warnings before: 18 this patch: 18
netdev/module_param	success	Was 0 now: 0
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/verify_fixes	success	Fixes tag looks correct
netdev/build_allmodconfig_warn	success	Errors and warnings before: 34 this patch: 34
netdev/checkpatch	success	total: 0 errors, 0 warnings, 0 checks, 33 lines checked
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0
bpf/vmtest-bpf-next	success	VM_Test
bpf/vmtest-bpf-next-PR	success	PR summary
netdev/tree_selection	success	Guessing tree name failed - patch did not apply

Context

Check

Description

netdev/fixes_present

success

Fixes tag not required for -next series

netdev/subject_prefix

warning

Target tree name not specified in the subject

netdev/cover_letter

success

Single patches do not need cover letters

netdev/patch_count

success

Link

netdev/header_inline

success

No static functions without inline keyword in header files

netdev/build_32bit

success

Errors and warnings before: 29 this patch: 29

netdev/cc_maintainers

fail

1 blamed authors not CCed: hagen@jauu.net; 1 maintainers not CCed: hagen@jauu.net

netdev/build_clang

success

Errors and warnings before: 18 this patch: 18

netdev/module_param

success

Was 0 now: 0

netdev/verify_signedoff

success

Signed-off-by tag matches author and committer

netdev/verify_fixes

success

Fixes tag looks correct

netdev/build_allmodconfig_warn

success

Errors and warnings before: 34 this patch: 34

netdev/checkpatch

success

total: 0 errors, 0 warnings, 0 checks, 33 lines checked

netdev/kdoc

success

Errors and warnings before: 0 this patch: 0

netdev/source_inline

success

Was 0 now: 0

bpf/vmtest-bpf-next

success

VM_Test

bpf/vmtest-bpf-next-PR

success

PR summary

netdev/tree_selection

success

Guessing tree name failed - patch did not apply

Commit Message

Jason Xing Feb. 15, 2022, 10:36 a.m. UTC

From: Jason Xing <xingwanli@kuaishou.com>

Normally, user doesn't care the logic behind the kernel if they're
trying to set receive buffer via setsockopt. However, if the new value
of the receive buffer is not smaller than the initial value which is
sysctl_tcp_rmem[1] implemented in tcp_rcv_space_adjust(), the server's
wscale will shrink and then lead to the bad bandwidth. I think it is
not appropriate.

Here are some numbers:
$ sysctl -a | grep rmem
net.core.rmem_default = 212992
net.core.rmem_max = 40880000
net.ipv4.tcp_rmem = 4096	425984	40880000

Case 1
on the server side
    # iperf -s -p 5201
on the client side
    # iperf -c [client ip] -p 5201
It turns out that the bandwidth is 9.34 Gbits/sec while the wscale of
server side is 10. It's good.

Case 2
on the server side
    #iperf -s -p 5201 -w 425984
on the client side
    # iperf -c [client ip] -p 5201
It turns out that the bandwidth is reduced to 2.73 Gbits/sec while the
wcale is 2, even though the receive buffer is not changed at all at the
very beginning.

Therefore, I added one condition where only user is trying to set a
smaller rx buffer. After this patch is applied, the bandwidth of case 2
is recovered to 9.34 Gbits/sec.

Fixes: e88c64f0a425 ("tcp: allow effective reduction of TCP's rcv-buffer via setsockopt")
Signed-off-by: Jason Xing <xingwanli@kuaishou.com>
---
 net/core/filter.c | 7 ++++---
 net/core/sock.c   | 8 +++++---
 2 files changed, 9 insertions(+), 6 deletions(-)

Comments

Eric Dumazet Feb. 15, 2022, 3:25 p.m. UTC | #1

On Tue, Feb 15, 2022 at 2:37 AM <kerneljasonxing@gmail.com> wrote:
>
> From: Jason Xing <xingwanli@kuaishou.com>
>
> Normally, user doesn't care the logic behind the kernel if they're
> trying to set receive buffer via setsockopt. However, if the new value
> of the receive buffer is not smaller than the initial value which is
> sysctl_tcp_rmem[1] implemented in tcp_rcv_space_adjust(), the server's
> wscale will shrink and then lead to the bad bandwidth. I think it is
> not appropriate.

Then do not use SO_RCVBUF ?

It is working as intended really.

>
> Here are some numbers:
> $ sysctl -a | grep rmem
> net.core.rmem_default = 212992
> net.core.rmem_max = 40880000
> net.ipv4.tcp_rmem = 4096        425984  40880000
>
> Case 1
> on the server side
>     # iperf -s -p 5201
> on the client side
>     # iperf -c [client ip] -p 5201
> It turns out that the bandwidth is 9.34 Gbits/sec while the wscale of
> server side is 10. It's good.
>
> Case 2
> on the server side
>     #iperf -s -p 5201 -w 425984
> on the client side
>     # iperf -c [client ip] -p 5201
> It turns out that the bandwidth is reduced to 2.73 Gbits/sec while the
> wcale is 2, even though the receive buffer is not changed at all at the
> very beginning.

Great, you discovered auto tuning is working as intended.

>
> Therefore, I added one condition where only user is trying to set a
> smaller rx buffer. After this patch is applied, the bandwidth of case 2
> is recovered to 9.34 Gbits/sec.
>
> Fixes: e88c64f0a425 ("tcp: allow effective reduction of TCP's rcv-buffer via setsockopt")

This commit has nothing to do with your patch or feature.

> Signed-off-by: Jason Xing <xingwanli@kuaishou.com>
> ---
>  net/core/filter.c | 7 ++++---
>  net/core/sock.c   | 8 +++++---
>  2 files changed, 9 insertions(+), 6 deletions(-)
>
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 4603b7c..99f5d9c 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -4795,9 +4795,10 @@ static int _bpf_setsockopt(struct sock *sk, int level, int optname,
>                 case SO_RCVBUF:
>                         val = min_t(u32, val, sysctl_rmem_max);
>                         val = min_t(int, val, INT_MAX / 2);
> -                       sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
> -                       WRITE_ONCE(sk->sk_rcvbuf,
> -                                  max_t(int, val * 2, SOCK_MIN_RCVBUF));
> +                       val = max_t(int, val * 2, SOCK_MIN_RCVBUF);
> +                       if (val < sock_net(sk)->ipv4.sysctl_tcp_rmem[1])
> +                               sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
> +                       WRITE_ONCE(sk->sk_rcvbuf, val);
>                         break;
>                 case SO_SNDBUF:
>                         val = min_t(u32, val, sysctl_wmem_max);
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 4ff806d..e5e9cb0 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -923,8 +923,6 @@ static void __sock_set_rcvbuf(struct sock *sk, int val)
>          * as a negative value.
>          */
>         val = min_t(int, val, INT_MAX / 2);
> -       sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
> -
>         /* We double it on the way in to account for "struct sk_buff" etc.
>          * overhead.   Applications assume that the SO_RCVBUF setting they make
>          * will allow that much actual data to be received on that socket.
> @@ -935,7 +933,11 @@ static void __sock_set_rcvbuf(struct sock *sk, int val)
>          * And after considering the possible alternatives, returning the value
>          * we actually used in getsockopt is the most desirable behavior.
>          */
> -       WRITE_ONCE(sk->sk_rcvbuf, max_t(int, val * 2, SOCK_MIN_RCVBUF));
> +       val = max_t(int, val * 2, SOCK_MIN_RCVBUF);
> +       if (val < sock_net(sk)->ipv4.sysctl_tcp_rmem[1])
> +               sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
> +
> +       WRITE_ONCE(sk->sk_rcvbuf, val);
>  }
>
>  void sock_set_rcvbuf(struct sock *sk, int val)

You are breaking applications that want to set sk->sk_rcvbuf  to a fixed value,
to control memory usage on millions of active sockets in a host.

I think that you want new functionality, with new SO_ socket options,
targeting net-next tree (No spurious FIxes: tag)

For instance letting an application set  or unset  SOCK_RCVBUF_LOCK
would be more useful, and would not break applications.

Jason Xing Feb. 15, 2022, 4 p.m. UTC | #2

On Tue, Feb 15, 2022 at 11:25 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Feb 15, 2022 at 2:37 AM <kerneljasonxing@gmail.com> wrote:
> >
> > From: Jason Xing <xingwanli@kuaishou.com>
> >
> > Normally, user doesn't care the logic behind the kernel if they're
> > trying to set receive buffer via setsockopt. However, if the new value
> > of the receive buffer is not smaller than the initial value which is
> > sysctl_tcp_rmem[1] implemented in tcp_rcv_space_adjust(), the server's
> > wscale will shrink and then lead to the bad bandwidth. I think it is
> > not appropriate.
>
> Then do not use SO_RCVBUF ?
>
> It is working as intended really.
>
> >
> > Here are some numbers:
> > $ sysctl -a | grep rmem
> > net.core.rmem_default = 212992
> > net.core.rmem_max = 40880000
> > net.ipv4.tcp_rmem = 4096        425984  40880000
> >
> > Case 1
> > on the server side
> >     # iperf -s -p 5201
> > on the client side
> >     # iperf -c [client ip] -p 5201
> > It turns out that the bandwidth is 9.34 Gbits/sec while the wscale of
> > server side is 10. It's good.
> >
> > Case 2
> > on the server side
> >     #iperf -s -p 5201 -w 425984
> > on the client side
> >     # iperf -c [client ip] -p 5201
> > It turns out that the bandwidth is reduced to 2.73 Gbits/sec while the
> > wcale is 2, even though the receive buffer is not changed at all at the
> > very beginning.
>
> Great, you discovered auto tuning is working as intended.
>

Thanks.

> >
> > Therefore, I added one condition where only user is trying to set a
> > smaller rx buffer. After this patch is applied, the bandwidth of case 2
> > is recovered to 9.34 Gbits/sec.
> >
> > Fixes: e88c64f0a425 ("tcp: allow effective reduction of TCP's rcv-buffer via setsockopt")
>
> This commit has nothing to do with your patch or feature.
>
> > Signed-off-by: Jason Xing <xingwanli@kuaishou.com>
> > ---
> >  net/core/filter.c | 7 ++++---
> >  net/core/sock.c   | 8 +++++---
> >  2 files changed, 9 insertions(+), 6 deletions(-)
> >
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 4603b7c..99f5d9c 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -4795,9 +4795,10 @@ static int _bpf_setsockopt(struct sock *sk, int level, int optname,
> >                 case SO_RCVBUF:
> >                         val = min_t(u32, val, sysctl_rmem_max);
> >                         val = min_t(int, val, INT_MAX / 2);
> > -                       sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
> > -                       WRITE_ONCE(sk->sk_rcvbuf,
> > -                                  max_t(int, val * 2, SOCK_MIN_RCVBUF));
> > +                       val = max_t(int, val * 2, SOCK_MIN_RCVBUF);
> > +                       if (val < sock_net(sk)->ipv4.sysctl_tcp_rmem[1])
> > +                               sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
> > +                       WRITE_ONCE(sk->sk_rcvbuf, val);
> >                         break;
> >                 case SO_SNDBUF:
> >                         val = min_t(u32, val, sysctl_wmem_max);
> > diff --git a/net/core/sock.c b/net/core/sock.c
> > index 4ff806d..e5e9cb0 100644
> > --- a/net/core/sock.c
> > +++ b/net/core/sock.c
> > @@ -923,8 +923,6 @@ static void __sock_set_rcvbuf(struct sock *sk, int val)
> >          * as a negative value.
> >          */
> >         val = min_t(int, val, INT_MAX / 2);
> > -       sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
> > -
> >         /* We double it on the way in to account for "struct sk_buff" etc.
> >          * overhead.   Applications assume that the SO_RCVBUF setting they make
> >          * will allow that much actual data to be received on that socket.
> > @@ -935,7 +933,11 @@ static void __sock_set_rcvbuf(struct sock *sk, int val)
> >          * And after considering the possible alternatives, returning the value
> >          * we actually used in getsockopt is the most desirable behavior.
> >          */
> > -       WRITE_ONCE(sk->sk_rcvbuf, max_t(int, val * 2, SOCK_MIN_RCVBUF));
> > +       val = max_t(int, val * 2, SOCK_MIN_RCVBUF);
> > +       if (val < sock_net(sk)->ipv4.sysctl_tcp_rmem[1])
> > +               sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
> > +
> > +       WRITE_ONCE(sk->sk_rcvbuf, val);
> >  }
> >
> >  void sock_set_rcvbuf(struct sock *sk, int val)
>
> You are breaking applications that want to set sk->sk_rcvbuf  to a fixed value,
> to control memory usage on millions of active sockets in a host.
>
> I think that you want new functionality, with new SO_ socket options,
> targeting net-next tree (No spurious FIxes: tag)
>

I agree with you.

> For instance letting an application set  or unset  SOCK_RCVBUF_LOCK
> would be more useful, and would not break applications.

I would like to change the title of the v2 patch while the title of
this version is a little bit misleading, say, "net: introduce
SO_RCVBUFAUTO to unset SOCK_RCVBUF_LOCK flag".

Eric, what do you think of the new name of this socket option, like
SO_RCVBUFAUTO?

Thanks,
Jason

diff --git a/net/core/filter.c b/net/core/filter.c
index 4603b7c..99f5d9c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4795,9 +4795,10 @@  static int _bpf_setsockopt(struct sock *sk, int level, int optname,
 		case SO_RCVBUF:
 			val = min_t(u32, val, sysctl_rmem_max);
 			val = min_t(int, val, INT_MAX / 2);
-			sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
-			WRITE_ONCE(sk->sk_rcvbuf,
-				   max_t(int, val * 2, SOCK_MIN_RCVBUF));
+			val = max_t(int, val * 2, SOCK_MIN_RCVBUF);
+			if (val < sock_net(sk)->ipv4.sysctl_tcp_rmem[1])
+				sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
+			WRITE_ONCE(sk->sk_rcvbuf, val);
 			break;
 		case SO_SNDBUF:
 			val = min_t(u32, val, sysctl_wmem_max);
diff --git a/net/core/sock.c b/net/core/sock.c
index 4ff806d..e5e9cb0 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -923,8 +923,6 @@  static void __sock_set_rcvbuf(struct sock *sk, int val)
 	 * as a negative value.
 	 */
 	val = min_t(int, val, INT_MAX / 2);
-	sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
-
 	/* We double it on the way in to account for "struct sk_buff" etc.
 	 * overhead.   Applications assume that the SO_RCVBUF setting they make
 	 * will allow that much actual data to be received on that socket.
@@ -935,7 +933,11 @@  static void __sock_set_rcvbuf(struct sock *sk, int val)
 	 * And after considering the possible alternatives, returning the value
 	 * we actually used in getsockopt is the most desirable behavior.
 	 */
-	WRITE_ONCE(sk->sk_rcvbuf, max_t(int, val * 2, SOCK_MIN_RCVBUF));
+	val = max_t(int, val * 2, SOCK_MIN_RCVBUF);
+	if (val < sock_net(sk)->ipv4.sysctl_tcp_rmem[1])
+		sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
+
+	WRITE_ONCE(sk->sk_rcvbuf, val);
 }
 
 void sock_set_rcvbuf(struct sock *sk, int val)

net: do not set SOCK_RCVBUF_LOCK if sk_rcvbuf isn't reduced

Checks

Commit Message

Comments

Patch