diff mbox series

[v2,net-next] net: introduce SO_RCVBUFAUTO to let the rcv_buf tune automatically

Message ID 20220216050320.3222-1-kerneljasonxing@gmail.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [v2,net-next] net: introduce SO_RCVBUFAUTO to let the rcv_buf tune automatically | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit fail Errors and warnings before: 6800 this patch: 6529
netdev/cc_maintainers warning 2 maintainers not CCed: arnd@arndb.de linux-arch@vger.kernel.org
netdev/build_clang fail Errors and warnings before: 1458 this patch: 1424
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn fail Errors and warnings before: 12066 this patch: 11755
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 54 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Jason Xing Feb. 16, 2022, 5:03 a.m. UTC
From: Jason Xing <xingwanli@kuaishou.com>

Normally, user doesn't care the logic behind the kernel if they're
trying to set receive buffer via setsockopt. However, once the new
value of the receive buffer is set even though it's not smaller than
the initial value which is sysctl_tcp_rmem[1] implemented in
tcp_rcv_space_adjust(),, the server's wscale will shrink and then
lead to the bad bandwidth as intended.

For now, introducing a new socket option to let the receive buffer
grow automatically no matter what the new value is can solve
the bad bandwidth issue meanwhile it's not breaking the application
with SO_RCVBUF option set.

Here are some numbers:
$ sysctl -a | grep rmem
net.core.rmem_default = 212992
net.core.rmem_max = 40880000
net.ipv4.tcp_rmem = 4096	425984	40880000

Case 1
on the server side
    # iperf -s -p 5201
on the client side
    # iperf -c [client ip] -p 5201
It turns out that the bandwidth is 9.34 Gbits/sec while the wscale of
server side is 10. It's good.

Case 2
on the server side
    #iperf -s -p 5201 -w 425984
on the client side
    # iperf -c [client ip] -p 5201
It turns out that the bandwidth is reduced to 2.73 Gbits/sec while the
wcale is 2, even though the receive buffer is not changed at all at the
very beginning.

After this patch is applied, the bandwidth of case 2 is recovered to
9.34 Gbits/sec as expected at the cost of consuming more memory per
socket.

Signed-off-by: Jason Xing <xingwanli@kuaishou.com>
--
v2: suggested by Eric
- introduce new socket option instead of breaking the logic in SO_RCVBUF
- Adjust the title and description of this patch
link: https://lore.kernel.org/lkml/CANn89iL8vOUOH9bZaiA-cKcms+PotuKCxv7LpVx3RF0dDDSnmg@mail.gmail.com/
---
 include/uapi/asm-generic/socket.h |  1 +
 net/core/sock.c                   | 18 +++++++++++++-----
 2 files changed, 14 insertions(+), 5 deletions(-)

Comments

Eric Dumazet Feb. 16, 2022, 6:24 a.m. UTC | #1
On Tue, Feb 15, 2022 at 9:03 PM <kerneljasonxing@gmail.com> wrote:
>
> From: Jason Xing <xingwanli@kuaishou.com>
>
> Normally, user doesn't care the logic behind the kernel if they're
> trying to set receive buffer via setsockopt. However, once the new
> value of the receive buffer is set even though it's not smaller than
> the initial value which is sysctl_tcp_rmem[1] implemented in
> tcp_rcv_space_adjust(),, the server's wscale will shrink and then
> lead to the bad bandwidth as intended.

Quite confusing changelog, honestly.

Users of SO_RCVBUF specifically told the kernel : I want to use _this_
buffer size, I do not want the kernel to decide for me.

Also, I think your changelog does not really explain that _if_ you set
SO_RCVBUF to a small value before
connect() or in general the 3WHS, the chosen wscale will be small, and
this won't allow future 10x increase
of the effective RWIN.


>
> For now, introducing a new socket option to let the receive buffer
> grow automatically no matter what the new value is can solve
> the bad bandwidth issue meanwhile it's not breaking the application
> with SO_RCVBUF option set.
>
> Here are some numbers:
> $ sysctl -a | grep rmem
> net.core.rmem_default = 212992
> net.core.rmem_max = 40880000
> net.ipv4.tcp_rmem = 4096        425984  40880000
>
> Case 1
> on the server side
>     # iperf -s -p 5201
> on the client side
>     # iperf -c [client ip] -p 5201
> It turns out that the bandwidth is 9.34 Gbits/sec while the wscale of
> server side is 10. It's good.
>
> Case 2
> on the server side
>     #iperf -s -p 5201 -w 425984
> on the client side
>     # iperf -c [client ip] -p 5201
> It turns out that the bandwidth is reduced to 2.73 Gbits/sec while the
> wcale is 2, even though the receive buffer is not changed at all at the
> very beginning.
>
> After this patch is applied, the bandwidth of case 2 is recovered to
> 9.34 Gbits/sec as expected at the cost of consuming more memory per
> socket.

How does your patch allow wscale to increase after flow is established ?

I would remove from the changelog these experimental numbers that look
quite wrong,
maybe copy/pasted from your prior version.

Instead I would describe why an application might want to clear the
'receive buffer size is locked' socket attribute.

>
> Signed-off-by: Jason Xing <xingwanli@kuaishou.com>
> --
> v2: suggested by Eric
> - introduce new socket option instead of breaking the logic in SO_RCVBUF
> - Adjust the title and description of this patch
> link: https://lore.kernel.org/lkml/CANn89iL8vOUOH9bZaiA-cKcms+PotuKCxv7LpVx3RF0dDDSnmg@mail.gmail.com/
> ---
>

I think adding another parallel SO_RCVBUF option is not good. It is
adding confusion (and net/core/filter.c has been unchanged)

Also we want CRIU to work correctly.

So if you have a SO_XXXX setsockopt() call, you also need to provide
getsockopt() implementation.

I would suggest an option to clear or set SOCK_RCVBUF_LOCK,  and
getsockopt() would return if the bit is currently set or not.

Something clearly describing the intent, like SO_RCVBUF_LOCK maybe.
Jason Xing Feb. 16, 2022, 6:57 a.m. UTC | #2
On Wed, Feb 16, 2022 at 2:25 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Feb 15, 2022 at 9:03 PM <kerneljasonxing@gmail.com> wrote:
> >
> > From: Jason Xing <xingwanli@kuaishou.com>
> >
> > Normally, user doesn't care the logic behind the kernel if they're
> > trying to set receive buffer via setsockopt. However, once the new
> > value of the receive buffer is set even though it's not smaller than
> > the initial value which is sysctl_tcp_rmem[1] implemented in
> > tcp_rcv_space_adjust(),, the server's wscale will shrink and then
> > lead to the bad bandwidth as intended.
>
> Quite confusing changelog, honestly.
>
> Users of SO_RCVBUF specifically told the kernel : I want to use _this_
> buffer size, I do not want the kernel to decide for me.
>
> Also, I think your changelog does not really explain that _if_ you set
> SO_RCVBUF to a small value before
> connect() or in general the 3WHS, the chosen wscale will be small, and
> this won't allow future 10x increase
> of the effective RWIN.
>

Yes, you hit the point really.

>
> >
> > For now, introducing a new socket option to let the receive buffer
> > grow automatically no matter what the new value is can solve
> > the bad bandwidth issue meanwhile it's not breaking the application
> > with SO_RCVBUF option set.
> >
> > Here are some numbers:
> > $ sysctl -a | grep rmem
> > net.core.rmem_default = 212992
> > net.core.rmem_max = 40880000
> > net.ipv4.tcp_rmem = 4096        425984  40880000
> >
> > Case 1
> > on the server side
> >     # iperf -s -p 5201
> > on the client side
> >     # iperf -c [client ip] -p 5201
> > It turns out that the bandwidth is 9.34 Gbits/sec while the wscale of
> > server side is 10. It's good.
> >
> > Case 2
> > on the server side
> >     #iperf -s -p 5201 -w 425984
> > on the client side
> >     # iperf -c [client ip] -p 5201
> > It turns out that the bandwidth is reduced to 2.73 Gbits/sec while the
> > wcale is 2, even though the receive buffer is not changed at all at the
> > very beginning.
> >
> > After this patch is applied, the bandwidth of case 2 is recovered to
> > 9.34 Gbits/sec as expected at the cost of consuming more memory per
> > socket.
>
> How does your patch allow wscale to increase after flow is established ?
>
> I would remove from the changelog these experimental numbers that look
> quite wrong,
> maybe copy/pasted from your prior version.
>

My fault. I should have removed this part.

> Instead I would describe why an application might want to clear the
> 'receive buffer size is locked' socket attribute.
>
> >
> > Signed-off-by: Jason Xing <xingwanli@kuaishou.com>
> > --
> > v2: suggested by Eric
> > - introduce new socket option instead of breaking the logic in SO_RCVBUF
> > - Adjust the title and description of this patch
> > link: https://lore.kernel.org/lkml/CANn89iL8vOUOH9bZaiA-cKcms+PotuKCxv7LpVx3RF0dDDSnmg@mail.gmail.com/
> > ---
> >
>
> I think adding another parallel SO_RCVBUF option is not good. It is
> adding confusion (and net/core/filter.c has been unchanged)

I'll change the filter.c altogether in the next submission.

>
> Also we want CRIU to work correctly.
>
> So if you have a SO_XXXX setsockopt() call, you also need to provide
> getsockopt() implementation.
>
> I would suggest an option to clear or set SOCK_RCVBUF_LOCK,  and
> getsockopt() would return if the bit is currently set or not.
>
> Something clearly describing the intent, like SO_RCVBUF_LOCK maybe.

Just now, I found out that the latest kernel has merged a similar
patch (commit 04190bf89) about three months ago.

Is it still necessary to add another separate option to clear the
SOCK_RCVBUF_LOCK explicitly?

Thanks,
Jason
Eric Dumazet Feb. 16, 2022, 4:56 p.m. UTC | #3
On Tue, Feb 15, 2022 at 10:58 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> Just now, I found out that the latest kernel has merged a similar
> patch (commit 04190bf89) about three months ago.

There you go :)

>
> Is it still necessary to add another separate option to clear the
> SOCK_RCVBUF_LOCK explicitly?

What do you mean, SO_BUF_LOCK is all that is needed.
Jason Xing Feb. 17, 2022, 2:15 a.m. UTC | #4
On Thu, Feb 17, 2022 at 12:56 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Feb 15, 2022 at 10:58 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > Just now, I found out that the latest kernel has merged a similar
> > patch (commit 04190bf89) about three months ago.
>
> There you go :)
>
> >
> > Is it still necessary to add another separate option to clear the
> > SOCK_RCVBUF_LOCK explicitly?
>
> What do you mean, SO_BUF_LOCK is all that is needed.

Yeah, I think SO_BUF_LOCK is enough and we don't have to add a new
option like SOCK_RCVBUF_LOCK as we've talked about before. Thanks,
Eric.
diff mbox series

Patch

diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
index c77a131..f4ce79b 100644
--- a/include/uapi/asm-generic/socket.h
+++ b/include/uapi/asm-generic/socket.h
@@ -18,6 +18,7 @@ 
 #define SO_RCVBUF	8
 #define SO_SNDBUFFORCE	32
 #define SO_RCVBUFFORCE	33
+#define SO_RCVBUFAUTO	74
 #define SO_KEEPALIVE	9
 #define SO_OOBINLINE	10
 #define SO_NO_CHECK	11
diff --git a/net/core/sock.c b/net/core/sock.c
index 4ff806d..8565684 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -917,13 +917,14 @@  void sock_set_keepalive(struct sock *sk)
 }
 EXPORT_SYMBOL(sock_set_keepalive);
 
-static void __sock_set_rcvbuf(struct sock *sk, int val)
+static void __sock_set_rcvbuf(struct sock *sk, int val, bool is_set)
 {
 	/* Ensure val * 2 fits into an int, to prevent max_t() from treating it
 	 * as a negative value.
 	 */
 	val = min_t(int, val, INT_MAX / 2);
-	sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
+	if (is_set)
+		sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
 
 	/* We double it on the way in to account for "struct sk_buff" etc.
 	 * overhead.   Applications assume that the SO_RCVBUF setting they make
@@ -941,7 +942,7 @@  static void __sock_set_rcvbuf(struct sock *sk, int val)
 void sock_set_rcvbuf(struct sock *sk, int val)
 {
 	lock_sock(sk);
-	__sock_set_rcvbuf(sk, val);
+	__sock_set_rcvbuf(sk, val, true);
 	release_sock(sk);
 }
 EXPORT_SYMBOL(sock_set_rcvbuf);
@@ -1106,7 +1107,7 @@  int sock_setsockopt(struct socket *sock, int level, int optname,
 		 * play 'guess the biggest size' games. RCVBUF/SNDBUF
 		 * are treated in BSD as hints
 		 */
-		__sock_set_rcvbuf(sk, min_t(u32, val, sysctl_rmem_max));
+		__sock_set_rcvbuf(sk, min_t(u32, val, sysctl_rmem_max), true);
 		break;
 
 	case SO_RCVBUFFORCE:
@@ -1118,7 +1119,14 @@  int sock_setsockopt(struct socket *sock, int level, int optname,
 		/* No negative values (to prevent underflow, as val will be
 		 * multiplied by 2).
 		 */
-		__sock_set_rcvbuf(sk, max(val, 0));
+		__sock_set_rcvbuf(sk, max(val, 0), true);
+		break;
+
+	case SO_RCVBUFAUTO:
+		/* Though similar to SO_RCVBUF, we do not use userlocks in
+		 * order to let the receive buffer tune automatically.
+		 */
+		__sock_set_rcvbuf(sk, min_t(u32, val, sysctl_rmem_max), false);
 		break;
 
 	case SO_KEEPALIVE: