diff mbox series

[net] tcp: fix receive buffer autotuning to trigger for any valid advertised MSS

Message ID 20201207114049.7634-1-abuehaze@amazon.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [net] tcp: fix receive buffer autotuning to trigger for any valid advertised MSS | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present fail Series targets non-next tree, but doesn't contain any Fixes tags
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for net
netdev/subject_prefix success Link
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 2 this patch: 2
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch warning WARNING: line length of 95 exceeds 80 columns
netdev/build_allmodconfig_warn success Errors and warnings before: 2 this patch: 2
netdev/header_inline success Link
netdev/stable success Stable not CCed

Commit Message

Mohamed Abuelfotoh, Hazem Dec. 7, 2020, 11:40 a.m. UTC
Previously receiver buffer auto-tuning starts after receiving
    one advertised window amount of data.After the initial
    receiver buffer was raised by
    commit a337531b942b ("tcp: up initial rmem to 128KB
    and SYN rwin to around 64KB"),the receiver buffer may
    take too long for TCP autotuning to start raising
    the receiver buffer size.
    commit 041a14d26715 ("tcp: start receiver buffer autotuning sooner")
    tried to decrease the threshold at which TCP auto-tuning starts
    but it's doesn't work well in some environments
    where the receiver has large MTU (9001) especially with high RTT
    connections as in these environments rcvq_space.space will be the same
    as rcv_wnd so TCP autotuning will never start because
    sender can't send more than rcv_wnd size in one round trip.
    To address this issue this patch is decreasing the initial
    rcvq_space.space so TCP autotuning kicks in whenever the sender is
    able to send more than 5360 bytes in one round trip regardless the
    receiver's configured MTU.

    Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
    Fixes: 041a14d26715 ("tcp: start receiver buffer autotuning sooner")

Signed-off-by: Hazem Mohamed Abuelfotoh <abuehaze@amazon.com>
---
 net/ipv4/tcp_input.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Eric Dumazet Dec. 7, 2020, 3:37 p.m. UTC | #1
On Mon, Dec 7, 2020 at 12:41 PM Hazem Mohamed Abuelfotoh
<abuehaze@amazon.com> wrote:
>
>     Previously receiver buffer auto-tuning starts after receiving
>     one advertised window amount of data.After the initial
>     receiver buffer was raised by
>     commit a337531b942b ("tcp: up initial rmem to 128KB
>     and SYN rwin to around 64KB"),the receiver buffer may
>     take too long for TCP autotuning to start raising
>     the receiver buffer size.
>     commit 041a14d26715 ("tcp: start receiver buffer autotuning sooner")
>     tried to decrease the threshold at which TCP auto-tuning starts
>     but it's doesn't work well in some environments
>     where the receiver has large MTU (9001) especially with high RTT
>     connections as in these environments rcvq_space.space will be the same
>     as rcv_wnd so TCP autotuning will never start because
>     sender can't send more than rcv_wnd size in one round trip.
>     To address this issue this patch is decreasing the initial
>     rcvq_space.space so TCP autotuning kicks in whenever the sender is
>     able to send more than 5360 bytes in one round trip regardless the
>     receiver's configured MTU.
>
>     Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
>     Fixes: 041a14d26715 ("tcp: start receiver buffer autotuning sooner")
>
> Signed-off-by: Hazem Mohamed Abuelfotoh <abuehaze@amazon.com>
> ---
>  net/ipv4/tcp_input.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 389d1b340248..f0ffac9e937b 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -504,13 +504,14 @@ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
>  static void tcp_init_buffer_space(struct sock *sk)
>  {
>         int tcp_app_win = sock_net(sk)->ipv4.sysctl_tcp_app_win;
> +       struct inet_connection_sock *icsk = inet_csk(sk);
>         struct tcp_sock *tp = tcp_sk(sk);
>         int maxwin;
>
>         if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK))
>                 tcp_sndbuf_expand(sk);
>
> -       tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * tp->advmss);
> +       tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * icsk->icsk_ack.rcv_mss);

I find using icsk->icsk_ack.rcv_mss misleading.

I would either use TCP_MSS_DEFAULT , or maybe simply 0, since we had
no samples yet, there is little point to use a magic value.

Note that if a driver uses 16KB of memory to hold a 1500 bytes packet,
then a 10 MSS GRO packet is consuming 160 KB of memory,
which is bigger than tcp_rmem[1]. TCP could decide to drop these fat packets.

I wonder if your patch does not work around a more fundamental issue,
I am still unable to reproduce the issue.
Eric Dumazet Dec. 7, 2020, 4:08 p.m. UTC | #2
On Mon, Dec 7, 2020 at 4:37 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Mon, Dec 7, 2020 at 12:41 PM Hazem Mohamed Abuelfotoh
> <abuehaze@amazon.com> wrote:
> >
> >     Previously receiver buffer auto-tuning starts after receiving
> >     one advertised window amount of data.After the initial
> >     receiver buffer was raised by
> >     commit a337531b942b ("tcp: up initial rmem to 128KB
> >     and SYN rwin to around 64KB"),the receiver buffer may
> >     take too long for TCP autotuning to start raising
> >     the receiver buffer size.
> >     commit 041a14d26715 ("tcp: start receiver buffer autotuning sooner")
> >     tried to decrease the threshold at which TCP auto-tuning starts
> >     but it's doesn't work well in some environments
> >     where the receiver has large MTU (9001) especially with high RTT
> >     connections as in these environments rcvq_space.space will be the same
> >     as rcv_wnd so TCP autotuning will never start because
> >     sender can't send more than rcv_wnd size in one round trip.
> >     To address this issue this patch is decreasing the initial
> >     rcvq_space.space so TCP autotuning kicks in whenever the sender is
> >     able to send more than 5360 bytes in one round trip regardless the
> >     receiver's configured MTU.
> >
> >     Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
> >     Fixes: 041a14d26715 ("tcp: start receiver buffer autotuning sooner")
> >
> > Signed-off-by: Hazem Mohamed Abuelfotoh <abuehaze@amazon.com>
> > ---
> >  net/ipv4/tcp_input.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 389d1b340248..f0ffac9e937b 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -504,13 +504,14 @@ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
> >  static void tcp_init_buffer_space(struct sock *sk)
> >  {
> >         int tcp_app_win = sock_net(sk)->ipv4.sysctl_tcp_app_win;
> > +       struct inet_connection_sock *icsk = inet_csk(sk);
> >         struct tcp_sock *tp = tcp_sk(sk);
> >         int maxwin;
> >
> >         if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK))
> >                 tcp_sndbuf_expand(sk);
> >
> > -       tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * tp->advmss);
> > +       tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * icsk->icsk_ack.rcv_mss);
>
> I find using icsk->icsk_ack.rcv_mss misleading.
>
> I would either use TCP_MSS_DEFAULT , or maybe simply 0, since we had
> no samples yet, there is little point to use a magic value.

0 will not work, since we use a do_div(grow, tp->rcvq_space.space)

>
> Note that if a driver uses 16KB of memory to hold a 1500 bytes packet,
> then a 10 MSS GRO packet is consuming 160 KB of memory,
> which is bigger than tcp_rmem[1]. TCP could decide to drop these fat packets.
>
> I wonder if your patch does not work around a more fundamental issue,
> I am still unable to reproduce the issue.
Mohamed Abuelfotoh, Hazem Dec. 7, 2020, 5:49 p.m. UTC | #3
> I find using icsk->icsk_ack.rcv_mss misleading.
    > I would either use TCP_MSS_DEFAULT , or maybe simply 0, since we had
    > no samples yet, there is little point to use a magic value.

My point is  by definition  rcv_space is used in TCP's internal auto-tuning to grow socket buffers based on how much data the kernel estimates the sender can send so we are talking about an estimation anyway that won't be 100% accurate especially during the connection initialization, I proposed using TCP_INIT_CWND * icsk->icsk_ack.rcv_mss as initial receive space because it's the minimum that the sender can send assuming they are filling up their initial congestion window this shouldn't cause security impact in my opinion because the rcv_buf and receive window won't scale unless the sender sent 5360 in one round trip so anything lower than that won't trigger the TCP autotuning.

Another point is that the proposed  patch is impacting the  initial rcv_space.space but has no impact on how the receive window/receive buffer  will scale while the connection is ongoing.

I was actually thinking about the below options before proposing my final patch because I am afraid that they will have impact of higher  memory footprint or connection get stuck later with packet loss.


I am listing them below as well.


A)The below patch would be enough as the usercopied data would be usually more than 50KB so the window will scale 


# diff -u a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c 
--- a/net/ipv4/tcp_input.c	2020-11-18 19:54:23.624306129 +0000
+++ b/net/ipv4/tcp_input.c	2020-11-18 19:55:05.032259419 +0000
@@ -605,7 +605,7 @@
 
 	/* Number of bytes copied to user in last RTT */
 	copied = tp->copied_seq - tp->rcvq_space.seq;
-	if (copied <= tp->rcvq_space.space)
+	if (copied <= (tp->rcvq_space.space >> 1))
 		goto new_measure;


Pros:
-it will decrease the threshold where we are scaling up the receive buffers and advertised window to half what it's currently on so we should see the connection stuck using the current default kernel configurations with High RTT.
-it's likely hood for getting stuck later is low because it's dynamic and should impact the DRS during the whole connection lifetime not just during the initialization phase.

Cons:
-This may have Higher memory footprint because we are increasing the receiver buffer size on lower threshold than before.


################################################################################################################################

B)Otherwise we can be looking into decreasing the initial rcv_wnd and accordingly  rcvq_space.space   by decreasing the default ipv4.sysctl_tcp_rmem[1] from 131072 to 87380 as it was before https://lore.kernel.org/patchwork/patch/1157936/.

#################################################################################################################################


C)Other solution would be to bound current  rcvq_space.space with the usercopied amount of the previous RTT, something as below would also work.


--- a/net/ipv4/tcp_input.c      2020-11-19 15:43:10.441524021 +0000
+++ b/net/ipv4/tcp_input.c      2020-11-19 15:45:42.772614521 +0000
@@ -649,6 +649,7 @@
        tp->rcvq_space.space = copied;
 
 new_measure:
+       tp->rcvq_space.space = copied;
        tp->rcvq_space.seq = tp->copied_seq;
        tp->rcvq_space.time = tp->tcp_mstamp;
 }

Cons:
-When I emulated packet loss later after connection establishment  I could see the connection get stuck on low speed for a long time.


On 07/12/2020, 16:09, "Eric Dumazet" <edumazet@google.com> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.



    On Mon, Dec 7, 2020 at 4:37 PM Eric Dumazet <edumazet@google.com> wrote:
    >
    > On Mon, Dec 7, 2020 at 12:41 PM Hazem Mohamed Abuelfotoh
    > <abuehaze@amazon.com> wrote:
    > >
    > >     Previously receiver buffer auto-tuning starts after receiving
    > >     one advertised window amount of data.After the initial
    > >     receiver buffer was raised by
    > >     commit a337531b942b ("tcp: up initial rmem to 128KB
    > >     and SYN rwin to around 64KB"),the receiver buffer may
    > >     take too long for TCP autotuning to start raising
    > >     the receiver buffer size.
    > >     commit 041a14d26715 ("tcp: start receiver buffer autotuning sooner")
    > >     tried to decrease the threshold at which TCP auto-tuning starts
    > >     but it's doesn't work well in some environments
    > >     where the receiver has large MTU (9001) especially with high RTT
    > >     connections as in these environments rcvq_space.space will be the same
    > >     as rcv_wnd so TCP autotuning will never start because
    > >     sender can't send more than rcv_wnd size in one round trip.
    > >     To address this issue this patch is decreasing the initial
    > >     rcvq_space.space so TCP autotuning kicks in whenever the sender is
    > >     able to send more than 5360 bytes in one round trip regardless the
    > >     receiver's configured MTU.
    > >
    > >     Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
    > >     Fixes: 041a14d26715 ("tcp: start receiver buffer autotuning sooner")
    > >
    > > Signed-off-by: Hazem Mohamed Abuelfotoh <abuehaze@amazon.com>
    > > ---
    > >  net/ipv4/tcp_input.c | 3 ++-
    > >  1 file changed, 2 insertions(+), 1 deletion(-)
    > >
    > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
    > > index 389d1b340248..f0ffac9e937b 100644
    > > --- a/net/ipv4/tcp_input.c
    > > +++ b/net/ipv4/tcp_input.c
    > > @@ -504,13 +504,14 @@ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
    > >  static void tcp_init_buffer_space(struct sock *sk)
    > >  {
    > >         int tcp_app_win = sock_net(sk)->ipv4.sysctl_tcp_app_win;
    > > +       struct inet_connection_sock *icsk = inet_csk(sk);
    > >         struct tcp_sock *tp = tcp_sk(sk);
    > >         int maxwin;
    > >
    > >         if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK))
    > >                 tcp_sndbuf_expand(sk);
    > >
    > > -       tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * tp->advmss);
    > > +       tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * icsk->icsk_ack.rcv_mss);
    >
    

    0 will not work, since we use a do_div(grow, tp->rcvq_space.space)

    >
    > Note that if a driver uses 16KB of memory to hold a 1500 bytes packet,
    > then a 10 MSS GRO packet is consuming 160 KB of memory,
    > which is bigger than tcp_rmem[1]. TCP could decide to drop these fat packets.
    >
    > I wonder if your patch does not work around a more fundamental issue,
    > I am still unable to reproduce the issue.




Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284

Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705
diff mbox series

Patch

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 389d1b340248..f0ffac9e937b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -504,13 +504,14 @@  static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
 static void tcp_init_buffer_space(struct sock *sk)
 {
 	int tcp_app_win = sock_net(sk)->ipv4.sysctl_tcp_app_win;
+	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
 	int maxwin;
 
 	if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK))
 		tcp_sndbuf_expand(sk);
 
-	tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * tp->advmss);
+	tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * icsk->icsk_ack.rcv_mss);
 	tcp_mstamp_refresh(tp);
 	tp->rcvq_space.time = tp->tcp_mstamp;
 	tp->rcvq_space.seq = tp->copied_seq;