diff mbox series

[net-next,2/2] tcp: correct handling of extreme menory squeeze

Message ID 20240406182107.261472-3-jmaloy@redhat.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series tcp: add support for SO_PEEK_OFF socket option | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 944 this patch: 944
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers warning 2 maintainers not CCed: dsahern@kernel.org pabeni@redhat.com
netdev/build_clang success Errors and warnings before: 956 this patch: 956
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 955 this patch: 955
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 27 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 1 this patch: 1
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-04-07--00-00 (tests: 956)

Commit Message

Jon Maloy April 6, 2024, 6:21 p.m. UTC
From: Jon Maloy <jmaloy@redhat.com>

Testing of the previous commit ("tcp: add support for SO_PEEK_OFF")
in this series along with the pasta protocol splicer revealed a bug in
the way tcp handles window advertising during extreme memory squeeze
situations.

The excerpt of the below logging session shows what is happeing:

[5201<->54494]:     ==== Activating log @ tcp_select_window()/268 ====
[5201<->54494]:     (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) --> TRUE
[5201<->54494]:   tcp_select_window(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354, returning 0
[5201<->54494]:   ADVERTISING WINDOW SIZE 0
[5201<->54494]: __tcp_transmit_skb(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354

[5201<->54494]: tcp_recvmsg_locked(->)
[5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
[5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
[5201<->54494]:     NOT calling tcp_send_ack()
[5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
[5201<->54494]: tcp_recvmsg_locked(<-) returning 131072 bytes, window now: 250164, qlen: 83

[...]

[5201<->54494]: tcp_recvmsg_locked(->)
[5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
[5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
[5201<->54494]:     NOT calling tcp_send_ack()
[5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
[5201<->54494]: tcp_recvmsg_locked(<-) returning 131072 bytes, window now: 250164, qlen: 1

[5201<->54494]: tcp_recvmsg_locked(->)
[5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
[5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
[5201<->54494]:     NOT calling tcp_send_ack()
[5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
[5201<->54494]: tcp_recvmsg_locked(<-) returning 57036 bytes, window now: 250164, qlen: 0

[5201<->54494]: tcp_recvmsg_locked(->)
[5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
[5201<->54494]:     NOT calling tcp_send_ack()
[5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
[5201<->54494]: tcp_recvmsg_locked(<-) returning -11 bytes, window now: 250164, qlen: 0

We can see that although we are adverising a window size of zero,
tp->rcv_wnd is not updated accordingly. This leads to a discrepancy
between this side's and the peer's view of the current window size.
- The peer thinks the window is zero, and stops sending.
- This side ends up in a cycle where it repeatedly caclulates a new
  window size it finds too small to advertise.

Hence no messages are received, and no acknowledges are sent, and
the situation remains locked even after the last queued receive buffer
has been consumed.

We fix this by setting tp->rcv_wnd to 0 before we return from the
function tcp_select_window() in this particular case.
Further testing shows that the connection recovers neatly from the
squeeze situation, and traffic can continue indefinitely.

Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
---
 net/ipv4/tcp_output.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

Comments

Eric Dumazet April 6, 2024, 6:37 p.m. UTC | #1
On Sat, Apr 6, 2024 at 8:21 PM <jmaloy@redhat.com> wrote:
>
> From: Jon Maloy <jmaloy@redhat.com>
>
> Testing of the previous commit ("tcp: add support for SO_PEEK_OFF")
> in this series along with the pasta protocol splicer revealed a bug in
> the way tcp handles window advertising during extreme memory squeeze
> situations.
>
> The excerpt of the below logging session shows what is happeing:
>
> [5201<->54494]:     ==== Activating log @ tcp_select_window()/268 ====
> [5201<->54494]:     (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) --> TRUE
> [5201<->54494]:   tcp_select_window(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354, returning 0
> [5201<->54494]:   ADVERTISING WINDOW SIZE 0
> [5201<->54494]: __tcp_transmit_skb(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
>
> [5201<->54494]: tcp_recvmsg_locked(->)
> [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> [5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
> [5201<->54494]:     NOT calling tcp_send_ack()
> [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> [5201<->54494]: tcp_recvmsg_locked(<-) returning 131072 bytes, window now: 250164, qlen: 83
>
> [...]

I would prefer a packetdrill test, it is not clear what is happening...

In particular, have you used SO_RCVBUF ?

>
> [5201<->54494]: tcp_recvmsg_locked(->)
> [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> [5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
> [5201<->54494]:     NOT calling tcp_send_ack()
> [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> [5201<->54494]: tcp_recvmsg_locked(<-) returning 131072 bytes, window now: 250164, qlen: 1
>
> [5201<->54494]: tcp_recvmsg_locked(->)
> [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> [5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
> [5201<->54494]:     NOT calling tcp_send_ack()
> [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> [5201<->54494]: tcp_recvmsg_locked(<-) returning 57036 bytes, window now: 250164, qlen: 0
>
> [5201<->54494]: tcp_recvmsg_locked(->)
> [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> [5201<->54494]:     NOT calling tcp_send_ack()
> [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> [5201<->54494]: tcp_recvmsg_locked(<-) returning -11 bytes, window now: 250164, qlen: 0
>
> We can see that although we are adverising a window size of zero,
> tp->rcv_wnd is not updated accordingly. This leads to a discrepancy
> between this side's and the peer's view of the current window size.
> - The peer thinks the window is zero, and stops sending.
> - This side ends up in a cycle where it repeatedly caclulates a new
>   window size it finds too small to advertise.
>
> Hence no messages are received, and no acknowledges are sent, and
> the situation remains locked even after the last queued receive buffer
> has been consumed.
>
> We fix this by setting tp->rcv_wnd to 0 before we return from the
> function tcp_select_window() in this particular case.
> Further testing shows that the connection recovers neatly from the
> squeeze situation, and traffic can continue indefinitely.
>
> Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> ---
>  net/ipv4/tcp_output.c | 14 +++++++++-----
>  1 file changed, 9 insertions(+), 5 deletions(-)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 9282fafc0e61..57ead8f3c334 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -263,11 +263,15 @@ static u16 tcp_select_window(struct sock *sk)
>         u32 cur_win, new_win;
>
>         /* Make the window 0 if we failed to queue the data because we
> -        * are out of memory. The window is temporary, so we don't store
> -        * it on the socket.
> +        * are out of memory. The window needs to be stored in the socket
> +        * for the connection to recover.
>          */
> -       if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM))
> -               return 0;
> +       if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) {
> +               new_win = 0;
> +               tp->rcv_wnd = 0;
> +               tp->rcv_wup = tp->rcv_nxt;
> +               goto out;
> +       }
>
>         cur_win = tcp_receive_window(tp);
>         new_win = __tcp_select_window(sk);
> @@ -301,7 +305,7 @@ static u16 tcp_select_window(struct sock *sk)
>
>         /* RFC1323 scaling applied */
>         new_win >>= tp->rx_opt.rcv_wscale;
> -
> +out:
>         /* If we advertise zero window, disable fast path. */
>         if (new_win == 0) {
>                 tp->pred_flags = 0;
> --
> 2.42.0
>

Any particular reason to not cc Menglong Dong ?
(I just did)

This code was added in

commit e2142825c120d4317abf7160a0fc34b3de532586
Author: Menglong Dong <imagedong@tencent.com>
Date:   Fri Aug 11 10:55:27 2023 +0800

    net: tcp: send zero-window ACK when no memory

    For now, skb will be dropped when no memory, which makes client keep
    retrans util timeout and it's not friendly to the users.

    In this patch, we reply an ACK with zero-window in this case to update
    the snd_wnd of the sender to 0. Therefore, the sender won't timeout the
    connection and will probe the zero-window with the retransmits.

    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Xing April 7, 2024, 6:51 a.m. UTC | #2
On Sun, Apr 7, 2024 at 2:38 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Sat, Apr 6, 2024 at 8:21 PM <jmaloy@redhat.com> wrote:
> >
> > From: Jon Maloy <jmaloy@redhat.com>
> >
> > Testing of the previous commit ("tcp: add support for SO_PEEK_OFF")
> > in this series along with the pasta protocol splicer revealed a bug in
> > the way tcp handles window advertising during extreme memory squeeze
> > situations.
> >
> > The excerpt of the below logging session shows what is happeing:
> >
> > [5201<->54494]:     ==== Activating log @ tcp_select_window()/268 ====
> > [5201<->54494]:     (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) --> TRUE
> > [5201<->54494]:   tcp_select_window(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354, returning 0
> > [5201<->54494]:   ADVERTISING WINDOW SIZE 0
> > [5201<->54494]: __tcp_transmit_skb(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> >
> > [5201<->54494]: tcp_recvmsg_locked(->)
> > [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > [5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
> > [5201<->54494]:     NOT calling tcp_send_ack()
> > [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > [5201<->54494]: tcp_recvmsg_locked(<-) returning 131072 bytes, window now: 250164, qlen: 83
> >
> > [...]
>
> I would prefer a packetdrill test, it is not clear what is happening...
>
> In particular, have you used SO_RCVBUF ?
>
> >
> > [5201<->54494]: tcp_recvmsg_locked(->)
> > [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > [5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
> > [5201<->54494]:     NOT calling tcp_send_ack()
> > [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > [5201<->54494]: tcp_recvmsg_locked(<-) returning 131072 bytes, window now: 250164, qlen: 1
> >
> > [5201<->54494]: tcp_recvmsg_locked(->)
> > [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > [5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
> > [5201<->54494]:     NOT calling tcp_send_ack()
> > [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > [5201<->54494]: tcp_recvmsg_locked(<-) returning 57036 bytes, window now: 250164, qlen: 0
> >
> > [5201<->54494]: tcp_recvmsg_locked(->)
> > [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > [5201<->54494]:     NOT calling tcp_send_ack()
> > [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > [5201<->54494]: tcp_recvmsg_locked(<-) returning -11 bytes, window now: 250164, qlen: 0
> >
> > We can see that although we are adverising a window size of zero,
> > tp->rcv_wnd is not updated accordingly. This leads to a discrepancy
> > between this side's and the peer's view of the current window size.
> > - The peer thinks the window is zero, and stops sending.
> > - This side ends up in a cycle where it repeatedly caclulates a new
> >   window size it finds too small to advertise.
> >
> > Hence no messages are received, and no acknowledges are sent, and
> > the situation remains locked even after the last queued receive buffer
> > has been consumed.
> >
> > We fix this by setting tp->rcv_wnd to 0 before we return from the
> > function tcp_select_window() in this particular case.
> > Further testing shows that the connection recovers neatly from the
> > squeeze situation, and traffic can continue indefinitely.
> >
> > Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
> > Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> > ---
> >  net/ipv4/tcp_output.c | 14 +++++++++-----
> >  1 file changed, 9 insertions(+), 5 deletions(-)
> >
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index 9282fafc0e61..57ead8f3c334 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -263,11 +263,15 @@ static u16 tcp_select_window(struct sock *sk)
> >         u32 cur_win, new_win;
> >
> >         /* Make the window 0 if we failed to queue the data because we
> > -        * are out of memory. The window is temporary, so we don't store
> > -        * it on the socket.
> > +        * are out of memory. The window needs to be stored in the socket
> > +        * for the connection to recover.
> >          */
> > -       if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM))
> > -               return 0;
> > +       if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) {
> > +               new_win = 0;
> > +               tp->rcv_wnd = 0;
> > +               tp->rcv_wup = tp->rcv_nxt;
> > +               goto out;
> > +       }
> >
> >         cur_win = tcp_receive_window(tp);
> >         new_win = __tcp_select_window(sk);
> > @@ -301,7 +305,7 @@ static u16 tcp_select_window(struct sock *sk)
> >
> >         /* RFC1323 scaling applied */
> >         new_win >>= tp->rx_opt.rcv_wscale;
> > -
> > +out:
> >         /* If we advertise zero window, disable fast path. */
> >         if (new_win == 0) {
> >                 tp->pred_flags = 0;
> > --
> > 2.42.0
> >
>
> Any particular reason to not cc Menglong Dong ?
> (I just did)

He is not working at Tencent any more. Let me CC here one more time.
Menglong Dong April 7, 2024, 7:51 a.m. UTC | #3
On Sun, Apr 7, 2024 at 2:52 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Sun, Apr 7, 2024 at 2:38 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Sat, Apr 6, 2024 at 8:21 PM <jmaloy@redhat.com> wrote:
> > >
> > > From: Jon Maloy <jmaloy@redhat.com>
> > >
> > > Testing of the previous commit ("tcp: add support for SO_PEEK_OFF")
> > > in this series along with the pasta protocol splicer revealed a bug in
> > > the way tcp handles window advertising during extreme memory squeeze
> > > situations.
> > >
> > > The excerpt of the below logging session shows what is happeing:
> > >
> > > [5201<->54494]:     ==== Activating log @ tcp_select_window()/268 ====
> > > [5201<->54494]:     (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) --> TRUE
> > > [5201<->54494]:   tcp_select_window(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354, returning 0
> > > [5201<->54494]:   ADVERTISING WINDOW SIZE 0
> > > [5201<->54494]: __tcp_transmit_skb(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > >
> > > [5201<->54494]: tcp_recvmsg_locked(->)
> > > [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > > [5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
> > > [5201<->54494]:     NOT calling tcp_send_ack()
> > > [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > > [5201<->54494]: tcp_recvmsg_locked(<-) returning 131072 bytes, window now: 250164, qlen: 83
> > >
> > > [...]
> >
> > I would prefer a packetdrill test, it is not clear what is happening...
> >
> > In particular, have you used SO_RCVBUF ?
> >
> > >
> > > [5201<->54494]: tcp_recvmsg_locked(->)
> > > [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > > [5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
> > > [5201<->54494]:     NOT calling tcp_send_ack()
> > > [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > > [5201<->54494]: tcp_recvmsg_locked(<-) returning 131072 bytes, window now: 250164, qlen: 1
> > >
> > > [5201<->54494]: tcp_recvmsg_locked(->)
> > > [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > > [5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
> > > [5201<->54494]:     NOT calling tcp_send_ack()
> > > [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > > [5201<->54494]: tcp_recvmsg_locked(<-) returning 57036 bytes, window now: 250164, qlen: 0
> > >
> > > [5201<->54494]: tcp_recvmsg_locked(->)
> > > [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > > [5201<->54494]:     NOT calling tcp_send_ack()
> > > [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > > [5201<->54494]: tcp_recvmsg_locked(<-) returning -11 bytes, window now: 250164, qlen: 0
> > >
> > > We can see that although we are adverising a window size of zero,
> > > tp->rcv_wnd is not updated accordingly. This leads to a discrepancy
> > > between this side's and the peer's view of the current window size.
> > > - The peer thinks the window is zero, and stops sending.

Hi!

In my original logic, the client will send a zero-window
ack when it drops the skb because it is out of the
memory. And the peer SHOULD keep retrans the dropped
packet.

Does the peer do the transmission in this case? The receive
window of the peer SHOULD recover once the
retransmission is successful.

> > > - This side ends up in a cycle where it repeatedly caclulates a new
> > >   window size it finds too small to advertise.

Yeah,  the zero-window suppressed the sending of ack in
__tcp_cleanup_rbuf, which I wasn't aware of.

The ack will recover the receive window of the peer. Does
it make the peer retrans the dropped data immediately?
In my opinion, the peer still needs to retrans the dropped
packet until the retransmission timer timeout. Isn't it?

If it is, maybe we can do the retransmission immediately
if we are in zero-window from a window-shrink, which can
make the recovery faster.

[......]
> > Any particular reason to not cc Menglong Dong ?
> > (I just did)
>
> He is not working at Tencent any more. Let me CC here one more time.

Thanks for CC the new email of mine, it's very kind of you,
xing :/
Eric Dumazet April 8, 2024, 10:03 a.m. UTC | #4
On Sat, Apr 6, 2024 at 8:37 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Sat, Apr 6, 2024 at 8:21 PM <jmaloy@redhat.com> wrote:
> >
> > From: Jon Maloy <jmaloy@redhat.com>
> >
> > Testing of the previous commit ("tcp: add support for SO_PEEK_OFF")
> > in this series along with the pasta protocol splicer revealed a bug in
> > the way tcp handles window advertising during extreme memory squeeze
> > situations.
> >
> > The excerpt of the below logging session shows what is happeing:
> >
> > [5201<->54494]:     ==== Activating log @ tcp_select_window()/268 ====
> > [5201<->54494]:     (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) --> TRUE
> > [5201<->54494]:   tcp_select_window(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354, returning 0
> > [5201<->54494]:   ADVERTISING WINDOW SIZE 0
> > [5201<->54494]: __tcp_transmit_skb(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> >
> > [5201<->54494]: tcp_recvmsg_locked(->)
> > [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > [5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
> > [5201<->54494]:     NOT calling tcp_send_ack()
> > [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > [5201<->54494]: tcp_recvmsg_locked(<-) returning 131072 bytes, window now: 250164, qlen: 83
> >
> > [...]
>
> I would prefer a packetdrill test, it is not clear what is happening...
>
> In particular, have you used SO_RCVBUF ?
>
> >
> > [5201<->54494]: tcp_recvmsg_locked(->)
> > [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > [5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
> > [5201<->54494]:     NOT calling tcp_send_ack()
> > [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > [5201<->54494]: tcp_recvmsg_locked(<-) returning 131072 bytes, window now: 250164, qlen: 1
> >
> > [5201<->54494]: tcp_recvmsg_locked(->)
> > [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > [5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
> > [5201<->54494]:     NOT calling tcp_send_ack()
> > [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > [5201<->54494]: tcp_recvmsg_locked(<-) returning 57036 bytes, window now: 250164, qlen: 0
> >
> > [5201<->54494]: tcp_recvmsg_locked(->)
> > [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > [5201<->54494]:     NOT calling tcp_send_ack()
> > [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > [5201<->54494]: tcp_recvmsg_locked(<-) returning -11 bytes, window now: 250164, qlen: 0
> >
> > We can see that although we are adverising a window size of zero,
> > tp->rcv_wnd is not updated accordingly. This leads to a discrepancy
> > between this side's and the peer's view of the current window size.
> > - The peer thinks the window is zero, and stops sending.
> > - This side ends up in a cycle where it repeatedly caclulates a new
> >   window size it finds too small to advertise.
> >
> > Hence no messages are received, and no acknowledges are sent, and
> > the situation remains locked even after the last queued receive buffer
> > has been consumed.
> >
> > We fix this by setting tp->rcv_wnd to 0 before we return from the
> > function tcp_select_window() in this particular case.
> > Further testing shows that the connection recovers neatly from the
> > squeeze situation, and traffic can continue indefinitely.
> >
> > Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
> > Signed-off-by: Jon Maloy <jmaloy@redhat.com>

I do not think this patch is good. If we reach zero window, it is a
sign something is wrong.

TCP has heuristics to slow down the sender if the receiver does not
drain the receive queue fast enough.

MSG_PEEK is an obvious reason, and SO_RCVLOWAT too.

I suggest you take a look at tcp_set_rcvlowat(), see what is needed
for SO_PEEK_OFF (ab)use ?

In short, when SO_PEEK_OFF is in action :
- TCP needs to not delay ACK when receive queue starts to fill
- TCP needs to make sure sk_rcvbuf and tp->window_clamp grow (if
autotuning is enabled)
Jon Maloy April 8, 2024, 11:01 a.m. UTC | #5
On 2024-04-07 03:51, Menglong Dong wrote:
> On Sun, Apr 7, 2024 at 2:52 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
>> On Sun, Apr 7, 2024 at 2:38 AM Eric Dumazet <edumazet@google.com> wrote:
[...]
>>>> [5201<->54494]: tcp_recvmsg_locked(<-) returning -11 bytes, window now: 250164, qlen: 0
>>>>
>>>> We can see that although we are adverising a window size of zero,
>>>> tp->rcv_wnd is not updated accordingly. This leads to a discrepancy
>>>> between this side's and the peer's view of the current window size.
>>>> - The peer thinks the window is zero, and stops sending.
> Hi!
>
> In my original logic, the client will send a zero-window
> ack when it drops the skb because it is out of the
> memory. And the peer SHOULD keep retrans the dropped
> packet.
>
> Does the peer do the transmission in this case? The receive
> window of the peer SHOULD recover once the
> retransmission is successful.
The "peer" is this case is our user-space protocol splicer, emulating
the behavior of of the remote end socket.
At a first glance, it looks like it is *not* performing any retransmits
at all when it sees a zero window at the receiver, so this might indeed
be the problem.
I will be out of office today, but I will test this later this week.

///jon

>
>>>> - This side ends up in a cycle where it repeatedly caclulates a new
>>>>    window size it finds too small to advertise.
> Yeah,  the zero-window suppressed the sending of ack in
> __tcp_cleanup_rbuf, which I wasn't aware of.
>
> The ack will recover the receive window of the peer. Does
> it make the peer retrans the dropped data immediately?
> In my opinion, the peer still needs to retrans the dropped
> packet until the retransmission timer timeout. Isn't it?
>
> If it is, maybe we can do the retransmission immediately
> if we are in zero-window from a window-shrink, which can
> make the recovery faster.
>
> [......]
>>> Any particular reason to not cc Menglong Dong ?
>>> (I just did)
>> He is not working at Tencent any more. Let me CC here one more time.
> Thanks for CC the new email of mine, it's very kind of you,
> xing :/
>
Jon Maloy April 8, 2024, 11:13 a.m. UTC | #6
On 2024-04-08 06:03, Eric Dumazet wrote:
> On Sat, Apr 6, 2024 at 8:37 PM Eric Dumazet <edumazet@google.com> wrote:
>> On Sat, Apr 6, 2024 at 8:21 PM <jmaloy@redhat.com> wrote:
[...]
>>> [5201<->54494]: tcp_recvmsg_locked(<-) returning 57036 bytes, window now: 250164, qlen: 0
>>>
>>> [5201<->54494]: tcp_recvmsg_locked(->)
>>> [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
>>> [5201<->54494]:     NOT calling tcp_send_ack()
>>> [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
>>> [5201<->54494]: tcp_recvmsg_locked(<-) returning -11 bytes, window now: 250164, qlen: 0
>>>
>>> We can see that although we are adverising a window size of zero,
>>> tp->rcv_wnd is not updated accordingly. This leads to a discrepancy
>>> between this side's and the peer's view of the current window size.
>>> - The peer thinks the window is zero, and stops sending.
>>> - This side ends up in a cycle where it repeatedly caclulates a new
>>>    window size it finds too small to advertise.
>>>
>>> Hence no messages are received, and no acknowledges are sent, and
>>> the situation remains locked even after the last queued receive buffer
>>> has been consumed.
>>>
>>> We fix this by setting tp->rcv_wnd to 0 before we return from the
>>> function tcp_select_window() in this particular case.
>>> Further testing shows that the connection recovers neatly from the
>>> squeeze situation, and traffic can continue indefinitely.
>>>
>>> Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
>>> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> I do not think this patch is good. If we reach zero window, it is a
> sign something is wrong.
>
> TCP has heuristics to slow down the sender if the receiver does not
> drain the receive queue fast enough.
>
> MSG_PEEK is an obvious reason, and SO_RCVLOWAT too.
>
> I suggest you take a look at tcp_set_rcvlowat(), see what is needed
> for SO_PEEK_OFF (ab)use ?
>
> In short, when SO_PEEK_OFF is in action :
> - TCP needs to not delay ACK when receive queue starts to fill
> - TCP needs to make sure sk_rcvbuf and tp->window_clamp grow (if
> autotuning is enabled)
>
We are not talking about the same socket here. The one being
overloaded is the terminating socket at the guest side. This is
just a regular socket not using MSG_PEEK or SO_PEEK_OFF.

SO_PEEK_OFF is used in the intermediate socket terminating
the connection towards the remote end.  We want to preserve
the message in its receive queue until it has been acknowledged
by the guest side, so we don't need to keep a copy of it in user space.
This seems to work flawlessly.

Anyway, I think this is worth taking a closer look at, as you say.
I don't think this situation should occur at all.

///jon
diff mbox series

Patch

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 9282fafc0e61..57ead8f3c334 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -263,11 +263,15 @@  static u16 tcp_select_window(struct sock *sk)
 	u32 cur_win, new_win;
 
 	/* Make the window 0 if we failed to queue the data because we
-	 * are out of memory. The window is temporary, so we don't store
-	 * it on the socket.
+	 * are out of memory. The window needs to be stored in the socket
+	 * for the connection to recover.
 	 */
-	if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM))
-		return 0;
+	if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) {
+		new_win = 0;
+		tp->rcv_wnd = 0;
+		tp->rcv_wup = tp->rcv_nxt;
+		goto out;
+	}
 
 	cur_win = tcp_receive_window(tp);
 	new_win = __tcp_select_window(sk);
@@ -301,7 +305,7 @@  static u16 tcp_select_window(struct sock *sk)
 
 	/* RFC1323 scaling applied */
 	new_win >>= tp->rx_opt.rcv_wscale;
-
+out:
 	/* If we advertise zero window, disable fast path. */
 	if (new_win == 0) {
 		tp->pred_flags = 0;