diff mbox series

[net] tcp: fix mid stream window clamp.

Message ID fab4d0949126683a3b6b4e04a9ec088cf9bfdbb1.1700751622.git.pabeni@redhat.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [net] tcp: fix mid stream window clamp. | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/codegen success Generated files up to date
netdev/tree_selection success Clearly marked for net
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1117 this patch: 1117
netdev/cc_maintainers success CCed 6 of 6 maintainers
netdev/build_clang success Errors and warnings before: 1143 this patch: 1143
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 1144 this patch: 1144
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 25 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Paolo Abeni Nov. 23, 2023, 3:24 p.m. UTC
After the blamed commit below, if the user-space application performs
window clamping when tp->rcv_wnd is 0, the TCP socket will never be
able to announce a non 0 receive window, even after completely emptying
the receive buffer and re-setting the window clamp to higher values.

Refactor tcp_set_window_clamp() to address the issue: when the user
decreases the current clamp value, set rcv_ssthresh according to the
same logic used at buffer initialization time.
When increasing the clamp value, give the rcv_ssthresh a chance to grow
according to previously implemented heuristic.

Fixes: 3aa7857fe1d7 ("tcp: enable mid stream window clamp")
Reported-by: David Gibson <david@gibson.dropbear.id.au>
Reported-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Tested-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/ipv4/tcp.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

Comments

Eric Dumazet Nov. 23, 2023, 5:10 p.m. UTC | #1
CC Neal and Wei

On Thu, Nov 23, 2023 at 4:25 PM Paolo Abeni <pabeni@redhat.com> wrote:
>
> After the blamed commit below, if the user-space application performs
> window clamping when tp->rcv_wnd is 0, the TCP socket will never be
> able to announce a non 0 receive window, even after completely emptying
> the receive buffer and re-setting the window clamp to higher values.
>
> Refactor tcp_set_window_clamp() to address the issue: when the user
> decreases the current clamp value, set rcv_ssthresh according to the
> same logic used at buffer initialization time.
> When increasing the clamp value, give the rcv_ssthresh a chance to grow
> according to previously implemented heuristic.
>
> Fixes: 3aa7857fe1d7 ("tcp: enable mid stream window clamp")
> Reported-by: David Gibson <david@gibson.dropbear.id.au>
> Reported-by: Stefano Brivio <sbrivio@redhat.com>
> Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
> Tested-by: Stefano Brivio <sbrivio@redhat.com>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
>  net/ipv4/tcp.c | 19 ++++++++++++++++---
>  1 file changed, 16 insertions(+), 3 deletions(-)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 53bcc17c91e4..1a9b9064e080 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -3368,9 +3368,22 @@ int tcp_set_window_clamp(struct sock *sk, int val)
>                         return -EINVAL;
>                 tp->window_clamp = 0;
>         } else {
> -               tp->window_clamp = val < SOCK_MIN_RCVBUF / 2 ?
> -                       SOCK_MIN_RCVBUF / 2 : val;
> -               tp->rcv_ssthresh = min(tp->rcv_wnd, tp->window_clamp);
> +               u32 new_rcv_ssthresh, old_window_clamp = tp->window_clamp;
> +               u32 new_window_clamp = val < SOCK_MIN_RCVBUF / 2 ?
> +                                               SOCK_MIN_RCVBUF / 2 : val;
> +
> +               if (new_window_clamp == old_window_clamp)
> +                       return 0;
> +
> +               tp->window_clamp = new_window_clamp;
> +               if (new_window_clamp < old_window_clamp) {
> +                       tp->rcv_ssthresh = min(tp->rcv_ssthresh,
> +                                              new_window_clamp);
> +               } else {
> +                       new_rcv_ssthresh = min(tp->rcv_wnd, tp->window_clamp);
> +                       tp->rcv_ssthresh = max(new_rcv_ssthresh,
> +                                              tp->rcv_ssthresh);
> +               }
>         }
>         return 0;
>  }

It seems there is no provision for SO_RESERVE_MEM

I wonder if tcp_adjust_rcv_ssthresh()  could help here ?

Have you considered reverting  3aa7857fe1d7 ("tcp: enable mid stream
window clamp") ?

Thanks.
Paolo Abeni Nov. 23, 2023, 6:16 p.m. UTC | #2
On Thu, 2023-11-23 at 18:10 +0100, Eric Dumazet wrote:
> CC Neal and Wei
> 
> On Thu, Nov 23, 2023 at 4:25 PM Paolo Abeni <pabeni@redhat.com> wrote:
> > 
> > After the blamed commit below, if the user-space application performs
> > window clamping when tp->rcv_wnd is 0, the TCP socket will never be
> > able to announce a non 0 receive window, even after completely emptying
> > the receive buffer and re-setting the window clamp to higher values.
> > 
> > Refactor tcp_set_window_clamp() to address the issue: when the user
> > decreases the current clamp value, set rcv_ssthresh according to the
> > same logic used at buffer initialization time.
> > When increasing the clamp value, give the rcv_ssthresh a chance to grow
> > according to previously implemented heuristic.
> > 
> > Fixes: 3aa7857fe1d7 ("tcp: enable mid stream window clamp")
> > Reported-by: David Gibson <david@gibson.dropbear.id.au>
> > Reported-by: Stefano Brivio <sbrivio@redhat.com>
> > Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
> > Tested-by: Stefano Brivio <sbrivio@redhat.com>
> > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > ---
> >  net/ipv4/tcp.c | 19 ++++++++++++++++---
> >  1 file changed, 16 insertions(+), 3 deletions(-)
> > 
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index 53bcc17c91e4..1a9b9064e080 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -3368,9 +3368,22 @@ int tcp_set_window_clamp(struct sock *sk, int val)
> >                         return -EINVAL;
> >                 tp->window_clamp = 0;
> >         } else {
> > -               tp->window_clamp = val < SOCK_MIN_RCVBUF / 2 ?
> > -                       SOCK_MIN_RCVBUF / 2 : val;
> > -               tp->rcv_ssthresh = min(tp->rcv_wnd, tp->window_clamp);
> > +               u32 new_rcv_ssthresh, old_window_clamp = tp->window_clamp;
> > +               u32 new_window_clamp = val < SOCK_MIN_RCVBUF / 2 ?
> > +                                               SOCK_MIN_RCVBUF / 2 : val;
> > +
> > +               if (new_window_clamp == old_window_clamp)
> > +                       return 0;
> > +
> > +               tp->window_clamp = new_window_clamp;
> > +               if (new_window_clamp < old_window_clamp) {
> > +                       tp->rcv_ssthresh = min(tp->rcv_ssthresh,
> > +                                              new_window_clamp);
> > +               } else {
> > +                       new_rcv_ssthresh = min(tp->rcv_wnd, tp->window_clamp);
> > +                       tp->rcv_ssthresh = max(new_rcv_ssthresh,
> > +                                              tp->rcv_ssthresh);
> > +               }
> >         }
> >         return 0;
> >  }
> 
> It seems there is no provision for SO_RESERVE_MEM

Indeed I did take that in account.

> I wonder if tcp_adjust_rcv_ssthresh()  could help here ?

I don't know how to fit it into the above. tcp_adjust_rcv_ssthresh()
tends to shrink rcv_ssthresh to low values when no memory is reserved. 

Dealing directly with SO_RESERVE_MEM when shrinking the threshold feels
easier to me, something alike:

               if (new_window_clamp == old_window_clamp)
                       return 0;

               tp->window_clamp = new_window_clamp;
               if (new_window_clamp < old_window_clamp) {
			int unused_mem = sk_unused_reserved_mem(sk);

			tp->rcv_ssthresh = min(tp->rcv_ssthresh,
                                               new_window_clamp);

			if (unused_mem)
				tp->rcv_ssthresh = max_t(u32, tp->rcv_ssthresh,
					tcp_win_from_space(sk, unused_mem));
               } else {
			new_rcv_ssthresh = min(tp->rcv_wnd, tp->window_clamp);
			tp->rcv_ssthresh = max(new_rcv_ssthresh,
                                              tp->rcv_ssthresh);
               }

Possibly the bits shared with tcp_adjust_rcv_ssthresh() could be
factored out in a common helper.

> Have you considered reverting  3aa7857fe1d7 ("tcp: enable mid stream
> window clamp") ?


That would work, too and will be simpler.

The issue at hand was noted with an application that really wants to
limit the announced window:

https://gitlab.com/dgibson/passt

I guess touching rcv_ssthresh would be a bit more effective. 

Not much more in the end, as both window_clamp and rcv_ssthresh can
later grow due to rcv buf auto-tune. Ideally we would like to prevent
tcp_rcv_space_adjust() from touching window_clamp after
TCP_WINDOW_CLAMP - but that is another matter/patch.

Thanks!

Paolo
Neil Spring Nov. 24, 2023, 5:27 a.m. UTC | #3
> 
> ________________________________________
> From: Paolo Abeni <pabeni@redhat.com>
> Sent: Thursday, November 23, 2023 10:16 AM
> To: Eric Dumazet; Neal Cardwell; Wei Wang
> Cc: netdev@vger.kernel.org; David S. Miller; David Ahern; Jakub Kicinski; Neil Spring; David Gibson
> Subject: Re: [PATCH net] tcp: fix mid stream window clamp.
> 
> !-------------------------------------------------------------------|
>   This Message Is From an External Sender
> 
> |-------------------------------------------------------------------!
> 
> On Thu, 2023-11-23 at 18:10 +0100, Eric Dumazet wrote:
> > CC Neal and Wei
> >
> > On Thu, Nov 23, 2023 at 4:25 PM Paolo Abeni <pabeni@redhat.com> wrote:
> > >
> > > After the blamed commit below, if the user-space application performs
> > > window clamping when tp->rcv_wnd is 0, the TCP socket will never be
> > > able to announce a non 0 receive window, even after completely emptying
> > > the receive buffer and re-setting the window clamp to higher values.
> > >
> > > Refactor tcp_set_window_clamp() to address the issue: when the user
> > > decreases the current clamp value, set rcv_ssthresh according to the
> > > same logic used at buffer initialization time.
> > > When increasing the clamp value, give the rcv_ssthresh a chance to grow
> > > according to previously implemented heuristic.
> > >
> > > Fixes: 3aa7857fe1d7 ("tcp: enable mid stream window clamp")
> > > Reported-by: David Gibson <david@gibson.dropbear.id.au>
> > > Reported-by: Stefano Brivio <sbrivio@redhat.com>
> > > Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
> > > Tested-by: Stefano Brivio <sbrivio@redhat.com>
> > > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > > ---
> > >  net/ipv4/tcp.c | 19 ++++++++++++++++---
> > >  1 file changed, 16 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > index 53bcc17c91e4..1a9b9064e080 100644
> > > --- a/net/ipv4/tcp.c
> > > +++ b/net/ipv4/tcp.c
> > > @@ -3368,9 +3368,22 @@ int tcp_set_window_clamp(struct sock *sk, int val)
> > >                         return -EINVAL;
> > >                 tp->window_clamp = 0;
> > >         } else {
> > > -               tp->window_clamp = val < SOCK_MIN_RCVBUF / 2 ?
> > > -                       SOCK_MIN_RCVBUF / 2 : val;
> > > -               tp->rcv_ssthresh = min(tp->rcv_wnd, tp->window_clamp);
> > > +               u32 new_rcv_ssthresh, old_window_clamp = tp->window_clamp;
> > > +               u32 new_window_clamp = val < SOCK_MIN_RCVBUF / 2 ?
> > > +                                               SOCK_MIN_RCVBUF / 2 : val;
> > > +
> > > +               if (new_window_clamp == old_window_clamp)
> > > +                       return 0;
> > > +
> > > +               tp->window_clamp = new_window_clamp;
> > > +               if (new_window_clamp < old_window_clamp) {
> > > +                       tp->rcv_ssthresh = min(tp->rcv_ssthresh,
> > > +                                              new_window_clamp);
> > > +               } else {
> > > +                       new_rcv_ssthresh = min(tp->rcv_wnd, tp->window_clamp);
> > > +                       tp->rcv_ssthresh = max(new_rcv_ssthresh,
> > > +                                              tp->rcv_ssthresh);
> > > +               }
> > >         }
> > >         return 0;
> > >  }
> >
> > It seems there is no provision for SO_RESERVE_MEM
> 
> Indeed I did take that in account.
> 
> > I wonder if tcp_adjust_rcv_ssthresh()  could help here ?
> 
> I don't know how to fit it into the above. tcp_adjust_rcv_ssthresh()
> tends to shrink rcv_ssthresh to low values when no memory is reserved.
> 
> Dealing directly with SO_RESERVE_MEM when shrinking the threshold feels
> easier to me, something alike:
> 
>                if (new_window_clamp == old_window_clamp)
>                        return 0;
> 
>                tp->window_clamp = new_window_clamp;
>                if (new_window_clamp < old_window_clamp) {
>                         int unused_mem = sk_unused_reserved_mem(sk);
> 
>                         tp->rcv_ssthresh = min(tp->rcv_ssthresh,
>                                                new_window_clamp);
> 
>                         if (unused_mem)
>                                 tp->rcv_ssthresh = max_t(u32, tp->rcv_ssthresh,
>                                         tcp_win_from_space(sk, unused_mem));
>                } else {
>                         new_rcv_ssthresh = min(tp->rcv_wnd, tp->window_clamp);
>                         tp->rcv_ssthresh = max(new_rcv_ssthresh,
>                                               tp->rcv_ssthresh);
>                }
> 
> Possibly the bits shared with tcp_adjust_rcv_ssthresh() could be
> factored out in a common helper.
> 
> > Have you considered reverting  3aa7857fe1d7 ("tcp: enable mid stream
> > window clamp") ?
> 
> 
> That would work, too and will be simpler.
> 
> The issue at hand was noted with an application that really wants to
> limit the announced window:
> 
> https://gitlab.com/dgibson/passt
> 
> I guess touching rcv_ssthresh would be a bit more effective.
> 
> Not much more in the end, as both window_clamp and rcv_ssthresh can
> later grow due to rcv buf auto-tune. Ideally we would like to prevent
> tcp_rcv_space_adjust() from touching window_clamp after
> TCP_WINDOW_CLAMP - but that is another matter/patch.
> 
> Thanks!
> 
> Paolo
> 

The patch to fix the bug where rcv_sshthresh is reduced to zero on a full receive window and cannot recover is:
-tp->rcv_ssthresh = min(tp->rcv_wnd, tp->window_clamp);
+tp->rcv_ssthresh = min(tp->rcv_ssthresh, tp->window_clamp);

This change and the patch in this thread behave the same in the following packetdrill that reproduces the problem.   

-neil

---

`../common/defaults.sh`

   // Establish a connection.
    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

  +.1 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 8>
   +0 > S. 0:0(0) ack 1 win 65535 <mss 1460,nop,nop,sackOK,nop,wscale 8>
  +.1 < . 1:1(0) ack 1 win 320
   +0 accept(3, ..., ...) = 4

   +0 setsockopt(4, SOL_SOCKET, SO_RCVBUF, [65536], 4) = 0

   // Close rcv_wnd to zero
  +.01 < . 1:64000(63999) ack 1 win 320
   +.04 > . 1:1(0) ack 64000 win 248
  +.01 < . 64000:128000(64000) ack 1 win 320
   // delayed ack waiting for user to read.
   +0.02~+0.14 > . 1:1(0) ack 128000 win 0

   +0 setsockopt(4, IPPROTO_TCP, TCP_WINDOW_CLAMP, [12800], 4) = 0

  +.01 read(4, ..., 129000) = 127999

   // read opens window ; this is the key part of the test.
   +0 > . 1:1(0) ack 128000 win 50

   // raise clamp further, above buffer size to observe growth
   +0 setsockopt(4, IPPROTO_TCP, TCP_WINDOW_CLAMP, [128000], 4) = 0
  +.01 < . 128000:140800(12800) ack 1 win 320
   +0 > . 1:1(0) ack 140800 win 150

   // Read to clear the receive buffer.
  +.01 read(4, ..., 129000) = 12800

  +.01 < . 140800:172000(31200) ack 1 win 320
   +0 > . 1:1(0) ack 172000 win 361
Paolo Abeni Nov. 24, 2023, 7:54 a.m. UTC | #4
On Fri, 2023-11-24 at 05:27 +0000, Neil Spring wrote:
> > 
> > ________________________________________
> > From: Paolo Abeni <pabeni@redhat.com>
> > Sent: Thursday, November 23, 2023 10:16 AM
> > To: Eric Dumazet; Neal Cardwell; Wei Wang
> > Cc: netdev@vger.kernel.org; David S. Miller; David Ahern; Jakub
> > Kicinski; Neil Spring; David Gibson
> > Subject: Re: [PATCH net] tcp: fix mid stream window clamp.
> > 
> > !------------------------------------------------------------------
> > -|
> >   This Message Is From an External Sender
> > 
> > > -----------------------------------------------------------------
> > > --!
> > 
> > On Thu, 2023-11-23 at 18:10 +0100, Eric Dumazet wrote:
> > > CC Neal and Wei
> > > 
> > > On Thu, Nov 23, 2023 at 4:25 PM Paolo Abeni <pabeni@redhat.com>
> > > wrote:
> > > > 
> > > > After the blamed commit below, if the user-space application
> > > > performs
> > > > window clamping when tp->rcv_wnd is 0, the TCP socket will
> > > > never be
> > > > able to announce a non 0 receive window, even after completely
> > > > emptying
> > > > the receive buffer and re-setting the window clamp to higher
> > > > values.
> > > > 
> > > > Refactor tcp_set_window_clamp() to address the issue: when the
> > > > user
> > > > decreases the current clamp value, set rcv_ssthresh according
> > > > to the
> > > > same logic used at buffer initialization time.
> > > > When increasing the clamp value, give the rcv_ssthresh a chance
> > > > to grow
> > > > according to previously implemented heuristic.
> > > > 
> > > > Fixes: 3aa7857fe1d7 ("tcp: enable mid stream window clamp")
> > > > Reported-by: David Gibson <david@gibson.dropbear.id.au>
> > > > Reported-by: Stefano Brivio <sbrivio@redhat.com>
> > > > Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
> > > > Tested-by: Stefano Brivio <sbrivio@redhat.com>
> > > > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > > > ---
> > > >  net/ipv4/tcp.c | 19 ++++++++++++++++---
> > > >  1 file changed, 16 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > > index 53bcc17c91e4..1a9b9064e080 100644
> > > > --- a/net/ipv4/tcp.c
> > > > +++ b/net/ipv4/tcp.c
> > > > @@ -3368,9 +3368,22 @@ int tcp_set_window_clamp(struct sock
> > > > *sk, int val)
> > > >                         return -EINVAL;
> > > >                 tp->window_clamp = 0;
> > > >         } else {
> > > > -               tp->window_clamp = val < SOCK_MIN_RCVBUF / 2 ?
> > > > -                       SOCK_MIN_RCVBUF / 2 : val;
> > > > -               tp->rcv_ssthresh = min(tp->rcv_wnd, tp-
> > > > >window_clamp);
> > > > +               u32 new_rcv_ssthresh, old_window_clamp = tp-
> > > > >window_clamp;
> > > > +               u32 new_window_clamp = val < SOCK_MIN_RCVBUF /
> > > > 2 ?
> > > > +                                               SOCK_MIN_RCVBUF
> > > > / 2 : val;
> > > > +
> > > > +               if (new_window_clamp == old_window_clamp)
> > > > +                       return 0;
> > > > +
> > > > +               tp->window_clamp = new_window_clamp;
> > > > +               if (new_window_clamp < old_window_clamp) {
> > > > +                       tp->rcv_ssthresh = min(tp-
> > > > >rcv_ssthresh,
> > > > +                                             
> > > > new_window_clamp);
> > > > +               } else {
> > > > +                       new_rcv_ssthresh = min(tp->rcv_wnd, tp-
> > > > >window_clamp);
> > > > +                       tp->rcv_ssthresh =
> > > > max(new_rcv_ssthresh,
> > > > +                                              tp-
> > > > >rcv_ssthresh);
> > > > +               }
> > > >         }
> > > >         return 0;
> > > >  }
> > > 
> > > It seems there is no provision for SO_RESERVE_MEM
> > 
> > Indeed I did take that in account.
> > 
> > > I wonder if tcp_adjust_rcv_ssthresh()  could help here ?
> > 
> > I don't know how to fit it into the above.
> > tcp_adjust_rcv_ssthresh()
> > tends to shrink rcv_ssthresh to low values when no memory is
> > reserved.
> > 
> > Dealing directly with SO_RESERVE_MEM when shrinking the threshold
> > feels
> > easier to me, something alike:
> > 
> >                if (new_window_clamp == old_window_clamp)
> >                        return 0;
> > 
> >                tp->window_clamp = new_window_clamp;
> >                if (new_window_clamp < old_window_clamp) {
> >                         int unused_mem =
> > sk_unused_reserved_mem(sk);
> > 
> >                         tp->rcv_ssthresh = min(tp->rcv_ssthresh,
> >                                                new_window_clamp);
> > 
> >                         if (unused_mem)
> >                                 tp->rcv_ssthresh = max_t(u32, tp-
> > >rcv_ssthresh,
> >                                         tcp_win_from_space(sk,
> > unused_mem));
> >                } else {
> >                         new_rcv_ssthresh = min(tp->rcv_wnd, tp-
> > >window_clamp);
> >                         tp->rcv_ssthresh = max(new_rcv_ssthresh,
> >                                               tp->rcv_ssthresh);
> >                }
> > 
> > Possibly the bits shared with tcp_adjust_rcv_ssthresh() could be
> > factored out in a common helper.
> > 
> > > Have you considered reverting  3aa7857fe1d7 ("tcp: enable mid
> > > stream
> > > window clamp") ?
> > 
> > 
> > That would work, too and will be simpler.
> > 
> > The issue at hand was noted with an application that really wants
> > to
> > limit the announced window:
> > 
> > https://gitlab.com/dgibson/passt
> > 
> > I guess touching rcv_ssthresh would be a bit more effective.
> > 
> > Not much more in the end, as both window_clamp and rcv_ssthresh can
> > later grow due to rcv buf auto-tune. Ideally we would like to
> > prevent
> > tcp_rcv_space_adjust() from touching window_clamp after
> > TCP_WINDOW_CLAMP - but that is another matter/patch.
> > 
> > Thanks!
> > 
> > Paolo
> > 
> 
> The patch to fix the bug where rcv_sshthresh is reduced to zero on a
> full receive window and cannot recover is:
> -tp->rcv_ssthresh = min(tp->rcv_wnd, tp->window_clamp);
> +tp->rcv_ssthresh = min(tp->rcv_ssthresh, tp->window_clamp);

FTR I considered something similar to the above, but I opted for the
present patch, as the above does not pass the pktdrill suggested by
Eric here:

https://lore.kernel.org/netdev/6070816e-f7d2-725a-ec10-9d85f15455a2@gmail.com/

Cheers,

Paolo
Neil Spring Nov. 27, 2023, 10:59 p.m. UTC | #5
> On Fri, 2023-11-24 at 05:27 +0000, Neil Spring wrote:
> > >
> > > ________________________________________
> > > From: Paolo Abeni <pabeni@redhat.com>
> > > Sent: Thursday, November 23, 2023 10:16 AM
> > > To: Eric Dumazet; Neal Cardwell; Wei Wang
> > > Cc: netdev@vger.kernel.org; David S. Miller; David Ahern; Jakub
> > > Kicinski; Neil Spring; David Gibson
> > > Subject: Re: [PATCH net] tcp: fix mid stream window clamp.
> > >
> > > !------------------------------------------------------------------
> > > -|
> > >   This Message Is From an External Sender
> > >
> > > > -----------------------------------------------------------------
> > > > --!
> > >
> > > On Thu, 2023-11-23 at 18:10 +0100, Eric Dumazet wrote:
> > > > CC Neal and Wei
> > > >
> > > > On Thu, Nov 23, 2023 at 4:25 PM Paolo Abeni <pabeni@redhat.com>
> > > > wrote:
> > > > >
> > > > > After the blamed commit below, if the user-space application
> > > > > performs
> > > > > window clamping when tp->rcv_wnd is 0, the TCP socket will
> > > > > never be
> > > > > able to announce a non 0 receive window, even after completely
> > > > > emptying
> > > > > the receive buffer and re-setting the window clamp to higher
> > > > > values.
> > > > >
> > > > > Refactor tcp_set_window_clamp() to address the issue: when the
> > > > > user
> > > > > decreases the current clamp value, set rcv_ssthresh according
> > > > > to the
> > > > > same logic used at buffer initialization time.
> > > > > When increasing the clamp value, give the rcv_ssthresh a chance
> > > > > to grow
> > > > > according to previously implemented heuristic.
> > > > >
> > > > > Fixes: 3aa7857fe1d7 ("tcp: enable mid stream window clamp")
> > > > > Reported-by: David Gibson <david@gibson.dropbear.id.au>
> > > > > Reported-by: Stefano Brivio <sbrivio@redhat.com>
> > > > > Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
> > > > > Tested-by: Stefano Brivio <sbrivio@redhat.com>
> > > > > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > > > > ---
> > > > >  net/ipv4/tcp.c | 19 ++++++++++++++++---
> > > > >  1 file changed, 16 insertions(+), 3 deletions(-)
> > > > >
> > > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > > > index 53bcc17c91e4..1a9b9064e080 100644
> > > > > --- a/net/ipv4/tcp.c
> > > > > +++ b/net/ipv4/tcp.c
> > > > > @@ -3368,9 +3368,22 @@ int tcp_set_window_clamp(struct sock
> > > > > *sk, int val)
> > > > >                         return -EINVAL;
> > > > >                 tp->window_clamp = 0;
> > > > >         } else {
> > > > > -               tp->window_clamp = val < SOCK_MIN_RCVBUF / 2 ?
> > > > > -                       SOCK_MIN_RCVBUF / 2 : val;
> > > > > -               tp->rcv_ssthresh = min(tp->rcv_wnd, tp-
> > > > > >window_clamp);
> > > > > +               u32 new_rcv_ssthresh, old_window_clamp = tp-
> > > > > >window_clamp;
> > > > > +               u32 new_window_clamp = val < SOCK_MIN_RCVBUF /
> > > > > 2 ?
> > > > > +                                               SOCK_MIN_RCVBUF
> > > > > / 2 : val;
> > > > > +
> > > > > +               if (new_window_clamp == old_window_clamp)
> > > > > +                       return 0;
> > > > > +
> > > > > +               tp->window_clamp = new_window_clamp;
> > > > > +               if (new_window_clamp < old_window_clamp) {
> > > > > +                       tp->rcv_ssthresh = min(tp-
> > > > > >rcv_ssthresh,
> > > > > +                                            
> > > > > new_window_clamp);
> > > > > +               } else {
> > > > > +                       new_rcv_ssthresh = min(tp->rcv_wnd, tp-
> > > > > >window_clamp);
> > > > > +                       tp->rcv_ssthresh =
> > > > > max(new_rcv_ssthresh,
> > > > > +                                              tp-
> > > > > >rcv_ssthresh);
> > > > > +               }
> > > > >         }
> > > > >         return 0;
> > > > >  }
> > > >
> > > > It seems there is no provision for SO_RESERVE_MEM
> > >
> > > Indeed I did take that in account.
> > >
> > > > I wonder if tcp_adjust_rcv_ssthresh()  could help here ?
> > >
> > > I don't know how to fit it into the above.
> > > tcp_adjust_rcv_ssthresh()
> > > tends to shrink rcv_ssthresh to low values when no memory is
> > > reserved.
> > >
> > > Dealing directly with SO_RESERVE_MEM when shrinking the threshold
> > > feels
> > > easier to me, something alike:
> > >
> > >                if (new_window_clamp == old_window_clamp)
> > >                        return 0;
> > >
> > >                tp->window_clamp = new_window_clamp;
> > >                if (new_window_clamp < old_window_clamp) {
> > >                         int unused_mem =
> > > sk_unused_reserved_mem(sk);
> > >
> > >                         tp->rcv_ssthresh = min(tp->rcv_ssthresh,
> > >                                                new_window_clamp);
> > >
> > >                         if (unused_mem)
> > >                                 tp->rcv_ssthresh = max_t(u32, tp-
> > > >rcv_ssthresh,
> > >                                         tcp_win_from_space(sk,
> > > unused_mem));
> > >                } else {
> > >                         new_rcv_ssthresh = min(tp->rcv_wnd, tp-
> > > >window_clamp);
> > >                         tp->rcv_ssthresh = max(new_rcv_ssthresh,
> > >                                               tp->rcv_ssthresh);
> > >                }
> > >
> > > Possibly the bits shared with tcp_adjust_rcv_ssthresh() could be
> > > factored out in a common helper.
> > >
> > > > Have you considered reverting  3aa7857fe1d7 ("tcp: enable mid
> > > > stream
> > > > window clamp") ?
> > >
> > >
> > > That would work, too and will be simpler.
> > >
> > > The issue at hand was noted with an application that really wants
> > > to
> > > limit the announced window:
> > >
> > > https://gitlab.com/dgibson/passt
> > >
> > > I guess touching rcv_ssthresh would be a bit more effective.
> > >
> > > Not much more in the end, as both window_clamp and rcv_ssthresh can
> > > later grow due to rcv buf auto-tune. Ideally we would like to
> > > prevent
> > > tcp_rcv_space_adjust() from touching window_clamp after
> > > TCP_WINDOW_CLAMP - but that is another matter/patch.
> > >
> > > Thanks!
> > >
> > > Paolo
> > >
> >
> > The patch to fix the bug where rcv_sshthresh is reduced to zero on a
> > full receive window and cannot recover is:
> > -tp->rcv_ssthresh = min(tp->rcv_wnd, tp->window_clamp);
> > +tp->rcv_ssthresh = min(tp->rcv_ssthresh, tp->window_clamp);
> 
> FTR I considered something similar to the above, but I opted for the
> present patch, as the above does not pass the pktdrill suggested by
> Eric here:
> 
> https://lore.kernel.org/netdev/6070816e-f7d2-725a-ec10-9d85f15455a2@gmail.com/
> 
> Cheers,
> 
> Paolo

Ahh, thanks for pointing that out!  I see that the packetdrill also needs some fixes to address changes to the initial window.  :(

Would the following address the concern?  

tp->rcv_ssthresh = min(max(tp->rcv_ssthresh, tp->rcv_wnd), tp->window_clamp);

(that is, rcv_sshthresh must be no greater than window_clamp, but otherwise it can keep the larger of its current value or the last advertised window.)

I believe this addresses both problem cases (transient tiny clamp; closed window when clamping) and passes (slightly less picky) packetdrill tests.

-neil
Paolo Abeni Nov. 28, 2023, 8:59 a.m. UTC | #6
On Mon, 2023-11-27 at 22:59 +0000, Neil Spring wrote:
> 
> Would the following address the concern?  
> 
> tp->rcv_ssthresh = min(max(tp->rcv_ssthresh, tp->rcv_wnd), tp-
> >window_clamp);
> 
> (that is, rcv_sshthresh must be no greater than window_clamp, but
> otherwise it can keep the larger of its current value or the last
> advertised window.)
> 
> I believe this addresses both problem cases (transient tiny clamp;
> closed window when clamping) and passes (slightly less picky)
> packetdrill tests.

Note that the above is basically the patch I submitted (it yields the
same values).

Yes, it addresses the issue.

But it does not address Eric's concerns reported in this thread.

It's unclear to me if the more involved approach proposed here:

https://lore.kernel.org/netdev/ebb26a4a8a80292423c8cfc965c7b16e2aa4e201.camel@redhat.com/

would be ok?

Thanks!

Paolo
Neil Spring Nov. 28, 2023, 10:56 a.m. UTC | #7
> 
> 
> ________________________________________
> From: Paolo Abeni <pabeni@redhat.com>
> Sent: Tuesday, November 28, 2023 12:59 AM
> To: Neil Spring; Eric Dumazet; Neal Cardwell; Wei Wang
> Cc: netdev@vger.kernel.org; David S. Miller; David Ahern; Jakub Kicinski; David Gibson
> Subject: Re: [PATCH net] tcp: fix mid stream window clamp.
> 
> !-------------------------------------------------------------------|
>   This Message Is From an External Sender
> 
> |-------------------------------------------------------------------!
> 
> On Mon, 2023-11-27 at 22:59 +0000, Neil Spring wrote:
> >
> > Would the following address the concern?
> >
> > tp->rcv_ssthresh = min(max(tp->rcv_ssthresh, tp->rcv_wnd), tp-
> > >window_clamp);
> >
> > (that is, rcv_sshthresh must be no greater than window_clamp, but
> > otherwise it can keep the larger of its current value or the last
> > advertised window.)
> >
> > I believe this addresses both problem cases (transient tiny clamp;
> > closed window when clamping) and passes (slightly less picky)
> > packetdrill tests.
> 
> Note that the above is basically the patch I submitted (it yields the
> same values).

Yes! 

> 
> Yes, it addresses the issue.
> 
> But it does not address Eric's concerns reported in this thread.
> 
> It's unclear to me if the more involved approach proposed here:
> 
> https://lore.kernel.org/netdev/ebb26a4a8a80292423c8cfc965c7b16e2aa4e201.camel@redhat.com/
> 
> would be ok?

My understanding of the documentation of TCP_WINDOW_CLAMP is: "Bound the size of the advertised window to this value. The kernel imposes a minimum size of SOCK_MIN_RCVBUF/2."   So in my opinion, increasing this value above the application-requested clamp to at least the unused reserved buffer space adds complexity and is not correct.  I suspect the result is also not predictable, since it depends on whether the application has consumed the buffer.  

The clamp and allocated buffer space weren't tied like this before - I could set SO_RCVBUF large and TCP_WINDOW_CLAMP small, perhaps slowing the network transfer to match a slow receiving application, and expect the clamp to do as I asked - so I don't quite see what's new here.

I guess Eric has a different perspective and I should let you figure out what's best for the interaction with reserved memory.

Thanks again for fixing this. 

-neil

> 
> Thanks!
> 
> Paolo
> 
>
diff mbox series

Patch

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 53bcc17c91e4..1a9b9064e080 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3368,9 +3368,22 @@  int tcp_set_window_clamp(struct sock *sk, int val)
 			return -EINVAL;
 		tp->window_clamp = 0;
 	} else {
-		tp->window_clamp = val < SOCK_MIN_RCVBUF / 2 ?
-			SOCK_MIN_RCVBUF / 2 : val;
-		tp->rcv_ssthresh = min(tp->rcv_wnd, tp->window_clamp);
+		u32 new_rcv_ssthresh, old_window_clamp = tp->window_clamp;
+		u32 new_window_clamp = val < SOCK_MIN_RCVBUF / 2 ?
+						SOCK_MIN_RCVBUF / 2 : val;
+
+		if (new_window_clamp == old_window_clamp)
+			return 0;
+
+		tp->window_clamp = new_window_clamp;
+		if (new_window_clamp < old_window_clamp) {
+			tp->rcv_ssthresh = min(tp->rcv_ssthresh,
+					       new_window_clamp);
+		} else {
+			new_rcv_ssthresh = min(tp->rcv_wnd, tp->window_clamp);
+			tp->rcv_ssthresh = max(new_rcv_ssthresh,
+					       tp->rcv_ssthresh);
+		}
 	}
 	return 0;
 }