Message ID | 20210111222411.232916-5-hcaldwel@akamai.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | Fix receive window restriction | expand |
Context | Check | Description |
---|---|---|
netdev/cover_letter | success | Link |
netdev/fixes_present | success | Link |
netdev/patch_count | success | Link |
netdev/tree_selection | success | Clearly marked for net-next |
netdev/subject_prefix | success | Link |
netdev/cc_maintainers | warning | 4 maintainers not CCed: kuba@kernel.org davem@davemloft.net kuznet@ms2.inr.ac.ru yoshfuji@linux-ipv6.org |
netdev/source_inline | success | Was 0 now: 0 |
netdev/verify_signedoff | success | Link |
netdev/module_param | success | Was 0 now: 0 |
netdev/build_32bit | success | Errors and warnings before: 1 this patch: 1 |
netdev/kdoc | success | Errors and warnings before: 0 this patch: 0 |
netdev/verify_fixes | success | Link |
netdev/checkpatch | success | total: 0 errors, 0 warnings, 0 checks, 8 lines checked |
netdev/build_allmodconfig_warn | success | Errors and warnings before: 1 this patch: 1 |
netdev/header_inline | success | Link |
netdev/stable | success | Stable not CCed |
On Mon, Jan 11, 2021 at 11:24 PM Heath Caldwell <hcaldwel@akamai.com> wrote: > > Remove the 64KB limit imposed on the initial receive window. > > The limit was added by commit a337531b942b ("tcp: up initial rmem to 128KB > and SYN rwin to around 64KB"). > > This change removes that limit so that the initial receive window can be > arbitrarily large (within existing limits and depending on the current > configuration). > > The arbitrary, internal limit can interfere with research because it > irremediably restricts the receive window at the beginning of a connection > below what would be expected when explicitly configuring the receive buffer > size. > > - > > Here is a scenario to illustrate how the limit might cause undesirable > behavior: > > Consider an installation where all parts of a network are either controlled > or sufficiently monitored and there is a desired use case where a 1MB > object is transmitted over a newly created TCP connection in a single > initial burst. > > Let MSS be 1460 bytes. > > The initial cwnd would need to be at least: > > |- 1048576 bytes -| > cwnd_init = | --------------- | = 719 packets > | 1460 bytes/pkt | > > Let us say that it was determined that the network could handle bursts of > 800 full sized packets at the frequency which the connections under > consideration would be expected to occur, so the sending host is configured > to use an initial cwnd of 800 for these connections. > > In order for the receiver to be able to receive a 1MB burst, it needs to > have a sufficiently large receive buffer for the connection. Considering > overhead, let us say that the receiver is configured to initially use a > receive buffer of 2148K for TCP connections: > > net.ipv4.tcp_rmem = 4096 2199552 6291456 > > Let rtt be 50 milliseconds. > > If the entire object is sent in a single burst, then the theoretically > highest achievable throughput (discounting handshake and request) should > be: > > bits 1048576 bytes 8 bits > T_upperbound = ---- = ------------- * ------ =~ 168 Mbit/s > rtt 0.05 s 1 byte > > But, if flow control limits throughput because the receive window is > initially limited to 64KB and grows at a rate of quadrupling every > rtt (maybe not accurate but seems to be optimistic from observation), we > should expect the highest achievable throughput to be limited to: > > bytes_sent = 65536 * (1 + 4)^(t / rtt) > > When bytes_sent = object size = 1048576: > > 1048576 = 65536 * (1 + 4)^(t / rtt) > t = rtt * log_5(16) > > 1048576 bytes 8 bits > T_limited = ------------------------------------ * ------ > / |- rtt * log_5(16) -| \ 1 byte > rtt * ( 1 + | ---------------- | ) > \ | rtt | / > > 1048576 bytes 8 bits > = ---------------- * ------ > 0.05 s * (1 + 2) 1 byte > > =~ 55.9 Mbit/s > > In short: for this scenario, the 64KB limit on the initial receive window > increases the achievable acknowledged delivery time from 1 rtt > to (optimistically) 3 rtts, reducing the achievable throughput from > 168 Mbit/s to 55.9 Mbit/s. > > Here is an experimental illustration: > > A time sequence chart of a packet capture taken on the sender for a > scenario similar to what is described above, where the receiver had the > 64KB limit in place: > > Symbols: > .:' - Data packets > _- - Window advertised by receiver > > y-axis - Relative sequence number > x-axis - Time from sending of first data packet, in seconds > > 3212891 _ > 3089318 - > 2965745 - > 2842172 - > 2718600 ________- > 2595027 - > 2471454 - > 2347881 -------- > 2224309 _ > 2100736 - > 1977163 -- > 1853590 _ > 1730018 - > 1606445 - > 1482872 - > 1359300 - > 1235727 - > 1112154 - > 988581 _: > 865009 _______--------.: > 741436 . : ' > 617863 -: > 494290 -: > 370718 .: > 247145 --------.-------: > 123572 _________________: ' > 0 .: ' > 0.000 0.028 0.056 0.084 0.112 0.140 0.168 0.195 > > Note that the sender was not able to send the object in a single initial > burst and that it took around 4 rtts for the object to be fully > acknowledged. > > > A time sequence chart of a packet capture taken for the same scenario, but > with the limit removed: > > 2147035 __ > 2064456 _- > 1981878 _- > 1899300 - > 1816721 -- > 1734143 _- > 1651565 _- > 1568987 - > 1486408 -- > 1403830 _- > 1321252 _- > 1238674 - > 1156095 ________________________________________________________-- > 1073517 > 990939 : > 908360 :' > 825782 :' > 743204 .: > 660626 : > 578047 :' > 495469 :' > 412891 .: > 330313 .: > 247734 : > 165156 :' > 82578 :' > 0 .: > 0.000 0.008 0.016 0.025 0.033 0.041 0.049 0.057 > > Note that the sender was able to send the entire object in a single burst > and that it was fully acknowledged after a little over 1 rtt. > > Signed-off-by: Heath Caldwell <hcaldwel@akamai.com> > --- > net/ipv4/tcp_output.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index 1d2773cd02c8..d7ab1f5f071e 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -232,7 +232,7 @@ void tcp_select_initial_window(const struct sock *sk, int __space, __u32 mss, > if (sock_net(sk)->ipv4.sysctl_tcp_workaround_signed_windows) > (*rcv_wnd) = min(space, MAX_TCP_WINDOW); > else > - (*rcv_wnd) = min_t(u32, space, U16_MAX); > + (*rcv_wnd) = space; > > if (init_rcv_wnd) > *rcv_wnd = min(*rcv_wnd, init_rcv_wnd * mss); > -- > 2.28.0 > I think the whole patch series is an attempt to badly break TCP stack. Hint : 64K is really the max allowed by TCP standards. Yes, this is sad, but this is it. I will not spend hours of work running packetdrill tests over your changes, but I am sure they are now quite broken. If you believe auto tuning is broken, fix it properly, without trying to change all the code so that you can understand it. I strongly advise you read RFC 7323 before doing any changes in TCP stack, and asking us to spend time reviewing your patches. If you want to do research, this is fine, but please do not break production TCP stack. Thank you.
On 2021-01-12 09:30 (+0100), Eric Dumazet <edumazet@google.com> wrote: > I think the whole patch series is an attempt to badly break TCP stack. Can you explain the concern that you have about how these changes might break the TCP stack? Patches 1 and 3 fix clear bugs. Patches 2 and 4 might be arguable, though. Is you objection primarily about the limit removed by patch 4? > Hint : 64K is really the max allowed by TCP standards. Yes, this is > sad, but this is it. Do you mean the limit imposed by the size of the "Window Size" header field? This limitation is directly addressed by the check in __tcp_transmit_skb(): if (likely(!(tcb->tcp_flags & TCPHDR_SYN))) { th->window = htons(tcp_select_window(sk)); tcp_ecn_send(sk, skb, th, tcp_header_size); } else { /* RFC1323: The window in SYN & SYN/ACK segments * is never scaled. */ th->window = htons(min(tp->rcv_wnd, 65535U)); } and checking (and capping it there) allows for the field to not overflow while also not artificially restricting the size of the window which will later be advertised (once window scaling is negotiated). > I will not spend hours of work running packetdrill tests over your > changes, but I am sure they are now quite broken. > > If you believe auto tuning is broken, fix it properly, without trying > to change all the code so that you can understand it. The removal of the limit specifically addresses the situation where auto tuning cannot work: on the initial burst. There is no way to know whether an installation desires to receive a larger first burst unless it is specifically configured - and this limit prevents such configuration. > I strongly advise you read RFC 7323 before doing any changes in TCP > stack, and asking us to spend time reviewing your patches. Can you point out the part of the RFC which would be violated by initially (that is, the first packet after the SYN) advertising a window larger than 64KB? > If you want to do research, this is fine, but please do not break > production TCP stack. > > Thank you.
On Tue, Jan 12, 2021 at 5:02 PM Heath Caldwell <hcaldwel@akamai.com> wrote: > > On 2021-01-12 09:30 (+0100), Eric Dumazet <edumazet@google.com> wrote: > > I think the whole patch series is an attempt to badly break TCP stack. > > Can you explain the concern that you have about how these changes might > break the TCP stack? > > Patches 1 and 3 fix clear bugs. Not clear to me at least. If they were bug fixes, a FIxes: tag would be provided. You are a first time contributor to linux TCP stack, so better make sure your claims are solid. > > Patches 2 and 4 might be arguable, though. So we have to pick up whatever pleases us ? > > Is you objection primarily about the limit removed by patch 4? > > > Hint : 64K is really the max allowed by TCP standards. Yes, this is > > sad, but this is it. > > Do you mean the limit imposed by the size of the "Window Size" header > field? This limitation is directly addressed by the check in > __tcp_transmit_skb(): > > if (likely(!(tcb->tcp_flags & TCPHDR_SYN))) { > th->window = htons(tcp_select_window(sk)); > tcp_ecn_send(sk, skb, th, tcp_header_size); > } else { > /* RFC1323: The window in SYN & SYN/ACK segments > * is never scaled. > */ > th->window = htons(min(tp->rcv_wnd, 65535U)); > } > > and checking (and capping it there) allows for the field to not overflow > while also not artificially restricting the size of the window which > will later be advertised (once window scaling is negotiated). > > > I will not spend hours of work running packetdrill tests over your > > changes, but I am sure they are now quite broken. > > > > If you believe auto tuning is broken, fix it properly, without trying > > to change all the code so that you can understand it. > > The removal of the limit specifically addresses the situation where auto > tuning cannot work: on the initial burst. Which standard allows for a very big initial burst ? AFAIK, IW10 got a lot of pushback, and that was only for a burst of ~14600 bytes Allowing arbitrarily large windows needs IETF approval. > There is no way to know > whether an installation desires to receive a larger first burst unless > it is specifically configured - and this limit prevents such > configuration. > > > I strongly advise you read RFC 7323 before doing any changes in TCP > > stack, and asking us to spend time reviewing your patches. > > Can you point out the part of the RFC which would be violated by > initially (that is, the first packet after the SYN) advertising a window > larger than 64KB? We handle gradual increase of rwin based on the behavior of applications and senders (skb len/truesize ratio) Again if you believe autotuning is broken please fix it, instead of throwing it away. Changing initial RWIN can have subtle differences on how fast DRS algo can kick in. This kind of change would need to be heavily tested, with millions of TCP flows in the wild. (ie not in a lab environment) Have you done this ? What happens if flows are malicious and sent 100 bytes per segment ?
On 2021-01-12 18:05 (+0100), Eric Dumazet <edumazet@google.com> wrote: > On Tue, Jan 12, 2021 at 5:02 PM Heath Caldwell <hcaldwel@akamai.com> wrote: > > > > On 2021-01-12 09:30 (+0100), Eric Dumazet <edumazet@google.com> wrote: > > > I think the whole patch series is an attempt to badly break TCP stack. > > > > Can you explain the concern that you have about how these changes might > > break the TCP stack? > > > > Patches 1 and 3 fix clear bugs. > > Not clear to me at least. > > If they were bug fixes, a FIxes: tag would be provided. The underlying bugs that are addressed in patches 1 and 3 are present in 1da177e4c3f4 ("Linux-2.6.12-rc2") which looks to be the earliest parent commit in the repository. What should I do for a Fixes: tag in this case? > You are a first time contributor to linux TCP stack, so better make > sure your claims are solid. I fear that I may not have expressed the problems and solutions in a manner that imparted the ideas well. Maybe I added too much detail in the description for patch 1, which may have obscured the problem: val is capped to sysctl_rmem_max *before* it is doubled (resulting in the possibility for sk_rcvbuf to be set to 2*sysctl_rmem_max, rather than it being capped at sysctl_rmem_max). Maybe I was not explicit enough in the description for patch 3: space is expanded into sock_net(sk)->ipv4.sysctl_tcp_rmem[2] and sysctl_rmem_max without first shrinking them to discount the overhead. > > Patches 2 and 4 might be arguable, though. > > So we have to pick up whatever pleases us ? I have been treating all of these changes together because they all kind of work together to provide a consistent model and configurability for the initial receive window. Patches 1 and 3 address bugs. Patch 2 addresses an inconsistency in how overhead is treated specially for TCP sockets. Patch 4 addresses the 64KB limit which has been imposed. If you think that they should be treated separately, I can separate them to not be combined into a series. Though tcp_space_from_win(), introduced by patch 2, is also used by patch 3. > > Is you objection primarily about the limit removed by patch 4? > > > > > Hint : 64K is really the max allowed by TCP standards. Yes, this is > > > sad, but this is it. > > > > Do you mean the limit imposed by the size of the "Window Size" header > > field? This limitation is directly addressed by the check in > > __tcp_transmit_skb(): > > > > if (likely(!(tcb->tcp_flags & TCPHDR_SYN))) { > > th->window = htons(tcp_select_window(sk)); > > tcp_ecn_send(sk, skb, th, tcp_header_size); > > } else { > > /* RFC1323: The window in SYN & SYN/ACK segments > > * is never scaled. > > */ > > th->window = htons(min(tp->rcv_wnd, 65535U)); > > } > > > > and checking (and capping it there) allows for the field to not overflow > > while also not artificially restricting the size of the window which > > will later be advertised (once window scaling is negotiated). > > > > > I will not spend hours of work running packetdrill tests over your > > > changes, but I am sure they are now quite broken. > > > > > > If you believe auto tuning is broken, fix it properly, without trying > > > to change all the code so that you can understand it. > > > > The removal of the limit specifically addresses the situation where auto > > tuning cannot work: on the initial burst. > > Which standard allows for a very big initial burst ? > AFAIK, IW10 got a lot of pushback, and that was only for a burst of ~14600 bytes > Allowing arbitrarily large windows needs IETF approval. Should not a TCP implementation be generous in what it accepts? The removal of the limit is not trying to force an increase of the size of an initially sent burst, it is only trying to allow for the reception of an initial burst which may have been sent by another host. A sender should not rely on the receiver's advertised window to prevent it from causing congestion. > > There is no way to know > > whether an installation desires to receive a larger first burst unless > > it is specifically configured - and this limit prevents such > > configuration. > > > > > I strongly advise you read RFC 7323 before doing any changes in TCP > > > stack, and asking us to spend time reviewing your patches. > > > > Can you point out the part of the RFC which would be violated by > > initially (that is, the first packet after the SYN) advertising a window > > larger than 64KB? > > We handle gradual increase of rwin based on the behavior of > applications and senders (skb len/truesize ratio) > > Again if you believe autotuning is broken please fix it, instead of > throwing it away. Removing the limit is not an attempt to remove the autotuning, it is an attempt to provide the ability for a user to increase what a TCP connection can initially receive. This can be used to configure where autotuning *starts*. > Changing initial RWIN can have subtle differences on how fast DRS algo > can kick in. > > This kind of change would need to be heavily tested, with millions of > TCP flows in the wild. > (ie not in a lab environment) > > Have you done this ? What happens if flows are malicious and sent 100 > bytes per segment ? Removing the limit would not remove protection from such an attack. If there was concern over such an attack at a particular installation, the default sysctl_tcp_rmem[1] of 128KB could be depended upon (along with the default tcp_adv_win_scale of 1) to achieve the same behavior that the limit provides. Instead, the removal of the limit allows for a user who is not concerned about such an attack to configure an installation such that the reception of larger than 64KB initial bursts is possible. Removing the limit does not impose a choice on the user. Leaving the limit in place does.
On Tue, Jan 12, 2021 at 8:25 PM Heath Caldwell <hcaldwel@akamai.com> wrote: > > On 2021-01-12 18:05 (+0100), Eric Dumazet <edumazet@google.com> wrote: > > On Tue, Jan 12, 2021 at 5:02 PM Heath Caldwell <hcaldwel@akamai.com> wrote: > > > > > > On 2021-01-12 09:30 (+0100), Eric Dumazet <edumazet@google.com> wrote: > > > > I think the whole patch series is an attempt to badly break TCP stack. > > > > > > Can you explain the concern that you have about how these changes might > > > break the TCP stack? > > > > > > Patches 1 and 3 fix clear bugs. > > > > Not clear to me at least. > > > > If they were bug fixes, a FIxes: tag would be provided. > > The underlying bugs that are addressed in patches 1 and 3 are present in > 1da177e4c3f4 ("Linux-2.6.12-rc2") which looks to be the earliest parent > commit in the repository. What should I do for a Fixes: tag in this > case? > > > You are a first time contributor to linux TCP stack, so better make > > sure your claims are solid. > > I fear that I may not have expressed the problems and solutions in a > manner that imparted the ideas well. > > Maybe I added too much detail in the description for patch 1, which may > have obscured the problem: val is capped to sysctl_rmem_max *before* it > is doubled (resulting in the possibility for sk_rcvbuf to be set to > 2*sysctl_rmem_max, rather than it being capped at sysctl_rmem_max). This is fine. This has been done forever. Your change might break applications. I would advise documenting this fact, since existing behavior will be kept in many linux hosts for years to come. > > Maybe I was not explicit enough in the description for patch 3: space is > expanded into sock_net(sk)->ipv4.sysctl_tcp_rmem[2] and sysctl_rmem_max > without first shrinking them to discount the overhead. > > > > Patches 2 and 4 might be arguable, though. > > > > So we have to pick up whatever pleases us ? > > I have been treating all of these changes together because they all kind > of work together to provide a consistent model and configurability for > the initial receive window. > > Patches 1 and 3 address bugs. Maybe, but will break applications. > Patch 2 addresses an inconsistency in how overhead is treated specially > for TCP sockets. > Patch 4 addresses the 64KB limit which has been imposed. For very good reasons. This is going nowhere. I will stop right now.
On 2021-01-12 21:26 (+0100), Eric Dumazet <edumazet@google.com> wrote: > On Tue, Jan 12, 2021 at 8:25 PM Heath Caldwell <hcaldwel@akamai.com> wrote: > > > > On 2021-01-12 18:05 (+0100), Eric Dumazet <edumazet@google.com> wrote: > > > On Tue, Jan 12, 2021 at 5:02 PM Heath Caldwell <hcaldwel@akamai.com> wrote: > > > > > > > > On 2021-01-12 09:30 (+0100), Eric Dumazet <edumazet@google.com> wrote: > > > > > I think the whole patch series is an attempt to badly break TCP stack. > > > > > > > > Can you explain the concern that you have about how these changes might > > > > break the TCP stack? > > > > > > > > Patches 1 and 3 fix clear bugs. > > > > > > Not clear to me at least. > > > > > > If they were bug fixes, a FIxes: tag would be provided. > > > > The underlying bugs that are addressed in patches 1 and 3 are present in > > 1da177e4c3f4 ("Linux-2.6.12-rc2") which looks to be the earliest parent > > commit in the repository. What should I do for a Fixes: tag in this > > case? > > > > > You are a first time contributor to linux TCP stack, so better make > > > sure your claims are solid. > > > > I fear that I may not have expressed the problems and solutions in a > > manner that imparted the ideas well. > > > > Maybe I added too much detail in the description for patch 1, which may > > have obscured the problem: val is capped to sysctl_rmem_max *before* it > > is doubled (resulting in the possibility for sk_rcvbuf to be set to > > 2*sysctl_rmem_max, rather than it being capped at sysctl_rmem_max). > > This is fine. This has been done forever. Your change might break applications. In what way might applications be broken? It seems to be a very strange position to allow a configured maximum to be violated because of obscure precedent. It does not seem to be a supportable position to allow an application to violate an installation's configuration because of a chance that the application may behave differently if a setsockopt() call fails. What if a system administrator decides to reduce sysctl_rmem_max to half of the current default? > I would advise documenting this fact, since existing behavior will be kept > in many linux hosts for years to come. > > > > > Maybe I was not explicit enough in the description for patch 3: space is > > expanded into sock_net(sk)->ipv4.sysctl_tcp_rmem[2] and sysctl_rmem_max > > without first shrinking them to discount the overhead. > > > > > > Patches 2 and 4 might be arguable, though. > > > > > > So we have to pick up whatever pleases us ? > > > > I have been treating all of these changes together because they all kind > > of work together to provide a consistent model and configurability for > > the initial receive window. > > > > Patches 1 and 3 address bugs. > > Maybe, but will break applications. How might patch 3 break an application? It merely will reduce the window scale value to something lower but still capable of representing the largest window that a particular connection might advertise. > > Patch 2 addresses an inconsistency in how overhead is treated specially > > for TCP sockets. > > Patch 4 addresses the 64KB limit which has been imposed. > > For very good reasons. What are the reasons? > This is going nowhere. I will stop right now. That is a shame :(.
On Tue, Jan 12, 2021 at 9:43 PM Heath Caldwell <hcaldwel@akamai.com> wrote: > > On 2021-01-12 21:26 (+0100), Eric Dumazet <edumazet@google.com> wrote: > > This is fine. This has been done forever. Your change might break applications. > > In what way might applications be broken? > > It seems to be a very strange position to allow a configured maximum to > be violated because of obscure precedent. > Welcome to the real world Some applications use setsockopt(), followed by getsockopt() to make sure their change has been recorded. Some setups might break, you can be sure of this. I doubt this issue is serious enough, the code has been there forever. Please fix the documentation (if any)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 1d2773cd02c8..d7ab1f5f071e 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -232,7 +232,7 @@ void tcp_select_initial_window(const struct sock *sk, int __space, __u32 mss, if (sock_net(sk)->ipv4.sysctl_tcp_workaround_signed_windows) (*rcv_wnd) = min(space, MAX_TCP_WINDOW); else - (*rcv_wnd) = min_t(u32, space, U16_MAX); + (*rcv_wnd) = space; if (init_rcv_wnd) *rcv_wnd = min(*rcv_wnd, init_rcv_wnd * mss);
Remove the 64KB limit imposed on the initial receive window. The limit was added by commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB"). This change removes that limit so that the initial receive window can be arbitrarily large (within existing limits and depending on the current configuration). The arbitrary, internal limit can interfere with research because it irremediably restricts the receive window at the beginning of a connection below what would be expected when explicitly configuring the receive buffer size. - Here is a scenario to illustrate how the limit might cause undesirable behavior: Consider an installation where all parts of a network are either controlled or sufficiently monitored and there is a desired use case where a 1MB object is transmitted over a newly created TCP connection in a single initial burst. Let MSS be 1460 bytes. The initial cwnd would need to be at least: |- 1048576 bytes -| cwnd_init = | --------------- | = 719 packets | 1460 bytes/pkt | Let us say that it was determined that the network could handle bursts of 800 full sized packets at the frequency which the connections under consideration would be expected to occur, so the sending host is configured to use an initial cwnd of 800 for these connections. In order for the receiver to be able to receive a 1MB burst, it needs to have a sufficiently large receive buffer for the connection. Considering overhead, let us say that the receiver is configured to initially use a receive buffer of 2148K for TCP connections: net.ipv4.tcp_rmem = 4096 2199552 6291456 Let rtt be 50 milliseconds. If the entire object is sent in a single burst, then the theoretically highest achievable throughput (discounting handshake and request) should be: bits 1048576 bytes 8 bits T_upperbound = ---- = ------------- * ------ =~ 168 Mbit/s rtt 0.05 s 1 byte But, if flow control limits throughput because the receive window is initially limited to 64KB and grows at a rate of quadrupling every rtt (maybe not accurate but seems to be optimistic from observation), we should expect the highest achievable throughput to be limited to: bytes_sent = 65536 * (1 + 4)^(t / rtt) When bytes_sent = object size = 1048576: 1048576 = 65536 * (1 + 4)^(t / rtt) t = rtt * log_5(16) 1048576 bytes 8 bits T_limited = ------------------------------------ * ------ / |- rtt * log_5(16) -| \ 1 byte rtt * ( 1 + | ---------------- | ) \ | rtt | / 1048576 bytes 8 bits = ---------------- * ------ 0.05 s * (1 + 2) 1 byte =~ 55.9 Mbit/s In short: for this scenario, the 64KB limit on the initial receive window increases the achievable acknowledged delivery time from 1 rtt to (optimistically) 3 rtts, reducing the achievable throughput from 168 Mbit/s to 55.9 Mbit/s. Here is an experimental illustration: A time sequence chart of a packet capture taken on the sender for a scenario similar to what is described above, where the receiver had the 64KB limit in place: Symbols: .:' - Data packets _- - Window advertised by receiver y-axis - Relative sequence number x-axis - Time from sending of first data packet, in seconds 3212891 _ 3089318 - 2965745 - 2842172 - 2718600 ________- 2595027 - 2471454 - 2347881 -------- 2224309 _ 2100736 - 1977163 -- 1853590 _ 1730018 - 1606445 - 1482872 - 1359300 - 1235727 - 1112154 - 988581 _: 865009 _______--------.: 741436 . : ' 617863 -: 494290 -: 370718 .: 247145 --------.-------: 123572 _________________: ' 0 .: ' 0.000 0.028 0.056 0.084 0.112 0.140 0.168 0.195 Note that the sender was not able to send the object in a single initial burst and that it took around 4 rtts for the object to be fully acknowledged. A time sequence chart of a packet capture taken for the same scenario, but with the limit removed: 2147035 __ 2064456 _- 1981878 _- 1899300 - 1816721 -- 1734143 _- 1651565 _- 1568987 - 1486408 -- 1403830 _- 1321252 _- 1238674 - 1156095 ________________________________________________________-- 1073517 990939 : 908360 :' 825782 :' 743204 .: 660626 : 578047 :' 495469 :' 412891 .: 330313 .: 247734 : 165156 :' 82578 :' 0 .: 0.000 0.008 0.016 0.025 0.033 0.041 0.049 0.057 Note that the sender was able to send the entire object in a single burst and that it was fully acknowledged after a little over 1 rtt. Signed-off-by: Heath Caldwell <hcaldwel@akamai.com> --- net/ipv4/tcp_output.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)