[net-next] tcp: update window_clamp together with scaling_ratio

After commit dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale"),
we noticed an application-level timeout due to reduced throughput. This
can be reproduced by the following minimal client and server program.

server:

int main(int argc, char *argv[]) {
    int sockfd;
    char buffer[256];
    struct sockaddr_in srv_addr;

    // Create socket
    sockfd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
    if (sockfd < 0) {
       perror("server: socket()");
       return -1;
    }
    bzero((char *) &srv_addr, sizeof(srv_addr));
    srv_addr.sin_family = AF_INET;
    srv_addr.sin_addr.s_addr = htonl(INADDR_ANY);
    srv_addr.sin_port = htons(8080);
    // Bind socket
    if (bind(sockfd, (struct sockaddr *) &srv_addr,
	     sizeof(srv_addr)) < 0)  {
        perror("server: bind()");
        close(sockfd);
        return -1;
    }
    // Listen for connections
    listen(sockfd,5);

    while(1) {
        int filefd = -1, newsockfd = -1;
        struct sockaddr_in cli_addr;
        socklen_t cli_len = sizeof(cli_addr);

        // Accept connection
        newsockfd = accept(sockfd, (struct sockaddr *)&cli_addr, &cli_len);
        if (newsockfd < 0) {
            perror("server: accept()");
            goto end;
        }
        // Read filename from client
        bzero(buffer, sizeof(buffer));
        ssize_t n = read(newsockfd,buffer,sizeof(buffer)-1);
        if (n < 0) {
            perror("server: read()");
            goto end;
        }
        // Open file
        filefd = open(buffer, O_RDONLY);
        if (filefd < 0) {
            perror("server: read()");
            goto end;
        }
        // Get file size
        struct stat file_stat;
        if(fstat(filefd, &file_stat) < 0) {
            perror("server: fstat()");
            goto end;
        }
        // Send file
        off_t offset = 0;
        ssize_t bytes_sent = 0, bytes_left = file_stat.st_size;
        while ((bytes_sent = sendfile(newsockfd, filefd,
				      &offset, bytes_left)) > 0) {
            bytes_left -= bytes_sent;
        }

end:
        // Close file and client socket
        if (filefd > 0) {
                close(filefd);
        }
        if (newsockfd > 0) {
                close(newsockfd);
        }
    }
    close(sockfd);
    return 0;
}

client:

int main(int argc, char *argv[]) {
    int sockfd, filefd;
    char *server_addr = argv[1];
    char *filename = argv[2];
    struct sockaddr_in sockaddr;
    char buffer[256];
    ssize_t n;

    if ((sockfd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP)) == -1) {
        perror("client: socket()");
        return -1;
    }

    sockaddr.sin_family = AF_INET;
    inet_pton(AF_INET, server_addr, &sockaddr.sin_addr);
    sockaddr.sin_port = htons(8080);

    int val = 65536;
    if (setsockopt(sockfd, SOL_SOCKET, SO_RCVBUF,
		   &val, sizeof(val)) < 0) {
        perror("client: setockopt(SO_RCVBUF)");
        return -1;
    }
    if (connect(sockfd, (struct sockaddr*)&sockaddr,
		sizeof(sockaddr)) == -1) {
        close(sockfd);
        perror("client: connect()");
        return -1;
    }

    // Send filename to server
    n = write(sockfd, filename, strlen(filename));
    if (n < 0) {
         perror("client: write()");
         return -1;
    }
    // Open file
    filefd = open(filename, O_WRONLY | O_CREAT, 0666);
    if(filefd < 0) {
         perror("client: open()");
         return -1;
    }
    // Read file from server
    while((n = read(sockfd, buffer, sizeof(buffer))) > 0) {
        write(filefd, buffer, n);
    }
    // Close file and socket
    close(filefd);
    close(sockfd);
    return 0;
}

Before the commit, it takes around 22 seconds to transfer 10M data.
After the commit, it takes 40 seconds. Because our application has a
30-second timeout, this regression broke the application.

The reason that it takes longer to transfer data is that
tp->scaling_ratio is initialized to a value that results in ~0.25 of
rcvbuf. In our case, SO_RCVBUF is set to 65536 by the application, which
translates to 2 * 65536 = 131,072 bytes in rcvbuf and hence a ~28k
initial receive window.

Later, even though the scaling_ratio is updated to a more accurate
skb->len/skb->truesize, which is ~0.66 in our environment, the window
stays at ~0.25 * rcvbuf. This is because tp->window_clamp does not
change together with the tp->scaling_ratio update. As a result, the
window size is capped at the initial window_clamp, which is also ~0.25 *
rcvbuf, and never grows bigger.

This patch updates window_clamp along with scaling_ratio. It changes the
calculation of the initial rcv_wscale as well to make sure the scale
factor is also not capped by the initial window_clamp.

A comment from Tycho Andersen <tycho@tycho.pizza> is "What happens if
someone has done setsockopt(sk, TCP_WINDOW_CLAMP) explicitly; will this
and the above not violate userspace's desire to clamp the window size?".
This comment is not addressed in this patch because the existing code
also updates window_clamp at several places without checking if
TCP_WINDOW_CLAMP is set by user space. Adding this check now may break
certain user space assumption (similar to how the original patch broke
the assumption of buffer overhead being 50%). For example, if a user
space program sets TCP_WINDOW_CLAMP but the applicaiton behavior relies
on window_clamp adjusted by the kernel as of today.

Fixes: dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale")
Signed-off-by: Hechao Li <hli@netflix.com>
Reviewed-by: Tycho Andersen <tycho@tycho.pizza>
---
 net/ipv4/tcp_input.c  | 6 +++++-
 net/ipv4/tcp_output.c | 2 +-
 2 files changed, 6 insertions(+), 2 deletions(-)

Message ID	20240402215405.432863-1-hli@netflix.com (mailing list archive)
State	Changes Requested
Delegated to:	Netdev Maintainers
Headers	show Received: from mail-pj1-f51.google.com (mail-pj1-f51.google.com [209.85.216.51]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3F38379DD4 for <netdev@vger.kernel.org>; Tue, 2 Apr 2024 21:56:02 +0000 (UTC) From: Hechao Li <hli@netflix.com> To: Eric Dumazet <edumazet@google.com>, "David S . Miller" <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Soheil Hassas Yeganeh <soheil@google.com> Cc: netdev@vger.kernel.org, kernel-developers@netflix.com, Hechao Li <hli@netflix.com>, Tycho Andersen <tycho@tycho.pizza> Subject: [PATCH net-next] tcp: update window_clamp together with scaling_ratio Date: Tue, 2 Apr 2024 14:54:06 -0700 Message-Id: <20240402215405.432863-1-hli@netflix.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[net-next] tcp: update window_clamp together with scaling_ratio \| expand [net-next] tcp: update window_clamp together with scaling_ratio

Context	Check	Description
netdev/series_format	success	Single patches do not need cover letters
netdev/tree_selection	success	Clearly marked for net-next
netdev/ynl	success	Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 946 this patch: 946
netdev/build_tools	success	No tools touched, skip
netdev/cc_maintainers	warning	1 maintainers not CCed: dsahern@kernel.org
netdev/build_clang	success	Errors and warnings before: 957 this patch: 957
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	Fixes tag looks correct
netdev/build_allmodconfig_warn	success	Errors and warnings before: 957 this patch: 957
netdev/checkpatch	success	total: 0 errors, 0 warnings, 0 checks, 22 lines checked
netdev/build_clang_rust	success	No Rust files in patch. Skipping build
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0
netdev/contest	success	net-next-2024-04-03--09-00 (tests: 950)

[net-next] tcp: update window_clamp together with scaling_ratio

Checks

Commit Message

Comments

Patch