From patchwork Tue May 2 15:51:52 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Fastabend X-Patchwork-Id: 13229110 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5CDF0C77B73 for ; Tue, 2 May 2023 15:52:28 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234629AbjEBPwZ (ORCPT ); Tue, 2 May 2023 11:52:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58784 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234631AbjEBPwU (ORCPT ); Tue, 2 May 2023 11:52:20 -0400 Received: from mail-pg1-x535.google.com (mail-pg1-x535.google.com [IPv6:2607:f8b0:4864:20::535]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5C5851B7; Tue, 2 May 2023 08:52:13 -0700 (PDT) Received: by mail-pg1-x535.google.com with SMTP id 41be03b00d2f7-51f597c975fso3528951a12.0; Tue, 02 May 2023 08:52:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1683042733; x=1685634733; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=0kZKE0BIR8kU5LZVmgLjnz2d1PSXb3yg4NzK45uW/48=; b=ECMHCmM55kveaA/UrPCe7dgGrYYAecTld2nmK7k46Butwm5vMt95MQsALV41XbuzVQ RRJJRthitWe+qyEKC6+uP1qKfLhQ9jwZ/uyoAgSOrwqkEbmVFNJMeaXDWxnR16LfeISy RFUAwrvneDzRD6sdBP5RllmHl2rvHhSPxVhK/WXNgY7nwmPNfXwl2rXQxXrtnXQka7/G OsHjlgTfY1VB4+MTbYje9Ri1ZowcnlW3VWKn3E5zB0GVvfFIjjwIK86R43tm4dG99BFn BDBGkX3mML3prDnW0X8kl4bN3ryhxX0e7/md1xTbNr84TN2gz0458l4iw9oJ+Rovnsxv F18Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683042733; x=1685634733; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=0kZKE0BIR8kU5LZVmgLjnz2d1PSXb3yg4NzK45uW/48=; b=S6TabTWQqRLZmVi9yVClQ762bGBSQJJXkAGFtg+vmrAoX8Bd71QLZwGKIbSKSHab8k l8/32ba3kRLYeZWZWAL3HTIyxlcKeDgGBr6tSTqUsygJCrCnuU+RnMBa8ead2mPpdC6j 8YDwxwPYsvtxHlaUjA7BOMrNM9MQiKrRIidayKEAD4f2RqYgayGvZxYq05UWWP743SlR etItWAx1azyso+BQM8kQYdBKBWBbWyuZDuw1p2xYsuuGbetztIolAZ5ZivLZBIIN2ElF UrZVSf6sbgUjxaMe7woOv438B3TdWOLCxgYA/4I3+YtgCGm5e6PEZPnC8rDQ2SK7+3Md F7WQ== X-Gm-Message-State: AC+VfDxCml8VxsB+R5G28xablOO+RvByFFyuORFomt7K3gv2HRYe4X/a NzDXI1eT8xSY1qiT5VXGCF4= X-Google-Smtp-Source: ACHHUZ7lpJgjgZcd7ZcesWhgnNSoy4qz82nzEulrFOr4FVS+EK2ceFw9fvTa/W+c31AaGSFkM9ErtQ== X-Received: by 2002:a17:902:ccc1:b0:1a5:2760:74ef with SMTP id z1-20020a170902ccc100b001a5276074efmr17000482ple.25.1683042733104; Tue, 02 May 2023 08:52:13 -0700 (PDT) Received: from john.lan ([2605:59c8:148:ba10:62ab:a7fd:a4e3:bd70]) by smtp.gmail.com with ESMTPSA id o3-20020a170902778300b001a1a07d04e6sm19917212pll.77.2023.05.02.08.52.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 02 May 2023 08:52:12 -0700 (PDT) From: John Fastabend To: jakub@cloudflare.com, daniel@iogearbox.net, lmb@isovalent.com, edumazet@google.com Cc: john.fastabend@gmail.com, bpf@vger.kernel.org, netdev@vger.kernel.org, ast@kernel.org, andrii@kernel.org, will@isovalent.com Subject: [PATCH bpf v7 06/13] bpf: sockmap, TCP data stall on recv before accept Date: Tue, 2 May 2023 08:51:52 -0700 Message-Id: <20230502155159.305437-7-john.fastabend@gmail.com> X-Mailer: git-send-email 2.33.0 In-Reply-To: <20230502155159.305437-1-john.fastabend@gmail.com> References: <20230502155159.305437-1-john.fastabend@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net A common mechanism to put a TCP socket into the sockmap is to hook the BPF_SOCK_OPS_{ACTIVE_PASSIVE}_ESTABLISHED_CB event with a BPF program that can map the socket info to the correct BPF verdict parser. When the user adds the socket to the map the psock is created and the new ops are assigned to ensure the verdict program will 'see' the sk_buffs as they arrive. Part of this process hooks the sk_data_ready op with a BPF specific handler to wake up the BPF verdict program when data is ready to read. The logic is simple enough (posted here for easy reading) static void sk_psock_verdict_data_ready(struct sock *sk) { struct socket *sock = sk->sk_socket; if (unlikely(!sock || !sock->ops || !sock->ops->read_skb)) return; sock->ops->read_skb(sk, sk_psock_verdict_recv); } The oversight here is sk->sk_socket is not assigned until the application accepts() the new socket. However, its entirely ok for the peer application to do a connect() followed immediately by sends. The socket on the receiver is sitting on the backlog queue of the listening socket until its accepted and the data is queued up. If the peer never accepts the socket or is slow it will eventually hit data limits and rate limit the session. But, important for BPF sockmap hooks when this data is received TCP stack does the sk_data_ready() call but the read_skb() for this data is never called because sk_socket is missing. The data sits on the sk_receive_queue. Then once the socket is accepted if we never receive more data from the peer there will be no further sk_data_ready calls and all the data is still on the sk_receive_queue(). Then user calls recvmsg after accept() and for TCP sockets in sockmap we use the tcp_bpf_recvmsg_parser() handler. The handler checks for data in the sk_msg ingress queue expecting that the BPF program has already run from the sk_data_ready hook and enqueued the data as needed. So we are stuck. To fix do an unlikely check in recvmsg handler for data on the sk_receive_queue and if it exists wake up data_ready. We have the sock locked in both read_skb and recvmsg so should avoid having multiple runners. Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") Reviewed-by: Jakub Sitnicki Signed-off-by: John Fastabend --- net/ipv4/tcp_bpf.c | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c index 804bd0c247d0..404857ab14cc 100644 --- a/net/ipv4/tcp_bpf.c +++ b/net/ipv4/tcp_bpf.c @@ -212,6 +212,26 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk, return tcp_recvmsg(sk, msg, len, flags, addr_len); lock_sock(sk); + + /* We may have received data on the sk_receive_queue pre-accept and + * then we can not use read_skb in this context because we haven't + * assigned a sk_socket yet so have no link to the ops. The work-around + * is to check the sk_receive_queue and in these cases read skbs off + * queue again. The read_skb hook is not running at this point because + * of lock_sock so we avoid having multiple runners in read_skb. + */ + if (unlikely(!skb_queue_empty(&sk->sk_receive_queue))) { + tcp_data_ready(sk); + /* This handles the ENOMEM errors if we both receive data + * pre accept and are already under memory pressure. At least + * let user know to retry. + */ + if (unlikely(!skb_queue_empty(&sk->sk_receive_queue))) { + copied = -EAGAIN; + goto out; + } + } + msg_bytes_ready: copied = sk_msg_recvmsg(sk, psock, msg, len, flags); /* The typical case for EFAULT is the socket was gracefully