From patchwork Thu Oct 7 19:51:47 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Cong Wang X-Patchwork-Id: 12542909 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 05470C4332F for ; Thu, 7 Oct 2021 19:51:56 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id E4F0A60F6E for ; Thu, 7 Oct 2021 19:51:55 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S241520AbhJGTxs (ORCPT ); Thu, 7 Oct 2021 15:53:48 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48372 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233897AbhJGTxq (ORCPT ); Thu, 7 Oct 2021 15:53:46 -0400 Received: from mail-ot1-x32a.google.com (mail-ot1-x32a.google.com [IPv6:2607:f8b0:4864:20::32a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 91DFAC061570; Thu, 7 Oct 2021 12:51:52 -0700 (PDT) Received: by mail-ot1-x32a.google.com with SMTP id 5-20020a9d0685000000b0054706d7b8e5so8901005otx.3; Thu, 07 Oct 2021 12:51:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=y0/6HCugUMN0F654upmT8E3BC8E0KlkaMuLVFLmed0w=; b=kpYzDIHQ8hqzKKZj5IVoqIa4/Y0cTKN2yDVeHbls9V9GTXBM5ElVvRNGWPCSGppv7g VkoWvUDN+juoR/xbjMTVd0RWX+TSV3sxG1SgYBb58VL5opkboPn7jx8ltbQGVMkNyXSj gTAgpJ+iQvkdmtN0QrUPODgOhBHf2Sp944DjUZPADvwB73NeMjmWmtpIkZiITmzXti8/ 5x1UNKUqy4Mn/gHZBIehejTQ3552exuKiXuHfU/10GRQQZW/XXqexvi7cfJ2hZRfPPiz ocHjA4pzb/TnlyUtW0vOMi7rMIFoG1A+h88GFlYCLytI/TH4yI5d0Eghe79S6nCRgPcO TYzw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=y0/6HCugUMN0F654upmT8E3BC8E0KlkaMuLVFLmed0w=; b=Ox8fyWlOfIp4E/GStqr3xYPEmQDwqcxfcMyStSk45HksRVONfbBqSz2FFM9w8kHIGx Wu4l8vIiwta/RNGQVANQdnnVFedFnYjPlRIRN/VrQBVkBM28KaKxBtEcXafOjlKEnLo9 Spe0ZW+1WeewKdPWZFI7xJ9VqyferAHXLfd1KySx9QzkhM6Nn6gXKTyisWlWFYira6lU LeahM0YMwunQukjhp6OlJJ3kBQLDuYG4nfu5spsJlD1XCme30w/VvBFJE+KcYbRt4Ncs 8f8jcAKcFwyOts2ieXoCBXkJVUpf8OZSWmSLAw9hOncVrsEkrUHN8zusTNOt7b36nof9 xiIw== X-Gm-Message-State: AOAM533eOpLIZIS+/IyDE2RbCMaVVoYuFl0L1DbN13eHQKu3LnBLTwmI z0v6HaNvZH8k3j6DRTR2NLFRH4wHQiQ= X-Google-Smtp-Source: ABdhPJzIvUe3fMiVX/W5xfVi1zORiSe5YE8qxDAuN4+cVJ00zdWrKB74TBmicUWfk03AXXZvfJDPCA== X-Received: by 2002:a9d:357:: with SMTP id 81mr5134026otv.381.1633636311857; Thu, 07 Oct 2021 12:51:51 -0700 (PDT) Received: from unknown.attlocal.net ([2600:1700:65a0:ab60:85ae:a51:4d98:71a3]) by smtp.gmail.com with ESMTPSA id s206sm94774oia.33.2021.10.07.12.51.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 07 Oct 2021 12:51:51 -0700 (PDT) From: Cong Wang To: netdev@vger.kernel.org Cc: bpf@vger.kernel.org, Cong Wang , Jiang Wang , Daniel Borkmann , John Fastabend , Lorenz Bauer , Jakub Sitnicki Subject: [Patch bpf v3] skmsg: check sk_rcvbuf limit before queuing to ingress_skb Date: Thu, 7 Oct 2021 12:51:47 -0700 Message-Id: <20211007195147.28462-1-xiyou.wangcong@gmail.com> X-Mailer: git-send-email 2.30.2 MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net From: Cong Wang Jiang observed OOM frequently when testing our AF_UNIX/UDP proxy. This is due to the fact that we do not actually limit the socket memory before queueing skb to ingress_skb. We charge the skb memory later when handling the psock backlog, and it is not limited either. This patch adds checks for sk->sk_rcvbuf right before queuing to ingress_skb and drops or retries the packets if this limit exceeds. This is very similar to UDP receive path. Ideally we should set the skb owner before this check too, but it is hard to make TCP happy with sk_forward_alloc. For TCP, we can not just drop the packets on errors. TCP ACKs are already sent for those packet before reaching ->sk_data_ready(). Instead, we use best effort to retry, this works because TCP does not remove the skb from receive queue at that point and exceeding sk_rcvbuf limit is a temporary situation. Reported-by: Jiang Wang Cc: Daniel Borkmann Cc: John Fastabend Cc: Lorenz Bauer Cc: Jakub Sitnicki Signed-off-by: Cong Wang --- v3: add retry logic for TCP v2: add READ_ONCE() net/core/skmsg.c | 15 +++++++++------ net/ipv4/tcp.c | 2 ++ 2 files changed, 11 insertions(+), 6 deletions(-) diff --git a/net/core/skmsg.c b/net/core/skmsg.c index 2d6249b28928..356c314cd60c 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -877,11 +877,12 @@ static int sk_psock_skb_redirect(struct sk_psock *from, struct sk_buff *skb) return -EIO; } spin_lock_bh(&psock_other->ingress_lock); - if (!sk_psock_test_state(psock_other, SK_PSOCK_TX_ENABLED)) { + if (!sk_psock_test_state(psock_other, SK_PSOCK_TX_ENABLED) || + atomic_read(&sk_other->sk_rmem_alloc) > READ_ONCE(sk_other->sk_rcvbuf)) { spin_unlock_bh(&psock_other->ingress_lock); skb_bpf_redirect_clear(skb); sock_drop(from->sk, skb); - return -EIO; + return -EAGAIN; } skb_queue_tail(&psock_other->ingress_skb, skb); @@ -941,7 +942,7 @@ static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb, } skb_bpf_set_ingress(skb); - + err = -EAGAIN; /* If the queue is empty then we can submit directly * into the msg queue. If its not empty we have to * queue work otherwise we may get OOO data. Otherwise, @@ -953,7 +954,8 @@ static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb, } if (err < 0) { spin_lock_bh(&psock->ingress_lock); - if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) { + if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED) && + atomic_read(&sk_other->sk_rmem_alloc) <= READ_ONCE(sk_other->sk_rcvbuf)) { skb_queue_tail(&psock->ingress_skb, skb); schedule_work(&psock->work); err = 0; @@ -1141,8 +1143,9 @@ static int sk_psock_verdict_recv(read_descriptor_t *desc, struct sk_buff *skb, ret = sk_psock_map_verd(ret, skb_bpf_redirect_fetch(skb)); skb->sk = NULL; } - if (sk_psock_verdict_apply(psock, skb, ret) < 0) - len = 0; + ret = sk_psock_verdict_apply(psock, skb, ret); + if (ret < 0) + len = ret; out: rcu_read_unlock(); return len; diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index e8b48df73c85..8b243fcdbb8f 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1665,6 +1665,8 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, if (used <= 0) { if (!copied) copied = used; + if (used == -EAGAIN) + continue; break; } else if (used <= len) { seq += used;