From patchwork Tue Mar 21 21:52:03 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Fastabend X-Patchwork-Id: 13183296 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3ACECC6FD1D for ; Tue, 21 Mar 2023 21:52:27 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229961AbjCUVwZ (ORCPT ); Tue, 21 Mar 2023 17:52:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49156 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229550AbjCUVwW (ORCPT ); Tue, 21 Mar 2023 17:52:22 -0400 Received: from mail-pl1-x635.google.com (mail-pl1-x635.google.com [IPv6:2607:f8b0:4864:20::635]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5D44C3E633; Tue, 21 Mar 2023 14:52:20 -0700 (PDT) Received: by mail-pl1-x635.google.com with SMTP id z19so7092570plo.2; Tue, 21 Mar 2023 14:52:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1679435540; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=C16N4t/yebwnWEt3up1L0bcXG9FtrfIrB+7RyncYbw4=; b=iOt2pqUT1uZdYmLJpLvbbUao5fOULrd6HbpoNz92z/sRYVzExF+q69UAb5H6HJeS4K aqta9IomgATWddpABEH3YtcMrFwIxAr+/AQRMFGm1bd+c+YbZkX2mnyfIovxsgUSBDKF EKJ09wdNIzBD1y4OnK/ceIvPriewAbirXHWUqJkDQE58lwrDjeIGlTt2iSMimfGOvaxQ IKXnGbSCxqzeHP1jJq59A94PUPpXwy1NVQIGfX/5RZkYZGDgrbSAlDRUAuZbgNKOjXuG cYM/4+I7PTatQVoasZiQ6orACEtiELei+CS/OZ2/Xc+zbzgdzB0Ctno2ndSyPkoyugaD YCGw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1679435540; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=C16N4t/yebwnWEt3up1L0bcXG9FtrfIrB+7RyncYbw4=; b=aJD34v0hUCtZTi4wkZ7M7R6rJPAGb8c2mnYohMTKNeZI2fdG7iehh6E5Xw2rHVwLNK EroVrl159s7YyiOuh1rXlsniPcmC9l+ZBJx5ezuPwqPLf0bYNbW/vE3OcgkVKCiIsQDu bemxzvy84qxRaBgwbiFjrsMOjoqQR59YR2KlYa5PRVg1T23J6awWwqyq91iv4Hloab1R x+oc2YNk41DXszTrgGKY++hzbi3h4ONYfIIPr5S2t6VVbsN6hDfEcNkCYDehjZs5Ii8m QpTv1W4Jl/sSUh2UVd+THowER46VQj4imePM0X4M8v5z0jCIzma9YYPvhBE5qI0F0lRS w9Ow== X-Gm-Message-State: AO0yUKUolPmn/wDRe25LZQ66wtD8UV8UhtO20+948HiMvJVALEd57sLR VUUIxFqgZnSzy5VHf0voo1I= X-Google-Smtp-Source: AK7set8EHJeD6AjLpG62ySoy1RH3j6xP4CQFdIxg5Jz3fT8jMdK3lkZZikO6mO5+AgKMgypABvmsBw== X-Received: by 2002:a17:902:dac2:b0:1a1:d1be:91fe with SMTP id q2-20020a170902dac200b001a1d1be91femr813276plx.0.1679435539767; Tue, 21 Mar 2023 14:52:19 -0700 (PDT) Received: from john.lan ([98.97.36.54]) by smtp.gmail.com with ESMTPSA id m3-20020a63fd43000000b004facdf070d6sm8661331pgj.39.2023.03.21.14.52.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 21 Mar 2023 14:52:19 -0700 (PDT) From: John Fastabend To: jakub@cloudflare.com, daniel@iogearbox.net, lmb@isovalent.com, cong.wang@bytedance.com Cc: bpf@vger.kernel.org, john.fastabend@gmail.com, netdev@vger.kernel.org, edumazet@google.com, ast@kernel.org, andrii@kernel.org, will@isovalent.com Subject: [PATCH bpf 02/11] bpf: sockmap, convert schedule_work into delayed_work Date: Tue, 21 Mar 2023 14:52:03 -0700 Message-Id: <20230321215212.525630-3-john.fastabend@gmail.com> X-Mailer: git-send-email 2.33.0 In-Reply-To: <20230321215212.525630-1-john.fastabend@gmail.com> References: <20230321215212.525630-1-john.fastabend@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net Sk_buffs are fed into sockmap verdict programs either from a strparser (when the user might want to decide how framing of skb is done by attaching another parser program) or directly through tcp_read_sock. The tcp_read_sock is the preferred method for performance when the BPF logic is a stream parser. The flow for Cilium's common use case with a stream parser is, tcp_read_sock() sk_psock_verdict_recv ret = bpf_prog_run_pin_on_cpu() sk_psock_verdict_apply(sock, skb, ret) // if system is under memory pressure or app is slow we may // need to queue skb. Do this queuing through ingress_skb and // then kick timer to wake up handler skb_queue_tail(ingress_skb, skb) schedule_work(work); The work queue is wired up to sk_psock_backlog(). This will then walk the ingress_skb skb list that holds our sk_buffs that could not be handled, but should be OK to run at some later point. However, its possible that the workqueue doing this work still hits an error when sending the skb. When this happens the skbuff is requeued on a temporary 'state' struct kept with the workqueue. This is necessary because its possible to partially send an skbuff before hitting an error and we need to know how and where to restart when the workqueue runs next. Now for the trouble, we don't rekick the workqueue. This can cause a stall where the skbuff we just cached on the state variable might never be sent. This happens when its the last packet in a flow and no further packets come along that would cause the system to kick the workqueue from that side. To fix we could do simple schedule_work(), but while under memory pressure it makes sense to back off some instead of continue to retry repeatedly. So instead to fix convert schedule_work to schedule_delayed_work and add backoff logic to reschedule from backlog queue on errors. Its not obvious though what a good backoff is so use '1'. To test we observed some flakes whil running NGINX compliance test with sockmap we attributed these failed test to this bug and subsequent issue. Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") Tested-by: William Findlay Signed-off-by: John Fastabend --- include/linux/skmsg.h | 2 +- net/core/skmsg.c | 19 ++++++++++++------- net/core/sock_map.c | 3 ++- 3 files changed, 15 insertions(+), 9 deletions(-) diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h index 84f787416a54..904ff9a32ad6 100644 --- a/include/linux/skmsg.h +++ b/include/linux/skmsg.h @@ -105,7 +105,7 @@ struct sk_psock { struct proto *sk_proto; struct mutex work_mutex; struct sk_psock_work_state work_state; - struct work_struct work; + struct delayed_work work; struct rcu_work rwork; }; diff --git a/net/core/skmsg.c b/net/core/skmsg.c index 2b6d9519ff29..96a6a3a74a67 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -481,7 +481,7 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg, } out: if (psock->work_state.skb && copied > 0) - schedule_work(&psock->work); + schedule_delayed_work(&psock->work, 0); return copied; } EXPORT_SYMBOL_GPL(sk_msg_recvmsg); @@ -639,7 +639,8 @@ static void sk_psock_skb_state(struct sk_psock *psock, static void sk_psock_backlog(struct work_struct *work) { - struct sk_psock *psock = container_of(work, struct sk_psock, work); + struct delayed_work *dwork = to_delayed_work(work); + struct sk_psock *psock = container_of(dwork, struct sk_psock, work); struct sk_psock_work_state *state = &psock->work_state; struct sk_buff *skb = NULL; bool ingress; @@ -679,6 +680,10 @@ static void sk_psock_backlog(struct work_struct *work) if (ret == -EAGAIN) { sk_psock_skb_state(psock, state, skb, len, off); + + // Delay slightly to prioritize any + // other work that might be here. + schedule_delayed_work(&psock->work, 1); goto end; } /* Hard errors break pipe and stop xmit. */ @@ -733,7 +738,7 @@ struct sk_psock *sk_psock_init(struct sock *sk, int node) INIT_LIST_HEAD(&psock->link); spin_lock_init(&psock->link_lock); - INIT_WORK(&psock->work, sk_psock_backlog); + INIT_DELAYED_WORK(&psock->work, sk_psock_backlog); mutex_init(&psock->work_mutex); INIT_LIST_HEAD(&psock->ingress_msg); spin_lock_init(&psock->ingress_lock); @@ -822,7 +827,7 @@ static void sk_psock_destroy(struct work_struct *work) sk_psock_done_strp(psock); - cancel_work_sync(&psock->work); + cancel_delayed_work_sync(&psock->work); mutex_destroy(&psock->work_mutex); psock_progs_drop(&psock->progs); @@ -937,7 +942,7 @@ static int sk_psock_skb_redirect(struct sk_psock *from, struct sk_buff *skb) } skb_queue_tail(&psock_other->ingress_skb, skb); - schedule_work(&psock_other->work); + schedule_delayed_work(&psock_other->work, 0); spin_unlock_bh(&psock_other->ingress_lock); return 0; } @@ -1017,7 +1022,7 @@ static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb, spin_lock_bh(&psock->ingress_lock); if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) { skb_queue_tail(&psock->ingress_skb, skb); - schedule_work(&psock->work); + schedule_delayed_work(&psock->work, 0); err = 0; } spin_unlock_bh(&psock->ingress_lock); @@ -1048,7 +1053,7 @@ static void sk_psock_write_space(struct sock *sk) psock = sk_psock(sk); if (likely(psock)) { if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) - schedule_work(&psock->work); + schedule_delayed_work(&psock->work, 0); write_space = psock->saved_write_space; } rcu_read_unlock(); diff --git a/net/core/sock_map.c b/net/core/sock_map.c index a68a7290a3b2..d38267201892 100644 --- a/net/core/sock_map.c +++ b/net/core/sock_map.c @@ -1624,9 +1624,10 @@ void sock_map_close(struct sock *sk, long timeout) rcu_read_unlock(); sk_psock_stop(psock); release_sock(sk); - cancel_work_sync(&psock->work); + cancel_delayed_work_sync(&psock->work); sk_psock_put(sk, psock); } + /* Make sure we do not recurse. This is a bug. * Leak the socket instead of crashing on a stack overflow. */