From patchwork Mon Apr 14 15:45:50 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jesper Dangaard Brouer X-Patchwork-Id: 14050643 X-Patchwork-Delegate: kuba@kernel.org Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B5EB31F92E; Mon, 14 Apr 2025 15:45:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744645555; cv=none; b=Jvh21GGRdGj9lyqt13bl7DbCHzg63/HQ3Aspn7NeN961+Ub2FMvRXuUMTBAKGeQC5pFSWNMts2i3oW6t9+M9ikXIHMnMdjjP7qR2Scd7ALxC0LHxk9hpt/rH7XDnSS7La6IQMlNUqJ+03rZDaqZn9kVZseAOsRhpOXQUjCLkAQ8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744645555; c=relaxed/simple; bh=qW/iEZSmptKVmguSwrimgXAbe5PaLzBos/5ELdpZOA8=; h=Subject:From:To:Cc:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=DzGLaadcBkTpArf9vOVlgHrUnRXzA0P3F6tygdTknspO9yvPUz9ix/lAul2Q7LgNPFG95xcMuMl8ogocSPoSaGSyKUjAuVNXJS8IdxnCqB7ybE4hDBgliUAZTxEd10qiTxMU1f71pf2vESrWGai+n8WHcv2CPBBsqZ/pd/veIoY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=mQXp6TNE; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="mQXp6TNE" Received: by smtp.kernel.org (Postfix) with ESMTPSA id DDAFFC4CEE2; Mon, 14 Apr 2025 15:45:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1744645555; bh=qW/iEZSmptKVmguSwrimgXAbe5PaLzBos/5ELdpZOA8=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=mQXp6TNEHs+NyrWQgdOlD5Apw/dxN/vjUXmXOxA+ZkrX556AnUm4KiKhjTVweC5I1 0ny8tuF8+Rwyxc/AokzBdFB2AYzT1AtGPwvm1448ZI/Z1vBiX+w6LWtvqFBSMAdFz3 YEHJWFMQjfdblwvPtkWKzF+x1szTSqa4kixzKt3rKQXE4nRjMwETaifGtU1VL0ENnV dm5wThcJetPbZ1EIPBMx24cGIsSYOmKTL96oTE3WbKtJipiFLHt3Wyh301roAjjtF0 JpbX6aiyi85YHcFH93vrRYq1usTV5v2esQwxDAtOyhNWUJWKCdcrESJ6KBiySNnHFC 4Mc8HLyXXDDlw== Subject: [PATCH net-next RFC V3 1/2] net: sched: generalize check for no-op qdisc on TX queue From: Jesper Dangaard Brouer To: netdev@vger.kernel.org, Jakub Kicinski Cc: Jesper Dangaard Brouer , bpf@vger.kernel.org, tom@herbertland.com, Eric Dumazet , "David S. Miller" , Paolo Abeni , =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?= , dsahern@kernel.org, makita.toshiaki@lab.ntt.co.jp, kernel-team@cloudflare.com Date: Mon, 14 Apr 2025 17:45:50 +0200 Message-ID: <174464555063.20396.9545196538212416415.stgit@firesoul> In-Reply-To: <174464549885.20396.6987653753122223942.stgit@firesoul> References: <174464549885.20396.6987653753122223942.stgit@firesoul> User-Agent: StGit/1.5 Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-Delegate: kuba@kernel.org X-Patchwork-State: RFC WARNING: Testing show this is NOT correct! - I need help figuring out why this patch is incorrect. - Testing against &noop_qdisc is not the same as q->enqueue == NULL - I copied test (txq->qdisc == &noop_qdisc) from qdisc_tx_is_noop - Q: is qdisc_tx_is_noop() function incorrect? The vrf driver includes an open-coded check to determine whether a TX queue (netdev_queue) has a real qdisc attached. This is done by testing whether qdisc->enqueue is NULL, which is functionally equivalent to checking whether the qdisc is &noop_qdisc. This equivalence stems from noqueue_init(), which explicitly clears the enqueue pointer to signal no-op behavior to __dev_queue_xmit(). This patch introduces a shared helper, qdisc_txq_is_noop(), to clarify intent and make this logic reusable. The vrf driver is updated to use this new helper. Subsequent patches will make further use of this helper in other drivers, such as veth. This is a non-functional change. Signed-off-by: Jesper Dangaard Brouer --- drivers/net/vrf.c | 4 +--- include/net/sch_generic.h | 7 ++++++- 2 files changed, 7 insertions(+), 4 deletions(-) diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c index 7168b33adadb..f0a24fc85945 100644 --- a/drivers/net/vrf.c +++ b/drivers/net/vrf.c @@ -343,15 +343,13 @@ static int vrf_ifindex_lookup_by_table_id(struct net *net, u32 table_id) static bool qdisc_tx_is_default(const struct net_device *dev) { struct netdev_queue *txq; - struct Qdisc *qdisc; if (dev->num_tx_queues > 1) return false; txq = netdev_get_tx_queue(dev, 0); - qdisc = rcu_access_pointer(txq->qdisc); - return !qdisc->enqueue; + return qdisc_txq_is_noop(txq); } /* Local traffic destined to local address. Reinsert the packet to rx diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index d48c657191cd..a1f5560350a6 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -803,6 +803,11 @@ static inline bool qdisc_tx_changing(const struct net_device *dev) return false; } +static inline bool qdisc_txq_is_noop(const struct netdev_queue *txq) +{ + return rcu_access_pointer(txq->qdisc) == &noop_qdisc; +} + /* Is the device using the noop qdisc on all queues? */ static inline bool qdisc_tx_is_noop(const struct net_device *dev) { @@ -810,7 +815,7 @@ static inline bool qdisc_tx_is_noop(const struct net_device *dev) for (i = 0; i < dev->num_tx_queues; i++) { struct netdev_queue *txq = netdev_get_tx_queue(dev, i); - if (rcu_access_pointer(txq->qdisc) != &noop_qdisc) + if (!qdisc_txq_is_noop(txq)) return false; } return true; From patchwork Mon Apr 14 15:45:56 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jesper Dangaard Brouer X-Patchwork-Id: 14050644 X-Patchwork-Delegate: kuba@kernel.org Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 466B627B4F5; Mon, 14 Apr 2025 15:46:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744645561; cv=none; b=Z72jhUvqr19WxWcusWsKkDpnUFEgA3BP4zd+/11Uoga3eCCF26dJjVAf2Is+J/mIdY3BhdN+WO4q9LSJkMBVA/gbHR3Sp+ICVTWTbfnRvMRtdPX9eDFJf1a8AGhx7VgAJbhmezGTT9YQkFSW6JIE/ja4tA8mneogLNPOcmj8G0k= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744645561; c=relaxed/simple; bh=BYdAOPIOmHl00ROk7DBXfagd2ii1yuYrUiKx/6Sxkio=; h=Subject:From:To:Cc:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=uRFBQ8c8pfYbT/vBUYZsGlahKcxbJ9B2NvRiDifX5jV6iAR2UonjYBnb0pFgbhOgs91YnKVP5AcVmYv/+sGOxruRvNAZ6q3lyWqeSb3B0lCJGFEqHubRG16SyVtj78XApaWpuB7nULRUfJHkGs5HA8vKKXmIrBJbYfVliiIZBn8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=WUpp3fz0; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="WUpp3fz0" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 6433AC4CEE2; Mon, 14 Apr 2025 15:45:58 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1744645560; bh=BYdAOPIOmHl00ROk7DBXfagd2ii1yuYrUiKx/6Sxkio=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=WUpp3fz0KnfvWA8gkonln9I2U5lGMoY73p8VykmTXDlp6dfNfKU8fY0huHtSLXKB8 A3FB80m383w4fYiWERc+XaUHnaU3NZ/jzFEMzMawOfqFSQYGpUvU/5ZSU/rXdWLGv/ dnesR3CpSUgx+70S7hjoeBX1K565eX3YgOG61XmN+9aWOuVa57fw5p1yK9qYrnpnP9 QfPal6ku4AHtqJzZ+J4a9a9FG0rF8ASroTycDPMsvxH4neeoXpC08aK3wXpnNbKWQb skMZ7YjoFGfjaVhnJNNn9N30a8UKQgLxsRt+c4SyzdIO0okMtmKzA/lZCg39Y0FJHx dE3GdSzsbSVuA== Subject: [PATCH net-next RFC V3 2/2] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops From: Jesper Dangaard Brouer To: netdev@vger.kernel.org, Jakub Kicinski Cc: Jesper Dangaard Brouer , bpf@vger.kernel.org, tom@herbertland.com, Eric Dumazet , "David S. Miller" , Paolo Abeni , =?utf-8?q?Toke_H=C3=B8iland-J=C3=B8rgensen?= , dsahern@kernel.org, makita.toshiaki@lab.ntt.co.jp, kernel-team@cloudflare.com Date: Mon, 14 Apr 2025 17:45:56 +0200 Message-ID: <174464555655.20396.15783804183694533537.stgit@firesoul> In-Reply-To: <174464549885.20396.6987653753122223942.stgit@firesoul> References: <174464549885.20396.6987653753122223942.stgit@firesoul> User-Agent: StGit/1.5 Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-Delegate: kuba@kernel.org X-Patchwork-State: RFC In production, we're seeing TX drops on veth devices when the ptr_ring fills up. This can occur when NAPI mode is enabled, though it's relatively rare. However, with threaded NAPI - which we use in production - the drops become significantly more frequent. The underlying issue is that with threaded NAPI, the consumer often runs on a different CPU than the producer. This increases the likelihood of the ring filling up before the consumer gets scheduled, especially under load, leading to drops in veth_xmit() (ndo_start_xmit()). This patch introduces backpressure by returning NETDEV_TX_BUSY when the ring is full, signaling the qdisc layer to requeue the packet. The txq (netdev queue) is stopped in this condition and restarted once veth_poll() drains entries from the ring, ensuring coordination between NAPI and qdisc. Backpressure is only enabled when a qdisc is attached. Without a qdisc, the driver retains its original behavior - dropping packets immediately when the ring is full. This avoids unexpected behavior changes in setups without a configured qdisc. With a qdisc in place (e.g. fq, sfq) this allows Active Queue Management (AQM) to fairly schedule packets across flows and reduce collateral damage from elephant flows. A known limitation of this approach is that the full ring sits in front of the qdisc layer, effectively forming a FIFO buffer that introduces base latency. While AQM still improves fairness and mitigates flow dominance, the latency impact is measurable. In hardware drivers, this issue is typically addressed using BQL (Byte Queue Limits), which tracks in-flight bytes needed based on physical link rate. However, for virtual drivers like veth, there is no fixed bandwidth constraint - the bottleneck is CPU availability and the scheduler's ability to run the NAPI thread. It is unclear how effective BQL would be in this context. This patch serves as a first step toward addressing TX drops. Future work may explore adapting a BQL-like mechanism to better suit virtual devices like veth. Reported-by: Yan Zhai Signed-off-by: Jesper Dangaard Brouer --- drivers/net/veth.c | 49 +++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 41 insertions(+), 8 deletions(-) diff --git a/drivers/net/veth.c b/drivers/net/veth.c index 7bb53961c0ea..3455fca2f2af 100644 --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -308,11 +308,10 @@ static void __veth_xdp_flush(struct veth_rq *rq) static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb) { if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb))) { - dev_kfree_skb_any(skb); - return NET_RX_DROP; + return NETDEV_TX_BUSY; /* signal qdisc layer */ } - return NET_RX_SUCCESS; + return NET_RX_SUCCESS; /* same as NETDEV_TX_OK */ } static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb, @@ -346,11 +345,11 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev) { struct veth_priv *rcv_priv, *priv = netdev_priv(dev); struct veth_rq *rq = NULL; - int ret = NETDEV_TX_OK; + struct netdev_queue *txq; struct net_device *rcv; int length = skb->len; bool use_napi = false; - int rxq; + int ret, rxq; rcu_read_lock(); rcv = rcu_dereference(priv->peer); @@ -373,17 +372,41 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev) } skb_tx_timestamp(skb); - if (likely(veth_forward_skb(rcv, skb, rq, use_napi) == NET_RX_SUCCESS)) { + + ret = veth_forward_skb(rcv, skb, rq, use_napi); + switch(ret) { + case NET_RX_SUCCESS: /* same as NETDEV_TX_OK */ if (!use_napi) dev_sw_netstats_tx_add(dev, 1, length); else __veth_xdp_flush(rq); - } else { + break; + case NETDEV_TX_BUSY: + /* If a qdisc is attached to our virtual device, returning + * NETDEV_TX_BUSY is allowed. + */ + txq = netdev_get_tx_queue(dev, rxq); + + if (qdisc_txq_is_noop(txq)) { + dev_kfree_skb_any(skb); + goto drop; + } + netif_tx_stop_queue(txq); + /* Restore Eth hdr pulled by dev_forward_skb/eth_type_trans */ + __skb_push(skb, ETH_HLEN); + if (use_napi) + __veth_xdp_flush(rq); + + break; + case NET_RX_DROP: /* same as NET_XMIT_DROP */ drop: atomic64_inc(&priv->dropped); ret = NET_XMIT_DROP; + break; + default: + net_crit_ratelimited("veth_xmit(%s): Invalid return code(%d)", + dev->name, ret); } - rcu_read_unlock(); return ret; @@ -874,9 +897,16 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget, struct veth_xdp_tx_bq *bq, struct veth_stats *stats) { + struct veth_priv *priv = netdev_priv(rq->dev); + int queue_idx = rq->xdp_rxq.queue_index; + struct netdev_queue *peer_txq; + struct net_device *peer_dev; int i, done = 0, n_xdpf = 0; void *xdpf[VETH_XDP_BATCH]; + peer_dev = rcu_dereference(priv->peer); + peer_txq = netdev_get_tx_queue(peer_dev, queue_idx); + for (i = 0; i < budget; i++) { void *ptr = __ptr_ring_consume(&rq->xdp_ring); @@ -925,6 +955,9 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget, rq->stats.vs.xdp_packets += done; u64_stats_update_end(&rq->stats.syncp); + if (unlikely(netif_tx_queue_stopped(peer_txq))) + netif_tx_wake_queue(peer_txq); + return done; }