diff mbox series

[net-next,RFC,v1,2/4] veth: use generic-XDP functions when dealing with SKBs

Message ID 169272715407.1975370.3989385869434330916.stgit@firesoul (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series veth: reduce reallocations of SKBs when XDP bpf-prog is loaded | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1362 this patch: 1362
netdev/cc_maintainers warning 12 maintainers not CCed: daniel@iogearbox.net kpsingh@kernel.org martin.lau@linux.dev john.fastabend@gmail.com sdf@google.com song@kernel.org andrii@kernel.org yonghong.song@linux.dev bpf@vger.kernel.org jolsa@kernel.org haoluo@google.com ast@kernel.org
netdev/build_clang success Errors and warnings before: 1353 this patch: 1353
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 1385 this patch: 1385
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 124 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Jesper Dangaard Brouer Aug. 22, 2023, 5:59 p.m. UTC
The root-cause the realloc issue is that veth_xdp_rcv_skb() code path (that
handles SKBs like generic-XDP) is calling a native-XDP function
xdp_do_redirect(), instead of simply using xdp_do_generic_redirect() that can
handle SKBs.

The existing code tries to steal the packet-data from the SKB (and frees the SKB
itself). This cause issues as SKBs can have different memory models that are
incompatible with native-XDP call xdp_do_redirect(). For this reason the checks
in veth_convert_skb_to_xdp_buff() becomes more strict. This in turn makes this a
bad approach. Simply leveraging generic-XDP helpers e.g. generic_xdp_tx() and
xdp_do_generic_redirect() as this resolves the issue given netstack can handle
these different SKB memory models.

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 drivers/net/veth.c |   69 ++++++++++++++++++++--------------------------------
 net/core/dev.c     |    1 +
 net/core/filter.c  |    1 +
 3 files changed, 28 insertions(+), 43 deletions(-)

Comments

Toke Høiland-Jørgensen Aug. 24, 2023, 10:30 a.m. UTC | #1
Jesper Dangaard Brouer <hawk@kernel.org> writes:

> The root-cause the realloc issue is that veth_xdp_rcv_skb() code path (that
> handles SKBs like generic-XDP) is calling a native-XDP function
> xdp_do_redirect(), instead of simply using xdp_do_generic_redirect() that can
> handle SKBs.
>
> The existing code tries to steal the packet-data from the SKB (and frees the SKB
> itself). This cause issues as SKBs can have different memory models that are
> incompatible with native-XDP call xdp_do_redirect(). For this reason the checks
> in veth_convert_skb_to_xdp_buff() becomes more strict. This in turn makes this a
> bad approach. Simply leveraging generic-XDP helpers e.g. generic_xdp_tx() and
> xdp_do_generic_redirect() as this resolves the issue given netstack can handle
> these different SKB memory models.

While this does solve the memory issue, it's also a subtle change of
semantics. For one thing, generic_xdp_tx() has this comment above it:

/* When doing generic XDP we have to bypass the qdisc layer and the
 * network taps in order to match in-driver-XDP behavior. This also means
 * that XDP packets are able to starve other packets going through a qdisc,
 * and DDOS attacks will be more effective. In-driver-XDP use dedicated TX
 * queues, so they do not have this starvation issue.
 */

Also, more generally, this means that if you have a setup with
XDP_REDIRECT-based forwarding in on a host with a mix of physical and
veth devices, all the traffic originating from the veth devices will go
on different TXQs than that originating from a physical NIC. Or if a
veth device has a mix of xdp_frame-backed packets and skb-backed
packets, those will also go on different queues, potentially leading to
reordering.

I'm not sure exactly how much of an issue this is in practice, but at
least from a conceptual PoV it's a change in behaviour that I don't
think we should be making lightly. WDYT?

-Toke
Jesper Dangaard Brouer Aug. 29, 2023, 2:37 p.m. UTC | #2
On 24/08/2023 12.30, Toke Høiland-Jørgensen wrote:
> Jesper Dangaard Brouer <hawk@kernel.org> writes:
> 
>> The root-cause the realloc issue is that veth_xdp_rcv_skb() code path (that
>> handles SKBs like generic-XDP) is calling a native-XDP function
>> xdp_do_redirect(), instead of simply using xdp_do_generic_redirect() that can
>> handle SKBs.
>>
>> The existing code tries to steal the packet-data from the SKB (and frees the SKB
>> itself). This cause issues as SKBs can have different memory models that are
>> incompatible with native-XDP call xdp_do_redirect(). For this reason the checks
>> in veth_convert_skb_to_xdp_buff() becomes more strict. This in turn makes this a
>> bad approach. Simply leveraging generic-XDP helpers e.g. generic_xdp_tx() and
>> xdp_do_generic_redirect() as this resolves the issue given netstack can handle
>> these different SKB memory models.
> 
> While this does solve the memory issue, it's also a subtle change of
> semantics. For one thing, generic_xdp_tx() has this comment above it:
> 
> /* When doing generic XDP we have to bypass the qdisc layer and the
>   * network taps in order to match in-driver-XDP behavior. This also means
>   * that XDP packets are able to starve other packets going through a qdisc,
>   * and DDOS attacks will be more effective. In-driver-XDP use dedicated TX
>   * queues, so they do not have this starvation issue.
>   */
> 
> Also, more generally, this means that if you have a setup with
> XDP_REDIRECT-based forwarding in on a host with a mix of physical and
> veth devices, all the traffic originating from the veth devices will go
> on different TXQs than that originating from a physical NIC. Or if a
> veth device has a mix of xdp_frame-backed packets and skb-backed
> packets, those will also go on different queues, potentially leading to
> reordering.
> 

Mixing xdp_frame-backed packets and skb-backed packet (towards veth)
will naturally come from two different data paths, and the BPF-developer
that redirected the xdp_frame (into veth) will have taken this choice,
including the chance of reordering (given the two data/code paths).

I will claim that (for SKBs) current code cause reordering on TXQs (as
you explain), and my code changes actually fix this problem.

Consider a userspace app (inside namespace) sending packets out (to veth
peer).  Routing (or bridging) will make netstack send out device A
(maybe a physical device).  On veth peer we have XDP-prog running, that
will XDP-redirect every 2nd packet to device A.  With current code TXQ
reordering will occur, as calling "native" xdp_do_redirect() will select
TXQ based on current-running CPU, while normal SKBs will use
netdev_core_pick_tx().  After my change, using
xdp_do_generic_redirect(), the code end-up using generic_xdp_tx() which
(looking at the code) also use netdev_core_pick_tx() to select the TXQ.
Thus, I will claim it is more correct (even-though XDP in general
doesn't give this guarantee).

> I'm not sure exactly how much of an issue this is in practice, but at
> least from a conceptual PoV it's a change in behaviour that I don't
> think we should be making lightly. WDYT?

As desc above, I think this patchset is an improvement.  It might even
fix/address the concern that was raised.


[Outside the scope of this patchset]

The single XDP BPF-prog getting attached to (RX-side) on a veth device,
actually needs to handle *both* xdp_frame-backed packets and SKB-backed
packets, and it cannot tell them apart. (Easy fix: implement a kfunc
RX-metadata hint to expose this?).

For the use-case[1] of implementing NFV (Network Function Virt) chaining
via veth device, where each veth-pairs XDP BPF-prog implement a network
"function" and redirect/chain to the next veth/container NFV.  For this
use-case, I would like the ability to either skip SKB-backed packet or
turn off BPF-prog seeing any SKB-backed packets. There is a huge
performance advantage when XDP-redirecting an xdp_frame into veth
devices in this way, approx 6Mpps for traversing 4 veth devices as
benchmarked in [1]. (p.s. I was going to improve this performance
further, but I got distracted by other work).

  [1] 
https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp_frame03_overhead.org

The veth-NFV like use-cases are hampered by the SKB-based XDP code-path
causing a significant slowdown for normal netstack packets.  Plus, it
need to parse-and-filter those SKB-based packets too.  This, patchset
"just" significantly reduce the overhead of the SKB-based XDP code path,
which IMHO is a good first step.  Then we can discuss if should have a
switch to turn off the SKB-based XDP code-path in veth, afterwards.

--Jesper
Toke Høiland-Jørgensen Sept. 1, 2023, 1:32 p.m. UTC | #3
Jesper Dangaard Brouer <hawk@kernel.org> writes:

> On 24/08/2023 12.30, Toke Høiland-Jørgensen wrote:
>> Jesper Dangaard Brouer <hawk@kernel.org> writes:
>> 
>>> The root-cause the realloc issue is that veth_xdp_rcv_skb() code path (that
>>> handles SKBs like generic-XDP) is calling a native-XDP function
>>> xdp_do_redirect(), instead of simply using xdp_do_generic_redirect() that can
>>> handle SKBs.
>>>
>>> The existing code tries to steal the packet-data from the SKB (and frees the SKB
>>> itself). This cause issues as SKBs can have different memory models that are
>>> incompatible with native-XDP call xdp_do_redirect(). For this reason the checks
>>> in veth_convert_skb_to_xdp_buff() becomes more strict. This in turn makes this a
>>> bad approach. Simply leveraging generic-XDP helpers e.g. generic_xdp_tx() and
>>> xdp_do_generic_redirect() as this resolves the issue given netstack can handle
>>> these different SKB memory models.
>> 
>> While this does solve the memory issue, it's also a subtle change of
>> semantics. For one thing, generic_xdp_tx() has this comment above it:
>> 
>> /* When doing generic XDP we have to bypass the qdisc layer and the
>>   * network taps in order to match in-driver-XDP behavior. This also means
>>   * that XDP packets are able to starve other packets going through a qdisc,
>>   * and DDOS attacks will be more effective. In-driver-XDP use dedicated TX
>>   * queues, so they do not have this starvation issue.
>>   */
>> 
>> Also, more generally, this means that if you have a setup with
>> XDP_REDIRECT-based forwarding in on a host with a mix of physical and
>> veth devices, all the traffic originating from the veth devices will go
>> on different TXQs than that originating from a physical NIC. Or if a
>> veth device has a mix of xdp_frame-backed packets and skb-backed
>> packets, those will also go on different queues, potentially leading to
>> reordering.
>> 
>
> Mixing xdp_frame-backed packets and skb-backed packet (towards veth)
> will naturally come from two different data paths, and the BPF-developer
> that redirected the xdp_frame (into veth) will have taken this choice,
> including the chance of reordering (given the two data/code paths).

I'm not sure we can quite conclude that this is a choice any XDP
developers will be actively aware of. At best it's a very implicit
choice :)

> I will claim that (for SKBs) current code cause reordering on TXQs (as
> you explain), and my code changes actually fix this problem.
>
> Consider a userspace app (inside namespace) sending packets out (to veth
> peer).  Routing (or bridging) will make netstack send out device A
> (maybe a physical device).  On veth peer we have XDP-prog running, that
> will XDP-redirect every 2nd packet to device A.  With current code TXQ
> reordering will occur, as calling "native" xdp_do_redirect() will select
> TXQ based on current-running CPU, while normal SKBs will use
> netdev_core_pick_tx().  After my change, using
> xdp_do_generic_redirect(), the code end-up using generic_xdp_tx() which
> (looking at the code) also use netdev_core_pick_tx() to select the TXQ.
> Thus, I will claim it is more correct (even-though XDP in general
> doesn't give this guarantee).
>
>> I'm not sure exactly how much of an issue this is in practice, but at
>> least from a conceptual PoV it's a change in behaviour that I don't
>> think we should be making lightly. WDYT?
>
> As desc above, I think this patchset is an improvement.  It might even
> fix/address the concern that was raised.

Well, you can obviously construct examples in both direction (i.e.,
where the old behaviour leads to reordering but the new one doesn't, and
vice versa). I believe you could also reasonably argue that either
behaviour is more "correct", so if we were just picking between
behaviours I wouldn't be objecting, I think.

However, we're not just picking between two equally good behaviours,
we're changing one long-standing behaviour to a different one, and I
worry this will introduce regressions because there are applications
that (explicitly or implicitly) rely on the old behaviour.

Also, there's the starvation issue mentioned in the comment I quoted
above: with this patch it is possible for traffic redirected from a veth
to effectively starve the host TXQ, where before it wouldn't.

I don't really have a good answer for how we can make sure of this
either way, but I believe it's cause for concern, which is really my
main reservation with this change :)

-Toke
diff mbox series

Patch

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index be7b62f57087..192547035194 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -713,19 +713,6 @@  static void veth_xdp_rcv_bulk_skb(struct veth_rq *rq, void **frames,
 	}
 }
 
-static void veth_xdp_get(struct xdp_buff *xdp)
-{
-	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
-	int i;
-
-	get_page(virt_to_page(xdp->data));
-	if (likely(!xdp_buff_has_frags(xdp)))
-		return;
-
-	for (i = 0; i < sinfo->nr_frags; i++)
-		__skb_frag_ref(&sinfo->frags[i]);
-}
-
 static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
 					struct xdp_buff *xdp,
 					struct sk_buff **pskb)
@@ -837,7 +824,7 @@  static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 	struct veth_xdp_buff vxbuf;
 	struct xdp_buff *xdp = &vxbuf.xdp;
 	u32 act, metalen;
-	int off;
+	int off, err;
 
 	skb_prepare_for_gro(skb);
 
@@ -860,30 +847,10 @@  static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 
 	switch (act) {
 	case XDP_PASS:
-		break;
 	case XDP_TX:
-		veth_xdp_get(xdp);
-		consume_skb(skb);
-		xdp->rxq->mem = rq->xdp_mem;
-		if (unlikely(veth_xdp_tx(rq, xdp, bq) < 0)) {
-			trace_xdp_exception(rq->dev, xdp_prog, act);
-			stats->rx_drops++;
-			goto err_xdp;
-		}
-		stats->xdp_tx++;
-		rcu_read_unlock();
-		goto xdp_xmit;
 	case XDP_REDIRECT:
-		veth_xdp_get(xdp);
-		consume_skb(skb);
-		xdp->rxq->mem = rq->xdp_mem;
-		if (xdp_do_redirect(rq->dev, xdp, xdp_prog)) {
-			stats->rx_drops++;
-			goto err_xdp;
-		}
-		stats->xdp_redirect++;
-		rcu_read_unlock();
-		goto xdp_xmit;
+		/* Postpone actions to after potential SKB geometry update */
+		break;
 	default:
 		bpf_warn_invalid_xdp_action(rq->dev, xdp_prog, act);
 		fallthrough;
@@ -894,7 +861,6 @@  static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 		stats->xdp_drops++;
 		goto xdp_drop;
 	}
-	rcu_read_unlock();
 
 	/* check if bpf_xdp_adjust_head was used */
 	off = xdp->data - orig_data;
@@ -919,11 +885,32 @@  static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 	else
 		skb->data_len = 0;
 
-	skb->protocol = eth_type_trans(skb, rq->dev);
-
 	metalen = xdp->data - xdp->data_meta;
 	if (metalen)
 		skb_metadata_set(skb, metalen);
+
+	switch (act) {
+	case XDP_PASS:
+		/* This skb_pull's off mac_len, __skb_push'ed above */
+		skb->protocol = eth_type_trans(skb, rq->dev);
+		break;
+	case XDP_REDIRECT:
+		err = xdp_do_generic_redirect(rq->dev, skb, xdp, xdp_prog);
+		if (unlikely(err)) {
+			trace_xdp_exception(rq->dev, xdp_prog, act);
+			goto xdp_drop;
+		}
+		stats->xdp_redirect++;
+		rcu_read_unlock();
+		goto xdp_xmit;
+	case XDP_TX:
+		/* TODO: this can be optimized to be veth specific */
+		generic_xdp_tx(skb, xdp_prog);
+		stats->xdp_tx++;
+		rcu_read_unlock();
+		goto xdp_xmit;
+	}
+	rcu_read_unlock();
 out:
 	return skb;
 drop:
@@ -931,10 +918,6 @@  static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 xdp_drop:
 	rcu_read_unlock();
 	kfree_skb(skb);
-	return NULL;
-err_xdp:
-	rcu_read_unlock();
-	xdp_return_buff(xdp);
 xdp_xmit:
 	return NULL;
 }
diff --git a/net/core/dev.c b/net/core/dev.c
index 17e6281e408c..1187bfced9ec 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4987,6 +4987,7 @@  void generic_xdp_tx(struct sk_buff *skb, struct bpf_prog *xdp_prog)
 		kfree_skb(skb);
 	}
 }
+EXPORT_SYMBOL_GPL(generic_xdp_tx);
 
 static DEFINE_STATIC_KEY_FALSE(generic_xdp_needed_key);
 
diff --git a/net/core/filter.c b/net/core/filter.c
index a094694899c9..a6fd7ba901ba 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4443,6 +4443,7 @@  int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
 	_trace_xdp_redirect_err(dev, xdp_prog, ri->tgt_index, err);
 	return err;
 }
+EXPORT_SYMBOL_GPL(xdp_do_generic_redirect);
 
 BPF_CALL_2(bpf_xdp_redirect, u32, ifindex, u64, flags)
 {