Message ID | 169272715407.1975370.3989385869434330916.stgit@firesoul (mailing list archive) |
---|---|
State | RFC |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | veth: reduce reallocations of SKBs when XDP bpf-prog is loaded | expand |
Jesper Dangaard Brouer <hawk@kernel.org> writes: > The root-cause the realloc issue is that veth_xdp_rcv_skb() code path (that > handles SKBs like generic-XDP) is calling a native-XDP function > xdp_do_redirect(), instead of simply using xdp_do_generic_redirect() that can > handle SKBs. > > The existing code tries to steal the packet-data from the SKB (and frees the SKB > itself). This cause issues as SKBs can have different memory models that are > incompatible with native-XDP call xdp_do_redirect(). For this reason the checks > in veth_convert_skb_to_xdp_buff() becomes more strict. This in turn makes this a > bad approach. Simply leveraging generic-XDP helpers e.g. generic_xdp_tx() and > xdp_do_generic_redirect() as this resolves the issue given netstack can handle > these different SKB memory models. While this does solve the memory issue, it's also a subtle change of semantics. For one thing, generic_xdp_tx() has this comment above it: /* When doing generic XDP we have to bypass the qdisc layer and the * network taps in order to match in-driver-XDP behavior. This also means * that XDP packets are able to starve other packets going through a qdisc, * and DDOS attacks will be more effective. In-driver-XDP use dedicated TX * queues, so they do not have this starvation issue. */ Also, more generally, this means that if you have a setup with XDP_REDIRECT-based forwarding in on a host with a mix of physical and veth devices, all the traffic originating from the veth devices will go on different TXQs than that originating from a physical NIC. Or if a veth device has a mix of xdp_frame-backed packets and skb-backed packets, those will also go on different queues, potentially leading to reordering. I'm not sure exactly how much of an issue this is in practice, but at least from a conceptual PoV it's a change in behaviour that I don't think we should be making lightly. WDYT? -Toke
On 24/08/2023 12.30, Toke Høiland-Jørgensen wrote: > Jesper Dangaard Brouer <hawk@kernel.org> writes: > >> The root-cause the realloc issue is that veth_xdp_rcv_skb() code path (that >> handles SKBs like generic-XDP) is calling a native-XDP function >> xdp_do_redirect(), instead of simply using xdp_do_generic_redirect() that can >> handle SKBs. >> >> The existing code tries to steal the packet-data from the SKB (and frees the SKB >> itself). This cause issues as SKBs can have different memory models that are >> incompatible with native-XDP call xdp_do_redirect(). For this reason the checks >> in veth_convert_skb_to_xdp_buff() becomes more strict. This in turn makes this a >> bad approach. Simply leveraging generic-XDP helpers e.g. generic_xdp_tx() and >> xdp_do_generic_redirect() as this resolves the issue given netstack can handle >> these different SKB memory models. > > While this does solve the memory issue, it's also a subtle change of > semantics. For one thing, generic_xdp_tx() has this comment above it: > > /* When doing generic XDP we have to bypass the qdisc layer and the > * network taps in order to match in-driver-XDP behavior. This also means > * that XDP packets are able to starve other packets going through a qdisc, > * and DDOS attacks will be more effective. In-driver-XDP use dedicated TX > * queues, so they do not have this starvation issue. > */ > > Also, more generally, this means that if you have a setup with > XDP_REDIRECT-based forwarding in on a host with a mix of physical and > veth devices, all the traffic originating from the veth devices will go > on different TXQs than that originating from a physical NIC. Or if a > veth device has a mix of xdp_frame-backed packets and skb-backed > packets, those will also go on different queues, potentially leading to > reordering. > Mixing xdp_frame-backed packets and skb-backed packet (towards veth) will naturally come from two different data paths, and the BPF-developer that redirected the xdp_frame (into veth) will have taken this choice, including the chance of reordering (given the two data/code paths). I will claim that (for SKBs) current code cause reordering on TXQs (as you explain), and my code changes actually fix this problem. Consider a userspace app (inside namespace) sending packets out (to veth peer). Routing (or bridging) will make netstack send out device A (maybe a physical device). On veth peer we have XDP-prog running, that will XDP-redirect every 2nd packet to device A. With current code TXQ reordering will occur, as calling "native" xdp_do_redirect() will select TXQ based on current-running CPU, while normal SKBs will use netdev_core_pick_tx(). After my change, using xdp_do_generic_redirect(), the code end-up using generic_xdp_tx() which (looking at the code) also use netdev_core_pick_tx() to select the TXQ. Thus, I will claim it is more correct (even-though XDP in general doesn't give this guarantee). > I'm not sure exactly how much of an issue this is in practice, but at > least from a conceptual PoV it's a change in behaviour that I don't > think we should be making lightly. WDYT? As desc above, I think this patchset is an improvement. It might even fix/address the concern that was raised. [Outside the scope of this patchset] The single XDP BPF-prog getting attached to (RX-side) on a veth device, actually needs to handle *both* xdp_frame-backed packets and SKB-backed packets, and it cannot tell them apart. (Easy fix: implement a kfunc RX-metadata hint to expose this?). For the use-case[1] of implementing NFV (Network Function Virt) chaining via veth device, where each veth-pairs XDP BPF-prog implement a network "function" and redirect/chain to the next veth/container NFV. For this use-case, I would like the ability to either skip SKB-backed packet or turn off BPF-prog seeing any SKB-backed packets. There is a huge performance advantage when XDP-redirecting an xdp_frame into veth devices in this way, approx 6Mpps for traversing 4 veth devices as benchmarked in [1]. (p.s. I was going to improve this performance further, but I got distracted by other work). [1] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp_frame03_overhead.org The veth-NFV like use-cases are hampered by the SKB-based XDP code-path causing a significant slowdown for normal netstack packets. Plus, it need to parse-and-filter those SKB-based packets too. This, patchset "just" significantly reduce the overhead of the SKB-based XDP code path, which IMHO is a good first step. Then we can discuss if should have a switch to turn off the SKB-based XDP code-path in veth, afterwards. --Jesper
Jesper Dangaard Brouer <hawk@kernel.org> writes: > On 24/08/2023 12.30, Toke Høiland-Jørgensen wrote: >> Jesper Dangaard Brouer <hawk@kernel.org> writes: >> >>> The root-cause the realloc issue is that veth_xdp_rcv_skb() code path (that >>> handles SKBs like generic-XDP) is calling a native-XDP function >>> xdp_do_redirect(), instead of simply using xdp_do_generic_redirect() that can >>> handle SKBs. >>> >>> The existing code tries to steal the packet-data from the SKB (and frees the SKB >>> itself). This cause issues as SKBs can have different memory models that are >>> incompatible with native-XDP call xdp_do_redirect(). For this reason the checks >>> in veth_convert_skb_to_xdp_buff() becomes more strict. This in turn makes this a >>> bad approach. Simply leveraging generic-XDP helpers e.g. generic_xdp_tx() and >>> xdp_do_generic_redirect() as this resolves the issue given netstack can handle >>> these different SKB memory models. >> >> While this does solve the memory issue, it's also a subtle change of >> semantics. For one thing, generic_xdp_tx() has this comment above it: >> >> /* When doing generic XDP we have to bypass the qdisc layer and the >> * network taps in order to match in-driver-XDP behavior. This also means >> * that XDP packets are able to starve other packets going through a qdisc, >> * and DDOS attacks will be more effective. In-driver-XDP use dedicated TX >> * queues, so they do not have this starvation issue. >> */ >> >> Also, more generally, this means that if you have a setup with >> XDP_REDIRECT-based forwarding in on a host with a mix of physical and >> veth devices, all the traffic originating from the veth devices will go >> on different TXQs than that originating from a physical NIC. Or if a >> veth device has a mix of xdp_frame-backed packets and skb-backed >> packets, those will also go on different queues, potentially leading to >> reordering. >> > > Mixing xdp_frame-backed packets and skb-backed packet (towards veth) > will naturally come from two different data paths, and the BPF-developer > that redirected the xdp_frame (into veth) will have taken this choice, > including the chance of reordering (given the two data/code paths). I'm not sure we can quite conclude that this is a choice any XDP developers will be actively aware of. At best it's a very implicit choice :) > I will claim that (for SKBs) current code cause reordering on TXQs (as > you explain), and my code changes actually fix this problem. > > Consider a userspace app (inside namespace) sending packets out (to veth > peer). Routing (or bridging) will make netstack send out device A > (maybe a physical device). On veth peer we have XDP-prog running, that > will XDP-redirect every 2nd packet to device A. With current code TXQ > reordering will occur, as calling "native" xdp_do_redirect() will select > TXQ based on current-running CPU, while normal SKBs will use > netdev_core_pick_tx(). After my change, using > xdp_do_generic_redirect(), the code end-up using generic_xdp_tx() which > (looking at the code) also use netdev_core_pick_tx() to select the TXQ. > Thus, I will claim it is more correct (even-though XDP in general > doesn't give this guarantee). > >> I'm not sure exactly how much of an issue this is in practice, but at >> least from a conceptual PoV it's a change in behaviour that I don't >> think we should be making lightly. WDYT? > > As desc above, I think this patchset is an improvement. It might even > fix/address the concern that was raised. Well, you can obviously construct examples in both direction (i.e., where the old behaviour leads to reordering but the new one doesn't, and vice versa). I believe you could also reasonably argue that either behaviour is more "correct", so if we were just picking between behaviours I wouldn't be objecting, I think. However, we're not just picking between two equally good behaviours, we're changing one long-standing behaviour to a different one, and I worry this will introduce regressions because there are applications that (explicitly or implicitly) rely on the old behaviour. Also, there's the starvation issue mentioned in the comment I quoted above: with this patch it is possible for traffic redirected from a veth to effectively starve the host TXQ, where before it wouldn't. I don't really have a good answer for how we can make sure of this either way, but I believe it's cause for concern, which is really my main reservation with this change :) -Toke
diff --git a/drivers/net/veth.c b/drivers/net/veth.c index be7b62f57087..192547035194 100644 --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -713,19 +713,6 @@ static void veth_xdp_rcv_bulk_skb(struct veth_rq *rq, void **frames, } } -static void veth_xdp_get(struct xdp_buff *xdp) -{ - struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp); - int i; - - get_page(virt_to_page(xdp->data)); - if (likely(!xdp_buff_has_frags(xdp))) - return; - - for (i = 0; i < sinfo->nr_frags; i++) - __skb_frag_ref(&sinfo->frags[i]); -} - static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq, struct xdp_buff *xdp, struct sk_buff **pskb) @@ -837,7 +824,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq, struct veth_xdp_buff vxbuf; struct xdp_buff *xdp = &vxbuf.xdp; u32 act, metalen; - int off; + int off, err; skb_prepare_for_gro(skb); @@ -860,30 +847,10 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq, switch (act) { case XDP_PASS: - break; case XDP_TX: - veth_xdp_get(xdp); - consume_skb(skb); - xdp->rxq->mem = rq->xdp_mem; - if (unlikely(veth_xdp_tx(rq, xdp, bq) < 0)) { - trace_xdp_exception(rq->dev, xdp_prog, act); - stats->rx_drops++; - goto err_xdp; - } - stats->xdp_tx++; - rcu_read_unlock(); - goto xdp_xmit; case XDP_REDIRECT: - veth_xdp_get(xdp); - consume_skb(skb); - xdp->rxq->mem = rq->xdp_mem; - if (xdp_do_redirect(rq->dev, xdp, xdp_prog)) { - stats->rx_drops++; - goto err_xdp; - } - stats->xdp_redirect++; - rcu_read_unlock(); - goto xdp_xmit; + /* Postpone actions to after potential SKB geometry update */ + break; default: bpf_warn_invalid_xdp_action(rq->dev, xdp_prog, act); fallthrough; @@ -894,7 +861,6 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq, stats->xdp_drops++; goto xdp_drop; } - rcu_read_unlock(); /* check if bpf_xdp_adjust_head was used */ off = xdp->data - orig_data; @@ -919,11 +885,32 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq, else skb->data_len = 0; - skb->protocol = eth_type_trans(skb, rq->dev); - metalen = xdp->data - xdp->data_meta; if (metalen) skb_metadata_set(skb, metalen); + + switch (act) { + case XDP_PASS: + /* This skb_pull's off mac_len, __skb_push'ed above */ + skb->protocol = eth_type_trans(skb, rq->dev); + break; + case XDP_REDIRECT: + err = xdp_do_generic_redirect(rq->dev, skb, xdp, xdp_prog); + if (unlikely(err)) { + trace_xdp_exception(rq->dev, xdp_prog, act); + goto xdp_drop; + } + stats->xdp_redirect++; + rcu_read_unlock(); + goto xdp_xmit; + case XDP_TX: + /* TODO: this can be optimized to be veth specific */ + generic_xdp_tx(skb, xdp_prog); + stats->xdp_tx++; + rcu_read_unlock(); + goto xdp_xmit; + } + rcu_read_unlock(); out: return skb; drop: @@ -931,10 +918,6 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq, xdp_drop: rcu_read_unlock(); kfree_skb(skb); - return NULL; -err_xdp: - rcu_read_unlock(); - xdp_return_buff(xdp); xdp_xmit: return NULL; } diff --git a/net/core/dev.c b/net/core/dev.c index 17e6281e408c..1187bfced9ec 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4987,6 +4987,7 @@ void generic_xdp_tx(struct sk_buff *skb, struct bpf_prog *xdp_prog) kfree_skb(skb); } } +EXPORT_SYMBOL_GPL(generic_xdp_tx); static DEFINE_STATIC_KEY_FALSE(generic_xdp_needed_key); diff --git a/net/core/filter.c b/net/core/filter.c index a094694899c9..a6fd7ba901ba 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -4443,6 +4443,7 @@ int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb, _trace_xdp_redirect_err(dev, xdp_prog, ri->tgt_index, err); return err; } +EXPORT_SYMBOL_GPL(xdp_do_generic_redirect); BPF_CALL_2(bpf_xdp_redirect, u32, ifindex, u64, flags) {
The root-cause the realloc issue is that veth_xdp_rcv_skb() code path (that handles SKBs like generic-XDP) is calling a native-XDP function xdp_do_redirect(), instead of simply using xdp_do_generic_redirect() that can handle SKBs. The existing code tries to steal the packet-data from the SKB (and frees the SKB itself). This cause issues as SKBs can have different memory models that are incompatible with native-XDP call xdp_do_redirect(). For this reason the checks in veth_convert_skb_to_xdp_buff() becomes more strict. This in turn makes this a bad approach. Simply leveraging generic-XDP helpers e.g. generic_xdp_tx() and xdp_do_generic_redirect() as this resolves the issue given netstack can handle these different SKB memory models. Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> --- drivers/net/veth.c | 69 ++++++++++++++++++++-------------------------------- net/core/dev.c | 1 + net/core/filter.c | 1 + 3 files changed, 28 insertions(+), 43 deletions(-)