Message ID | 20220314141508.39952-2-xiangxia.m.yue@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | net: sched: allow user to select txqueue | expand |
On 2022-03-14 10:15, xiangxia.m.yue@gmail.com wrote: > From: Tonghao Zhang <xiangxia.m.yue@gmail.com> > > This patch fixes issue: > * If we install tc filters with act_skbedit in clsact hook. > It doesn't work, because netdev_core_pick_tx() overwrites > queue_mapping. > > $ tc filter ... action skbedit queue_mapping 1 > > And this patch is useful: > * We can use FQ + EDT to implement efficient policies. Tx queues > are picked by xps, ndo_select_queue of netdev driver, or skb hash > in netdev_core_pick_tx(). In fact, the netdev driver, and skb > hash are _not_ under control. xps uses the CPUs map to select Tx > queues, but we can't figure out which task_struct of pod/containter > running on this cpu in most case. We can use clsact filters to classify > one pod/container traffic to one Tx queue. Why ? > > In containter networking environment, there are two kinds of pod/ > containter/net-namespace. One kind (e.g. P1, P2), the high throughput > is key in these applications. But avoid running out of network resource, > the outbound traffic of these pods is limited, using or sharing one > dedicated Tx queues assigned HTB/TBF/FQ Qdisc. Other kind of pods > (e.g. Pn), the low latency of data access is key. And the traffic is not > limited. Pods use or share other dedicated Tx queues assigned FIFO Qdisc. > This choice provides two benefits. First, contention on the HTB/FQ Qdisc > lock is significantly reduced since fewer CPUs contend for the same queue. > More importantly, Qdisc contention can be eliminated completely if each > CPU has its own FIFO Qdisc for the second kind of pods. > > There must be a mechanism in place to support classifying traffic based on > pods/container to different Tx queues. Note that clsact is outside of Qdisc > while Qdisc can run a classifier to select a sub-queue under the lock. > > In general recording the decision in the skb seems a little heavy handed. > This patch introduces a per-CPU variable, suggested by Eric. > > The xmit.skip_txqueue flag is firstly cleared in __dev_queue_xmit(). > - Tx Qdisc may install that skbedit actions, then xmit.skip_txqueue flag > is set in qdisc->enqueue() though tx queue has been selected in > netdev_tx_queue_mapping() or netdev_core_pick_tx(). That flag is cleared > firstly in __dev_queue_xmit(), is useful: > - Avoid picking Tx queue with netdev_tx_queue_mapping() in next netdev > in such case: eth0 macvlan - eth0.3 vlan - eth0 ixgbe-phy: > For example, eth0, macvlan in pod, which root Qdisc install skbedit > queue_mapping, send packets to eth0.3, vlan in host. In __dev_queue_xmit() of > eth0.3, clear the flag, does not select tx queue according to skb->queue_mapping > because there is no filters in clsact or tx Qdisc of this netdev. > Same action taked in eth0, ixgbe in Host. > - Avoid picking Tx queue for next packet. If we set xmit.skip_txqueue > in tx Qdisc (qdisc->enqueue()), the proper way to clear it is clearing it > in __dev_queue_xmit when processing next packets. > > For performance reasons, use the static key. If user does not config the NET_EGRESS, > the patch will not be compiled. > > +----+ +----+ +----+ > | P1 | | P2 | | Pn | > +----+ +----+ +----+ > | | | > +-----------+-----------+ > | > | clsact/skbedit > | MQ > v > +-----------+-----------+ > | q0 | q1 | qn > v v v > HTB/FQ HTB/FQ ... FIFO > > Cc: Jamal Hadi Salim <jhs@mojatatu.com> > Cc: Cong Wang <xiyou.wangcong@gmail.com> > Cc: Jiri Pirko <jiri@resnulli.us> > Cc: "David S. Miller" <davem@davemloft.net> > Cc: Jakub Kicinski <kuba@kernel.org> > Cc: Jonathan Lemon <jonathan.lemon@gmail.com> > Cc: Eric Dumazet <edumazet@google.com> > Cc: Alexander Lobakin <alobakin@pm.me> > Cc: Paolo Abeni <pabeni@redhat.com> > Cc: Talal Ahmad <talalahmad@google.com> > Cc: Kevin Hao <haokexin@gmail.com> > Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> > Cc: Kees Cook <keescook@chromium.org> > Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> > Cc: Antoine Tenart <atenart@kernel.org> > Cc: Wei Wang <weiwan@google.com> > Cc: Arnd Bergmann <arnd@arndb.de> > Suggested-by: Eric Dumazet <edumazet@google.com> > Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> cheers, jamal
On 3/14/22 3:15 PM, xiangxia.m.yue@gmail.com wrote: [...] > include/linux/netdevice.h | 3 +++ > include/linux/rtnetlink.h | 1 + > net/core/dev.c | 31 +++++++++++++++++++++++++++++-- > net/sched/act_skbedit.c | 6 +++++- > 4 files changed, 38 insertions(+), 3 deletions(-) > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > index 0d994710b335..f33fb2d6712a 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -3065,6 +3065,9 @@ struct softnet_data { > struct { > u16 recursion; > u8 more; > +#ifdef CONFIG_NET_EGRESS > + u8 skip_txqueue; > +#endif > } xmit; > #ifdef CONFIG_RPS > /* input_queue_head should be written by cpu owning this struct, > diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h > index 7f970b16da3a..ae2c6a3cec5d 100644 > --- a/include/linux/rtnetlink.h > +++ b/include/linux/rtnetlink.h > @@ -100,6 +100,7 @@ void net_dec_ingress_queue(void); > #ifdef CONFIG_NET_EGRESS > void net_inc_egress_queue(void); > void net_dec_egress_queue(void); > +void netdev_xmit_skip_txqueue(bool skip); > #endif > > void rtnetlink_init(void); > diff --git a/net/core/dev.c b/net/core/dev.c > index 75bab5b0dbae..8e83b7099977 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -3908,6 +3908,25 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev) > > return skb; > } > + > +static struct netdev_queue * > +netdev_tx_queue_mapping(struct net_device *dev, struct sk_buff *skb) > +{ > + int qm = skb_get_queue_mapping(skb); > + > + return netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, qm)); > +} > + > +static bool netdev_xmit_txqueue_skipped(void) > +{ > + return __this_cpu_read(softnet_data.xmit.skip_txqueue); > +} > + > +void netdev_xmit_skip_txqueue(bool skip) > +{ > + __this_cpu_write(softnet_data.xmit.skip_txqueue, skip); > +} > +EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue); > #endif /* CONFIG_NET_EGRESS */ > > #ifdef CONFIG_XPS > @@ -4078,7 +4097,7 @@ struct netdev_queue *netdev_core_pick_tx(struct net_device *dev, > static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev) > { > struct net_device *dev = skb->dev; > - struct netdev_queue *txq; > + struct netdev_queue *txq = NULL; > struct Qdisc *q; > int rc = -ENOMEM; > bool again = false; > @@ -4106,11 +4125,17 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev) > if (!skb) > goto out; > } > + > + netdev_xmit_skip_txqueue(false); > + > nf_skip_egress(skb, true); > skb = sch_handle_egress(skb, &rc, dev); > if (!skb) > goto out; > nf_skip_egress(skb, false); > + > + if (netdev_xmit_txqueue_skipped()) > + txq = netdev_tx_queue_mapping(dev, skb); > } > #endif > /* If device/qdisc don't need skb->dst, release it right now while > @@ -4121,7 +4146,9 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev) > else > skb_dst_force(skb); > > - txq = netdev_core_pick_tx(dev, skb, sb_dev); > + if (likely(!txq)) nit: Drop likely(). If the feature is used from sch_handle_egress(), then this would always be the case. > + txq = netdev_core_pick_tx(dev, skb, sb_dev); > + > q = rcu_dereference_bh(txq->qdisc); How will the `netdev_xmit_skip_txqueue(true)` be usable from BPF side (see bpf_convert_ctx_access() -> queue_mapping)? > trace_net_dev_queue(skb); > diff --git a/net/sched/act_skbedit.c b/net/sched/act_skbedit.c > index ceba11b198bb..d5799b4fc499 100644 > --- a/net/sched/act_skbedit.c > +++ b/net/sched/act_skbedit.c > @@ -58,8 +58,12 @@ static int tcf_skbedit_act(struct sk_buff *skb, const struct tc_action *a, > } > } > if (params->flags & SKBEDIT_F_QUEUE_MAPPING && > - skb->dev->real_num_tx_queues > params->queue_mapping) > + skb->dev->real_num_tx_queues > params->queue_mapping) { > +#ifdef CONFIG_NET_EGRESS > + netdev_xmit_skip_txqueue(true); > +#endif > skb_set_queue_mapping(skb, params->queue_mapping); > + } > if (params->flags & SKBEDIT_F_MARK) { > skb->mark &= ~params->mask; > skb->mark |= params->mark & params->mask; >
On Tue, Mar 15, 2022 at 5:59 AM Daniel Borkmann <daniel@iogearbox.net> wrote: > > On 3/14/22 3:15 PM, xiangxia.m.yue@gmail.com wrote: > [...] > > include/linux/netdevice.h | 3 +++ > > include/linux/rtnetlink.h | 1 + > > net/core/dev.c | 31 +++++++++++++++++++++++++++++-- > > net/sched/act_skbedit.c | 6 +++++- > > 4 files changed, 38 insertions(+), 3 deletions(-) > > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > > index 0d994710b335..f33fb2d6712a 100644 > > --- a/include/linux/netdevice.h > > +++ b/include/linux/netdevice.h > > @@ -3065,6 +3065,9 @@ struct softnet_data { > > struct { > > u16 recursion; > > u8 more; > > +#ifdef CONFIG_NET_EGRESS > > + u8 skip_txqueue; > > +#endif > > } xmit; > > #ifdef CONFIG_RPS > > /* input_queue_head should be written by cpu owning this struct, > > diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h > > index 7f970b16da3a..ae2c6a3cec5d 100644 > > --- a/include/linux/rtnetlink.h > > +++ b/include/linux/rtnetlink.h > > @@ -100,6 +100,7 @@ void net_dec_ingress_queue(void); > > #ifdef CONFIG_NET_EGRESS > > void net_inc_egress_queue(void); > > void net_dec_egress_queue(void); > > +void netdev_xmit_skip_txqueue(bool skip); > > #endif > > > > void rtnetlink_init(void); > > diff --git a/net/core/dev.c b/net/core/dev.c > > index 75bab5b0dbae..8e83b7099977 100644 > > --- a/net/core/dev.c > > +++ b/net/core/dev.c > > @@ -3908,6 +3908,25 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev) > > > > return skb; > > } > > + > > +static struct netdev_queue * > > +netdev_tx_queue_mapping(struct net_device *dev, struct sk_buff *skb) > > +{ > > + int qm = skb_get_queue_mapping(skb); > > + > > + return netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, qm)); > > +} > > + > > +static bool netdev_xmit_txqueue_skipped(void) > > +{ > > + return __this_cpu_read(softnet_data.xmit.skip_txqueue); > > +} > > + > > +void netdev_xmit_skip_txqueue(bool skip) > > +{ > > + __this_cpu_write(softnet_data.xmit.skip_txqueue, skip); > > +} > > +EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue); > > #endif /* CONFIG_NET_EGRESS */ > > > > #ifdef CONFIG_XPS > > @@ -4078,7 +4097,7 @@ struct netdev_queue *netdev_core_pick_tx(struct net_device *dev, > > static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev) > > { > > struct net_device *dev = skb->dev; > > - struct netdev_queue *txq; > > + struct netdev_queue *txq = NULL; > > struct Qdisc *q; > > int rc = -ENOMEM; > > bool again = false; > > @@ -4106,11 +4125,17 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev) > > if (!skb) > > goto out; > > } > > + > > + netdev_xmit_skip_txqueue(false); > > + > > nf_skip_egress(skb, true); > > skb = sch_handle_egress(skb, &rc, dev); > > if (!skb) > > goto out; > > nf_skip_egress(skb, false); > > + > > + if (netdev_xmit_txqueue_skipped()) > > + txq = netdev_tx_queue_mapping(dev, skb); > > } > > #endif > > /* If device/qdisc don't need skb->dst, release it right now while > > @@ -4121,7 +4146,9 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev) > > else > > skb_dst_force(skb); > > > > - txq = netdev_core_pick_tx(dev, skb, sb_dev); > > + if (likely(!txq)) > > nit: Drop likely(). If the feature is used from sch_handle_egress(), then this would always be the case. Hi Daniel I think in most case, we don't use skbedit queue_mapping in the sch_handle_egress() , so I add likely in fast path. > > + txq = netdev_core_pick_tx(dev, skb, sb_dev); > > + > > q = rcu_dereference_bh(txq->qdisc); > > How will the `netdev_xmit_skip_txqueue(true)` be usable from BPF side (see bpf_convert_ctx_access() -> > queue_mapping)? Good questions, In other patch, I introduce the bpf_netdev_skip_txqueue, so we can use netdev_xmit_skip_txqueue in bpf side not official patch:(I will post this patch, if ready) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 4eebea830613..ef147a1a2d62 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -5117,6 +5117,21 @@ union bpf_attr { * 0 on success. * **-EINVAL** for invalid input * **-EOPNOTSUPP** for unsupported delivery_time_type and protocol + * + * void bpf_netdev_skip_txqueue(u32 skip) + * Description + * Redirect the packet to another net device of index *ifindex*. + * This helper is somewhat similar to **bpf_redirect**\ (), except + * that the redirection happens to the *skip*' peer device and + * the netns switch takes place from ingress to ingress without + * going through the CPU's backlog queue. + * + * The *skip* argument is reserved and must be 0. The helper is + * currently only supported for tc BPF program types at the ingress + * hook and for veth device types. The peer device must reside in a + * different network namespace. + * Return + * Nothing. Always succeeds. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -5312,6 +5327,7 @@ union bpf_attr { FN(xdp_store_bytes), \ FN(copy_from_user_task), \ FN(skb_set_delivery_time), \ + FN(netdev_skip_txqueue), \ /* */ /* integer value in 'imm' field of BPF_CALL instruction selects which helper diff --git a/net/core/filter.c b/net/core/filter.c index 88767f7da150..5845b4632b6b 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2517,6 +2517,19 @@ static const struct bpf_func_proto bpf_redirect_peer_proto = { .arg2_type = ARG_ANYTHING, }; +BPF_CALL_1(bpf_netdev_skip_txqueue, u32, skip) +{ + netdev_xmit_skip_txqueue(!!skip); + return 0; +}; + +static const struct bpf_func_proto bpf_netdev_skip_txqueue_proto = { + .func = bpf_netdev_skip_txqueue, + .gpl_only = false, + .ret_type = RET_VOID, + .arg1_type = ARG_ANYTHING, +}; + BPF_CALL_4(bpf_redirect_neigh, u32, ifindex, struct bpf_redir_neigh *, params, int, plen, u64, flags) { @@ -7721,6 +7734,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_redirect_proto; case BPF_FUNC_redirect_neigh: return &bpf_redirect_neigh_proto; + case BPF_FUNC_netdev_skip_txqueue: + return &bpf_netdev_skip_txqueue_proto; case BPF_FUNC_redirect_peer: return &bpf_redirect_peer_proto; case BPF_FUNC_get_route_realm: diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 4eebea830613..ef147a1a2d62 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -5117,6 +5117,21 @@ union bpf_attr { * 0 on success. * **-EINVAL** for invalid input * **-EOPNOTSUPP** for unsupported delivery_time_type and protocol + * + * void bpf_netdev_skip_txqueue(u32 skip) + * Description + * Redirect the packet to another net device of index *ifindex*. + * This helper is somewhat similar to **bpf_redirect**\ (), except + * that the redirection happens to the *skip*' peer device and + * the netns switch takes place from ingress to ingress without + * going through the CPU's backlog queue. + * + * The *skip* argument is reserved and must be 0. The helper is + * currently only supported for tc BPF program types at the ingress + * hook and for veth device types. The peer device must reside in a + * different network namespace. + * Return + * Nothing. Always succeeds. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -5312,6 +5327,7 @@ union bpf_attr { FN(xdp_store_bytes), \ FN(copy_from_user_task), \ FN(skb_set_delivery_time), \ + FN(netdev_skip_txqueue), \ /* */ /* integer value in 'imm' field of BPF_CALL instruction selects which helper > > > trace_net_dev_queue(skb); > > diff --git a/net/sched/act_skbedit.c b/net/sched/act_skbedit.c > > index ceba11b198bb..d5799b4fc499 100644 > > --- a/net/sched/act_skbedit.c > > +++ b/net/sched/act_skbedit.c > > @@ -58,8 +58,12 @@ static int tcf_skbedit_act(struct sk_buff *skb, const struct tc_action *a, > > } > > } > > if (params->flags & SKBEDIT_F_QUEUE_MAPPING && > > - skb->dev->real_num_tx_queues > params->queue_mapping) > > + skb->dev->real_num_tx_queues > params->queue_mapping) { > > +#ifdef CONFIG_NET_EGRESS > > + netdev_xmit_skip_txqueue(true); > > +#endif > > skb_set_queue_mapping(skb, params->queue_mapping); > > + } > > if (params->flags & SKBEDIT_F_MARK) { > > skb->mark &= ~params->mask; > > skb->mark |= params->mark & params->mask; > > >
On Tue, 2022-03-15 at 20:48 +0800, Tonghao Zhang wrote: > On Tue, Mar 15, 2022 at 5:59 AM Daniel Borkmann <daniel@iogearbox.net> wrote: > > > > On 3/14/22 3:15 PM, xiangxia.m.yue@gmail.com wrote: > > [...] > > > include/linux/netdevice.h | 3 +++ > > > include/linux/rtnetlink.h | 1 + > > > net/core/dev.c | 31 +++++++++++++++++++++++++++++-- > > > net/sched/act_skbedit.c | 6 +++++- > > > 4 files changed, 38 insertions(+), 3 deletions(-) > > > > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > > > index 0d994710b335..f33fb2d6712a 100644 > > > --- a/include/linux/netdevice.h > > > +++ b/include/linux/netdevice.h > > > @@ -3065,6 +3065,9 @@ struct softnet_data { > > > struct { > > > u16 recursion; > > > u8 more; > > > +#ifdef CONFIG_NET_EGRESS > > > + u8 skip_txqueue; > > > +#endif > > > } xmit; > > > #ifdef CONFIG_RPS > > > /* input_queue_head should be written by cpu owning this struct, > > > diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h > > > index 7f970b16da3a..ae2c6a3cec5d 100644 > > > --- a/include/linux/rtnetlink.h > > > +++ b/include/linux/rtnetlink.h > > > @@ -100,6 +100,7 @@ void net_dec_ingress_queue(void); > > > #ifdef CONFIG_NET_EGRESS > > > void net_inc_egress_queue(void); > > > void net_dec_egress_queue(void); > > > +void netdev_xmit_skip_txqueue(bool skip); > > > #endif > > > > > > void rtnetlink_init(void); > > > diff --git a/net/core/dev.c b/net/core/dev.c > > > index 75bab5b0dbae..8e83b7099977 100644 > > > --- a/net/core/dev.c > > > +++ b/net/core/dev.c > > > @@ -3908,6 +3908,25 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev) > > > > > > return skb; > > > } > > > + > > > +static struct netdev_queue * > > > +netdev_tx_queue_mapping(struct net_device *dev, struct sk_buff *skb) > > > +{ > > > + int qm = skb_get_queue_mapping(skb); > > > + > > > + return netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, qm)); > > > +} > > > + > > > +static bool netdev_xmit_txqueue_skipped(void) > > > +{ > > > + return __this_cpu_read(softnet_data.xmit.skip_txqueue); > > > +} > > > + > > > +void netdev_xmit_skip_txqueue(bool skip) > > > +{ > > > + __this_cpu_write(softnet_data.xmit.skip_txqueue, skip); > > > +} > > > +EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue); > > > #endif /* CONFIG_NET_EGRESS */ > > > > > > #ifdef CONFIG_XPS > > > @@ -4078,7 +4097,7 @@ struct netdev_queue *netdev_core_pick_tx(struct net_device *dev, > > > static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev) > > > { > > > struct net_device *dev = skb->dev; > > > - struct netdev_queue *txq; > > > + struct netdev_queue *txq = NULL; > > > struct Qdisc *q; > > > int rc = -ENOMEM; > > > bool again = false; > > > @@ -4106,11 +4125,17 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev) > > > if (!skb) > > > goto out; > > > } > > > + > > > + netdev_xmit_skip_txqueue(false); > > > + > > > nf_skip_egress(skb, true); > > > skb = sch_handle_egress(skb, &rc, dev); > > > if (!skb) > > > goto out; > > > nf_skip_egress(skb, false); > > > + > > > + if (netdev_xmit_txqueue_skipped()) > > > + txq = netdev_tx_queue_mapping(dev, skb); > > > } > > > #endif > > > /* If device/qdisc don't need skb->dst, release it right now while > > > @@ -4121,7 +4146,9 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev) > > > else > > > skb_dst_force(skb); > > > > > > - txq = netdev_core_pick_tx(dev, skb, sb_dev); > > > + if (likely(!txq)) > > > > nit: Drop likely(). If the feature is used from sch_handle_egress(), then this would always be the case. > Hi Daniel > I think in most case, we don't use skbedit queue_mapping in the > sch_handle_egress() , so I add likely in fast path. > > > + txq = netdev_core_pick_tx(dev, skb, sb_dev); > > > + > > > q = rcu_dereference_bh(txq->qdisc); > > > > How will the `netdev_xmit_skip_txqueue(true)` be usable from BPF side (see bpf_convert_ctx_access() -> > > queue_mapping)? > Good questions, In other patch, I introduce the > bpf_netdev_skip_txqueue, so we can use netdev_xmit_skip_txqueue in bpf > side @Daniel: are you ok with the above explaination? Thanks! Paolo
On 3/17/22 9:20 AM, Paolo Abeni wrote: > On Tue, 2022-03-15 at 20:48 +0800, Tonghao Zhang wrote: >> On Tue, Mar 15, 2022 at 5:59 AM Daniel Borkmann <daniel@iogearbox.net> wrote: >>> On 3/14/22 3:15 PM, xiangxia.m.yue@gmail.com wrote: >>> [...] >>>> include/linux/netdevice.h | 3 +++ >>>> include/linux/rtnetlink.h | 1 + >>>> net/core/dev.c | 31 +++++++++++++++++++++++++++++-- >>>> net/sched/act_skbedit.c | 6 +++++- >>>> 4 files changed, 38 insertions(+), 3 deletions(-) >>>> >>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h >>>> index 0d994710b335..f33fb2d6712a 100644 >>>> --- a/include/linux/netdevice.h >>>> +++ b/include/linux/netdevice.h >>>> @@ -3065,6 +3065,9 @@ struct softnet_data { >>>> struct { >>>> u16 recursion; >>>> u8 more; >>>> +#ifdef CONFIG_NET_EGRESS >>>> + u8 skip_txqueue; >>>> +#endif >>>> } xmit; >>>> #ifdef CONFIG_RPS >>>> /* input_queue_head should be written by cpu owning this struct, >>>> diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h >>>> index 7f970b16da3a..ae2c6a3cec5d 100644 >>>> --- a/include/linux/rtnetlink.h >>>> +++ b/include/linux/rtnetlink.h >>>> @@ -100,6 +100,7 @@ void net_dec_ingress_queue(void); >>>> #ifdef CONFIG_NET_EGRESS >>>> void net_inc_egress_queue(void); >>>> void net_dec_egress_queue(void); >>>> +void netdev_xmit_skip_txqueue(bool skip); >>>> #endif >>>> >>>> void rtnetlink_init(void); >>>> diff --git a/net/core/dev.c b/net/core/dev.c >>>> index 75bab5b0dbae..8e83b7099977 100644 >>>> --- a/net/core/dev.c >>>> +++ b/net/core/dev.c >>>> @@ -3908,6 +3908,25 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev) >>>> >>>> return skb; >>>> } >>>> + >>>> +static struct netdev_queue * >>>> +netdev_tx_queue_mapping(struct net_device *dev, struct sk_buff *skb) >>>> +{ >>>> + int qm = skb_get_queue_mapping(skb); >>>> + >>>> + return netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, qm)); >>>> +} >>>> + >>>> +static bool netdev_xmit_txqueue_skipped(void) >>>> +{ >>>> + return __this_cpu_read(softnet_data.xmit.skip_txqueue); >>>> +} >>>> + >>>> +void netdev_xmit_skip_txqueue(bool skip) >>>> +{ >>>> + __this_cpu_write(softnet_data.xmit.skip_txqueue, skip); >>>> +} >>>> +EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue); >>>> #endif /* CONFIG_NET_EGRESS */ >>>> >>>> #ifdef CONFIG_XPS >>>> @@ -4078,7 +4097,7 @@ struct netdev_queue *netdev_core_pick_tx(struct net_device *dev, >>>> static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev) >>>> { >>>> struct net_device *dev = skb->dev; >>>> - struct netdev_queue *txq; >>>> + struct netdev_queue *txq = NULL; >>>> struct Qdisc *q; >>>> int rc = -ENOMEM; >>>> bool again = false; >>>> @@ -4106,11 +4125,17 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev) >>>> if (!skb) >>>> goto out; >>>> } >>>> + >>>> + netdev_xmit_skip_txqueue(false); >>>> + >>>> nf_skip_egress(skb, true); >>>> skb = sch_handle_egress(skb, &rc, dev); >>>> if (!skb) >>>> goto out; >>>> nf_skip_egress(skb, false); >>>> + >>>> + if (netdev_xmit_txqueue_skipped()) >>>> + txq = netdev_tx_queue_mapping(dev, skb); >>>> } >>>> #endif >>>> /* If device/qdisc don't need skb->dst, release it right now while >>>> @@ -4121,7 +4146,9 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev) >>>> else >>>> skb_dst_force(skb); >>>> >>>> - txq = netdev_core_pick_tx(dev, skb, sb_dev); >>>> + if (likely(!txq)) >>> >>> nit: Drop likely(). If the feature is used from sch_handle_egress(), then this would always be the case. >> Hi Daniel >> I think in most case, we don't use skbedit queue_mapping in the >> sch_handle_egress() , so I add likely in fast path. Yeah, but then let branch predictor do its work ? We can still change and drop the likely() once we add support for BPF though.. >>>> + txq = netdev_core_pick_tx(dev, skb, sb_dev); >>>> + >>>> q = rcu_dereference_bh(txq->qdisc); >>> >>> How will the `netdev_xmit_skip_txqueue(true)` be usable from BPF side (see bpf_convert_ctx_access() -> >>> queue_mapping)? >> Good questions, In other patch, I introduce the >> bpf_netdev_skip_txqueue, so we can use netdev_xmit_skip_txqueue in bpf >> side Yeah, that bpf_netdev_skip_txqueue() won't fly. It's basically a helper doing quirk for an implementation detail (aka calling netdev_xmit_skip_txqueue()). Was hoping you have something better we could use along with the context rewrite of __sk_buff's queue_mapping, but worst case we need to rework a bit for BPF. :/ Thanks, Daniel
On Fri, Mar 18, 2022 at 9:36 PM Daniel Borkmann <daniel@iogearbox.net> wrote: > > On 3/17/22 9:20 AM, Paolo Abeni wrote: > > On Tue, 2022-03-15 at 20:48 +0800, Tonghao Zhang wrote: > >> On Tue, Mar 15, 2022 at 5:59 AM Daniel Borkmann <daniel@iogearbox.net> wrote: > >>> On 3/14/22 3:15 PM, xiangxia.m.yue@gmail.com wrote: > >>> [...] > >>>> include/linux/netdevice.h | 3 +++ > >>>> include/linux/rtnetlink.h | 1 + > >>>> net/core/dev.c | 31 +++++++++++++++++++++++++++++-- > >>>> net/sched/act_skbedit.c | 6 +++++- > >>>> 4 files changed, 38 insertions(+), 3 deletions(-) > >>>> > >>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > >>>> index 0d994710b335..f33fb2d6712a 100644 > >>>> --- a/include/linux/netdevice.h > >>>> +++ b/include/linux/netdevice.h > >>>> @@ -3065,6 +3065,9 @@ struct softnet_data { > >>>> struct { > >>>> u16 recursion; > >>>> u8 more; > >>>> +#ifdef CONFIG_NET_EGRESS > >>>> + u8 skip_txqueue; > >>>> +#endif > >>>> } xmit; > >>>> #ifdef CONFIG_RPS > >>>> /* input_queue_head should be written by cpu owning this struct, > >>>> diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h > >>>> index 7f970b16da3a..ae2c6a3cec5d 100644 > >>>> --- a/include/linux/rtnetlink.h > >>>> +++ b/include/linux/rtnetlink.h > >>>> @@ -100,6 +100,7 @@ void net_dec_ingress_queue(void); > >>>> #ifdef CONFIG_NET_EGRESS > >>>> void net_inc_egress_queue(void); > >>>> void net_dec_egress_queue(void); > >>>> +void netdev_xmit_skip_txqueue(bool skip); > >>>> #endif > >>>> > >>>> void rtnetlink_init(void); > >>>> diff --git a/net/core/dev.c b/net/core/dev.c > >>>> index 75bab5b0dbae..8e83b7099977 100644 > >>>> --- a/net/core/dev.c > >>>> +++ b/net/core/dev.c > >>>> @@ -3908,6 +3908,25 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev) > >>>> > >>>> return skb; > >>>> } > >>>> + > >>>> +static struct netdev_queue * > >>>> +netdev_tx_queue_mapping(struct net_device *dev, struct sk_buff *skb) > >>>> +{ > >>>> + int qm = skb_get_queue_mapping(skb); > >>>> + > >>>> + return netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, qm)); > >>>> +} > >>>> + > >>>> +static bool netdev_xmit_txqueue_skipped(void) > >>>> +{ > >>>> + return __this_cpu_read(softnet_data.xmit.skip_txqueue); > >>>> +} > >>>> + > >>>> +void netdev_xmit_skip_txqueue(bool skip) > >>>> +{ > >>>> + __this_cpu_write(softnet_data.xmit.skip_txqueue, skip); > >>>> +} > >>>> +EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue); > >>>> #endif /* CONFIG_NET_EGRESS */ > >>>> > >>>> #ifdef CONFIG_XPS > >>>> @@ -4078,7 +4097,7 @@ struct netdev_queue *netdev_core_pick_tx(struct net_device *dev, > >>>> static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev) > >>>> { > >>>> struct net_device *dev = skb->dev; > >>>> - struct netdev_queue *txq; > >>>> + struct netdev_queue *txq = NULL; > >>>> struct Qdisc *q; > >>>> int rc = -ENOMEM; > >>>> bool again = false; > >>>> @@ -4106,11 +4125,17 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev) > >>>> if (!skb) > >>>> goto out; > >>>> } > >>>> + > >>>> + netdev_xmit_skip_txqueue(false); > >>>> + > >>>> nf_skip_egress(skb, true); > >>>> skb = sch_handle_egress(skb, &rc, dev); > >>>> if (!skb) > >>>> goto out; > >>>> nf_skip_egress(skb, false); > >>>> + > >>>> + if (netdev_xmit_txqueue_skipped()) > >>>> + txq = netdev_tx_queue_mapping(dev, skb); > >>>> } > >>>> #endif > >>>> /* If device/qdisc don't need skb->dst, release it right now while > >>>> @@ -4121,7 +4146,9 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev) > >>>> else > >>>> skb_dst_force(skb); > >>>> > >>>> - txq = netdev_core_pick_tx(dev, skb, sb_dev); > >>>> + if (likely(!txq)) > >>> > >>> nit: Drop likely(). If the feature is used from sch_handle_egress(), then this would always be the case. > >> Hi Daniel > >> I think in most case, we don't use skbedit queue_mapping in the > >> sch_handle_egress() , so I add likely in fast path. > > Yeah, but then let branch predictor do its work ? We can still change and drop the > likely() once we add support for BPF though.. Hi if you are ok that introducing the bpf helper shown below, I will drop likely() in next patch. > > >>>> + txq = netdev_core_pick_tx(dev, skb, sb_dev); > >>>> + > >>>> q = rcu_dereference_bh(txq->qdisc); > >>> > >>> How will the `netdev_xmit_skip_txqueue(true)` be usable from BPF side (see bpf_convert_ctx_access() -> > >>> queue_mapping)? > >> Good questions, In other patch, I introduce the > >> bpf_netdev_skip_txqueue, so we can use netdev_xmit_skip_txqueue in bpf > >> side > > Yeah, that bpf_netdev_skip_txqueue() won't fly. It's basically a helper doing quirk for > an implementation detail (aka calling netdev_xmit_skip_txqueue()). Was hoping you have > something better we could use along with the context rewrite of __sk_buff's queue_mapping, Hi Daniel I review the bpf codes, we introduce a lot helper to change the skb field: skb_change_proto skb_change_type skb_change_tail skb_pull_data skb_change_head skb_ecn_set_ce skb_cgroup_classid skb_vlan_push skb_set_tunnel_key did you mean that, we introduce bpf_skb_set_queue_mapping is better than bpf_netdev_skip_txqueue. for example: BPF_CALL_2(bpf_skb_set_queue_mapping, struct sk_buff *, skb, u32, txq) { skb->queue_mapping = txq; netdev_xmit_skip_txqueue(true); return 0; }; > but worst case we need to rework a bit for BPF. :/ > Thanks, > Daniel
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 0d994710b335..f33fb2d6712a 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -3065,6 +3065,9 @@ struct softnet_data { struct { u16 recursion; u8 more; +#ifdef CONFIG_NET_EGRESS + u8 skip_txqueue; +#endif } xmit; #ifdef CONFIG_RPS /* input_queue_head should be written by cpu owning this struct, diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h index 7f970b16da3a..ae2c6a3cec5d 100644 --- a/include/linux/rtnetlink.h +++ b/include/linux/rtnetlink.h @@ -100,6 +100,7 @@ void net_dec_ingress_queue(void); #ifdef CONFIG_NET_EGRESS void net_inc_egress_queue(void); void net_dec_egress_queue(void); +void netdev_xmit_skip_txqueue(bool skip); #endif void rtnetlink_init(void); diff --git a/net/core/dev.c b/net/core/dev.c index 75bab5b0dbae..8e83b7099977 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3908,6 +3908,25 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev) return skb; } + +static struct netdev_queue * +netdev_tx_queue_mapping(struct net_device *dev, struct sk_buff *skb) +{ + int qm = skb_get_queue_mapping(skb); + + return netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, qm)); +} + +static bool netdev_xmit_txqueue_skipped(void) +{ + return __this_cpu_read(softnet_data.xmit.skip_txqueue); +} + +void netdev_xmit_skip_txqueue(bool skip) +{ + __this_cpu_write(softnet_data.xmit.skip_txqueue, skip); +} +EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue); #endif /* CONFIG_NET_EGRESS */ #ifdef CONFIG_XPS @@ -4078,7 +4097,7 @@ struct netdev_queue *netdev_core_pick_tx(struct net_device *dev, static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev) { struct net_device *dev = skb->dev; - struct netdev_queue *txq; + struct netdev_queue *txq = NULL; struct Qdisc *q; int rc = -ENOMEM; bool again = false; @@ -4106,11 +4125,17 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev) if (!skb) goto out; } + + netdev_xmit_skip_txqueue(false); + nf_skip_egress(skb, true); skb = sch_handle_egress(skb, &rc, dev); if (!skb) goto out; nf_skip_egress(skb, false); + + if (netdev_xmit_txqueue_skipped()) + txq = netdev_tx_queue_mapping(dev, skb); } #endif /* If device/qdisc don't need skb->dst, release it right now while @@ -4121,7 +4146,9 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev) else skb_dst_force(skb); - txq = netdev_core_pick_tx(dev, skb, sb_dev); + if (likely(!txq)) + txq = netdev_core_pick_tx(dev, skb, sb_dev); + q = rcu_dereference_bh(txq->qdisc); trace_net_dev_queue(skb); diff --git a/net/sched/act_skbedit.c b/net/sched/act_skbedit.c index ceba11b198bb..d5799b4fc499 100644 --- a/net/sched/act_skbedit.c +++ b/net/sched/act_skbedit.c @@ -58,8 +58,12 @@ static int tcf_skbedit_act(struct sk_buff *skb, const struct tc_action *a, } } if (params->flags & SKBEDIT_F_QUEUE_MAPPING && - skb->dev->real_num_tx_queues > params->queue_mapping) + skb->dev->real_num_tx_queues > params->queue_mapping) { +#ifdef CONFIG_NET_EGRESS + netdev_xmit_skip_txqueue(true); +#endif skb_set_queue_mapping(skb, params->queue_mapping); + } if (params->flags & SKBEDIT_F_MARK) { skb->mark &= ~params->mask; skb->mark |= params->mark & params->mask;