Message ID | 20211206080512.36610-2-xiangxia.m.yue@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | net: sched: allow user to select txqueue | expand |
On Mon, 6 Dec 2021 16:05:11 +0800 xiangxia.m.yue@gmail.com wrote: > +----+ +----+ +----+ > | P1 | | P2 | | Pn | > +----+ +----+ +----+ > | | | > +-----------+-----------+ > | > | clsact/skbedit > | MQ > v > +-----------+-----------+ > | q0 | q1 | qn > v v v > HTB HTB ... FIFO The usual suggestion these days is to try to use FQ + EDT to implement efficient policies. You don't need dedicated qdiscs, just modulate transmission time appropriately on egress of the container. In general recording the decision in the skb seems a little heavy handed. We just need to carry the information from the egress hook to the queue selection a few lines below. Or in fact maybe egress hook shouldn't be used for this in the first place, and we need a more appropriate root qdisc than simple mq? Not sure. What I am sure of is that you need to fix these warnings: include/linux/skbuff.h:937: warning: Function parameter or member 'tc_skip_txqueue' not described in 'sk_buff' ERROR: spaces required around that '=' (ctx:VxW) #103: FILE: net/sched/act_skbedit.c:42: + queue_mapping= (queue_mapping & 0xff) + hash % mapping_mod; ^ ;)
On Tue, Dec 7, 2021 at 4:40 AM Jakub Kicinski <kuba@kernel.org> wrote: > > On Mon, 6 Dec 2021 16:05:11 +0800 xiangxia.m.yue@gmail.com wrote: > > +----+ +----+ +----+ > > | P1 | | P2 | | Pn | > > +----+ +----+ +----+ > > | | | > > +-----------+-----------+ > > | > > | clsact/skbedit > > | MQ > > v > > +-----------+-----------+ > > | q0 | q1 | qn > > v v v > > HTB HTB ... FIFO Hi Jakub, thanks for your comments > The usual suggestion these days is to try to use FQ + EDT to > implement efficient policies. You don't need dedicated qdiscs, > just modulate transmission time appropriately on egress of the > container. FQ+EDT is good solution. But this patch should be used on another scenario. 1. the containers which outbound traffic is not limited, want to use the fifo qdisc. If this traffic share the FQ/HTB Qdisc, the qdisc lock will affect the performance and latency. 2. we can support user to select tx queue, range from A to B. skb hash or cgroup classid is good to do load balance. patch 2/2: https://patchwork.kernel.org/project/netdevbpf/patch/20211206080512.36610-3-xiangxia.m.yue@gmail.com/ > In general recording the decision in the skb seems a little heavy > handed. We just need to carry the information from the egress hook > to the queue selection a few lines below. Or in fact maybe egress Yes, we can refactor netdev_core_pick_tx to 1. select queue_index and invoke skb_set_queue_mapping, but don't return the txq. 2. after egress hook, use skb_get_queue_mapping/netdev_get_tx_queue to get txq. > hook shouldn't be used for this in the first place, and we need > a more appropriate root qdisc than simple mq? I have no idea about mq, I think clsact may make the things more flexible. and act_bpf can also support to change sk queue_mapping. queue_mapping was included in __sk_buff. > Not sure. What I am sure of is that you need to fix these warnings: Ok > include/linux/skbuff.h:937: warning: Function parameter or member 'tc_skip_txqueue' not described in 'sk_buff' > > ERROR: spaces required around that '=' (ctx:VxW) > #103: FILE: net/sched/act_skbedit.c:42: > + queue_mapping= (queue_mapping & 0xff) + hash % mapping_mod; > ^ > > ;)
On Tue, 7 Dec 2021 10:10:22 +0800 Tonghao Zhang wrote: > > In general recording the decision in the skb seems a little heavy > > handed. We just need to carry the information from the egress hook > > to the queue selection a few lines below. Or in fact maybe egress > Yes, we can refactor netdev_core_pick_tx to > 1. select queue_index and invoke skb_set_queue_mapping, but don't > return the txq. > 2. after egress hook, use skb_get_queue_mapping/netdev_get_tx_queue to get txq. I'm not sure that's what I meant, I meant the information you need to store does not need to be stored in the skb, you can pass a pointer to a stack variable to both egress handling and pick_tx. > > hook shouldn't be used for this in the first place, and we need > > a more appropriate root qdisc than simple mq? > I have no idea about mq, I think clsact may make the things more flexible. > and act_bpf can also support to change sk queue_mapping. queue_mapping > was included in __sk_buff. Qdiscs can run a classifier to select a sub-queue. The advantage of the classifier run by the Qdisc is that it runs after pick_tx.
On Tue, Dec 7, 2021 at 10:33 AM Jakub Kicinski <kuba@kernel.org> wrote: > > On Tue, 7 Dec 2021 10:10:22 +0800 Tonghao Zhang wrote: > > > In general recording the decision in the skb seems a little heavy > > > handed. We just need to carry the information from the egress hook > > > to the queue selection a few lines below. Or in fact maybe egress > > Yes, we can refactor netdev_core_pick_tx to > > 1. select queue_index and invoke skb_set_queue_mapping, but don't > > return the txq. > > 2. after egress hook, use skb_get_queue_mapping/netdev_get_tx_queue to get txq. > > I'm not sure that's what I meant, I meant the information you need to > store does not need to be stored in the skb, you can pass a pointer to > a stack variable to both egress handling and pick_tx. Thanks, I got it. I think we store the txq index in skb->queue_mapping better. because in egress hook, act_skbedit/act_bpf can change the skb queue_mapping. Then we can pick_tx depending on queue_mapping. > > > hook shouldn't be used for this in the first place, and we need > > > a more appropriate root qdisc than simple mq? > > I have no idea about mq, I think clsact may make the things more flexible. > > and act_bpf can also support to change sk queue_mapping. queue_mapping > > was included in __sk_buff. > > Qdiscs can run a classifier to select a sub-queue. The advantage of > the classifier run by the Qdisc is that it runs after pick_tx. Yes, we should consider the qdisc lock too. Qdisc lock may affect performance and latency when running a classifier in Qdisc and clsact is outside of qdisc.
On Tue, 7 Dec 2021 11:22:28 +0800 Tonghao Zhang wrote: > On Tue, Dec 7, 2021 at 10:33 AM Jakub Kicinski <kuba@kernel.org> wrote: > > On Tue, 7 Dec 2021 10:10:22 +0800 Tonghao Zhang wrote: > > > Yes, we can refactor netdev_core_pick_tx to > > > 1. select queue_index and invoke skb_set_queue_mapping, but don't > > > return the txq. > > > 2. after egress hook, use skb_get_queue_mapping/netdev_get_tx_queue to get txq. > > > > I'm not sure that's what I meant, I meant the information you need to > > store does not need to be stored in the skb, you can pass a pointer to > > a stack variable to both egress handling and pick_tx. > Thanks, I got it. I think we store the txq index in skb->queue_mapping > better. because in egress hook, > act_skbedit/act_bpf can change the skb queue_mapping. Then we can > pick_tx depending on queue_mapping. Actually Eric pointed out in another thread that xmit_more() is now done via a per-CPU variable, you can try that instead of plumbing a variable all the way into actions and back out to pick_tx(). Please make sure to include the analysis of the performance impact when the feature is _not_ used in the next version. > > > I have no idea about mq, I think clsact may make the things more flexible. > > > and act_bpf can also support to change sk queue_mapping. queue_mapping > > > was included in __sk_buff. > > > > Qdiscs can run a classifier to select a sub-queue. The advantage of > > the classifier run by the Qdisc is that it runs after pick_tx. > Yes, we should consider the qdisc lock too. Qdisc lock may affect > performance and latency when running a classifier in Qdisc > and clsact is outside of qdisc.
On Tue, Dec 7, 2021 at 11:45 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Tue, 7 Dec 2021 11:22:28 +0800 Tonghao Zhang wrote: > > On Tue, Dec 7, 2021 at 10:33 AM Jakub Kicinski <kuba@kernel.org> wrote: > > > On Tue, 7 Dec 2021 10:10:22 +0800 Tonghao Zhang wrote: > > > > Yes, we can refactor netdev_core_pick_tx to > > > > 1. select queue_index and invoke skb_set_queue_mapping, but don't > > > > return the txq. > > > > 2. after egress hook, use skb_get_queue_mapping/netdev_get_tx_queue to get txq. > > > > > > I'm not sure that's what I meant, I meant the information you need to > > > store does not need to be stored in the skb, you can pass a pointer to > > > a stack variable to both egress handling and pick_tx. > > Thanks, I got it. I think we store the txq index in skb->queue_mapping > > better. because in egress hook, > > act_skbedit/act_bpf can change the skb queue_mapping. Then we can > > pick_tx depending on queue_mapping. > > Actually Eric pointed out in another thread that xmit_more() is now > done via a per-CPU variable, you can try that instead of plumbing a > variable all the way into actions and back out to pick_tx(). Thanks Jakub, Eric v2 is sent, please review https://patchwork.kernel.org/project/netdevbpf/list/?series=592341 > Please make sure to include the analysis of the performance impact > when the feature is _not_ used in the next version. Ok, I updated the commit message. Thanks! > > > > I have no idea about mq, I think clsact may make the things more flexible. > > > > and act_bpf can also support to change sk queue_mapping. queue_mapping > > > > was included in __sk_buff. > > > > > > Qdiscs can run a classifier to select a sub-queue. The advantage of > > > the classifier run by the Qdisc is that it runs after pick_tx. > > Yes, we should consider the qdisc lock too. Qdisc lock may affect > > performance and latency when running a classifier in Qdisc > > and clsact is outside of qdisc.
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index eae4bd3237a4..b6ea4b920409 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -856,6 +856,7 @@ struct sk_buff { #endif #ifdef CONFIG_NET_CLS_ACT __u8 tc_skip_classify:1; + __u8 tc_skip_txqueue:1; __u8 tc_at_ingress:1; #endif __u8 redirected:1; diff --git a/net/core/dev.c b/net/core/dev.c index aba8acc1238c..fb9d4eee29ee 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3975,10 +3975,16 @@ struct netdev_queue *netdev_core_pick_tx(struct net_device *dev, { int queue_index = 0; -#ifdef CONFIG_XPS - u32 sender_cpu = skb->sender_cpu - 1; +#ifdef CONFIG_NET_CLS_ACT + if (skb->tc_skip_txqueue) { + queue_index = netdev_cap_txqueue(dev, + skb_get_queue_mapping(skb)); + return netdev_get_tx_queue(dev, queue_index); + } +#endif - if (sender_cpu >= (u32)NR_CPUS) +#ifdef CONFIG_XPS + if ((skb->sender_cpu - 1) >= (u32)NR_CPUS) skb->sender_cpu = raw_smp_processor_id() + 1; #endif diff --git a/net/sched/act_skbedit.c b/net/sched/act_skbedit.c index d30ecbfc8f84..940091a7c7f0 100644 --- a/net/sched/act_skbedit.c +++ b/net/sched/act_skbedit.c @@ -58,8 +58,10 @@ static int tcf_skbedit_act(struct sk_buff *skb, const struct tc_action *a, } } if (params->flags & SKBEDIT_F_QUEUE_MAPPING && - skb->dev->real_num_tx_queues > params->queue_mapping) + skb->dev->real_num_tx_queues > params->queue_mapping) { + skb->tc_skip_txqueue = 1; skb_set_queue_mapping(skb, params->queue_mapping); + } if (params->flags & SKBEDIT_F_MARK) { skb->mark &= ~params->mask; skb->mark |= params->mark & params->mask;