diff mbox series

[net-next,v1,1/2] net: sched: use queue_mapping to pick tx queue

Message ID 20211206080512.36610-2-xiangxia.m.yue@gmail.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series net: sched: allow user to select txqueue | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 5995 this patch: 5995
netdev/cc_maintainers success CCed 8 of 8 maintainers
netdev/build_clang success Errors and warnings before: 1032 this patch: 1032
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 6148 this patch: 6148
netdev/checkpatch warning WARNING: usage of NR_CPUS is often wrong - consider using cpu_possible(), num_possible_cpus(), for_each_possible_cpu(), etc
netdev/kdoc fail Errors and warnings before: 0 this patch: 1
netdev/source_inline success Was 0 now: 0

Commit Message

Tonghao Zhang Dec. 6, 2021, 8:05 a.m. UTC
From: Tonghao Zhang <xiangxia.m.yue@gmail.com>

This patch fix issue:
* If we install tc filters with act_skbedit in clsact hook.
  It doesn't work, because *netdev_core_pick_tx will overwrite
  queue_mapping.

  $ tc filter add dev $NETDEV egress .. action skbedit queue_mapping 1

And this patch is useful:
* In containter networking environment, one kind of pod/containter/
  net-namespace (e.g. P1, P2) which outbound traffic limited, can
  use one specific tx queue which used HTB/TBF Qdisc. But other kind
  of pods (e.g. Pn) can use other specific tx queue too, which used fifio
  Qdisc. Then the lock contention of HTB/TBF Qdisc will not affect Pn.

  +----+      +----+      +----+
  | P1 |      | P2 |      | Pn |
  +----+      +----+      +----+
    |           |           |
    +-----------+-----------+
                |
                | clsact/skbedit
                |    MQ
                v
    +-----------+-----------+
    | q0        | q1        | qn
    v           v           v
   HTB         HTB   ...   FIFO

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexander Lobakin <alobakin@pm.me>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Talal Ahmad <talalahmad@google.com>
Cc: Kevin Hao <haokexin@gmail.com>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Antoine Tenart <atenart@kernel.org>
Cc: Wei Wang <weiwan@google.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
---
 include/linux/skbuff.h  |  1 +
 net/core/dev.c          | 12 +++++++++---
 net/sched/act_skbedit.c |  4 +++-
 3 files changed, 13 insertions(+), 4 deletions(-)

Comments

Jakub Kicinski Dec. 6, 2021, 8:40 p.m. UTC | #1
On Mon,  6 Dec 2021 16:05:11 +0800 xiangxia.m.yue@gmail.com wrote:
>   +----+      +----+      +----+
>   | P1 |      | P2 |      | Pn |
>   +----+      +----+      +----+
>     |           |           |
>     +-----------+-----------+
>                 |
>                 | clsact/skbedit
>                 |    MQ
>                 v
>     +-----------+-----------+
>     | q0        | q1        | qn
>     v           v           v
>    HTB         HTB   ...   FIFO

The usual suggestion these days is to try to use FQ + EDT to
implement efficient policies. You don't need dedicated qdiscs, 
just modulate transmission time appropriately on egress of the
container.

In general recording the decision in the skb seems a little heavy
handed. We just need to carry the information from the egress hook
to the queue selection a few lines below. Or in fact maybe egress
hook shouldn't be used for this in the first place, and we need 
a more appropriate root qdisc than simple mq?

Not sure. What I am sure of is that you need to fix these warnings:

include/linux/skbuff.h:937: warning: Function parameter or member 'tc_skip_txqueue' not described in 'sk_buff'

ERROR: spaces required around that '=' (ctx:VxW)
#103: FILE: net/sched/act_skbedit.c:42:
+	queue_mapping= (queue_mapping & 0xff) + hash % mapping_mod;
 	             ^

;)
Tonghao Zhang Dec. 7, 2021, 2:10 a.m. UTC | #2
On Tue, Dec 7, 2021 at 4:40 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon,  6 Dec 2021 16:05:11 +0800 xiangxia.m.yue@gmail.com wrote:
> >   +----+      +----+      +----+
> >   | P1 |      | P2 |      | Pn |
> >   +----+      +----+      +----+
> >     |           |           |
> >     +-----------+-----------+
> >                 |
> >                 | clsact/skbedit
> >                 |    MQ
> >                 v
> >     +-----------+-----------+
> >     | q0        | q1        | qn
> >     v           v           v
> >    HTB         HTB   ...   FIFO
Hi Jakub, thanks for your comments
> The usual suggestion these days is to try to use FQ + EDT to
> implement efficient policies. You don't need dedicated qdiscs,
> just modulate transmission time appropriately on egress of the
> container.
FQ+EDT is good solution. But this patch should be used on another scenario.
1. the containers  which outbound traffic is not limited, want to use
the fifo qdisc.
If this traffic share the FQ/HTB Qdisc, the qdisc lock will affect the
performance and latency.
2. we can support user to select tx queue, range from A to B. skb hash
or cgroup classid is good to do load balance.
patch 2/2: https://patchwork.kernel.org/project/netdevbpf/patch/20211206080512.36610-3-xiangxia.m.yue@gmail.com/

> In general recording the decision in the skb seems a little heavy
> handed. We just need to carry the information from the egress hook
> to the queue selection a few lines below. Or in fact maybe egress
Yes, we can refactor netdev_core_pick_tx to
1. select queue_index and invoke skb_set_queue_mapping, but don't
return the txq.
2. after egress hook, use skb_get_queue_mapping/netdev_get_tx_queue to get txq.
> hook shouldn't be used for this in the first place, and we need
> a more appropriate root qdisc than simple mq?
I have no idea about mq, I think clsact may make the things more flexible.
and act_bpf can also support to change sk queue_mapping. queue_mapping
was included in __sk_buff.

> Not sure. What I am sure of is that you need to fix these warnings:
Ok
> include/linux/skbuff.h:937: warning: Function parameter or member 'tc_skip_txqueue' not described in 'sk_buff'
>
> ERROR: spaces required around that '=' (ctx:VxW)
> #103: FILE: net/sched/act_skbedit.c:42:
> +       queue_mapping= (queue_mapping & 0xff) + hash % mapping_mod;
>                      ^
>
> ;)
Jakub Kicinski Dec. 7, 2021, 2:33 a.m. UTC | #3
On Tue, 7 Dec 2021 10:10:22 +0800 Tonghao Zhang wrote:
> > In general recording the decision in the skb seems a little heavy
> > handed. We just need to carry the information from the egress hook
> > to the queue selection a few lines below. Or in fact maybe egress  
> Yes, we can refactor netdev_core_pick_tx to
> 1. select queue_index and invoke skb_set_queue_mapping, but don't
> return the txq.
> 2. after egress hook, use skb_get_queue_mapping/netdev_get_tx_queue to get txq.

I'm not sure that's what I meant, I meant the information you need to
store does not need to be stored in the skb, you can pass a pointer to
a stack variable to both egress handling and pick_tx.

> > hook shouldn't be used for this in the first place, and we need
> > a more appropriate root qdisc than simple mq?  
> I have no idea about mq, I think clsact may make the things more flexible.
> and act_bpf can also support to change sk queue_mapping. queue_mapping
> was included in __sk_buff.

Qdiscs can run a classifier to select a sub-queue. The advantage of
the classifier run by the Qdisc is that it runs after pick_tx.
Tonghao Zhang Dec. 7, 2021, 3:22 a.m. UTC | #4
On Tue, Dec 7, 2021 at 10:33 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 7 Dec 2021 10:10:22 +0800 Tonghao Zhang wrote:
> > > In general recording the decision in the skb seems a little heavy
> > > handed. We just need to carry the information from the egress hook
> > > to the queue selection a few lines below. Or in fact maybe egress
> > Yes, we can refactor netdev_core_pick_tx to
> > 1. select queue_index and invoke skb_set_queue_mapping, but don't
> > return the txq.
> > 2. after egress hook, use skb_get_queue_mapping/netdev_get_tx_queue to get txq.
>
> I'm not sure that's what I meant, I meant the information you need to
> store does not need to be stored in the skb, you can pass a pointer to
> a stack variable to both egress handling and pick_tx.
Thanks, I got it. I think we store the txq index in skb->queue_mapping
better. because in egress hook,
act_skbedit/act_bpf can change the skb queue_mapping. Then we can
pick_tx depending on queue_mapping.

> > > hook shouldn't be used for this in the first place, and we need
> > > a more appropriate root qdisc than simple mq?
> > I have no idea about mq, I think clsact may make the things more flexible.
> > and act_bpf can also support to change sk queue_mapping. queue_mapping
> > was included in __sk_buff.
>
> Qdiscs can run a classifier to select a sub-queue. The advantage of
> the classifier run by the Qdisc is that it runs after pick_tx.
Yes, we should consider the qdisc lock too. Qdisc lock may affect
performance and latency when running a classifier in Qdisc
and clsact is outside of qdisc.
Jakub Kicinski Dec. 7, 2021, 3:44 p.m. UTC | #5
On Tue, 7 Dec 2021 11:22:28 +0800 Tonghao Zhang wrote:
> On Tue, Dec 7, 2021 at 10:33 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > On Tue, 7 Dec 2021 10:10:22 +0800 Tonghao Zhang wrote:  
> > > Yes, we can refactor netdev_core_pick_tx to
> > > 1. select queue_index and invoke skb_set_queue_mapping, but don't
> > > return the txq.
> > > 2. after egress hook, use skb_get_queue_mapping/netdev_get_tx_queue to get txq.  
> >
> > I'm not sure that's what I meant, I meant the information you need to
> > store does not need to be stored in the skb, you can pass a pointer to
> > a stack variable to both egress handling and pick_tx.  
> Thanks, I got it. I think we store the txq index in skb->queue_mapping
> better. because in egress hook,
> act_skbedit/act_bpf can change the skb queue_mapping. Then we can
> pick_tx depending on queue_mapping.

Actually Eric pointed out in another thread that xmit_more() is now
done via a per-CPU variable, you can try that instead of plumbing a
variable all the way into actions and back out to pick_tx().

Please make sure to include the analysis of the performance impact 
when the feature is _not_ used in the next version.

> > > I have no idea about mq, I think clsact may make the things more flexible.
> > > and act_bpf can also support to change sk queue_mapping. queue_mapping
> > > was included in __sk_buff.  
> >
> > Qdiscs can run a classifier to select a sub-queue. The advantage of
> > the classifier run by the Qdisc is that it runs after pick_tx.  
> Yes, we should consider the qdisc lock too. Qdisc lock may affect
> performance and latency when running a classifier in Qdisc
> and clsact is outside of qdisc.
Tonghao Zhang Dec. 8, 2021, 2:54 p.m. UTC | #6
On Tue, Dec 7, 2021 at 11:45 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 7 Dec 2021 11:22:28 +0800 Tonghao Zhang wrote:
> > On Tue, Dec 7, 2021 at 10:33 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > > On Tue, 7 Dec 2021 10:10:22 +0800 Tonghao Zhang wrote:
> > > > Yes, we can refactor netdev_core_pick_tx to
> > > > 1. select queue_index and invoke skb_set_queue_mapping, but don't
> > > > return the txq.
> > > > 2. after egress hook, use skb_get_queue_mapping/netdev_get_tx_queue to get txq.
> > >
> > > I'm not sure that's what I meant, I meant the information you need to
> > > store does not need to be stored in the skb, you can pass a pointer to
> > > a stack variable to both egress handling and pick_tx.
> > Thanks, I got it. I think we store the txq index in skb->queue_mapping
> > better. because in egress hook,
> > act_skbedit/act_bpf can change the skb queue_mapping. Then we can
> > pick_tx depending on queue_mapping.
>
> Actually Eric pointed out in another thread that xmit_more() is now
> done via a per-CPU variable, you can try that instead of plumbing a
> variable all the way into actions and back out to pick_tx().
Thanks Jakub, Eric
v2 is sent, please review
https://patchwork.kernel.org/project/netdevbpf/list/?series=592341

> Please make sure to include the analysis of the performance impact
> when the feature is _not_ used in the next version.
Ok, I updated the commit message. Thanks!

> > > > I have no idea about mq, I think clsact may make the things more flexible.
> > > > and act_bpf can also support to change sk queue_mapping. queue_mapping
> > > > was included in __sk_buff.
> > >
> > > Qdiscs can run a classifier to select a sub-queue. The advantage of
> > > the classifier run by the Qdisc is that it runs after pick_tx.
> > Yes, we should consider the qdisc lock too. Qdisc lock may affect
> > performance and latency when running a classifier in Qdisc
> > and clsact is outside of qdisc.
diff mbox series

Patch

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index eae4bd3237a4..b6ea4b920409 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -856,6 +856,7 @@  struct sk_buff {
 #endif
 #ifdef CONFIG_NET_CLS_ACT
 	__u8			tc_skip_classify:1;
+	__u8			tc_skip_txqueue:1;
 	__u8			tc_at_ingress:1;
 #endif
 	__u8			redirected:1;
diff --git a/net/core/dev.c b/net/core/dev.c
index aba8acc1238c..fb9d4eee29ee 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3975,10 +3975,16 @@  struct netdev_queue *netdev_core_pick_tx(struct net_device *dev,
 {
 	int queue_index = 0;
 
-#ifdef CONFIG_XPS
-	u32 sender_cpu = skb->sender_cpu - 1;
+#ifdef CONFIG_NET_CLS_ACT
+	if (skb->tc_skip_txqueue) {
+		queue_index = netdev_cap_txqueue(dev,
+						 skb_get_queue_mapping(skb));
+		return netdev_get_tx_queue(dev, queue_index);
+	}
+#endif
 
-	if (sender_cpu >= (u32)NR_CPUS)
+#ifdef CONFIG_XPS
+	if ((skb->sender_cpu - 1) >= (u32)NR_CPUS)
 		skb->sender_cpu = raw_smp_processor_id() + 1;
 #endif
 
diff --git a/net/sched/act_skbedit.c b/net/sched/act_skbedit.c
index d30ecbfc8f84..940091a7c7f0 100644
--- a/net/sched/act_skbedit.c
+++ b/net/sched/act_skbedit.c
@@ -58,8 +58,10 @@  static int tcf_skbedit_act(struct sk_buff *skb, const struct tc_action *a,
 		}
 	}
 	if (params->flags & SKBEDIT_F_QUEUE_MAPPING &&
-	    skb->dev->real_num_tx_queues > params->queue_mapping)
+	    skb->dev->real_num_tx_queues > params->queue_mapping) {
+		skb->tc_skip_txqueue = 1;
 		skb_set_queue_mapping(skb, params->queue_mapping);
+	}
 	if (params->flags & SKBEDIT_F_MARK) {
 		skb->mark &= ~params->mask;
 		skb->mark |= params->mark & params->mask;