diff mbox series

[net-next] net: sched: move block device tracking into tcf_block_get/put_ext()

Message ID 20240104125844.1522062-1-jiri@resnulli.us (mailing list archive)
State Accepted
Commit 94e2557d086ad831027c54bc9c2130d337c72814
Delegated to: Netdev Maintainers
Headers show
Series [net-next] net: sched: move block device tracking into tcf_block_get/put_ext() | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1128 this patch: 1128
netdev/cc_maintainers success CCed 9 of 9 maintainers
netdev/build_clang success Errors and warnings before: 1141 this patch: 1141
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 1155 this patch: 1155
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 116 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Jiri Pirko Jan. 4, 2024, 12:58 p.m. UTC
From: Jiri Pirko <jiri@nvidia.com>

Inserting the device to block xarray in qdisc_create() is not suitable
place to do this. As it requires use of tcf_block() callback, it causes
multiple issues. It is called for all qdisc types, which is incorrect.

So, instead, move it to more suitable place, which is tcf_block_get_ext()
and make sure it is only done for qdiscs that use block infrastructure
and also only for blocks which are shared.

Symmetrically, alter the cleanup path, move the xarray entry removal
into tcf_block_put_ext().

Fixes: 913b47d3424e ("net/sched: Introduce tc block netdev tracking infra")
Reported-by: Ido Schimmel <idosch@nvidia.com>
Closes: https://lore.kernel.org/all/ZY1hBb8GFwycfgvd@shredder/
Reported-by: Kui-Feng Lee <sinquersw@gmail.com>
Closes: https://lore.kernel.org/all/ce8d3e55-b8bc-409c-ace9-5cf1c4f7c88e@gmail.com/
Reported-and-tested-by: syzbot+84339b9e7330daae4d66@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/0000000000007c85f5060dcc3a28@google.com/
Reported-and-tested-by: syzbot+806b0572c8d06b66b234@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/00000000000082f2f2060dcc3a92@google.com/
Reported-and-tested-by: syzbot+0039110f932d438130f9@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/0000000000007fbc8c060dcc3a5c@google.com/
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
---
 net/sched/cls_api.c     | 14 ++++++++++++++
 net/sched/sch_api.c     | 41 -----------------------------------------
 net/sched/sch_generic.c | 14 --------------
 3 files changed, 14 insertions(+), 55 deletions(-)

Comments

Ido Schimmel Jan. 4, 2024, 2:32 p.m. UTC | #1
On Thu, Jan 04, 2024 at 01:58:44PM +0100, Jiri Pirko wrote:
> From: Jiri Pirko <jiri@nvidia.com>
> 
> Inserting the device to block xarray in qdisc_create() is not suitable
> place to do this. As it requires use of tcf_block() callback, it causes
> multiple issues. It is called for all qdisc types, which is incorrect.
> 
> So, instead, move it to more suitable place, which is tcf_block_get_ext()
> and make sure it is only done for qdiscs that use block infrastructure
> and also only for blocks which are shared.
> 
> Symmetrically, alter the cleanup path, move the xarray entry removal
> into tcf_block_put_ext().
> 
> Fixes: 913b47d3424e ("net/sched: Introduce tc block netdev tracking infra")
> Reported-by: Ido Schimmel <idosch@nvidia.com>
> Closes: https://lore.kernel.org/all/ZY1hBb8GFwycfgvd@shredder/
> Reported-by: Kui-Feng Lee <sinquersw@gmail.com>
> Closes: https://lore.kernel.org/all/ce8d3e55-b8bc-409c-ace9-5cf1c4f7c88e@gmail.com/
> Reported-and-tested-by: syzbot+84339b9e7330daae4d66@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/all/0000000000007c85f5060dcc3a28@google.com/
> Reported-and-tested-by: syzbot+806b0572c8d06b66b234@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/all/00000000000082f2f2060dcc3a92@google.com/
> Reported-and-tested-by: syzbot+0039110f932d438130f9@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/all/0000000000007fbc8c060dcc3a5c@google.com/
> Signed-off-by: Jiri Pirko <jiri@nvidia.com>

Tested-by: Ido Schimmel <idosch@nvidia.com>
Jamal Hadi Salim Jan. 4, 2024, 4:10 p.m. UTC | #2
On Thu, Jan 4, 2024 at 7:58 AM Jiri Pirko <jiri@resnulli.us> wrote:
>
> From: Jiri Pirko <jiri@nvidia.com>
>
> Inserting the device to block xarray in qdisc_create() is not suitable
> place to do this. As it requires use of tcf_block() callback, it causes
> multiple issues. It is called for all qdisc types, which is incorrect.
>
> So, instead, move it to more suitable place, which is tcf_block_get_ext()
> and make sure it is only done for qdiscs that use block infrastructure
> and also only for blocks which are shared.
>
> Symmetrically, alter the cleanup path, move the xarray entry removal
> into tcf_block_put_ext().
>
> Fixes: 913b47d3424e ("net/sched: Introduce tc block netdev tracking infra")
> Reported-by: Ido Schimmel <idosch@nvidia.com>
> Closes: https://lore.kernel.org/all/ZY1hBb8GFwycfgvd@shredder/
> Reported-by: Kui-Feng Lee <sinquersw@gmail.com>
> Closes: https://lore.kernel.org/all/ce8d3e55-b8bc-409c-ace9-5cf1c4f7c88e@gmail.com/
> Reported-and-tested-by: syzbot+84339b9e7330daae4d66@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/all/0000000000007c85f5060dcc3a28@google.com/
> Reported-and-tested-by: syzbot+806b0572c8d06b66b234@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/all/00000000000082f2f2060dcc3a92@google.com/
> Reported-and-tested-by: syzbot+0039110f932d438130f9@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/all/0000000000007fbc8c060dcc3a5c@google.com/
> Signed-off-by: Jiri Pirko <jiri@nvidia.com>

Did you get a chance to run the tdc tests?

cheers,
jamal

> ---
>  net/sched/cls_api.c     | 14 ++++++++++++++
>  net/sched/sch_api.c     | 41 -----------------------------------------
>  net/sched/sch_generic.c | 14 --------------
>  3 files changed, 14 insertions(+), 55 deletions(-)
>
> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> index adf5de1ff773..253b26f2eddd 100644
> --- a/net/sched/cls_api.c
> +++ b/net/sched/cls_api.c
> @@ -1428,6 +1428,7 @@ int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
>                       struct tcf_block_ext_info *ei,
>                       struct netlink_ext_ack *extack)
>  {
> +       struct net_device *dev = qdisc_dev(q);
>         struct net *net = qdisc_net(q);
>         struct tcf_block *block = NULL;
>         int err;
> @@ -1461,9 +1462,18 @@ int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
>         if (err)
>                 goto err_block_offload_bind;
>
> +       if (tcf_block_shared(block)) {
> +               err = xa_insert(&block->ports, dev->ifindex, dev, GFP_KERNEL);
> +               if (err) {
> +                       NL_SET_ERR_MSG(extack, "block dev insert failed");
> +                       goto err_dev_insert;
> +               }
> +       }
> +
>         *p_block = block;
>         return 0;
>
> +err_dev_insert:
>  err_block_offload_bind:
>         tcf_chain0_head_change_cb_del(block, ei);
>  err_chain0_head_change_cb_add:
> @@ -1502,8 +1512,12 @@ EXPORT_SYMBOL(tcf_block_get);
>  void tcf_block_put_ext(struct tcf_block *block, struct Qdisc *q,
>                        struct tcf_block_ext_info *ei)
>  {
> +       struct net_device *dev = qdisc_dev(q);
> +
>         if (!block)
>                 return;
> +       if (tcf_block_shared(block))
> +               xa_erase(&block->ports, dev->ifindex);
>         tcf_chain0_head_change_cb_del(block, ei);
>         tcf_block_owner_del(block, q, ei->binder_type);
>
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index 2a2a48838eb9..36b025cc4fd2 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -1209,43 +1209,6 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>         return 0;
>  }
>
> -static int qdisc_block_add_dev(struct Qdisc *sch, struct net_device *dev,
> -                              struct netlink_ext_ack *extack)
> -{
> -       const struct Qdisc_class_ops *cl_ops = sch->ops->cl_ops;
> -       struct tcf_block *block;
> -       int err;
> -
> -       block = cl_ops->tcf_block(sch, TC_H_MIN_INGRESS, NULL);
> -       if (block) {
> -               err = xa_insert(&block->ports, dev->ifindex, dev, GFP_KERNEL);
> -               if (err) {
> -                       NL_SET_ERR_MSG(extack,
> -                                      "ingress block dev insert failed");
> -                       return err;
> -               }
> -       }
> -
> -       block = cl_ops->tcf_block(sch, TC_H_MIN_EGRESS, NULL);
> -       if (block) {
> -               err = xa_insert(&block->ports, dev->ifindex, dev, GFP_KERNEL);
> -               if (err) {
> -                       NL_SET_ERR_MSG(extack,
> -                                      "Egress block dev insert failed");
> -                       goto err_out;
> -               }
> -       }
> -
> -       return 0;
> -
> -err_out:
> -       block = cl_ops->tcf_block(sch, TC_H_MIN_INGRESS, NULL);
> -       if (block)
> -               xa_erase(&block->ports, dev->ifindex);
> -
> -       return err;
> -}
> -
>  static int qdisc_block_indexes_set(struct Qdisc *sch, struct nlattr **tca,
>                                    struct netlink_ext_ack *extack)
>  {
> @@ -1416,10 +1379,6 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
>         qdisc_hash_add(sch, false);
>         trace_qdisc_create(ops, dev, parent);
>
> -       err = qdisc_block_add_dev(sch, dev, extack);
> -       if (err)
> -               goto err_out4;
> -
>         return sch;
>
>  err_out4:
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index e33568df97a5..9b3e9262040b 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -1052,8 +1052,6 @@ static void __qdisc_destroy(struct Qdisc *qdisc)
>  {
>         const struct Qdisc_ops  *ops = qdisc->ops;
>         struct net_device *dev = qdisc_dev(qdisc);
> -       const struct Qdisc_class_ops *cops;
> -       struct tcf_block *block;
>
>  #ifdef CONFIG_NET_SCHED
>         qdisc_hash_del(qdisc);
> @@ -1064,18 +1062,6 @@ static void __qdisc_destroy(struct Qdisc *qdisc)
>
>         qdisc_reset(qdisc);
>
> -       cops = ops->cl_ops;
> -       if (ops->ingress_block_get) {
> -               block = cops->tcf_block(qdisc, TC_H_MIN_INGRESS, NULL);
> -               if (block)
> -                       xa_erase(&block->ports, dev->ifindex);
> -       }
> -
> -       if (ops->egress_block_get) {
> -               block = cops->tcf_block(qdisc, TC_H_MIN_EGRESS, NULL);
> -               if (block)
> -                       xa_erase(&block->ports, dev->ifindex);
> -       }
>
>         if (ops->destroy)
>                 ops->destroy(qdisc);
> --
> 2.43.0
>
Jiri Pirko Jan. 4, 2024, 6:03 p.m. UTC | #3
Thu, Jan 04, 2024 at 05:10:58PM CET, jhs@mojatatu.com wrote:
>On Thu, Jan 4, 2024 at 7:58 AM Jiri Pirko <jiri@resnulli.us> wrote:
>>
>> From: Jiri Pirko <jiri@nvidia.com>
>>
>> Inserting the device to block xarray in qdisc_create() is not suitable
>> place to do this. As it requires use of tcf_block() callback, it causes
>> multiple issues. It is called for all qdisc types, which is incorrect.
>>
>> So, instead, move it to more suitable place, which is tcf_block_get_ext()
>> and make sure it is only done for qdiscs that use block infrastructure
>> and also only for blocks which are shared.
>>
>> Symmetrically, alter the cleanup path, move the xarray entry removal
>> into tcf_block_put_ext().
>>
>> Fixes: 913b47d3424e ("net/sched: Introduce tc block netdev tracking infra")
>> Reported-by: Ido Schimmel <idosch@nvidia.com>
>> Closes: https://lore.kernel.org/all/ZY1hBb8GFwycfgvd@shredder/
>> Reported-by: Kui-Feng Lee <sinquersw@gmail.com>
>> Closes: https://lore.kernel.org/all/ce8d3e55-b8bc-409c-ace9-5cf1c4f7c88e@gmail.com/
>> Reported-and-tested-by: syzbot+84339b9e7330daae4d66@syzkaller.appspotmail.com
>> Closes: https://lore.kernel.org/all/0000000000007c85f5060dcc3a28@google.com/
>> Reported-and-tested-by: syzbot+806b0572c8d06b66b234@syzkaller.appspotmail.com
>> Closes: https://lore.kernel.org/all/00000000000082f2f2060dcc3a92@google.com/
>> Reported-and-tested-by: syzbot+0039110f932d438130f9@syzkaller.appspotmail.com
>> Closes: https://lore.kernel.org/all/0000000000007fbc8c060dcc3a5c@google.com/
>> Signed-off-by: Jiri Pirko <jiri@nvidia.com>
>
>Did you get a chance to run the tdc tests?

I ran the TC ones we have in the net/forwarding directory.
I didn't manage to run the tdc. Readme didn't help me much.
How do you run the suite?
Jamal Hadi Salim Jan. 4, 2024, 6:22 p.m. UTC | #4
On Thu, Jan 4, 2024 at 1:03 PM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Thu, Jan 04, 2024 at 05:10:58PM CET, jhs@mojatatu.com wrote:
> >On Thu, Jan 4, 2024 at 7:58 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >>
> >> From: Jiri Pirko <jiri@nvidia.com>
> >>
> >> Inserting the device to block xarray in qdisc_create() is not suitable
> >> place to do this. As it requires use of tcf_block() callback, it causes
> >> multiple issues. It is called for all qdisc types, which is incorrect.
> >>
> >> So, instead, move it to more suitable place, which is tcf_block_get_ext()
> >> and make sure it is only done for qdiscs that use block infrastructure
> >> and also only for blocks which are shared.
> >>
> >> Symmetrically, alter the cleanup path, move the xarray entry removal
> >> into tcf_block_put_ext().
> >>
> >> Fixes: 913b47d3424e ("net/sched: Introduce tc block netdev tracking infra")
> >> Reported-by: Ido Schimmel <idosch@nvidia.com>
> >> Closes: https://lore.kernel.org/all/ZY1hBb8GFwycfgvd@shredder/
> >> Reported-by: Kui-Feng Lee <sinquersw@gmail.com>
> >> Closes: https://lore.kernel.org/all/ce8d3e55-b8bc-409c-ace9-5cf1c4f7c88e@gmail.com/
> >> Reported-and-tested-by: syzbot+84339b9e7330daae4d66@syzkaller.appspotmail.com
> >> Closes: https://lore.kernel.org/all/0000000000007c85f5060dcc3a28@google.com/
> >> Reported-and-tested-by: syzbot+806b0572c8d06b66b234@syzkaller.appspotmail.com
> >> Closes: https://lore.kernel.org/all/00000000000082f2f2060dcc3a92@google.com/
> >> Reported-and-tested-by: syzbot+0039110f932d438130f9@syzkaller.appspotmail.com
> >> Closes: https://lore.kernel.org/all/0000000000007fbc8c060dcc3a5c@google.com/
> >> Signed-off-by: Jiri Pirko <jiri@nvidia.com>
> >
> >Did you get a chance to run the tdc tests?
>
> I ran the TC ones we have in the net/forwarding directory.
> I didn't manage to run the tdc. Readme didn't help me much.
> How do you run the suite?

For next time:
make -C tools/testing/selftests TARGETS=tc-testing run_tests

We'll let you off the hook this time. We'll do the rest of the testing.

cheers,
jamal
Victor Nogueira Jan. 4, 2024, 7:26 p.m. UTC | #5
On 04/01/2024 09:58, Jiri Pirko wrote:
> From: Jiri Pirko <jiri@nvidia.com>
> 
> Inserting the device to block xarray in qdisc_create() is not suitable
> place to do this. As it requires use of tcf_block() callback, it causes
> multiple issues. It is called for all qdisc types, which is incorrect.
> 
> So, instead, move it to more suitable place, which is tcf_block_get_ext()
> and make sure it is only done for qdiscs that use block infrastructure
> and also only for blocks which are shared.
> 
> Symmetrically, alter the cleanup path, move the xarray entry removal
> into tcf_block_put_ext().
> 
> Fixes: 913b47d3424e ("net/sched: Introduce tc block netdev tracking infra")
> Reported-by: Ido Schimmel <idosch@nvidia.com>
> Closes: https://lore.kernel.org/all/ZY1hBb8GFwycfgvd@shredder/
> Reported-by: Kui-Feng Lee <sinquersw@gmail.com>
> Closes: https://lore.kernel.org/all/ce8d3e55-b8bc-409c-ace9-5cf1c4f7c88e@gmail.com/
> Reported-and-tested-by: syzbot+84339b9e7330daae4d66@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/all/0000000000007c85f5060dcc3a28@google.com/
> Reported-and-tested-by: syzbot+806b0572c8d06b66b234@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/all/00000000000082f2f2060dcc3a92@google.com/
> Reported-and-tested-by: syzbot+0039110f932d438130f9@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/all/0000000000007fbc8c060dcc3a5c@google.com/
> Signed-off-by: Jiri Pirko <jiri@nvidia.com>

Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Tested-by: Victor Nogueira <victor@mojatatu.com>
Jamal Hadi Salim Jan. 4, 2024, 7:36 p.m. UTC | #6
On Thu, Jan 4, 2024 at 2:27 PM Victor Nogueira <victor@mojatatu.com> wrote:
>
> On 04/01/2024 09:58, Jiri Pirko wrote:
> > From: Jiri Pirko <jiri@nvidia.com>
> >
> > Inserting the device to block xarray in qdisc_create() is not suitable
> > place to do this. As it requires use of tcf_block() callback, it causes
> > multiple issues. It is called for all qdisc types, which is incorrect.
> >
> > So, instead, move it to more suitable place, which is tcf_block_get_ext()
> > and make sure it is only done for qdiscs that use block infrastructure
> > and also only for blocks which are shared.
> >
> > Symmetrically, alter the cleanup path, move the xarray entry removal
> > into tcf_block_put_ext().
> >
> > Fixes: 913b47d3424e ("net/sched: Introduce tc block netdev tracking infra")
> > Reported-by: Ido Schimmel <idosch@nvidia.com>
> > Closes: https://lore.kernel.org/all/ZY1hBb8GFwycfgvd@shredder/
> > Reported-by: Kui-Feng Lee <sinquersw@gmail.com>
> > Closes: https://lore.kernel.org/all/ce8d3e55-b8bc-409c-ace9-5cf1c4f7c88e@gmail.com/
> > Reported-and-tested-by: syzbot+84339b9e7330daae4d66@syzkaller.appspotmail.com
> > Closes: https://lore.kernel.org/all/0000000000007c85f5060dcc3a28@google.com/
> > Reported-and-tested-by: syzbot+806b0572c8d06b66b234@syzkaller.appspotmail.com
> > Closes: https://lore.kernel.org/all/00000000000082f2f2060dcc3a92@google.com/
> > Reported-and-tested-by: syzbot+0039110f932d438130f9@syzkaller.appspotmail.com
> > Closes: https://lore.kernel.org/all/0000000000007fbc8c060dcc3a5c@google.com/
> > Signed-off-by: Jiri Pirko <jiri@nvidia.com>
>
> Reviewed-by: Victor Nogueira <victor@mojatatu.com>
> Tested-by: Victor Nogueira <victor@mojatatu.com>

Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>

cheers,
jamal
patchwork-bot+netdevbpf@kernel.org Jan. 5, 2024, 11:20 a.m. UTC | #7
Hello:

This patch was applied to netdev/net-next.git (main)
by David S. Miller <davem@davemloft.net>:

On Thu,  4 Jan 2024 13:58:44 +0100 you wrote:
> From: Jiri Pirko <jiri@nvidia.com>
> 
> Inserting the device to block xarray in qdisc_create() is not suitable
> place to do this. As it requires use of tcf_block() callback, it causes
> multiple issues. It is called for all qdisc types, which is incorrect.
> 
> So, instead, move it to more suitable place, which is tcf_block_get_ext()
> and make sure it is only done for qdiscs that use block infrastructure
> and also only for blocks which are shared.
> 
> [...]

Here is the summary with links:
  - [net-next] net: sched: move block device tracking into tcf_block_get/put_ext()
    https://git.kernel.org/netdev/net-next/c/94e2557d086a

You are awesome, thank you!
Jiri Pirko Jan. 5, 2024, 11:24 a.m. UTC | #8
Thu, Jan 04, 2024 at 07:22:48PM CET, jhs@mojatatu.com wrote:
>On Thu, Jan 4, 2024 at 1:03 PM Jiri Pirko <jiri@resnulli.us> wrote:
>>
>> Thu, Jan 04, 2024 at 05:10:58PM CET, jhs@mojatatu.com wrote:
>> >On Thu, Jan 4, 2024 at 7:58 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >>
>> >> From: Jiri Pirko <jiri@nvidia.com>
>> >>
>> >> Inserting the device to block xarray in qdisc_create() is not suitable
>> >> place to do this. As it requires use of tcf_block() callback, it causes
>> >> multiple issues. It is called for all qdisc types, which is incorrect.
>> >>
>> >> So, instead, move it to more suitable place, which is tcf_block_get_ext()
>> >> and make sure it is only done for qdiscs that use block infrastructure
>> >> and also only for blocks which are shared.
>> >>
>> >> Symmetrically, alter the cleanup path, move the xarray entry removal
>> >> into tcf_block_put_ext().
>> >>
>> >> Fixes: 913b47d3424e ("net/sched: Introduce tc block netdev tracking infra")
>> >> Reported-by: Ido Schimmel <idosch@nvidia.com>
>> >> Closes: https://lore.kernel.org/all/ZY1hBb8GFwycfgvd@shredder/
>> >> Reported-by: Kui-Feng Lee <sinquersw@gmail.com>
>> >> Closes: https://lore.kernel.org/all/ce8d3e55-b8bc-409c-ace9-5cf1c4f7c88e@gmail.com/
>> >> Reported-and-tested-by: syzbot+84339b9e7330daae4d66@syzkaller.appspotmail.com
>> >> Closes: https://lore.kernel.org/all/0000000000007c85f5060dcc3a28@google.com/
>> >> Reported-and-tested-by: syzbot+806b0572c8d06b66b234@syzkaller.appspotmail.com
>> >> Closes: https://lore.kernel.org/all/00000000000082f2f2060dcc3a92@google.com/
>> >> Reported-and-tested-by: syzbot+0039110f932d438130f9@syzkaller.appspotmail.com
>> >> Closes: https://lore.kernel.org/all/0000000000007fbc8c060dcc3a5c@google.com/
>> >> Signed-off-by: Jiri Pirko <jiri@nvidia.com>
>> >
>> >Did you get a chance to run the tdc tests?
>>
>> I ran the TC ones we have in the net/forwarding directory.
>> I didn't manage to run the tdc. Readme didn't help me much.
>> How do you run the suite?
>
>For next time:
>make -C tools/testing/selftests TARGETS=tc-testing run_tests

Unrelated to this patch.

Running this, I'm getting lots of errors, some seem might be bugs in
tests. Here's the output:

make: Entering directory '/mnt/share156/jiri/net-next/tools/testing/selftests'
make[1]: Entering directory '/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing'
make[1]: Entering directory '/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing'
TAP version 13
1..1
# timeout set to 900
# selftests: tc-testing: tdc.sh
# netdevsim
# act_bpf
# act_connmark
# act_csum
# act_ct
# act_ctinfo
# act_gact
# act_gate
# act_mirred
# act_mpls
# act_nat
# act_pedit
# act_police
# act_sample
# act_simple
# act_skbedit
# act_skbmod
# act_tunnel_key
# act_vlan
# cls_basic
# cls_bpf
# cls_cgroup
# cls_flow
# cls_flower
# cls_fw
# cls_matchall
# cls_route
# cls_u32
# Module em_canid not found... skipping.
# em_cmp
# em_ipset
# em_ipt
# em_meta
# em_nbyte
# em_text
# em_u32
# sch_cake
# sch_cbs
# sch_choke
# sch_codel
# sch_drr
# sch_etf
# sch_ets
# sch_fq
# sch_fq_codel
# sch_fq_pie
# sch_gred
# sch_hfsc
# sch_hhf
# sch_htb
# sch_teql
# considering category actions
# !!! Consider installing pyroute2 !!!
#  -- ns/SubPlugin.__init__
#  -- scapy/SubPlugin.__init__
# Executing 528 tests in parallel and 15 in serial
# Using 18 batches and 4 workers
# 
# -----> prepare stage *** Could not execute: "$TC actions add action pass index 1"
# 
# -----> prepare stage *** Error message: "setting the network namespace "tcut-3101919570" failed: Invalid argument
# "
# 
# -----> prepare stage *** Aborting test run.
# 
# 
# <_io.BufferedReader name=6> *** stdout ***
# 
# 
# <_io.BufferedReader name=19> *** stderr ***
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 536, in test_runner
#     res = run_one_test(pm, args, index, tidx)
#           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 420, in run_one_test
#     prepare_env(tidx, args, pm, 'setup', "-----> prepare stage", tidx["setup"])
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 268, in prepare_env
#     raise PluginMgrTestFail(
# RTNETLINK answers: File exists
# Command failed -:20
# 
# -----> teardown stage *** Could not execute: "$TC action flush action ctinfo"
# 
# -----> teardown stage *** Error message: "setting the network namespace "tcut-518092810" failed: Invalid argument
# "
# 
# -----> teardown stage *** Aborting test run.
# 
# 
# <_io.BufferedReader name=6> *** stdout ***
# 
# 
# <_io.BufferedReader name=21> *** stderr ***
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 536, in test_runner
#     res = run_one_test(pm, args, index, tidx)
#           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 465, in run_one_test
#     prepare_env(tidx, args, pm, 'teardown', '-----> teardown stage', tidx['teardown'], procout)
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 268, in prepare_env
#     raise PluginMgrTestFail(
# Error: argument "v0p0id8295idbf47" is wrong: "name" not a valid ifname
# 
# -----> prepare stage *** Could not execute: "$TC actions add action simple sdata "Rock""
# 
# -----> prepare stage *** Error message: "setting the network namespace "tcut-1996746596" failed: Invalid argument
# "
# 
# -----> prepare stage *** Aborting test run.
# 
# 
# <_io.BufferedReader name=6> *** stdout ***
# 
# 
# <_io.BufferedReader name=15> *** stderr ***
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 536, in test_runner
#     res = run_one_test(pm, args, index, tidx)
#           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 420, in run_one_test
#     prepare_env(tidx, args, pm, 'setup', "-----> prepare stage", tidx["setup"])
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 268, in prepare_env
#     raise PluginMgrTestFail(
# ..RTNETLINK answers: File exists
# Command failed -:425
# Test 1c3a: Flush police actions
# Test 7326: Add police action with control continue
# Test 34fa: Add police action with control drop
# Test 8dd5: Add police action with control ok
# Test b9d1: Add police action with control reclassify
# Test c534: Add police action with control pipe
# Test b48b: Add police action with exceed goto chain control action
# Test 689e: Replace police action with invalid goto chain control
# Test cdd7: Add valid police action with packets per second rate limit
# Test f5bc: Add invalid police action with both bps and pps
# Test 7d64: Add police action with skip_hw option
# Test 7d50: Add skbmod action to set destination mac
# Test 9b29: Add skbmod action to set source mac
# Test 1724: Add skbmod action with invalid mac
# Test 3cf1: Add skbmod action with valid etype
# Test a749: Add skbmod action with invalid etype
# Test bfe6: Add skbmod action to swap mac
# Test 839b: Add skbmod action with control pipe
# Test c167: Add skbmod action with control reclassify
# Test 0c2f: Add skbmod action with control drop
# Test d113: Add skbmod action with control continue
# Test 7242: Add skbmod action with control pass
# Test 6046: Add skbmod action with control reclassify and cookie
# Test 58cb: List skbmod actions
# Test 9aa8: Get a single skbmod action from a list
# Test e93a: Delete an skbmod action
# Test 40c2: Flush skbmod actions
# Test b651: Replace skbmod action with invalid goto_chain control
# Test fe09: Add skbmod action to mark ECN bits
# Test 696a: Add simple ct action
# Test e38c: Add simple ct action with cookie
# Test 9f20: Add ct clear action
# Test 0bc1: Add ct clear action with cookie of max length
# Test 5bea: Try ct with zone
# Test d5d6: Try ct with zone, commit
# Test 029f: Try ct with zone, commit, mark
# Test a58d: Try ct with zone, commit, mark, nat
# Test 901b: Try ct with full nat ipv4 range syntax
# Test 072b: Try ct with full nat ipv6 syntax
# Test 3420: Try ct with full nat ipv6 range syntax
# Test 4470: Try ct with full nat ipv6 range syntax + force
# Test 5d88: Try ct with label
# Test 04d4: Try ct with label with mask
# Test 9751: Try ct with mark + mask
# Test 2faa: Try ct with mark + mask and cookie
# Test 3991: Add simple ct action with no_percpu flag
# Test 3992: Add ct action triggering DNAT tuple conflict
# 
# Sent 1 packets.
# 
# Sent 1 packets.
# Test 2029: Add xt action with log-prefix
# exception iproute2 exited with an error code in call to pre_case for <class 'plugin-lib.nsPlugin.SubPlugin'> plugin
# Test ac2a: List actions
# Test 3edf: Flush gact actions
# Test 63ec: Delete pass action
# returncode 255; expected [0]
# "-----> prepare stage" did not complete successfully
# Exception <class '__main__.PluginMgrTestFail'> ('setup', None, '"-----> prepare stage" did not complete successfully') (caught in test_runner, running test 4 63ec Delete pass action stage setup)
# ---------------
# traceback
# ---------------
# ---------------
# Test 68e2: Add tunnel_key unset action
# exception iproute2 exited with an error code in call to pre_case for <class 'plugin-lib.nsPlugin.SubPlugin'> plugin
# Test fa1c: Add mpls dec_ttl action with ttl (invalid)
# exception [Errno 32] Broken pipe in call to pre_case for <class 'plugin-lib.nsPlugin.SubPlugin'> plugin
# Test a5a7: Add pedit action with LAYERED_OP eth set src
# exception [Errno 32] Broken pipe in call to pre_case for <class 'plugin-lib.nsPlugin.SubPlugin'> plugin
# Test 5a31: Add vlan modify action for protocol 802.1AD
# exception [Errno 32] Broken pipe in call to pre_case for <class 'plugin-lib.nsPlugin.SubPlugin'> plugin
# Test 8eb5: Create valid ife encode action with tcindex and continue control
# exception [Errno 32] Broken pipe in call to pre_case for <class 'plugin-lib.nsPlugin.SubPlugin'> plugin
# Test a874: Add invalid sample action without mandatory arguments
# exception [Errno 32] Broken pipe in call to pre_case for <class 'plugin-lib.nsPlugin.SubPlugin'> plugin
# Test d959: Add cBPF action with valid bytecode
# Test f84a: Add cBPF action with invalid bytecode
# Test e939: Add eBPF action with valid object-file
# Test 282d: Add eBPF action with invalid object-file
# Test d819: Replace cBPF bytecode and action control
# Test 6ae3: Delete cBPF action 
# Test 3e0d: List cBPF actions
# Test 55ce: Flush BPF actions
# Test ccc3: Add cBPF action with duplicate index
# Test 89c7: Add cBPF action with invalid index
# Test 7ab9: Add cBPF action with cookie
# Test b8a1: Replace bpf action with invalid goto_chain control
# Test 2893: Add pedit action with RAW_OP offset u8 quad
# Test 6795: Add pedit action with LAYERED_OP ip6 set payload_len, nexthdr, hoplimit
# Test cfcc: Add pedit action with LAYERED_OP tcp flags set
# Test b078: Add simple action
# Test 4297: Add simple action with change command
# Test 6d4c: Add simple action with duplicate index
# Test 2542: List simple actions
# returncode 255; expected [0]
# "-----> prepare stage" did not complete successfully
# Exception <class '__main__.PluginMgrTestFail'> ('setup', None, '"-----> prepare stage" did not complete successfully') (caught in test_runner, running test 5 2542 List simple actions stage setup)
# ---------------
# traceback
# ---------------
# ---------------
# Test 5232: List ctinfo actions
# Test 7702: Flush ctinfo actions
# Test 3201: Add ctinfo action with duplicate index
# Test 8295: Add ctinfo action with invalid index
# returncode 255; expected [0]
# "-----> teardown stage" did not complete successfully
# Exception <class '__main__.PluginMgrTestFail'> ('teardown', 'setting the network namespace "tcut-518092810" failed: Invalid argument\n', '"-----> teardown stage" did not complete successfully') (caught in test_runner, running test 5 8295 Add ctinfo action with invalid index stage teardown)
# ---------------
# traceback
# ---------------
# accumulated output for this test:
# setting the network namespace "tcut-518092810" failed: Invalid argument
# 
# ---------------
# Test bf47: Add csum icmp action
# exception iproute2 exited with an error code in call to pre_case for <class 'plugin-lib.nsPlugin.SubPlugin'> plugin
# multiprocessing.pool.RemoteTraceback: 
# """
# Traceback (most recent call last):
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 142, in call_pre_case
#     pgn_inst.pre_case(caseinfo, test_skip)
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/plugin-lib/nsPlugin.py", line 63, in pre_case
#     self.prepare_test(test)
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/plugin-lib/nsPlugin.py", line 45, in prepare_test
#     self._proc_check()
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/plugin-lib/nsPlugin.py", line 219, in _proc_check
#     raise RuntimeError("iproute2 exited with an error code")
# RuntimeError: iproute2 exited with an error code
# 
# During handling of the above exception, another exception occurred:
# 
# Traceback (most recent call last):
#   File "/usr/lib64/python3.11/multiprocessing/pool.py", line 125, in worker
#     result = (True, func(*args, **kwds))
#                     ^^^^^^^^^^^^^^^^^^^
#   File "/usr/lib64/python3.11/multiprocessing/pool.py", line 48, in mapstar
#     return list(map(*args))
#            ^^^^^^^^^^^^^^^^
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 602, in __mp_runner
#     (_, tsr) = test_runner(mp_pm, mp_args, tests)
#                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 536, in test_runner
#     res = run_one_test(pm, args, index, tidx)
#           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 419, in run_one_test
#     pm.call_pre_case(tidx)
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 146, in call_pre_case
#     print('test_ordinal is {}'.format(test_ordinal))
#                                       ^^^^^^^^^^^^
# NameError: name 'test_ordinal' is not defined
# """
# 
# The above exception was the direct cause of the following exception:
# 
# Traceback (most recent call last):
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 1028, in <module>
#     main()
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 1022, in main
#     set_operation_mode(pm, parser, args, remaining)
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 964, in set_operation_mode
#     catresults = test_runner_mp(pm, args, alltests)
#                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 624, in test_runner_mp
#     pres = p.map(__mp_runner, batches)
#            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/usr/lib64/python3.11/multiprocessing/pool.py", line 367, in map
#     return self._map_async(func, iterable, mapstar, chunksize).get()
#            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/usr/lib64/python3.11/multiprocessing/pool.py", line 774, in get
#     raise self._value
# NameError: name 'test_ordinal' is not defined
# WARNING: Interface v0p0id7326 does not exist!
# WARNING: Interface v0p0id34fa does not exist!
# WARNING: more Interface v0p0id8dd5 does not exist!
# considering category qdisc
# !!! Consider installing pyroute2 !!!
#  -- ns/SubPlugin.__init__
#  -- scapy/SubPlugin.__init__
# Executing 294 tests in parallel and 33 in serial
# Using 11 batches and 4 workers
# 
# -----> prepare stage *** Could not execute: "echo "1 1 8" > /sys/bus/netdevsim/new_device"
# 
# -----> prepare stage *** Error message: "setting the network namespace "tcut-2202500200" failed: Invalid argument
# "
# 
# -----> prepare stage *** Aborting test run.
# 
# 
# <_io.BufferedReader name=6> *** stdout ***
# 
# 
# <_io.BufferedReader name=15> *** stderr ***
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 536, in test_runner
#     res = run_one_test(pm, args, index, tidx)
#           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 420, in run_one_test
#     prepare_env(tidx, args, pm, 'setup', "-----> prepare stage", tidx["setup"])
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 268, in prepare_env
#     raise PluginMgrTestFail(
# RTNETLINK answers: File exists
# Command failed -:2
# 
# -----> teardown stage *** Could not execute: "$TC qdisc del dev $DUMMY handle 1: root"
# 
# -----> teardown stage *** Error message: "Error: Invalid handle.
# "
# 
# -----> teardown stage *** Aborting test run.
# 
# 
# <_io.BufferedReader name=6> *** stdout ***
# 
# 
# <_io.BufferedReader name=21> *** stderr ***
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 536, in test_runner
#     res = run_one_test(pm, args, index, tidx)
#           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 465, in run_one_test
#     prepare_env(tidx, args, pm, 'teardown', '-----> teardown stage', tidx['teardown'], procout)
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 268, in prepare_env
#     raise PluginMgrTestFail(
# 
# -----> teardown stage *** Could not execute: "$TC qdisc del dev $DUMMY handle 1: root bfifo"
# 
# -----> teardown stage *** Error message: "Cannot find device "dummy1ida519"
# "
# 
# -----> teardown stage *** Aborting test run.
# 
# 
# <_io.BufferedReader name=6> *** stdout ***
# 
# 
# <_io.BufferedReader name=17> *** stderr ***
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 536, in test_runner
#     res = run_one_test(pm, args, index, tidx)
#           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 465, in run_one_test
#     prepare_env(tidx, args, pm, 'teardown', '-----> teardown stage', tidx['teardown'], procout)
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 268, in prepare_env
#     raise PluginMgrTestFail(
# Error: argument "v0p0id20baid2958" is wrong: "name" not a valid ifname
# 
# -----> teardown stage *** Could not execute: "$TC qdisc del dev $DUMMY handle 1: root"
# 
# -----> teardown stage *** Error message: "setting the network namespace "tcut-2202500200-3305288203" failed: Invalid argument
# "
# 
# -----> teardown stage *** Aborting test run.
# 
# 
# <_io.BufferedReader name=6> *** stdout ***
# 
# 
# <_io.BufferedReader name=15> *** stderr ***
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 536, in test_runner
#     res = run_one_test(pm, args, index, tidx)
#           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 465, in run_one_test
#     prepare_env(tidx, args, pm, 'teardown', '-----> teardown stage', tidx['teardown'], procout)
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 268, in prepare_env
#     raise PluginMgrTestFail(
# 
# -----> prepare stage *** Could not execute: "$TC qdisc add dev $DUMMY handle 1: root choke limit 1000 bandwidth 10000"
# 
# -----> prepare stage *** Error message: "setting the network namespace "tcut-57801789-2243486144" failed: Invalid argument
# "
# 
# -----> prepare stage *** Aborting test run.
# 
# 
# <_io.BufferedReader name=6> *** stdout ***
# 
# 
# <_io.BufferedReader name=21> *** stderr ***
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 536, in test_runner
#     res = run_one_test(pm, args, index, tidx)
#           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 420, in run_one_test
#     prepare_env(tidx, args, pm, 'setup', "-----> prepare stage", tidx["setup"])
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 268, in prepare_env
#     raise PluginMgrTestFail(
# Error: argument "v0p0id30c4id0301" is wrong: "name" not a valid ifname
# RTNETLINK answers: File exists
# Command failed -:38
# Test a519: Add bfifo qdisc with system default parameters on egress
# exit: 1
# exit: 0
# Cannot find device "dummy1ida519"
# 
# returncode 1; expected [0]
# "-----> teardown stage" did not complete successfully
# Exception <class '__main__.PluginMgrTestFail'> ('teardown', 'Cannot find device "dummy1ida519"\n', '"-----> teardown stage" did not complete successfully') (caught in test_runner, running test 2 a519 Add bfifo qdisc with system default parameters on egress stage teardown)
# ---------------
# traceback
# ---------------
# accumulated output for this test:
# Cannot find device "dummy1ida519"
# 
# ---------------
# Test 2385: Create CAKE with besteffort flag
# exception [Errno 32] Broken pipe in call to pre_case for <class 'plugin-lib.nsPlugin.SubPlugin'> plugin
# Test 0521: Show ingress class
# exception [Errno 32] Broken pipe in call to pre_case for <class 'plugin-lib.nsPlugin.SubPlugin'> plugin
# Test b190: Create FQ_CODEL with noecn flag
# exception [Errno 32] Broken pipe in call to pre_case for <class 'plugin-lib.nsPlugin.SubPlugin'> plugin
# Test 5439: Create NETEM with multiple slot setting
# exception [Errno 32] Broken pipe in call to pre_case for <class 'plugin-lib.nsPlugin.SubPlugin'> plugin
# Test 30a9: Create SFB with increment setting
# exception [Errno 32] Broken pipe in call to pre_case for <class 'plugin-lib.nsPlugin.SubPlugin'> plugin
# Test 20ba: Add multiq Qdisc to multi-queue device (8 queues)
# returncode 255; expected [0]
# "-----> prepare stage" did not complete successfully
# Exception <class '__main__.PluginMgrTestFail'> ('setup', None, '"-----> prepare stage" did not complete successfully') (caught in test_runner, running test 2 20ba Add multiq Qdisc to multi-queue device (8 queues) stage setup)
# ---------------
# traceback
# ---------------
# ---------------
# Test 2958: Show skbprio class
# exit: 255
# exit: 0
# setting the network namespace "tcut-2202500200-3305288203" failed: Invalid argument
# 
# returncode 255; expected [0]
# "-----> teardown stage" did not complete successfully
# Exception <class '__main__.PluginMgrTestFail'> ('teardown', 'setting the network namespace "tcut-2202500200-3305288203" failed: Invalid argument\n', '"-----> teardown stage" did not complete successfully') (caught in test_runner, running test 2 2958 Show skbprio class stage teardown)
# ---------------
# traceback
# ---------------
# accumulated output for this test:
# setting the network namespace "tcut-2202500200-3305288203" failed: Invalid argument
# 
# ---------------
# Test fe0f: Add prio qdisc on egress with 4 bands and priomap exceeding TC_PRIO_MAX entries
# Test 1f91: Add prio qdisc on egress with 4 bands and priomap's values exceeding bands number
# exit: 255
# exit: 1
# setting the network namespace "tcut-3396639522" failed: Invalid argument
# 
# Test d248: Add prio qdisc on egress with invalid bands value (< 2)
# exit: 1
# exit: 2
# Cannot find device "dummy1idd248"
# 
# Test 1d0e: Add prio qdisc on egress with invalid bands value exceeding TCQ_PRIO_BANDS
# Test 1971: Replace default prio qdisc on egress with 8 bands and new priomap
# exception iproute2 exited with an error code in call to pre_case for <class 'plugin-lib.nsPlugin.SubPlugin'> plugin
# Test 30c4: Add ETS qdisc with quanta + priomap
# exit: 1
# exit: 0
# Cannot find device "dummy1id30c4"
# 
# returncode 2; expected [0]
# "-----> teardown stage" did not complete successfully
# Exception <class '__main__.PluginMgrTestFail'> ('teardown', 'Cannot find device "dummy1id30c4"\n', '"-----> teardown stage" did not complete successfully') (caught in test_runner, running test 2 30c4 Add ETS qdisc with quanta + priomap stage teardown)
# ---------------
# traceback
# ---------------
# accumulated output for this test:
# Cannot find device "dummy1id30c4"
# 
# ---------------
# Test 0301: Change CHOKE with limit setting
# returncode 255; expected [0]
# "-----> prepare stage" did not complete successfully
# Exception <class '__main__.PluginMgrTestFail'> ('setup', None, '"-----> prepare stage" did not complete successfully') (caught in test_runner, running test 2 0301 Change CHOKE with limit setting stage setup)
# ---------------
# traceback
# ---------------
# ---------------
# multiprocessing.pool.RemoteTraceback: 
# """
# Traceback (most recent call last):
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 142, in call_pre_case
#     pgn_inst.pre_case(caseinfo, test_skip)
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/plugin-lib/nsPlugin.py", line 63, in pre_case
#     self.prepare_test(test)
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/plugin-lib/nsPlugin.py", line 38, in prepare_test
#     self._ipr2_ns_create()
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/plugin-lib/nsPlugin.py", line 187, in _ipr2_ns_create
#     self._exec_cmd_batched('pre', self._ipr2_ns_create_cmds())
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/plugin-lib/nsPlugin.py", line 242, in _exec_cmd_batched
#     self._exec_cmd(stage, cmd)
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/plugin-lib/nsPlugin.py", line 233, in _exec_cmd
#     proc.stdin.flush()
# BrokenPipeError: [Errno 32] Broken pipe
# 
# During handling of the above exception, another exception occurred:
# 
# Traceback (most recent call last):
#   File "/usr/lib64/python3.11/multiprocessing/pool.py", line 125, in worker
#     result = (True, func(*args, **kwds))
#                     ^^^^^^^^^^^^^^^^^^^
#   File "/usr/lib64/python3.11/multiprocessing/pool.py", line 48, in mapstar
#     return list(map(*args))
#            ^^^^^^^^^^^^^^^^
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 602, in __mp_runner
#     (_, tsr) = test_runner(mp_pm, mp_args, tests)
#                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 536, in test_runner
#     res = run_one_test(pm, args, index, tidx)
#           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 419, in run_one_test
#     pm.call_pre_case(tidx)
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 146, in call_pre_case
#     print('test_ordinal is {}'.format(test_ordinal))
#                                       ^^^^^^^^^^^^
# NameError: name 'test_ordinal' is not defined
# """
# 
# The above exception was the direct cause of the following exception:
# 
# Traceback (most recent call last):
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 1028, in <module>
#     main()
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 1022, in main
#     set_operation_mode(pm, parser, args, remaining)
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 964, in set_operation_mode
#     catresults = test_runner_mp(pm, args, alltests)
#                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 624, in test_runner_mp
#     pres = p.map(__mp_runner, batches)
#            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/usr/lib64/python3.11/multiprocessing/pool.py", line 367, in map
#     return self._map_async(func, iterable, mapstar, chunksize).get()
#            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/usr/lib64/python3.11/multiprocessing/pool.py", line 774, in get
#     raise self._value
# NameError: name 'test_ordinal' is not defined
not ok 1 selftests: tc-testing: tdc.sh # exit=1
make[1]: Leaving directory '/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing'
make: Leaving directory '/mnt/share156/jiri/net-next/tools/testing/selftests'
Pedro Tammela Jan. 5, 2024, 11:51 a.m. UTC | #9
On 05/01/2024 08:24, Jiri Pirko wrote:
> Thu, Jan 04, 2024 at 07:22:48PM CET, jhs@mojatatu.com wrote:
>> On Thu, Jan 4, 2024 at 1:03 PM Jiri Pirko <jiri@resnulli.us> wrote:
>>>
>>> Thu, Jan 04, 2024 at 05:10:58PM CET, jhs@mojatatu.com wrote:
>>>> On Thu, Jan 4, 2024 at 7:58 AM Jiri Pirko <jiri@resnulli.us> wrote:
>>>>>
>>>>> From: Jiri Pirko <jiri@nvidia.com>
>>>>>
>>>>> Inserting the device to block xarray in qdisc_create() is not suitable
>>>>> place to do this. As it requires use of tcf_block() callback, it causes
>>>>> multiple issues. It is called for all qdisc types, which is incorrect.
>>>>>
>>>>> So, instead, move it to more suitable place, which is tcf_block_get_ext()
>>>>> and make sure it is only done for qdiscs that use block infrastructure
>>>>> and also only for blocks which are shared.
>>>>>
>>>>> Symmetrically, alter the cleanup path, move the xarray entry removal
>>>>> into tcf_block_put_ext().
>>>>>
>>>>> Fixes: 913b47d3424e ("net/sched: Introduce tc block netdev tracking infra")
>>>>> Reported-by: Ido Schimmel <idosch@nvidia.com>
>>>>> Closes: https://lore.kernel.org/all/ZY1hBb8GFwycfgvd@shredder/
>>>>> Reported-by: Kui-Feng Lee <sinquersw@gmail.com>
>>>>> Closes: https://lore.kernel.org/all/ce8d3e55-b8bc-409c-ace9-5cf1c4f7c88e@gmail.com/
>>>>> Reported-and-tested-by: syzbot+84339b9e7330daae4d66@syzkaller.appspotmail.com
>>>>> Closes: https://lore.kernel.org/all/0000000000007c85f5060dcc3a28@google.com/
>>>>> Reported-and-tested-by: syzbot+806b0572c8d06b66b234@syzkaller.appspotmail.com
>>>>> Closes: https://lore.kernel.org/all/00000000000082f2f2060dcc3a92@google.com/
>>>>> Reported-and-tested-by: syzbot+0039110f932d438130f9@syzkaller.appspotmail.com
>>>>> Closes: https://lore.kernel.org/all/0000000000007fbc8c060dcc3a5c@google.com/
>>>>> Signed-off-by: Jiri Pirko <jiri@nvidia.com>
>>>>
>>>> Did you get a chance to run the tdc tests?
>>>
>>> I ran the TC ones we have in the net/forwarding directory.
>>> I didn't manage to run the tdc. Readme didn't help me much.
>>> How do you run the suite?
>>
>> For next time:
>> make -C tools/testing/selftests TARGETS=tc-testing run_tests
> 
> Unrelated to this patch.
> 
> Running this, I'm getting lots of errors, some seem might be bugs in
> tests. Here's the output:
> 
> make: Entering directory '/mnt/share156/jiri/net-next/tools/testing/selftests'
> make[1]: Entering directory '/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing'
> make[1]: Nothing to be done for 'all'.
> make[1]: Leaving directory '/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing'
> make[1]: Entering directory '/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing'
> TAP version 13
> 1..1
> # timeout set to 900
> # selftests: tc-testing: tdc.sh
> # netdevsim
> # act_bpf
> # act_connmark
> # act_csum
> # act_ct
> # act_ctinfo
> # act_gact
> # act_gate
> # act_mirred
> # act_mpls
> # act_nat
> # act_pedit
> # act_police
> # act_sample
> # act_simple
> # act_skbedit
> # act_skbmod
> # act_tunnel_key
> # act_vlan
> # cls_basic
> # cls_bpf
> # cls_cgroup
> # cls_flow
> # cls_flower
> # cls_fw
> # cls_matchall
> # cls_route
> # cls_u32
> # Module em_canid not found... skipping.
> # em_cmp
> # em_ipset
> # em_ipt
> # em_meta
> # em_nbyte
> # em_text
> # em_u32
> # sch_cake
> # sch_cbs
> # sch_choke
> # sch_codel
> # sch_drr
> # sch_etf
> # sch_ets
> # sch_fq
> # sch_fq_codel
> # sch_fq_pie
> # sch_gred
> # sch_hfsc
> # sch_hhf
> # sch_htb
> # sch_teql
> # considering category actions
> # !!! Consider installing pyroute2 !!!
> #  -- ns/SubPlugin.__init__
> #  -- scapy/SubPlugin.__init__
> # Executing 528 tests in parallel and 15 in serial
> # Using 18 batches and 4 workers
> #

Hi Jiri,

Can you also try running after installing pyroute2?
`pip3 install pyroute2` or via your distro's package manager.

Seems like in your kernel config + (virtual?) machine you are running 
into issues like:
  - create netns
  - try to use netns (still "creating", not visible yet)
  - fail badly

Since pyroute2 is netlink based, these issues are more likely to not happen.

Yes I agree the error messages are cryptic, we have been wondering how 
to improve them.
If you are still running into issues after trying with pyroute2, can you 
also try running tdc with the following diff? It disables the parallel 
testing for kselftests.

Perhaps when pyroute2 is not detected on the system we should disable 
the parallel testing completely. tc/ip commands are too susceptible to 
these issues and parallel testing just amplifies them.

diff --git a/tools/testing/selftests/tc-testing/tdc.sh 
b/tools/testing/selftests/tc-testing/tdc.sh
index c53ede8b730d..89bb123db86b 100755
--- a/tools/testing/selftests/tc-testing/tdc.sh
+++ b/tools/testing/selftests/tc-testing/tdc.sh
@@ -63,5 +63,5 @@ try_modprobe sch_hfsc
  try_modprobe sch_hhf
  try_modprobe sch_htb
  try_modprobe sch_teql
-./tdc.py -J`nproc` -c actions
-./tdc.py -J`nproc` -c qdisc
+./tdc.py -J1 -c actions
+./tdc.py -J1 -c qdisc
Jiri Pirko Jan. 6, 2024, 11:14 a.m. UTC | #10
Fri, Jan 05, 2024 at 12:51:04PM CET, pctammela@mojatatu.com wrote:
>On 05/01/2024 08:24, Jiri Pirko wrote:
>> Thu, Jan 04, 2024 at 07:22:48PM CET, jhs@mojatatu.com wrote:
>> > On Thu, Jan 4, 2024 at 1:03 PM Jiri Pirko <jiri@resnulli.us> wrote:
>> > > 
>> > > Thu, Jan 04, 2024 at 05:10:58PM CET, jhs@mojatatu.com wrote:
>> > > > On Thu, Jan 4, 2024 at 7:58 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> > > > > 
>> > > > > From: Jiri Pirko <jiri@nvidia.com>
>> > > > > 
>> > > > > Inserting the device to block xarray in qdisc_create() is not suitable
>> > > > > place to do this. As it requires use of tcf_block() callback, it causes
>> > > > > multiple issues. It is called for all qdisc types, which is incorrect.
>> > > > > 
>> > > > > So, instead, move it to more suitable place, which is tcf_block_get_ext()
>> > > > > and make sure it is only done for qdiscs that use block infrastructure
>> > > > > and also only for blocks which are shared.
>> > > > > 
>> > > > > Symmetrically, alter the cleanup path, move the xarray entry removal
>> > > > > into tcf_block_put_ext().
>> > > > > 
>> > > > > Fixes: 913b47d3424e ("net/sched: Introduce tc block netdev tracking infra")
>> > > > > Reported-by: Ido Schimmel <idosch@nvidia.com>
>> > > > > Closes: https://lore.kernel.org/all/ZY1hBb8GFwycfgvd@shredder/
>> > > > > Reported-by: Kui-Feng Lee <sinquersw@gmail.com>
>> > > > > Closes: https://lore.kernel.org/all/ce8d3e55-b8bc-409c-ace9-5cf1c4f7c88e@gmail.com/
>> > > > > Reported-and-tested-by: syzbot+84339b9e7330daae4d66@syzkaller.appspotmail.com
>> > > > > Closes: https://lore.kernel.org/all/0000000000007c85f5060dcc3a28@google.com/
>> > > > > Reported-and-tested-by: syzbot+806b0572c8d06b66b234@syzkaller.appspotmail.com
>> > > > > Closes: https://lore.kernel.org/all/00000000000082f2f2060dcc3a92@google.com/
>> > > > > Reported-and-tested-by: syzbot+0039110f932d438130f9@syzkaller.appspotmail.com
>> > > > > Closes: https://lore.kernel.org/all/0000000000007fbc8c060dcc3a5c@google.com/
>> > > > > Signed-off-by: Jiri Pirko <jiri@nvidia.com>
>> > > > 
>> > > > Did you get a chance to run the tdc tests?
>> > > 
>> > > I ran the TC ones we have in the net/forwarding directory.
>> > > I didn't manage to run the tdc. Readme didn't help me much.
>> > > How do you run the suite?
>> > 
>> > For next time:
>> > make -C tools/testing/selftests TARGETS=tc-testing run_tests
>> 
>> Unrelated to this patch.
>> 
>> Running this, I'm getting lots of errors, some seem might be bugs in
>> tests. Here's the output:
>> 
>> make: Entering directory '/mnt/share156/jiri/net-next/tools/testing/selftests'
>> make[1]: Entering directory '/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing'
>> make[1]: Nothing to be done for 'all'.
>> make[1]: Leaving directory '/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing'
>> make[1]: Entering directory '/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing'
>> TAP version 13
>> 1..1
>> # timeout set to 900
>> # selftests: tc-testing: tdc.sh
>> # netdevsim
>> # act_bpf
>> # act_connmark
>> # act_csum
>> # act_ct
>> # act_ctinfo
>> # act_gact
>> # act_gate
>> # act_mirred
>> # act_mpls
>> # act_nat
>> # act_pedit
>> # act_police
>> # act_sample
>> # act_simple
>> # act_skbedit
>> # act_skbmod
>> # act_tunnel_key
>> # act_vlan
>> # cls_basic
>> # cls_bpf
>> # cls_cgroup
>> # cls_flow
>> # cls_flower
>> # cls_fw
>> # cls_matchall
>> # cls_route
>> # cls_u32
>> # Module em_canid not found... skipping.
>> # em_cmp
>> # em_ipset
>> # em_ipt
>> # em_meta
>> # em_nbyte
>> # em_text
>> # em_u32
>> # sch_cake
>> # sch_cbs
>> # sch_choke
>> # sch_codel
>> # sch_drr
>> # sch_etf
>> # sch_ets
>> # sch_fq
>> # sch_fq_codel
>> # sch_fq_pie
>> # sch_gred
>> # sch_hfsc
>> # sch_hhf
>> # sch_htb
>> # sch_teql
>> # considering category actions
>> # !!! Consider installing pyroute2 !!!
>> #  -- ns/SubPlugin.__init__
>> #  -- scapy/SubPlugin.__init__
>> # Executing 528 tests in parallel and 15 in serial
>> # Using 18 batches and 4 workers
>> #
>
>Hi Jiri,
>
>Can you also try running after installing pyroute2?
>`pip3 install pyroute2` or via your distro's package manager.
>
>Seems like in your kernel config + (virtual?) machine you are running into
>issues like:
> - create netns
> - try to use netns (still "creating", not visible yet)
> - fail badly
>
>Since pyroute2 is netlink based, these issues are more likely to not happen.
>
>Yes I agree the error messages are cryptic, we have been wondering how to
>improve them.
>If you are still running into issues after trying with pyroute2, can you also
>try running tdc with the following diff? It disables the parallel testing for
>kselftests.
>
>Perhaps when pyroute2 is not detected on the system we should disable the
>parallel testing completely. tc/ip commands are too susceptible to these
>issues and parallel testing just amplifies them.
>
>diff --git a/tools/testing/selftests/tc-testing/tdc.sh
>b/tools/testing/selftests/tc-testing/tdc.sh
>index c53ede8b730d..89bb123db86b 100755
>--- a/tools/testing/selftests/tc-testing/tdc.sh
>+++ b/tools/testing/selftests/tc-testing/tdc.sh
>@@ -63,5 +63,5 @@ try_modprobe sch_hfsc
> try_modprobe sch_hhf
> try_modprobe sch_htb
> try_modprobe sch_teql
>-./tdc.py -J`nproc` -c actions
>-./tdc.py -J`nproc` -c qdisc
>+./tdc.py -J1 -c actions
>+./tdc.py -J1 -c qdisc
>

I installed pyroute2. It looks some of the error went away, still having
a couple:

make: Entering directory '/mnt/share156/jiri/net-next/tools/testing/selftests'
make[1]: Entering directory '/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing'
make[1]: Entering directory '/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing'
TAP version 13
1..1
# timeout set to 900
# selftests: tc-testing: tdc.sh
# netdevsim
# act_bpf
# act_connmark
# act_csum
# act_ct
# act_ctinfo
# act_gact
# act_gate
# act_mirred
# act_mpls
# act_nat
# act_pedit
# act_police
# act_sample
# act_simple
# act_skbedit
# act_skbmod
# act_tunnel_key
# act_vlan
# cls_basic
# cls_bpf
# cls_cgroup
# cls_flow
# cls_flower
# cls_fw
# cls_matchall
# cls_route
# cls_u32
# Module em_canid not found... skipping.
# em_cmp
# em_ipset
# em_ipt
# em_meta
# em_nbyte
# em_text
# em_u32
# sch_cake
# sch_cbs
# sch_choke
# sch_codel
# sch_drr
# sch_etf
# sch_ets
# sch_fq
# sch_fq_codel
# sch_fq_pie
# sch_gred
# sch_hfsc
# sch_hhf
# sch_htb
# sch_teql
# considering category actions
#  -- ns/SubPlugin.__init__
#  -- scapy/SubPlugin.__init__
# Executing 528 tests in parallel and 15 in serial
# Using 18 batches and 4 workers
# ..
# -----> teardown stage *** Could not execute: "$TC actions flush action xt"
# 
# -----> teardown stage *** Error message: "Error: Cannot flush unknown TC action.
# We have an error flushing
# "
# 
# -----> teardown stage *** Aborting test run.
# 
# 
# <_io.BufferedReader name=6> *** stdout ***
# 
# 
# <_io.BufferedReader name=17> *** stderr ***
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 536, in test_runner
#     res = run_one_test(pm, args, index, tidx)
#           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 465, in run_one_test
#     prepare_env(tidx, args, pm, 'teardown', '-----> teardown stage', tidx['teardown'], procout)
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 268, in prepare_env
#     raise PluginMgrTestFail(
# Test 5232: List ctinfo actions
# Test 7702: Flush ctinfo actions
# Test 3201: Add ctinfo action with duplicate index
# Test 8295: Add ctinfo action with invalid index
# Test 3964: Replace ctinfo action with invalid goto_chain control
# Test 5124: Add mirred mirror to egress action
# Test 6fb4: Add mirred redirect to egress action
# Test ba38: Get mirred actions
# Test d7c0: Add invalid mirred direction
# Test e213: Add invalid mirred action
# Test 2d89: Add mirred action with invalid device
# Test 300b: Add mirred action with duplicate index
# Test 8917: Add mirred mirror action with control pass
# Test 1054: Add mirred mirror action with control pipe
# Test 9887: Add mirred mirror action with control continue
# Test e4aa: Add mirred mirror action with control reclassify
# Test ece9: Add mirred mirror action with control drop
# Test 0031: Add mirred mirror action with control jump
# Test 407c: Add mirred mirror action with cookie
# Test 8b69: Add mirred mirror action with index at 32-bit maximum
# Test 3f66: Add mirred mirror action with index exceeding 32-bit maximum
# Test a70e: Delete mirred mirror action
# Test 3fb3: Delete mirred redirect action
# Test 2a9a: Replace mirred action with invalid goto chain control
# Test 4749: Add batch of 32 mirred redirect egress actions with cookie
# Test 5c69: Delete batch of 32 mirred redirect egress actions
# Test d3c0: Add batch of 32 mirred mirror ingress actions with cookie
# Test e684: Delete batch of 32 mirred mirror ingress actions
# Test 31e3: Add mirred mirror to egress action with no_percpu flag
# Test 6d84: Add csum iph action
# Test 1862: Add csum ip4h action
# Test 15c6: Add csum ipv4h action
# Test bf47: Add csum icmp action
# Test cc1d: Add csum igmp action
# Test bccc: Add csum foobar action
# Test 3bb4: Add csum tcp action
# Test 759c: Add csum udp action
# Test bdb6: Add csum udp xor iph action
# Test c220: Add csum udplite action
# Test 8993: Add csum sctp action
# Test b138: Add csum ip & icmp action
# Test eeda: Add csum ip & sctp action
# Test 0017: Add csum udp or tcp action
# Test b10b: Add all 7 csum actions
# Test ce92: Add csum udp action with cookie
# Test 912f: Add csum icmp action with large cookie
# Test 879b: Add batch of 32 csum tcp actions
# Test b4e9: Delete batch of 32 csum actions
# Test 0015: Add batch of 32 csum tcp actions with large cookies
# Test 989e: Delete batch of 32 csum actions with large cookies
# Test d128: Replace csum action with invalid goto chain control
# Test eaf0: Add csum iph action with no_percpu flag
# Test a933: Add MPLS dec_ttl action with pipe opcode
# Test 08d1: Add mpls dec_ttl action with pass opcode
# Test d786: Add mpls dec_ttl action with drop opcode
# Test f334: Add mpls dec_ttl action with reclassify opcode
# Test 29bd: Add mpls dec_ttl action with continue opcode
# Test 48df: Add mpls dec_ttl action with jump opcode
# Test 62eb: Add mpls dec_ttl action with trap opcode
# Test 09d2: Add mpls dec_ttl action with opcode and cookie
# Test c170: Add mpls dec_ttl action with opcode and cookie of max length
# Test 9118: Add mpls dec_ttl action with invalid opcode
# Test 6ce1: Add mpls dec_ttl action with label (invalid)
# Test 352f: Add mpls dec_ttl action with tc (invalid)
# Test 1c3a: Flush police actions
# Test 7326: Add police action with control continue
# Test 34fa: Add police action with control drop
# Test 8dd5: Add police action with control ok
# Test b9d1: Add police action with control reclassify
# Test c534: Add police action with control pipe
# Test b48b: Add police action with exceed goto chain control action
# Test 689e: Replace police action with invalid goto chain control
# Test cdd7: Add valid police action with packets per second rate limit
# Test f5bc: Add invalid police action with both bps and pps
# Test 7d64: Add police action with skip_hw option
# Test 7d50: Add skbmod action to set destination mac
# Test 9b29: Add skbmod action to set source mac
# Test 1724: Add skbmod action with invalid mac
# Test 3cf1: Add skbmod action with valid etype
# Test a749: Add skbmod action with invalid etype
# Test bfe6: Add skbmod action to swap mac
# Test 839b: Add skbmod action with control pipe
# Test c167: Add skbmod action with control reclassify
# Test 0c2f: Add skbmod action with control drop
# Test d113: Add skbmod action with control continue
# Test 7242: Add skbmod action with control pass
# Test 6046: Add skbmod action with control reclassify and cookie
# Test 58cb: List skbmod actions
# Test 9aa8: Get a single skbmod action from a list
# Test e93a: Delete an skbmod action
# Test 40c2: Flush skbmod actions
# Test b651: Replace skbmod action with invalid goto_chain control
# Test fe09: Add skbmod action to mark ECN bits
# Test 696a: Add simple ct action
# Test e38c: Add simple ct action with cookie
# Test 9f20: Add ct clear action
# Test 0bc1: Add ct clear action with cookie of max length
# Test 5bea: Try ct with zone
# Test d5d6: Try ct with zone, commit
# Test 029f: Try ct with zone, commit, mark
# Test a58d: Try ct with zone, commit, mark, nat
# Test 901b: Try ct with full nat ipv4 range syntax
# Test 072b: Try ct with full nat ipv6 syntax
# Test 3420: Try ct with full nat ipv6 range syntax
# Test 4470: Try ct with full nat ipv6 range syntax + force
# Test 5d88: Try ct with label
# Test 04d4: Try ct with label with mask
# Test 9751: Try ct with mark + mask
# Test 2faa: Try ct with mark + mask and cookie
# Test 3991: Add simple ct action with no_percpu flag
# Test 3992: Add ct action triggering DNAT tuple conflict
# 
# Sent 1 packets.
# 
# Sent 1 packets.
# Test 2029: Add xt action with log-prefix
# exit: 255
# exit: 0
# Warning: Extension LOG revision 0 not supported, missing kernel module?
# Error: Failed to load TC action module.
# We have an error talking to the kernel
# 
# returncode 1; expected [0]
# "-----> teardown stage" did not complete successfully
# Exception <class '__main__.PluginMgrTestFail'> ('teardown', 'Warning: Extension LOG revision 0 not supported, missing kernel module?\nError: Failed to load TC action module.\nWe have an error talking to the kernel\n', '"-----> teardown stage" did not complete successfully') (caught in test_runner, running test 17 2029 Add xt action with log-prefix stage teardown)
# ---------------
# traceback
# ---------------
# accumulated output for this test:
# Warning: Extension LOG revision 0 not supported, missing kernel module?
# Error: Failed to load TC action module.
# We have an error talking to the kernel
# 
# ---------------
# Test a5a7: Add pedit action with LAYERED_OP eth set src
# exception (34, 'Numerical result out of range') in call to pre_case for <class 'plugin-lib.nsPlugin.SubPlugin'> plugin
# Test 5a31: Add vlan modify action for protocol 802.1AD
# exception (34, 'Numerical result out of range') in call to pre_case for <class 'plugin-lib.nsPlugin.SubPlugin'> plugin
# Test 8eb5: Create valid ife encode action with tcindex and continue control
# exception (34, 'Numerical result out of range') in call to pre_case for <class 'plugin-lib.nsPlugin.SubPlugin'> plugin
# Test a874: Add invalid sample action without mandatory arguments
# exception (34, 'Numerical result out of range') in call to pre_case for <class 'plugin-lib.nsPlugin.SubPlugin'> plugin
# Test ac2a: List actions
# Test 3edf: Flush gact actions
# Test 63ec: Delete pass action
# Test 46be: Delete pipe action
# Test 2e08: Delete reclassify action
# Test 99c4: Delete drop action
# Test fb6b: Delete continue action
# Test 0eb3: Delete non-existent action
# Test f02c: Replace gact action
# Test 525f: Get gact action by index
# Test 1021: Add batch of 32 gact pass actions
# Test da7a: Add batch of 32 gact continue actions with cookie
# Test 8aa3: Delete batch of 32 gact continue actions
# Test 8e47: Add gact action with random determ goto chain control action
# Test ca89: Replace gact action with invalid goto chain control
# Test 95ad: Add gact pass action with no_percpu flag
# Test 7f52: Try to flush action which is referenced by filter
# Test ae1e: Try to flush actions when last one is referenced by filter
# Test 2b11: Add tunnel_key set action with mandatory parameters
# Test dc6b: Add tunnel_key set action with missing mandatory src_ip parameter
# Test 7f25: Add tunnel_key set action with missing mandatory dst_ip parameter
# Test a5e0: Add tunnel_key set action with invalid src_ip parameter
# Test eaa8: Add tunnel_key set action with invalid dst_ip parameter
# Test 3b09: Add tunnel_key set action with invalid id parameter
# Test 9625: Add tunnel_key set action with invalid dst_port parameter
# Test 05af: Add tunnel_key set action with optional dst_port parameter
# Test da80: Add tunnel_key set action with index at 32-bit maximum
# Test d407: Add tunnel_key set action with index exceeding 32-bit maximum
# Test 5cba: Add tunnel_key set action with id value at 32-bit maximum
# Test e84a: Add tunnel_key set action with id value exceeding 32-bit maximum
# Test 9c19: Add tunnel_key set action with dst_port value at 16-bit maximum
# Test 3bd9: Add tunnel_key set action with dst_port value exceeding 16-bit maximum
# Test 68e2: Add tunnel_key unset action
# Test 6192: Add tunnel_key unset continue action
# Test 061d: Add tunnel_key set continue action with cookie
# Test 8acb: Add tunnel_key set continue action with invalid cookie
# Test a07e: Add tunnel_key action with no set/unset command specified
# Test b227: Add tunnel_key action with csum option
# Test 58a7: Add tunnel_key action with nocsum option
# Test 2575: Add tunnel_key action with not-supported parameter
# Test 7a88: Add tunnel_key action with cookie parameter
# Test 4f20: Add tunnel_key action with a single geneve option parameter
# Test e33d: Add tunnel_key action with multiple geneve options parameter
# Test 0778: Add tunnel_key action with invalid class geneve option parameter
# Test 4ae8: Add tunnel_key action with invalid type geneve option parameter
# Test 4039: Add tunnel_key action with short data length geneve option parameter
# Test 26a6: Add tunnel_key action with non-multiple of 4 data length geneve option parameter
# Test f44d: Add tunnel_key action with incomplete geneve options parameter
# Test 7afc: Replace tunnel_key set action with all parameters
# Test 364d: Replace tunnel_key set action with all parameters and cookie
# Test 937c: Fetch all existing tunnel_key actions
# Test 6783: Flush all existing tunnel_key actions
# Test 8242: Replace tunnel_key set action with invalid goto chain
# Test 0cd2: Add tunnel_key set action with no_percpu flag
# Test 3671: Delete tunnel_key set action with valid index
# Test 8597: Delete tunnel_key set action with invalid index
# Test 6bda: Add tunnel_key action with nofrag option
# Test c826: Add ctinfo action with default setting
# Test 0286: Add ctinfo action with dscp
# Test 4938: Add ctinfo action with valid cpmark and zone
# Test 7593: Add ctinfo action with drop control
# Test 2961: Replace ctinfo action zone and action control
# Test e567: Delete ctinfo action with valid index
# Test 6a91: Delete ctinfo action with invalid index
# Test d959: Add cBPF action with valid bytecode
# Test f84a: Add cBPF action with invalid bytecode
# Test e939: Add eBPF action with valid object-file
# Test 282d: Add eBPF action with invalid object-file
# Test d819: Replace cBPF bytecode and action control
# Test 6ae3: Delete cBPF action 
# Test 3e0d: List cBPF actions
# Test 55ce: Flush BPF actions
# Test ccc3: Add cBPF action with duplicate index
# Test 89c7: Add cBPF action with invalid index
# Test 7ab9: Add cBPF action with cookie
# Test b8a1: Replace bpf action with invalid goto_chain control
# Test 2893: Add pedit action with RAW_OP offset u8 quad
# Test 6795: Add pedit action with LAYERED_OP ip6 set payload_len, nexthdr, hoplimit
# Test cfcc: Add pedit action with LAYERED_OP tcp flags set
# Test b078: Add simple action
# Test 4297: Add simple action with change command
# Test 6d4c: Add simple action with duplicate index
# Test 2542: List simple actions
# Test ea67: Delete simple action
# Test 8ff1: Flush simple actions
# Test b776: Replace simple action with invalid goto chain control
# Test 8d07: Verify cleanup of failed actions batch add
# Test a68a: Verify cleanup of failed actions batch change
# Test 49aa: Add valid basic police action
# Test 3abe: Add police action with duplicate index
# Test 49fa: Add valid police action with mtu
# Test 7943: Add valid police action with peakrate
# Test 055e: Add police action with peakrate and no mtu
# Test f057: Add police action with valid overhead
# Test 7ffb: Add police action with ethernet linklayer type
# Test 3dda: Add police action with atm linklayer type
# Test 551b: Add police actions with conform-exceed control continue/drop
# Test 0c70: Add police actions with conform-exceed control pass/reclassify
# Test d946: Add police actions with conform-exceed control pass/pipe
# Test ddd6: Add police action with invalid rate value
# Test f61c: Add police action with invalid burst value
# Test 6aaf: Add police actions with conform-exceed control pass/pipe [with numeric values]
# Test 29b1: Add police actions with conform-exceed control <invalid>/drop
# Test c26f: Add police action with invalid peakrate value
# Test db04: Add police action with invalid mtu value
# Test f3c9: Add police action with cookie
# Test d190: Add police action with maximum index
# Test 336e: Delete police action
# Test 77fa: Get single police action from many actions
# Test aa43: Get single police action without specifying index
# Test 858b: List police actions
# Test fa1c: Add mpls dec_ttl action with ttl (invalid)
# Test 6b79: Add mpls dec_ttl action with bos (invalid)
# Test d4c4: Add mpls pop action with ip proto
# Test 91fb: Add mpls pop action with ip proto and cookie
# Test 92fe: Add mpls pop action with mpls proto
# Test 7e23: Add mpls pop action with no protocol (invalid)
# Test 6182: Add mpls pop action with label (invalid)
# Test 6475: Add mpls pop action with tc (invalid)
# Test 067b: Add mpls pop action with ttl (invalid)
# Test 7316: Add mpls pop action with bos (invalid)
# Test 38cc: Add mpls push action with label
# Test c281: Add mpls push action with mpls_mc protocol
# Test 5db4: Add mpls push action with label, tc and ttl
# Test 7c34: Add mpls push action with label, tc ttl and cookie of max length
# Test 16eb: Add mpls push action with label and bos
# Test d69d: Add mpls push action with no label (invalid)
# Test e8e4: Add mpls push action with ipv4 protocol (invalid)
# Test ecd0: Add mpls push action with out of range label (invalid)
# Test d303: Add mpls push action with out of range tc (invalid)
# Test fd6e: Add mpls push action with ttl of 0 (invalid)
# Test 19e9: Add mpls mod action with mpls label
# Test 1fde: Add mpls mod action with max mpls label
# Test 0c50: Add mpls mod action with mpls label exceeding max (invalid)
# Test 10b6: Add mpls mod action with mpls label of MPLS_LABEL_IMPLNULL (invalid)
# Test 57c9: Add mpls mod action with mpls min tc
# Test 6872: Add mpls mod action with mpls max tc
# Test a70a: Add mpls mod action with mpls tc exceeding max (invalid)
# Test 6ed5: Add mpls mod action with mpls ttl
# Test 77c1: Add mpls mod action with mpls ttl and cookie
# Test b80f: Add mpls mod action with mpls max ttl
# Test 8864: Add mpls mod action with mpls min ttl
# Test 6c06: Add mpls mod action with mpls ttl of 0 (invalid)
# Test b5d8: Add mpls mod action with mpls ttl exceeding max (invalid)
# Test 451f: Add mpls mod action with mpls max bos
# Test a1ed: Add mpls mod action with mpls min bos
# Test 3dcf: Add mpls mod action with mpls bos exceeding max (invalid)
# Test db7c: Add mpls mod action with protocol (invalid)
# Test b070: Replace existing mpls push action with new ID
# Test 95a9: Replace existing mpls push action with new label, tc, ttl and cookie
# Test 6cce: Delete mpls pop action
# Test d138: Flush mpls actions
# Test 319a: Add pedit action that mangles IP TTL
# Test 7e67: Replace pedit action with invalid goto chain
# Test 377e: Add pedit action with RAW_OP offset u32
# Test a0ca: Add pedit action with RAW_OP offset u32 (INVALID)
# Test dd8a: Add pedit action with RAW_OP offset u16 u16
# Test 53db: Add pedit action with RAW_OP offset u16 (INVALID)
# Test 5c7e: Add pedit action with RAW_OP offset u8 add value
# Test 3a07: Add pedit action with RAW_OP offset u8-u16-u8
# Test ab0f: Add pedit action with RAW_OP offset u16-u8-u8
# Test 9d12: Add pedit action with RAW_OP offset u32 set u16 clear u8 invert
# Test ebfa: Add pedit action with RAW_OP offset overflow u32 (INVALID)
# Test f512: Add pedit action with RAW_OP offset u16 at offmask shift set
# Test c2cb: Add pedit action with RAW_OP offset u32 retain value
# Test 1762: Add pedit action with RAW_OP offset u8 clear value
# Test bcee: Add pedit action with RAW_OP offset u8 retain value
# Test e89f: Add pedit action with RAW_OP offset u16 retain value
# Test c282: Add pedit action with RAW_OP offset u32 clear value
# Test c422: Add pedit action with RAW_OP offset u16 invert value
# Test d3d3: Add pedit action with RAW_OP offset u32 invert value
# Test 57e5: Add pedit action with RAW_OP offset u8 preserve value
# Test 99e0: Add pedit action with RAW_OP offset u16 preserve value
# Test 1892: Add pedit action with RAW_OP offset u32 preserve value
# Test 4b60: Add pedit action with RAW_OP negative offset u16/u32 set value
# multiprocessing.pool.RemoteTraceback: 
# """
# Traceback (most recent call last):
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 142, in call_pre_case
#     pgn_inst.pre_case(caseinfo, test_skip)
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/plugin-lib/nsPlugin.py", line 63, in pre_case
#     self.prepare_test(test)
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/plugin-lib/nsPlugin.py", line 36, in prepare_test
#     self._nl_ns_create()
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/plugin-lib/nsPlugin.py", line 130, in _nl_ns_create
#     ip.link('add', ifname=dev1, kind='veth', peer={'ifname': dev0, 'net_ns_fd':'/proc/1/ns/net'})
#   File "/usr/local/lib/python3.11/site-packages/pyroute2/iproute/linux.py", line 1696, in link
#     ret = self.nlm_request(msg, msg_type=msg_type, msg_flags=msg_flags)
#           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/usr/local/lib/python3.11/site-packages/pyroute2/netlink/nlsocket.py", line 870, in nlm_request
#     return tuple(self._genlm_request(*argv, **kwarg))
#            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/usr/local/lib/python3.11/site-packages/pyroute2/netlink/nlsocket.py", line 1214, in nlm_request
#     for msg in self.get(
#                ^^^^^^^^^
#   File "/usr/local/lib/python3.11/site-packages/pyroute2/netlink/nlsocket.py", line 873, in get
#     return tuple(self._genlm_get(*argv, **kwarg))
#            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/usr/local/lib/python3.11/site-packages/pyroute2/netlink/nlsocket.py", line 550, in get
#     raise msg['header']['error']
# pyroute2.netlink.exceptions.NetlinkError: (34, 'Numerical result out of range')
# 
# During handling of the above exception, another exception occurred:
# 
# Traceback (most recent call last):
#   File "/usr/lib64/python3.11/multiprocessing/pool.py", line 125, in worker
#     result = (True, func(*args, **kwds))
#                     ^^^^^^^^^^^^^^^^^^^
#   File "/usr/lib64/python3.11/multiprocessing/pool.py", line 48, in mapstar
#     return list(map(*args))
#            ^^^^^^^^^^^^^^^^
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 602, in __mp_runner
#     (_, tsr) = test_runner(mp_pm, mp_args, tests)
#                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 536, in test_runner
#     res = run_one_test(pm, args, index, tidx)
#           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 419, in run_one_test
#     pm.call_pre_case(tidx)
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 146, in call_pre_case
#     print('test_ordinal is {}'.format(test_ordinal))
#                                       ^^^^^^^^^^^^
# NameError: name 'test_ordinal' is not defined
# """
# 
# The above exception was the direct cause of the following exception:
# 
# Traceback (most recent call last):
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 1028, in <module>
#     main()
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 1022, in main
#     set_operation_mode(pm, parser, args, remaining)
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 964, in set_operation_mode
#     catresults = test_runner_mp(pm, args, alltests)
#                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 624, in test_runner_mp
#     pres = p.map(__mp_runner, batches)
#            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/usr/lib64/python3.11/multiprocessing/pool.py", line 367, in map
#     return self._map_async(func, iterable, mapstar, chunksize).get()
#            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/usr/lib64/python3.11/multiprocessing/pool.py", line 774, in get
#     raise self._value
# NameError: name 'test_ordinal' is not defined
# considering category qdisc
#  -- scapy/SubPlugin.__init__
#  -- ns/SubPlugin.__init__
# Executing 294 tests in parallel and 33 in serial
# Using 11 batches and 4 workers
# 
# -----> prepare stage *** Could not execute: "$IP link add dev $ETH type dummy"
# 
# -----> prepare stage *** Error message: "RTNETLINK answers: File exists
# "
# 
# -----> prepare stage *** Aborting test run.
# 
# 
# <_io.BufferedReader name=6> *** stdout ***
# 
# 
# <_io.BufferedReader name=15> *** stderr ***
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 536, in test_runner
#     res = run_one_test(pm, args, index, tidx)
#           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 420, in run_one_test
#     prepare_env(tidx, args, pm, 'setup', "-----> prepare stage", tidx["setup"])
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 268, in prepare_env
#     raise PluginMgrTestFail(
# ......................Test 30c4: Add ETS qdisc with quanta + priomap
# Test e8ac: Add ETS qdisc with strict + priomap
# Test 5a7e: Add ETS qdisc with quanta + strict + priomap
# Test cb8b: Show ETS class :1
# Test 1b4e: Show ETS class :2
# Test f642: Show ETS class :3
# Test 0a5f: Show ETS strict class
# Test f7c8: Add ETS qdisc with too many quanta
# Test 2389: Add ETS qdisc with too many strict
# Test fe3c: Add ETS qdisc with too many strict + quanta
# Test cb04: Add ETS qdisc with excess priomap elements
# Test c32e: Add ETS qdisc with priomap above bands
# Test 744c: Add ETS qdisc with priomap above quanta
# Test 7b33: Add ETS qdisc with priomap above strict
# Test dbe6: Add ETS qdisc with priomap above strict + quanta
# Test bdb2: Add ETS qdisc with priomap within bands with strict + quanta
# Test 39a3: Add ETS qdisc with priomap above bands with strict + quanta
# Test 557c: Unset priorities default to the last band
# Test a347: Unset priorities default to the last band -- no priomap
# Test 39c4: Add ETS qdisc with too few bands
# Test 930b: Add ETS qdisc with too many bands
# Test 406a: Add ETS qdisc without parameters
# Test e51a: Zero element in quanta
# Test e7f2: Sole zero element in quanta
# Test d6e6: No values after the quanta keyword
# Test 28c6: Change ETS band quantum
# Test 4714: Change ETS band without quantum
# Test 6979: Change quantum of a strict ETS band
# Test 9a7d: Change ETS strict band without quantum
# Test 283e: Create skbprio with default setting
# Test c086: Create skbprio with limit setting
# Test 6733: Change skbprio with limit setting
# Test fe0f: Add prio qdisc on egress with 4 bands and priomap exceeding TC_PRIO_MAX entries
# Test 1f91: Add prio qdisc on egress with 4 bands and priomap's values exceeding bands number
# Test d248: Add prio qdisc on egress with invalid bands value (< 2)
# Test 1d0e: Add prio qdisc on egress with invalid bands value exceeding TCQ_PRIO_BANDS
# Test 1971: Replace default prio qdisc on egress with 8 bands and new priomap
# Test d88a: Add duplicate prio qdisc on egress
# Test 5948: Delete nonexistent prio qdisc
# Test 6c0a: Add prio qdisc on egress with invalid format for handles
# Test 0175: Delete prio qdisc twice
# Test 2410: Show prio class
# Test 0385: Create DRR with default setting
# Test 2375: Delete DRR with handle
# Test 3092: Show DRR class
# Test e90e: Add ETS qdisc using bands
# Test b059: Add ETS qdisc using quanta
# Test e8e7: Add ETS qdisc using strict
# Test 233c: Add ETS qdisc using bands + quanta
# Test 3d35: Add ETS qdisc using bands + strict
# Test 7f3b: Add ETS qdisc using strict + quanta
# Test 4593: Add ETS qdisc using strict 0 + quanta
# Test 8938: Add ETS qdisc using bands + strict + quanta
# Test 0782: Add ETS qdisc with more bands than quanta
# Test 501b: Add ETS qdisc with more bands than strict
# Test 671a: Add ETS qdisc with more bands than strict + quanta
# Test 2a23: Add ETS qdisc with 16 bands
# Test 8daf: Add ETS qdisc with 17 bands
# Test 7f95: Add ETS qdisc with 17 strict
# Test 837a: Add ETS qdisc with 16 quanta
# Test 65b6: Add ETS qdisc with 17 quanta
# Test b9e9: Add ETS qdisc with 16 strict + quanta
# Test 9877: Add ETS qdisc with 17 strict + quanta
# Test c696: Add ETS qdisc with priomap
# Test 20ba: Add multiq Qdisc to multi-queue device (8 queues)
# Test 4301: List multiq Class
# Test 7832: Delete nonexistent multiq Qdisc
# Test 2891: Delete multiq Qdisc twice
# Test 1329: Add multiq Qdisc to single-queue device
# Test 84a0: Create TEQL with default setting
# Test 7734: Create TEQL with multiple device
# returncode 2; expected [0]
# "-----> prepare stage" did not complete successfully
# Exception <class '__main__.PluginMgrTestFail'> ('setup', None, '"-----> prepare stage" did not complete successfully') (caught in test_runner, running test 8 7734 Create TEQL with multiple device stage setup)
# ---------------
# traceback
# ---------------
# ---------------
# Test 2958: Show skbprio class
# exception (34, 'Numerical result out of range') in call to pre_case for <class 'plugin-lib.nsPlugin.SubPlugin'> plugin
# Test 0301: Change CHOKE with limit setting
# exception (34, 'Numerical result out of range') in call to pre_case for <class 'plugin-lib.nsPlugin.SubPlugin'> plugin
# Test 2385: Create CAKE with besteffort flag
# exception (34, 'Numerical result out of range') in call to pre_case for <class 'plugin-lib.nsPlugin.SubPlugin'> plugin
# Test 0521: Show ingress class
# exception (34, 'Numerical result out of range') in call to pre_case for <class 'plugin-lib.nsPlugin.SubPlugin'> plugin
# Test b190: Create FQ_CODEL with noecn flag
# exception (34, 'Numerical result out of range') in call to pre_case for <class 'plugin-lib.nsPlugin.SubPlugin'> plugin
# Test 5439: Create NETEM with multiple slot setting
# exception (34, 'Numerical result out of range') in call to pre_case for <class 'plugin-lib.nsPlugin.SubPlugin'> plugin
# Test 30a9: Create SFB with increment setting
# exception (34, 'Numerical result out of range') in call to pre_case for <class 'plugin-lib.nsPlugin.SubPlugin'> plugin
# multiprocessing.pool.RemoteTraceback: 
# """
# Traceback (most recent call last):
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 142, in call_pre_case
#     pgn_inst.pre_case(caseinfo, test_skip)
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/plugin-lib/nsPlugin.py", line 63, in pre_case
#     self.prepare_test(test)
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/plugin-lib/nsPlugin.py", line 36, in prepare_test
#     self._nl_ns_create()
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/plugin-lib/nsPlugin.py", line 130, in _nl_ns_create
#     ip.link('add', ifname=dev1, kind='veth', peer={'ifname': dev0, 'net_ns_fd':'/proc/1/ns/net'})
#   File "/usr/local/lib/python3.11/site-packages/pyroute2/iproute/linux.py", line 1696, in link
#     ret = self.nlm_request(msg, msg_type=msg_type, msg_flags=msg_flags)
#           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/usr/local/lib/python3.11/site-packages/pyroute2/netlink/nlsocket.py", line 870, in nlm_request
#     return tuple(self._genlm_request(*argv, **kwarg))
#            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/usr/local/lib/python3.11/site-packages/pyroute2/netlink/nlsocket.py", line 1214, in nlm_request
#     for msg in self.get(
#                ^^^^^^^^^
#   File "/usr/local/lib/python3.11/site-packages/pyroute2/netlink/nlsocket.py", line 873, in get
#     return tuple(self._genlm_get(*argv, **kwarg))
#            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/usr/local/lib/python3.11/site-packages/pyroute2/netlink/nlsocket.py", line 550, in get
#     raise msg['header']['error']
# pyroute2.netlink.exceptions.NetlinkError: (34, 'Numerical result out of range')
# 
# During handling of the above exception, another exception occurred:
# 
# Traceback (most recent call last):
#   File "/usr/lib64/python3.11/multiprocessing/pool.py", line 125, in worker
#     result = (True, func(*args, **kwds))
#                     ^^^^^^^^^^^^^^^^^^^
#   File "/usr/lib64/python3.11/multiprocessing/pool.py", line 48, in mapstar
#     return list(map(*args))
#            ^^^^^^^^^^^^^^^^
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 602, in __mp_runner
#     (_, tsr) = test_runner(mp_pm, mp_args, tests)
#                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 536, in test_runner
#     res = run_one_test(pm, args, index, tidx)
#           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 419, in run_one_test
#     pm.call_pre_case(tidx)
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 146, in call_pre_case
#     print('test_ordinal is {}'.format(test_ordinal))
#                                       ^^^^^^^^^^^^
# NameError: name 'test_ordinal' is not defined
# """
# 
# The above exception was the direct cause of the following exception:
# 
# Traceback (most recent call last):
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 1028, in <module>
#     main()
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 1022, in main
#     set_operation_mode(pm, parser, args, remaining)
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 964, in set_operation_mode
#     catresults = test_runner_mp(pm, args, alltests)
#                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 624, in test_runner_mp
#     pres = p.map(__mp_runner, batches)
#            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/usr/lib64/python3.11/multiprocessing/pool.py", line 367, in map
#     return self._map_async(func, iterable, mapstar, chunksize).get()
#            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/usr/lib64/python3.11/multiprocessing/pool.py", line 774, in get
#     raise self._value
# NameError: name 'test_ordinal' is not defined
not ok 1 selftests: tc-testing: tdc.sh # exit=1
make[1]: Leaving directory '/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing'
make: Leaving directory '/mnt/share156/jiri/net-next/tools/testing/selftests'
Jamal Hadi Salim Jan. 6, 2024, 11:49 a.m. UTC | #11
On Sat, Jan 6, 2024 at 6:14 AM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Fri, Jan 05, 2024 at 12:51:04PM CET, pctammela@mojatatu.com wrote:
> >On 05/01/2024 08:24, Jiri Pirko wrote:
> >> Thu, Jan 04, 2024 at 07:22:48PM CET, jhs@mojatatu.com wrote:
> >> > On Thu, Jan 4, 2024 at 1:03 PM Jiri Pirko <jiri@resnulli.us> wrote:
> >> > >
> >> > > Thu, Jan 04, 2024 at 05:10:58PM CET, jhs@mojatatu.com wrote:
> >> > > > On Thu, Jan 4, 2024 at 7:58 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> > > > >
> >> > > > > From: Jiri Pirko <jiri@nvidia.com>
> >> > > > >
> >> > > > > Inserting the device to block xarray in qdisc_create() is not suitable
> >> > > > > place to do this. As it requires use of tcf_block() callback, it causes
> >> > > > > multiple issues. It is called for all qdisc types, which is incorrect.
> >> > > > >
> >> > > > > So, instead, move it to more suitable place, which is tcf_block_get_ext()
> >> > > > > and make sure it is only done for qdiscs that use block infrastructure
> >> > > > > and also only for blocks which are shared.
> >> > > > >
> >> > > > > Symmetrically, alter the cleanup path, move the xarray entry removal
> >> > > > > into tcf_block_put_ext().
> >> > > > >
> >> > > > > Fixes: 913b47d3424e ("net/sched: Introduce tc block netdev tracking infra")
> >> > > > > Reported-by: Ido Schimmel <idosch@nvidia.com>
> >> > > > > Closes: https://lore.kernel.org/all/ZY1hBb8GFwycfgvd@shredder/
> >> > > > > Reported-by: Kui-Feng Lee <sinquersw@gmail.com>
> >> > > > > Closes: https://lore.kernel.org/all/ce8d3e55-b8bc-409c-ace9-5cf1c4f7c88e@gmail.com/
> >> > > > > Reported-and-tested-by: syzbot+84339b9e7330daae4d66@syzkaller.appspotmail.com
> >> > > > > Closes: https://lore.kernel.org/all/0000000000007c85f5060dcc3a28@google.com/
> >> > > > > Reported-and-tested-by: syzbot+806b0572c8d06b66b234@syzkaller.appspotmail.com
> >> > > > > Closes: https://lore.kernel.org/all/00000000000082f2f2060dcc3a92@google.com/
> >> > > > > Reported-and-tested-by: syzbot+0039110f932d438130f9@syzkaller.appspotmail.com
> >> > > > > Closes: https://lore.kernel.org/all/0000000000007fbc8c060dcc3a5c@google.com/
> >> > > > > Signed-off-by: Jiri Pirko <jiri@nvidia.com>
> >> > > >
> >> > > > Did you get a chance to run the tdc tests?
> >> > >
> >> > > I ran the TC ones we have in the net/forwarding directory.
> >> > > I didn't manage to run the tdc. Readme didn't help me much.
> >> > > How do you run the suite?
> >> >
> >> > For next time:
> >> > make -C tools/testing/selftests TARGETS=tc-testing run_tests
> >>
> >> Unrelated to this patch.
> >>
> >> Running this, I'm getting lots of errors, some seem might be bugs in
> >> tests. Here's the output:
> >>
> >> make: Entering directory '/mnt/share156/jiri/net-next/tools/testing/selftests'
> >> make[1]: Entering directory '/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing'
> >> make[1]: Nothing to be done for 'all'.
> >> make[1]: Leaving directory '/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing'
> >> make[1]: Entering directory '/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing'
> >> TAP version 13
> >> 1..1
> >> # timeout set to 900
> >> # selftests: tc-testing: tdc.sh
> >> # netdevsim
> >> # act_bpf
> >> # act_connmark
> >> # act_csum
> >> # act_ct
> >> # act_ctinfo
> >> # act_gact
> >> # act_gate
> >> # act_mirred
> >> # act_mpls
> >> # act_nat
> >> # act_pedit
> >> # act_police
> >> # act_sample
> >> # act_simple
> >> # act_skbedit
> >> # act_skbmod
> >> # act_tunnel_key
> >> # act_vlan
> >> # cls_basic
> >> # cls_bpf
> >> # cls_cgroup
> >> # cls_flow
> >> # cls_flower
> >> # cls_fw
> >> # cls_matchall
> >> # cls_route
> >> # cls_u32
> >> # Module em_canid not found... skipping.
> >> # em_cmp
> >> # em_ipset
> >> # em_ipt
> >> # em_meta
> >> # em_nbyte
> >> # em_text
> >> # em_u32
> >> # sch_cake
> >> # sch_cbs
> >> # sch_choke
> >> # sch_codel
> >> # sch_drr
> >> # sch_etf
> >> # sch_ets
> >> # sch_fq
> >> # sch_fq_codel
> >> # sch_fq_pie
> >> # sch_gred
> >> # sch_hfsc
> >> # sch_hhf
> >> # sch_htb
> >> # sch_teql
> >> # considering category actions
> >> # !!! Consider installing pyroute2 !!!
> >> #  -- ns/SubPlugin.__init__
> >> #  -- scapy/SubPlugin.__init__
> >> # Executing 528 tests in parallel and 15 in serial
> >> # Using 18 batches and 4 workers
> >> #
> >
> >Hi Jiri,
> >
> >Can you also try running after installing pyroute2?
> >`pip3 install pyroute2` or via your distro's package manager.
> >
> >Seems like in your kernel config + (virtual?) machine you are running into
> >issues like:
> > - create netns
> > - try to use netns (still "creating", not visible yet)
> > - fail badly
> >
> >Since pyroute2 is netlink based, these issues are more likely to not happen.
> >
> >Yes I agree the error messages are cryptic, we have been wondering how to
> >improve them.
> >If you are still running into issues after trying with pyroute2, can you also
> >try running tdc with the following diff? It disables the parallel testing for
> >kselftests.
> >
> >Perhaps when pyroute2 is not detected on the system we should disable the
> >parallel testing completely. tc/ip commands are too susceptible to these
> >issues and parallel testing just amplifies them.
> >
> >diff --git a/tools/testing/selftests/tc-testing/tdc.sh
> >b/tools/testing/selftests/tc-testing/tdc.sh
> >index c53ede8b730d..89bb123db86b 100755
> >--- a/tools/testing/selftests/tc-testing/tdc.sh
> >+++ b/tools/testing/selftests/tc-testing/tdc.sh
> >@@ -63,5 +63,5 @@ try_modprobe sch_hfsc
> > try_modprobe sch_hhf
> > try_modprobe sch_htb
> > try_modprobe sch_teql
> >-./tdc.py -J`nproc` -c actions
> >-./tdc.py -J`nproc` -c qdisc
> >+./tdc.py -J1 -c actions
> >+./tdc.py -J1 -c qdisc
> >
>
> I installed pyroute2. It looks some of the error went away, still having
> a couple:
>
> make: Entering directory '/mnt/share156/jiri/net-next/tools/testing/selftests'
> make[1]: Entering directory '/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing'
> make[1]: Nothing to be done for 'all'.
> make[1]: Leaving directory '/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing'
> make[1]: Entering directory '/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing'
> TAP version 13
> 1..1
> # timeout set to 900
> # selftests: tc-testing: tdc.sh
> # netdevsim
> # act_bpf
> # act_connmark
> # act_csum
> # act_ct
> # act_ctinfo
> # act_gact
> # act_gate
> # act_mirred
> # act_mpls
> # act_nat
> # act_pedit
> # act_police
> # act_sample
> # act_simple
> # act_skbedit
> # act_skbmod
> # act_tunnel_key
> # act_vlan
> # cls_basic
> # cls_bpf
> # cls_cgroup
> # cls_flow
> # cls_flower
> # cls_fw
> # cls_matchall
> # cls_route
> # cls_u32
> # Module em_canid not found... skipping.
> # em_cmp
> # em_ipset
> # em_ipt
> # em_meta
> # em_nbyte
> # em_text
> # em_u32
> # sch_cake
> # sch_cbs
> # sch_choke
> # sch_codel
> # sch_drr
> # sch_etf
> # sch_ets
> # sch_fq
> # sch_fq_codel
> # sch_fq_pie
> # sch_gred
> # sch_hfsc
> # sch_hhf
> # sch_htb
> # sch_teql
> # considering category actions
> #  -- ns/SubPlugin.__init__
> #  -- scapy/SubPlugin.__init__
> # Executing 528 tests in parallel and 15 in serial
> # Using 18 batches and 4 workers
> # ..
> # -----> teardown stage *** Could not execute: "$TC actions flush action xt"
> #
> # -----> teardown stage *** Error message: "Error: Cannot flush unknown TC action.
> # We have an error flushing
> # "
> #
> # -----> teardown stage *** Aborting test run.
> #
> #
> # <_io.BufferedReader name=6> *** stdout ***
> #
> #
> # <_io.BufferedReader name=17> *** stderr ***
> #   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 536, in test_runner
> #     res = run_one_test(pm, args, index, tidx)
> #           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> #   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 465, in run_one_test
> #     prepare_env(tidx, args, pm, 'teardown', '-----> teardown stage', tidx['teardown'], procout)
> #   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 268, in prepare_env
> #     raise PluginMgrTestFail(
> # Test 5232: List ctinfo actions
> # Test 7702: Flush ctinfo actions
> # Test 3201: Add ctinfo action with duplicate index
> # Test 8295: Add ctinfo action with invalid index
> # Test 3964: Replace ctinfo action with invalid goto_chain control
> # Test 5124: Add mirred mirror to egress action
> # Test 6fb4: Add mirred redirect to egress action
> # Test ba38: Get mirred actions
> # Test d7c0: Add invalid mirred direction
> # Test e213: Add invalid mirred action
> # Test 2d89: Add mirred action with invalid device
> # Test 300b: Add mirred action with duplicate index
> # Test 8917: Add mirred mirror action with control pass
> # Test 1054: Add mirred mirror action with control pipe
> # Test 9887: Add mirred mirror action with control continue
> # Test e4aa: Add mirred mirror action with control reclassify
> # Test ece9: Add mirred mirror action with control drop
> # Test 0031: Add mirred mirror action with control jump
> # Test 407c: Add mirred mirror action with cookie
> # Test 8b69: Add mirred mirror action with index at 32-bit maximum
> # Test 3f66: Add mirred mirror action with index exceeding 32-bit maximum
> # Test a70e: Delete mirred mirror action
> # Test 3fb3: Delete mirred redirect action
> # Test 2a9a: Replace mirred action with invalid goto chain control
> # Test 4749: Add batch of 32 mirred redirect egress actions with cookie
> # Test 5c69: Delete batch of 32 mirred redirect egress actions
> # Test d3c0: Add batch of 32 mirred mirror ingress actions with cookie
> # Test e684: Delete batch of 32 mirred mirror ingress actions
> # Test 31e3: Add mirred mirror to egress action with no_percpu flag
> # Test 6d84: Add csum iph action
> # Test 1862: Add csum ip4h action
> # Test 15c6: Add csum ipv4h action
> # Test bf47: Add csum icmp action
> # Test cc1d: Add csum igmp action
> # Test bccc: Add csum foobar action
> # Test 3bb4: Add csum tcp action
> # Test 759c: Add csum udp action
> # Test bdb6: Add csum udp xor iph action
> # Test c220: Add csum udplite action
> # Test 8993: Add csum sctp action
> # Test b138: Add csum ip & icmp action
> # Test eeda: Add csum ip & sctp action
> # Test 0017: Add csum udp or tcp action
> # Test b10b: Add all 7 csum actions
> # Test ce92: Add csum udp action with cookie
> # Test 912f: Add csum icmp action with large cookie
> # Test 879b: Add batch of 32 csum tcp actions
> # Test b4e9: Delete batch of 32 csum actions
> # Test 0015: Add batch of 32 csum tcp actions with large cookies
> # Test 989e: Delete batch of 32 csum actions with large cookies
> # Test d128: Replace csum action with invalid goto chain control
> # Test eaf0: Add csum iph action with no_percpu flag
> # Test a933: Add MPLS dec_ttl action with pipe opcode
> # Test 08d1: Add mpls dec_ttl action with pass opcode
> # Test d786: Add mpls dec_ttl action with drop opcode
> # Test f334: Add mpls dec_ttl action with reclassify opcode
> # Test 29bd: Add mpls dec_ttl action with continue opcode
> # Test 48df: Add mpls dec_ttl action with jump opcode
> # Test 62eb: Add mpls dec_ttl action with trap opcode
> # Test 09d2: Add mpls dec_ttl action with opcode and cookie
> # Test c170: Add mpls dec_ttl action with opcode and cookie of max length
> # Test 9118: Add mpls dec_ttl action with invalid opcode
> # Test 6ce1: Add mpls dec_ttl action with label (invalid)
> # Test 352f: Add mpls dec_ttl action with tc (invalid)
> # Test 1c3a: Flush police actions
> # Test 7326: Add police action with control continue
> # Test 34fa: Add police action with control drop
> # Test 8dd5: Add police action with control ok
> # Test b9d1: Add police action with control reclassify
> # Test c534: Add police action with control pipe
> # Test b48b: Add police action with exceed goto chain control action
> # Test 689e: Replace police action with invalid goto chain control
> # Test cdd7: Add valid police action with packets per second rate limit
> # Test f5bc: Add invalid police action with both bps and pps
> # Test 7d64: Add police action with skip_hw option
> # Test 7d50: Add skbmod action to set destination mac
> # Test 9b29: Add skbmod action to set source mac
> # Test 1724: Add skbmod action with invalid mac
> # Test 3cf1: Add skbmod action with valid etype
> # Test a749: Add skbmod action with invalid etype
> # Test bfe6: Add skbmod action to swap mac
> # Test 839b: Add skbmod action with control pipe
> # Test c167: Add skbmod action with control reclassify
> # Test 0c2f: Add skbmod action with control drop
> # Test d113: Add skbmod action with control continue
> # Test 7242: Add skbmod action with control pass
> # Test 6046: Add skbmod action with control reclassify and cookie
> # Test 58cb: List skbmod actions
> # Test 9aa8: Get a single skbmod action from a list
> # Test e93a: Delete an skbmod action
> # Test 40c2: Flush skbmod actions
> # Test b651: Replace skbmod action with invalid goto_chain control
> # Test fe09: Add skbmod action to mark ECN bits
> # Test 696a: Add simple ct action
> # Test e38c: Add simple ct action with cookie
> # Test 9f20: Add ct clear action
> # Test 0bc1: Add ct clear action with cookie of max length
> # Test 5bea: Try ct with zone
> # Test d5d6: Try ct with zone, commit
> # Test 029f: Try ct with zone, commit, mark
> # Test a58d: Try ct with zone, commit, mark, nat
> # Test 901b: Try ct with full nat ipv4 range syntax
> # Test 072b: Try ct with full nat ipv6 syntax
> # Test 3420: Try ct with full nat ipv6 range syntax
> # Test 4470: Try ct with full nat ipv6 range syntax + force
> # Test 5d88: Try ct with label
> # Test 04d4: Try ct with label with mask
> # Test 9751: Try ct with mark + mask
> # Test 2faa: Try ct with mark + mask and cookie
> # Test 3991: Add simple ct action with no_percpu flag
> # Test 3992: Add ct action triggering DNAT tuple conflict
> #
> # Sent 1 packets.
> #
> # Sent 1 packets.
> # Test 2029: Add xt action with log-prefix
> # exit: 255
> # exit: 0
> # Warning: Extension LOG revision 0 not supported, missing kernel module?
> # Error: Failed to load TC action module.
> # We have an error talking to the kernel
> #
> # returncode 1; expected [0]
> # "-----> teardown stage" did not complete successfully
> # Exception <class '__main__.PluginMgrTestFail'> ('teardown', 'Warning: Extension LOG revision 0 not supported, missing kernel module?\nError: Failed to load TC action module.\nWe have an error talking to the kernel\n', '"-----> teardown stage" did not complete successfully') (caught in test_runner, running test 17 2029 Add xt action with log-prefix stage teardown)
> # ---------------
> # traceback
> # ---------------
> # accumulated output for this test:
> # Warning: Extension LOG revision 0 not supported, missing kernel module?
> # Error: Failed to load TC action module.
> # We have an error talking to the kernel
> #
> # ---------------
> # Test a5a7: Add pedit action with LAYERED_OP eth set src
> # exception (34, 'Numerical result out of range') in call to pre_case for <class 'plugin-lib.nsPlugin.SubPlugin'> plugin
> # Test 5a31: Add vlan modify action for protocol 802.1AD
> # exception (34, 'Numerical result out of range') in call to pre_case for <class 'plugin-lib.nsPlugin.SubPlugin'> plugin
> # Test 8eb5: Create valid ife encode action with tcindex and continue control
> # exception (34, 'Numerical result out of range') in call to pre_case for <class 'plugin-lib.nsPlugin.SubPlugin'> plugin
> # Test a874: Add invalid sample action without mandatory arguments
> # exception (34, 'Numerical result out of range') in call to pre_case for <class 'plugin-lib.nsPlugin.SubPlugin'> plugin
> # Test ac2a: List actions
> # Test 3edf: Flush gact actions
> # Test 63ec: Delete pass action
> # Test 46be: Delete pipe action
> # Test 2e08: Delete reclassify action
> # Test 99c4: Delete drop action
> # Test fb6b: Delete continue action
> # Test 0eb3: Delete non-existent action
> # Test f02c: Replace gact action
> # Test 525f: Get gact action by index
> # Test 1021: Add batch of 32 gact pass actions
> # Test da7a: Add batch of 32 gact continue actions with cookie
> # Test 8aa3: Delete batch of 32 gact continue actions
> # Test 8e47: Add gact action with random determ goto chain control action
> # Test ca89: Replace gact action with invalid goto chain control
> # Test 95ad: Add gact pass action with no_percpu flag
> # Test 7f52: Try to flush action which is referenced by filter
> # Test ae1e: Try to flush actions when last one is referenced by filter
> # Test 2b11: Add tunnel_key set action with mandatory parameters
> # Test dc6b: Add tunnel_key set action with missing mandatory src_ip parameter
> # Test 7f25: Add tunnel_key set action with missing mandatory dst_ip parameter
> # Test a5e0: Add tunnel_key set action with invalid src_ip parameter
> # Test eaa8: Add tunnel_key set action with invalid dst_ip parameter
> # Test 3b09: Add tunnel_key set action with invalid id parameter
> # Test 9625: Add tunnel_key set action with invalid dst_port parameter
> # Test 05af: Add tunnel_key set action with optional dst_port parameter
> # Test da80: Add tunnel_key set action with index at 32-bit maximum
> # Test d407: Add tunnel_key set action with index exceeding 32-bit maximum
> # Test 5cba: Add tunnel_key set action with id value at 32-bit maximum
> # Test e84a: Add tunnel_key set action with id value exceeding 32-bit maximum
> # Test 9c19: Add tunnel_key set action with dst_port value at 16-bit maximum
> # Test 3bd9: Add tunnel_key set action with dst_port value exceeding 16-bit maximum
> # Test 68e2: Add tunnel_key unset action
> # Test 6192: Add tunnel_key unset continue action
> # Test 061d: Add tunnel_key set continue action with cookie
> # Test 8acb: Add tunnel_key set continue action with invalid cookie
> # Test a07e: Add tunnel_key action with no set/unset command specified
> # Test b227: Add tunnel_key action with csum option
> # Test 58a7: Add tunnel_key action with nocsum option
> # Test 2575: Add tunnel_key action with not-supported parameter
> # Test 7a88: Add tunnel_key action with cookie parameter
> # Test 4f20: Add tunnel_key action with a single geneve option parameter
> # Test e33d: Add tunnel_key action with multiple geneve options parameter
> # Test 0778: Add tunnel_key action with invalid class geneve option parameter
> # Test 4ae8: Add tunnel_key action with invalid type geneve option parameter
> # Test 4039: Add tunnel_key action with short data length geneve option parameter
> # Test 26a6: Add tunnel_key action with non-multiple of 4 data length geneve option parameter
> # Test f44d: Add tunnel_key action with incomplete geneve options parameter
> # Test 7afc: Replace tunnel_key set action with all parameters
> # Test 364d: Replace tunnel_key set action with all parameters and cookie
> # Test 937c: Fetch all existing tunnel_key actions
> # Test 6783: Flush all existing tunnel_key actions
> # Test 8242: Replace tunnel_key set action with invalid goto chain
> # Test 0cd2: Add tunnel_key set action with no_percpu flag
> # Test 3671: Delete tunnel_key set action with valid index
> # Test 8597: Delete tunnel_key set action with invalid index
> # Test 6bda: Add tunnel_key action with nofrag option
> # Test c826: Add ctinfo action with default setting
> # Test 0286: Add ctinfo action with dscp
> # Test 4938: Add ctinfo action with valid cpmark and zone
> # Test 7593: Add ctinfo action with drop control
> # Test 2961: Replace ctinfo action zone and action control
> # Test e567: Delete ctinfo action with valid index
> # Test 6a91: Delete ctinfo action with invalid index
> # Test d959: Add cBPF action with valid bytecode
> # Test f84a: Add cBPF action with invalid bytecode
> # Test e939: Add eBPF action with valid object-file
> # Test 282d: Add eBPF action with invalid object-file
> # Test d819: Replace cBPF bytecode and action control
> # Test 6ae3: Delete cBPF action
> # Test 3e0d: List cBPF actions
> # Test 55ce: Flush BPF actions
> # Test ccc3: Add cBPF action with duplicate index
> # Test 89c7: Add cBPF action with invalid index
> # Test 7ab9: Add cBPF action with cookie
> # Test b8a1: Replace bpf action with invalid goto_chain control
> # Test 2893: Add pedit action with RAW_OP offset u8 quad
> # Test 6795: Add pedit action with LAYERED_OP ip6 set payload_len, nexthdr, hoplimit
> # Test cfcc: Add pedit action with LAYERED_OP tcp flags set
> # Test b078: Add simple action
> # Test 4297: Add simple action with change command
> # Test 6d4c: Add simple action with duplicate index
> # Test 2542: List simple actions
> # Test ea67: Delete simple action
> # Test 8ff1: Flush simple actions
> # Test b776: Replace simple action with invalid goto chain control
> # Test 8d07: Verify cleanup of failed actions batch add
> # Test a68a: Verify cleanup of failed actions batch change
> # Test 49aa: Add valid basic police action
> # Test 3abe: Add police action with duplicate index
> # Test 49fa: Add valid police action with mtu
> # Test 7943: Add valid police action with peakrate
> # Test 055e: Add police action with peakrate and no mtu
> # Test f057: Add police action with valid overhead
> # Test 7ffb: Add police action with ethernet linklayer type
> # Test 3dda: Add police action with atm linklayer type
> # Test 551b: Add police actions with conform-exceed control continue/drop
> # Test 0c70: Add police actions with conform-exceed control pass/reclassify
> # Test d946: Add police actions with conform-exceed control pass/pipe
> # Test ddd6: Add police action with invalid rate value
> # Test f61c: Add police action with invalid burst value
> # Test 6aaf: Add police actions with conform-exceed control pass/pipe [with numeric values]
> # Test 29b1: Add police actions with conform-exceed control <invalid>/drop
> # Test c26f: Add police action with invalid peakrate value
> # Test db04: Add police action with invalid mtu value
> # Test f3c9: Add police action with cookie
> # Test d190: Add police action with maximum index
> # Test 336e: Delete police action
> # Test 77fa: Get single police action from many actions
> # Test aa43: Get single police action without specifying index
> # Test 858b: List police actions
> # Test fa1c: Add mpls dec_ttl action with ttl (invalid)
> # Test 6b79: Add mpls dec_ttl action with bos (invalid)
> # Test d4c4: Add mpls pop action with ip proto
> # Test 91fb: Add mpls pop action with ip proto and cookie
> # Test 92fe: Add mpls pop action with mpls proto
> # Test 7e23: Add mpls pop action with no protocol (invalid)
> # Test 6182: Add mpls pop action with label (invalid)
> # Test 6475: Add mpls pop action with tc (invalid)
> # Test 067b: Add mpls pop action with ttl (invalid)
> # Test 7316: Add mpls pop action with bos (invalid)
> # Test 38cc: Add mpls push action with label
> # Test c281: Add mpls push action with mpls_mc protocol
> # Test 5db4: Add mpls push action with label, tc and ttl
> # Test 7c34: Add mpls push action with label, tc ttl and cookie of max length
> # Test 16eb: Add mpls push action with label and bos
> # Test d69d: Add mpls push action with no label (invalid)
> # Test e8e4: Add mpls push action with ipv4 protocol (invalid)
> # Test ecd0: Add mpls push action with out of range label (invalid)
> # Test d303: Add mpls push action with out of range tc (invalid)
> # Test fd6e: Add mpls push action with ttl of 0 (invalid)
> # Test 19e9: Add mpls mod action with mpls label
> # Test 1fde: Add mpls mod action with max mpls label
> # Test 0c50: Add mpls mod action with mpls label exceeding max (invalid)
> # Test 10b6: Add mpls mod action with mpls label of MPLS_LABEL_IMPLNULL (invalid)
> # Test 57c9: Add mpls mod action with mpls min tc
> # Test 6872: Add mpls mod action with mpls max tc
> # Test a70a: Add mpls mod action with mpls tc exceeding max (invalid)
> # Test 6ed5: Add mpls mod action with mpls ttl
> # Test 77c1: Add mpls mod action with mpls ttl and cookie
> # Test b80f: Add mpls mod action with mpls max ttl
> # Test 8864: Add mpls mod action with mpls min ttl
> # Test 6c06: Add mpls mod action with mpls ttl of 0 (invalid)
> # Test b5d8: Add mpls mod action with mpls ttl exceeding max (invalid)
> # Test 451f: Add mpls mod action with mpls max bos
> # Test a1ed: Add mpls mod action with mpls min bos
> # Test 3dcf: Add mpls mod action with mpls bos exceeding max (invalid)
> # Test db7c: Add mpls mod action with protocol (invalid)
> # Test b070: Replace existing mpls push action with new ID
> # Test 95a9: Replace existing mpls push action with new label, tc, ttl and cookie
> # Test 6cce: Delete mpls pop action
> # Test d138: Flush mpls actions
> # Test 319a: Add pedit action that mangles IP TTL
> # Test 7e67: Replace pedit action with invalid goto chain
> # Test 377e: Add pedit action with RAW_OP offset u32
> # Test a0ca: Add pedit action with RAW_OP offset u32 (INVALID)
> # Test dd8a: Add pedit action with RAW_OP offset u16 u16
> # Test 53db: Add pedit action with RAW_OP offset u16 (INVALID)
> # Test 5c7e: Add pedit action with RAW_OP offset u8 add value
> # Test 3a07: Add pedit action with RAW_OP offset u8-u16-u8
> # Test ab0f: Add pedit action with RAW_OP offset u16-u8-u8
> # Test 9d12: Add pedit action with RAW_OP offset u32 set u16 clear u8 invert
> # Test ebfa: Add pedit action with RAW_OP offset overflow u32 (INVALID)
> # Test f512: Add pedit action with RAW_OP offset u16 at offmask shift set
> # Test c2cb: Add pedit action with RAW_OP offset u32 retain value
> # Test 1762: Add pedit action with RAW_OP offset u8 clear value
> # Test bcee: Add pedit action with RAW_OP offset u8 retain value
> # Test e89f: Add pedit action with RAW_OP offset u16 retain value
> # Test c282: Add pedit action with RAW_OP offset u32 clear value
> # Test c422: Add pedit action with RAW_OP offset u16 invert value
> # Test d3d3: Add pedit action with RAW_OP offset u32 invert value
> # Test 57e5: Add pedit action with RAW_OP offset u8 preserve value
> # Test 99e0: Add pedit action with RAW_OP offset u16 preserve value
> # Test 1892: Add pedit action with RAW_OP offset u32 preserve value
> # Test 4b60: Add pedit action with RAW_OP negative offset u16/u32 set value
> # multiprocessing.pool.RemoteTraceback:
> # """
> # Traceback (most recent call last):
> #   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 142, in call_pre_case
> #     pgn_inst.pre_case(caseinfo, test_skip)
> #   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/plugin-lib/nsPlugin.py", line 63, in pre_case
> #     self.prepare_test(test)
> #   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/plugin-lib/nsPlugin.py", line 36, in prepare_test
> #     self._nl_ns_create()
> #   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/plugin-lib/nsPlugin.py", line 130, in _nl_ns_create
> #     ip.link('add', ifname=dev1, kind='veth', peer={'ifname': dev0, 'net_ns_fd':'/proc/1/ns/net'})
> #   File "/usr/local/lib/python3.11/site-packages/pyroute2/iproute/linux.py", line 1696, in link
> #     ret = self.nlm_request(msg, msg_type=msg_type, msg_flags=msg_flags)
> #           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> #   File "/usr/local/lib/python3.11/site-packages/pyroute2/netlink/nlsocket.py", line 870, in nlm_request
> #     return tuple(self._genlm_request(*argv, **kwarg))
> #            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> #   File "/usr/local/lib/python3.11/site-packages/pyroute2/netlink/nlsocket.py", line 1214, in nlm_request
> #     for msg in self.get(
> #                ^^^^^^^^^
> #   File "/usr/local/lib/python3.11/site-packages/pyroute2/netlink/nlsocket.py", line 873, in get
> #     return tuple(self._genlm_get(*argv, **kwarg))
> #            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> #   File "/usr/local/lib/python3.11/site-packages/pyroute2/netlink/nlsocket.py", line 550, in get
> #     raise msg['header']['error']
> # pyroute2.netlink.exceptions.NetlinkError: (34, 'Numerical result out of range')
> #
> # During handling of the above exception, another exception occurred:
> #
> # Traceback (most recent call last):
> #   File "/usr/lib64/python3.11/multiprocessing/pool.py", line 125, in worker
> #     result = (True, func(*args, **kwds))
> #                     ^^^^^^^^^^^^^^^^^^^
> #   File "/usr/lib64/python3.11/multiprocessing/pool.py", line 48, in mapstar
> #     return list(map(*args))
> #            ^^^^^^^^^^^^^^^^
> #   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 602, in __mp_runner
> #     (_, tsr) = test_runner(mp_pm, mp_args, tests)
> #                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> #   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 536, in test_runner
> #     res = run_one_test(pm, args, index, tidx)
> #           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> #   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 419, in run_one_test
> #     pm.call_pre_case(tidx)
> #   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 146, in call_pre_case
> #     print('test_ordinal is {}'.format(test_ordinal))
> #                                       ^^^^^^^^^^^^
> # NameError: name 'test_ordinal' is not defined
> # """
> #
> # The above exception was the direct cause of the following exception:
> #
> # Traceback (most recent call last):
> #   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 1028, in <module>
> #     main()
> #   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 1022, in main
> #     set_operation_mode(pm, parser, args, remaining)
> #   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 964, in set_operation_mode
> #     catresults = test_runner_mp(pm, args, alltests)
> #                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> #   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 624, in test_runner_mp
> #     pres = p.map(__mp_runner, batches)
> #            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
> #   File "/usr/lib64/python3.11/multiprocessing/pool.py", line 367, in map
> #     return self._map_async(func, iterable, mapstar, chunksize).get()
> #            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> #   File "/usr/lib64/python3.11/multiprocessing/pool.py", line 774, in get
> #     raise self._value
> # NameError: name 'test_ordinal' is not defined
> # considering category qdisc
> #  -- scapy/SubPlugin.__init__
> #  -- ns/SubPlugin.__init__
> # Executing 294 tests in parallel and 33 in serial
> # Using 11 batches and 4 workers
> #
> # -----> prepare stage *** Could not execute: "$IP link add dev $ETH type dummy"
> #
> # -----> prepare stage *** Error message: "RTNETLINK answers: File exists
> # "
> #
> # -----> prepare stage *** Aborting test run.
> #
> #
> # <_io.BufferedReader name=6> *** stdout ***
> #
> #
> # <_io.BufferedReader name=15> *** stderr ***
> #   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 536, in test_runner
> #     res = run_one_test(pm, args, index, tidx)
> #           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> #   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 420, in run_one_test
> #     prepare_env(tidx, args, pm, 'setup', "-----> prepare stage", tidx["setup"])
> #   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 268, in prepare_env
> #     raise PluginMgrTestFail(
> # ......................Test 30c4: Add ETS qdisc with quanta + priomap
> # Test e8ac: Add ETS qdisc with strict + priomap
> # Test 5a7e: Add ETS qdisc with quanta + strict + priomap
> # Test cb8b: Show ETS class :1
> # Test 1b4e: Show ETS class :2
> # Test f642: Show ETS class :3
> # Test 0a5f: Show ETS strict class
> # Test f7c8: Add ETS qdisc with too many quanta
> # Test 2389: Add ETS qdisc with too many strict
> # Test fe3c: Add ETS qdisc with too many strict + quanta
> # Test cb04: Add ETS qdisc with excess priomap elements
> # Test c32e: Add ETS qdisc with priomap above bands
> # Test 744c: Add ETS qdisc with priomap above quanta
> # Test 7b33: Add ETS qdisc with priomap above strict
> # Test dbe6: Add ETS qdisc with priomap above strict + quanta
> # Test bdb2: Add ETS qdisc with priomap within bands with strict + quanta
> # Test 39a3: Add ETS qdisc with priomap above bands with strict + quanta
> # Test 557c: Unset priorities default to the last band
> # Test a347: Unset priorities default to the last band -- no priomap
> # Test 39c4: Add ETS qdisc with too few bands
> # Test 930b: Add ETS qdisc with too many bands
> # Test 406a: Add ETS qdisc without parameters
> # Test e51a: Zero element in quanta
> # Test e7f2: Sole zero element in quanta
> # Test d6e6: No values after the quanta keyword
> # Test 28c6: Change ETS band quantum
> # Test 4714: Change ETS band without quantum
> # Test 6979: Change quantum of a strict ETS band
> # Test 9a7d: Change ETS strict band without quantum
> # Test 283e: Create skbprio with default setting
> # Test c086: Create skbprio with limit setting
> # Test 6733: Change skbprio with limit setting
> # Test fe0f: Add prio qdisc on egress with 4 bands and priomap exceeding TC_PRIO_MAX entries
> # Test 1f91: Add prio qdisc on egress with 4 bands and priomap's values exceeding bands number
> # Test d248: Add prio qdisc on egress with invalid bands value (< 2)
> # Test 1d0e: Add prio qdisc on egress with invalid bands value exceeding TCQ_PRIO_BANDS
> # Test 1971: Replace default prio qdisc on egress with 8 bands and new priomap
> # Test d88a: Add duplicate prio qdisc on egress
> # Test 5948: Delete nonexistent prio qdisc
> # Test 6c0a: Add prio qdisc on egress with invalid format for handles
> # Test 0175: Delete prio qdisc twice
> # Test 2410: Show prio class
> # Test 0385: Create DRR with default setting
> # Test 2375: Delete DRR with handle
> # Test 3092: Show DRR class
> # Test e90e: Add ETS qdisc using bands
> # Test b059: Add ETS qdisc using quanta
> # Test e8e7: Add ETS qdisc using strict
> # Test 233c: Add ETS qdisc using bands + quanta
> # Test 3d35: Add ETS qdisc using bands + strict
> # Test 7f3b: Add ETS qdisc using strict + quanta
> # Test 4593: Add ETS qdisc using strict 0 + quanta
> # Test 8938: Add ETS qdisc using bands + strict + quanta
> # Test 0782: Add ETS qdisc with more bands than quanta
> # Test 501b: Add ETS qdisc with more bands than strict
> # Test 671a: Add ETS qdisc with more bands than strict + quanta
> # Test 2a23: Add ETS qdisc with 16 bands
> # Test 8daf: Add ETS qdisc with 17 bands
> # Test 7f95: Add ETS qdisc with 17 strict
> # Test 837a: Add ETS qdisc with 16 quanta
> # Test 65b6: Add ETS qdisc with 17 quanta
> # Test b9e9: Add ETS qdisc with 16 strict + quanta
> # Test 9877: Add ETS qdisc with 17 strict + quanta
> # Test c696: Add ETS qdisc with priomap
> # Test 20ba: Add multiq Qdisc to multi-queue device (8 queues)
> # Test 4301: List multiq Class
> # Test 7832: Delete nonexistent multiq Qdisc
> # Test 2891: Delete multiq Qdisc twice
> # Test 1329: Add multiq Qdisc to single-queue device
> # Test 84a0: Create TEQL with default setting
> # Test 7734: Create TEQL with multiple device
> # returncode 2; expected [0]
> # "-----> prepare stage" did not complete successfully
> # Exception <class '__main__.PluginMgrTestFail'> ('setup', None, '"-----> prepare stage" did not complete successfully') (caught in test_runner, running test 8 7734 Create TEQL with multiple device stage setup)
> # ---------------
> # traceback
> # ---------------
> # ---------------
> # Test 2958: Show skbprio class
> # exception (34, 'Numerical result out of range') in call to pre_case for <class 'plugin-lib.nsPlugin.SubPlugin'> plugin
> # Test 0301: Change CHOKE with limit setting
> # exception (34, 'Numerical result out of range') in call to pre_case for <class 'plugin-lib.nsPlugin.SubPlugin'> plugin
> # Test 2385: Create CAKE with besteffort flag
> # exception (34, 'Numerical result out of range') in call to pre_case for <class 'plugin-lib.nsPlugin.SubPlugin'> plugin
> # Test 0521: Show ingress class
> # exception (34, 'Numerical result out of range') in call to pre_case for <class 'plugin-lib.nsPlugin.SubPlugin'> plugin
> # Test b190: Create FQ_CODEL with noecn flag
> # exception (34, 'Numerical result out of range') in call to pre_case for <class 'plugin-lib.nsPlugin.SubPlugin'> plugin
> # Test 5439: Create NETEM with multiple slot setting
> # exception (34, 'Numerical result out of range') in call to pre_case for <class 'plugin-lib.nsPlugin.SubPlugin'> plugin
> # Test 30a9: Create SFB with increment setting
> # exception (34, 'Numerical result out of range') in call to pre_case for <class 'plugin-lib.nsPlugin.SubPlugin'> plugin
> # multiprocessing.pool.RemoteTraceback:
> # """
> # Traceback (most recent call last):
> #   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 142, in call_pre_case
> #     pgn_inst.pre_case(caseinfo, test_skip)
> #   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/plugin-lib/nsPlugin.py", line 63, in pre_case
> #     self.prepare_test(test)
> #   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/plugin-lib/nsPlugin.py", line 36, in prepare_test
> #     self._nl_ns_create()
> #   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/plugin-lib/nsPlugin.py", line 130, in _nl_ns_create
> #     ip.link('add', ifname=dev1, kind='veth', peer={'ifname': dev0, 'net_ns_fd':'/proc/1/ns/net'})
> #   File "/usr/local/lib/python3.11/site-packages/pyroute2/iproute/linux.py", line 1696, in link
> #     ret = self.nlm_request(msg, msg_type=msg_type, msg_flags=msg_flags)
> #           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> #   File "/usr/local/lib/python3.11/site-packages/pyroute2/netlink/nlsocket.py", line 870, in nlm_request
> #     return tuple(self._genlm_request(*argv, **kwarg))
> #            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> #   File "/usr/local/lib/python3.11/site-packages/pyroute2/netlink/nlsocket.py", line 1214, in nlm_request
> #     for msg in self.get(
> #                ^^^^^^^^^
> #   File "/usr/local/lib/python3.11/site-packages/pyroute2/netlink/nlsocket.py", line 873, in get
> #     return tuple(self._genlm_get(*argv, **kwarg))
> #            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> #   File "/usr/local/lib/python3.11/site-packages/pyroute2/netlink/nlsocket.py", line 550, in get
> #     raise msg['header']['error']
> # pyroute2.netlink.exceptions.NetlinkError: (34, 'Numerical result out of range')
> #
> # During handling of the above exception, another exception occurred:
> #
> # Traceback (most recent call last):
> #   File "/usr/lib64/python3.11/multiprocessing/pool.py", line 125, in worker
> #     result = (True, func(*args, **kwds))
> #                     ^^^^^^^^^^^^^^^^^^^
> #   File "/usr/lib64/python3.11/multiprocessing/pool.py", line 48, in mapstar
> #     return list(map(*args))
> #            ^^^^^^^^^^^^^^^^
> #   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 602, in __mp_runner
> #     (_, tsr) = test_runner(mp_pm, mp_args, tests)
> #                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> #   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 536, in test_runner
> #     res = run_one_test(pm, args, index, tidx)
> #           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> #   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 419, in run_one_test
> #     pm.call_pre_case(tidx)
> #   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 146, in call_pre_case
> #     print('test_ordinal is {}'.format(test_ordinal))
> #                                       ^^^^^^^^^^^^
> # NameError: name 'test_ordinal' is not defined
> # """
> #
> # The above exception was the direct cause of the following exception:
> #
> # Traceback (most recent call last):
> #   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 1028, in <module>
> #     main()
> #   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 1022, in main
> #     set_operation_mode(pm, parser, args, remaining)
> #   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 964, in set_operation_mode
> #     catresults = test_runner_mp(pm, args, alltests)
> #                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> #   File "/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing/./tdc.py", line 624, in test_runner_mp
> #     pres = p.map(__mp_runner, batches)
> #            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
> #   File "/usr/lib64/python3.11/multiprocessing/pool.py", line 367, in map
> #     return self._map_async(func, iterable, mapstar, chunksize).get()
> #            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> #   File "/usr/lib64/python3.11/multiprocessing/pool.py", line 774, in get
> #     raise self._value
> # NameError: name 'test_ordinal' is not defined
> not ok 1 selftests: tc-testing: tdc.sh # exit=1
> make[1]: Leaving directory '/mnt/share156/jiri/net-next/tools/testing/selftests/tc-testing'
> make: Leaving directory '/mnt/share156/jiri/net-next/tools/testing/selftests'

Hrm. I may have forgotten to send parts when deleting ipt. Does this
patchlet help?

cheers,
jamal
Ido Schimmel Jan. 10, 2024, 12:09 p.m. UTC | #12
On Thu, Jan 04, 2024 at 01:58:44PM +0100, Jiri Pirko wrote:
> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> index adf5de1ff773..253b26f2eddd 100644
> --- a/net/sched/cls_api.c
> +++ b/net/sched/cls_api.c
> @@ -1428,6 +1428,7 @@ int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
>  		      struct tcf_block_ext_info *ei,
>  		      struct netlink_ext_ack *extack)
>  {
> +	struct net_device *dev = qdisc_dev(q);
>  	struct net *net = qdisc_net(q);
>  	struct tcf_block *block = NULL;
>  	int err;
> @@ -1461,9 +1462,18 @@ int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
>  	if (err)
>  		goto err_block_offload_bind;
>  
> +	if (tcf_block_shared(block)) {
> +		err = xa_insert(&block->ports, dev->ifindex, dev, GFP_KERNEL);
> +		if (err) {
> +			NL_SET_ERR_MSG(extack, "block dev insert failed");
> +			goto err_dev_insert;
> +		}
> +	}

While this patch fixes the original issue, it creates another one:

# ip link add name swp1 type dummy
# tc qdisc replace dev swp1 root handle 10: prio bands 8 priomap 7 6 5 4 3 2 1
# tc qdisc add dev swp1 parent 10:8 handle 108: red limit 1000000 min 200000 max 200001 probability 1.0 avpkt 8000 burst 38 qevent early_drop block 10
RED: set bandwidth to 10Mbit
# tc qdisc add dev swp1 parent 10:7 handle 107: red limit 1000000 min 500000 max 500001 probability 1.0 avpkt 8000 burst 63 qevent early_drop block 10
RED: set bandwidth to 10Mbit
Error: block dev insert failed.

The reproducer does not fail if I revert this patch and apply Victor's
[1] instead.

[1] https://lore.kernel.org/netdev/20231231172320.245375-1-victor@mojatatu.com/

> +
>  	*p_block = block;
>  	return 0;
>  
> +err_dev_insert:
>  err_block_offload_bind:
>  	tcf_chain0_head_change_cb_del(block, ei);
>  err_chain0_head_change_cb_add:
Jiri Pirko Jan. 10, 2024, 2:08 p.m. UTC | #13
Wed, Jan 10, 2024 at 01:09:55PM CET, idosch@idosch.org wrote:
>On Thu, Jan 04, 2024 at 01:58:44PM +0100, Jiri Pirko wrote:
>> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
>> index adf5de1ff773..253b26f2eddd 100644
>> --- a/net/sched/cls_api.c
>> +++ b/net/sched/cls_api.c
>> @@ -1428,6 +1428,7 @@ int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
>>  		      struct tcf_block_ext_info *ei,
>>  		      struct netlink_ext_ack *extack)
>>  {
>> +	struct net_device *dev = qdisc_dev(q);
>>  	struct net *net = qdisc_net(q);
>>  	struct tcf_block *block = NULL;
>>  	int err;
>> @@ -1461,9 +1462,18 @@ int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
>>  	if (err)
>>  		goto err_block_offload_bind;
>>  
>> +	if (tcf_block_shared(block)) {
>> +		err = xa_insert(&block->ports, dev->ifindex, dev, GFP_KERNEL);
>> +		if (err) {
>> +			NL_SET_ERR_MSG(extack, "block dev insert failed");
>> +			goto err_dev_insert;
>> +		}
>> +	}
>
>While this patch fixes the original issue, it creates another one:
>
># ip link add name swp1 type dummy
># tc qdisc replace dev swp1 root handle 10: prio bands 8 priomap 7 6 5 4 3 2 1
># tc qdisc add dev swp1 parent 10:8 handle 108: red limit 1000000 min 200000 max 200001 probability 1.0 avpkt 8000 burst 38 qevent early_drop block 10
>RED: set bandwidth to 10Mbit
># tc qdisc add dev swp1 parent 10:7 handle 107: red limit 1000000 min 500000 max 500001 probability 1.0 avpkt 8000 burst 63 qevent early_drop block 10
>RED: set bandwidth to 10Mbit
>Error: block dev insert failed.
>
>The reproducer does not fail if I revert this patch and apply Victor's
>[1] instead.
>
>[1] https://lore.kernel.org/netdev/20231231172320.245375-1-victor@mojatatu.com/

Will fix. Thanks!

>
>> +
>>  	*p_block = block;
>>  	return 0;
>>  
>> +err_dev_insert:
>>  err_block_offload_bind:
>>  	tcf_chain0_head_change_cb_del(block, ei);
>>  err_chain0_head_change_cb_add:
Jiri Pirko Jan. 10, 2024, 4:17 p.m. UTC | #14
Wed, Jan 10, 2024 at 01:09:55PM CET, idosch@idosch.org wrote:
>On Thu, Jan 04, 2024 at 01:58:44PM +0100, Jiri Pirko wrote:
>> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
>> index adf5de1ff773..253b26f2eddd 100644
>> --- a/net/sched/cls_api.c
>> +++ b/net/sched/cls_api.c
>> @@ -1428,6 +1428,7 @@ int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
>>  		      struct tcf_block_ext_info *ei,
>>  		      struct netlink_ext_ack *extack)
>>  {
>> +	struct net_device *dev = qdisc_dev(q);
>>  	struct net *net = qdisc_net(q);
>>  	struct tcf_block *block = NULL;
>>  	int err;
>> @@ -1461,9 +1462,18 @@ int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
>>  	if (err)
>>  		goto err_block_offload_bind;
>>  
>> +	if (tcf_block_shared(block)) {
>> +		err = xa_insert(&block->ports, dev->ifindex, dev, GFP_KERNEL);
>> +		if (err) {
>> +			NL_SET_ERR_MSG(extack, "block dev insert failed");
>> +			goto err_dev_insert;
>> +		}
>> +	}
>
>While this patch fixes the original issue, it creates another one:
>
># ip link add name swp1 type dummy
># tc qdisc replace dev swp1 root handle 10: prio bands 8 priomap 7 6 5 4 3 2 1
># tc qdisc add dev swp1 parent 10:8 handle 108: red limit 1000000 min 200000 max 200001 probability 1.0 avpkt 8000 burst 38 qevent early_drop block 10
>RED: set bandwidth to 10Mbit
># tc qdisc add dev swp1 parent 10:7 handle 107: red limit 1000000 min 500000 max 500001 probability 1.0 avpkt 8000 burst 63 qevent early_drop block 10
>RED: set bandwidth to 10Mbit
>Error: block dev insert failed.

Feel free to text following patch. Since I believe it is not possible
to send fixes to net-next now, I will send it once net-next merges
into net.

net: sched: track device in tcf_block_get/put_ext() only for clsact binder types

Clsact/ingress qdisc is not the only one using shared block,
red is also using it. The device tracking was originally introduced
by commit 913b47d3424e ("net/sched: Introduce tc block netdev
tracking infra") for clsact/ingress only. Commit 94e2557d086a ("net:
sched: move block device tracking into tcf_block_get/put_ext()")
mistakenly enabled that for red as well.

Fix that by adding a check for the binder type being clsact when adding
device to the block->ports xarray.

Reported-by: Ido Schimmel <idosch@idosch.org>
Closes: https://lore.kernel.org/all/ZZ6JE0odnu1lLPtu@shredder/
Fixes: 94e2557d086a ("net: sched: move block device tracking into tcf_block_get/put_ext()")
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
---
 net/sched/cls_api.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index e3236a3169c3..92a12e3d0fe6 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -1424,6 +1424,14 @@ static void tcf_block_owner_del(struct tcf_block *block,
 	WARN_ON(1);
 }
 
+static bool tcf_block_tracks_dev(struct tcf_block *block,
+				 struct tcf_block_ext_info *ei)
+{
+	return tcf_block_shared(block) &&
+	       (ei->binder_type == FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS ||
+		ei->binder_type == FLOW_BLOCK_BINDER_TYPE_CLSACT_EGRESS);
+}
+
 int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
 		      struct tcf_block_ext_info *ei,
 		      struct netlink_ext_ack *extack)
@@ -1462,7 +1470,7 @@ int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
 	if (err)
 		goto err_block_offload_bind;
 
-	if (tcf_block_shared(block)) {
+	if (tcf_block_tracks_dev(block, ei)) {
 		err = xa_insert(&block->ports, dev->ifindex, dev, GFP_KERNEL);
 		if (err) {
 			NL_SET_ERR_MSG(extack, "block dev insert failed");
@@ -1516,7 +1524,7 @@ void tcf_block_put_ext(struct tcf_block *block, struct Qdisc *q,
 
 	if (!block)
 		return;
-	if (tcf_block_shared(block))
+	if (tcf_block_tracks_dev(block, ei))
 		xa_erase(&block->ports, dev->ifindex);
 	tcf_chain0_head_change_cb_del(block, ei);
 	tcf_block_owner_del(block, q, ei->binder_type);
Ido Schimmel Jan. 11, 2024, 8:51 a.m. UTC | #15
On Wed, Jan 10, 2024 at 05:17:50PM +0100, Jiri Pirko wrote:
> Feel free to text following patch. Since I believe it is not possible
> to send fixes to net-next now, I will send it once net-next merges
> into net.

Looks good, thanks. I will run it through regression to make sure no
other issues pop up.
Jamal Hadi Salim Jan. 11, 2024, 3:40 p.m. UTC | #16
On Wed, Jan 10, 2024 at 7:10 AM Ido Schimmel <idosch@idosch.org> wrote:
>
> On Thu, Jan 04, 2024 at 01:58:44PM +0100, Jiri Pirko wrote:
> > diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> > index adf5de1ff773..253b26f2eddd 100644
> > --- a/net/sched/cls_api.c
> > +++ b/net/sched/cls_api.c
> > @@ -1428,6 +1428,7 @@ int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
> >                     struct tcf_block_ext_info *ei,
> >                     struct netlink_ext_ack *extack)
> >  {
> > +     struct net_device *dev = qdisc_dev(q);
> >       struct net *net = qdisc_net(q);
> >       struct tcf_block *block = NULL;
> >       int err;
> > @@ -1461,9 +1462,18 @@ int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
> >       if (err)
> >               goto err_block_offload_bind;
> >
> > +     if (tcf_block_shared(block)) {
> > +             err = xa_insert(&block->ports, dev->ifindex, dev, GFP_KERNEL);
> > +             if (err) {
> > +                     NL_SET_ERR_MSG(extack, "block dev insert failed");
> > +                     goto err_dev_insert;
> > +             }
> > +     }
>
> While this patch fixes the original issue, it creates another one:
>
> # ip link add name swp1 type dummy
> # tc qdisc replace dev swp1 root handle 10: prio bands 8 priomap 7 6 5 4 3 2 1
> # tc qdisc add dev swp1 parent 10:8 handle 108: red limit 1000000 min 200000 max 200001 probability 1.0 avpkt 8000 burst 38 qevent early_drop block 10
> RED: set bandwidth to 10Mbit
> # tc qdisc add dev swp1 parent 10:7 handle 107: red limit 1000000 min 500000 max 500001 probability 1.0 avpkt 8000 burst 63 qevent early_drop block 10
> RED: set bandwidth to 10Mbit
> Error: block dev insert failed.
>


+cc Petr
We'll add a testcase on tdc - it doesnt seem we have any for qevents.
If you have others that are related let us know.
But how does this work? I see no mention of block on red code and i
see no mention of block on the reproducer above.
Are the qevents exception packets from the hardware? Is there a good
description of what qevents do?

cheers,
jamal


> The reproducer does not fail if I revert this patch and apply Victor's
> [1] instead.
>
> [1] https://lore.kernel.org/netdev/20231231172320.245375-1-victor@mojatatu.com/
>
> > +
> >       *p_block = block;
> >       return 0;
> >
> > +err_dev_insert:
> >  err_block_offload_bind:
> >       tcf_chain0_head_change_cb_del(block, ei);
> >  err_chain0_head_change_cb_add:
Jamal Hadi Salim Jan. 11, 2024, 3:42 p.m. UTC | #17
On Thu, Jan 11, 2024 at 10:40 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Wed, Jan 10, 2024 at 7:10 AM Ido Schimmel <idosch@idosch.org> wrote:
> >
> > On Thu, Jan 04, 2024 at 01:58:44PM +0100, Jiri Pirko wrote:
> > > diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> > > index adf5de1ff773..253b26f2eddd 100644
> > > --- a/net/sched/cls_api.c
> > > +++ b/net/sched/cls_api.c
> > > @@ -1428,6 +1428,7 @@ int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
> > >                     struct tcf_block_ext_info *ei,
> > >                     struct netlink_ext_ack *extack)
> > >  {
> > > +     struct net_device *dev = qdisc_dev(q);
> > >       struct net *net = qdisc_net(q);
> > >       struct tcf_block *block = NULL;
> > >       int err;
> > > @@ -1461,9 +1462,18 @@ int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
> > >       if (err)
> > >               goto err_block_offload_bind;
> > >
> > > +     if (tcf_block_shared(block)) {
> > > +             err = xa_insert(&block->ports, dev->ifindex, dev, GFP_KERNEL);
> > > +             if (err) {
> > > +                     NL_SET_ERR_MSG(extack, "block dev insert failed");
> > > +                     goto err_dev_insert;
> > > +             }
> > > +     }
> >
> > While this patch fixes the original issue, it creates another one:
> >
> > # ip link add name swp1 type dummy
> > # tc qdisc replace dev swp1 root handle 10: prio bands 8 priomap 7 6 5 4 3 2 1
> > # tc qdisc add dev swp1 parent 10:8 handle 108: red limit 1000000 min 200000 max 200001 probability 1.0 avpkt 8000 burst 38 qevent early_drop block 10
> > RED: set bandwidth to 10Mbit
> > # tc qdisc add dev swp1 parent 10:7 handle 107: red limit 1000000 min 500000 max 500001 probability 1.0 avpkt 8000 burst 63 qevent early_drop block 10
> > RED: set bandwidth to 10Mbit
> > Error: block dev insert failed.
> >
>
>
> +cc Petr
> We'll add a testcase on tdc - it doesnt seem we have any for qevents.
> If you have others that are related let us know.
> But how does this work? I see no mention of block on red code and i
> see no mention of block on the reproducer above.

Context: Yes, i see it on red setup but i dont see any block being setup.
Also: Is it only Red or other qdiscs could behave this way?

cheers,
jamal
> Are the qevents exception packets from the hardware? Is there a good
> description of what qevents do?
>
> cheers,
> jamal
>
>
> > The reproducer does not fail if I revert this patch and apply Victor's
> > [1] instead.
> >
> > [1] https://lore.kernel.org/netdev/20231231172320.245375-1-victor@mojatatu.com/
> >
> > > +
> > >       *p_block = block;
> > >       return 0;
> > >
> > > +err_dev_insert:
> > >  err_block_offload_bind:
> > >       tcf_chain0_head_change_cb_del(block, ei);
> > >  err_chain0_head_change_cb_add:
Petr Machata Jan. 11, 2024, 4:17 p.m. UTC | #18
Jamal Hadi Salim <jhs@mojatatu.com> writes:

> On Thu, Jan 11, 2024 at 10:40 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>>
>> On Wed, Jan 10, 2024 at 7:10 AM Ido Schimmel <idosch@idosch.org> wrote:
>> >
>> > On Thu, Jan 04, 2024 at 01:58:44PM +0100, Jiri Pirko wrote:
>> > > diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
>> > > index adf5de1ff773..253b26f2eddd 100644
>> > > --- a/net/sched/cls_api.c
>> > > +++ b/net/sched/cls_api.c
>> > > @@ -1428,6 +1428,7 @@ int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
>> > >                     struct tcf_block_ext_info *ei,
>> > >                     struct netlink_ext_ack *extack)
>> > >  {
>> > > +     struct net_device *dev = qdisc_dev(q);
>> > >       struct net *net = qdisc_net(q);
>> > >       struct tcf_block *block = NULL;
>> > >       int err;
>> > > @@ -1461,9 +1462,18 @@ int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
>> > >       if (err)
>> > >               goto err_block_offload_bind;
>> > >
>> > > +     if (tcf_block_shared(block)) {
>> > > +             err = xa_insert(&block->ports, dev->ifindex, dev, GFP_KERNEL);
>> > > +             if (err) {
>> > > +                     NL_SET_ERR_MSG(extack, "block dev insert failed");
>> > > +                     goto err_dev_insert;
>> > > +             }
>> > > +     }
>> >
>> > While this patch fixes the original issue, it creates another one:
>> >
>> > # ip link add name swp1 type dummy
>> > # tc qdisc replace dev swp1 root handle 10: prio bands 8 priomap 7 6 5 4 3 2 1
>> > # tc qdisc add dev swp1 parent 10:8 handle 108: red limit 1000000 min 200000 max 200001 probability 1.0 avpkt 8000 burst 38 qevent early_drop block 10
>> > RED: set bandwidth to 10Mbit
>> > # tc qdisc add dev swp1 parent 10:7 handle 107: red limit 1000000 min 500000 max 500001 probability 1.0 avpkt 8000 burst 63 qevent early_drop block 10
>> > RED: set bandwidth to 10Mbit
>> > Error: block dev insert failed.
>> >
>>
>>
>> +cc Petr
>> We'll add a testcase on tdc - it doesnt seem we have any for qevents.
>> If you have others that are related let us know.
>> But how does this work? I see no mention of block on red code and i

Look for qe_early_drop and qe_mark in sch_red.c.

>> see no mention of block on the reproducer above.
>
> Context: Yes, i see it on red setup but i dont see any block being setup.

qevents are binding locations for blocks, similar in principle to
clsact's ingress_block / egress_block. So the way to create a block is
the same: just mention the block number for the first time.

What qevents there are depends on the qdisc. They are supposed to
reflect events that are somehow interesting, from the point of view of
an skb within a qdisc. Thus RED has two qevents: early_drop for packets
that were chosen to be, well, dropped early, and mark for packets that
are ECN-marked. So when a packet is, say, early-dropped, the RED qdisc
passes it through the TC block bound at that qevent (if any).

> Also: Is it only Red or other qdiscs could behave this way?

Currently only red supports any qevents at all, but in principle the
mechanism is reusable. With my mlxsw hat on, an obvious next candidate
would be tail_drop on FIFO qdisc.
Jiri Pirko Jan. 11, 2024, 4:27 p.m. UTC | #19
Thu, Jan 11, 2024 at 04:42:55PM CET, jhs@mojatatu.com wrote:
>On Thu, Jan 11, 2024 at 10:40 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>>
>> On Wed, Jan 10, 2024 at 7:10 AM Ido Schimmel <idosch@idosch.org> wrote:
>> >
>> > On Thu, Jan 04, 2024 at 01:58:44PM +0100, Jiri Pirko wrote:
>> > > diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
>> > > index adf5de1ff773..253b26f2eddd 100644
>> > > --- a/net/sched/cls_api.c
>> > > +++ b/net/sched/cls_api.c
>> > > @@ -1428,6 +1428,7 @@ int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
>> > >                     struct tcf_block_ext_info *ei,
>> > >                     struct netlink_ext_ack *extack)
>> > >  {
>> > > +     struct net_device *dev = qdisc_dev(q);
>> > >       struct net *net = qdisc_net(q);
>> > >       struct tcf_block *block = NULL;
>> > >       int err;
>> > > @@ -1461,9 +1462,18 @@ int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
>> > >       if (err)
>> > >               goto err_block_offload_bind;
>> > >
>> > > +     if (tcf_block_shared(block)) {
>> > > +             err = xa_insert(&block->ports, dev->ifindex, dev, GFP_KERNEL);
>> > > +             if (err) {
>> > > +                     NL_SET_ERR_MSG(extack, "block dev insert failed");
>> > > +                     goto err_dev_insert;
>> > > +             }
>> > > +     }
>> >
>> > While this patch fixes the original issue, it creates another one:
>> >
>> > # ip link add name swp1 type dummy
>> > # tc qdisc replace dev swp1 root handle 10: prio bands 8 priomap 7 6 5 4 3 2 1
>> > # tc qdisc add dev swp1 parent 10:8 handle 108: red limit 1000000 min 200000 max 200001 probability 1.0 avpkt 8000 burst 38 qevent early_drop block 10
>> > RED: set bandwidth to 10Mbit
>> > # tc qdisc add dev swp1 parent 10:7 handle 107: red limit 1000000 min 500000 max 500001 probability 1.0 avpkt 8000 burst 63 qevent early_drop block 10
>> > RED: set bandwidth to 10Mbit
>> > Error: block dev insert failed.
>> >
>>
>>
>> +cc Petr
>> We'll add a testcase on tdc - it doesnt seem we have any for qevents.
>> If you have others that are related let us know.
>> But how does this work? I see no mention of block on red code and i
>> see no mention of block on the reproducer above.
>
>Context: Yes, i see it on red setup but i dont see any block being setup.
>Also: Is it only Red or other qdiscs could behave this way?

Just red.

>
>cheers,
>jamal
>> Are the qevents exception packets from the hardware? Is there a good
>> description of what qevents do?
>>
>> cheers,
>> jamal
>>
>>
>> > The reproducer does not fail if I revert this patch and apply Victor's
>> > [1] instead.
>> >
>> > [1] https://lore.kernel.org/netdev/20231231172320.245375-1-victor@mojatatu.com/
>> >
>> > > +
>> > >       *p_block = block;
>> > >       return 0;
>> > >
>> > > +err_dev_insert:
>> > >  err_block_offload_bind:
>> > >       tcf_chain0_head_change_cb_del(block, ei);
>> > >  err_chain0_head_change_cb_add:
Jamal Hadi Salim Jan. 11, 2024, 7:55 p.m. UTC | #20
On Thu, Jan 11, 2024 at 11:55 AM Petr Machata <petrm@nvidia.com> wrote:
>
>
> Jamal Hadi Salim <jhs@mojatatu.com> writes:
>
> > On Thu, Jan 11, 2024 at 10:40 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >>
> >> On Wed, Jan 10, 2024 at 7:10 AM Ido Schimmel <idosch@idosch.org> wrote:
> >> >
> >> > On Thu, Jan 04, 2024 at 01:58:44PM +0100, Jiri Pirko wrote:
> >> > > diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> >> > > index adf5de1ff773..253b26f2eddd 100644
> >> > > --- a/net/sched/cls_api.c
> >> > > +++ b/net/sched/cls_api.c
> >> > > @@ -1428,6 +1428,7 @@ int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
> >> > >                     struct tcf_block_ext_info *ei,
> >> > >                     struct netlink_ext_ack *extack)
> >> > >  {
> >> > > +     struct net_device *dev = qdisc_dev(q);
> >> > >       struct net *net = qdisc_net(q);
> >> > >       struct tcf_block *block = NULL;
> >> > >       int err;
> >> > > @@ -1461,9 +1462,18 @@ int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
> >> > >       if (err)
> >> > >               goto err_block_offload_bind;
> >> > >
> >> > > +     if (tcf_block_shared(block)) {
> >> > > +             err = xa_insert(&block->ports, dev->ifindex, dev, GFP_KERNEL);
> >> > > +             if (err) {
> >> > > +                     NL_SET_ERR_MSG(extack, "block dev insert failed");
> >> > > +                     goto err_dev_insert;
> >> > > +             }
> >> > > +     }
> >> >
> >> > While this patch fixes the original issue, it creates another one:
> >> >
> >> > # ip link add name swp1 type dummy
> >> > # tc qdisc replace dev swp1 root handle 10: prio bands 8 priomap 7 6 5 4 3 2 1
> >> > # tc qdisc add dev swp1 parent 10:8 handle 108: red limit 1000000 min 200000 max 200001 probability 1.0 avpkt 8000 burst 38 qevent early_drop block 10
> >> > RED: set bandwidth to 10Mbit
> >> > # tc qdisc add dev swp1 parent 10:7 handle 107: red limit 1000000 min 500000 max 500001 probability 1.0 avpkt 8000 burst 63 qevent early_drop block 10
> >> > RED: set bandwidth to 10Mbit
> >> > Error: block dev insert failed.
> >> >
> >>
> >>
> >> +cc Petr
> >> We'll add a testcase on tdc - it doesnt seem we have any for qevents.
> >> If you have others that are related let us know.
> >> But how does this work? I see no mention of block on red code and i
>
> Look for qe_early_drop and qe_mark in sch_red.c.
>

I see it...

> >> see no mention of block on the reproducer above.
> >
> > Context: Yes, i see it on red setup but i dont see any block being setup.
>
> qevents are binding locations for blocks, similar in principle to
> clsact's ingress_block / egress_block. So the way to create a block is
> the same: just mention the block number for the first time.
>
> What qevents there are depends on the qdisc. They are supposed to
> reflect events that are somehow interesting, from the point of view of
> an skb within a qdisc. Thus RED has two qevents: early_drop for packets
> that were chosen to be, well, dropped early, and mark for packets that
> are ECN-marked. So when a packet is, say, early-dropped, the RED qdisc
> passes it through the TC block bound at that qevent (if any).
>

Ok, the confusing part was the missing block command. I am assuming in
addition to Ido's example one would need to create block 10 and then
attach a filter to it?
Something like:
tc qdisc add dev swp1 parent 10:7 handle 107: red limit 1000000 min
500000 max 500001 probability 1.0 avpkt 8000 burst 63 qevent
early_drop block 10
tc filter add block 10 ...

So a packet tagged for early drop will end up being processed in some
filter chain with some specified actions. So in the case of offload,
does it mean early drops will be sent to the kernel and land at the
specific chain? Also trying to understand (in retrospect, not armchair
lawyering): why was a block necessary? feels like the goto chain
action could have worked, no? i.e something like: qevent early_drop
goto chain x.. Is the block perhaps tied to something in the h/w or is
it just some clever metainfo that is used to jump to tc block when the
exceptions happen?

Important thing is we need tests so we can catch these regressions in
the future.  If you can, point me to some (outside of the ones Ido
posted) and we'll put them on tdc.

> > Also: Is it only Red or other qdiscs could behave this way?
>
> Currently only red supports any qevents at all, but in principle the
> mechanism is reusable. With my mlxsw hat on, an obvious next candidate
> would be tail_drop on FIFO qdisc.

Sounds cool. I can see use even for s/w only dpath.

cheers,
jamal
Petr Machata Jan. 11, 2024, 9:44 p.m. UTC | #21
Jamal Hadi Salim <jhs@mojatatu.com> writes:

> On Thu, Jan 11, 2024 at 11:55 AM Petr Machata <petrm@nvidia.com> wrote:
>>
>>
>> Jamal Hadi Salim <jhs@mojatatu.com> writes:
>>
>> > On Thu, Jan 11, 2024 at 10:40 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>> >>
>> >> On Wed, Jan 10, 2024 at 7:10 AM Ido Schimmel <idosch@idosch.org> wrote:
>> >> >
>> >> > On Thu, Jan 04, 2024 at 01:58:44PM +0100, Jiri Pirko wrote:
>> >> > > diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
>> >> > > index adf5de1ff773..253b26f2eddd 100644
>> >> > > --- a/net/sched/cls_api.c
>> >> > > +++ b/net/sched/cls_api.c
>> >> > > @@ -1428,6 +1428,7 @@ int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
>> >> > >                     struct tcf_block_ext_info *ei,
>> >> > >                     struct netlink_ext_ack *extack)
>> >> > >  {
>> >> > > +     struct net_device *dev = qdisc_dev(q);
>> >> > >       struct net *net = qdisc_net(q);
>> >> > >       struct tcf_block *block = NULL;
>> >> > >       int err;
>> >> > > @@ -1461,9 +1462,18 @@ int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
>> >> > >       if (err)
>> >> > >               goto err_block_offload_bind;
>> >> > >
>> >> > > +     if (tcf_block_shared(block)) {
>> >> > > +             err = xa_insert(&block->ports, dev->ifindex, dev, GFP_KERNEL);
>> >> > > +             if (err) {
>> >> > > +                     NL_SET_ERR_MSG(extack, "block dev insert failed");
>> >> > > +                     goto err_dev_insert;
>> >> > > +             }
>> >> > > +     }
>> >> >
>> >> > While this patch fixes the original issue, it creates another one:
>> >> >
>> >> > # ip link add name swp1 type dummy
>> >> > # tc qdisc replace dev swp1 root handle 10: prio bands 8 priomap 7 6 5 4 3 2 1
>> >> > # tc qdisc add dev swp1 parent 10:8 handle 108: red limit 1000000 min 200000 max 200001 probability 1.0 avpkt 8000 burst 38 qevent early_drop block 10
>> >> > RED: set bandwidth to 10Mbit
>> >> > # tc qdisc add dev swp1 parent 10:7 handle 107: red limit 1000000 min 500000 max 500001 probability 1.0 avpkt 8000 burst 63 qevent early_drop block 10
>> >> > RED: set bandwidth to 10Mbit
>> >> > Error: block dev insert failed.
>> >> >
>> >>
>> >>
>> >> +cc Petr
>> >> We'll add a testcase on tdc - it doesnt seem we have any for qevents.
>> >> If you have others that are related let us know.
>> >> But how does this work? I see no mention of block on red code and i
>>
>> Look for qe_early_drop and qe_mark in sch_red.c.
>>
>
> I see it...
>
>> >> see no mention of block on the reproducer above.
>> >
>> > Context: Yes, i see it on red setup but i dont see any block being setup.
>>
>> qevents are binding locations for blocks, similar in principle to
>> clsact's ingress_block / egress_block. So the way to create a block is
>> the same: just mention the block number for the first time.
>>
>> What qevents there are depends on the qdisc. They are supposed to
>> reflect events that are somehow interesting, from the point of view of
>> an skb within a qdisc. Thus RED has two qevents: early_drop for packets
>> that were chosen to be, well, dropped early, and mark for packets that
>> are ECN-marked. So when a packet is, say, early-dropped, the RED qdisc
>> passes it through the TC block bound at that qevent (if any).
>>
>
> Ok, the confusing part was the missing block command. I am assuming in
> addition to Ido's example one would need to create block 10 and then
> attach a filter to it?
> Something like:
> tc qdisc add dev swp1 parent 10:7 handle 107: red limit 1000000 min
> 500000 max 500001 probability 1.0 avpkt 8000 burst 63 qevent
> early_drop block 10
> tc filter add block 10 ...

Yes, correct.

> So a packet tagged for early drop will end up being processed in some
> filter chain with some specified actions. So in the case of offload,
> does it mean early drops will be sent to the kernel and land at the
> specific chain? Also trying to understand (in retrospect, not armchair

For offload, the idea is that the silicon is configured to do the things
that the user configures at the qevent.

In the particular case of mlxsw, we only permit adding one chain with
one filter, which has to be a matchall, and the action has to be either
a mirred or trap, plus it needs to have hw_stats disabled. Because the
HW is limited like that, it can't do much in that context.

> lawyering): why was a block necessary? feels like the goto chain
> action could have worked, no? i.e something like: qevent early_drop
> goto chain x.. Is the block perhaps tied to something in the h/w or is
> it just some clever metainfo that is used to jump to tc block when the
> exceptions happen?

So yeah, blocks are super fancy compared to what the HW can actually do.
The initial idea was to have a single action, but then what if the HW
can do more than that? And qdiscs still exist as SW entitites obviously,
why limit ourselves to a single action in SW? OK, so maybe we can make
that one action a goto chain, where the real stuff is, but where would
it look the chain up? On ingress? Egress? So say that we know where to
look it up, but then also you'll end up with totally unhinged actions on
an ingress / egress qdisc that are there just as jump targets of some
qevent. Plus the set of actions permissible on ingress / egress can be
arbitrarily different from the set of actions permissible on a qevent,
which makes the validation in an offloading driver difficult. And chain
reuse is really not a thing in Linux TC, so keeping a chain on its own
seems wrong. Plus the goto chain is still unclear in that scenario.

Blocks have no such issues, they are self-contained. They are heavy
compared to what we need, true. But that's not really an issue -- the
driver can bounce invalid configuration just fine. And there's no risk
of making another driver or SW datapath user unhappy because their set
of constrants is something we didn't anticipate. Conceptually it's
cleaner than if we had just one action / one rule / one chain, because
you can point at e.g. ingress_block and say, this, it's the same as
this, except instead of all ingress packets, only those that are
early_dropped are passed through.

BTW, newer Spectrum chips actually allow (some?) ACL matching to run in
the qevent context, so we may end up relaxing the matchall requirement
in the future and do a more complex offload of qevents.

> Important thing is we need tests so we can catch these regressions in
> the future.  If you can, point me to some (outside of the ones Ido
> posted) and we'll put them on tdc.

We just have the followin. Pretty sure that's where Ido's come from:

    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/drivers/net/mlxsw/sch_red_core.sh
    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/drivers/net/mlxsw/sch_red_root.sh
    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/drivers/net/mlxsw/sch_red_ets.sh

Which is doing this (or s/early_drop/mark/ instead):

    tc qdisc add dev $swp3 parent 1: handle 108: red \
            limit 1000000 min $BACKLOG max $((BACKLOG + 1)) \
            probability 1.0 avpkt 8000 burst 38 qevent early_drop block 10

And then installing a rule like one of these:

    tc filter add block 10 pref 1234 handle 102 matchall skip_sw \
            action mirred egress mirror dev $swp2 hw_stats disabled
    tc filter add block 10 pref 1234 handle 102 matchall skip_sw \
            action trap hw_stats disabled
    tc filter add block 10 pref 1234 handle 102 matchall skip_sw \
            action trap_fwd hw_stats disabled

Then in runs traffic and checks the right amount gets mirrored or trapped.

>> > Also: Is it only Red or other qdiscs could behave this way?
>>
>> Currently only red supports any qevents at all, but in principle the
>> mechanism is reusable. With my mlxsw hat on, an obvious next candidate
>> would be tail_drop on FIFO qdisc.
>
> Sounds cool. I can see use even for s/w only dpath.

FIFO is tricky to extend BTW. I wrote some patches way back before it
got backburner'd, and the only payloads that are currently bounced are
those that are <=3 bytes. Everything else is interpreted to mean
something, extra garbage is ignored, etc. Fun fun.
Jamal Hadi Salim Jan. 12, 2024, 2:47 p.m. UTC | #22
On Thu, Jan 11, 2024 at 6:22 PM Petr Machata <me@pmachata.org> wrote:
>
> Jamal Hadi Salim <jhs@mojatatu.com> writes:
>
> > On Thu, Jan 11, 2024 at 11:55 AM Petr Machata <petrm@nvidia.com> wrote:
> >>
> >>
> >> Jamal Hadi Salim <jhs@mojatatu.com> writes:
> >>
> >> > On Thu, Jan 11, 2024 at 10:40 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >> >>
> >> >> On Wed, Jan 10, 2024 at 7:10 AM Ido Schimmel <idosch@idosch.org> wrote:
> >> >> >
> >> >> > On Thu, Jan 04, 2024 at 01:58:44PM +0100, Jiri Pirko wrote:
> >> >> > > diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> >> >> > > index adf5de1ff773..253b26f2eddd 100644
> >> >> > > --- a/net/sched/cls_api.c
> >> >> > > +++ b/net/sched/cls_api.c
> >> >> > > @@ -1428,6 +1428,7 @@ int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
> >> >> > >                     struct tcf_block_ext_info *ei,
> >> >> > >                     struct netlink_ext_ack *extack)
> >> >> > >  {
> >> >> > > +     struct net_device *dev = qdisc_dev(q);
> >> >> > >       struct net *net = qdisc_net(q);
> >> >> > >       struct tcf_block *block = NULL;
> >> >> > >       int err;
> >> >> > > @@ -1461,9 +1462,18 @@ int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
> >> >> > >       if (err)
> >> >> > >               goto err_block_offload_bind;
> >> >> > >
> >> >> > > +     if (tcf_block_shared(block)) {
> >> >> > > +             err = xa_insert(&block->ports, dev->ifindex, dev, GFP_KERNEL);
> >> >> > > +             if (err) {
> >> >> > > +                     NL_SET_ERR_MSG(extack, "block dev insert failed");
> >> >> > > +                     goto err_dev_insert;
> >> >> > > +             }
> >> >> > > +     }
> >> >> >
> >> >> > While this patch fixes the original issue, it creates another one:
> >> >> >
> >> >> > # ip link add name swp1 type dummy
> >> >> > # tc qdisc replace dev swp1 root handle 10: prio bands 8 priomap 7 6 5 4 3 2 1
> >> >> > # tc qdisc add dev swp1 parent 10:8 handle 108: red limit 1000000 min 200000 max 200001 probability 1.0 avpkt 8000 burst 38 qevent early_drop block 10
> >> >> > RED: set bandwidth to 10Mbit
> >> >> > # tc qdisc add dev swp1 parent 10:7 handle 107: red limit 1000000 min 500000 max 500001 probability 1.0 avpkt 8000 burst 63 qevent early_drop block 10
> >> >> > RED: set bandwidth to 10Mbit
> >> >> > Error: block dev insert failed.
> >> >> >
> >> >>
> >> >>
> >> >> +cc Petr
> >> >> We'll add a testcase on tdc - it doesnt seem we have any for qevents.
> >> >> If you have others that are related let us know.
> >> >> But how does this work? I see no mention of block on red code and i
> >>
> >> Look for qe_early_drop and qe_mark in sch_red.c.
> >>
> >
> > I see it...
> >
> >> >> see no mention of block on the reproducer above.
> >> >
> >> > Context: Yes, i see it on red setup but i dont see any block being setup.
> >>
> >> qevents are binding locations for blocks, similar in principle to
> >> clsact's ingress_block / egress_block. So the way to create a block is
> >> the same: just mention the block number for the first time.
> >>
> >> What qevents there are depends on the qdisc. They are supposed to
> >> reflect events that are somehow interesting, from the point of view of
> >> an skb within a qdisc. Thus RED has two qevents: early_drop for packets
> >> that were chosen to be, well, dropped early, and mark for packets that
> >> are ECN-marked. So when a packet is, say, early-dropped, the RED qdisc
> >> passes it through the TC block bound at that qevent (if any).
> >>
> >
> > Ok, the confusing part was the missing block command. I am assuming in
> > addition to Ido's example one would need to create block 10 and then
> > attach a filter to it?
> > Something like:
> > tc qdisc add dev swp1 parent 10:7 handle 107: red limit 1000000 min
> > 500000 max 500001 probability 1.0 avpkt 8000 burst 63 qevent
> > early_drop block 10
> > tc filter add block 10 ...
>
> Yes, correct.
>
> > So a packet tagged for early drop will end up being processed in some
> > filter chain with some specified actions. So in the case of offload,
> > does it mean early drops will be sent to the kernel and land at the
> > specific chain? Also trying to understand (in retrospect, not armchair
>
> For offload, the idea is that the silicon is configured to do the things
> that the user configures at the qevent.
>

Ok, so none of these qevents related packets need to escape to s/w then.

> In the particular case of mlxsw, we only permit adding one chain with
> one filter, which has to be a matchall, and the action has to be either
> a mirred or trap, plus it needs to have hw_stats disabled. Because the
> HW is limited like that, it can't do much in that context.
>

Understood. The challenge right now is we depend on a human to know
what the h/w constraints are and on misconfig you reject the bad
config. The other model is you query for the h/w capabilities and then
only allow valid config (in case of skip_sw)

> > lawyering): why was a block necessary? feels like the goto chain
> > action could have worked, no? i.e something like: qevent early_drop
> > goto chain x.. Is the block perhaps tied to something in the h/w or is
> > it just some clever metainfo that is used to jump to tc block when the
> > exceptions happen?
>
> So yeah, blocks are super fancy compared to what the HW can actually do.

My curiosity on blocks was more if the ports added to a block are
related somehow from a grouping perspective and in particular to the
h/w. For example in P4TC, at a hardware level we are looking to use
blocks to group devices that are under one PCI-dev (PF, VFs, etc) -
mostly this is because the underlying ASIC has them related already
(and if for example you said to mirror from one to the other it would
work whether in s/w or h/w)

> The initial idea was to have a single action, but then what if the HW
> can do more than that? And qdiscs still exist as SW entitites obviously,
> why limit ourselves to a single action in SW? OK, so maybe we can make
> that one action a goto chain, where the real stuff is, but where would
> it look the chain up? On ingress? Egress? So say that we know where to
> look it up,

Note: At the NIC offloads i have seen - typically it is the driver
that would maintain such state and would know where to jump to.

>but then also you'll end up with totally unhinged actions on
> an ingress / egress qdisc that are there just as jump targets of some
> qevent. Plus the set of actions permissible on ingress / egress can be
> arbitrarily different from the set of actions permissible on a qevent,
> which makes the validation in an offloading driver difficult. And chain
> reuse is really not a thing in Linux TC, so keeping a chain on its own
> seems wrong. Plus the goto chain is still unclear in that scenario.
>

It is a configuration/policy domain and requires human knowledge. The
NIC guys do it today. There's some default assumptions of "missed"
lookups - which end up as exceptions on chain 0 of the same qdisc
where skip_sw is configured. But you are right on the ingress/egress
issues: the NIC model at the moment assumes an ingress only approach
(at least in the ASICs), whereas you as a switch have to deal with
both. The good news is that TC can be taught to handle both.

> Blocks have no such issues, they are self-contained. They are heavy
> compared to what we need, true. But that's not really an issue -- the
> driver can bounce invalid configuration just fine. And there's no risk
> of making another driver or SW datapath user unhappy because their set
> of constrants is something we didn't anticipate. Conceptually it's
> cleaner than if we had just one action / one rule / one chain, because
> you can point at e.g. ingress_block and say, this, it's the same as
> this, except instead of all ingress packets, only those that are
> early_dropped are passed through.
>

/nod

> BTW, newer Spectrum chips actually allow (some?) ACL matching to run in
> the qevent context, so we may end up relaxing the matchall requirement
> in the future and do a more complex offload of qevents.
>

In P4 a lot of this could be modelled to one's liking really - but
that's a different discussion.

> > Important thing is we need tests so we can catch these regressions in
> > the future.  If you can, point me to some (outside of the ones Ido
> > posted) and we'll put them on tdc.
>
> We just have the followin. Pretty sure that's where Ido's come from:
>
>     https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/drivers/net/mlxsw/sch_red_core.sh
>     https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/drivers/net/mlxsw/sch_red_root.sh
>     https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/drivers/net/mlxsw/sch_red_ets.sh
>
> Which is doing this (or s/early_drop/mark/ instead):
>
>     tc qdisc add dev $swp3 parent 1: handle 108: red \
>             limit 1000000 min $BACKLOG max $((BACKLOG + 1)) \
>             probability 1.0 avpkt 8000 burst 38 qevent early_drop block 10
>
> And then installing a rule like one of these:
>
>     tc filter add block 10 pref 1234 handle 102 matchall skip_sw \
>             action mirred egress mirror dev $swp2 hw_stats disabled
>     tc filter add block 10 pref 1234 handle 102 matchall skip_sw \
>             action trap hw_stats disabled
>     tc filter add block 10 pref 1234 handle 102 matchall skip_sw \
>             action trap_fwd hw_stats disabled
>
> Then in runs traffic and checks the right amount gets mirrored or trapped.
>

Makes sense. Is the Red piece offloaded as well?

> >> > Also: Is it only Red or other qdiscs could behave this way?
> >>
> >> Currently only red supports any qevents at all, but in principle the
> >> mechanism is reusable. With my mlxsw hat on, an obvious next candidate
> >> would be tail_drop on FIFO qdisc.
> >
> > Sounds cool. I can see use even for s/w only dpath.
>
> FIFO is tricky to extend BTW. I wrote some patches way back before it
> got backburner'd, and the only payloads that are currently bounced are
> those that are <=3 bytes. Everything else is interpreted to mean
> something, extra garbage is ignored, etc. Fun fun.

FIFO may have undergone too many changes to be general purpose anymore
(thinking of the attempts to make it lockless in particular) but I
would think that all you need is to share the netlink attributes with
the h/w, no? i.e you need a new attribute to enable qevents on fifo.

cheers,
jamal
Petr Machata Jan. 12, 2024, 3:37 p.m. UTC | #23
Jamal Hadi Salim <jhs@mojatatu.com> writes:

> On Thu, Jan 11, 2024 at 6:22 PM Petr Machata <me@pmachata.org> wrote:
>>
>> Jamal Hadi Salim <jhs@mojatatu.com> writes:
>>
>> > On Thu, Jan 11, 2024 at 11:55 AM Petr Machata <petrm@nvidia.com> wrote:
>> >
>> >> qevents are binding locations for blocks, similar in principle to
>> >> clsact's ingress_block / egress_block. So the way to create a block is
>> >> the same: just mention the block number for the first time.
>> >>
>> >> What qevents there are depends on the qdisc. They are supposed to
>> >> reflect events that are somehow interesting, from the point of view of
>> >> an skb within a qdisc. Thus RED has two qevents: early_drop for packets
>> >> that were chosen to be, well, dropped early, and mark for packets that
>> >> are ECN-marked. So when a packet is, say, early-dropped, the RED qdisc
>> >> passes it through the TC block bound at that qevent (if any).
>> >>
>> >
>> > Ok, the confusing part was the missing block command. I am assuming in
>> > addition to Ido's example one would need to create block 10 and then
>> > attach a filter to it?
>> > Something like:
>> > tc qdisc add dev swp1 parent 10:7 handle 107: red limit 1000000 min
>> > 500000 max 500001 probability 1.0 avpkt 8000 burst 63 qevent
>> > early_drop block 10
>> > tc filter add block 10 ...
>>
>> Yes, correct.
>>
>> > So a packet tagged for early drop will end up being processed in some
>> > filter chain with some specified actions. So in the case of offload,
>> > does it mean early drops will be sent to the kernel and land at the
>> > specific chain? Also trying to understand (in retrospect, not armchair
>>
>> For offload, the idea is that the silicon is configured to do the things
>> that the user configures at the qevent.
>
> Ok, so none of these qevents related packets need to escape to s/w then.

Like yeah, I suppose drops and marks shouldn't be /that/ numerous, so
maybe you can get away with the cookie / schedule from driver that you
talk about below. But you've got dozens of ports each of which is
capable of flooding the PCI on its own. I can see this working for like
connection tracking, or some other sort of miss that gets quickly
installed in HW and future traffic stays there. I'm not sure it makes
sense as a strategy for basically unbounded amounts of drops.

>> In the particular case of mlxsw, we only permit adding one chain with
>> one filter, which has to be a matchall, and the action has to be either
>> a mirred or trap, plus it needs to have hw_stats disabled. Because the
>> HW is limited like that, it can't do much in that context.
>>
>
> Understood. The challenge right now is we depend on a human to know
> what the h/w constraints are and on misconfig you reject the bad
> config. The other model is you query for the h/w capabilities and then
> only allow valid config (in case of skip_sw)

This is not new with qevents though. The whole TC offload is an exercise
in return -EOPNOTSUPP. And half VXLAN, and IPIP, and, and. But I guess
that's your point, that this sucks as a general approach?

>> > lawyering): why was a block necessary? feels like the goto chain
>> > action could have worked, no? i.e something like: qevent early_drop
>> > goto chain x.. Is the block perhaps tied to something in the h/w or is
>> > it just some clever metainfo that is used to jump to tc block when the
>> > exceptions happen?
>>
>> So yeah, blocks are super fancy compared to what the HW can actually do.
>
> My curiosity on blocks was more if the ports added to a block are
> related somehow from a grouping perspective and in particular to the
> h/w.

Not necessarily. You could have two nics from different vendors with
different offloading capabilities, and still share the same block with
both, so long as you use the subset of features common to both.

> For example in P4TC, at a hardware level we are looking to use
> blocks to group devices that are under one PCI-dev (PF, VFs, etc) -
> mostly this is because the underlying ASIC has them related already
> (and if for example you said to mirror from one to the other it would
> work whether in s/w or h/w)

Yeah, I've noticed the "flood to other ports" patches. This is an
interesting use case, but not one that was necessarily part of the
blocks abstraction IMHO. I think originally blocks were just a sharing
mechanism, which was the mindset with which I worked on qevents.

>> The initial idea was to have a single action, but then what if the HW
>> can do more than that? And qdiscs still exist as SW entitites obviously,
>> why limit ourselves to a single action in SW? OK, so maybe we can make
>> that one action a goto chain, where the real stuff is, but where would
>> it look the chain up? On ingress? Egress? So say that we know where to
>> look it up,
>
> Note: At the NIC offloads i have seen - typically it is the driver
> that would maintain such state and would know where to jump to.
>
>> but then also you'll end up with totally unhinged actions on
>> an ingress / egress qdisc that are there just as jump targets of some
>> qevent. Plus the set of actions permissible on ingress / egress can be
>> arbitrarily different from the set of actions permissible on a qevent,
>> which makes the validation in an offloading driver difficult. And chain
>> reuse is really not a thing in Linux TC, so keeping a chain on its own
>> seems wrong. Plus the goto chain is still unclear in that scenario.
>
> It is a configuration/policy domain and requires human knowledge. The
> NIC guys do it today. There's some default assumptions of "missed"
> lookups - which end up as exceptions on chain 0 of the same qdisc
> where skip_sw is configured. But you are right on the ingress/egress
> issues: the NIC model at the moment assumes an ingress only approach
> (at least in the ASICs), whereas you as a switch have to deal with
> both. The good news is that TC can be taught to handle both.

Oh, right. You mean give the packets a cookie to recognize them later,
and do the scheduling by hand in the driver. Instead of injecting to
ingress, directly pass the packet through some chain. But again, you
need a well-defined way to specify which chain should be invoked, and it
needs to work for any qdisc that decides to implement qevents. I know I
sound like a broken record, but just make that reference be a block
number and it's done for you, and everybody will agree on the semantics.

>> Blocks have no such issues, they are self-contained. They are heavy
>> compared to what we need, true. But that's not really an issue -- the
>> driver can bounce invalid configuration just fine. And there's no risk
>> of making another driver or SW datapath user unhappy because their set
>> of constrants is something we didn't anticipate. Conceptually it's
>> cleaner than if we had just one action / one rule / one chain, because
>> you can point at e.g. ingress_block and say, this, it's the same as
>> this, except instead of all ingress packets, only those that are
>> early_dropped are passed through.
>
> /nod
>
>> BTW, newer Spectrum chips actually allow (some?) ACL matching to run in
>> the qevent context, so we may end up relaxing the matchall requirement
>> in the future and do a more complex offload of qevents.
>
> In P4 a lot of this could be modelled to one's liking really - but
> that's a different discussion.

Pretty sure :)

BTW Spectrum's ACLs are really fairly close to what flower is doing.
At least that's the ABI we get to work with. Not really a good target
for u32 at all. So I suspect that P4 would end up looking remarkably
flower-like, no offsets or layouts, most of it just pure symbolics.

>> > Important thing is we need tests so we can catch these regressions in
>> > the future.  If you can, point me to some (outside of the ones Ido
>> > posted) and we'll put them on tdc.
>>
>> We just have the followin. Pretty sure that's where Ido's come from:
>>
>>     https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/drivers/net/mlxsw/sch_red_core.sh
>>     https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/drivers/net/mlxsw/sch_red_root.sh
>>     https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/drivers/net/mlxsw/sch_red_ets.sh
>>
>> Which is doing this (or s/early_drop/mark/ instead):
>>
>>     tc qdisc add dev $swp3 parent 1: handle 108: red \
>>             limit 1000000 min $BACKLOG max $((BACKLOG + 1)) \
>>             probability 1.0 avpkt 8000 burst 38 qevent early_drop block 10
>>
>> And then installing a rule like one of these:
>>
>>     tc filter add block 10 pref 1234 handle 102 matchall skip_sw \
>>             action mirred egress mirror dev $swp2 hw_stats disabled
>>     tc filter add block 10 pref 1234 handle 102 matchall skip_sw \
>>             action trap hw_stats disabled
>>     tc filter add block 10 pref 1234 handle 102 matchall skip_sw \
>>             action trap_fwd hw_stats disabled
>>
>> Then in runs traffic and checks the right amount gets mirrored or trapped.
>>
>
> Makes sense. Is the Red piece offloaded as well?

Yeah, it does offload the early dropping / ECN marking logic, as per the
configured min/max, if that's what you mean.

>> >> > Also: Is it only Red or other qdiscs could behave this way?
>> >>
>> >> Currently only red supports any qevents at all, but in principle the
>> >> mechanism is reusable. With my mlxsw hat on, an obvious next candidate
>> >> would be tail_drop on FIFO qdisc.
>> >
>> > Sounds cool. I can see use even for s/w only dpath.
>>
>> FIFO is tricky to extend BTW. I wrote some patches way back before it
>> got backburner'd, and the only payloads that are currently bounced are
>> those that are <=3 bytes. Everything else is interpreted to mean
>> something, extra garbage is ignored, etc. Fun fun.
>
> FIFO may have undergone too many changes to be general purpose anymore
> (thinking of the attempts to make it lockless in particular) but I
> would think that all you need is to share the netlink attributes with
> the h/w, no? i.e you need a new attribute to enable qevents on fifo.

Like the buffer model of our HW is just wildly incompatible with qdiscs'
ideas. But that's not what I wanted to say. Say you want to add a new
configuration knob to FIFO, e.g. the tail_drop qevent. How do you
configure it?

For reasonable qdiscs it's a simple matter of adding new attributes. But
with FIFO, there _are_ no attributes. It's literally just if no nlattr,
use defaults, otherwise bounce if the payload size < 4, else first 4
bytes are limit and ignore the rest. Done & done. No attributes, and
nowhere to put them in a backward compatible manner. So extending it is
a bit hacky. (Can be done safely I believe, but it's not very pretty.)
Jamal Hadi Salim Jan. 15, 2024, 9:02 p.m. UTC | #24
On Fri, Jan 12, 2024 at 11:55 AM Petr Machata <petrm@nvidia.com> wrote:
>
>
> Jamal Hadi Salim <jhs@mojatatu.com> writes:
>
> > On Thu, Jan 11, 2024 at 6:22 PM Petr Machata <me@pmachata.org> wrote:
> >>
> >> Jamal Hadi Salim <jhs@mojatatu.com> writes:
> >>
> >> > On Thu, Jan 11, 2024 at 11:55 AM Petr Machata <petrm@nvidia.com> wrote:
> >> >
> >> >> qevents are binding locations for blocks, similar in principle to
> >> >> clsact's ingress_block / egress_block. So the way to create a block is
> >> >> the same: just mention the block number for the first time.
> >> >>
> >> >> What qevents there are depends on the qdisc. They are supposed to
> >> >> reflect events that are somehow interesting, from the point of view of
> >> >> an skb within a qdisc. Thus RED has two qevents: early_drop for packets
> >> >> that were chosen to be, well, dropped early, and mark for packets that
> >> >> are ECN-marked. So when a packet is, say, early-dropped, the RED qdisc
> >> >> passes it through the TC block bound at that qevent (if any).
> >> >>
> >> >
> >> > Ok, the confusing part was the missing block command. I am assuming in
> >> > addition to Ido's example one would need to create block 10 and then
> >> > attach a filter to it?
> >> > Something like:
> >> > tc qdisc add dev swp1 parent 10:7 handle 107: red limit 1000000 min
> >> > 500000 max 500001 probability 1.0 avpkt 8000 burst 63 qevent
> >> > early_drop block 10
> >> > tc filter add block 10 ...
> >>
> >> Yes, correct.
> >>
> >> > So a packet tagged for early drop will end up being processed in some
> >> > filter chain with some specified actions. So in the case of offload,
> >> > does it mean early drops will be sent to the kernel and land at the
> >> > specific chain? Also trying to understand (in retrospect, not armchair
> >>
> >> For offload, the idea is that the silicon is configured to do the things
> >> that the user configures at the qevent.
> >
> > Ok, so none of these qevents related packets need to escape to s/w then.
>
> Like yeah, I suppose drops and marks shouldn't be /that/ numerous, so
> maybe you can get away with the cookie / schedule from driver that you
> talk about below. But you've got dozens of ports each of which is
> capable of flooding the PCI on its own. I can see this working for like
> connection tracking, or some other sort of miss that gets quickly
> installed in HW and future traffic stays there. I'm not sure it makes
> sense as a strategy for basically unbounded amounts of drops.

I misunderstood this to be more to trigger the control plane to do
something like adjust buffer sizes etc, etc. But if you are getting a
lot of these packets like you said it will be tantamount to a DoS.
For monitoring purposes (which seems to be the motivation here) I
wouldnt send pkts to the kernel.
Mirroring that you pointed to is sane, possibly even to a remote
machine dedicated for this makes more sense. BTW, is there a way for
the receiving end (where the packet is mirrored to) to know if the
packet they received was due to a RED drop or threshold excess/ecn
mark?

> >> In the particular case of mlxsw, we only permit adding one chain with
> >> one filter, which has to be a matchall, and the action has to be either
> >> a mirred or trap, plus it needs to have hw_stats disabled. Because the
> >> HW is limited like that, it can't do much in that context.
> >>
> >
> > Understood. The challenge right now is we depend on a human to know
> > what the h/w constraints are and on misconfig you reject the bad
> > config. The other model is you query for the h/w capabilities and then
> > only allow valid config (in case of skip_sw)
>
> This is not new with qevents though. The whole TC offload is an exercise
> in return -EOPNOTSUPP. And half VXLAN, and IPIP, and, and. But I guess
> that's your point, that this sucks as a general approach?

Current scheme works fine - and in some cases is unavoidable. In P4 it
is different because resources are reserved; so by the time the
request hits the kernel, the tc layer already knows if such a request
will succeed or not if sent to the driver. This would be hard to do if
you have dynamic clever resourcing where you are continuously
adjusting the hardware resources - either growing or shrinking them
(as i think the mlxsw works); in such a case sending to the driver
makes more sense.

> >> > lawyering): why was a block necessary? feels like the goto chain
> >> > action could have worked, no? i.e something like: qevent early_drop
> >> > goto chain x.. Is the block perhaps tied to something in the h/w or is
> >> > it just some clever metainfo that is used to jump to tc block when the
> >> > exceptions happen?
> >>
> >> So yeah, blocks are super fancy compared to what the HW can actually do.
> >
> > My curiosity on blocks was more if the ports added to a block are
> > related somehow from a grouping perspective and in particular to the
> > h/w.
>
> Not necessarily. You could have two nics from different vendors with
> different offloading capabilities, and still share the same block with
> both, so long as you use the subset of features common to both.
>

True, for the general case. The hard part is dealing with exceptions,
for example, lets say we have two vendors who can both mirror in
hardware:
if you have a rule to mirror from vendor A port 0 to vendor B port 1,
it cant be done from Vendor A's ASIC directly. It would work if you
made it an exception that goes via kernel and some tc chain there
picks it up and sends it to vendor B's port1. You can probably do some
clever things like have direct DMA from vendor A to B, but that would
be very speacilized code.

> > For example in P4TC, at a hardware level we are looking to use
> > blocks to group devices that are under one PCI-dev (PF, VFs, etc) -
> > mostly this is because the underlying ASIC has them related already
> > (and if for example you said to mirror from one to the other it would
> > work whether in s/w or h/w)
>
> Yeah, I've noticed the "flood to other ports" patches. This is an
> interesting use case, but not one that was necessarily part of the
> blocks abstraction IMHO. I think originally blocks were just a sharing
> mechanism, which was the mindset with which I worked on qevents.
>

Basically now if you attach a mirror to a block all netdevs on that
block will receive a copy, note sure if that will have any effect on
what you are doing. Note:
Sharing of tables, actions, etc (or as P4 calls them
programs/pipelines) for all the ports in a block is still in play.
Challenge: assume vendor A from earlier has N ports which are part of
block 11: There is no need for vendor A's driver to register its tc
callback everytime one of the N ports get added into the block, one
should be enough (at least that is the thought process in the
discussions for offloads with P4TC).
Vendor A and B both register once, then you can replay the same rule
request to both. For actions that dont require cross-ASIC activity
(example a "drop")  it works fine because it will be localized.

> >> The initial idea was to have a single action, but then what if the HW
> >> can do more than that? And qdiscs still exist as SW entitites obviously,
> >> why limit ourselves to a single action in SW? OK, so maybe we can make
> >> that one action a goto chain, where the real stuff is, but where would
> >> it look the chain up? On ingress? Egress? So say that we know where to
> >> look it up,
> >
> > Note: At the NIC offloads i have seen - typically it is the driver
> > that would maintain such state and would know where to jump to.
> >
> >> but then also you'll end up with totally unhinged actions on
> >> an ingress / egress qdisc that are there just as jump targets of some
> >> qevent. Plus the set of actions permissible on ingress / egress can be
> >> arbitrarily different from the set of actions permissible on a qevent,
> >> which makes the validation in an offloading driver difficult. And chain
> >> reuse is really not a thing in Linux TC, so keeping a chain on its own
> >> seems wrong. Plus the goto chain is still unclear in that scenario.
> >
> > It is a configuration/policy domain and requires human knowledge. The
> > NIC guys do it today. There's some default assumptions of "missed"
> > lookups - which end up as exceptions on chain 0 of the same qdisc
> > where skip_sw is configured. But you are right on the ingress/egress
> > issues: the NIC model at the moment assumes an ingress only approach
> > (at least in the ASICs), whereas you as a switch have to deal with
> > both. The good news is that TC can be taught to handle both.
>
> Oh, right. You mean give the packets a cookie to recognize them later,
> and do the scheduling by hand in the driver. Instead of injecting to
> ingress, directly pass the packet through some chain. But again, you
> need a well-defined way to specify which chain should be invoked, and it
> needs to work for any qdisc that decides to implement qevents. I know I
> sound like a broken record, but just make that reference be a block
> number and it's done for you, and everybody will agree on the semantics.
>

The block approach is simpler but i think the cookie continuation
approach is more autonomous.

> >> Blocks have no such issues, they are self-contained. They are heavy
> >> compared to what we need, true. But that's not really an issue -- the
> >> driver can bounce invalid configuration just fine. And there's no risk
> >> of making another driver or SW datapath user unhappy because their set
> >> of constrants is something we didn't anticipate. Conceptually it's
> >> cleaner than if we had just one action / one rule / one chain, because
> >> you can point at e.g. ingress_block and say, this, it's the same as
> >> this, except instead of all ingress packets, only those that are
> >> early_dropped are passed through.
> >
> > /nod
> >
> >> BTW, newer Spectrum chips actually allow (some?) ACL matching to run in
> >> the qevent context, so we may end up relaxing the matchall requirement
> >> in the future and do a more complex offload of qevents.
> >
> > In P4 a lot of this could be modelled to one's liking really - but
> > that's a different discussion.
>
> Pretty sure :)
>
> BTW Spectrum's ACLs are really fairly close to what flower is doing.
> At least that's the ABI we get to work with. Not really a good target
> for u32 at all. So I suspect that P4 would end up looking remarkably
> flower-like, no offsets or layouts, most of it just pure symbolics.
>

Its a "fixed" ASIC, so it is expected. But: One should be able to
express the Spectrum's ACLs or even the whole datapath as a P4 program
and i dont see why it wouldnt work with P4TC. Matty has at least once
in the past, if i am not mistaken, pitched such an idea.

> >> > Important thing is we need tests so we can catch these regressions in
> >> > the future.  If you can, point me to some (outside of the ones Ido
> >> > posted) and we'll put them on tdc.
> >>
> >> We just have the followin. Pretty sure that's where Ido's come from:
> >>
> >>     https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/drivers/net/mlxsw/sch_red_core.sh
> >>     https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/drivers/net/mlxsw/sch_red_root.sh
> >>     https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/drivers/net/mlxsw/sch_red_ets.sh
> >>
> >> Which is doing this (or s/early_drop/mark/ instead):
> >>
> >>     tc qdisc add dev $swp3 parent 1: handle 108: red \
> >>             limit 1000000 min $BACKLOG max $((BACKLOG + 1)) \
> >>             probability 1.0 avpkt 8000 burst 38 qevent early_drop block 10
> >>
> >> And then installing a rule like one of these:
> >>
> >>     tc filter add block 10 pref 1234 handle 102 matchall skip_sw \
> >>             action mirred egress mirror dev $swp2 hw_stats disabled
> >>     tc filter add block 10 pref 1234 handle 102 matchall skip_sw \
> >>             action trap hw_stats disabled
> >>     tc filter add block 10 pref 1234 handle 102 matchall skip_sw \
> >>             action trap_fwd hw_stats disabled
> >>
> >> Then in runs traffic and checks the right amount gets mirrored or trapped.
> >>
> >
> > Makes sense. Is the Red piece offloaded as well?
>
> Yeah, it does offload the early dropping / ECN marking logic, as per the
> configured min/max, if that's what you mean.
>
> >> >> > Also: Is it only Red or other qdiscs could behave this way?
> >> >>
> >> >> Currently only red supports any qevents at all, but in principle the
> >> >> mechanism is reusable. With my mlxsw hat on, an obvious next candidate
> >> >> would be tail_drop on FIFO qdisc.
> >> >
> >> > Sounds cool. I can see use even for s/w only dpath.
> >>
> >> FIFO is tricky to extend BTW. I wrote some patches way back before it
> >> got backburner'd, and the only payloads that are currently bounced are
> >> those that are <=3 bytes. Everything else is interpreted to mean
> >> something, extra garbage is ignored, etc. Fun fun.
> >
> > FIFO may have undergone too many changes to be general purpose anymore
> > (thinking of the attempts to make it lockless in particular) but I
> > would think that all you need is to share the netlink attributes with
> > the h/w, no? i.e you need a new attribute to enable qevents on fifo.
>
> Like the buffer model of our HW is just wildly incompatible with qdiscs'
> ideas. But that's not what I wanted to say. Say you want to add a new
> configuration knob to FIFO, e.g. the tail_drop qevent. How do you
> configure it?
>
> For reasonable qdiscs it's a simple matter of adding new attributes. But
> with FIFO, there _are_ no attributes. It's literally just if no nlattr,
> use defaults, otherwise bounce if the payload size < 4, else first 4
> bytes are limit and ignore the rest. Done & done. No attributes, and
> nowhere to put them in a backward compatible manner. So extending it is
> a bit hacky. (Can be done safely I believe, but it's not very pretty.)

I believe you but will have to look to make sense. There's at least
one attribute you mention above carried in some data structure in a
TLV (if i am not mistaken it is queue size either in packet or bytes,
depending on which fifo mode you are running). You are saying you cant
add another one or a flag at least?

cheers,
jamal
Petr Machata Jan. 16, 2024, 10:15 a.m. UTC | #25
Jamal Hadi Salim <jhs@mojatatu.com> writes:

> On Fri, Jan 12, 2024 at 11:55 AM Petr Machata <petrm@nvidia.com> wrote:
>>
>>
>> Jamal Hadi Salim <jhs@mojatatu.com> writes:
>>
>> > On Thu, Jan 11, 2024 at 6:22 PM Petr Machata <me@pmachata.org> wrote:
>> >>
>> >> Jamal Hadi Salim <jhs@mojatatu.com> writes:
>> >> > So a packet tagged for early drop will end up being processed in some
>> >> > filter chain with some specified actions. So in the case of offload,
>> >> > does it mean early drops will be sent to the kernel and land at the
>> >> > specific chain? Also trying to understand (in retrospect, not armchair
>> >>
>> >> For offload, the idea is that the silicon is configured to do the things
>> >> that the user configures at the qevent.
>> >
>> > Ok, so none of these qevents related packets need to escape to s/w then.
>>
>> Like yeah, I suppose drops and marks shouldn't be /that/ numerous, so
>> maybe you can get away with the cookie / schedule from driver that you
>> talk about below. But you've got dozens of ports each of which is
>> capable of flooding the PCI on its own. I can see this working for like
>> connection tracking, or some other sort of miss that gets quickly
>> installed in HW and future traffic stays there. I'm not sure it makes
>> sense as a strategy for basically unbounded amounts of drops.
>
> I misunderstood this to be more to trigger the control plane to do
> something like adjust buffer sizes etc, etc. But if you are getting a
> lot of these packets like you said it will be tantamount to a DoS.
> For monitoring purposes (which seems to be the motivation here) I
> wouldnt send pkts to the kernel.
> Mirroring that you pointed to is sane, possibly even to a remote
> machine dedicated for this makes more sense. BTW, is there a way for
> the receiving end (where the packet is mirrored to) to know if the
> packet they received was due to a RED drop or threshold excess/ecn
> mark?

Spectrum supports these mirror headers. They are sandwiched between the
encapsulation headers of the mirrored packets and the actual packet, and
tell you the RX and TX ports, what was the mirror reason, some
timestamps, etc. Without this, you need to encode the information into
like parts of IPv6 or what not.

But it's a proprietary format. It could be expressed as a netdevice, or
maybe a special version in the ERSPAN netdevice. But there's no
standard, or public documentation, no open source tools that would
consume it. AFAIK. So adding code to ERSPAN to support it looks like a
non-starter.

I was also toying with approaches around some push_header action, but it
all seemed too clumsy. Also unclear how to express things like
timestamps, port labels, mirror reasons, latency, queue depths... It's
very, very HW-oriented header.

I suppose with P4 there might be a way for the user to describe the
encapsulation and for the driver to recognize it as the mirror header
and offload properly. With some squinting: frankly I don't see anybody
encoding things like queue depths, or anybody writing the driver code to
recognize that indeed that's what it is.

>> >> In the particular case of mlxsw, we only permit adding one chain with
>> >> one filter, which has to be a matchall, and the action has to be either
>> >> a mirred or trap, plus it needs to have hw_stats disabled. Because the
>> >> HW is limited like that, it can't do much in that context.
>> >>
>> >
>> > Understood. The challenge right now is we depend on a human to know
>> > what the h/w constraints are and on misconfig you reject the bad
>> > config. The other model is you query for the h/w capabilities and then
>> > only allow valid config (in case of skip_sw)
>>
>> This is not new with qevents though. The whole TC offload is an exercise
>> in return -EOPNOTSUPP. And half VXLAN, and IPIP, and, and. But I guess
>> that's your point, that this sucks as a general approach?
>
> Current scheme works fine - and in some cases is unavoidable. In P4 it
> is different because resources are reserved; so by the time the
> request hits the kernel, the tc layer already knows if such a request
> will succeed or not if sent to the driver. This would be hard to do if
> you have dynamic clever resourcing where you are continuously
> adjusting the hardware resources - either growing or shrinking them
> (as i think the mlxsw works); in such a case sending to the driver
> makes more sense.

The HW does some of that, in the sense that there are shared tables used
by various stages of the pipeline for different things, so more of one
means less of another. The driver is fairly simplistic though, most of
the time anyway, and just configures what the user tells it to. The only
exception might be ACLs, but even there it's nothing like noticing, oh,
none of my rules so far uses header XYZ, so I can use this denser way of
encoding the keys, to satisfy this request to cram in more stuff. If as
a user you want this effect, you are supposed to use chain templates to
give the driver a better idea of the shape of things to come.

>> >> > lawyering): why was a block necessary? feels like the goto chain
>> >> > action could have worked, no? i.e something like: qevent early_drop
>> >> > goto chain x.. Is the block perhaps tied to something in the h/w or is
>> >> > it just some clever metainfo that is used to jump to tc block when the
>> >> > exceptions happen?
>> >>
>> >> So yeah, blocks are super fancy compared to what the HW can actually do.
>> >
>> > My curiosity on blocks was more if the ports added to a block are
>> > related somehow from a grouping perspective and in particular to the
>> > h/w.
>>
>> Not necessarily. You could have two nics from different vendors with
>> different offloading capabilities, and still share the same block with
>> both, so long as you use the subset of features common to both.
>
> True, for the general case. The hard part is dealing with exceptions,
> for example, lets say we have two vendors who can both mirror in
> hardware:
> if you have a rule to mirror from vendor A port 0 to vendor B port 1,
> it cant be done from Vendor A's ASIC directly. It would work if you
> made it an exception that goes via kernel and some tc chain there
> picks it up and sends it to vendor B's port1. You can probably do some
> clever things like have direct DMA from vendor A to B, but that would
> be very speacilized code.

Yep.

>> > For example in P4TC, at a hardware level we are looking to use
>> > blocks to group devices that are under one PCI-dev (PF, VFs, etc) -
>> > mostly this is because the underlying ASIC has them related already
>> > (and if for example you said to mirror from one to the other it would
>> > work whether in s/w or h/w)
>>
>> Yeah, I've noticed the "flood to other ports" patches. This is an
>> interesting use case, but not one that was necessarily part of the
>> blocks abstraction IMHO. I think originally blocks were just a sharing
>> mechanism, which was the mindset with which I worked on qevents.
>
> Basically now if you attach a mirror to a block all netdevs on that
> block will receive a copy, note sure if that will have any effect on
> what you are doing.

I'm not sure what semantics of mirroring to a qevent block are, but
beyond that, it shouldn't impact us. Such rule would be punted from HW
datapath, because there's no netdevice, and we demand a netdevice (plus
conditions on what the netdevice is allowed be).

> Note:
> Sharing of tables, actions, etc (or as P4 calls them
> programs/pipelines) for all the ports in a block is still in play.
> Challenge: assume vendor A from earlier has N ports which are part of
> block 11: There is no need for vendor A's driver to register its tc
> callback everytime one of the N ports get added into the block, one
> should be enough (at least that is the thought process in the
> discussions for offloads with P4TC).
> Vendor A and B both register once, then you can replay the same rule
> request to both. For actions that dont require cross-ASIC activity
> (example a "drop")  it works fine because it will be localized.

Sounds reasonable so far, but reads like something that misses a
"however, consider" or "now the challenge is to" :)

>> >> Blocks have no such issues, they are self-contained. They are heavy
>> >> compared to what we need, true. But that's not really an issue -- the
>> >> driver can bounce invalid configuration just fine. And there's no risk
>> >> of making another driver or SW datapath user unhappy because their set
>> >> of constrants is something we didn't anticipate. Conceptually it's
>> >> cleaner than if we had just one action / one rule / one chain, because
>> >> you can point at e.g. ingress_block and say, this, it's the same as
>> >> this, except instead of all ingress packets, only those that are
>> >> early_dropped are passed through.
>> >
>> > /nod
>> >
>> >> BTW, newer Spectrum chips actually allow (some?) ACL matching to run in
>> >> the qevent context, so we may end up relaxing the matchall requirement
>> >> in the future and do a more complex offload of qevents.
>> >
>> > In P4 a lot of this could be modelled to one's liking really - but
>> > that's a different discussion.
>>
>> Pretty sure :)
>>
>> BTW Spectrum's ACLs are really fairly close to what flower is doing.
>> At least that's the ABI we get to work with. Not really a good target
>> for u32 at all. So I suspect that P4 would end up looking remarkably
>> flower-like, no offsets or layouts, most of it just pure symbolics.
>>
>
> Its a "fixed" ASIC, so it is expected. But: One should be able to
> express the Spectrum's ACLs or even the whole datapath as a P4 program

Yeah, I think so, too.

> and i dont see why it wouldnt work with P4TC. Matty has at least once
> in the past, if i am not mistaken, pitched such an idea.

I don't see why it wouldn't work. What I'm saying is that at least the
ACL bits will just look fairly close to flower, because that's the HW we
are working with. And then the benefits of P4 are not as clear, because
flower is already here and also looks like flower.

On the upside, we would get more flexibility with different matching
approaches. Mixing different matchers is awkward (say flower + basic
might occasionally be useful), so there's this tendency to cram
everything into flower.

I mentioned the mirror headers above, that's one area where TC just
doesn't have the tools we need.

>> >> >> > Also: Is it only Red or other qdiscs could behave this way?
>> >> >>
>> >> >> Currently only red supports any qevents at all, but in principle the
>> >> >> mechanism is reusable. With my mlxsw hat on, an obvious next candidate
>> >> >> would be tail_drop on FIFO qdisc.
>> >> >
>> >> > Sounds cool. I can see use even for s/w only dpath.
>> >>
>> >> FIFO is tricky to extend BTW. I wrote some patches way back before it
>> >> got backburner'd, and the only payloads that are currently bounced are
>> >> those that are <=3 bytes. Everything else is interpreted to mean
>> >> something, extra garbage is ignored, etc. Fun fun.
>> >
>> > FIFO may have undergone too many changes to be general purpose anymore
>> > (thinking of the attempts to make it lockless in particular) but I
>> > would think that all you need is to share the netlink attributes with
>> > the h/w, no? i.e you need a new attribute to enable qevents on fifo.
>>
>> Like the buffer model of our HW is just wildly incompatible with qdiscs'
>> ideas. But that's not what I wanted to say. Say you want to add a new
>> configuration knob to FIFO, e.g. the tail_drop qevent. How do you
>> configure it?
>>
>> For reasonable qdiscs it's a simple matter of adding new attributes. But
>> with FIFO, there _are_ no attributes. It's literally just if no nlattr,
>> use defaults, otherwise bounce if the payload size < 4, else first 4
>> bytes are limit and ignore the rest. Done & done. No attributes, and
>> nowhere to put them in a backward compatible manner. So extending it is
>> a bit hacky. (Can be done safely I believe, but it's not very pretty.)
>
> I believe you but will have to look to make sense. There's at least
> one attribute you mention above carried in some data structure in a
> TLV (if i am not mistaken it is queue size either in packet or bytes,
> depending on which fifo mode you are running). You are saying you cant
> add another one or a flag at least?

Not backward-compatibly. The sole attribute's payload is interpreted as
follows:

- length<4? Bounce.
- length>=4? First u32 is queue limit, the rest is ignored garbage.

So you have to stash any new stuff into the now-ignored garbage, thus
changing behavior. The I-think-safe approach that I mention above is
passing limit=0 and serializing the real attribute tree into the garbage
area. So if limit=0 and the garbage parses as an atrribute tree, use
that, otherwise it's really just limit of 0 and some garbage.

I think this is safe, because the combination of limit=0 (already an
improbable configuration) and parseable garbage is unlikely to arise by
mistake. But it's kludgy.

Or maybe the flag could be in the message header, but that seems wrong
given we are talking about extending one qdisc kind.
Jamal Hadi Salim Jan. 17, 2024, 8:44 p.m. UTC | #26
Hi Petr,

On Tue, Jan 16, 2024 at 7:03 AM Petr Machata <petrm@nvidia.com> wrote:
>
> Jamal Hadi Salim <jhs@mojatatu.com> writes:
>
> > On Fri, Jan 12, 2024 at 11:55 AM Petr Machata <petrm@nvidia.com> wrote:
> >>
> >>
> >> Jamal Hadi Salim <jhs@mojatatu.com> writes:
> >>
> >> > On Thu, Jan 11, 2024 at 6:22 PM Petr Machata <me@pmachata.org> wrote:
> >> >>
> >> >> Jamal Hadi Salim <jhs@mojatatu.com> writes:
> >> >> > So a packet tagged for early drop will end up being processed in some
> >> >> > filter chain with some specified actions. So in the case of offload,
> >> >> > does it mean early drops will be sent to the kernel and land at the
> >> >> > specific chain? Also trying to understand (in retrospect, not armchair
> >> >>
> >> >> For offload, the idea is that the silicon is configured to do the things
> >> >> that the user configures at the qevent.
> >> >
> >> > Ok, so none of these qevents related packets need to escape to s/w then.
> >>
> >> Like yeah, I suppose drops and marks shouldn't be /that/ numerous, so
> >> maybe you can get away with the cookie / schedule from driver that you
> >> talk about below. But you've got dozens of ports each of which is
> >> capable of flooding the PCI on its own. I can see this working for like
> >> connection tracking, or some other sort of miss that gets quickly
> >> installed in HW and future traffic stays there. I'm not sure it makes
> >> sense as a strategy for basically unbounded amounts of drops.
> >
> > I misunderstood this to be more to trigger the control plane to do
> > something like adjust buffer sizes etc, etc. But if you are getting a
> > lot of these packets like you said it will be tantamount to a DoS.
> > For monitoring purposes (which seems to be the motivation here) I
> > wouldnt send pkts to the kernel.
> > Mirroring that you pointed to is sane, possibly even to a remote
> > machine dedicated for this makes more sense. BTW, is there a way for
> > the receiving end (where the packet is mirrored to) to know if the
> > packet they received was due to a RED drop or threshold excess/ecn
> > mark?
>
> Spectrum supports these mirror headers. They are sandwiched between the
> encapsulation headers of the mirrored packets and the actual packet, and
> tell you the RX and TX ports, what was the mirror reason, some
> timestamps, etc. Without this, you need to encode the information into
> like parts of IPv6 or what not.
>

Makes sense given the available options.

> But it's a proprietary format. It could be expressed as a netdevice, or
> maybe a special version in the ERSPAN netdevice. But there's no
> standard, or public documentation, no open source tools that would
> consume it. AFAIK. So adding code to ERSPAN to support it looks like a
> non-starter.
>

This is the kind of problem that P4 is well suited to solve. We are
stuck with current kernel implementations and current standards.
Spectrum can do so much more than the current ERSPAN standard
provides. It can do so much more than the current kernel code can
express. I am sure there are at least 5 people who want this feature
we are talking about but status quo says the only choice to make this
acceptable is to convince the masses (meaning likely years of chasing)
and do the ERSPAN standard mods alongside kernel and user space
implementation. And then in standards bodies like IEEE, IETF, etc you
politik to death trying to get people to sign onto your proposal with
+1s (sorry for the rant, but it is one reason i stopped going there).
Alternatively, just have those 5 people write a P4 program in a very
short time and not bother anybody else...

> I was also toying with approaches around some push_header action, but it
> all seemed too clumsy. Also unclear how to express things like
> timestamps, port labels, mirror reasons, latency, queue depths... It's
> very, very HW-oriented header.
>

So at one point Yotam Gigi was trying to use IFE for i think this,
which makes sense since you can add arbitrary metadata after the
ethernet header and transport it to a remote machine on the wire or
terminate on local cpu. You may want to look at that again because he
seemed to think it works closely to the h/w approach. Of course it
suffers from the same "fixed implementation" so both the producer and
consumer would have to be taught what those metadatum mean i.e the
kernel and iproute2 code updates will be required. IIRC, the Cumulus
people pushed back and converted this into traps coming out of the
spectrum in order to make use of standardized tooling to avoid
retraining (I think it was sflow as the consumer).

> I suppose with P4 there might be a way for the user to describe the
> encapsulation and for the driver to recognize it as the mirror header
> and offload properly. With some squinting: frankly I don't see anybody
> encoding things like queue depths, or anybody writing the driver code to
> recognize that indeed that's what it is.
>

You can define whatever metadata you want and make your datapath
generate it. As long as you also have a P4 program on the consumer to
decode it. My fingers cant resist, so let me show extract from a
simple test program we have for P4TC which transports skb->mark
between machines:

// define the headers
header customl2_t {
    bit<32> skbmark;
    bit<16> anothermetadatum
    bit<16> etherType;
}
..
struct headers_t {
    ethernet_t   ethernet;
    customl2_t   customl2;
    ipv4_t       ip;
    //rest is considered payload
}
...

//the parser
 ....
    state parse_ethernet {
        pkt.extract(hdr.ethernet);
        transition select(hdr.ethernet.etherType) {
            ETHERTYPE_IPV4: parse_ipv4;
            ETHERTYPE_CUSTOML2: parse_customl2;
            default: reject;
        }
    }
    state parse_customl2 {
        pkt.extract(hdr.customl2);
        transition select(hdr.customl2.etherType) {
            ETHERTYPE_IPV4: parse_ipv4;
            default: reject;
        }
    }
    state parse_ipv4 {
        pkt.extract(hdr.ip);
        transition accept;
    }

...

Then you have a match action table definition which in our case
matches on dst ip, but you can make it match whatever you want. Add a
couple of actions; one to push and another to pop the headers. Here's
one that pushes (not showing the other part that pops and sets
skb->mark):

action push_custom(@tc_type("macaddr") bit<48> dmac, @tc_type("dev")
PortId_t port)
         hdr.ethernet.dstAddr = dmac;
         hdr.customl2.etherType = hdr.ethernet.etherType;
         hdr.ethernet.etherType = ETHERTYPE_CUSTOML2;
         hdr.customl2.skbmark = meta.mark;
         hdr.customl2.setValid();
         send_to_port(port);
}

And at runtime:
tc p4ctrl create mycustoml2/table/mytable dstAddr 10.99.0.1/32 \
action push_custom param dmac 66:33:34:35:46:01 param port eno1

And when hw supports it, you could just say skip_sw above..
That's it - no need to argue about tomatos or tomateos on the mailing
list for the next 6 months or longer.

> >> >> In the particular case of mlxsw, we only permit adding one chain with
> >> >> one filter, which has to be a matchall, and the action has to be either
> >> >> a mirred or trap, plus it needs to have hw_stats disabled. Because the
> >> >> HW is limited like that, it can't do much in that context.
> >> >>
> >> >
> >> > Understood. The challenge right now is we depend on a human to know
> >> > what the h/w constraints are and on misconfig you reject the bad
> >> > config. The other model is you query for the h/w capabilities and then
> >> > only allow valid config (in case of skip_sw)
> >>
> >> This is not new with qevents though. The whole TC offload is an exercise
> >> in return -EOPNOTSUPP. And half VXLAN, and IPIP, and, and. But I guess
> >> that's your point, that this sucks as a general approach?
> >
> > Current scheme works fine - and in some cases is unavoidable. In P4 it
> > is different because resources are reserved; so by the time the
> > request hits the kernel, the tc layer already knows if such a request
> > will succeed or not if sent to the driver. This would be hard to do if
> > you have dynamic clever resourcing where you are continuously
> > adjusting the hardware resources - either growing or shrinking them
> > (as i think the mlxsw works); in such a case sending to the driver
> > makes more sense.
>
> The HW does some of that, in the sense that there are shared tables used
> by various stages of the pipeline for different things, so more of one
> means less of another.

Makes sense. But assumedly in this case (for P4), you will know how
much to reserve for one thing or other. And let your compiler backend
decide what it means to allow 1024 table entries... IOW, it is not an
opportunistic/runtime-optimizing/best effort but rather
access-controlled approach.

>The driver is fairly simplistic though, most of
> the time anyway, and just configures what the user tells it to. The only
> exception might be ACLs, but even there it's nothing like noticing, oh,
> none of my rules so far uses header XYZ, so I can use this denser way of
> encoding the keys, to satisfy this request to cram in more stuff. If as
> a user you want this effect, you are supposed to use chain templates to
> give the driver a better idea of the shape of things to come.

Yes, that kind of optimization is one approach (but not the one taken
by the P4 spec).

> >> >> > lawyering): why was a block necessary? feels like the goto chain
> >> >> > action could have worked, no? i.e something like: qevent early_drop
> >> >> > goto chain x.. Is the block perhaps tied to something in the h/w or is
> >> >> > it just some clever metainfo that is used to jump to tc block when the
> >> >> > exceptions happen?
> >> >>
> >> >> So yeah, blocks are super fancy compared to what the HW can actually do.
> >> >
> >> > My curiosity on blocks was more if the ports added to a block are
> >> > related somehow from a grouping perspective and in particular to the
> >> > h/w.
> >>
> >> Not necessarily. You could have two nics from different vendors with
> >> different offloading capabilities, and still share the same block with
> >> both, so long as you use the subset of features common to both.
> >
> > True, for the general case. The hard part is dealing with exceptions,
> > for example, lets say we have two vendors who can both mirror in
> > hardware:
> > if you have a rule to mirror from vendor A port 0 to vendor B port 1,
> > it cant be done from Vendor A's ASIC directly. It would work if you
> > made it an exception that goes via kernel and some tc chain there
> > picks it up and sends it to vendor B's port1. You can probably do some
> > clever things like have direct DMA from vendor A to B, but that would
> > be very speacilized code.
>
> Yep.
>
> >> > For example in P4TC, at a hardware level we are looking to use
> >> > blocks to group devices that are under one PCI-dev (PF, VFs, etc) -
> >> > mostly this is because the underlying ASIC has them related already
> >> > (and if for example you said to mirror from one to the other it would
> >> > work whether in s/w or h/w)
> >>
> >> Yeah, I've noticed the "flood to other ports" patches. This is an
> >> interesting use case, but not one that was necessarily part of the
> >> blocks abstraction IMHO. I think originally blocks were just a sharing
> >> mechanism, which was the mindset with which I worked on qevents.
> >
> > Basically now if you attach a mirror to a block all netdevs on that
> > block will receive a copy, note sure if that will have any effect on
> > what you are doing.
>
> I'm not sure what semantics of mirroring to a qevent block are, but
> beyond that, it shouldn't impact us. Such rule would be punted from HW
> datapath, because there's no netdevice, and we demand a netdevice (plus
> conditions on what the netdevice is allowed to be).

Ok, for the hardware path i guess it's however you abstract it. But if
someone did this in sw as such:
--
tc qdisc add dev ens10 egress_block 10 clsact
tc qdisc add dev ens9 egress_block 10 clsact
tc qdisc add dev ens8 egress_block 10 clsact
tc filter add block 22 protocol ip pref 25 \
  matchall action mirred egress mirror blockid 10
tc qdisc add dev eth0 parent 10:7 handle 107: red limit 1000000 min
500000 max 500001 probability 1.0 avpkt 8000 burst 63 qevent
early_drop block 22
---
Then all of ens8-10 will get a copy of the packet on the qevent.

> > Note:
> > Sharing of tables, actions, etc (or as P4 calls them
> > programs/pipelines) for all the ports in a block is still in play.
> > Challenge: assume vendor A from earlier has N ports which are part of
> > block 11: There is no need for vendor A's driver to register its tc
> > callback everytime one of the N ports get added into the block, one
> > should be enough (at least that is the thought process in the
> > discussions for offloads with P4TC).
> > Vendor A and B both register once, then you can replay the same rule
> > request to both. For actions that dont require cross-ASIC activity
> > (example a "drop")  it works fine because it will be localized.
>
> Sounds reasonable so far, but reads like something that misses a
> "however, consider" or "now the challenge is to" :)
>

Yeah, lots of discussions are still going on there in our biweekly meetings.

> >> >> Blocks have no such issues, they are self-contained. They are heavy
> >> >> compared to what we need, true. But that's not really an issue -- the
> >> >> driver can bounce invalid configuration just fine. And there's no risk
> >> >> of making another driver or SW datapath user unhappy because their set
> >> >> of constrants is something we didn't anticipate. Conceptually it's
> >> >> cleaner than if we had just one action / one rule / one chain, because
> >> >> you can point at e.g. ingress_block and say, this, it's the same as
> >> >> this, except instead of all ingress packets, only those that are
> >> >> early_dropped are passed through.
> >> >
> >> > /nod
> >> >
> >> >> BTW, newer Spectrum chips actually allow (some?) ACL matching to run in
> >> >> the qevent context, so we may end up relaxing the matchall requirement
> >> >> in the future and do a more complex offload of qevents.
> >> >
> >> > In P4 a lot of this could be modelled to one's liking really - but
> >> > that's a different discussion.
> >>
> >> Pretty sure :)
> >>
> >> BTW Spectrum's ACLs are really fairly close to what flower is doing.
> >> At least that's the ABI we get to work with. Not really a good target
> >> for u32 at all. So I suspect that P4 would end up looking remarkably
> >> flower-like, no offsets or layouts, most of it just pure symbolics.
> >>
> >
> > Its a "fixed" ASIC, so it is expected. But: One should be able to
> > express the Spectrum's ACLs or even the whole datapath as a P4 program
>
> Yeah, I think so, too.
>
> > and i dont see why it wouldnt work with P4TC. Matty has at least once
> > in the past, if i am not mistaken, pitched such an idea.
>
> I don't see why it wouldn't work. What I'm saying is that at least the
> ACL bits will just look fairly close to flower, because that's the HW we
> are working with. And then the benefits of P4 are not as clear, because
> flower is already here and also looks like flower.
>

That's fine mostly because the ASIC doesnt change once it is cast. If
however you want to expose a new field that the ASIC already supports,
the problem with flower is it is a "fixed datapath". Adding another
header requires changing the kernel (tc and flowdisector, driver) and
iproute2..

> On the upside, we would get more flexibility with different matching
> approaches. Mixing different matchers is awkward (say flower + basic
> might occasionally be useful), so there's this tendency to cram
> everything into flower.
>

It's a good hammer and if a lot of things can be imagined to be nails,
it works great ;->

> I mentioned the mirror headers above, that's one area where TC just
> doesn't have the tools we need.
>

Agreed - but with P4 you have a way out IMO.

> >> >> >> > Also: Is it only Red or other qdiscs could behave this way?
> >> >> >>
> >> >> >> Currently only red supports any qevents at all, but in principle the
> >> >> >> mechanism is reusable. With my mlxsw hat on, an obvious next candidate
> >> >> >> would be tail_drop on FIFO qdisc.
> >> >> >
> >> >> > Sounds cool. I can see use even for s/w only dpath.
> >> >>
> >> >> FIFO is tricky to extend BTW. I wrote some patches way back before it
> >> >> got backburner'd, and the only payloads that are currently bounced are
> >> >> those that are <=3 bytes. Everything else is interpreted to mean
> >> >> something, extra garbage is ignored, etc. Fun fun.
> >> >
> >> > FIFO may have undergone too many changes to be general purpose anymore
> >> > (thinking of the attempts to make it lockless in particular) but I
> >> > would think that all you need is to share the netlink attributes with

> >> > the h/w, no? i.e you need a new attribute to enable qevents on fifo.
> >>
> >> Like the buffer model of our HW is just wildly incompatible with qdiscs'
> >> ideas. But that's not what I wanted to say. Say you want to add a new
> >> configuration knob to FIFO, e.g. the tail_drop qevent. How do you
> >> configure it?
> >>
> >> For reasonable qdiscs it's a simple matter of adding new attributes. But
> >> with FIFO, there _are_ no attributes. It's literally just if no nlattr,
> >> use defaults, otherwise bounce if the payload size < 4, else first 4
> >> bytes are limit and ignore the rest. Done & done. No attributes, and
> >> nowhere to put them in a backward compatible manner. So extending it is
> >> a bit hacky. (Can be done safely I believe, but it's not very pretty.)
> >
> > I believe you but will have to look to make sense. There's at least
> > one attribute you mention above carried in some data structure in a
> > TLV (if i am not mistaken it is queue size either in packet or bytes,
> > depending on which fifo mode you are running). You are saying you cant
> > add another one or a flag at least?
>
> Not backward-compatibly. The sole attribute's payload is interpreted as
> follows:
>

Ok, just took a quick look..
> - length<4? Bounce.
> - length>=4? First u32 is queue limit, the rest is ignored garbage.

I think you mean this part:
                struct tc_fifo_qopt *ctl = nla_data(opt);

                if (nla_len(opt) < sizeof(*ctl))
                        return -EINVAL;

                sch->limit = ctl->limit;

Yeah, cant stash a new attribute there unfortunately. TCA_OPTIONS only
has tc_fifo_qopt. Would have been easier if TCA_OPTIONS was nested
(like other qdiscs such as TBF) - then you could add a new TLV.

> So you have to stash any new stuff into the now-ignored garbage, thus
> changing behavior. The I-think-safe approach that I mention above is
> passing limit=0 and serializing the real attribute tree into the garbage
> area. So if limit=0 and the garbage parses as an atrribute tree, use
> that, otherwise it's really just limit of 0 and some garbage.
>
> I think this is safe, because the combination of limit=0 (already an
> improbable configuration) and parseable garbage is unlikely to arise by
> mistake. But it's kludgy.
>
> Or maybe the flag could be in the message header, but that seems wrong
> given we are talking about extending one qdisc kind.

I can see the dilema - and if i understood correctly what you are
suggesting, something like:

if (nla_len(opt) == sizeof(*ctl))
    do the old thing here
else if (nla_len(opt) == sizeof(*mynewstruct)
   do the new thing here
else
      invalid..


cheers,
jamal
Petr Machata Jan. 19, 2024, 4:28 p.m. UTC | #27
Jamal Hadi Salim <jhs@mojatatu.com> writes:

> On Tue, Jan 16, 2024 at 7:03 AM Petr Machata <petrm@nvidia.com> wrote:
>>
>> Spectrum supports these mirror headers. They are sandwiched between the
>> encapsulation headers of the mirrored packets and the actual packet, and
>> tell you the RX and TX ports, what was the mirror reason, some
>> timestamps, etc. Without this, you need to encode the information into
>> like parts of IPv6 or what not.
>>
>> But it's a proprietary format. It could be expressed as a netdevice, or
>> maybe a special version in the ERSPAN netdevice. But there's no
>> standard, or public documentation, no open source tools that would
>> consume it. AFAIK. So adding code to ERSPAN to support it looks like a
>> non-starter.
>
> This is the kind of problem that P4 is well suited to solve. We are
> stuck with current kernel implementations and current standards.
> Spectrum can do so much more than the current ERSPAN standard
> provides. It can do so much more than the current kernel code can
> express. I am sure there are at least 5 people who want this feature
> we are talking about but status quo says the only choice to make this
> acceptable is to convince the masses (meaning likely years of chasing)
> and do the ERSPAN standard mods alongside kernel and user space
> implementation. And then in standards bodies like IEEE, IETF, etc you
> politik to death trying to get people to sign onto your proposal with
> +1s (sorry for the rant, but it is one reason i stopped going there).
> Alternatively, just have those 5 people write a P4 program in a very
> short time and not bother anybody else...
>
>> I was also toying with approaches around some push_header action, but it
>> all seemed too clumsy. Also unclear how to express things like
>> timestamps, port labels, mirror reasons, latency, queue depths... It's
>> very, very HW-oriented header.
>>
>
> So at one point Yotam Gigi was trying to use IFE for i think this,
> which makes sense since you can add arbitrary metadata after the
> ethernet header and transport it to a remote machine on the wire or
> terminate on local cpu. You may want to look at that again because he

I did look at IFE way back, but it looks like I should do that again.

> seemed to think it works closely to the h/w approach. Of course it
> suffers from the same "fixed implementation" so both the producer and
> consumer would have to be taught what those metadatum mean i.e the
> kernel and iproute2 code updates will be required. IIRC, the Cumulus
> people pushed back and converted this into traps coming out of the
> spectrum in order to make use of standardized tooling to avoid
> retraining (I think it was sflow as the consumer).
>
>> I suppose with P4 there might be a way for the user to describe the
>> encapsulation and for the driver to recognize it as the mirror header
>> and offload properly. With some squinting: frankly I don't see anybody
>> encoding things like queue depths, or anybody writing the driver code to
>> recognize that indeed that's what it is.
>
> You can define whatever metadata you want and make your datapath
> generate it. As long as you also have a P4 program on the consumer to
> decode it. My fingers cant resist, so let me show extract from a
> simple test program we have for P4TC which transports skb->mark
> between machines:
>
> // define the headers
> header customl2_t {
>     bit<32> skbmark;
>     bit<16> anothermetadatum
>     bit<16> etherType;
> }
> ..
> struct headers_t {
>     ethernet_t   ethernet;
>     customl2_t   customl2;
>     ipv4_t       ip;
>     //rest is considered payload
> }
> ...

So let's talk about the queue depth in particular. How would the driver
recognize the field? Does it need to "parse the parser" to figure it
out? Or use magic field names? Use EtherType to just know what is meant
by the header? I don't see how to express the idea in the abstract, for
the driver to recognize it and say, yeah, we can in fact offload this
program, because it the abstract description matches what the HW is
doing for mirror headers version XYZ.

> //the parser
>  ....
>     state parse_ethernet {
>         pkt.extract(hdr.ethernet);
>         transition select(hdr.ethernet.etherType) {
>             ETHERTYPE_IPV4: parse_ipv4;
>             ETHERTYPE_CUSTOML2: parse_customl2;
>             default: reject;
>         }
>     }
>     state parse_customl2 {
>         pkt.extract(hdr.customl2);
>         transition select(hdr.customl2.etherType) {
>             ETHERTYPE_IPV4: parse_ipv4;
>             default: reject;
>         }
>     }
>     state parse_ipv4 {
>         pkt.extract(hdr.ip);
>         transition accept;
>     }
>
> ...
>
> Then you have a match action table definition which in our case
> matches on dst ip, but you can make it match whatever you want. Add a
> couple of actions; one to push and another to pop the headers. Here's
> one that pushes (not showing the other part that pops and sets
> skb->mark):
>
> action push_custom(@tc_type("macaddr") bit<48> dmac, @tc_type("dev")
> PortId_t port)
>          hdr.ethernet.dstAddr = dmac;
>          hdr.customl2.etherType = hdr.ethernet.etherType;
>          hdr.ethernet.etherType = ETHERTYPE_CUSTOML2;
>          hdr.customl2.skbmark = meta.mark;
>          hdr.customl2.setValid();
>          send_to_port(port);
> }
>
> And at runtime:
> tc p4ctrl create mycustoml2/table/mytable dstAddr 10.99.0.1/32 \
> action push_custom param dmac 66:33:34:35:46:01 param port eno1
>
> And when hw supports it, you could just say skip_sw above..
> That's it - no need to argue about tomatos or tomateos on the mailing
> list for the next 6 months or longer.

OK, thanks for the primer, I'll try to carve out some time to look at it
more closely.

>> I'm not sure what semantics of mirroring to a qevent block are, but
>> beyond that, it shouldn't impact us. Such rule would be punted from HW
>> datapath, because there's no netdevice, and we demand a netdevice (plus
>> conditions on what the netdevice is allowed to be).
>
> Ok, for the hardware path i guess it's however you abstract it. But if
> someone did this in sw as such:
> --
> tc qdisc add dev ens10 egress_block 10 clsact
> tc qdisc add dev ens9 egress_block 10 clsact
> tc qdisc add dev ens8 egress_block 10 clsact
> tc filter add block 22 protocol ip pref 25 \
>   matchall action mirred egress mirror blockid 10
> tc qdisc add dev eth0 parent 10:7 handle 107: red limit 1000000 min
> 500000 max 500001 probability 1.0 avpkt 8000 burst 63 qevent
> early_drop block 22
> ---
> Then all of ens8-10 will get a copy of the packet on the qevent.

I meant the other way around. Say someone mirrors to blockid 22.
Does the packet go to eth0?

> > > Its a "fixed" ASIC, so it is expected. But: One should be able to
> > > express the Spectrum's ACLs or even the whole datapath as a P4 program
> >
> > Yeah, I think so, too.
> >
> > > and i dont see why it wouldnt work with P4TC. Matty has at least once
> > > in the past, if i am not mistaken, pitched such an idea.
> >
> > I don't see why it wouldn't work. What I'm saying is that at least the
> > ACL bits will just look fairly close to flower, because that's the HW we
> > are working with. And then the benefits of P4 are not as clear, because
> > flower is already here and also looks like flower.
>
> That's fine mostly because the ASIC doesnt change once it is cast. If
> however you want to expose a new field that the ASIC already supports,
> the problem with flower is it is a "fixed datapath". Adding another
> header requires changing the kernel (tc and flowdisector, driver) and
> iproute2..
>
> > On the upside, we would get more flexibility with different matching
> > approaches. Mixing different matchers is awkward (say flower + basic
> > might occasionally be useful), so there's this tendency to cram
> > everything into flower.
> >
> 
> It's a good hammer and if a lot of things can be imagined to be nails,
> it works great ;->
> 
> > I mentioned the mirror headers above, that's one area where TC just
> > doesn't have the tools we need.
> >
> 
> Agreed - but with P4 you have a way out IMO.

Yeah, that's why I'm mentioning it. Those mirror headers are the closest
to where we would want... P4 or something P4-ish, to have the sort of
flexibility we need. Because the alternatives are non-starters.
(Though I'll check out IFE once more.)

> > > I believe you but will have to look to make sense. There's at least
> > > one attribute you mention above carried in some data structure in a
> > > TLV (if i am not mistaken it is queue size either in packet or bytes,
> > > depending on which fifo mode you are running). You are saying you cant
> > > add another one or a flag at least?
> >
> > Not backward-compatibly. The sole attribute's payload is interpreted as
> > follows:
> >
> 
> Ok, just took a quick look..
> > - length<4? Bounce.
> > - length>=4? First u32 is queue limit, the rest is ignored garbage.
> 
> I think you mean this part:
>                 struct tc_fifo_qopt *ctl = nla_data(opt);
> 
>                 if (nla_len(opt) < sizeof(*ctl))
>                         return -EINVAL;
> 
>                 sch->limit = ctl->limit;
> 
> Yeah, cant stash a new attribute there unfortunately. TCA_OPTIONS only
> has tc_fifo_qopt. Would have been easier if TCA_OPTIONS was nested
> (like other qdiscs such as TBF) - then you could add a new TLV.

Yep.

> > So you have to stash any new stuff into the now-ignored garbage, thus
> > changing behavior. The I-think-safe approach that I mention above is
> > passing limit=0 and serializing the real attribute tree into the garbage
> > area. So if limit=0 and the garbage parses as an atrribute tree, use
> > that, otherwise it's really just limit of 0 and some garbage.
> >
> > I think this is safe, because the combination of limit=0 (already an
> > improbable configuration) and parseable garbage is unlikely to arise by
> > mistake. But it's kludgy.
> >
> > Or maybe the flag could be in the message header, but that seems wrong
> > given we are talking about extending one qdisc kind.
> 
> I can see the dilema - and if i understood correctly what you are
> suggesting, something like:
> 
> if (nla_len(opt) == sizeof(*ctl))
>     do the old thing here
> else if (nla_len(opt) == sizeof(*mynewstruct))
>     do the new thing here
> else
>     invalid..

Basically this, but handle the case that a broken userspace is sending
payload such that (nla_len(opt) == sizeof(*mynewstruct)), but only
provides the first four bytes with the queue limit, and the rest is
garbage.
Jamal Hadi Salim Jan. 21, 2024, 6:32 p.m. UTC | #28
On Fri, Jan 19, 2024 at 11:51 AM Petr Machata <petrm@nvidia.com> wrote:
>
>
> Jamal Hadi Salim <jhs@mojatatu.com> writes:
>
> > On Tue, Jan 16, 2024 at 7:03 AM Petr Machata <petrm@nvidia.com> wrote:
> >>
> >> Spectrum supports these mirror headers. They are sandwiched between the
> >> encapsulation headers of the mirrored packets and the actual packet, and
> >> tell you the RX and TX ports, what was the mirror reason, some
> >> timestamps, etc. Without this, you need to encode the information into
> >> like parts of IPv6 or what not.
> >>
> >> But it's a proprietary format. It could be expressed as a netdevice, or
> >> maybe a special version in the ERSPAN netdevice. But there's no
> >> standard, or public documentation, no open source tools that would
> >> consume it. AFAIK. So adding code to ERSPAN to support it looks like a
> >> non-starter.
> >
> > This is the kind of problem that P4 is well suited to solve. We are
> > stuck with current kernel implementations and current standards.
> > Spectrum can do so much more than the current ERSPAN standard
> > provides. It can do so much more than the current kernel code can
> > express. I am sure there are at least 5 people who want this feature
> > we are talking about but status quo says the only choice to make this
> > acceptable is to convince the masses (meaning likely years of chasing)
> > and do the ERSPAN standard mods alongside kernel and user space
> > implementation. And then in standards bodies like IEEE, IETF, etc you
> > politik to death trying to get people to sign onto your proposal with
> > +1s (sorry for the rant, but it is one reason i stopped going there).
> > Alternatively, just have those 5 people write a P4 program in a very
> > short time and not bother anybody else...
> >
> >> I was also toying with approaches around some push_header action, but it
> >> all seemed too clumsy. Also unclear how to express things like
> >> timestamps, port labels, mirror reasons, latency, queue depths... It's
> >> very, very HW-oriented header.
> >>
> >
> > So at one point Yotam Gigi was trying to use IFE for i think this,
> > which makes sense since you can add arbitrary metadata after the
> > ethernet header and transport it to a remote machine on the wire or
> > terminate on local cpu. You may want to look at that again because he
>
> I did look at IFE way back, but it looks like I should do that again.
>

Ping me if you need details.

> > seemed to think it works closely to the h/w approach. Of course it
> > suffers from the same "fixed implementation" so both the producer and
> > consumer would have to be taught what those metadatum mean i.e the
> > kernel and iproute2 code updates will be required. IIRC, the Cumulus
> > people pushed back and converted this into traps coming out of the
> > spectrum in order to make use of standardized tooling to avoid
> > retraining (I think it was sflow as the consumer).
> >
> >> I suppose with P4 there might be a way for the user to describe the
> >> encapsulation and for the driver to recognize it as the mirror header
> >> and offload properly. With some squinting: frankly I don't see anybody
> >> encoding things like queue depths, or anybody writing the driver code to
> >> recognize that indeed that's what it is.
> >
> > You can define whatever metadata you want and make your datapath
> > generate it. As long as you also have a P4 program on the consumer to
> > decode it. My fingers cant resist, so let me show extract from a
> > simple test program we have for P4TC which transports skb->mark
> > between machines:
> >
> > // define the headers
> > header customl2_t {
> >     bit<32> skbmark;
> >     bit<16> anothermetadatum
> >     bit<16> etherType;
> > }
> > ..
> > struct headers_t {
> >     ethernet_t   ethernet;
> >     customl2_t   customl2;
> >     ipv4_t       ip;
> >     //rest is considered payload
> > }
> > ...
>
> So let's talk about the queue depth in particular. How would the driver
> recognize the field?

Your driver doesnt need to recognize anything for either P4TC or IFE.
Does your driver need to know this detail for something it does maybe?
Post offload, on ingress into the kernel:
For P4TC, it will be taken care of by either the XDP code or tc code
(which is compiler generated) i.e no pre-existing kernel code
required.
For IFE, because it is a standard(its an RFC) you will have to write
kernel code to recognize what metadatum called "queudepth" means
(could be a kernel module) to conform to the IFE standard encoding
where each metadata you add is encapped in a TLV. You wouldnt need a
TLV if you prescribe it with P4.

> Does it need to "parse the parser" to figure it
> out? Or use magic field names? Use EtherType to just know what is meant
> by the header?

The example I showed used an ethertype to indicate that "a customer
header follows". But really you can do whatever you want.

> I don't see how to express the idea in the abstract, for
> the driver to recognize it and say, yeah, we can in fact offload this
> program, because it the abstract description matches what the HW is
> doing for mirror headers version XYZ.
>


For the policy offload side, it's mostly the tc ndo.
If you showed me your custom format, I am sure it can be expressed in
P4. Write P4 for mirror header version XYZ and tommorow you can write
a different one for version ABC. Then you generate the code and attach
it to a tc block which knows how to interpret the headers and maybe do
something with them. The trivial example i showed would strip off
customhdr->skbmark and write it to skb->mark.

> > //the parser
> >  ....
> >     state parse_ethernet {
> >         pkt.extract(hdr.ethernet);
> >         transition select(hdr.ethernet.etherType) {
> >             ETHERTYPE_IPV4: parse_ipv4;
> >             ETHERTYPE_CUSTOML2: parse_customl2;
> >             default: reject;
> >         }
> >     }
> >     state parse_customl2 {
> >         pkt.extract(hdr.customl2);
> >         transition select(hdr.customl2.etherType) {
> >             ETHERTYPE_IPV4: parse_ipv4;
> >             default: reject;
> >         }
> >     }
> >     state parse_ipv4 {
> >         pkt.extract(hdr.ip);
> >         transition accept;
> >     }
> >
> > ...
> >
> > Then you have a match action table definition which in our case
> > matches on dst ip, but you can make it match whatever you want. Add a
> > couple of actions; one to push and another to pop the headers. Here's
> > one that pushes (not showing the other part that pops and sets
> > skb->mark):
> >
> > action push_custom(@tc_type("macaddr") bit<48> dmac, @tc_type("dev")
> > PortId_t port)
> >          hdr.ethernet.dstAddr = dmac;
> >          hdr.customl2.etherType = hdr.ethernet.etherType;
> >          hdr.ethernet.etherType = ETHERTYPE_CUSTOML2;
> >          hdr.customl2.skbmark = meta.mark;
> >          hdr.customl2.setValid();
> >          send_to_port(port);
> > }
> >
> > And at runtime:
> > tc p4ctrl create mycustoml2/table/mytable dstAddr 10.99.0.1/32 \
> > action push_custom param dmac 66:33:34:35:46:01 param port eno1
> >
> > And when hw supports it, you could just say skip_sw above..
> > That's it - no need to argue about tomatos or tomateos on the mailing
> > list for the next 6 months or longer.
>
> OK, thanks for the primer, I'll try to carve out some time to look at it
> more closely.
>
> >> I'm not sure what semantics of mirroring to a qevent block are, but
> >> beyond that, it shouldn't impact us. Such rule would be punted from HW
> >> datapath, because there's no netdevice, and we demand a netdevice (plus
> >> conditions on what the netdevice is allowed to be).
> >
> > Ok, for the hardware path i guess it's however you abstract it. But if
> > someone did this in sw as such:
> > --
> > tc qdisc add dev ens10 egress_block 10 clsact
> > tc qdisc add dev ens9 egress_block 10 clsact
> > tc qdisc add dev ens8 egress_block 10 clsact
> > tc filter add block 22 protocol ip pref 25 \
> >   matchall action mirred egress mirror blockid 10
> > tc qdisc add dev eth0 parent 10:7 handle 107: red limit 1000000 min
> > 500000 max 500001 probability 1.0 avpkt 8000 burst 63 qevent
> > early_drop block 22
> > ---
> > Then all of ens8-10 will get a copy of the packet on the qevent.
>
> I meant the other way around. Say someone mirrors to blockid 22.
> Does the packet go to eth0?
>

No, it shouldnt. One of the potentials for loops is if you say mirror
from ens10 to block 22, so we do have rules to avoid the port equal to
ingressing port.

> > > > Its a "fixed" ASIC, so it is expected. But: One should be able to
> > > > express the Spectrum's ACLs or even the whole datapath as a P4 program
> > >
> > > Yeah, I think so, too.
> > >
> > > > and i dont see why it wouldnt work with P4TC. Matty has at least once
> > > > in the past, if i am not mistaken, pitched such an idea.
> > >
> > > I don't see why it wouldn't work. What I'm saying is that at least the
> > > ACL bits will just look fairly close to flower, because that's the HW we
> > > are working with. And then the benefits of P4 are not as clear, because
> > > flower is already here and also looks like flower.
> >
> > That's fine mostly because the ASIC doesnt change once it is cast. If
> > however you want to expose a new field that the ASIC already supports,
> > the problem with flower is it is a "fixed datapath". Adding another
> > header requires changing the kernel (tc and flowdisector, driver) and
> > iproute2..
> >
> > > On the upside, we would get more flexibility with different matching
> > > approaches. Mixing different matchers is awkward (say flower + basic
> > > might occasionally be useful), so there's this tendency to cram
> > > everything into flower.
> > >
> >
> > It's a good hammer and if a lot of things can be imagined to be nails,
> > it works great ;->
> >
> > > I mentioned the mirror headers above, that's one area where TC just
> > > doesn't have the tools we need.
> > >
> >
> > Agreed - but with P4 you have a way out IMO.
>
> Yeah, that's why I'm mentioning it. Those mirror headers are the closest
> to where we would want... P4 or something P4-ish, to have the sort of
> flexibility we need. Because the alternatives are non-starters.
> (Though I'll check out IFE once more.)
>

Both should work. With P4 advantage is: you dont have to upstream
anything. You can just publish the P4 program.

> > > > I believe you but will have to look to make sense. There's at least
> > > > one attribute you mention above carried in some data structure in a
> > > > TLV (if i am not mistaken it is queue size either in packet or bytes,
> > > > depending on which fifo mode you are running). You are saying you cant
> > > > add another one or a flag at least?
> > >
> > > Not backward-compatibly. The sole attribute's payload is interpreted as
> > > follows:
> > >
> >
> > Ok, just took a quick look..
> > > - length<4? Bounce.
> > > - length>=4? First u32 is queue limit, the rest is ignored garbage.
> >
> > I think you mean this part:
> >                 struct tc_fifo_qopt *ctl = nla_data(opt);
> >
> >                 if (nla_len(opt) < sizeof(*ctl))
> >                         return -EINVAL;
> >
> >                 sch->limit = ctl->limit;
> >
> > Yeah, cant stash a new attribute there unfortunately. TCA_OPTIONS only
> > has tc_fifo_qopt. Would have been easier if TCA_OPTIONS was nested
> > (like other qdiscs such as TBF) - then you could add a new TLV.
>
> Yep.
>
> > > So you have to stash any new stuff into the now-ignored garbage, thus
> > > changing behavior. The I-think-safe approach that I mention above is
> > > passing limit=0 and serializing the real attribute tree into the garbage
> > > area. So if limit=0 and the garbage parses as an atrribute tree, use
> > > that, otherwise it's really just limit of 0 and some garbage.
> > >
> > > I think this is safe, because the combination of limit=0 (already an
> > > improbable configuration) and parseable garbage is unlikely to arise by
> > > mistake. But it's kludgy.
> > >
> > > Or maybe the flag could be in the message header, but that seems wrong
> > > given we are talking about extending one qdisc kind.
> >
> > I can see the dilema - and if i understood correctly what you are
> > suggesting, something like:
> >
> > if (nla_len(opt) == sizeof(*ctl))
> >     do the old thing here
> > else if (nla_len(opt) == sizeof(*mynewstruct))
> >     do the new thing here
> > else
> >     invalid..
>
> Basically this, but handle the case that a broken userspace is sending
> payload such that (nla_len(opt) == sizeof(*mynewstruct)), but only
> provides the first four bytes with the queue limit, and the rest is
> garbage.

That would be standard checks though, no? i.e
if (mynewstruct->foo doesnt make sense)
      set extack appropriately and EINVAL back

cheers,
jamal
diff mbox series

Patch

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index adf5de1ff773..253b26f2eddd 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -1428,6 +1428,7 @@  int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
 		      struct tcf_block_ext_info *ei,
 		      struct netlink_ext_ack *extack)
 {
+	struct net_device *dev = qdisc_dev(q);
 	struct net *net = qdisc_net(q);
 	struct tcf_block *block = NULL;
 	int err;
@@ -1461,9 +1462,18 @@  int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
 	if (err)
 		goto err_block_offload_bind;
 
+	if (tcf_block_shared(block)) {
+		err = xa_insert(&block->ports, dev->ifindex, dev, GFP_KERNEL);
+		if (err) {
+			NL_SET_ERR_MSG(extack, "block dev insert failed");
+			goto err_dev_insert;
+		}
+	}
+
 	*p_block = block;
 	return 0;
 
+err_dev_insert:
 err_block_offload_bind:
 	tcf_chain0_head_change_cb_del(block, ei);
 err_chain0_head_change_cb_add:
@@ -1502,8 +1512,12 @@  EXPORT_SYMBOL(tcf_block_get);
 void tcf_block_put_ext(struct tcf_block *block, struct Qdisc *q,
 		       struct tcf_block_ext_info *ei)
 {
+	struct net_device *dev = qdisc_dev(q);
+
 	if (!block)
 		return;
+	if (tcf_block_shared(block))
+		xa_erase(&block->ports, dev->ifindex);
 	tcf_chain0_head_change_cb_del(block, ei);
 	tcf_block_owner_del(block, q, ei->binder_type);
 
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 2a2a48838eb9..36b025cc4fd2 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1209,43 +1209,6 @@  static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
 	return 0;
 }
 
-static int qdisc_block_add_dev(struct Qdisc *sch, struct net_device *dev,
-			       struct netlink_ext_ack *extack)
-{
-	const struct Qdisc_class_ops *cl_ops = sch->ops->cl_ops;
-	struct tcf_block *block;
-	int err;
-
-	block = cl_ops->tcf_block(sch, TC_H_MIN_INGRESS, NULL);
-	if (block) {
-		err = xa_insert(&block->ports, dev->ifindex, dev, GFP_KERNEL);
-		if (err) {
-			NL_SET_ERR_MSG(extack,
-				       "ingress block dev insert failed");
-			return err;
-		}
-	}
-
-	block = cl_ops->tcf_block(sch, TC_H_MIN_EGRESS, NULL);
-	if (block) {
-		err = xa_insert(&block->ports, dev->ifindex, dev, GFP_KERNEL);
-		if (err) {
-			NL_SET_ERR_MSG(extack,
-				       "Egress block dev insert failed");
-			goto err_out;
-		}
-	}
-
-	return 0;
-
-err_out:
-	block = cl_ops->tcf_block(sch, TC_H_MIN_INGRESS, NULL);
-	if (block)
-		xa_erase(&block->ports, dev->ifindex);
-
-	return err;
-}
-
 static int qdisc_block_indexes_set(struct Qdisc *sch, struct nlattr **tca,
 				   struct netlink_ext_ack *extack)
 {
@@ -1416,10 +1379,6 @@  static struct Qdisc *qdisc_create(struct net_device *dev,
 	qdisc_hash_add(sch, false);
 	trace_qdisc_create(ops, dev, parent);
 
-	err = qdisc_block_add_dev(sch, dev, extack);
-	if (err)
-		goto err_out4;
-
 	return sch;
 
 err_out4:
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index e33568df97a5..9b3e9262040b 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -1052,8 +1052,6 @@  static void __qdisc_destroy(struct Qdisc *qdisc)
 {
 	const struct Qdisc_ops  *ops = qdisc->ops;
 	struct net_device *dev = qdisc_dev(qdisc);
-	const struct Qdisc_class_ops *cops;
-	struct tcf_block *block;
 
 #ifdef CONFIG_NET_SCHED
 	qdisc_hash_del(qdisc);
@@ -1064,18 +1062,6 @@  static void __qdisc_destroy(struct Qdisc *qdisc)
 
 	qdisc_reset(qdisc);
 
-	cops = ops->cl_ops;
-	if (ops->ingress_block_get) {
-		block = cops->tcf_block(qdisc, TC_H_MIN_INGRESS, NULL);
-		if (block)
-			xa_erase(&block->ports, dev->ifindex);
-	}
-
-	if (ops->egress_block_get) {
-		block = cops->tcf_block(qdisc, TC_H_MIN_EGRESS, NULL);
-		if (block)
-			xa_erase(&block->ports, dev->ifindex);
-	}
 
 	if (ops->destroy)
 		ops->destroy(qdisc);