diff mbox series

[bpf-next,v6,2/8] bpf: Add fd-based tcx multi-prog infra with link support

Message ID 20230719140858.13224-3-daniel@iogearbox.net (mailing list archive)
State Accepted
Commit e420bed025071a623d2720a92bc2245c84757ecb
Delegated to: BPF
Headers show
Series BPF link support for tc BPF programs | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-VM_Test-2 success Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-3 success Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-9 success Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-10 success Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-19 success Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-20 success Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-21 success Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-22 success Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-23 success Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-24 success Logs for test_progs_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-25 success Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-27 success Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-28 success Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-29 success Logs for veristat
bpf/vmtest-bpf-next-VM_Test-7 success Logs for test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-11 success Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-13 success Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-14 success Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-15 success Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-17 success Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-18 success Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-26 success Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-12 success Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-16 success Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-8 success Logs for test_maps on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-4 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-5 success Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-6 success Logs for set-matrix
bpf/vmtest-bpf-next-PR fail merge-conflict
netdev/tree_selection success Clearly marked for bpf-next, async
netdev/apply success Patch already applied to bpf-next

Commit Message

Daniel Borkmann July 19, 2023, 2:08 p.m. UTC
This work refactors and adds a lightweight extension ("tcx") to the tc BPF
ingress and egress data path side for allowing BPF program management based
on fds via bpf() syscall through the newly added generic multi-prog API.
The main goal behind this work which we also presented at LPC [0] last year
and a recent update at LSF/MM/BPF this year [3] is to support long-awaited
BPF link functionality for tc BPF programs, which allows for a model of safe
ownership and program detachment.

Given the rise in tc BPF users in cloud native environments, this becomes
necessary to avoid hard to debug incidents either through stale leftover
programs or 3rd party applications accidentally stepping on each others toes.
As a recap, a BPF link represents the attachment of a BPF program to a BPF
hook point. The BPF link holds a single reference to keep BPF program alive.
Moreover, hook points do not reference a BPF link, only the application's
fd or pinning does. A BPF link holds meta-data specific to attachment and
implements operations for link creation, (atomic) BPF program update,
detachment and introspection. The motivation for BPF links for tc BPF programs
is multi-fold, for example:

  - From Meta: "It's especially important for applications that are deployed
    fleet-wide and that don't "control" hosts they are deployed to. If such
    application crashes and no one notices and does anything about that, BPF
    program will keep running draining resources or even just, say, dropping
    packets. We at FB had outages due to such permanent BPF attachment
    semantics. With fd-based BPF link we are getting a framework, which allows
    safe, auto-detachable behavior by default, unless application explicitly
    opts in by pinning the BPF link." [1]

  - From Cilium-side the tc BPF programs we attach to host-facing veth devices
    and phys devices build the core datapath for Kubernetes Pods, and they
    implement forwarding, load-balancing, policy, EDT-management, etc, within
    BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently
    experienced hard-to-debug issues in a user's staging environment where
    another Kubernetes application using tc BPF attached to the same prio/handle
    of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath
    it. The goal is to establish a clear/safe ownership model via links which
    cannot accidentally be overridden. [0,2]

BPF links for tc can co-exist with non-link attachments, and the semantics are
in line also with XDP links: BPF links cannot replace other BPF links, BPF
links cannot replace non-BPF links, non-BPF links cannot replace BPF links and
lastly only non-BPF links can replace non-BPF links. In case of Cilium, this
would solve mentioned issue of safe ownership model as 3rd party applications
would not be able to accidentally wipe Cilium programs, even if they are not
BPF link aware.

Earlier attempts [4] have tried to integrate BPF links into core tc machinery
to solve cls_bpf, which has been intrusive to the generic tc kernel API with
extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could
be wiped from the qdisc also. Locking a tc BPF program in place this way, is
getting into layering hacks given the two object models are vastly different.

We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF
attach API, so that the BPF link implementation blends in naturally similar to
other link types which are fd-based and without the need for changing core tc
internal APIs. BPF programs for tc can then be successively migrated from classic
cls_bpf to the new tc BPF link without needing to change the program's source
code, just the BPF loader mechanics for attaching is sufficient.

For the current tc framework, there is no change in behavior with this change
and neither does this change touch on tc core kernel APIs. The gist of this
patch is that the ingress and egress hook have a lightweight, qdisc-less
extension for BPF to attach its tc BPF programs, in other words, a minimal
entry point for tc BPF. The name tcx has been suggested from discussion of
earlier revisions of this work as a good fit, and to more easily differ between
the classic cls_bpf attachment and the fd-based one.

For the ingress and egress tcx points, the device holds a cache-friendly array
with program pointers which is separated from control plane (slow-path) data.
Earlier versions of this work used priority to determine ordering and expression
of dependencies similar as with classic tc, but it was challenged that for
something more future-proof a better user experience is required. Hence this
resulted in the design and development of the generic attach/detach/query API
for multi-progs. See prior patch with its discussion on the API design. tcx is
the first user and later we plan to integrate also others, for example, one
candidate is multi-prog support for XDP which would benefit and have the same
'look and feel' from API perspective.

The goal with tcx is to have maximum compatibility to existing tc BPF programs,
so they don't need to be rewritten specifically. Compatibility to call into
classic tcf_classify() is also provided in order to allow successive migration
or both to cleanly co-exist where needed given its all one logical tc layer and
the tcx plus classic tc cls/act build one logical overall processing pipeline.

tcx supports the simplified return codes TCX_NEXT which is non-terminating (go
to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT.
The fd-based API is behind a static key, so that when unused the code is also
not entered. The struct tcx_entry's program array is currently static, but
could be made dynamic if necessary at a point in future. The a/b pair swap
design has been chosen so that for detachment there are no allocations which
otherwise could fail.

The work has been tested with tc-testing selftest suite which all passes, as
well as the tc BPF tests from the BPF CI, and also with Cilium's L4LB.

Thanks also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews
of this work.

  [0] https://lpc.events/event/16/contributions/1353/
  [1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com
  [2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog
  [3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
  [4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Kicinski <kuba@kernel.org>
---
 MAINTAINERS                    |   4 +-
 include/linux/bpf_mprog.h      |   9 +
 include/linux/netdevice.h      |  15 +-
 include/linux/skbuff.h         |   4 +-
 include/net/sch_generic.h      |   2 +-
 include/net/tcx.h              | 206 +++++++++++++++++++
 include/uapi/linux/bpf.h       |  34 +++-
 kernel/bpf/Kconfig             |   1 +
 kernel/bpf/Makefile            |   1 +
 kernel/bpf/syscall.c           |  82 ++++++--
 kernel/bpf/tcx.c               | 348 +++++++++++++++++++++++++++++++++
 net/Kconfig                    |   5 +
 net/core/dev.c                 | 265 +++++++++++++++----------
 net/core/filter.c              |   4 +-
 net/sched/Kconfig              |   4 +-
 net/sched/sch_ingress.c        |  61 +++++-
 tools/include/uapi/linux/bpf.h |  34 +++-
 17 files changed, 935 insertions(+), 144 deletions(-)
 create mode 100644 include/net/tcx.h
 create mode 100644 kernel/bpf/tcx.c

Comments

Yafang Shao July 20, 2023, 2:13 a.m. UTC | #1
On Wed, Jul 19, 2023 at 10:11 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> This work refactors and adds a lightweight extension ("tcx") to the tc BPF
> ingress and egress data path side for allowing BPF program management based
> on fds via bpf() syscall through the newly added generic multi-prog API.
> The main goal behind this work which we also presented at LPC [0] last year
> and a recent update at LSF/MM/BPF this year [3] is to support long-awaited
> BPF link functionality for tc BPF programs, which allows for a model of safe
> ownership and program detachment.
>
> Given the rise in tc BPF users in cloud native environments, this becomes
> necessary to avoid hard to debug incidents either through stale leftover
> programs or 3rd party applications accidentally stepping on each others toes.
> As a recap, a BPF link represents the attachment of a BPF program to a BPF
> hook point. The BPF link holds a single reference to keep BPF program alive.
> Moreover, hook points do not reference a BPF link, only the application's
> fd or pinning does. A BPF link holds meta-data specific to attachment and
> implements operations for link creation, (atomic) BPF program update,
> detachment and introspection. The motivation for BPF links for tc BPF programs
> is multi-fold, for example:
>
>   - From Meta: "It's especially important for applications that are deployed
>     fleet-wide and that don't "control" hosts they are deployed to. If such
>     application crashes and no one notices and does anything about that, BPF
>     program will keep running draining resources or even just, say, dropping
>     packets. We at FB had outages due to such permanent BPF attachment
>     semantics. With fd-based BPF link we are getting a framework, which allows
>     safe, auto-detachable behavior by default, unless application explicitly
>     opts in by pinning the BPF link." [1]
>
>   - From Cilium-side the tc BPF programs we attach to host-facing veth devices
>     and phys devices build the core datapath for Kubernetes Pods, and they
>     implement forwarding, load-balancing, policy, EDT-management, etc, within
>     BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently
>     experienced hard-to-debug issues in a user's staging environment where
>     another Kubernetes application using tc BPF attached to the same prio/handle
>     of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath
>     it. The goal is to establish a clear/safe ownership model via links which
>     cannot accidentally be overridden. [0,2]
>
> BPF links for tc can co-exist with non-link attachments, and the semantics are
> in line also with XDP links: BPF links cannot replace other BPF links, BPF
> links cannot replace non-BPF links, non-BPF links cannot replace BPF links and
> lastly only non-BPF links can replace non-BPF links. In case of Cilium, this
> would solve mentioned issue of safe ownership model as 3rd party applications
> would not be able to accidentally wipe Cilium programs, even if they are not
> BPF link aware.
>
> Earlier attempts [4] have tried to integrate BPF links into core tc machinery
> to solve cls_bpf, which has been intrusive to the generic tc kernel API with
> extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could
> be wiped from the qdisc also. Locking a tc BPF program in place this way, is
> getting into layering hacks given the two object models are vastly different.
>
> We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF
> attach API, so that the BPF link implementation blends in naturally similar to
> other link types which are fd-based and without the need for changing core tc
> internal APIs. BPF programs for tc can then be successively migrated from classic
> cls_bpf to the new tc BPF link without needing to change the program's source
> code, just the BPF loader mechanics for attaching is sufficient.
>
> For the current tc framework, there is no change in behavior with this change
> and neither does this change touch on tc core kernel APIs. The gist of this
> patch is that the ingress and egress hook have a lightweight, qdisc-less
> extension for BPF to attach its tc BPF programs, in other words, a minimal
> entry point for tc BPF. The name tcx has been suggested from discussion of
> earlier revisions of this work as a good fit, and to more easily differ between
> the classic cls_bpf attachment and the fd-based one.
>
> For the ingress and egress tcx points, the device holds a cache-friendly array
> with program pointers which is separated from control plane (slow-path) data.
> Earlier versions of this work used priority to determine ordering and expression
> of dependencies similar as with classic tc, but it was challenged that for
> something more future-proof a better user experience is required. Hence this
> resulted in the design and development of the generic attach/detach/query API
> for multi-progs. See prior patch with its discussion on the API design. tcx is
> the first user and later we plan to integrate also others, for example, one
> candidate is multi-prog support for XDP which would benefit and have the same
> 'look and feel' from API perspective.
>
> The goal with tcx is to have maximum compatibility to existing tc BPF programs,
> so they don't need to be rewritten specifically. Compatibility to call into
> classic tcf_classify() is also provided in order to allow successive migration
> or both to cleanly co-exist where needed given its all one logical tc layer and
> the tcx plus classic tc cls/act build one logical overall processing pipeline.
>
> tcx supports the simplified return codes TCX_NEXT which is non-terminating (go
> to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT.
> The fd-based API is behind a static key, so that when unused the code is also
> not entered. The struct tcx_entry's program array is currently static, but
> could be made dynamic if necessary at a point in future. The a/b pair swap
> design has been chosen so that for detachment there are no allocations which
> otherwise could fail.
>
> The work has been tested with tc-testing selftest suite which all passes, as
> well as the tc BPF tests from the BPF CI, and also with Cilium's L4LB.
>
> Thanks also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews
> of this work.
>
>   [0] https://lpc.events/event/16/contributions/1353/
>   [1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com
>   [2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog
>   [3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
>   [4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com
>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Acked-by: Jakub Kicinski <kuba@kernel.org>
> ---
>  MAINTAINERS                    |   4 +-
>  include/linux/bpf_mprog.h      |   9 +
>  include/linux/netdevice.h      |  15 +-
>  include/linux/skbuff.h         |   4 +-
>  include/net/sch_generic.h      |   2 +-
>  include/net/tcx.h              | 206 +++++++++++++++++++
>  include/uapi/linux/bpf.h       |  34 +++-
>  kernel/bpf/Kconfig             |   1 +
>  kernel/bpf/Makefile            |   1 +
>  kernel/bpf/syscall.c           |  82 ++++++--
>  kernel/bpf/tcx.c               | 348 +++++++++++++++++++++++++++++++++
>  net/Kconfig                    |   5 +
>  net/core/dev.c                 | 265 +++++++++++++++----------
>  net/core/filter.c              |   4 +-
>  net/sched/Kconfig              |   4 +-
>  net/sched/sch_ingress.c        |  61 +++++-
>  tools/include/uapi/linux/bpf.h |  34 +++-
>  17 files changed, 935 insertions(+), 144 deletions(-)
>  create mode 100644 include/net/tcx.h
>  create mode 100644 kernel/bpf/tcx.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 678bef9f60b4..990e3fce753c 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3778,13 +3778,15 @@ L:      netdev@vger.kernel.org
>  S:     Maintained
>  F:     kernel/bpf/bpf_struct*
>
> -BPF [NETWORKING] (tc BPF, sock_addr)
> +BPF [NETWORKING] (tcx & tc BPF, sock_addr)
>  M:     Martin KaFai Lau <martin.lau@linux.dev>
>  M:     Daniel Borkmann <daniel@iogearbox.net>
>  R:     John Fastabend <john.fastabend@gmail.com>
>  L:     bpf@vger.kernel.org
>  L:     netdev@vger.kernel.org
>  S:     Maintained
> +F:     include/net/tcx.h
> +F:     kernel/bpf/tcx.c
>  F:     net/core/filter.c
>  F:     net/sched/act_bpf.c
>  F:     net/sched/cls_bpf.c
> diff --git a/include/linux/bpf_mprog.h b/include/linux/bpf_mprog.h
> index 6feefec43422..2b429488f840 100644
> --- a/include/linux/bpf_mprog.h
> +++ b/include/linux/bpf_mprog.h
> @@ -315,4 +315,13 @@ int bpf_mprog_detach(struct bpf_mprog_entry *entry,
>  int bpf_mprog_query(const union bpf_attr *attr, union bpf_attr __user *uattr,
>                     struct bpf_mprog_entry *entry);
>
> +static inline bool bpf_mprog_supported(enum bpf_prog_type type)
> +{
> +       switch (type) {
> +       case BPF_PROG_TYPE_SCHED_CLS:
> +               return true;
> +       default:
> +               return false;
> +       }
> +}
>  #endif /* __BPF_MPROG_H */
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index b828c7a75be2..024314c68bc8 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1930,8 +1930,7 @@ enum netdev_ml_priv_type {
>   *
>   *     @rx_handler:            handler for received packets
>   *     @rx_handler_data:       XXX: need comments on this one
> - *     @miniq_ingress:         ingress/clsact qdisc specific data for
> - *                             ingress processing
> + *     @tcx_ingress:           BPF & clsact qdisc specific data for ingress processing
>   *     @ingress_queue:         XXX: need comments on this one
>   *     @nf_hooks_ingress:      netfilter hooks executed for ingress packets
>   *     @broadcast:             hw bcast address
> @@ -1952,8 +1951,7 @@ enum netdev_ml_priv_type {
>   *     @xps_maps:              all CPUs/RXQs maps for XPS device
>   *
>   *     @xps_maps:      XXX: need comments on this one
> - *     @miniq_egress:          clsact qdisc specific data for
> - *                             egress processing
> + *     @tcx_egress:            BPF & clsact qdisc specific data for egress processing
>   *     @nf_hooks_egress:       netfilter hooks executed for egress packets
>   *     @qdisc_hash:            qdisc hash table
>   *     @watchdog_timeo:        Represents the timeout that is used by
> @@ -2252,9 +2250,8 @@ struct net_device {
>         unsigned int            gro_ipv4_max_size;
>         rx_handler_func_t __rcu *rx_handler;
>         void __rcu              *rx_handler_data;
> -
> -#ifdef CONFIG_NET_CLS_ACT
> -       struct mini_Qdisc __rcu *miniq_ingress;
> +#ifdef CONFIG_NET_XGRESS
> +       struct bpf_mprog_entry __rcu *tcx_ingress;
>  #endif
>         struct netdev_queue __rcu *ingress_queue;
>  #ifdef CONFIG_NETFILTER_INGRESS
> @@ -2282,8 +2279,8 @@ struct net_device {
>  #ifdef CONFIG_XPS
>         struct xps_dev_maps __rcu *xps_maps[XPS_MAPS_MAX];
>  #endif
> -#ifdef CONFIG_NET_CLS_ACT
> -       struct mini_Qdisc __rcu *miniq_egress;
> +#ifdef CONFIG_NET_XGRESS
> +       struct bpf_mprog_entry __rcu *tcx_egress;
>  #endif
>  #ifdef CONFIG_NETFILTER_EGRESS
>         struct nf_hook_entries __rcu *nf_hooks_egress;
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 91ed66952580..ed83f1c5fc1f 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -944,7 +944,7 @@ struct sk_buff {
>         __u8                    __mono_tc_offset[0];
>         /* public: */
>         __u8                    mono_delivery_time:1;   /* See SKB_MONO_DELIVERY_TIME_MASK */
> -#ifdef CONFIG_NET_CLS_ACT
> +#ifdef CONFIG_NET_XGRESS
>         __u8                    tc_at_ingress:1;        /* See TC_AT_INGRESS_MASK */
>         __u8                    tc_skip_classify:1;
>  #endif
> @@ -993,7 +993,7 @@ struct sk_buff {
>         __u8                    csum_not_inet:1;
>  #endif
>
> -#ifdef CONFIG_NET_SCHED
> +#if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
>         __u16                   tc_index;       /* traffic control index */
>  #endif
>
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index e92f73bb3198..15be2d96b06d 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -703,7 +703,7 @@ int skb_do_redirect(struct sk_buff *);
>
>  static inline bool skb_at_tc_ingress(const struct sk_buff *skb)
>  {
> -#ifdef CONFIG_NET_CLS_ACT
> +#ifdef CONFIG_NET_XGRESS
>         return skb->tc_at_ingress;
>  #else
>         return false;
> diff --git a/include/net/tcx.h b/include/net/tcx.h
> new file mode 100644
> index 000000000000..264f147953ba
> --- /dev/null
> +++ b/include/net/tcx.h
> @@ -0,0 +1,206 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright (c) 2023 Isovalent */
> +#ifndef __NET_TCX_H
> +#define __NET_TCX_H
> +
> +#include <linux/bpf.h>
> +#include <linux/bpf_mprog.h>
> +
> +#include <net/sch_generic.h>
> +
> +struct mini_Qdisc;
> +
> +struct tcx_entry {
> +       struct mini_Qdisc __rcu *miniq;
> +       struct bpf_mprog_bundle bundle;
> +       bool miniq_active;
> +       struct rcu_head rcu;
> +};
> +
> +struct tcx_link {
> +       struct bpf_link link;
> +       struct net_device *dev;
> +       u32 location;
> +};
> +
> +static inline void tcx_set_ingress(struct sk_buff *skb, bool ingress)
> +{
> +#ifdef CONFIG_NET_XGRESS
> +       skb->tc_at_ingress = ingress;
> +#endif
> +}
> +
> +#ifdef CONFIG_NET_XGRESS
> +static inline struct tcx_entry *tcx_entry(struct bpf_mprog_entry *entry)
> +{
> +       struct bpf_mprog_bundle *bundle = entry->parent;
> +
> +       return container_of(bundle, struct tcx_entry, bundle);
> +}
> +
> +static inline struct tcx_link *tcx_link(struct bpf_link *link)
> +{
> +       return container_of(link, struct tcx_link, link);
> +}
> +
> +static inline const struct tcx_link *tcx_link_const(const struct bpf_link *link)
> +{
> +       return tcx_link((struct bpf_link *)link);
> +}
> +
> +void tcx_inc(void);
> +void tcx_dec(void);
> +
> +static inline void tcx_entry_sync(void)
> +{
> +       /* bpf_mprog_entry got a/b swapped, therefore ensure that
> +        * there are no inflight users on the old one anymore.
> +        */
> +       synchronize_rcu();
> +}
> +
> +static inline void
> +tcx_entry_update(struct net_device *dev, struct bpf_mprog_entry *entry,
> +                bool ingress)
> +{
> +       ASSERT_RTNL();
> +       if (ingress)
> +               rcu_assign_pointer(dev->tcx_ingress, entry);
> +       else
> +               rcu_assign_pointer(dev->tcx_egress, entry);
> +}
> +
> +static inline struct bpf_mprog_entry *
> +tcx_entry_fetch(struct net_device *dev, bool ingress)
> +{
> +       ASSERT_RTNL();
> +       if (ingress)
> +               return rcu_dereference_rtnl(dev->tcx_ingress);
> +       else
> +               return rcu_dereference_rtnl(dev->tcx_egress);
> +}
> +
> +static inline struct bpf_mprog_entry *tcx_entry_create(void)
> +{
> +       struct tcx_entry *tcx = kzalloc(sizeof(*tcx), GFP_KERNEL);
> +
> +       if (tcx) {
> +               bpf_mprog_bundle_init(&tcx->bundle);
> +               return &tcx->bundle.a;
> +       }
> +       return NULL;
> +}
> +
> +static inline void tcx_entry_free(struct bpf_mprog_entry *entry)
> +{
> +       kfree_rcu(tcx_entry(entry), rcu);
> +}
> +
> +static inline struct bpf_mprog_entry *
> +tcx_entry_fetch_or_create(struct net_device *dev, bool ingress, bool *created)
> +{
> +       struct bpf_mprog_entry *entry = tcx_entry_fetch(dev, ingress);
> +
> +       *created = false;
> +       if (!entry) {
> +               entry = tcx_entry_create();
> +               if (!entry)
> +                       return NULL;
> +               *created = true;
> +       }
> +       return entry;
> +}
> +
> +static inline void tcx_skeys_inc(bool ingress)
> +{
> +       tcx_inc();
> +       if (ingress)
> +               net_inc_ingress_queue();
> +       else
> +               net_inc_egress_queue();
> +}
> +
> +static inline void tcx_skeys_dec(bool ingress)
> +{
> +       if (ingress)
> +               net_dec_ingress_queue();
> +       else
> +               net_dec_egress_queue();
> +       tcx_dec();
> +}
> +
> +static inline void tcx_miniq_set_active(struct bpf_mprog_entry *entry,
> +                                       const bool active)
> +{
> +       ASSERT_RTNL();
> +       tcx_entry(entry)->miniq_active = active;
> +}
> +
> +static inline bool tcx_entry_is_active(struct bpf_mprog_entry *entry)
> +{
> +       ASSERT_RTNL();
> +       return bpf_mprog_total(entry) || tcx_entry(entry)->miniq_active;
> +}
> +
> +static inline enum tcx_action_base tcx_action_code(struct sk_buff *skb,
> +                                                  int code)
> +{
> +       switch (code) {
> +       case TCX_PASS:
> +               skb->tc_index = qdisc_skb_cb(skb)->tc_classid;
> +               fallthrough;
> +       case TCX_DROP:
> +       case TCX_REDIRECT:
> +               return code;
> +       case TCX_NEXT:
> +       default:
> +               return TCX_NEXT;
> +       }
> +}
> +#endif /* CONFIG_NET_XGRESS */
> +
> +#if defined(CONFIG_NET_XGRESS) && defined(CONFIG_BPF_SYSCALL)
> +int tcx_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog);
> +int tcx_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
> +int tcx_prog_detach(const union bpf_attr *attr, struct bpf_prog *prog);
> +void tcx_uninstall(struct net_device *dev, bool ingress);
> +
> +int tcx_prog_query(const union bpf_attr *attr,
> +                  union bpf_attr __user *uattr);
> +
> +static inline void dev_tcx_uninstall(struct net_device *dev)
> +{
> +       ASSERT_RTNL();
> +       tcx_uninstall(dev, true);
> +       tcx_uninstall(dev, false);
> +}
> +#else
> +static inline int tcx_prog_attach(const union bpf_attr *attr,
> +                                 struct bpf_prog *prog)
> +{
> +       return -EINVAL;
> +}
> +
> +static inline int tcx_link_attach(const union bpf_attr *attr,
> +                                 struct bpf_prog *prog)
> +{
> +       return -EINVAL;
> +}
> +
> +static inline int tcx_prog_detach(const union bpf_attr *attr,
> +                                 struct bpf_prog *prog)
> +{
> +       return -EINVAL;
> +}
> +
> +static inline int tcx_prog_query(const union bpf_attr *attr,
> +                                union bpf_attr __user *uattr)
> +{
> +       return -EINVAL;
> +}
> +
> +static inline void dev_tcx_uninstall(struct net_device *dev)
> +{
> +}
> +#endif /* CONFIG_NET_XGRESS && CONFIG_BPF_SYSCALL */
> +#endif /* __NET_TCX_H */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index d4c07e435336..739c15906a65 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1036,6 +1036,8 @@ enum bpf_attach_type {
>         BPF_LSM_CGROUP,
>         BPF_STRUCT_OPS,
>         BPF_NETFILTER,
> +       BPF_TCX_INGRESS,
> +       BPF_TCX_EGRESS,
>         __MAX_BPF_ATTACH_TYPE
>  };
>
> @@ -1053,7 +1055,7 @@ enum bpf_link_type {
>         BPF_LINK_TYPE_KPROBE_MULTI = 8,
>         BPF_LINK_TYPE_STRUCT_OPS = 9,
>         BPF_LINK_TYPE_NETFILTER = 10,
> -
> +       BPF_LINK_TYPE_TCX = 11,
>         MAX_BPF_LINK_TYPE,
>  };
>
> @@ -1569,13 +1571,13 @@ union bpf_attr {
>                         __u32           map_fd;         /* struct_ops to attach */
>                 };
>                 union {
> -                       __u32           target_fd;      /* object to attach to */
> -                       __u32           target_ifindex; /* target ifindex */
> +                       __u32   target_fd;      /* target object to attach to or ... */
> +                       __u32   target_ifindex; /* target ifindex */
>                 };
>                 __u32           attach_type;    /* attach type */
>                 __u32           flags;          /* extra flags */
>                 union {
> -                       __u32           target_btf_id;  /* btf_id of target to attach to */
> +                       __u32   target_btf_id;  /* btf_id of target to attach to */
>                         struct {
>                                 __aligned_u64   iter_info;      /* extra bpf_iter_link_info */
>                                 __u32           iter_info_len;  /* iter_info length */
> @@ -1609,6 +1611,13 @@ union bpf_attr {
>                                 __s32           priority;
>                                 __u32           flags;
>                         } netfilter;
> +                       struct {
> +                               union {
> +                                       __u32   relative_fd;
> +                                       __u32   relative_id;
> +                               };
> +                               __u64           expected_revision;
> +                       } tcx;
>                 };
>         } link_create;
>
> @@ -6217,6 +6226,19 @@ struct bpf_sock_tuple {
>         };
>  };
>
> +/* (Simplified) user return codes for tcx prog type.
> + * A valid tcx program must return one of these defined values. All other
> + * return codes are reserved for future use. Must remain compatible with
> + * their TC_ACT_* counter-parts. For compatibility in behavior, unknown
> + * return codes are mapped to TCX_NEXT.
> + */
> +enum tcx_action_base {
> +       TCX_NEXT        = -1,
> +       TCX_PASS        = 0,
> +       TCX_DROP        = 2,
> +       TCX_REDIRECT    = 7,
> +};
> +
>  struct bpf_xdp_sock {
>         __u32 queue_id;
>  };
> @@ -6499,6 +6521,10 @@ struct bpf_link_info {
>                                 } event; /* BPF_PERF_EVENT_EVENT */
>                         };
>                 } perf_event;
> +               struct {
> +                       __u32 ifindex;
> +                       __u32 attach_type;
> +               } tcx;
>         };
>  } __attribute__((aligned(8)));
>
> diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig
> index 2dfe1079f772..6a906ff93006 100644
> --- a/kernel/bpf/Kconfig
> +++ b/kernel/bpf/Kconfig
> @@ -31,6 +31,7 @@ config BPF_SYSCALL
>         select TASKS_TRACE_RCU
>         select BINARY_PRINTF
>         select NET_SOCK_MSG if NET
> +       select NET_XGRESS if NET
>         select PAGE_POOL if NET
>         default n
>         help
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index 1bea2eb912cd..f526b7573e97 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -21,6 +21,7 @@ obj-$(CONFIG_BPF_SYSCALL) += devmap.o
>  obj-$(CONFIG_BPF_SYSCALL) += cpumap.o
>  obj-$(CONFIG_BPF_SYSCALL) += offload.o
>  obj-$(CONFIG_BPF_SYSCALL) += net_namespace.o
> +obj-$(CONFIG_BPF_SYSCALL) += tcx.o
>  endif
>  ifeq ($(CONFIG_PERF_EVENTS),y)
>  obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index ee8cb1a174aa..7f4e8c357a6a 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -37,6 +37,8 @@
>  #include <linux/trace_events.h>
>  #include <net/netfilter/nf_bpf_link.h>
>
> +#include <net/tcx.h>
> +
>  #define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \
>                           (map)->map_type == BPF_MAP_TYPE_CGROUP_ARRAY || \
>                           (map)->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS)
> @@ -3740,31 +3742,45 @@ attach_type_to_prog_type(enum bpf_attach_type attach_type)
>                 return BPF_PROG_TYPE_XDP;
>         case BPF_LSM_CGROUP:
>                 return BPF_PROG_TYPE_LSM;
> +       case BPF_TCX_INGRESS:
> +       case BPF_TCX_EGRESS:
> +               return BPF_PROG_TYPE_SCHED_CLS;
>         default:
>                 return BPF_PROG_TYPE_UNSPEC;
>         }
>  }
>
> -#define BPF_PROG_ATTACH_LAST_FIELD replace_bpf_fd
> +#define BPF_PROG_ATTACH_LAST_FIELD expected_revision
> +
> +#define BPF_F_ATTACH_MASK_BASE \
> +       (BPF_F_ALLOW_OVERRIDE | \
> +        BPF_F_ALLOW_MULTI |    \
> +        BPF_F_REPLACE)
>
> -#define BPF_F_ATTACH_MASK \
> -       (BPF_F_ALLOW_OVERRIDE | BPF_F_ALLOW_MULTI | BPF_F_REPLACE)
> +#define BPF_F_ATTACH_MASK_MPROG        \
> +       (BPF_F_REPLACE |        \
> +        BPF_F_BEFORE |         \
> +        BPF_F_AFTER |          \
> +        BPF_F_ID |             \
> +        BPF_F_LINK)
>
>  static int bpf_prog_attach(const union bpf_attr *attr)
>  {
>         enum bpf_prog_type ptype;
>         struct bpf_prog *prog;
> +       u32 mask;
>         int ret;
>
>         if (CHECK_ATTR(BPF_PROG_ATTACH))
>                 return -EINVAL;
>
> -       if (attr->attach_flags & ~BPF_F_ATTACH_MASK)
> -               return -EINVAL;
> -
>         ptype = attach_type_to_prog_type(attr->attach_type);
>         if (ptype == BPF_PROG_TYPE_UNSPEC)
>                 return -EINVAL;
> +       mask = bpf_mprog_supported(ptype) ?
> +              BPF_F_ATTACH_MASK_MPROG : BPF_F_ATTACH_MASK_BASE;
> +       if (attr->attach_flags & ~mask)
> +               return -EINVAL;
>
>         prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype);
>         if (IS_ERR(prog))
> @@ -3800,6 +3816,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>                 else
>                         ret = cgroup_bpf_prog_attach(attr, ptype, prog);
>                 break;
> +       case BPF_PROG_TYPE_SCHED_CLS:
> +               ret = tcx_prog_attach(attr, prog);
> +               break;
>         default:
>                 ret = -EINVAL;
>         }
> @@ -3809,25 +3828,41 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>         return ret;
>  }
>
> -#define BPF_PROG_DETACH_LAST_FIELD attach_type
> +#define BPF_PROG_DETACH_LAST_FIELD expected_revision
>
>  static int bpf_prog_detach(const union bpf_attr *attr)
>  {
> +       struct bpf_prog *prog = NULL;
>         enum bpf_prog_type ptype;
> +       int ret;
>
>         if (CHECK_ATTR(BPF_PROG_DETACH))
>                 return -EINVAL;
>
>         ptype = attach_type_to_prog_type(attr->attach_type);
> +       if (bpf_mprog_supported(ptype)) {
> +               if (ptype == BPF_PROG_TYPE_UNSPEC)
> +                       return -EINVAL;
> +               if (attr->attach_flags & ~BPF_F_ATTACH_MASK_MPROG)
> +                       return -EINVAL;
> +               if (attr->attach_bpf_fd) {
> +                       prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype);
> +                       if (IS_ERR(prog))
> +                               return PTR_ERR(prog);
> +               }
> +       }
>
>         switch (ptype) {
>         case BPF_PROG_TYPE_SK_MSG:
>         case BPF_PROG_TYPE_SK_SKB:
> -               return sock_map_prog_detach(attr, ptype);
> +               ret = sock_map_prog_detach(attr, ptype);
> +               break;
>         case BPF_PROG_TYPE_LIRC_MODE2:
> -               return lirc_prog_detach(attr);
> +               ret = lirc_prog_detach(attr);
> +               break;
>         case BPF_PROG_TYPE_FLOW_DISSECTOR:
> -               return netns_bpf_prog_detach(attr, ptype);
> +               ret = netns_bpf_prog_detach(attr, ptype);
> +               break;
>         case BPF_PROG_TYPE_CGROUP_DEVICE:
>         case BPF_PROG_TYPE_CGROUP_SKB:
>         case BPF_PROG_TYPE_CGROUP_SOCK:
> @@ -3836,13 +3871,21 @@ static int bpf_prog_detach(const union bpf_attr *attr)
>         case BPF_PROG_TYPE_CGROUP_SYSCTL:
>         case BPF_PROG_TYPE_SOCK_OPS:
>         case BPF_PROG_TYPE_LSM:
> -               return cgroup_bpf_prog_detach(attr, ptype);
> +               ret = cgroup_bpf_prog_detach(attr, ptype);
> +               break;
> +       case BPF_PROG_TYPE_SCHED_CLS:
> +               ret = tcx_prog_detach(attr, prog);
> +               break;
>         default:
> -               return -EINVAL;
> +               ret = -EINVAL;
>         }
> +
> +       if (prog)
> +               bpf_prog_put(prog);
> +       return ret;
>  }
>
> -#define BPF_PROG_QUERY_LAST_FIELD query.prog_attach_flags
> +#define BPF_PROG_QUERY_LAST_FIELD query.link_attach_flags
>
>  static int bpf_prog_query(const union bpf_attr *attr,
>                           union bpf_attr __user *uattr)
> @@ -3890,6 +3933,9 @@ static int bpf_prog_query(const union bpf_attr *attr,
>         case BPF_SK_MSG_VERDICT:
>         case BPF_SK_SKB_VERDICT:
>                 return sock_map_bpf_prog_query(attr, uattr);
> +       case BPF_TCX_INGRESS:
> +       case BPF_TCX_EGRESS:
> +               return tcx_prog_query(attr, uattr);
>         default:
>                 return -EINVAL;
>         }
> @@ -4852,6 +4898,13 @@ static int link_create(union bpf_attr *attr, bpfptr_t uattr)
>                         goto out;
>                 }
>                 break;
> +       case BPF_PROG_TYPE_SCHED_CLS:
> +               if (attr->link_create.attach_type != BPF_TCX_INGRESS &&
> +                   attr->link_create.attach_type != BPF_TCX_EGRESS) {
> +                       ret = -EINVAL;
> +                       goto out;
> +               }
> +               break;
>         default:
>                 ptype = attach_type_to_prog_type(attr->link_create.attach_type);
>                 if (ptype == BPF_PROG_TYPE_UNSPEC || ptype != prog->type) {
> @@ -4903,6 +4956,9 @@ static int link_create(union bpf_attr *attr, bpfptr_t uattr)
>         case BPF_PROG_TYPE_XDP:
>                 ret = bpf_xdp_link_attach(attr, prog);
>                 break;
> +       case BPF_PROG_TYPE_SCHED_CLS:
> +               ret = tcx_link_attach(attr, prog);
> +               break;
>         case BPF_PROG_TYPE_NETFILTER:
>                 ret = bpf_nf_link_attach(attr, prog);
>                 break;
> diff --git a/kernel/bpf/tcx.c b/kernel/bpf/tcx.c
> new file mode 100644
> index 000000000000..69a272712b29
> --- /dev/null
> +++ b/kernel/bpf/tcx.c
> @@ -0,0 +1,348 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2023 Isovalent */
> +
> +#include <linux/bpf.h>
> +#include <linux/bpf_mprog.h>
> +#include <linux/netdevice.h>
> +
> +#include <net/tcx.h>
> +
> +int tcx_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog)
> +{
> +       bool created, ingress = attr->attach_type == BPF_TCX_INGRESS;
> +       struct net *net = current->nsproxy->net_ns;
> +       struct bpf_mprog_entry *entry, *entry_new;
> +       struct bpf_prog *replace_prog = NULL;
> +       struct net_device *dev;
> +       int ret;
> +
> +       rtnl_lock();
> +       dev = __dev_get_by_index(net, attr->target_ifindex);
> +       if (!dev) {
> +               ret = -ENODEV;
> +               goto out;
> +       }
> +       if (attr->attach_flags & BPF_F_REPLACE) {
> +               replace_prog = bpf_prog_get_type(attr->replace_bpf_fd,
> +                                                prog->type);
> +               if (IS_ERR(replace_prog)) {
> +                       ret = PTR_ERR(replace_prog);
> +                       replace_prog = NULL;
> +                       goto out;
> +               }
> +       }
> +       entry = tcx_entry_fetch_or_create(dev, ingress, &created);
> +       if (!entry) {
> +               ret = -ENOMEM;
> +               goto out;
> +       }
> +       ret = bpf_mprog_attach(entry, &entry_new, prog, NULL, replace_prog,
> +                              attr->attach_flags, attr->relative_fd,
> +                              attr->expected_revision);
> +       if (!ret) {
> +               if (entry != entry_new) {
> +                       tcx_entry_update(dev, entry_new, ingress);
> +                       tcx_entry_sync();
> +                       tcx_skeys_inc(ingress);
> +               }
> +               bpf_mprog_commit(entry);
> +       } else if (created) {
> +               tcx_entry_free(entry);
> +       }
> +out:
> +       if (replace_prog)
> +               bpf_prog_put(replace_prog);
> +       rtnl_unlock();
> +       return ret;
> +}
> +
> +int tcx_prog_detach(const union bpf_attr *attr, struct bpf_prog *prog)
> +{
> +       bool ingress = attr->attach_type == BPF_TCX_INGRESS;
> +       struct net *net = current->nsproxy->net_ns;
> +       struct bpf_mprog_entry *entry, *entry_new;
> +       struct net_device *dev;
> +       int ret;
> +
> +       rtnl_lock();
> +       dev = __dev_get_by_index(net, attr->target_ifindex);
> +       if (!dev) {
> +               ret = -ENODEV;
> +               goto out;
> +       }
> +       entry = tcx_entry_fetch(dev, ingress);
> +       if (!entry) {
> +               ret = -ENOENT;
> +               goto out;
> +       }
> +       ret = bpf_mprog_detach(entry, &entry_new, prog, NULL, attr->attach_flags,
> +                              attr->relative_fd, attr->expected_revision);
> +       if (!ret) {
> +               if (!tcx_entry_is_active(entry_new))
> +                       entry_new = NULL;
> +               tcx_entry_update(dev, entry_new, ingress);
> +               tcx_entry_sync();
> +               tcx_skeys_dec(ingress);
> +               bpf_mprog_commit(entry);
> +               if (!entry_new)
> +                       tcx_entry_free(entry);
> +       }
> +out:
> +       rtnl_unlock();
> +       return ret;
> +}
> +
> +void tcx_uninstall(struct net_device *dev, bool ingress)
> +{
> +       struct bpf_tuple tuple = {};
> +       struct bpf_mprog_entry *entry;
> +       struct bpf_mprog_fp *fp;
> +       struct bpf_mprog_cp *cp;
> +
> +       entry = tcx_entry_fetch(dev, ingress);
> +       if (!entry)
> +               return;
> +       tcx_entry_update(dev, NULL, ingress);
> +       tcx_entry_sync();
> +       bpf_mprog_foreach_tuple(entry, fp, cp, tuple) {
> +               if (tuple.link)
> +                       tcx_link(tuple.link)->dev = NULL;
> +               else
> +                       bpf_prog_put(tuple.prog);
> +               tcx_skeys_dec(ingress);
> +       }
> +       WARN_ON_ONCE(tcx_entry(entry)->miniq_active);
> +       tcx_entry_free(entry);
> +}
> +
> +int tcx_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr)
> +{
> +       bool ingress = attr->query.attach_type == BPF_TCX_INGRESS;
> +       struct net *net = current->nsproxy->net_ns;
> +       struct bpf_mprog_entry *entry;
> +       struct net_device *dev;
> +       int ret;
> +
> +       rtnl_lock();
> +       dev = __dev_get_by_index(net, attr->query.target_ifindex);
> +       if (!dev) {
> +               ret = -ENODEV;
> +               goto out;
> +       }
> +       entry = tcx_entry_fetch(dev, ingress);
> +       if (!entry) {
> +               ret = -ENOENT;
> +               goto out;
> +       }
> +       ret = bpf_mprog_query(attr, uattr, entry);
> +out:
> +       rtnl_unlock();
> +       return ret;
> +}
> +
> +static int tcx_link_prog_attach(struct bpf_link *link, u32 flags, u32 id_or_fd,
> +                               u64 revision)
> +{
> +       struct tcx_link *tcx = tcx_link(link);
> +       bool created, ingress = tcx->location == BPF_TCX_INGRESS;
> +       struct bpf_mprog_entry *entry, *entry_new;
> +       struct net_device *dev = tcx->dev;
> +       int ret;
> +
> +       ASSERT_RTNL();
> +       entry = tcx_entry_fetch_or_create(dev, ingress, &created);
> +       if (!entry)
> +               return -ENOMEM;
> +       ret = bpf_mprog_attach(entry, &entry_new, link->prog, link, NULL, flags,
> +                              id_or_fd, revision);
> +       if (!ret) {
> +               if (entry != entry_new) {
> +                       tcx_entry_update(dev, entry_new, ingress);
> +                       tcx_entry_sync();
> +                       tcx_skeys_inc(ingress);
> +               }
> +               bpf_mprog_commit(entry);
> +       } else if (created) {
> +               tcx_entry_free(entry);
> +       }
> +       return ret;
> +}
> +
> +static void tcx_link_release(struct bpf_link *link)
> +{
> +       struct tcx_link *tcx = tcx_link(link);
> +       bool ingress = tcx->location == BPF_TCX_INGRESS;
> +       struct bpf_mprog_entry *entry, *entry_new;
> +       struct net_device *dev;
> +       int ret = 0;
> +
> +       rtnl_lock();
> +       dev = tcx->dev;
> +       if (!dev)
> +               goto out;
> +       entry = tcx_entry_fetch(dev, ingress);
> +       if (!entry) {
> +               ret = -ENOENT;
> +               goto out;
> +       }
> +       ret = bpf_mprog_detach(entry, &entry_new, link->prog, link, 0, 0, 0);
> +       if (!ret) {
> +               if (!tcx_entry_is_active(entry_new))
> +                       entry_new = NULL;
> +               tcx_entry_update(dev, entry_new, ingress);
> +               tcx_entry_sync();
> +               tcx_skeys_dec(ingress);
> +               bpf_mprog_commit(entry);
> +               if (!entry_new)
> +                       tcx_entry_free(entry);
> +               tcx->dev = NULL;
> +       }
> +out:
> +       WARN_ON_ONCE(ret);
> +       rtnl_unlock();
> +}
> +
> +static int tcx_link_update(struct bpf_link *link, struct bpf_prog *nprog,
> +                          struct bpf_prog *oprog)
> +{
> +       struct tcx_link *tcx = tcx_link(link);
> +       bool ingress = tcx->location == BPF_TCX_INGRESS;
> +       struct bpf_mprog_entry *entry, *entry_new;
> +       struct net_device *dev;
> +       int ret = 0;
> +
> +       rtnl_lock();
> +       dev = tcx->dev;
> +       if (!dev) {
> +               ret = -ENOLINK;
> +               goto out;
> +       }
> +       if (oprog && link->prog != oprog) {
> +               ret = -EPERM;
> +               goto out;
> +       }
> +       oprog = link->prog;
> +       if (oprog == nprog) {
> +               bpf_prog_put(nprog);
> +               goto out;
> +       }
> +       entry = tcx_entry_fetch(dev, ingress);
> +       if (!entry) {
> +               ret = -ENOENT;
> +               goto out;
> +       }
> +       ret = bpf_mprog_attach(entry, &entry_new, nprog, link, oprog,
> +                              BPF_F_REPLACE | BPF_F_ID,
> +                              link->prog->aux->id, 0);
> +       if (!ret) {
> +               WARN_ON_ONCE(entry != entry_new);
> +               oprog = xchg(&link->prog, nprog);
> +               bpf_prog_put(oprog);
> +               bpf_mprog_commit(entry);
> +       }
> +out:
> +       rtnl_unlock();
> +       return ret;
> +}
> +
> +static void tcx_link_dealloc(struct bpf_link *link)
> +{
> +       kfree(tcx_link(link));
> +}
> +
> +static void tcx_link_fdinfo(const struct bpf_link *link, struct seq_file *seq)
> +{
> +       const struct tcx_link *tcx = tcx_link_const(link);
> +       u32 ifindex = 0;
> +
> +       rtnl_lock();
> +       if (tcx->dev)
> +               ifindex = tcx->dev->ifindex;
> +       rtnl_unlock();
> +
> +       seq_printf(seq, "ifindex:\t%u\n", ifindex);
> +       seq_printf(seq, "attach_type:\t%u (%s)\n",
> +                  tcx->location,
> +                  tcx->location == BPF_TCX_INGRESS ? "ingress" : "egress");
> +}
> +
> +static int tcx_link_fill_info(const struct bpf_link *link,
> +                             struct bpf_link_info *info)
> +{
> +       const struct tcx_link *tcx = tcx_link_const(link);
> +       u32 ifindex = 0;
> +
> +       rtnl_lock();
> +       if (tcx->dev)
> +               ifindex = tcx->dev->ifindex;
> +       rtnl_unlock();
> +
> +       info->tcx.ifindex = ifindex;
> +       info->tcx.attach_type = tcx->location;
> +       return 0;
> +}
> +
> +static int tcx_link_detach(struct bpf_link *link)
> +{
> +       tcx_link_release(link);
> +       return 0;
> +}
> +
> +static const struct bpf_link_ops tcx_link_lops = {
> +       .release        = tcx_link_release,
> +       .detach         = tcx_link_detach,
> +       .dealloc        = tcx_link_dealloc,
> +       .update_prog    = tcx_link_update,
> +       .show_fdinfo    = tcx_link_fdinfo,
> +       .fill_link_info = tcx_link_fill_info,

Should we show the tc link info in `bpftool link show` as well? I
believe that `bpftool link show` is the appropriate command to display
comprehensive information about all links.

> +};
> +
> +static int tcx_link_init(struct tcx_link *tcx,
> +                        struct bpf_link_primer *link_primer,
> +                        const union bpf_attr *attr,
> +                        struct net_device *dev,
> +                        struct bpf_prog *prog)
> +{
> +       bpf_link_init(&tcx->link, BPF_LINK_TYPE_TCX, &tcx_link_lops, prog);
> +       tcx->location = attr->link_create.attach_type;
> +       tcx->dev = dev;
> +       return bpf_link_prime(&tcx->link, link_primer);
> +}
> +
> +int tcx_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
> +{
> +       struct net *net = current->nsproxy->net_ns;
> +       struct bpf_link_primer link_primer;
> +       struct net_device *dev;
> +       struct tcx_link *tcx;
> +       int ret;
> +
> +       rtnl_lock();
> +       dev = __dev_get_by_index(net, attr->link_create.target_ifindex);
> +       if (!dev) {
> +               ret = -ENODEV;
> +               goto out;
> +       }
> +       tcx = kzalloc(sizeof(*tcx), GFP_USER);
> +       if (!tcx) {
> +               ret = -ENOMEM;
> +               goto out;
> +       }
> +       ret = tcx_link_init(tcx, &link_primer, attr, dev, prog);
> +       if (ret) {
> +               kfree(tcx);
> +               goto out;
> +       }
> +       ret = tcx_link_prog_attach(&tcx->link, attr->link_create.flags,
> +                                  attr->link_create.tcx.relative_fd,
> +                                  attr->link_create.tcx.expected_revision);
> +       if (ret) {
> +               tcx->dev = NULL;
> +               bpf_link_cleanup(&link_primer);
> +               goto out;
> +       }
> +       ret = bpf_link_settle(&link_primer);
> +out:
> +       rtnl_unlock();
> +       return ret;
> +}
> diff --git a/net/Kconfig b/net/Kconfig
> index 2fb25b534df5..d532ec33f1fe 100644
> --- a/net/Kconfig
> +++ b/net/Kconfig
> @@ -52,6 +52,11 @@ config NET_INGRESS
>  config NET_EGRESS
>         bool
>
> +config NET_XGRESS
> +       select NET_INGRESS
> +       select NET_EGRESS
> +       bool
> +
>  config NET_REDIRECT
>         bool
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index d6e1b786c5c5..c4b826024978 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -107,6 +107,7 @@
>  #include <net/pkt_cls.h>
>  #include <net/checksum.h>
>  #include <net/xfrm.h>
> +#include <net/tcx.h>
>  #include <linux/highmem.h>
>  #include <linux/init.h>
>  #include <linux/module.h>
> @@ -154,7 +155,6 @@
>  #include "dev.h"
>  #include "net-sysfs.h"
>
> -
>  static DEFINE_SPINLOCK(ptype_lock);
>  struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly;
>  struct list_head ptype_all __read_mostly;      /* Taps */
> @@ -3882,69 +3882,198 @@ int dev_loopback_xmit(struct net *net, struct sock *sk, struct sk_buff *skb)
>  EXPORT_SYMBOL(dev_loopback_xmit);
>
>  #ifdef CONFIG_NET_EGRESS
> -static struct sk_buff *
> -sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
> +static struct netdev_queue *
> +netdev_tx_queue_mapping(struct net_device *dev, struct sk_buff *skb)
> +{
> +       int qm = skb_get_queue_mapping(skb);
> +
> +       return netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, qm));
> +}
> +
> +static bool netdev_xmit_txqueue_skipped(void)
>  {
> +       return __this_cpu_read(softnet_data.xmit.skip_txqueue);
> +}
> +
> +void netdev_xmit_skip_txqueue(bool skip)
> +{
> +       __this_cpu_write(softnet_data.xmit.skip_txqueue, skip);
> +}
> +EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue);
> +#endif /* CONFIG_NET_EGRESS */
> +
> +#ifdef CONFIG_NET_XGRESS
> +static int tc_run(struct tcx_entry *entry, struct sk_buff *skb)
> +{
> +       int ret = TC_ACT_UNSPEC;
>  #ifdef CONFIG_NET_CLS_ACT
> -       struct mini_Qdisc *miniq = rcu_dereference_bh(dev->miniq_egress);
> -       struct tcf_result cl_res;
> +       struct mini_Qdisc *miniq = rcu_dereference_bh(entry->miniq);
> +       struct tcf_result res;
>
>         if (!miniq)
> -               return skb;
> +               return ret;
>
> -       /* qdisc_skb_cb(skb)->pkt_len was already set by the caller. */
>         tc_skb_cb(skb)->mru = 0;
>         tc_skb_cb(skb)->post_ct = false;
> -       mini_qdisc_bstats_cpu_update(miniq, skb);
>
> -       switch (tcf_classify(skb, miniq->block, miniq->filter_list, &cl_res, false)) {
> +       mini_qdisc_bstats_cpu_update(miniq, skb);
> +       ret = tcf_classify(skb, miniq->block, miniq->filter_list, &res, false);
> +       /* Only tcf related quirks below. */
> +       switch (ret) {
> +       case TC_ACT_SHOT:
> +               mini_qdisc_qstats_cpu_drop(miniq);
> +               break;
>         case TC_ACT_OK:
>         case TC_ACT_RECLASSIFY:
> -               skb->tc_index = TC_H_MIN(cl_res.classid);
> +               skb->tc_index = TC_H_MIN(res.classid);
>                 break;
> +       }
> +#endif /* CONFIG_NET_CLS_ACT */
> +       return ret;
> +}
> +
> +static DEFINE_STATIC_KEY_FALSE(tcx_needed_key);
> +
> +void tcx_inc(void)
> +{
> +       static_branch_inc(&tcx_needed_key);
> +}
> +
> +void tcx_dec(void)
> +{
> +       static_branch_dec(&tcx_needed_key);
> +}
> +
> +static __always_inline enum tcx_action_base
> +tcx_run(const struct bpf_mprog_entry *entry, struct sk_buff *skb,
> +       const bool needs_mac)
> +{
> +       const struct bpf_mprog_fp *fp;
> +       const struct bpf_prog *prog;
> +       int ret = TCX_NEXT;
> +
> +       if (needs_mac)
> +               __skb_push(skb, skb->mac_len);
> +       bpf_mprog_foreach_prog(entry, fp, prog) {
> +               bpf_compute_data_pointers(skb);
> +               ret = bpf_prog_run(prog, skb);
> +               if (ret != TCX_NEXT)
> +                       break;
> +       }
> +       if (needs_mac)
> +               __skb_pull(skb, skb->mac_len);
> +       return tcx_action_code(skb, ret);
> +}
> +
> +static __always_inline struct sk_buff *
> +sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
> +                  struct net_device *orig_dev, bool *another)
> +{
> +       struct bpf_mprog_entry *entry = rcu_dereference_bh(skb->dev->tcx_ingress);
> +       int sch_ret;
> +
> +       if (!entry)
> +               return skb;
> +       if (*pt_prev) {
> +               *ret = deliver_skb(skb, *pt_prev, orig_dev);
> +               *pt_prev = NULL;
> +       }
> +
> +       qdisc_skb_cb(skb)->pkt_len = skb->len;
> +       tcx_set_ingress(skb, true);
> +
> +       if (static_branch_unlikely(&tcx_needed_key)) {
> +               sch_ret = tcx_run(entry, skb, true);
> +               if (sch_ret != TC_ACT_UNSPEC)
> +                       goto ingress_verdict;
> +       }
> +       sch_ret = tc_run(tcx_entry(entry), skb);
> +ingress_verdict:
> +       switch (sch_ret) {
> +       case TC_ACT_REDIRECT:
> +               /* skb_mac_header check was done by BPF, so we can safely
> +                * push the L2 header back before redirecting to another
> +                * netdev.
> +                */
> +               __skb_push(skb, skb->mac_len);
> +               if (skb_do_redirect(skb) == -EAGAIN) {
> +                       __skb_pull(skb, skb->mac_len);
> +                       *another = true;
> +                       break;
> +               }
> +               *ret = NET_RX_SUCCESS;
> +               return NULL;
>         case TC_ACT_SHOT:
> -               mini_qdisc_qstats_cpu_drop(miniq);
> -               *ret = NET_XMIT_DROP;
> -               kfree_skb_reason(skb, SKB_DROP_REASON_TC_EGRESS);
> +               kfree_skb_reason(skb, SKB_DROP_REASON_TC_INGRESS);
> +               *ret = NET_RX_DROP;
>                 return NULL;
> +       /* used by tc_run */
>         case TC_ACT_STOLEN:
>         case TC_ACT_QUEUED:
>         case TC_ACT_TRAP:
> -               *ret = NET_XMIT_SUCCESS;
>                 consume_skb(skb);
> +               fallthrough;
> +       case TC_ACT_CONSUMED:
> +               *ret = NET_RX_SUCCESS;
>                 return NULL;
> +       }
> +
> +       return skb;
> +}
> +
> +static __always_inline struct sk_buff *
> +sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
> +{
> +       struct bpf_mprog_entry *entry = rcu_dereference_bh(dev->tcx_egress);
> +       int sch_ret;
> +
> +       if (!entry)
> +               return skb;
> +
> +       /* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was
> +        * already set by the caller.
> +        */
> +       if (static_branch_unlikely(&tcx_needed_key)) {
> +               sch_ret = tcx_run(entry, skb, false);
> +               if (sch_ret != TC_ACT_UNSPEC)
> +                       goto egress_verdict;
> +       }
> +       sch_ret = tc_run(tcx_entry(entry), skb);
> +egress_verdict:
> +       switch (sch_ret) {
>         case TC_ACT_REDIRECT:
>                 /* No need to push/pop skb's mac_header here on egress! */
>                 skb_do_redirect(skb);
>                 *ret = NET_XMIT_SUCCESS;
>                 return NULL;
> -       default:
> -               break;
> +       case TC_ACT_SHOT:
> +               kfree_skb_reason(skb, SKB_DROP_REASON_TC_EGRESS);
> +               *ret = NET_XMIT_DROP;
> +               return NULL;
> +       /* used by tc_run */
> +       case TC_ACT_STOLEN:
> +       case TC_ACT_QUEUED:
> +       case TC_ACT_TRAP:
> +               *ret = NET_XMIT_SUCCESS;
> +               return NULL;
>         }
> -#endif /* CONFIG_NET_CLS_ACT */
>
>         return skb;
>  }
> -
> -static struct netdev_queue *
> -netdev_tx_queue_mapping(struct net_device *dev, struct sk_buff *skb)
> -{
> -       int qm = skb_get_queue_mapping(skb);
> -
> -       return netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, qm));
> -}
> -
> -static bool netdev_xmit_txqueue_skipped(void)
> +#else
> +static __always_inline struct sk_buff *
> +sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
> +                  struct net_device *orig_dev, bool *another)
>  {
> -       return __this_cpu_read(softnet_data.xmit.skip_txqueue);
> +       return skb;
>  }
>
> -void netdev_xmit_skip_txqueue(bool skip)
> +static __always_inline struct sk_buff *
> +sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
>  {
> -       __this_cpu_write(softnet_data.xmit.skip_txqueue, skip);
> +       return skb;
>  }
> -EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue);
> -#endif /* CONFIG_NET_EGRESS */
> +#endif /* CONFIG_NET_XGRESS */
>
>  #ifdef CONFIG_XPS
>  static int __get_xps_queue_idx(struct net_device *dev, struct sk_buff *skb,
> @@ -4128,9 +4257,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
>         skb_update_prio(skb);
>
>         qdisc_pkt_len_init(skb);
> -#ifdef CONFIG_NET_CLS_ACT
> -       skb->tc_at_ingress = 0;
> -#endif
> +       tcx_set_ingress(skb, false);
>  #ifdef CONFIG_NET_EGRESS
>         if (static_branch_unlikely(&egress_needed_key)) {
>                 if (nf_hook_egress_active()) {
> @@ -5064,72 +5191,6 @@ int (*br_fdb_test_addr_hook)(struct net_device *dev,
>  EXPORT_SYMBOL_GPL(br_fdb_test_addr_hook);
>  #endif
>
> -static inline struct sk_buff *
> -sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
> -                  struct net_device *orig_dev, bool *another)
> -{
> -#ifdef CONFIG_NET_CLS_ACT
> -       struct mini_Qdisc *miniq = rcu_dereference_bh(skb->dev->miniq_ingress);
> -       struct tcf_result cl_res;
> -
> -       /* If there's at least one ingress present somewhere (so
> -        * we get here via enabled static key), remaining devices
> -        * that are not configured with an ingress qdisc will bail
> -        * out here.
> -        */
> -       if (!miniq)
> -               return skb;
> -
> -       if (*pt_prev) {
> -               *ret = deliver_skb(skb, *pt_prev, orig_dev);
> -               *pt_prev = NULL;
> -       }
> -
> -       qdisc_skb_cb(skb)->pkt_len = skb->len;
> -       tc_skb_cb(skb)->mru = 0;
> -       tc_skb_cb(skb)->post_ct = false;
> -       skb->tc_at_ingress = 1;
> -       mini_qdisc_bstats_cpu_update(miniq, skb);
> -
> -       switch (tcf_classify(skb, miniq->block, miniq->filter_list, &cl_res, false)) {
> -       case TC_ACT_OK:
> -       case TC_ACT_RECLASSIFY:
> -               skb->tc_index = TC_H_MIN(cl_res.classid);
> -               break;
> -       case TC_ACT_SHOT:
> -               mini_qdisc_qstats_cpu_drop(miniq);
> -               kfree_skb_reason(skb, SKB_DROP_REASON_TC_INGRESS);
> -               *ret = NET_RX_DROP;
> -               return NULL;
> -       case TC_ACT_STOLEN:
> -       case TC_ACT_QUEUED:
> -       case TC_ACT_TRAP:
> -               consume_skb(skb);
> -               *ret = NET_RX_SUCCESS;
> -               return NULL;
> -       case TC_ACT_REDIRECT:
> -               /* skb_mac_header check was done by cls/act_bpf, so
> -                * we can safely push the L2 header back before
> -                * redirecting to another netdev
> -                */
> -               __skb_push(skb, skb->mac_len);
> -               if (skb_do_redirect(skb) == -EAGAIN) {
> -                       __skb_pull(skb, skb->mac_len);
> -                       *another = true;
> -                       break;
> -               }
> -               *ret = NET_RX_SUCCESS;
> -               return NULL;
> -       case TC_ACT_CONSUMED:
> -               *ret = NET_RX_SUCCESS;
> -               return NULL;
> -       default:
> -               break;
> -       }
> -#endif /* CONFIG_NET_CLS_ACT */
> -       return skb;
> -}
> -
>  /**
>   *     netdev_is_rx_handler_busy - check if receive handler is registered
>   *     @dev: device to check
> @@ -10834,7 +10895,7 @@ void unregister_netdevice_many_notify(struct list_head *head,
>
>                 /* Shutdown queueing discipline. */
>                 dev_shutdown(dev);
> -
> +               dev_tcx_uninstall(dev);
>                 dev_xdp_uninstall(dev);
>                 bpf_dev_bound_netdev_unregister(dev);
>
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 06ba0e56e369..e39a8a20dd10 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -9312,7 +9312,7 @@ static struct bpf_insn *bpf_convert_tstamp_read(const struct bpf_prog *prog,
>         __u8 value_reg = si->dst_reg;
>         __u8 skb_reg = si->src_reg;
>
> -#ifdef CONFIG_NET_CLS_ACT
> +#ifdef CONFIG_NET_XGRESS
>         /* If the tstamp_type is read,
>          * the bpf prog is aware the tstamp could have delivery time.
>          * Thus, read skb->tstamp as is if tstamp_type_access is true.
> @@ -9346,7 +9346,7 @@ static struct bpf_insn *bpf_convert_tstamp_write(const struct bpf_prog *prog,
>         __u8 value_reg = si->src_reg;
>         __u8 skb_reg = si->dst_reg;
>
> -#ifdef CONFIG_NET_CLS_ACT
> +#ifdef CONFIG_NET_XGRESS
>         /* If the tstamp_type is read,
>          * the bpf prog is aware the tstamp could have delivery time.
>          * Thus, write skb->tstamp as is if tstamp_type_access is true.
> diff --git a/net/sched/Kconfig b/net/sched/Kconfig
> index 4b95cb1ac435..470c70deffe2 100644
> --- a/net/sched/Kconfig
> +++ b/net/sched/Kconfig
> @@ -347,8 +347,7 @@ config NET_SCH_FQ_PIE
>  config NET_SCH_INGRESS
>         tristate "Ingress/classifier-action Qdisc"
>         depends on NET_CLS_ACT
> -       select NET_INGRESS
> -       select NET_EGRESS
> +       select NET_XGRESS
>         help
>           Say Y here if you want to use classifiers for incoming and/or outgoing
>           packets. This qdisc doesn't do anything else besides running classifiers,
> @@ -679,6 +678,7 @@ config NET_EMATCH_IPT
>  config NET_CLS_ACT
>         bool "Actions"
>         select NET_CLS
> +       select NET_XGRESS
>         help
>           Say Y here if you want to use traffic control actions. Actions
>           get attached to classifiers and are invoked after a successful
> diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
> index e43a45499372..04e886f6cee4 100644
> --- a/net/sched/sch_ingress.c
> +++ b/net/sched/sch_ingress.c
> @@ -13,6 +13,7 @@
>  #include <net/netlink.h>
>  #include <net/pkt_sched.h>
>  #include <net/pkt_cls.h>
> +#include <net/tcx.h>
>
>  struct ingress_sched_data {
>         struct tcf_block *block;
> @@ -78,6 +79,8 @@ static int ingress_init(struct Qdisc *sch, struct nlattr *opt,
>  {
>         struct ingress_sched_data *q = qdisc_priv(sch);
>         struct net_device *dev = qdisc_dev(sch);
> +       struct bpf_mprog_entry *entry;
> +       bool created;
>         int err;
>
>         if (sch->parent != TC_H_INGRESS)
> @@ -85,7 +88,13 @@ static int ingress_init(struct Qdisc *sch, struct nlattr *opt,
>
>         net_inc_ingress_queue();
>
> -       mini_qdisc_pair_init(&q->miniqp, sch, &dev->miniq_ingress);
> +       entry = tcx_entry_fetch_or_create(dev, true, &created);
> +       if (!entry)
> +               return -ENOMEM;
> +       tcx_miniq_set_active(entry, true);
> +       mini_qdisc_pair_init(&q->miniqp, sch, &tcx_entry(entry)->miniq);
> +       if (created)
> +               tcx_entry_update(dev, entry, true);
>
>         q->block_info.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS;
>         q->block_info.chain_head_change = clsact_chain_head_change;
> @@ -103,11 +112,22 @@ static int ingress_init(struct Qdisc *sch, struct nlattr *opt,
>  static void ingress_destroy(struct Qdisc *sch)
>  {
>         struct ingress_sched_data *q = qdisc_priv(sch);
> +       struct net_device *dev = qdisc_dev(sch);
> +       struct bpf_mprog_entry *entry = rtnl_dereference(dev->tcx_ingress);
>
>         if (sch->parent != TC_H_INGRESS)
>                 return;
>
>         tcf_block_put_ext(q->block, sch, &q->block_info);
> +
> +       if (entry) {
> +               tcx_miniq_set_active(entry, false);
> +               if (!tcx_entry_is_active(entry)) {
> +                       tcx_entry_update(dev, NULL, false);
> +                       tcx_entry_free(entry);
> +               }
> +       }
> +
>         net_dec_ingress_queue();
>  }
>
> @@ -223,6 +243,8 @@ static int clsact_init(struct Qdisc *sch, struct nlattr *opt,
>  {
>         struct clsact_sched_data *q = qdisc_priv(sch);
>         struct net_device *dev = qdisc_dev(sch);
> +       struct bpf_mprog_entry *entry;
> +       bool created;
>         int err;
>
>         if (sch->parent != TC_H_CLSACT)
> @@ -231,7 +253,13 @@ static int clsact_init(struct Qdisc *sch, struct nlattr *opt,
>         net_inc_ingress_queue();
>         net_inc_egress_queue();
>
> -       mini_qdisc_pair_init(&q->miniqp_ingress, sch, &dev->miniq_ingress);
> +       entry = tcx_entry_fetch_or_create(dev, true, &created);
> +       if (!entry)
> +               return -ENOMEM;
> +       tcx_miniq_set_active(entry, true);
> +       mini_qdisc_pair_init(&q->miniqp_ingress, sch, &tcx_entry(entry)->miniq);
> +       if (created)
> +               tcx_entry_update(dev, entry, true);
>
>         q->ingress_block_info.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS;
>         q->ingress_block_info.chain_head_change = clsact_chain_head_change;
> @@ -244,7 +272,13 @@ static int clsact_init(struct Qdisc *sch, struct nlattr *opt,
>
>         mini_qdisc_pair_block_init(&q->miniqp_ingress, q->ingress_block);
>
> -       mini_qdisc_pair_init(&q->miniqp_egress, sch, &dev->miniq_egress);
> +       entry = tcx_entry_fetch_or_create(dev, false, &created);
> +       if (!entry)
> +               return -ENOMEM;
> +       tcx_miniq_set_active(entry, true);
> +       mini_qdisc_pair_init(&q->miniqp_egress, sch, &tcx_entry(entry)->miniq);
> +       if (created)
> +               tcx_entry_update(dev, entry, false);
>
>         q->egress_block_info.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_EGRESS;
>         q->egress_block_info.chain_head_change = clsact_chain_head_change;
> @@ -256,12 +290,31 @@ static int clsact_init(struct Qdisc *sch, struct nlattr *opt,
>  static void clsact_destroy(struct Qdisc *sch)
>  {
>         struct clsact_sched_data *q = qdisc_priv(sch);
> +       struct net_device *dev = qdisc_dev(sch);
> +       struct bpf_mprog_entry *ingress_entry = rtnl_dereference(dev->tcx_ingress);
> +       struct bpf_mprog_entry *egress_entry = rtnl_dereference(dev->tcx_egress);
>
>         if (sch->parent != TC_H_CLSACT)
>                 return;
>
> -       tcf_block_put_ext(q->egress_block, sch, &q->egress_block_info);
>         tcf_block_put_ext(q->ingress_block, sch, &q->ingress_block_info);
> +       tcf_block_put_ext(q->egress_block, sch, &q->egress_block_info);
> +
> +       if (ingress_entry) {
> +               tcx_miniq_set_active(ingress_entry, false);
> +               if (!tcx_entry_is_active(ingress_entry)) {
> +                       tcx_entry_update(dev, NULL, true);
> +                       tcx_entry_free(ingress_entry);
> +               }
> +       }
> +
> +       if (egress_entry) {
> +               tcx_miniq_set_active(egress_entry, false);
> +               if (!tcx_entry_is_active(egress_entry)) {
> +                       tcx_entry_update(dev, NULL, false);
> +                       tcx_entry_free(egress_entry);
> +               }
> +       }
>
>         net_dec_ingress_queue();
>         net_dec_egress_queue();
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 1c166870cdf3..47b76925189f 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -1036,6 +1036,8 @@ enum bpf_attach_type {
>         BPF_LSM_CGROUP,
>         BPF_STRUCT_OPS,
>         BPF_NETFILTER,
> +       BPF_TCX_INGRESS,
> +       BPF_TCX_EGRESS,
>         __MAX_BPF_ATTACH_TYPE
>  };
>
> @@ -1053,7 +1055,7 @@ enum bpf_link_type {
>         BPF_LINK_TYPE_KPROBE_MULTI = 8,
>         BPF_LINK_TYPE_STRUCT_OPS = 9,
>         BPF_LINK_TYPE_NETFILTER = 10,
> -
> +       BPF_LINK_TYPE_TCX = 11,
>         MAX_BPF_LINK_TYPE,
>  };
>
> @@ -1569,13 +1571,13 @@ union bpf_attr {
>                         __u32           map_fd;         /* struct_ops to attach */
>                 };
>                 union {
> -                       __u32           target_fd;      /* object to attach to */
> -                       __u32           target_ifindex; /* target ifindex */
> +                       __u32   target_fd;      /* target object to attach to or ... */
> +                       __u32   target_ifindex; /* target ifindex */
>                 };
>                 __u32           attach_type;    /* attach type */
>                 __u32           flags;          /* extra flags */
>                 union {
> -                       __u32           target_btf_id;  /* btf_id of target to attach to */
> +                       __u32   target_btf_id;  /* btf_id of target to attach to */
>                         struct {
>                                 __aligned_u64   iter_info;      /* extra bpf_iter_link_info */
>                                 __u32           iter_info_len;  /* iter_info length */
> @@ -1609,6 +1611,13 @@ union bpf_attr {
>                                 __s32           priority;
>                                 __u32           flags;
>                         } netfilter;
> +                       struct {
> +                               union {
> +                                       __u32   relative_fd;
> +                                       __u32   relative_id;
> +                               };
> +                               __u64           expected_revision;
> +                       } tcx;
>                 };
>         } link_create;
>
> @@ -6217,6 +6226,19 @@ struct bpf_sock_tuple {
>         };
>  };
>
> +/* (Simplified) user return codes for tcx prog type.
> + * A valid tcx program must return one of these defined values. All other
> + * return codes are reserved for future use. Must remain compatible with
> + * their TC_ACT_* counter-parts. For compatibility in behavior, unknown
> + * return codes are mapped to TCX_NEXT.
> + */
> +enum tcx_action_base {
> +       TCX_NEXT        = -1,
> +       TCX_PASS        = 0,
> +       TCX_DROP        = 2,
> +       TCX_REDIRECT    = 7,
> +};
> +
>  struct bpf_xdp_sock {
>         __u32 queue_id;
>  };
> @@ -6499,6 +6521,10 @@ struct bpf_link_info {
>                                 } event; /* BPF_PERF_EVENT_EVENT */
>                         };
>                 } perf_event;
> +               struct {
> +                       __u32 ifindex;
> +                       __u32 attach_type;
> +               } tcx;
>         };
>  } __attribute__((aligned(8)));
>
> --
> 2.34.1
>
>
Daniel Borkmann July 21, 2023, 12:53 p.m. UTC | #2
On 7/20/23 4:13 AM, Yafang Shao wrote:
> On Wed, Jul 19, 2023 at 10:11 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
[...]
>> +static const struct bpf_link_ops tcx_link_lops = {
>> +       .release        = tcx_link_release,
>> +       .detach         = tcx_link_detach,
>> +       .dealloc        = tcx_link_dealloc,
>> +       .update_prog    = tcx_link_update,
>> +       .show_fdinfo    = tcx_link_fdinfo,
>> +       .fill_link_info = tcx_link_fill_info,
> 
> Should we show the tc link info in `bpftool link show` as well? I
> believe that `bpftool link show` is the appropriate command to display
> comprehensive information about all links.

Yep, good idea. I'll add this to my todo list to tackle for once I'm back
from travel.

Thanks,
Daniel
Petr Machata July 21, 2023, 2:57 p.m. UTC | #3
As of this patch (commit e420bed02507), TC qdisc installation and/or
removal cause memory access issues in the system.

A semi-minimal reproducer is:

    bash-5.2# ip l a name v1 type veth peer name v2
    bash-5.2# ip l s dev v1 up
    bash-5.2# ip l s dev v2 up
    bash-5.2# tc q a dev v1 ingress
    bash-5.2# tc q d dev v1 ingress
    bash-5.2# tc q a dev v1 ingress
    bash-5.2# tc q d dev v1 ingress

It's a bit finnicky, but only a little. For me, the first two "tc q"
operations never triggered a splat. Then it could take a few "tc q a"
"tc q d" iterations to get it to splat. So it looks like maybe the first
"tc q d" is the problematic bit? And then there's some likelihood of
failing on any following "tc q" operation. The above in particular
produced three warning splats for me (attached as decoded.txt,
decoded2.txt and decoded3.txt). Probing further:

    bash-5.2# tc q a dev v1 ingress

Produced two more splats from KASAN (decoded4.txt and decoded5.txt),
which look more serious.

Further attempts to prod the system deadlock it, I guess because RTNL
was left locked.

Reverting e420bed02507, and fe20ce3a5126 + 55cc3768473e that fail to
build without it, makes net-next/main work again.
[  337.885866] ------------[ cut here ]------------
[  337.886351] ODEBUG: activate active (active state 1) object: ffff888008a7a000 object type: rcu_head hint: 0x0
[  337.887126] WARNING: CPU: 0 PID: 171 at lib/debugobjects.c:514 debug_print_object (/home/petr/src/linux_mlxsw/lib/debugobjects.c:514 (discriminator 2)) 
[  337.887813] Modules linked in: sch_ingress veth
[  337.888504] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
[  337.888996] RIP: 0010:debug_print_object (/home/petr/src/linux_mlxsw/lib/debugobjects.c:514 (discriminator 2)) 
[ 337.889324] Code: 00 fc ff df 48 89 fa 48 c1 ea 03 80 3c 02 00 75 49 41 56 48 8b 14 dd 80 e3 61 83 4c 89 e6 48 c7 c7 e0 d6 61 83 e8 52 8b 2e ff <0f> 0b 58 83 05 9c b1 58 02 01 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e
All code
========
   0:	00 fc                	add    %bh,%ah
   2:	ff                   	(bad)
   3:	df 48 89             	fisttps -0x77(%rax)
   6:	fa                   	cli
   7:	48 c1 ea 03          	shr    $0x3,%rdx
   b:	80 3c 02 00          	cmpb   $0x0,(%rdx,%rax,1)
   f:	75 49                	jne    0x5a
  11:	41 56                	push   %r14
  13:	48 8b 14 dd 80 e3 61 	mov    -0x7c9e1c80(,%rbx,8),%rdx
  1a:	83 
  1b:	4c 89 e6             	mov    %r12,%rsi
  1e:	48 c7 c7 e0 d6 61 83 	mov    $0xffffffff8361d6e0,%rdi
  25:	e8 52 8b 2e ff       	call   0xffffffffff2e8b7c
  2a:*	0f 0b                	ud2		<-- trapping instruction
  2c:	58                   	pop    %rax
  2d:	83 05 9c b1 58 02 01 	addl   $0x1,0x258b19c(%rip)        # 0x258b1d0
  34:	48 83 c4 18          	add    $0x18,%rsp
  38:	5b                   	pop    %rbx
  39:	5d                   	pop    %rbp
  3a:	41 5c                	pop    %r12
  3c:	41 5d                	pop    %r13
  3e:	41 5e                	pop    %r14

Code starting with the faulting instruction
===========================================
   0:	0f 0b                	ud2
   2:	58                   	pop    %rax
   3:	83 05 9c b1 58 02 01 	addl   $0x1,0x258b19c(%rip)        # 0x258b1a6
   a:	48 83 c4 18          	add    $0x18,%rsp
   e:	5b                   	pop    %rbx
   f:	5d                   	pop    %rbp
  10:	41 5c                	pop    %r12
  12:	41 5d                	pop    %r13
  14:	41 5e                	pop    %r14
[  337.890435] RSP: 0018:ffffc9000009f1c8 EFLAGS: 00010286
[  337.890798] RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000
[  337.891213] RDX: ffff88800c8c9fc0 RSI: ffffffff813f13cb RDI: 0000000000000001
[  337.891639] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000
[  337.892038] R10: 0000000000000001 R11: 0000000000000001 R12: ffffffff8361dd40
[  337.892460] R13: ffffffff834daf80 R14: 0000000000000000 R15: ffff8880093eee90
[  337.892865] FS:  00007f0089130740(0000) GS:ffff888036000000(0000) knlGS:0000000000000000
[  337.893339] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  337.893673] CR2: 000055958cefedc0 CR3: 000000000c0af001 CR4: 0000000000370ef0
[  337.894071] Call Trace:
[  337.894244]  <TASK>
[  337.894380] ? __warn (/home/petr/src/linux_mlxsw/kernel/panic.c:673) 
[  337.894585] ? debug_print_object (/home/petr/src/linux_mlxsw/lib/debugobjects.c:514 (discriminator 2)) 
[  337.894865] ? report_bug (/home/petr/src/linux_mlxsw/lib/bug.c:180 /home/petr/src/linux_mlxsw/lib/bug.c:219) 
[  337.895154] ? handle_bug (/home/petr/src/linux_mlxsw/arch/x86/kernel/traps.c:324 (discriminator 1)) 
[  337.895390] ? exc_invalid_op (/home/petr/src/linux_mlxsw/arch/x86/kernel/traps.c:345 (discriminator 1)) 
[  337.895628] ? asm_exc_invalid_op (/home/petr/src/linux_mlxsw/./arch/x86/include/asm/idtentry.h:568) 
[  337.895890] ? __warn_printk (/home/petr/src/linux_mlxsw/kernel/panic.c:712) 
[  337.896124] ? debug_print_object (/home/petr/src/linux_mlxsw/lib/debugobjects.c:514 (discriminator 2)) 
[  337.896414] ? debug_print_object (/home/petr/src/linux_mlxsw/lib/debugobjects.c:514 (discriminator 2)) 
[  337.896680] ? _raw_spin_unlock_irqrestore (/home/petr/src/linux_mlxsw/./arch/x86/include/asm/irqflags.h:42 /home/petr/src/linux_mlxsw/./arch/x86/include/asm/irqflags.h:77 /home/petr/src/linux_mlxsw/./arch/x86/include/asm/irqflags.h:135 /home/petr/src/linux_mlxsw/./include/linux/spinlock_api_smp.h:151 /home/petr/src/linux_mlxsw/kernel/locking/spinlock.c:194) 
[  337.896981] debug_object_activate (/home/petr/src/linux_mlxsw/lib/debugobjects.c:734) 
[  337.897274] ? debug_object_destroy (/home/petr/src/linux_mlxsw/lib/debugobjects.c:702) 
[  337.897559] ? mark_held_locks (/home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:4281 (discriminator 1)) 
[  337.897811] ? kvfree_call_rcu (/home/petr/src/linux_mlxsw/kernel/rcu/rcu.h:227 /home/petr/src/linux_mlxsw/kernel/rcu/tree.c:3359) 
[  337.898049] kvfree_call_rcu (/home/petr/src/linux_mlxsw/kernel/rcu/rcu.h:227 /home/petr/src/linux_mlxsw/kernel/rcu/tree.c:3359) 
[  337.898300] ? __tcf_block_put (/home/petr/src/linux_mlxsw/net/sched/cls_api.c:535 /home/petr/src/linux_mlxsw/net/sched/cls_api.c:530 /home/petr/src/linux_mlxsw/net/sched/cls_api.c:1301) 
[  337.898569] ingress_destroy (/home/petr/src/linux_mlxsw/net/sched/sch_ingress.c:131) sch_ingress
[  337.898877] ? clsact_init (/home/petr/src/linux_mlxsw/net/sched/sch_ingress.c:113) sch_ingress
[  337.899192] __qdisc_destroy (/home/petr/src/linux_mlxsw/net/sched/sch_generic.c:1065) 
[  337.899443] qdisc_destroy (/home/petr/src/linux_mlxsw/net/sched/sch_generic.c:1079) 
[  337.899669] qdisc_graft (/home/petr/src/linux_mlxsw/net/sched/sch_api.c:1134) 
[  337.899939] ? tc_dump_tclass (/home/petr/src/linux_mlxsw/net/sched/sch_api.c:1076) 
[  337.900248] tc_get_qdisc (/home/petr/src/linux_mlxsw/net/sched/sch_api.c:1541) 
[  337.900479] ? tc_ctl_tclass (/home/petr/src/linux_mlxsw/net/sched/sch_api.c:1475) 
[  337.900713] ? rtnetlink_rcv_msg (/home/petr/src/linux_mlxsw/net/core/rtnetlink.c:6421) 
[  337.901034] ? cap_capable (/home/petr/src/linux_mlxsw/security/commoncap.c:102) 
[  337.901384] ? lock_is_held_type (/home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:467 (discriminator 4) /home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:5833 (discriminator 4)) 
[  337.901641] ? tc_ctl_tclass (/home/petr/src/linux_mlxsw/net/sched/sch_api.c:1475) 
[  337.901902] rtnetlink_rcv_msg (/home/petr/src/linux_mlxsw/net/core/rtnetlink.c:6423) 
[  337.902196] ? rtnl_dump_ifinfo (/home/petr/src/linux_mlxsw/net/core/rtnetlink.c:6319) 
[  337.902478] ? lockdep_hardirqs_on_prepare (/home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:5000) 
[  337.902797] ? lockdep_hardirqs_on_prepare (/home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:5000) 
[  337.903101] ? find_held_lock (/home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:5251 (discriminator 1)) 
[  337.903378] netlink_rcv_skb (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:2547) 
[  337.903626] ? rtnl_dump_ifinfo (/home/petr/src/linux_mlxsw/net/core/rtnetlink.c:6319) 
[  337.903902] ? netlink_ack (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:2523) 
[  337.904139] ? lock_sync (/home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:5729) 
[  337.904409] ? netlink_deliver_tap (/home/petr/src/linux_mlxsw/./include/linux/rcupdate.h:308 /home/petr/src/linux_mlxsw/./include/linux/rcupdate.h:782 /home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:340) 
[  337.904679] ? is_vmalloc_addr (/home/petr/src/linux_mlxsw/mm/vmalloc.c:83) 
[  337.904933] netlink_unicast (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:1340 /home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:1365) 
[  337.905183] ? netlink_attachskb (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:1350) 
[  337.905462] ? __sanitizer_cov_trace_switch (/home/petr/src/linux_mlxsw/kernel/kcov.c:340 (discriminator 1)) 
[  337.905778] ? __check_object_size (/home/petr/src/linux_mlxsw/mm/usercopy.c:113 /home/petr/src/linux_mlxsw/mm/usercopy.c:145 /home/petr/src/linux_mlxsw/mm/usercopy.c:254 /home/petr/src/linux_mlxsw/mm/usercopy.c:213) 
[  337.906050] netlink_sendmsg (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:1911) 
[  337.906319] ? netlink_unicast (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:1830) 
[  337.906615] ? netlink_unicast (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:1830) 
[  337.906916] ____sys_sendmsg (/home/petr/src/linux_mlxsw/net/socket.c:728 (discriminator 1) /home/petr/src/linux_mlxsw/net/socket.c:748 (discriminator 1) /home/petr/src/linux_mlxsw/net/socket.c:2494 (discriminator 1)) 
[  337.907170] ? copy_msghdr_from_user (/home/petr/src/linux_mlxsw/net/socket.c:2420) 
[  337.907481] ? sock_read_iter (/home/petr/src/linux_mlxsw/net/socket.c:2440) 
[  337.907739] ? __lock_acquire (/home/petr/src/linux_mlxsw/./arch/x86/include/asm/bitops.h:228 /home/petr/src/linux_mlxsw/./arch/x86/include/asm/bitops.h:240 /home/petr/src/linux_mlxsw/./include/asm-generic/bitops/instrumented-non-atomic.h:142 /home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:228 /home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:3788 /home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:3844 /home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:5144) 
[  337.908007] ___sys_sendmsg (/home/petr/src/linux_mlxsw/net/socket.c:2550) 
[  337.908275] ? do_recvmmsg (/home/petr/src/linux_mlxsw/net/socket.c:2537) 
[  337.908547] ? local_clock_noinstr (/home/petr/src/linux_mlxsw/kernel/sched/clock.c:301 (discriminator 1)) 
[  337.908810] ? __fget_light (/home/petr/src/linux_mlxsw/fs/file.c:1027) 
[  337.909080] __sys_sendmsg (/home/petr/src/linux_mlxsw/net/socket.c:2579) 
[  337.909343] ? __sys_sendmsg_sock (/home/petr/src/linux_mlxsw/net/socket.c:2565) 
[  337.909607] ? xfd_validate_state (/home/petr/src/linux_mlxsw/arch/x86/kernel/fpu/xstate.c:1411 /home/petr/src/linux_mlxsw/arch/x86/kernel/fpu/xstate.c:1455) 
[  337.909899] ? syscall_enter_from_user_mode (/home/petr/src/linux_mlxsw/./arch/x86/include/asm/irqflags.h:42 /home/petr/src/linux_mlxsw/./arch/x86/include/asm/irqflags.h:77 /home/petr/src/linux_mlxsw/kernel/entry/common.c:111) 
[  337.910245] do_syscall_64 (/home/petr/src/linux_mlxsw/arch/x86/entry/common.c:50 (discriminator 1) /home/petr/src/linux_mlxsw/arch/x86/entry/common.c:80 (discriminator 1)) 
[  337.910480] entry_SYSCALL_64_after_hwframe (/home/petr/src/linux_mlxsw/arch/x86/entry/entry_64.S:120) 
[  337.910813] RIP: 0033:0x7f008946a8b4
[ 337.911091] Code: 15 59 f5 0b 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b5 0f 1f 00 f3 0f 1e fa 80 3d 2d 7d 0c 00 00 74 13 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 4c c3 0f 1f 00 55 48 89 e5 48 83 ec 20 89 55
All code
========
   0:	15 59 f5 0b 00       	adc    $0xbf559,%eax
   5:	f7 d8                	neg    %eax
   7:	64 89 02             	mov    %eax,%fs:(%rdx)
   a:	48 c7 c0 ff ff ff ff 	mov    $0xffffffffffffffff,%rax
  11:	eb b5                	jmp    0xffffffffffffffc8
  13:	0f 1f 00             	nopl   (%rax)
  16:	f3 0f 1e fa          	endbr64
  1a:	80 3d 2d 7d 0c 00 00 	cmpb   $0x0,0xc7d2d(%rip)        # 0xc7d4e
  21:	74 13                	je     0x36
  23:	b8 2e 00 00 00       	mov    $0x2e,%eax
  28:	0f 05                	syscall
  2a:*	48 3d 00 f0 ff ff    	cmp    $0xfffffffffffff000,%rax		<-- trapping instruction
  30:	77 4c                	ja     0x7e
  32:	c3                   	ret
  33:	0f 1f 00             	nopl   (%rax)
  36:	55                   	push   %rbp
  37:	48 89 e5             	mov    %rsp,%rbp
  3a:	48 83 ec 20          	sub    $0x20,%rsp
  3e:	89                   	.byte 0x89
  3f:	55                   	push   %rbp

Code starting with the faulting instruction
===========================================
   0:	48 3d 00 f0 ff ff    	cmp    $0xfffffffffffff000,%rax
   6:	77 4c                	ja     0x54
   8:	c3                   	ret
   9:	0f 1f 00             	nopl   (%rax)
   c:	55                   	push   %rbp
   d:	48 89 e5             	mov    %rsp,%rbp
  10:	48 83 ec 20          	sub    $0x20,%rsp
  14:	89                   	.byte 0x89
  15:	55                   	push   %rbp
[  337.912135] RSP: 002b:00007ffd615b18c8 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
[  337.912598] RAX: ffffffffffffffda RBX: 000055958cf26f80 RCX: 00007f008946a8b4
[  337.913006] RDX: 0000000000000000 RSI: 00007ffd615b1940 RDI: 0000000000000003
[  337.913435] RBP: 00007ffd615b19b0 R08: 0000000064bab34a R09: 0000000000000001
[  337.913846] R10: 0000000000000001 R11: 0000000000000202 R12: 00007ffd615b1a30
[  337.914302] R13: 0000000064bab34b R14: 000055958cf26f80 R15: 0000000000000000
[  337.914793]  </TASK>
[  337.914929] irq event stamp: 167013
[  337.915150] hardirqs last enabled at (167021): __up_console_sem (/home/petr/src/linux_mlxsw/./arch/x86/include/asm/irqflags.h:42 /home/petr/src/linux_mlxsw/./arch/x86/include/asm/irqflags.h:77 /home/petr/src/linux_mlxsw/./arch/x86/include/asm/irqflags.h:135 /home/petr/src/linux_mlxsw/kernel/printk/printk.c:347 /home/petr/src/linux_mlxsw/kernel/printk/printk.c:339) 
[  337.915684] hardirqs last disabled at (167030): __up_console_sem (/home/petr/src/linux_mlxsw/kernel/printk/printk.c:345 (discriminator 3)) 
[  337.916182] softirqs last enabled at (166282): irq_exit_rcu (/home/petr/src/linux_mlxsw/kernel/softirq.c:427 /home/petr/src/linux_mlxsw/kernel/softirq.c:632 /home/petr/src/linux_mlxsw/kernel/softirq.c:644) 
[  337.916803] softirqs last disabled at (166245): irq_exit_rcu (/home/petr/src/linux_mlxsw/kernel/softirq.c:427 /home/petr/src/linux_mlxsw/kernel/softirq.c:632 /home/petr/src/linux_mlxsw/kernel/softirq.c:644) 
[  337.917302] ---[ end trace 0000000000000000 ]---
[  337.918159] ------------[ cut here ]------------
[  337.918626] ODEBUG: active_state active (active state 1) object: ffff888008a7a000 object type: rcu_head hint: 0x0
[  337.920604] WARNING: CPU: 0 PID: 171 at lib/debugobjects.c:514 debug_print_object (/home/petr/src/linux_mlxsw/lib/debugobjects.c:514 (discriminator 2)) 
[  337.921684] Modules linked in: sch_ingress veth
[  337.923119] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
[  337.924543] RIP: 0010:debug_print_object (/home/petr/src/linux_mlxsw/lib/debugobjects.c:514 (discriminator 2)) 
[ 337.925136] Code: 00 fc ff df 48 89 fa 48 c1 ea 03 80 3c 02 00 75 49 41 56 48 8b 14 dd 80 e3 61 83 4c 89 e6 48 c7 c7 e0 d6 61 83 e8 52 8b 2e ff <0f> 0b 58 83 05 9c b1 58 02 01 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e
All code
========
   0:	00 fc                	add    %bh,%ah
   2:	ff                   	(bad)
   3:	df 48 89             	fisttps -0x77(%rax)
   6:	fa                   	cli
   7:	48 c1 ea 03          	shr    $0x3,%rdx
   b:	80 3c 02 00          	cmpb   $0x0,(%rdx,%rax,1)
   f:	75 49                	jne    0x5a
  11:	41 56                	push   %r14
  13:	48 8b 14 dd 80 e3 61 	mov    -0x7c9e1c80(,%rbx,8),%rdx
  1a:	83 
  1b:	4c 89 e6             	mov    %r12,%rsi
  1e:	48 c7 c7 e0 d6 61 83 	mov    $0xffffffff8361d6e0,%rdi
  25:	e8 52 8b 2e ff       	call   0xffffffffff2e8b7c
  2a:*	0f 0b                	ud2		<-- trapping instruction
  2c:	58                   	pop    %rax
  2d:	83 05 9c b1 58 02 01 	addl   $0x1,0x258b19c(%rip)        # 0x258b1d0
  34:	48 83 c4 18          	add    $0x18,%rsp
  38:	5b                   	pop    %rbx
  39:	5d                   	pop    %rbp
  3a:	41 5c                	pop    %r12
  3c:	41 5d                	pop    %r13
  3e:	41 5e                	pop    %r14

Code starting with the faulting instruction
===========================================
   0:	0f 0b                	ud2
   2:	58                   	pop    %rax
   3:	83 05 9c b1 58 02 01 	addl   $0x1,0x258b19c(%rip)        # 0x258b1a6
   a:	48 83 c4 18          	add    $0x18,%rsp
   e:	5b                   	pop    %rbx
   f:	5d                   	pop    %rbp
  10:	41 5c                	pop    %r12
  12:	41 5d                	pop    %r13
  14:	41 5e                	pop    %r14
[  337.926728] RSP: 0000:ffffc9000009f1c8 EFLAGS: 00010286
[  337.927123] RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000
[  337.927704] RDX: ffff88800c8c9fc0 RSI: ffffffff813f13cb RDI: 0000000000000001
[  337.928362] RBP: 0000000000000002 R08: 0000000000000001 R09: 0000000000000000
[  337.928930] R10: 0000000000000001 R11: 0000000000000001 R12: ffffffff8361db20
[  337.929541] R13: ffffffff834daf80 R14: 0000000000000000 R15: ffff8880093eee90
[  337.929977] FS:  00007f0089130740(0000) GS:ffff888036000000(0000) knlGS:0000000000000000
[  337.930495] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  337.930842] CR2: 00007fd15cc3d000 CR3: 000000000c0af001 CR4: 0000000000370ef0
[  337.931284] Call Trace:
[  337.931443]  <TASK>
[  337.931579] ? __warn (/home/petr/src/linux_mlxsw/kernel/panic.c:673) 
[  337.931801] ? debug_print_object (/home/petr/src/linux_mlxsw/lib/debugobjects.c:514 (discriminator 2)) 
[  337.932083] ? report_bug (/home/petr/src/linux_mlxsw/lib/bug.c:180 /home/petr/src/linux_mlxsw/lib/bug.c:219) 
[  337.932349] ? handle_bug (/home/petr/src/linux_mlxsw/arch/x86/kernel/traps.c:324 (discriminator 1)) 
[  337.932576] ? exc_invalid_op (/home/petr/src/linux_mlxsw/arch/x86/kernel/traps.c:345 (discriminator 1)) 
[  337.932816] ? asm_exc_invalid_op (/home/petr/src/linux_mlxsw/./arch/x86/include/asm/idtentry.h:568) 
[  337.933086] ? __warn_printk (/home/petr/src/linux_mlxsw/kernel/panic.c:712) 
[  337.933347] ? debug_print_object (/home/petr/src/linux_mlxsw/lib/debugobjects.c:514 (discriminator 2)) 
[  337.933616] ? debug_print_object (/home/petr/src/linux_mlxsw/lib/debugobjects.c:514 (discriminator 2)) 
[  337.933878] ? _raw_spin_unlock_irqrestore (/home/petr/src/linux_mlxsw/./arch/x86/include/asm/irqflags.h:42 /home/petr/src/linux_mlxsw/./arch/x86/include/asm/irqflags.h:77 /home/petr/src/linux_mlxsw/./arch/x86/include/asm/irqflags.h:135 /home/petr/src/linux_mlxsw/./include/linux/spinlock_api_smp.h:151 /home/petr/src/linux_mlxsw/kernel/locking/spinlock.c:194) 
[  337.934192] debug_object_active_state (/home/petr/src/linux_mlxsw/lib/debugobjects.c:993 /home/petr/src/linux_mlxsw/lib/debugobjects.c:954) 
[  337.934500] ? debug_stats_show (/home/petr/src/linux_mlxsw/lib/debugobjects.c:956) 
[  337.934763] ? mark_held_locks (/home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:4281 (discriminator 1)) 
[  337.935017] kvfree_call_rcu (/home/petr/src/linux_mlxsw/kernel/rcu/tree.c:3359 (discriminator 1)) 
[  337.935271] ? __tcf_block_put (/home/petr/src/linux_mlxsw/net/sched/cls_api.c:535 /home/petr/src/linux_mlxsw/net/sched/cls_api.c:530 /home/petr/src/linux_mlxsw/net/sched/cls_api.c:1301) 
[  337.935537] ingress_destroy (/home/petr/src/linux_mlxsw/net/sched/sch_ingress.c:131) sch_ingress
[  337.935852] ? clsact_init (/home/petr/src/linux_mlxsw/net/sched/sch_ingress.c:113) sch_ingress
[  337.936188] __qdisc_destroy (/home/petr/src/linux_mlxsw/net/sched/sch_generic.c:1065) 
[  337.936461] qdisc_destroy (/home/petr/src/linux_mlxsw/net/sched/sch_generic.c:1079) 
[  337.936694] qdisc_graft (/home/petr/src/linux_mlxsw/net/sched/sch_api.c:1134) 
[  337.936939] ? tc_dump_tclass (/home/petr/src/linux_mlxsw/net/sched/sch_api.c:1076) 
[  337.937287] tc_get_qdisc (/home/petr/src/linux_mlxsw/net/sched/sch_api.c:1541) 
[  337.937545] ? tc_ctl_tclass (/home/petr/src/linux_mlxsw/net/sched/sch_api.c:1475) 
[  337.937809] ? rtnetlink_rcv_msg (/home/petr/src/linux_mlxsw/net/core/rtnetlink.c:6421) 
[  337.938095] ? cap_capable (/home/petr/src/linux_mlxsw/security/commoncap.c:102) 
[  337.938360] ? lock_is_held_type (/home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:467 (discriminator 4) /home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:5833 (discriminator 4)) 
[  337.938620] ? tc_ctl_tclass (/home/petr/src/linux_mlxsw/net/sched/sch_api.c:1475) 
[  337.938883] rtnetlink_rcv_msg (/home/petr/src/linux_mlxsw/net/core/rtnetlink.c:6423) 
[  337.939320] ? rtnl_dump_ifinfo (/home/petr/src/linux_mlxsw/net/core/rtnetlink.c:6319) 
[  337.939726] ? lockdep_hardirqs_on_prepare (/home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:5000) 
[  337.940217] ? lockdep_hardirqs_on_prepare (/home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:5000) 
[  337.940749] ? find_held_lock (/home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:5251 (discriminator 1)) 
[  337.941119] netlink_rcv_skb (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:2547) 
[  337.941659] ? rtnl_dump_ifinfo (/home/petr/src/linux_mlxsw/net/core/rtnetlink.c:6319) 
[  337.942060] ? netlink_ack (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:2523) 
[  337.942478] ? lock_sync (/home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:5729) 
[  337.942825] ? netlink_deliver_tap (/home/petr/src/linux_mlxsw/./include/linux/rcupdate.h:308 /home/petr/src/linux_mlxsw/./include/linux/rcupdate.h:782 /home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:340) 
[  337.943338] ? is_vmalloc_addr (/home/petr/src/linux_mlxsw/mm/vmalloc.c:83) 
[  337.943667] netlink_unicast (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:1340 /home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:1365) 
[  337.943947] ? netlink_attachskb (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:1350) 
[  337.944312] ? __sanitizer_cov_trace_switch (/home/petr/src/linux_mlxsw/kernel/kcov.c:340 (discriminator 1)) 
[  337.944706] ? __check_object_size (/home/petr/src/linux_mlxsw/mm/usercopy.c:113 /home/petr/src/linux_mlxsw/mm/usercopy.c:145 /home/petr/src/linux_mlxsw/mm/usercopy.c:254 /home/petr/src/linux_mlxsw/mm/usercopy.c:213) 
[  337.945042] netlink_sendmsg (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:1911) 
[  337.945468] ? netlink_unicast (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:1830) 
[  337.945748] ? netlink_unicast (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:1830) 
[  337.945997] ____sys_sendmsg (/home/petr/src/linux_mlxsw/net/socket.c:728 (discriminator 1) /home/petr/src/linux_mlxsw/net/socket.c:748 (discriminator 1) /home/petr/src/linux_mlxsw/net/socket.c:2494 (discriminator 1)) 
[  337.946273] ? copy_msghdr_from_user (/home/petr/src/linux_mlxsw/net/socket.c:2420) 
[  337.946560] ? sock_read_iter (/home/petr/src/linux_mlxsw/net/socket.c:2440) 
[  337.946819] ? __lock_acquire (/home/petr/src/linux_mlxsw/./arch/x86/include/asm/bitops.h:228 /home/petr/src/linux_mlxsw/./arch/x86/include/asm/bitops.h:240 /home/petr/src/linux_mlxsw/./include/asm-generic/bitops/instrumented-non-atomic.h:142 /home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:228 /home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:3788 /home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:3844 /home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:5144) 
[  337.947097] ___sys_sendmsg (/home/petr/src/linux_mlxsw/net/socket.c:2550) 
[  337.947372] ? do_recvmmsg (/home/petr/src/linux_mlxsw/net/socket.c:2537) 
[  337.947653] ? local_clock_noinstr (/home/petr/src/linux_mlxsw/kernel/sched/clock.c:301 (discriminator 1)) 
[  337.947909] ? __fget_light (/home/petr/src/linux_mlxsw/fs/file.c:1027) 
[  337.948164] __sys_sendmsg (/home/petr/src/linux_mlxsw/net/socket.c:2579) 
[  337.948417] ? __sys_sendmsg_sock (/home/petr/src/linux_mlxsw/net/socket.c:2565) 
[  337.948689] ? xfd_validate_state (/home/petr/src/linux_mlxsw/arch/x86/kernel/fpu/xstate.c:1411 /home/petr/src/linux_mlxsw/arch/x86/kernel/fpu/xstate.c:1455) 
[  337.948970] ? syscall_enter_from_user_mode (/home/petr/src/linux_mlxsw/./arch/x86/include/asm/irqflags.h:42 /home/petr/src/linux_mlxsw/./arch/x86/include/asm/irqflags.h:77 /home/petr/src/linux_mlxsw/kernel/entry/common.c:111) 
[  337.949300] do_syscall_64 (/home/petr/src/linux_mlxsw/arch/x86/entry/common.c:50 (discriminator 1) /home/petr/src/linux_mlxsw/arch/x86/entry/common.c:80 (discriminator 1)) 
[  337.949530] entry_SYSCALL_64_after_hwframe (/home/petr/src/linux_mlxsw/arch/x86/entry/entry_64.S:120) 
[  337.949836] RIP: 0033:0x7f008946a8b4
[ 337.950059] Code: 15 59 f5 0b 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b5 0f 1f 00 f3 0f 1e fa 80 3d 2d 7d 0c 00 00 74 13 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 4c c3 0f 1f 00 55 48 89 e5 48 83 ec 20 89 55
All code
========
   0:	15 59 f5 0b 00       	adc    $0xbf559,%eax
   5:	f7 d8                	neg    %eax
   7:	64 89 02             	mov    %eax,%fs:(%rdx)
   a:	48 c7 c0 ff ff ff ff 	mov    $0xffffffffffffffff,%rax
  11:	eb b5                	jmp    0xffffffffffffffc8
  13:	0f 1f 00             	nopl   (%rax)
  16:	f3 0f 1e fa          	endbr64
  1a:	80 3d 2d 7d 0c 00 00 	cmpb   $0x0,0xc7d2d(%rip)        # 0xc7d4e
  21:	74 13                	je     0x36
  23:	b8 2e 00 00 00       	mov    $0x2e,%eax
  28:	0f 05                	syscall
  2a:*	48 3d 00 f0 ff ff    	cmp    $0xfffffffffffff000,%rax		<-- trapping instruction
  30:	77 4c                	ja     0x7e
  32:	c3                   	ret
  33:	0f 1f 00             	nopl   (%rax)
  36:	55                   	push   %rbp
  37:	48 89 e5             	mov    %rsp,%rbp
  3a:	48 83 ec 20          	sub    $0x20,%rsp
  3e:	89                   	.byte 0x89
  3f:	55                   	push   %rbp

Code starting with the faulting instruction
===========================================
   0:	48 3d 00 f0 ff ff    	cmp    $0xfffffffffffff000,%rax
   6:	77 4c                	ja     0x54
   8:	c3                   	ret
   9:	0f 1f 00             	nopl   (%rax)
   c:	55                   	push   %rbp
   d:	48 89 e5             	mov    %rsp,%rbp
  10:	48 83 ec 20          	sub    $0x20,%rsp
  14:	89                   	.byte 0x89
  15:	55                   	push   %rbp
[  337.951115] RSP: 002b:00007ffd615b18c8 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
[  337.951576] RAX: ffffffffffffffda RBX: 000055958cf26f80 RCX: 00007f008946a8b4
[  337.951983] RDX: 0000000000000000 RSI: 00007ffd615b1940 RDI: 0000000000000003
[  337.952415] RBP: 00007ffd615b19b0 R08: 0000000064bab34a R09: 0000000000000001
[  337.952825] R10: 0000000000000001 R11: 0000000000000202 R12: 00007ffd615b1a30
[  337.953247] R13: 0000000064bab34b R14: 000055958cf26f80 R15: 0000000000000000
[  337.953670]  </TASK>
[  337.953809] irq event stamp: 167935
[  337.954012] hardirqs last enabled at (167943): __up_console_sem (/home/petr/src/linux_mlxsw/./arch/x86/include/asm/irqflags.h:42 /home/petr/src/linux_mlxsw/./arch/x86/include/asm/irqflags.h:77 /home/petr/src/linux_mlxsw/./arch/x86/include/asm/irqflags.h:135 /home/petr/src/linux_mlxsw/kernel/printk/printk.c:347 /home/petr/src/linux_mlxsw/kernel/printk/printk.c:339) 
[  337.954517] hardirqs last disabled at (167952): __up_console_sem (/home/petr/src/linux_mlxsw/kernel/printk/printk.c:345 (discriminator 3)) 
[  337.955007] softirqs last enabled at (167216): irq_exit_rcu (/home/petr/src/linux_mlxsw/kernel/softirq.c:427 /home/petr/src/linux_mlxsw/kernel/softirq.c:632 /home/petr/src/linux_mlxsw/kernel/softirq.c:644) 
[  337.955509] softirqs last disabled at (167199): irq_exit_rcu (/home/petr/src/linux_mlxsw/kernel/softirq.c:427 /home/petr/src/linux_mlxsw/kernel/softirq.c:632 /home/petr/src/linux_mlxsw/kernel/softirq.c:644) 
[  337.955999] ---[ end trace 0000000000000000 ]---
[  337.957029] ------------[ cut here ]------------
[  337.957696] kvfree_call_rcu(): Double-freed call. rcu_head ffff888008a7a638
[  337.961920] WARNING: CPU: 0 PID: 171 at kernel/rcu/tree.c:3361 kvfree_call_rcu (/home/petr/src/linux_mlxsw/kernel/rcu/tree.c:3361 (discriminator 1)) 
[  337.963725] Modules linked in: sch_ingress veth
[  337.964797] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
[  337.965437] RIP: 0010:kvfree_call_rcu (/home/petr/src/linux_mlxsw/kernel/rcu/tree.c:3361 (discriminator 1)) 
[ 337.965764] Code: 00 00 00 e8 2a 4c e8 ff e9 35 01 00 00 4c 89 e2 48 c7 c6 a0 07 4e 83 48 c7 c7 40 df 4d 83 c6 05 29 35 07 03 01 e8 28 7c e0 ff <0f> 0b e9 db fa ff ff 48 b8 00 00 00 00 00 fc ff df 49 8d 7c 24 08
All code
========
   0:	00 00                	add    %al,(%rax)
   2:	00 e8                	add    %ch,%al
   4:	2a 4c e8 ff          	sub    -0x1(%rax,%rbp,8),%cl
   8:	e9 35 01 00 00       	jmp    0x142
   d:	4c 89 e2             	mov    %r12,%rdx
  10:	48 c7 c6 a0 07 4e 83 	mov    $0xffffffff834e07a0,%rsi
  17:	48 c7 c7 40 df 4d 83 	mov    $0xffffffff834ddf40,%rdi
  1e:	c6 05 29 35 07 03 01 	movb   $0x1,0x3073529(%rip)        # 0x307354e
  25:	e8 28 7c e0 ff       	call   0xffffffffffe07c52
  2a:*	0f 0b                	ud2		<-- trapping instruction
  2c:	e9 db fa ff ff       	jmp    0xfffffffffffffb0c
  31:	48 b8 00 00 00 00 00 	movabs $0xdffffc0000000000,%rax
  38:	fc ff df 
  3b:	49 8d 7c 24 08       	lea    0x8(%r12),%rdi

Code starting with the faulting instruction
===========================================
   0:	0f 0b                	ud2
   2:	e9 db fa ff ff       	jmp    0xfffffffffffffae2
   7:	48 b8 00 00 00 00 00 	movabs $0xdffffc0000000000,%rax
   e:	fc ff df 
  11:	49 8d 7c 24 08       	lea    0x8(%r12),%rdi
[  337.966903] RSP: 0000:ffffc9000009f328 EFLAGS: 00010286
[  337.967220] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[  337.967645] RDX: ffff88800c8c9fc0 RSI: ffffffff813f13cb RDI: 0000000000000001
[  337.968049] RBP: ffff888008a7a000 R08: 0000000000000001 R09: 0000000000000000
[  337.968474] R10: 0000000000000001 R11: 0000000000000001 R12: ffff888008a7a638
[  337.968879] R13: ffff888008a7a208 R14: ffff888008a7a008 R15: 0000000000000002
[  337.969302] FS:  00007f0089130740(0000) GS:ffff888036000000(0000) knlGS:0000000000000000
[  337.969759] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  337.970091] CR2: 00007fd15cc3d000 CR3: 000000000c0af001 CR4: 0000000000370ef0
[  337.970515] Call Trace:
[  337.970670]  <TASK>
[  337.970806] ? __warn (/home/petr/src/linux_mlxsw/kernel/panic.c:673) 
[  337.971008] ? kvfree_call_rcu (/home/petr/src/linux_mlxsw/kernel/rcu/tree.c:3361 (discriminator 1)) 
[  337.971279] ? report_bug (/home/petr/src/linux_mlxsw/lib/bug.c:180 /home/petr/src/linux_mlxsw/lib/bug.c:219) 
[  337.971519] ? handle_bug (/home/petr/src/linux_mlxsw/arch/x86/kernel/traps.c:324 (discriminator 1)) 
[  337.971738] ? exc_invalid_op (/home/petr/src/linux_mlxsw/arch/x86/kernel/traps.c:345 (discriminator 1)) 
[  337.971973] ? asm_exc_invalid_op (/home/petr/src/linux_mlxsw/./arch/x86/include/asm/idtentry.h:568) 
[  337.972261] ? __warn_printk (/home/petr/src/linux_mlxsw/kernel/panic.c:712) 
[  337.972501] ? kvfree_call_rcu (/home/petr/src/linux_mlxsw/kernel/rcu/tree.c:3361 (discriminator 1)) 
[  337.972750] ? kvfree_call_rcu (/home/petr/src/linux_mlxsw/kernel/rcu/tree.c:3361 (discriminator 1)) 
[  337.972997] ? __tcf_block_put (/home/petr/src/linux_mlxsw/net/sched/cls_api.c:535 /home/petr/src/linux_mlxsw/net/sched/cls_api.c:530 /home/petr/src/linux_mlxsw/net/sched/cls_api.c:1301) 
[  337.973276] ingress_destroy (/home/petr/src/linux_mlxsw/net/sched/sch_ingress.c:131) sch_ingress
[  337.973591] ? clsact_init (/home/petr/src/linux_mlxsw/net/sched/sch_ingress.c:113) sch_ingress
[  337.973892] __qdisc_destroy (/home/petr/src/linux_mlxsw/net/sched/sch_generic.c:1065) 
[  337.974128] qdisc_destroy (/home/petr/src/linux_mlxsw/net/sched/sch_generic.c:1079) 
[  337.974371] qdisc_graft (/home/petr/src/linux_mlxsw/net/sched/sch_api.c:1134) 
[  337.974612] ? tc_dump_tclass (/home/petr/src/linux_mlxsw/net/sched/sch_api.c:1076) 
[  337.974873] tc_get_qdisc (/home/petr/src/linux_mlxsw/net/sched/sch_api.c:1541) 
[  337.975106] ? tc_ctl_tclass (/home/petr/src/linux_mlxsw/net/sched/sch_api.c:1475) 
[  337.975363] ? rtnetlink_rcv_msg (/home/petr/src/linux_mlxsw/net/core/rtnetlink.c:6421) 
[  337.975645] ? cap_capable (/home/petr/src/linux_mlxsw/security/commoncap.c:102) 
[  337.975883] ? lock_is_held_type (/home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:467 (discriminator 4) /home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:5833 (discriminator 4)) 
[  337.976137] ? tc_ctl_tclass (/home/petr/src/linux_mlxsw/net/sched/sch_api.c:1475) 
[  337.976398] rtnetlink_rcv_msg (/home/petr/src/linux_mlxsw/net/core/rtnetlink.c:6423) 
[  337.976653] ? rtnl_dump_ifinfo (/home/petr/src/linux_mlxsw/net/core/rtnetlink.c:6319) 
[  337.976915] ? lockdep_hardirqs_on_prepare (/home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:5000) 
[  337.977243] ? lockdep_hardirqs_on_prepare (/home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:5000) 
[  337.977548] ? find_held_lock (/home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:5251 (discriminator 1)) 
[  337.977799] netlink_rcv_skb (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:2547) 
[  337.978041] ? rtnl_dump_ifinfo (/home/petr/src/linux_mlxsw/net/core/rtnetlink.c:6319) 
[  337.978331] ? netlink_ack (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:2523) 
[  337.978583] ? lock_sync (/home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:5729) 
[  337.978833] ? netlink_deliver_tap (/home/petr/src/linux_mlxsw/./include/linux/rcupdate.h:308 /home/petr/src/linux_mlxsw/./include/linux/rcupdate.h:782 /home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:340) 
[  337.979103] ? is_vmalloc_addr (/home/petr/src/linux_mlxsw/mm/vmalloc.c:83) 
[  337.979376] netlink_unicast (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:1340 /home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:1365) 
[  337.979668] ? netlink_attachskb (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:1350) 
[  337.979933] ? __sanitizer_cov_trace_switch (/home/petr/src/linux_mlxsw/kernel/kcov.c:340 (discriminator 1)) 
[  337.980307] ? __check_object_size (/home/petr/src/linux_mlxsw/mm/usercopy.c:113 /home/petr/src/linux_mlxsw/mm/usercopy.c:145 /home/petr/src/linux_mlxsw/mm/usercopy.c:254 /home/petr/src/linux_mlxsw/mm/usercopy.c:213) 
[  337.980608] netlink_sendmsg (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:1911) 
[  337.980861] ? netlink_unicast (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:1830) 
[  337.981124] ? netlink_unicast (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:1830) 
[  337.981406] ____sys_sendmsg (/home/petr/src/linux_mlxsw/net/socket.c:728 (discriminator 1) /home/petr/src/linux_mlxsw/net/socket.c:748 (discriminator 1) /home/petr/src/linux_mlxsw/net/socket.c:2494 (discriminator 1)) 
[  337.981654] ? copy_msghdr_from_user (/home/petr/src/linux_mlxsw/net/socket.c:2420) 
[  337.981944] ? sock_read_iter (/home/petr/src/linux_mlxsw/net/socket.c:2440) 
[  337.982202] ? __lock_acquire (/home/petr/src/linux_mlxsw/./arch/x86/include/asm/bitops.h:228 /home/petr/src/linux_mlxsw/./arch/x86/include/asm/bitops.h:240 /home/petr/src/linux_mlxsw/./include/asm-generic/bitops/instrumented-non-atomic.h:142 /home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:228 /home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:3788 /home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:3844 /home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:5144) 
[  337.982486] ___sys_sendmsg (/home/petr/src/linux_mlxsw/net/socket.c:2550) 
[  337.982734] ? do_recvmmsg (/home/petr/src/linux_mlxsw/net/socket.c:2537) 
[  337.983010] ? local_clock_noinstr (/home/petr/src/linux_mlxsw/kernel/sched/clock.c:301 (discriminator 1)) 
[  337.983299] ? __fget_light (/home/petr/src/linux_mlxsw/fs/file.c:1027) 
[  337.983549] __sys_sendmsg (/home/petr/src/linux_mlxsw/net/socket.c:2579) 
[  337.983782] ? __sys_sendmsg_sock (/home/petr/src/linux_mlxsw/net/socket.c:2565) 
[  337.984043] ? xfd_validate_state (/home/petr/src/linux_mlxsw/arch/x86/kernel/fpu/xstate.c:1411 /home/petr/src/linux_mlxsw/arch/x86/kernel/fpu/xstate.c:1455) 
[  337.984352] ? syscall_enter_from_user_mode (/home/petr/src/linux_mlxsw/./arch/x86/include/asm/irqflags.h:42 /home/petr/src/linux_mlxsw/./arch/x86/include/asm/irqflags.h:77 /home/petr/src/linux_mlxsw/kernel/entry/common.c:111) 
[  337.984667] do_syscall_64 (/home/petr/src/linux_mlxsw/arch/x86/entry/common.c:50 (discriminator 1) /home/petr/src/linux_mlxsw/arch/x86/entry/common.c:80 (discriminator 1)) 
[  337.984894] entry_SYSCALL_64_after_hwframe (/home/petr/src/linux_mlxsw/arch/x86/entry/entry_64.S:120) 
[  337.985202] RIP: 0033:0x7f008946a8b4
[ 337.985437] Code: 15 59 f5 0b 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b5 0f 1f 00 f3 0f 1e fa 80 3d 2d 7d 0c 00 00 74 13 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 4c c3 0f 1f 00 55 48 89 e5 48 83 ec 20 89 55
All code
========
   0:	15 59 f5 0b 00       	adc    $0xbf559,%eax
   5:	f7 d8                	neg    %eax
   7:	64 89 02             	mov    %eax,%fs:(%rdx)
   a:	48 c7 c0 ff ff ff ff 	mov    $0xffffffffffffffff,%rax
  11:	eb b5                	jmp    0xffffffffffffffc8
  13:	0f 1f 00             	nopl   (%rax)
  16:	f3 0f 1e fa          	endbr64
  1a:	80 3d 2d 7d 0c 00 00 	cmpb   $0x0,0xc7d2d(%rip)        # 0xc7d4e
  21:	74 13                	je     0x36
  23:	b8 2e 00 00 00       	mov    $0x2e,%eax
  28:	0f 05                	syscall
  2a:*	48 3d 00 f0 ff ff    	cmp    $0xfffffffffffff000,%rax		<-- trapping instruction
  30:	77 4c                	ja     0x7e
  32:	c3                   	ret
  33:	0f 1f 00             	nopl   (%rax)
  36:	55                   	push   %rbp
  37:	48 89 e5             	mov    %rsp,%rbp
  3a:	48 83 ec 20          	sub    $0x20,%rsp
  3e:	89                   	.byte 0x89
  3f:	55                   	push   %rbp

Code starting with the faulting instruction
===========================================
   0:	48 3d 00 f0 ff ff    	cmp    $0xfffffffffffff000,%rax
   6:	77 4c                	ja     0x54
   8:	c3                   	ret
   9:	0f 1f 00             	nopl   (%rax)
   c:	55                   	push   %rbp
   d:	48 89 e5             	mov    %rsp,%rbp
  10:	48 83 ec 20          	sub    $0x20,%rsp
  14:	89                   	.byte 0x89
  15:	55                   	push   %rbp
[  337.986472] RSP: 002b:00007ffd615b18c8 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
[  337.986905] RAX: ffffffffffffffda RBX: 000055958cf26f80 RCX: 00007f008946a8b4
[  337.987324] RDX: 0000000000000000 RSI: 00007ffd615b1940 RDI: 0000000000000003
[  337.987730] RBP: 00007ffd615b19b0 R08: 0000000064bab34a R09: 0000000000000001
[  337.988146] R10: 0000000000000001 R11: 0000000000000202 R12: 00007ffd615b1a30
[  337.988585] R13: 0000000064bab34b R14: 000055958cf26f80 R15: 0000000000000000
[  337.989023]  </TASK>
[  337.989171] irq event stamp: 168885
[  337.989399] hardirqs last enabled at (168895): __up_console_sem (/home/petr/src/linux_mlxsw/./arch/x86/include/asm/irqflags.h:42 /home/petr/src/linux_mlxsw/./arch/x86/include/asm/irqflags.h:77 /home/petr/src/linux_mlxsw/./arch/x86/include/asm/irqflags.h:135 /home/petr/src/linux_mlxsw/kernel/printk/printk.c:347 /home/petr/src/linux_mlxsw/kernel/printk/printk.c:339) 
[  337.989908] hardirqs last disabled at (168902): __up_console_sem (/home/petr/src/linux_mlxsw/kernel/printk/printk.c:345 (discriminator 3)) 
[  337.990424] softirqs last enabled at (168216): irq_exit_rcu (/home/petr/src/linux_mlxsw/kernel/softirq.c:427 /home/petr/src/linux_mlxsw/kernel/softirq.c:632 /home/petr/src/linux_mlxsw/kernel/softirq.c:644) 
[  337.990951] softirqs last disabled at (168207): irq_exit_rcu (/home/petr/src/linux_mlxsw/kernel/softirq.c:427 /home/petr/src/linux_mlxsw/kernel/softirq.c:632 /home/petr/src/linux_mlxsw/kernel/softirq.c:644) 
[  337.991830] ---[ end trace 0000000000000000 ]---
[  835.734697] ==================================================================
[  835.735303] BUG: KASAN: slab-use-after-free in ingress_init (/home/petr/src/linux_mlxsw/./include/net/tcx.h:36 /home/petr/src/linux_mlxsw/./include/net/tcx.h:136 /home/petr/src/linux_mlxsw/net/sched/sch_ingress.c:94) sch_ingress
[  835.735840] Read of size 8 at addr ffff888008a7a208 by task tc/303
[  835.736187]
[  835.736761] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
[  835.737244] Call Trace:
[  835.737394]  <TASK>
[  835.737524] dump_stack_lvl (/home/petr/src/linux_mlxsw/lib/dump_stack.c:107) 
[  835.737749] print_report (/home/petr/src/linux_mlxsw/mm/kasan/report.c:365 /home/petr/src/linux_mlxsw/mm/kasan/report.c:475) 
[  835.738015] ? __virt_addr_valid (/home/petr/src/linux_mlxsw/arch/x86/mm/physaddr.c:66) 
[  835.738265] kasan_report (/home/petr/src/linux_mlxsw/mm/kasan/report.c:590) 
[  835.738485] ? ingress_init (/home/petr/src/linux_mlxsw/./include/net/tcx.h:36 /home/petr/src/linux_mlxsw/./include/net/tcx.h:136 /home/petr/src/linux_mlxsw/net/sched/sch_ingress.c:94) sch_ingress
[  835.738783] ? ingress_init (/home/petr/src/linux_mlxsw/./include/net/tcx.h:36 /home/petr/src/linux_mlxsw/./include/net/tcx.h:136 /home/petr/src/linux_mlxsw/net/sched/sch_ingress.c:94) sch_ingress
[  835.739086] ingress_init (/home/petr/src/linux_mlxsw/./include/net/tcx.h:36 /home/petr/src/linux_mlxsw/./include/net/tcx.h:136 /home/petr/src/linux_mlxsw/net/sched/sch_ingress.c:94) sch_ingress
[  835.739393] ? ingress_dump (/home/petr/src/linux_mlxsw/net/sched/sch_ingress.c:79) sch_ingress
[  835.739703] qdisc_create (/home/petr/src/linux_mlxsw/net/sched/sch_api.c:1327) 
[  835.739929] ? tc_get_qdisc (/home/petr/src/linux_mlxsw/net/sched/sch_api.c:1228) 
[  835.740158] ? lock_is_held_type (/home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:467 (discriminator 4) /home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:5833 (discriminator 4)) 
[  835.740409] tc_modify_qdisc (/home/petr/src/linux_mlxsw/net/sched/sch_api.c:1703 (discriminator 1)) 
[  835.740651] ? qdisc_create (/home/petr/src/linux_mlxsw/net/sched/sch_api.c:1556) 
[  835.740886] ? rtnetlink_rcv_msg (/home/petr/src/linux_mlxsw/net/core/rtnetlink.c:6421) 
[  835.741144] ? cap_capable (/home/petr/src/linux_mlxsw/security/commoncap.c:102) 
[  835.741372] ? lock_is_held_type (/home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:467 (discriminator 4) /home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:5833 (discriminator 4)) 
[  835.741664] ? qdisc_create (/home/petr/src/linux_mlxsw/net/sched/sch_api.c:1556) 
[  835.741900] rtnetlink_rcv_msg (/home/petr/src/linux_mlxsw/net/core/rtnetlink.c:6423) 
[  835.742142] ? rtnl_dump_ifinfo (/home/petr/src/linux_mlxsw/net/core/rtnetlink.c:6319) 
[  835.742402] ? lockdep_hardirqs_on_prepare (/home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:5000) 
[  835.742702] ? lockdep_hardirqs_on_prepare (/home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:5000) 
[  835.742998] ? find_held_lock (/home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:5251 (discriminator 1)) 
[  835.743233] netlink_rcv_skb (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:2547) 
[  835.743481] ? rtnl_dump_ifinfo (/home/petr/src/linux_mlxsw/net/core/rtnetlink.c:6319) 
[  835.743732] ? netlink_ack (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:2523) 
[  835.743955] ? lock_sync (/home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:5729) 
[  835.744170] ? netlink_deliver_tap (/home/petr/src/linux_mlxsw/./include/linux/rcupdate.h:308 /home/petr/src/linux_mlxsw/./include/linux/rcupdate.h:782 /home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:340) 
[  835.744463] ? is_vmalloc_addr (/home/petr/src/linux_mlxsw/mm/vmalloc.c:83) 
[  835.744686] netlink_unicast (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:1340 /home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:1365) 
[  835.744912] ? netlink_attachskb (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:1350) 
[  835.745156] ? __sanitizer_cov_trace_switch (/home/petr/src/linux_mlxsw/kernel/kcov.c:340 (discriminator 1)) 
[  835.745482] ? __check_object_size (/home/petr/src/linux_mlxsw/mm/usercopy.c:113 /home/petr/src/linux_mlxsw/mm/usercopy.c:145 /home/petr/src/linux_mlxsw/mm/usercopy.c:254 /home/petr/src/linux_mlxsw/mm/usercopy.c:213) 
[  835.745736] netlink_sendmsg (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:1911) 
[  835.745967] ? netlink_unicast (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:1830) 
[  835.746204] ? netlink_unicast (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:1830) 
[  835.746481] ____sys_sendmsg (/home/petr/src/linux_mlxsw/net/socket.c:728 (discriminator 1) /home/petr/src/linux_mlxsw/net/socket.c:748 (discriminator 1) /home/petr/src/linux_mlxsw/net/socket.c:2494 (discriminator 1)) 
[  835.746705] ? copy_msghdr_from_user (/home/petr/src/linux_mlxsw/net/socket.c:2420) 
[  835.746987] ? sock_read_iter (/home/petr/src/linux_mlxsw/net/socket.c:2440) 
[  835.747225] ? __lock_acquire (/home/petr/src/linux_mlxsw/./arch/x86/include/asm/bitops.h:228 /home/petr/src/linux_mlxsw/./arch/x86/include/asm/bitops.h:240 /home/petr/src/linux_mlxsw/./include/asm-generic/bitops/instrumented-non-atomic.h:142 /home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:228 /home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:3788 /home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:3844 /home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:5144) 
[  835.747495] ___sys_sendmsg (/home/petr/src/linux_mlxsw/net/socket.c:2550) 
[  835.747718] ? do_recvmmsg (/home/petr/src/linux_mlxsw/net/socket.c:2537) 
[  835.747958] ? local_clock_noinstr (/home/petr/src/linux_mlxsw/kernel/sched/clock.c:301 (discriminator 1)) 
[  835.748235] ? __fget_light (/home/petr/src/linux_mlxsw/fs/file.c:1027) 
[  835.748523] __sys_sendmsg (/home/petr/src/linux_mlxsw/net/socket.c:2579) 
[  835.748756] ? __sys_sendmsg_sock (/home/petr/src/linux_mlxsw/net/socket.c:2565) 
[  835.749004] ? __up_read (/home/petr/src/linux_mlxsw/./arch/x86/include/asm/preempt.h:104 (discriminator 1) /home/petr/src/linux_mlxsw/kernel/locking/rwsem.c:1354 (discriminator 1)) 
[  835.749229] ? syscall_enter_from_user_mode (/home/petr/src/linux_mlxsw/./arch/x86/include/asm/irqflags.h:42 /home/petr/src/linux_mlxsw/./arch/x86/include/asm/irqflags.h:77 /home/petr/src/linux_mlxsw/kernel/entry/common.c:111) 
[  835.749531] do_syscall_64 (/home/petr/src/linux_mlxsw/arch/x86/entry/common.c:50 (discriminator 1) /home/petr/src/linux_mlxsw/arch/x86/entry/common.c:80 (discriminator 1)) 
[  835.749755] entry_SYSCALL_64_after_hwframe (/home/petr/src/linux_mlxsw/arch/x86/entry/entry_64.S:120) 
[  835.750059] RIP: 0033:0x7f4a861c38b4
[ 835.750279] Code: 15 59 f5 0b 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b5 0f 1f 00 f3 0f 1e fa 80 3d 2d 7d 0c 00 00 74 13 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 4c c3 0f 1f 00 55 48 89 e5 48 83 ec 20 89 55
All code
========
   0:	15 59 f5 0b 00       	adc    $0xbf559,%eax
   5:	f7 d8                	neg    %eax
   7:	64 89 02             	mov    %eax,%fs:(%rdx)
   a:	48 c7 c0 ff ff ff ff 	mov    $0xffffffffffffffff,%rax
  11:	eb b5                	jmp    0xffffffffffffffc8
  13:	0f 1f 00             	nopl   (%rax)
  16:	f3 0f 1e fa          	endbr64
  1a:	80 3d 2d 7d 0c 00 00 	cmpb   $0x0,0xc7d2d(%rip)        # 0xc7d4e
  21:	74 13                	je     0x36
  23:	b8 2e 00 00 00       	mov    $0x2e,%eax
  28:	0f 05                	syscall
  2a:*	48 3d 00 f0 ff ff    	cmp    $0xfffffffffffff000,%rax		<-- trapping instruction
  30:	77 4c                	ja     0x7e
  32:	c3                   	ret
  33:	0f 1f 00             	nopl   (%rax)
  36:	55                   	push   %rbp
  37:	48 89 e5             	mov    %rsp,%rbp
  3a:	48 83 ec 20          	sub    $0x20,%rsp
  3e:	89                   	.byte 0x89
  3f:	55                   	push   %rbp

Code starting with the faulting instruction
===========================================
   0:	48 3d 00 f0 ff ff    	cmp    $0xfffffffffffff000,%rax
   6:	77 4c                	ja     0x54
   8:	c3                   	ret
   9:	0f 1f 00             	nopl   (%rax)
   c:	55                   	push   %rbp
   d:	48 89 e5             	mov    %rsp,%rbp
  10:	48 83 ec 20          	sub    $0x20,%rsp
  14:	89                   	.byte 0x89
  15:	55                   	push   %rbp
[  835.751300] RSP: 002b:00007fff3b43db58 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
[  835.751739] RAX: ffffffffffffffda RBX: 000055dc998edf80 RCX: 00007f4a861c38b4
[  835.752143] RDX: 0000000000000000 RSI: 00007fff3b43dbd0 RDI: 0000000000000003
[  835.752553] RBP: 00007fff3b43dc40 R08: 0000000064bab53c R09: 0000000000000001
[  835.752962] R10: 0000000000000001 R11: 0000000000000202 R12: 00007fff3b43dcc0
[  835.753367] R13: 0000000064bab53d R14: 000055dc998edf80 R15: 0000000000000000
[  835.753784]  </TASK>
[  835.753923]
[  835.754017] Allocated by task 165:
[  835.754220] kasan_save_stack (/home/petr/src/linux_mlxsw/mm/kasan/common.c:46) 
[  835.754466] kasan_set_track (/home/petr/src/linux_mlxsw/mm/kasan/common.c:52 (discriminator 1)) 
[  835.754705] __kasan_kmalloc (/home/petr/src/linux_mlxsw/mm/kasan/common.c:374 /home/petr/src/linux_mlxsw/mm/kasan/common.c:383) 
[  835.754937] ingress_init (/home/petr/src/linux_mlxsw/./include/linux/slab.h:582 /home/petr/src/linux_mlxsw/./include/linux/slab.h:703 /home/petr/src/linux_mlxsw/./include/net/tcx.h:85 /home/petr/src/linux_mlxsw/./include/net/tcx.h:106 /home/petr/src/linux_mlxsw/./include/net/tcx.h:100 /home/petr/src/linux_mlxsw/net/sched/sch_ingress.c:91) sch_ingress
[  835.755240] qdisc_create (/home/petr/src/linux_mlxsw/net/sched/sch_api.c:1327) 
[  835.755481] tc_modify_qdisc (/home/petr/src/linux_mlxsw/net/sched/sch_api.c:1703 (discriminator 1)) 
[  835.755721] rtnetlink_rcv_msg (/home/petr/src/linux_mlxsw/net/core/rtnetlink.c:6423) 
[  835.755964] netlink_rcv_skb (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:2547) 
[  835.756197] netlink_unicast (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:1340 /home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:1365) 
[  835.756445] netlink_sendmsg (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:1911) 
[  835.756675] ____sys_sendmsg (/home/petr/src/linux_mlxsw/net/socket.c:728 (discriminator 1) /home/petr/src/linux_mlxsw/net/socket.c:748 (discriminator 1) /home/petr/src/linux_mlxsw/net/socket.c:2494 (discriminator 1)) 
[  835.756906] ___sys_sendmsg (/home/petr/src/linux_mlxsw/net/socket.c:2550) 
[  835.757133] __sys_sendmsg (/home/petr/src/linux_mlxsw/net/socket.c:2579) 
[  835.757360] do_syscall_64 (/home/petr/src/linux_mlxsw/arch/x86/entry/common.c:50 (discriminator 1) /home/petr/src/linux_mlxsw/arch/x86/entry/common.c:80 (discriminator 1)) 
[  835.757574] entry_SYSCALL_64_after_hwframe (/home/petr/src/linux_mlxsw/arch/x86/entry/entry_64.S:120) 
[  835.757866]
[  835.757964] Last potentially related work creation:
[  835.758236] kasan_save_stack (/home/petr/src/linux_mlxsw/mm/kasan/common.c:46) 
[  835.758473] __kasan_record_aux_stack (/home/petr/src/linux_mlxsw/mm/kasan/generic.c:492 (discriminator 1)) 
[  835.758752] kvfree_call_rcu (/home/petr/src/linux_mlxsw/./arch/x86/include/asm/irqflags.h:26 /home/petr/src/linux_mlxsw/./arch/x86/include/asm/irqflags.h:67 /home/petr/src/linux_mlxsw/./arch/x86/include/asm/irqflags.h:103 /home/petr/src/linux_mlxsw/kernel/rcu/tree.c:2883 /home/petr/src/linux_mlxsw/kernel/rcu/tree.c:3284 /home/petr/src/linux_mlxsw/kernel/rcu/tree.c:3369) 
[  835.758994] ingress_destroy (/home/petr/src/linux_mlxsw/net/sched/sch_ingress.c:131) sch_ingress
[  835.759321] __qdisc_destroy (/home/petr/src/linux_mlxsw/net/sched/sch_generic.c:1065) 
[  835.759551] qdisc_destroy (/home/petr/src/linux_mlxsw/net/sched/sch_generic.c:1079) 
[  835.759769] qdisc_graft (/home/petr/src/linux_mlxsw/net/sched/sch_api.c:1134) 
[  835.759994] tc_get_qdisc (/home/petr/src/linux_mlxsw/net/sched/sch_api.c:1541) 
[  835.760219] rtnetlink_rcv_msg (/home/petr/src/linux_mlxsw/net/core/rtnetlink.c:6423) 
[  835.760477] netlink_rcv_skb (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:2547) 
[  835.760710] netlink_unicast (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:1340 /home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:1365) 
[  835.760941] netlink_sendmsg (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:1911) 
[  835.761170] ____sys_sendmsg (/home/petr/src/linux_mlxsw/net/socket.c:728 (discriminator 1) /home/petr/src/linux_mlxsw/net/socket.c:748 (discriminator 1) /home/petr/src/linux_mlxsw/net/socket.c:2494 (discriminator 1)) 
[  835.761423] ___sys_sendmsg (/home/petr/src/linux_mlxsw/net/socket.c:2550) 
[  835.761646] __sys_sendmsg (/home/petr/src/linux_mlxsw/net/socket.c:2579) 
[  835.761864] do_syscall_64 (/home/petr/src/linux_mlxsw/arch/x86/entry/common.c:50 (discriminator 1) /home/petr/src/linux_mlxsw/arch/x86/entry/common.c:80 (discriminator 1)) 
[  835.762072] entry_SYSCALL_64_after_hwframe (/home/petr/src/linux_mlxsw/arch/x86/entry/entry_64.S:120) 
[  835.762398]
[  835.762490] Second to last potentially related work creation:
[  835.762802] kasan_save_stack (/home/petr/src/linux_mlxsw/mm/kasan/common.c:46) 
[  835.763067] __kasan_record_aux_stack (/home/petr/src/linux_mlxsw/mm/kasan/generic.c:492 (discriminator 1)) 
[  835.763340] __call_rcu_common.constprop.0 (/home/petr/src/linux_mlxsw/./arch/x86/include/asm/irqflags.h:26 /home/petr/src/linux_mlxsw/./arch/x86/include/asm/irqflags.h:67 /home/petr/src/linux_mlxsw/./arch/x86/include/asm/irqflags.h:103 /home/petr/src/linux_mlxsw/kernel/rcu/tree.c:2650) 
[  835.763631] netlink_release (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:829) 
[  835.763865] __sock_release (/home/petr/src/linux_mlxsw/net/socket.c:655) 
[  835.764085] sock_close (/home/petr/src/linux_mlxsw/net/socket.c:1388) 
[  835.764282] __fput (/home/petr/src/linux_mlxsw/fs/file_table.c:385) 
[  835.764493] task_work_run (/home/petr/src/linux_mlxsw/kernel/task_work.c:181) 
[  835.764715] do_exit (/home/petr/src/linux_mlxsw/kernel/exit.c:875) 
[  835.764915] do_group_exit (/home/petr/src/linux_mlxsw/kernel/exit.c:1005) 
[  835.765132] __x64_sys_exit_group (/home/petr/src/linux_mlxsw/kernel/exit.c:1033) 
[  835.765382] do_syscall_64 (/home/petr/src/linux_mlxsw/arch/x86/entry/common.c:50 (discriminator 1) /home/petr/src/linux_mlxsw/arch/x86/entry/common.c:80 (discriminator 1)) 
[  835.765596] entry_SYSCALL_64_after_hwframe (/home/petr/src/linux_mlxsw/arch/x86/entry/entry_64.S:120) 
[  835.765891]
[  835.765988] The buggy address belongs to the object at ffff888008a7a000
[  835.765988]  which belongs to the cache kmalloc-2k of size 2048
[  835.766668] The buggy address is located 520 bytes inside of
[  835.766668]  freed 2048-byte region [ffff888008a7a000, ffff888008a7a800)
[  835.767340]
[  835.767438] The buggy address belongs to the physical page:
[  835.767750] page:ffffea0000229e00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888008a7a000 pfn:0x8a78
[  835.768391] head:ffffea0000229e00 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[  835.768837] flags: 0x100000000010200(slab|head|node=0|zone=1)
[  835.769170] page_type: 0xffffffff()
[  835.769385] raw: 0100000000010200 ffff888006842340 ffffea0000241a10 ffffea000022a010
[  835.769861] raw: ffff888008a7a000 0000000000050001 00000001ffffffff 0000000000000000
[  835.770622] page dumped because: kasan: bad access detected
[  835.771278]
[  835.771540] Memory state around the buggy address:
[  835.772029]  ffff888008a7a100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  835.772980]  ffff888008a7a180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  835.773813] >ffff888008a7a200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  835.774506]                       ^
[  835.774844]  ffff888008a7a280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  835.775524]  ffff888008a7a300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  835.776185] ==================================================================
[  835.780645] general protection fault, probably for non-canonical address 0xed6d696d6d6d6e32: 0000 [#1] PREEMPT SMP KASAN
[  835.781337] KASAN: maybe wild-memory-access in range [0x6b6b6b6b6b6b7190-0x6b6b6b6b6b6b7197]
[  835.782241] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
[  835.782741] RIP: 0010:ingress_init (/home/petr/src/linux_mlxsw/./include/net/tcx.h:136 (discriminator 1) /home/petr/src/linux_mlxsw/net/sched/sch_ingress.c:94 (discriminator 1)) sch_ingress
[ 835.783089] Code: 03 80 3c 02 00 0f 85 91 04 00 00 4c 8b ad 00 02 00 00 48 b8 00 00 00 00 00 fc ff df 49 8d bd 28 06 00 00 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 06 0f 8e 75 03 00 00 41 c6 85 28 06 00 00 01
All code
========
   0:	03 80 3c 02 00 0f    	add    0xf00023c(%rax),%eax
   6:	85 91 04 00 00 4c    	test   %edx,0x4c000004(%rcx)
   c:	8b ad 00 02 00 00    	mov    0x200(%rbp),%ebp
  12:	48 b8 00 00 00 00 00 	movabs $0xdffffc0000000000,%rax
  19:	fc ff df 
  1c:	49 8d bd 28 06 00 00 	lea    0x628(%r13),%rdi
  23:	48 89 fa             	mov    %rdi,%rdx
  26:	48 c1 ea 03          	shr    $0x3,%rdx
  2a:*	0f b6 04 02          	movzbl (%rdx,%rax,1),%eax		<-- trapping instruction
  2e:	84 c0                	test   %al,%al
  30:	74 06                	je     0x38
  32:	0f 8e 75 03 00 00    	jle    0x3ad
  38:	41 c6 85 28 06 00 00 	movb   $0x1,0x628(%r13)
  3f:	01 

Code starting with the faulting instruction
===========================================
   0:	0f b6 04 02          	movzbl (%rdx,%rax,1),%eax
   4:	84 c0                	test   %al,%al
   6:	74 06                	je     0xe
   8:	0f 8e 75 03 00 00    	jle    0x383
   e:	41 c6 85 28 06 00 00 	movb   $0x1,0x628(%r13)
  15:	01 
[  835.784122] RSP: 0018:ffffc90000d17400 EFLAGS: 00010202
[  835.784429] RAX: dffffc0000000000 RBX: ffff88800c841000 RCX: 0000000000000001
[  835.784824] RDX: 0d6d6d6d6d6d6e32 RSI: ffffffff81c2398e RDI: 6b6b6b6b6b6b7193
[  835.785218] RBP: ffff888008a7a008 R08: 0000000000000007 R09: 0000000000000000
[  835.785620] R10: 0000000000000000 R11: 0000000000000001 R12: ffffc90000d17818
[  835.786017] R13: 6b6b6b6b6b6b6b6b R14: 0000000000000000 R15: ffff88800d731000
[  835.786437] FS:  00007f4a85e89740(0000) GS:ffff888036000000(0000) knlGS:0000000000000000
[  835.786907] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  835.787245] CR2: 000055dc998c5dc0 CR3: 000000000bbc8005 CR4: 0000000000370ef0
[  835.787664] Call Trace:
[  835.787818]  <TASK>
[  835.787958] ? die_addr (/home/petr/src/linux_mlxsw/arch/x86/kernel/dumpstack.c:421 /home/petr/src/linux_mlxsw/arch/x86/kernel/dumpstack.c:460) 
[  835.788173] ? exc_general_protection (/home/petr/src/linux_mlxsw/arch/x86/kernel/traps.c:786 /home/petr/src/linux_mlxsw/arch/x86/kernel/traps.c:728) 
[  835.788468] ? asm_exc_general_protection (/home/petr/src/linux_mlxsw/./arch/x86/include/asm/idtentry.h:564) 
[  835.788760] ? end_report (/home/petr/src/linux_mlxsw/./arch/x86/include/asm/current.h:41 (discriminator 1) /home/petr/src/linux_mlxsw/mm/kasan/report.c:239 (discriminator 1)) 
[  835.788984] ? ingress_init (/home/petr/src/linux_mlxsw/./include/net/tcx.h:136 (discriminator 1) /home/petr/src/linux_mlxsw/net/sched/sch_ingress.c:94 (discriminator 1)) sch_ingress
[  835.789431] ? ingress_dump (/home/petr/src/linux_mlxsw/net/sched/sch_ingress.c:79) sch_ingress
[  835.789870] qdisc_create (/home/petr/src/linux_mlxsw/net/sched/sch_api.c:1327) 
[  835.790234] ? tc_get_qdisc (/home/petr/src/linux_mlxsw/net/sched/sch_api.c:1228) 
[  835.790636] ? lock_is_held_type (/home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:467 (discriminator 4) /home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:5833 (discriminator 4)) 
[  835.791033] tc_modify_qdisc (/home/petr/src/linux_mlxsw/net/sched/sch_api.c:1703 (discriminator 1)) 
[  835.791530] ? qdisc_create (/home/petr/src/linux_mlxsw/net/sched/sch_api.c:1556) 
[  835.792092] ? rtnetlink_rcv_msg (/home/petr/src/linux_mlxsw/net/core/rtnetlink.c:6421) 
[  835.792543] ? cap_capable (/home/petr/src/linux_mlxsw/security/commoncap.c:102) 
[  835.792906] ? lock_is_held_type (/home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:467 (discriminator 4) /home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:5833 (discriminator 4)) 
[  835.793269] ? qdisc_create (/home/petr/src/linux_mlxsw/net/sched/sch_api.c:1556) 
[  835.793723] rtnetlink_rcv_msg (/home/petr/src/linux_mlxsw/net/core/rtnetlink.c:6423) 
[  835.794040] ? rtnl_dump_ifinfo (/home/petr/src/linux_mlxsw/net/core/rtnetlink.c:6319) 
[  835.794412] ? lockdep_hardirqs_on_prepare (/home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:5000) 
[  835.794892] ? lockdep_hardirqs_on_prepare (/home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:5000) 
[  835.795404] ? find_held_lock (/home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:5251 (discriminator 1)) 
[  835.795774] netlink_rcv_skb (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:2547) 
[  835.796011] ? rtnl_dump_ifinfo (/home/petr/src/linux_mlxsw/net/core/rtnetlink.c:6319) 
[  835.796272] ? netlink_ack (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:2523) 
[  835.796517] ? lock_sync (/home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:5729) 
[  835.796758] ? netlink_deliver_tap (/home/petr/src/linux_mlxsw/./include/linux/rcupdate.h:308 /home/petr/src/linux_mlxsw/./include/linux/rcupdate.h:782 /home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:340) 
[  835.797034] ? is_vmalloc_addr (/home/petr/src/linux_mlxsw/mm/vmalloc.c:83) 
[  835.797286] netlink_unicast (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:1340 /home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:1365) 
[  835.797547] ? netlink_attachskb (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:1350) 
[  835.797809] ? __sanitizer_cov_trace_switch (/home/petr/src/linux_mlxsw/kernel/kcov.c:340 (discriminator 1)) 
[  835.798141] ? __check_object_size (/home/petr/src/linux_mlxsw/mm/usercopy.c:113 /home/petr/src/linux_mlxsw/mm/usercopy.c:145 /home/petr/src/linux_mlxsw/mm/usercopy.c:254 /home/petr/src/linux_mlxsw/mm/usercopy.c:213) 
[  835.798429] netlink_sendmsg (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:1911) 
[  835.798670] ? netlink_unicast (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:1830) 
[  835.798927] ? netlink_unicast (/home/petr/src/linux_mlxsw/net/netlink/af_netlink.c:1830) 
[  835.799186] ____sys_sendmsg (/home/petr/src/linux_mlxsw/net/socket.c:728 (discriminator 1) /home/petr/src/linux_mlxsw/net/socket.c:748 (discriminator 1) /home/petr/src/linux_mlxsw/net/socket.c:2494 (discriminator 1)) 
[  835.799440] ? copy_msghdr_from_user (/home/petr/src/linux_mlxsw/net/socket.c:2420) 
[  835.799723] ? sock_read_iter (/home/petr/src/linux_mlxsw/net/socket.c:2440) 
[  835.799968] ? __lock_acquire (/home/petr/src/linux_mlxsw/./arch/x86/include/asm/bitops.h:228 /home/petr/src/linux_mlxsw/./arch/x86/include/asm/bitops.h:240 /home/petr/src/linux_mlxsw/./include/asm-generic/bitops/instrumented-non-atomic.h:142 /home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:228 /home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:3788 /home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:3844 /home/petr/src/linux_mlxsw/kernel/locking/lockdep.c:5144) 
[  835.800227] ___sys_sendmsg (/home/petr/src/linux_mlxsw/net/socket.c:2550) 
[  835.800480] ? do_recvmmsg (/home/petr/src/linux_mlxsw/net/socket.c:2537) 
[  835.800723] ? local_clock_noinstr (/home/petr/src/linux_mlxsw/kernel/sched/clock.c:301 (discriminator 1)) 
[  835.800971] ? __fget_light (/home/petr/src/linux_mlxsw/fs/file.c:1027) 
[  835.801222] __sys_sendmsg (/home/petr/src/linux_mlxsw/net/socket.c:2579) 
[  835.801460] ? __sys_sendmsg_sock (/home/petr/src/linux_mlxsw/net/socket.c:2565) 
[  835.801708] ? __up_read (/home/petr/src/linux_mlxsw/./arch/x86/include/asm/preempt.h:104 (discriminator 1) /home/petr/src/linux_mlxsw/kernel/locking/rwsem.c:1354 (discriminator 1)) 
[  835.801933] ? syscall_enter_from_user_mode (/home/petr/src/linux_mlxsw/./arch/x86/include/asm/irqflags.h:42 /home/petr/src/linux_mlxsw/./arch/x86/include/asm/irqflags.h:77 /home/petr/src/linux_mlxsw/kernel/entry/common.c:111) 
[  835.802228] do_syscall_64 (/home/petr/src/linux_mlxsw/arch/x86/entry/common.c:50 (discriminator 1) /home/petr/src/linux_mlxsw/arch/x86/entry/common.c:80 (discriminator 1)) 
[  835.802455] entry_SYSCALL_64_after_hwframe (/home/petr/src/linux_mlxsw/arch/x86/entry/entry_64.S:120) 
[  835.802758] RIP: 0033:0x7f4a861c38b4
[ 835.802983] Code: 15 59 f5 0b 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b5 0f 1f 00 f3 0f 1e fa 80 3d 2d 7d 0c 00 00 74 13 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 4c c3 0f 1f 00 55 48 89 e5 48 83 ec 20 89 55
All code
========
   0:	15 59 f5 0b 00       	adc    $0xbf559,%eax
   5:	f7 d8                	neg    %eax
   7:	64 89 02             	mov    %eax,%fs:(%rdx)
   a:	48 c7 c0 ff ff ff ff 	mov    $0xffffffffffffffff,%rax
  11:	eb b5                	jmp    0xffffffffffffffc8
  13:	0f 1f 00             	nopl   (%rax)
  16:	f3 0f 1e fa          	endbr64
  1a:	80 3d 2d 7d 0c 00 00 	cmpb   $0x0,0xc7d2d(%rip)        # 0xc7d4e
  21:	74 13                	je     0x36
  23:	b8 2e 00 00 00       	mov    $0x2e,%eax
  28:	0f 05                	syscall
  2a:*	48 3d 00 f0 ff ff    	cmp    $0xfffffffffffff000,%rax		<-- trapping instruction
  30:	77 4c                	ja     0x7e
  32:	c3                   	ret
  33:	0f 1f 00             	nopl   (%rax)
  36:	55                   	push   %rbp
  37:	48 89 e5             	mov    %rsp,%rbp
  3a:	48 83 ec 20          	sub    $0x20,%rsp
  3e:	89                   	.byte 0x89
  3f:	55                   	push   %rbp

Code starting with the faulting instruction
===========================================
   0:	48 3d 00 f0 ff ff    	cmp    $0xfffffffffffff000,%rax
   6:	77 4c                	ja     0x54
   8:	c3                   	ret
   9:	0f 1f 00             	nopl   (%rax)
   c:	55                   	push   %rbp
   d:	48 89 e5             	mov    %rsp,%rbp
  10:	48 83 ec 20          	sub    $0x20,%rsp
  14:	89                   	.byte 0x89
  15:	55                   	push   %rbp
[  835.803998] RSP: 002b:00007fff3b43db58 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
[  835.804428] RAX: ffffffffffffffda RBX: 000055dc998edf80 RCX: 00007f4a861c38b4
[  835.804824] RDX: 0000000000000000 RSI: 00007fff3b43dbd0 RDI: 0000000000000003
[  835.805222] RBP: 00007fff3b43dc40 R08: 0000000064bab53c R09: 0000000000000001
[  835.805622] R10: 0000000000000001 R11: 0000000000000202 R12: 00007fff3b43dcc0
[  835.806035] R13: 0000000064bab53d R14: 000055dc998edf80 R15: 0000000000000000
[  835.806466]  </TASK>
[  835.806606] Modules linked in: sch_ingress veth
[  835.807662] ---[ end trace 0000000000000000 ]---
[  835.808497] RIP: 0010:ingress_init (/home/petr/src/linux_mlxsw/./include/net/tcx.h:136 (discriminator 1) /home/petr/src/linux_mlxsw/net/sched/sch_ingress.c:94 (discriminator 1)) sch_ingress
[ 835.812394] Code: 03 80 3c 02 00 0f 85 91 04 00 00 4c 8b ad 00 02 00 00 48 b8 00 00 00 00 00 fc ff df 49 8d bd 28 06 00 00 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 06 0f 8e 75 03 00 00 41 c6 85 28 06 00 00 01
All code
========
   0:	03 80 3c 02 00 0f    	add    0xf00023c(%rax),%eax
   6:	85 91 04 00 00 4c    	test   %edx,0x4c000004(%rcx)
   c:	8b ad 00 02 00 00    	mov    0x200(%rbp),%ebp
  12:	48 b8 00 00 00 00 00 	movabs $0xdffffc0000000000,%rax
  19:	fc ff df 
  1c:	49 8d bd 28 06 00 00 	lea    0x628(%r13),%rdi
  23:	48 89 fa             	mov    %rdi,%rdx
  26:	48 c1 ea 03          	shr    $0x3,%rdx
  2a:*	0f b6 04 02          	movzbl (%rdx,%rax,1),%eax		<-- trapping instruction
  2e:	84 c0                	test   %al,%al
  30:	74 06                	je     0x38
  32:	0f 8e 75 03 00 00    	jle    0x3ad
  38:	41 c6 85 28 06 00 00 	movb   $0x1,0x628(%r13)
  3f:	01 

Code starting with the faulting instruction
===========================================
   0:	0f b6 04 02          	movzbl (%rdx,%rax,1),%eax
   4:	84 c0                	test   %al,%al
   6:	74 06                	je     0xe
   8:	0f 8e 75 03 00 00    	jle    0x383
   e:	41 c6 85 28 06 00 00 	movb   $0x1,0x628(%r13)
  15:	01 
[  835.814250] RSP: 0018:ffffc90000d17400 EFLAGS: 00010202
[  835.814569] RAX: dffffc0000000000 RBX: ffff88800c841000 RCX: 0000000000000001
[  835.815017] RDX: 0d6d6d6d6d6d6e32 RSI: ffffffff81c2398e RDI: 6b6b6b6b6b6b7193
[  835.815451] RBP: ffff888008a7a008 R08: 0000000000000007 R09: 0000000000000000
[  835.815857] R10: 0000000000000000 R11: 0000000000000001 R12: ffffc90000d17818
[  835.816270] R13: 6b6b6b6b6b6b6b6b R14: 0000000000000000 R15: ffff88800d731000
[  835.816683] FS:  00007f4a85e89740(0000) GS:ffff888036000000(0000) knlGS:0000000000000000
[  835.817133] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  835.820478] CR2: 000055dc998c5dc0 CR3: 000000000bbc8005 CR4: 0000000000370ef0
Daniel Borkmann July 21, 2023, 11:43 p.m. UTC | #4
On 7/21/23 4:57 PM, Petr Machata wrote:
> As of this patch (commit e420bed02507), TC qdisc installation and/or
> removal cause memory access issues in the system.
> 
> A semi-minimal reproducer is:
> 
>      bash-5.2# ip l a name v1 type veth peer name v2
>      bash-5.2# ip l s dev v1 up
>      bash-5.2# ip l s dev v2 up
>      bash-5.2# tc q a dev v1 ingress
>      bash-5.2# tc q d dev v1 ingress
>      bash-5.2# tc q a dev v1 ingress
>      bash-5.2# tc q d dev v1 ingress
> 
> It's a bit finnicky, but only a little. For me, the first two "tc q"
> operations never triggered a splat. Then it could take a few "tc q a"
> "tc q d" iterations to get it to splat. So it looks like maybe the first
> "tc q d" is the problematic bit? And then there's some likelihood of
> failing on any following "tc q" operation. The above in particular
> produced three warning splats for me (attached as decoded.txt,
> decoded2.txt and decoded3.txt). Probing further:
> 
>      bash-5.2# tc q a dev v1 ingress
> 
> Produced two more splats from KASAN (decoded4.txt and decoded5.txt),
> which look more serious.
> 
> Further attempts to prod the system deadlock it, I guess because RTNL
> was left locked.
> 
> Reverting e420bed02507, and fe20ce3a5126 + 55cc3768473e that fail to
> build without it, makes net-next/main work again.

Sorry about that, fix should be here:
https://lore.kernel.org/netdev/20230721233330.5678-1-daniel@iogearbox.net/

Thanks,
Daniel
diff mbox series

Patch

diff --git a/MAINTAINERS b/MAINTAINERS
index 678bef9f60b4..990e3fce753c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3778,13 +3778,15 @@  L:	netdev@vger.kernel.org
 S:	Maintained
 F:	kernel/bpf/bpf_struct*
 
-BPF [NETWORKING] (tc BPF, sock_addr)
+BPF [NETWORKING] (tcx & tc BPF, sock_addr)
 M:	Martin KaFai Lau <martin.lau@linux.dev>
 M:	Daniel Borkmann <daniel@iogearbox.net>
 R:	John Fastabend <john.fastabend@gmail.com>
 L:	bpf@vger.kernel.org
 L:	netdev@vger.kernel.org
 S:	Maintained
+F:	include/net/tcx.h
+F:	kernel/bpf/tcx.c
 F:	net/core/filter.c
 F:	net/sched/act_bpf.c
 F:	net/sched/cls_bpf.c
diff --git a/include/linux/bpf_mprog.h b/include/linux/bpf_mprog.h
index 6feefec43422..2b429488f840 100644
--- a/include/linux/bpf_mprog.h
+++ b/include/linux/bpf_mprog.h
@@ -315,4 +315,13 @@  int bpf_mprog_detach(struct bpf_mprog_entry *entry,
 int bpf_mprog_query(const union bpf_attr *attr, union bpf_attr __user *uattr,
 		    struct bpf_mprog_entry *entry);
 
+static inline bool bpf_mprog_supported(enum bpf_prog_type type)
+{
+	switch (type) {
+	case BPF_PROG_TYPE_SCHED_CLS:
+		return true;
+	default:
+		return false;
+	}
+}
 #endif /* __BPF_MPROG_H */
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b828c7a75be2..024314c68bc8 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1930,8 +1930,7 @@  enum netdev_ml_priv_type {
  *
  *	@rx_handler:		handler for received packets
  *	@rx_handler_data: 	XXX: need comments on this one
- *	@miniq_ingress:		ingress/clsact qdisc specific data for
- *				ingress processing
+ *	@tcx_ingress:		BPF & clsact qdisc specific data for ingress processing
  *	@ingress_queue:		XXX: need comments on this one
  *	@nf_hooks_ingress:	netfilter hooks executed for ingress packets
  *	@broadcast:		hw bcast address
@@ -1952,8 +1951,7 @@  enum netdev_ml_priv_type {
  *	@xps_maps:		all CPUs/RXQs maps for XPS device
  *
  *	@xps_maps:	XXX: need comments on this one
- *	@miniq_egress:		clsact qdisc specific data for
- *				egress processing
+ *	@tcx_egress:		BPF & clsact qdisc specific data for egress processing
  *	@nf_hooks_egress:	netfilter hooks executed for egress packets
  *	@qdisc_hash:		qdisc hash table
  *	@watchdog_timeo:	Represents the timeout that is used by
@@ -2252,9 +2250,8 @@  struct net_device {
 	unsigned int		gro_ipv4_max_size;
 	rx_handler_func_t __rcu	*rx_handler;
 	void __rcu		*rx_handler_data;
-
-#ifdef CONFIG_NET_CLS_ACT
-	struct mini_Qdisc __rcu	*miniq_ingress;
+#ifdef CONFIG_NET_XGRESS
+	struct bpf_mprog_entry __rcu *tcx_ingress;
 #endif
 	struct netdev_queue __rcu *ingress_queue;
 #ifdef CONFIG_NETFILTER_INGRESS
@@ -2282,8 +2279,8 @@  struct net_device {
 #ifdef CONFIG_XPS
 	struct xps_dev_maps __rcu *xps_maps[XPS_MAPS_MAX];
 #endif
-#ifdef CONFIG_NET_CLS_ACT
-	struct mini_Qdisc __rcu	*miniq_egress;
+#ifdef CONFIG_NET_XGRESS
+	struct bpf_mprog_entry __rcu *tcx_egress;
 #endif
 #ifdef CONFIG_NETFILTER_EGRESS
 	struct nf_hook_entries __rcu *nf_hooks_egress;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 91ed66952580..ed83f1c5fc1f 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -944,7 +944,7 @@  struct sk_buff {
 	__u8			__mono_tc_offset[0];
 	/* public: */
 	__u8			mono_delivery_time:1;	/* See SKB_MONO_DELIVERY_TIME_MASK */
-#ifdef CONFIG_NET_CLS_ACT
+#ifdef CONFIG_NET_XGRESS
 	__u8			tc_at_ingress:1;	/* See TC_AT_INGRESS_MASK */
 	__u8			tc_skip_classify:1;
 #endif
@@ -993,7 +993,7 @@  struct sk_buff {
 	__u8			csum_not_inet:1;
 #endif
 
-#ifdef CONFIG_NET_SCHED
+#if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
 	__u16			tc_index;	/* traffic control index */
 #endif
 
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index e92f73bb3198..15be2d96b06d 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -703,7 +703,7 @@  int skb_do_redirect(struct sk_buff *);
 
 static inline bool skb_at_tc_ingress(const struct sk_buff *skb)
 {
-#ifdef CONFIG_NET_CLS_ACT
+#ifdef CONFIG_NET_XGRESS
 	return skb->tc_at_ingress;
 #else
 	return false;
diff --git a/include/net/tcx.h b/include/net/tcx.h
new file mode 100644
index 000000000000..264f147953ba
--- /dev/null
+++ b/include/net/tcx.h
@@ -0,0 +1,206 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2023 Isovalent */
+#ifndef __NET_TCX_H
+#define __NET_TCX_H
+
+#include <linux/bpf.h>
+#include <linux/bpf_mprog.h>
+
+#include <net/sch_generic.h>
+
+struct mini_Qdisc;
+
+struct tcx_entry {
+	struct mini_Qdisc __rcu *miniq;
+	struct bpf_mprog_bundle bundle;
+	bool miniq_active;
+	struct rcu_head rcu;
+};
+
+struct tcx_link {
+	struct bpf_link link;
+	struct net_device *dev;
+	u32 location;
+};
+
+static inline void tcx_set_ingress(struct sk_buff *skb, bool ingress)
+{
+#ifdef CONFIG_NET_XGRESS
+	skb->tc_at_ingress = ingress;
+#endif
+}
+
+#ifdef CONFIG_NET_XGRESS
+static inline struct tcx_entry *tcx_entry(struct bpf_mprog_entry *entry)
+{
+	struct bpf_mprog_bundle *bundle = entry->parent;
+
+	return container_of(bundle, struct tcx_entry, bundle);
+}
+
+static inline struct tcx_link *tcx_link(struct bpf_link *link)
+{
+	return container_of(link, struct tcx_link, link);
+}
+
+static inline const struct tcx_link *tcx_link_const(const struct bpf_link *link)
+{
+	return tcx_link((struct bpf_link *)link);
+}
+
+void tcx_inc(void);
+void tcx_dec(void);
+
+static inline void tcx_entry_sync(void)
+{
+	/* bpf_mprog_entry got a/b swapped, therefore ensure that
+	 * there are no inflight users on the old one anymore.
+	 */
+	synchronize_rcu();
+}
+
+static inline void
+tcx_entry_update(struct net_device *dev, struct bpf_mprog_entry *entry,
+		 bool ingress)
+{
+	ASSERT_RTNL();
+	if (ingress)
+		rcu_assign_pointer(dev->tcx_ingress, entry);
+	else
+		rcu_assign_pointer(dev->tcx_egress, entry);
+}
+
+static inline struct bpf_mprog_entry *
+tcx_entry_fetch(struct net_device *dev, bool ingress)
+{
+	ASSERT_RTNL();
+	if (ingress)
+		return rcu_dereference_rtnl(dev->tcx_ingress);
+	else
+		return rcu_dereference_rtnl(dev->tcx_egress);
+}
+
+static inline struct bpf_mprog_entry *tcx_entry_create(void)
+{
+	struct tcx_entry *tcx = kzalloc(sizeof(*tcx), GFP_KERNEL);
+
+	if (tcx) {
+		bpf_mprog_bundle_init(&tcx->bundle);
+		return &tcx->bundle.a;
+	}
+	return NULL;
+}
+
+static inline void tcx_entry_free(struct bpf_mprog_entry *entry)
+{
+	kfree_rcu(tcx_entry(entry), rcu);
+}
+
+static inline struct bpf_mprog_entry *
+tcx_entry_fetch_or_create(struct net_device *dev, bool ingress, bool *created)
+{
+	struct bpf_mprog_entry *entry = tcx_entry_fetch(dev, ingress);
+
+	*created = false;
+	if (!entry) {
+		entry = tcx_entry_create();
+		if (!entry)
+			return NULL;
+		*created = true;
+	}
+	return entry;
+}
+
+static inline void tcx_skeys_inc(bool ingress)
+{
+	tcx_inc();
+	if (ingress)
+		net_inc_ingress_queue();
+	else
+		net_inc_egress_queue();
+}
+
+static inline void tcx_skeys_dec(bool ingress)
+{
+	if (ingress)
+		net_dec_ingress_queue();
+	else
+		net_dec_egress_queue();
+	tcx_dec();
+}
+
+static inline void tcx_miniq_set_active(struct bpf_mprog_entry *entry,
+					const bool active)
+{
+	ASSERT_RTNL();
+	tcx_entry(entry)->miniq_active = active;
+}
+
+static inline bool tcx_entry_is_active(struct bpf_mprog_entry *entry)
+{
+	ASSERT_RTNL();
+	return bpf_mprog_total(entry) || tcx_entry(entry)->miniq_active;
+}
+
+static inline enum tcx_action_base tcx_action_code(struct sk_buff *skb,
+						   int code)
+{
+	switch (code) {
+	case TCX_PASS:
+		skb->tc_index = qdisc_skb_cb(skb)->tc_classid;
+		fallthrough;
+	case TCX_DROP:
+	case TCX_REDIRECT:
+		return code;
+	case TCX_NEXT:
+	default:
+		return TCX_NEXT;
+	}
+}
+#endif /* CONFIG_NET_XGRESS */
+
+#if defined(CONFIG_NET_XGRESS) && defined(CONFIG_BPF_SYSCALL)
+int tcx_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog);
+int tcx_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
+int tcx_prog_detach(const union bpf_attr *attr, struct bpf_prog *prog);
+void tcx_uninstall(struct net_device *dev, bool ingress);
+
+int tcx_prog_query(const union bpf_attr *attr,
+		   union bpf_attr __user *uattr);
+
+static inline void dev_tcx_uninstall(struct net_device *dev)
+{
+	ASSERT_RTNL();
+	tcx_uninstall(dev, true);
+	tcx_uninstall(dev, false);
+}
+#else
+static inline int tcx_prog_attach(const union bpf_attr *attr,
+				  struct bpf_prog *prog)
+{
+	return -EINVAL;
+}
+
+static inline int tcx_link_attach(const union bpf_attr *attr,
+				  struct bpf_prog *prog)
+{
+	return -EINVAL;
+}
+
+static inline int tcx_prog_detach(const union bpf_attr *attr,
+				  struct bpf_prog *prog)
+{
+	return -EINVAL;
+}
+
+static inline int tcx_prog_query(const union bpf_attr *attr,
+				 union bpf_attr __user *uattr)
+{
+	return -EINVAL;
+}
+
+static inline void dev_tcx_uninstall(struct net_device *dev)
+{
+}
+#endif /* CONFIG_NET_XGRESS && CONFIG_BPF_SYSCALL */
+#endif /* __NET_TCX_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d4c07e435336..739c15906a65 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1036,6 +1036,8 @@  enum bpf_attach_type {
 	BPF_LSM_CGROUP,
 	BPF_STRUCT_OPS,
 	BPF_NETFILTER,
+	BPF_TCX_INGRESS,
+	BPF_TCX_EGRESS,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -1053,7 +1055,7 @@  enum bpf_link_type {
 	BPF_LINK_TYPE_KPROBE_MULTI = 8,
 	BPF_LINK_TYPE_STRUCT_OPS = 9,
 	BPF_LINK_TYPE_NETFILTER = 10,
-
+	BPF_LINK_TYPE_TCX = 11,
 	MAX_BPF_LINK_TYPE,
 };
 
@@ -1569,13 +1571,13 @@  union bpf_attr {
 			__u32		map_fd;		/* struct_ops to attach */
 		};
 		union {
-			__u32		target_fd;	/* object to attach to */
-			__u32		target_ifindex; /* target ifindex */
+			__u32	target_fd;	/* target object to attach to or ... */
+			__u32	target_ifindex; /* target ifindex */
 		};
 		__u32		attach_type;	/* attach type */
 		__u32		flags;		/* extra flags */
 		union {
-			__u32		target_btf_id;	/* btf_id of target to attach to */
+			__u32	target_btf_id;	/* btf_id of target to attach to */
 			struct {
 				__aligned_u64	iter_info;	/* extra bpf_iter_link_info */
 				__u32		iter_info_len;	/* iter_info length */
@@ -1609,6 +1611,13 @@  union bpf_attr {
 				__s32		priority;
 				__u32		flags;
 			} netfilter;
+			struct {
+				union {
+					__u32	relative_fd;
+					__u32	relative_id;
+				};
+				__u64		expected_revision;
+			} tcx;
 		};
 	} link_create;
 
@@ -6217,6 +6226,19 @@  struct bpf_sock_tuple {
 	};
 };
 
+/* (Simplified) user return codes for tcx prog type.
+ * A valid tcx program must return one of these defined values. All other
+ * return codes are reserved for future use. Must remain compatible with
+ * their TC_ACT_* counter-parts. For compatibility in behavior, unknown
+ * return codes are mapped to TCX_NEXT.
+ */
+enum tcx_action_base {
+	TCX_NEXT	= -1,
+	TCX_PASS	= 0,
+	TCX_DROP	= 2,
+	TCX_REDIRECT	= 7,
+};
+
 struct bpf_xdp_sock {
 	__u32 queue_id;
 };
@@ -6499,6 +6521,10 @@  struct bpf_link_info {
 				} event; /* BPF_PERF_EVENT_EVENT */
 			};
 		} perf_event;
+		struct {
+			__u32 ifindex;
+			__u32 attach_type;
+		} tcx;
 	};
 } __attribute__((aligned(8)));
 
diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig
index 2dfe1079f772..6a906ff93006 100644
--- a/kernel/bpf/Kconfig
+++ b/kernel/bpf/Kconfig
@@ -31,6 +31,7 @@  config BPF_SYSCALL
 	select TASKS_TRACE_RCU
 	select BINARY_PRINTF
 	select NET_SOCK_MSG if NET
+	select NET_XGRESS if NET
 	select PAGE_POOL if NET
 	default n
 	help
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 1bea2eb912cd..f526b7573e97 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -21,6 +21,7 @@  obj-$(CONFIG_BPF_SYSCALL) += devmap.o
 obj-$(CONFIG_BPF_SYSCALL) += cpumap.o
 obj-$(CONFIG_BPF_SYSCALL) += offload.o
 obj-$(CONFIG_BPF_SYSCALL) += net_namespace.o
+obj-$(CONFIG_BPF_SYSCALL) += tcx.o
 endif
 ifeq ($(CONFIG_PERF_EVENTS),y)
 obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index ee8cb1a174aa..7f4e8c357a6a 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -37,6 +37,8 @@ 
 #include <linux/trace_events.h>
 #include <net/netfilter/nf_bpf_link.h>
 
+#include <net/tcx.h>
+
 #define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \
 			  (map)->map_type == BPF_MAP_TYPE_CGROUP_ARRAY || \
 			  (map)->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS)
@@ -3740,31 +3742,45 @@  attach_type_to_prog_type(enum bpf_attach_type attach_type)
 		return BPF_PROG_TYPE_XDP;
 	case BPF_LSM_CGROUP:
 		return BPF_PROG_TYPE_LSM;
+	case BPF_TCX_INGRESS:
+	case BPF_TCX_EGRESS:
+		return BPF_PROG_TYPE_SCHED_CLS;
 	default:
 		return BPF_PROG_TYPE_UNSPEC;
 	}
 }
 
-#define BPF_PROG_ATTACH_LAST_FIELD replace_bpf_fd
+#define BPF_PROG_ATTACH_LAST_FIELD expected_revision
+
+#define BPF_F_ATTACH_MASK_BASE	\
+	(BPF_F_ALLOW_OVERRIDE |	\
+	 BPF_F_ALLOW_MULTI |	\
+	 BPF_F_REPLACE)
 
-#define BPF_F_ATTACH_MASK \
-	(BPF_F_ALLOW_OVERRIDE | BPF_F_ALLOW_MULTI | BPF_F_REPLACE)
+#define BPF_F_ATTACH_MASK_MPROG	\
+	(BPF_F_REPLACE |	\
+	 BPF_F_BEFORE |		\
+	 BPF_F_AFTER |		\
+	 BPF_F_ID |		\
+	 BPF_F_LINK)
 
 static int bpf_prog_attach(const union bpf_attr *attr)
 {
 	enum bpf_prog_type ptype;
 	struct bpf_prog *prog;
+	u32 mask;
 	int ret;
 
 	if (CHECK_ATTR(BPF_PROG_ATTACH))
 		return -EINVAL;
 
-	if (attr->attach_flags & ~BPF_F_ATTACH_MASK)
-		return -EINVAL;
-
 	ptype = attach_type_to_prog_type(attr->attach_type);
 	if (ptype == BPF_PROG_TYPE_UNSPEC)
 		return -EINVAL;
+	mask = bpf_mprog_supported(ptype) ?
+	       BPF_F_ATTACH_MASK_MPROG : BPF_F_ATTACH_MASK_BASE;
+	if (attr->attach_flags & ~mask)
+		return -EINVAL;
 
 	prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype);
 	if (IS_ERR(prog))
@@ -3800,6 +3816,9 @@  static int bpf_prog_attach(const union bpf_attr *attr)
 		else
 			ret = cgroup_bpf_prog_attach(attr, ptype, prog);
 		break;
+	case BPF_PROG_TYPE_SCHED_CLS:
+		ret = tcx_prog_attach(attr, prog);
+		break;
 	default:
 		ret = -EINVAL;
 	}
@@ -3809,25 +3828,41 @@  static int bpf_prog_attach(const union bpf_attr *attr)
 	return ret;
 }
 
-#define BPF_PROG_DETACH_LAST_FIELD attach_type
+#define BPF_PROG_DETACH_LAST_FIELD expected_revision
 
 static int bpf_prog_detach(const union bpf_attr *attr)
 {
+	struct bpf_prog *prog = NULL;
 	enum bpf_prog_type ptype;
+	int ret;
 
 	if (CHECK_ATTR(BPF_PROG_DETACH))
 		return -EINVAL;
 
 	ptype = attach_type_to_prog_type(attr->attach_type);
+	if (bpf_mprog_supported(ptype)) {
+		if (ptype == BPF_PROG_TYPE_UNSPEC)
+			return -EINVAL;
+		if (attr->attach_flags & ~BPF_F_ATTACH_MASK_MPROG)
+			return -EINVAL;
+		if (attr->attach_bpf_fd) {
+			prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype);
+			if (IS_ERR(prog))
+				return PTR_ERR(prog);
+		}
+	}
 
 	switch (ptype) {
 	case BPF_PROG_TYPE_SK_MSG:
 	case BPF_PROG_TYPE_SK_SKB:
-		return sock_map_prog_detach(attr, ptype);
+		ret = sock_map_prog_detach(attr, ptype);
+		break;
 	case BPF_PROG_TYPE_LIRC_MODE2:
-		return lirc_prog_detach(attr);
+		ret = lirc_prog_detach(attr);
+		break;
 	case BPF_PROG_TYPE_FLOW_DISSECTOR:
-		return netns_bpf_prog_detach(attr, ptype);
+		ret = netns_bpf_prog_detach(attr, ptype);
+		break;
 	case BPF_PROG_TYPE_CGROUP_DEVICE:
 	case BPF_PROG_TYPE_CGROUP_SKB:
 	case BPF_PROG_TYPE_CGROUP_SOCK:
@@ -3836,13 +3871,21 @@  static int bpf_prog_detach(const union bpf_attr *attr)
 	case BPF_PROG_TYPE_CGROUP_SYSCTL:
 	case BPF_PROG_TYPE_SOCK_OPS:
 	case BPF_PROG_TYPE_LSM:
-		return cgroup_bpf_prog_detach(attr, ptype);
+		ret = cgroup_bpf_prog_detach(attr, ptype);
+		break;
+	case BPF_PROG_TYPE_SCHED_CLS:
+		ret = tcx_prog_detach(attr, prog);
+		break;
 	default:
-		return -EINVAL;
+		ret = -EINVAL;
 	}
+
+	if (prog)
+		bpf_prog_put(prog);
+	return ret;
 }
 
-#define BPF_PROG_QUERY_LAST_FIELD query.prog_attach_flags
+#define BPF_PROG_QUERY_LAST_FIELD query.link_attach_flags
 
 static int bpf_prog_query(const union bpf_attr *attr,
 			  union bpf_attr __user *uattr)
@@ -3890,6 +3933,9 @@  static int bpf_prog_query(const union bpf_attr *attr,
 	case BPF_SK_MSG_VERDICT:
 	case BPF_SK_SKB_VERDICT:
 		return sock_map_bpf_prog_query(attr, uattr);
+	case BPF_TCX_INGRESS:
+	case BPF_TCX_EGRESS:
+		return tcx_prog_query(attr, uattr);
 	default:
 		return -EINVAL;
 	}
@@ -4852,6 +4898,13 @@  static int link_create(union bpf_attr *attr, bpfptr_t uattr)
 			goto out;
 		}
 		break;
+	case BPF_PROG_TYPE_SCHED_CLS:
+		if (attr->link_create.attach_type != BPF_TCX_INGRESS &&
+		    attr->link_create.attach_type != BPF_TCX_EGRESS) {
+			ret = -EINVAL;
+			goto out;
+		}
+		break;
 	default:
 		ptype = attach_type_to_prog_type(attr->link_create.attach_type);
 		if (ptype == BPF_PROG_TYPE_UNSPEC || ptype != prog->type) {
@@ -4903,6 +4956,9 @@  static int link_create(union bpf_attr *attr, bpfptr_t uattr)
 	case BPF_PROG_TYPE_XDP:
 		ret = bpf_xdp_link_attach(attr, prog);
 		break;
+	case BPF_PROG_TYPE_SCHED_CLS:
+		ret = tcx_link_attach(attr, prog);
+		break;
 	case BPF_PROG_TYPE_NETFILTER:
 		ret = bpf_nf_link_attach(attr, prog);
 		break;
diff --git a/kernel/bpf/tcx.c b/kernel/bpf/tcx.c
new file mode 100644
index 000000000000..69a272712b29
--- /dev/null
+++ b/kernel/bpf/tcx.c
@@ -0,0 +1,348 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2023 Isovalent */
+
+#include <linux/bpf.h>
+#include <linux/bpf_mprog.h>
+#include <linux/netdevice.h>
+
+#include <net/tcx.h>
+
+int tcx_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog)
+{
+	bool created, ingress = attr->attach_type == BPF_TCX_INGRESS;
+	struct net *net = current->nsproxy->net_ns;
+	struct bpf_mprog_entry *entry, *entry_new;
+	struct bpf_prog *replace_prog = NULL;
+	struct net_device *dev;
+	int ret;
+
+	rtnl_lock();
+	dev = __dev_get_by_index(net, attr->target_ifindex);
+	if (!dev) {
+		ret = -ENODEV;
+		goto out;
+	}
+	if (attr->attach_flags & BPF_F_REPLACE) {
+		replace_prog = bpf_prog_get_type(attr->replace_bpf_fd,
+						 prog->type);
+		if (IS_ERR(replace_prog)) {
+			ret = PTR_ERR(replace_prog);
+			replace_prog = NULL;
+			goto out;
+		}
+	}
+	entry = tcx_entry_fetch_or_create(dev, ingress, &created);
+	if (!entry) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	ret = bpf_mprog_attach(entry, &entry_new, prog, NULL, replace_prog,
+			       attr->attach_flags, attr->relative_fd,
+			       attr->expected_revision);
+	if (!ret) {
+		if (entry != entry_new) {
+			tcx_entry_update(dev, entry_new, ingress);
+			tcx_entry_sync();
+			tcx_skeys_inc(ingress);
+		}
+		bpf_mprog_commit(entry);
+	} else if (created) {
+		tcx_entry_free(entry);
+	}
+out:
+	if (replace_prog)
+		bpf_prog_put(replace_prog);
+	rtnl_unlock();
+	return ret;
+}
+
+int tcx_prog_detach(const union bpf_attr *attr, struct bpf_prog *prog)
+{
+	bool ingress = attr->attach_type == BPF_TCX_INGRESS;
+	struct net *net = current->nsproxy->net_ns;
+	struct bpf_mprog_entry *entry, *entry_new;
+	struct net_device *dev;
+	int ret;
+
+	rtnl_lock();
+	dev = __dev_get_by_index(net, attr->target_ifindex);
+	if (!dev) {
+		ret = -ENODEV;
+		goto out;
+	}
+	entry = tcx_entry_fetch(dev, ingress);
+	if (!entry) {
+		ret = -ENOENT;
+		goto out;
+	}
+	ret = bpf_mprog_detach(entry, &entry_new, prog, NULL, attr->attach_flags,
+			       attr->relative_fd, attr->expected_revision);
+	if (!ret) {
+		if (!tcx_entry_is_active(entry_new))
+			entry_new = NULL;
+		tcx_entry_update(dev, entry_new, ingress);
+		tcx_entry_sync();
+		tcx_skeys_dec(ingress);
+		bpf_mprog_commit(entry);
+		if (!entry_new)
+			tcx_entry_free(entry);
+	}
+out:
+	rtnl_unlock();
+	return ret;
+}
+
+void tcx_uninstall(struct net_device *dev, bool ingress)
+{
+	struct bpf_tuple tuple = {};
+	struct bpf_mprog_entry *entry;
+	struct bpf_mprog_fp *fp;
+	struct bpf_mprog_cp *cp;
+
+	entry = tcx_entry_fetch(dev, ingress);
+	if (!entry)
+		return;
+	tcx_entry_update(dev, NULL, ingress);
+	tcx_entry_sync();
+	bpf_mprog_foreach_tuple(entry, fp, cp, tuple) {
+		if (tuple.link)
+			tcx_link(tuple.link)->dev = NULL;
+		else
+			bpf_prog_put(tuple.prog);
+		tcx_skeys_dec(ingress);
+	}
+	WARN_ON_ONCE(tcx_entry(entry)->miniq_active);
+	tcx_entry_free(entry);
+}
+
+int tcx_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr)
+{
+	bool ingress = attr->query.attach_type == BPF_TCX_INGRESS;
+	struct net *net = current->nsproxy->net_ns;
+	struct bpf_mprog_entry *entry;
+	struct net_device *dev;
+	int ret;
+
+	rtnl_lock();
+	dev = __dev_get_by_index(net, attr->query.target_ifindex);
+	if (!dev) {
+		ret = -ENODEV;
+		goto out;
+	}
+	entry = tcx_entry_fetch(dev, ingress);
+	if (!entry) {
+		ret = -ENOENT;
+		goto out;
+	}
+	ret = bpf_mprog_query(attr, uattr, entry);
+out:
+	rtnl_unlock();
+	return ret;
+}
+
+static int tcx_link_prog_attach(struct bpf_link *link, u32 flags, u32 id_or_fd,
+				u64 revision)
+{
+	struct tcx_link *tcx = tcx_link(link);
+	bool created, ingress = tcx->location == BPF_TCX_INGRESS;
+	struct bpf_mprog_entry *entry, *entry_new;
+	struct net_device *dev = tcx->dev;
+	int ret;
+
+	ASSERT_RTNL();
+	entry = tcx_entry_fetch_or_create(dev, ingress, &created);
+	if (!entry)
+		return -ENOMEM;
+	ret = bpf_mprog_attach(entry, &entry_new, link->prog, link, NULL, flags,
+			       id_or_fd, revision);
+	if (!ret) {
+		if (entry != entry_new) {
+			tcx_entry_update(dev, entry_new, ingress);
+			tcx_entry_sync();
+			tcx_skeys_inc(ingress);
+		}
+		bpf_mprog_commit(entry);
+	} else if (created) {
+		tcx_entry_free(entry);
+	}
+	return ret;
+}
+
+static void tcx_link_release(struct bpf_link *link)
+{
+	struct tcx_link *tcx = tcx_link(link);
+	bool ingress = tcx->location == BPF_TCX_INGRESS;
+	struct bpf_mprog_entry *entry, *entry_new;
+	struct net_device *dev;
+	int ret = 0;
+
+	rtnl_lock();
+	dev = tcx->dev;
+	if (!dev)
+		goto out;
+	entry = tcx_entry_fetch(dev, ingress);
+	if (!entry) {
+		ret = -ENOENT;
+		goto out;
+	}
+	ret = bpf_mprog_detach(entry, &entry_new, link->prog, link, 0, 0, 0);
+	if (!ret) {
+		if (!tcx_entry_is_active(entry_new))
+			entry_new = NULL;
+		tcx_entry_update(dev, entry_new, ingress);
+		tcx_entry_sync();
+		tcx_skeys_dec(ingress);
+		bpf_mprog_commit(entry);
+		if (!entry_new)
+			tcx_entry_free(entry);
+		tcx->dev = NULL;
+	}
+out:
+	WARN_ON_ONCE(ret);
+	rtnl_unlock();
+}
+
+static int tcx_link_update(struct bpf_link *link, struct bpf_prog *nprog,
+			   struct bpf_prog *oprog)
+{
+	struct tcx_link *tcx = tcx_link(link);
+	bool ingress = tcx->location == BPF_TCX_INGRESS;
+	struct bpf_mprog_entry *entry, *entry_new;
+	struct net_device *dev;
+	int ret = 0;
+
+	rtnl_lock();
+	dev = tcx->dev;
+	if (!dev) {
+		ret = -ENOLINK;
+		goto out;
+	}
+	if (oprog && link->prog != oprog) {
+		ret = -EPERM;
+		goto out;
+	}
+	oprog = link->prog;
+	if (oprog == nprog) {
+		bpf_prog_put(nprog);
+		goto out;
+	}
+	entry = tcx_entry_fetch(dev, ingress);
+	if (!entry) {
+		ret = -ENOENT;
+		goto out;
+	}
+	ret = bpf_mprog_attach(entry, &entry_new, nprog, link, oprog,
+			       BPF_F_REPLACE | BPF_F_ID,
+			       link->prog->aux->id, 0);
+	if (!ret) {
+		WARN_ON_ONCE(entry != entry_new);
+		oprog = xchg(&link->prog, nprog);
+		bpf_prog_put(oprog);
+		bpf_mprog_commit(entry);
+	}
+out:
+	rtnl_unlock();
+	return ret;
+}
+
+static void tcx_link_dealloc(struct bpf_link *link)
+{
+	kfree(tcx_link(link));
+}
+
+static void tcx_link_fdinfo(const struct bpf_link *link, struct seq_file *seq)
+{
+	const struct tcx_link *tcx = tcx_link_const(link);
+	u32 ifindex = 0;
+
+	rtnl_lock();
+	if (tcx->dev)
+		ifindex = tcx->dev->ifindex;
+	rtnl_unlock();
+
+	seq_printf(seq, "ifindex:\t%u\n", ifindex);
+	seq_printf(seq, "attach_type:\t%u (%s)\n",
+		   tcx->location,
+		   tcx->location == BPF_TCX_INGRESS ? "ingress" : "egress");
+}
+
+static int tcx_link_fill_info(const struct bpf_link *link,
+			      struct bpf_link_info *info)
+{
+	const struct tcx_link *tcx = tcx_link_const(link);
+	u32 ifindex = 0;
+
+	rtnl_lock();
+	if (tcx->dev)
+		ifindex = tcx->dev->ifindex;
+	rtnl_unlock();
+
+	info->tcx.ifindex = ifindex;
+	info->tcx.attach_type = tcx->location;
+	return 0;
+}
+
+static int tcx_link_detach(struct bpf_link *link)
+{
+	tcx_link_release(link);
+	return 0;
+}
+
+static const struct bpf_link_ops tcx_link_lops = {
+	.release	= tcx_link_release,
+	.detach		= tcx_link_detach,
+	.dealloc	= tcx_link_dealloc,
+	.update_prog	= tcx_link_update,
+	.show_fdinfo	= tcx_link_fdinfo,
+	.fill_link_info	= tcx_link_fill_info,
+};
+
+static int tcx_link_init(struct tcx_link *tcx,
+			 struct bpf_link_primer *link_primer,
+			 const union bpf_attr *attr,
+			 struct net_device *dev,
+			 struct bpf_prog *prog)
+{
+	bpf_link_init(&tcx->link, BPF_LINK_TYPE_TCX, &tcx_link_lops, prog);
+	tcx->location = attr->link_create.attach_type;
+	tcx->dev = dev;
+	return bpf_link_prime(&tcx->link, link_primer);
+}
+
+int tcx_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
+{
+	struct net *net = current->nsproxy->net_ns;
+	struct bpf_link_primer link_primer;
+	struct net_device *dev;
+	struct tcx_link *tcx;
+	int ret;
+
+	rtnl_lock();
+	dev = __dev_get_by_index(net, attr->link_create.target_ifindex);
+	if (!dev) {
+		ret = -ENODEV;
+		goto out;
+	}
+	tcx = kzalloc(sizeof(*tcx), GFP_USER);
+	if (!tcx) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	ret = tcx_link_init(tcx, &link_primer, attr, dev, prog);
+	if (ret) {
+		kfree(tcx);
+		goto out;
+	}
+	ret = tcx_link_prog_attach(&tcx->link, attr->link_create.flags,
+				   attr->link_create.tcx.relative_fd,
+				   attr->link_create.tcx.expected_revision);
+	if (ret) {
+		tcx->dev = NULL;
+		bpf_link_cleanup(&link_primer);
+		goto out;
+	}
+	ret = bpf_link_settle(&link_primer);
+out:
+	rtnl_unlock();
+	return ret;
+}
diff --git a/net/Kconfig b/net/Kconfig
index 2fb25b534df5..d532ec33f1fe 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -52,6 +52,11 @@  config NET_INGRESS
 config NET_EGRESS
 	bool
 
+config NET_XGRESS
+	select NET_INGRESS
+	select NET_EGRESS
+	bool
+
 config NET_REDIRECT
 	bool
 
diff --git a/net/core/dev.c b/net/core/dev.c
index d6e1b786c5c5..c4b826024978 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -107,6 +107,7 @@ 
 #include <net/pkt_cls.h>
 #include <net/checksum.h>
 #include <net/xfrm.h>
+#include <net/tcx.h>
 #include <linux/highmem.h>
 #include <linux/init.h>
 #include <linux/module.h>
@@ -154,7 +155,6 @@ 
 #include "dev.h"
 #include "net-sysfs.h"
 
-
 static DEFINE_SPINLOCK(ptype_lock);
 struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly;
 struct list_head ptype_all __read_mostly;	/* Taps */
@@ -3882,69 +3882,198 @@  int dev_loopback_xmit(struct net *net, struct sock *sk, struct sk_buff *skb)
 EXPORT_SYMBOL(dev_loopback_xmit);
 
 #ifdef CONFIG_NET_EGRESS
-static struct sk_buff *
-sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
+static struct netdev_queue *
+netdev_tx_queue_mapping(struct net_device *dev, struct sk_buff *skb)
+{
+	int qm = skb_get_queue_mapping(skb);
+
+	return netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, qm));
+}
+
+static bool netdev_xmit_txqueue_skipped(void)
 {
+	return __this_cpu_read(softnet_data.xmit.skip_txqueue);
+}
+
+void netdev_xmit_skip_txqueue(bool skip)
+{
+	__this_cpu_write(softnet_data.xmit.skip_txqueue, skip);
+}
+EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue);
+#endif /* CONFIG_NET_EGRESS */
+
+#ifdef CONFIG_NET_XGRESS
+static int tc_run(struct tcx_entry *entry, struct sk_buff *skb)
+{
+	int ret = TC_ACT_UNSPEC;
 #ifdef CONFIG_NET_CLS_ACT
-	struct mini_Qdisc *miniq = rcu_dereference_bh(dev->miniq_egress);
-	struct tcf_result cl_res;
+	struct mini_Qdisc *miniq = rcu_dereference_bh(entry->miniq);
+	struct tcf_result res;
 
 	if (!miniq)
-		return skb;
+		return ret;
 
-	/* qdisc_skb_cb(skb)->pkt_len was already set by the caller. */
 	tc_skb_cb(skb)->mru = 0;
 	tc_skb_cb(skb)->post_ct = false;
-	mini_qdisc_bstats_cpu_update(miniq, skb);
 
-	switch (tcf_classify(skb, miniq->block, miniq->filter_list, &cl_res, false)) {
+	mini_qdisc_bstats_cpu_update(miniq, skb);
+	ret = tcf_classify(skb, miniq->block, miniq->filter_list, &res, false);
+	/* Only tcf related quirks below. */
+	switch (ret) {
+	case TC_ACT_SHOT:
+		mini_qdisc_qstats_cpu_drop(miniq);
+		break;
 	case TC_ACT_OK:
 	case TC_ACT_RECLASSIFY:
-		skb->tc_index = TC_H_MIN(cl_res.classid);
+		skb->tc_index = TC_H_MIN(res.classid);
 		break;
+	}
+#endif /* CONFIG_NET_CLS_ACT */
+	return ret;
+}
+
+static DEFINE_STATIC_KEY_FALSE(tcx_needed_key);
+
+void tcx_inc(void)
+{
+	static_branch_inc(&tcx_needed_key);
+}
+
+void tcx_dec(void)
+{
+	static_branch_dec(&tcx_needed_key);
+}
+
+static __always_inline enum tcx_action_base
+tcx_run(const struct bpf_mprog_entry *entry, struct sk_buff *skb,
+	const bool needs_mac)
+{
+	const struct bpf_mprog_fp *fp;
+	const struct bpf_prog *prog;
+	int ret = TCX_NEXT;
+
+	if (needs_mac)
+		__skb_push(skb, skb->mac_len);
+	bpf_mprog_foreach_prog(entry, fp, prog) {
+		bpf_compute_data_pointers(skb);
+		ret = bpf_prog_run(prog, skb);
+		if (ret != TCX_NEXT)
+			break;
+	}
+	if (needs_mac)
+		__skb_pull(skb, skb->mac_len);
+	return tcx_action_code(skb, ret);
+}
+
+static __always_inline struct sk_buff *
+sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
+		   struct net_device *orig_dev, bool *another)
+{
+	struct bpf_mprog_entry *entry = rcu_dereference_bh(skb->dev->tcx_ingress);
+	int sch_ret;
+
+	if (!entry)
+		return skb;
+	if (*pt_prev) {
+		*ret = deliver_skb(skb, *pt_prev, orig_dev);
+		*pt_prev = NULL;
+	}
+
+	qdisc_skb_cb(skb)->pkt_len = skb->len;
+	tcx_set_ingress(skb, true);
+
+	if (static_branch_unlikely(&tcx_needed_key)) {
+		sch_ret = tcx_run(entry, skb, true);
+		if (sch_ret != TC_ACT_UNSPEC)
+			goto ingress_verdict;
+	}
+	sch_ret = tc_run(tcx_entry(entry), skb);
+ingress_verdict:
+	switch (sch_ret) {
+	case TC_ACT_REDIRECT:
+		/* skb_mac_header check was done by BPF, so we can safely
+		 * push the L2 header back before redirecting to another
+		 * netdev.
+		 */
+		__skb_push(skb, skb->mac_len);
+		if (skb_do_redirect(skb) == -EAGAIN) {
+			__skb_pull(skb, skb->mac_len);
+			*another = true;
+			break;
+		}
+		*ret = NET_RX_SUCCESS;
+		return NULL;
 	case TC_ACT_SHOT:
-		mini_qdisc_qstats_cpu_drop(miniq);
-		*ret = NET_XMIT_DROP;
-		kfree_skb_reason(skb, SKB_DROP_REASON_TC_EGRESS);
+		kfree_skb_reason(skb, SKB_DROP_REASON_TC_INGRESS);
+		*ret = NET_RX_DROP;
 		return NULL;
+	/* used by tc_run */
 	case TC_ACT_STOLEN:
 	case TC_ACT_QUEUED:
 	case TC_ACT_TRAP:
-		*ret = NET_XMIT_SUCCESS;
 		consume_skb(skb);
+		fallthrough;
+	case TC_ACT_CONSUMED:
+		*ret = NET_RX_SUCCESS;
 		return NULL;
+	}
+
+	return skb;
+}
+
+static __always_inline struct sk_buff *
+sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
+{
+	struct bpf_mprog_entry *entry = rcu_dereference_bh(dev->tcx_egress);
+	int sch_ret;
+
+	if (!entry)
+		return skb;
+
+	/* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was
+	 * already set by the caller.
+	 */
+	if (static_branch_unlikely(&tcx_needed_key)) {
+		sch_ret = tcx_run(entry, skb, false);
+		if (sch_ret != TC_ACT_UNSPEC)
+			goto egress_verdict;
+	}
+	sch_ret = tc_run(tcx_entry(entry), skb);
+egress_verdict:
+	switch (sch_ret) {
 	case TC_ACT_REDIRECT:
 		/* No need to push/pop skb's mac_header here on egress! */
 		skb_do_redirect(skb);
 		*ret = NET_XMIT_SUCCESS;
 		return NULL;
-	default:
-		break;
+	case TC_ACT_SHOT:
+		kfree_skb_reason(skb, SKB_DROP_REASON_TC_EGRESS);
+		*ret = NET_XMIT_DROP;
+		return NULL;
+	/* used by tc_run */
+	case TC_ACT_STOLEN:
+	case TC_ACT_QUEUED:
+	case TC_ACT_TRAP:
+		*ret = NET_XMIT_SUCCESS;
+		return NULL;
 	}
-#endif /* CONFIG_NET_CLS_ACT */
 
 	return skb;
 }
-
-static struct netdev_queue *
-netdev_tx_queue_mapping(struct net_device *dev, struct sk_buff *skb)
-{
-	int qm = skb_get_queue_mapping(skb);
-
-	return netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, qm));
-}
-
-static bool netdev_xmit_txqueue_skipped(void)
+#else
+static __always_inline struct sk_buff *
+sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
+		   struct net_device *orig_dev, bool *another)
 {
-	return __this_cpu_read(softnet_data.xmit.skip_txqueue);
+	return skb;
 }
 
-void netdev_xmit_skip_txqueue(bool skip)
+static __always_inline struct sk_buff *
+sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 {
-	__this_cpu_write(softnet_data.xmit.skip_txqueue, skip);
+	return skb;
 }
-EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue);
-#endif /* CONFIG_NET_EGRESS */
+#endif /* CONFIG_NET_XGRESS */
 
 #ifdef CONFIG_XPS
 static int __get_xps_queue_idx(struct net_device *dev, struct sk_buff *skb,
@@ -4128,9 +4257,7 @@  int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 	skb_update_prio(skb);
 
 	qdisc_pkt_len_init(skb);
-#ifdef CONFIG_NET_CLS_ACT
-	skb->tc_at_ingress = 0;
-#endif
+	tcx_set_ingress(skb, false);
 #ifdef CONFIG_NET_EGRESS
 	if (static_branch_unlikely(&egress_needed_key)) {
 		if (nf_hook_egress_active()) {
@@ -5064,72 +5191,6 @@  int (*br_fdb_test_addr_hook)(struct net_device *dev,
 EXPORT_SYMBOL_GPL(br_fdb_test_addr_hook);
 #endif
 
-static inline struct sk_buff *
-sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
-		   struct net_device *orig_dev, bool *another)
-{
-#ifdef CONFIG_NET_CLS_ACT
-	struct mini_Qdisc *miniq = rcu_dereference_bh(skb->dev->miniq_ingress);
-	struct tcf_result cl_res;
-
-	/* If there's at least one ingress present somewhere (so
-	 * we get here via enabled static key), remaining devices
-	 * that are not configured with an ingress qdisc will bail
-	 * out here.
-	 */
-	if (!miniq)
-		return skb;
-
-	if (*pt_prev) {
-		*ret = deliver_skb(skb, *pt_prev, orig_dev);
-		*pt_prev = NULL;
-	}
-
-	qdisc_skb_cb(skb)->pkt_len = skb->len;
-	tc_skb_cb(skb)->mru = 0;
-	tc_skb_cb(skb)->post_ct = false;
-	skb->tc_at_ingress = 1;
-	mini_qdisc_bstats_cpu_update(miniq, skb);
-
-	switch (tcf_classify(skb, miniq->block, miniq->filter_list, &cl_res, false)) {
-	case TC_ACT_OK:
-	case TC_ACT_RECLASSIFY:
-		skb->tc_index = TC_H_MIN(cl_res.classid);
-		break;
-	case TC_ACT_SHOT:
-		mini_qdisc_qstats_cpu_drop(miniq);
-		kfree_skb_reason(skb, SKB_DROP_REASON_TC_INGRESS);
-		*ret = NET_RX_DROP;
-		return NULL;
-	case TC_ACT_STOLEN:
-	case TC_ACT_QUEUED:
-	case TC_ACT_TRAP:
-		consume_skb(skb);
-		*ret = NET_RX_SUCCESS;
-		return NULL;
-	case TC_ACT_REDIRECT:
-		/* skb_mac_header check was done by cls/act_bpf, so
-		 * we can safely push the L2 header back before
-		 * redirecting to another netdev
-		 */
-		__skb_push(skb, skb->mac_len);
-		if (skb_do_redirect(skb) == -EAGAIN) {
-			__skb_pull(skb, skb->mac_len);
-			*another = true;
-			break;
-		}
-		*ret = NET_RX_SUCCESS;
-		return NULL;
-	case TC_ACT_CONSUMED:
-		*ret = NET_RX_SUCCESS;
-		return NULL;
-	default:
-		break;
-	}
-#endif /* CONFIG_NET_CLS_ACT */
-	return skb;
-}
-
 /**
  *	netdev_is_rx_handler_busy - check if receive handler is registered
  *	@dev: device to check
@@ -10834,7 +10895,7 @@  void unregister_netdevice_many_notify(struct list_head *head,
 
 		/* Shutdown queueing discipline. */
 		dev_shutdown(dev);
-
+		dev_tcx_uninstall(dev);
 		dev_xdp_uninstall(dev);
 		bpf_dev_bound_netdev_unregister(dev);
 
diff --git a/net/core/filter.c b/net/core/filter.c
index 06ba0e56e369..e39a8a20dd10 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -9312,7 +9312,7 @@  static struct bpf_insn *bpf_convert_tstamp_read(const struct bpf_prog *prog,
 	__u8 value_reg = si->dst_reg;
 	__u8 skb_reg = si->src_reg;
 
-#ifdef CONFIG_NET_CLS_ACT
+#ifdef CONFIG_NET_XGRESS
 	/* If the tstamp_type is read,
 	 * the bpf prog is aware the tstamp could have delivery time.
 	 * Thus, read skb->tstamp as is if tstamp_type_access is true.
@@ -9346,7 +9346,7 @@  static struct bpf_insn *bpf_convert_tstamp_write(const struct bpf_prog *prog,
 	__u8 value_reg = si->src_reg;
 	__u8 skb_reg = si->dst_reg;
 
-#ifdef CONFIG_NET_CLS_ACT
+#ifdef CONFIG_NET_XGRESS
 	/* If the tstamp_type is read,
 	 * the bpf prog is aware the tstamp could have delivery time.
 	 * Thus, write skb->tstamp as is if tstamp_type_access is true.
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 4b95cb1ac435..470c70deffe2 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -347,8 +347,7 @@  config NET_SCH_FQ_PIE
 config NET_SCH_INGRESS
 	tristate "Ingress/classifier-action Qdisc"
 	depends on NET_CLS_ACT
-	select NET_INGRESS
-	select NET_EGRESS
+	select NET_XGRESS
 	help
 	  Say Y here if you want to use classifiers for incoming and/or outgoing
 	  packets. This qdisc doesn't do anything else besides running classifiers,
@@ -679,6 +678,7 @@  config NET_EMATCH_IPT
 config NET_CLS_ACT
 	bool "Actions"
 	select NET_CLS
+	select NET_XGRESS
 	help
 	  Say Y here if you want to use traffic control actions. Actions
 	  get attached to classifiers and are invoked after a successful
diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
index e43a45499372..04e886f6cee4 100644
--- a/net/sched/sch_ingress.c
+++ b/net/sched/sch_ingress.c
@@ -13,6 +13,7 @@ 
 #include <net/netlink.h>
 #include <net/pkt_sched.h>
 #include <net/pkt_cls.h>
+#include <net/tcx.h>
 
 struct ingress_sched_data {
 	struct tcf_block *block;
@@ -78,6 +79,8 @@  static int ingress_init(struct Qdisc *sch, struct nlattr *opt,
 {
 	struct ingress_sched_data *q = qdisc_priv(sch);
 	struct net_device *dev = qdisc_dev(sch);
+	struct bpf_mprog_entry *entry;
+	bool created;
 	int err;
 
 	if (sch->parent != TC_H_INGRESS)
@@ -85,7 +88,13 @@  static int ingress_init(struct Qdisc *sch, struct nlattr *opt,
 
 	net_inc_ingress_queue();
 
-	mini_qdisc_pair_init(&q->miniqp, sch, &dev->miniq_ingress);
+	entry = tcx_entry_fetch_or_create(dev, true, &created);
+	if (!entry)
+		return -ENOMEM;
+	tcx_miniq_set_active(entry, true);
+	mini_qdisc_pair_init(&q->miniqp, sch, &tcx_entry(entry)->miniq);
+	if (created)
+		tcx_entry_update(dev, entry, true);
 
 	q->block_info.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS;
 	q->block_info.chain_head_change = clsact_chain_head_change;
@@ -103,11 +112,22 @@  static int ingress_init(struct Qdisc *sch, struct nlattr *opt,
 static void ingress_destroy(struct Qdisc *sch)
 {
 	struct ingress_sched_data *q = qdisc_priv(sch);
+	struct net_device *dev = qdisc_dev(sch);
+	struct bpf_mprog_entry *entry = rtnl_dereference(dev->tcx_ingress);
 
 	if (sch->parent != TC_H_INGRESS)
 		return;
 
 	tcf_block_put_ext(q->block, sch, &q->block_info);
+
+	if (entry) {
+		tcx_miniq_set_active(entry, false);
+		if (!tcx_entry_is_active(entry)) {
+			tcx_entry_update(dev, NULL, false);
+			tcx_entry_free(entry);
+		}
+	}
+
 	net_dec_ingress_queue();
 }
 
@@ -223,6 +243,8 @@  static int clsact_init(struct Qdisc *sch, struct nlattr *opt,
 {
 	struct clsact_sched_data *q = qdisc_priv(sch);
 	struct net_device *dev = qdisc_dev(sch);
+	struct bpf_mprog_entry *entry;
+	bool created;
 	int err;
 
 	if (sch->parent != TC_H_CLSACT)
@@ -231,7 +253,13 @@  static int clsact_init(struct Qdisc *sch, struct nlattr *opt,
 	net_inc_ingress_queue();
 	net_inc_egress_queue();
 
-	mini_qdisc_pair_init(&q->miniqp_ingress, sch, &dev->miniq_ingress);
+	entry = tcx_entry_fetch_or_create(dev, true, &created);
+	if (!entry)
+		return -ENOMEM;
+	tcx_miniq_set_active(entry, true);
+	mini_qdisc_pair_init(&q->miniqp_ingress, sch, &tcx_entry(entry)->miniq);
+	if (created)
+		tcx_entry_update(dev, entry, true);
 
 	q->ingress_block_info.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS;
 	q->ingress_block_info.chain_head_change = clsact_chain_head_change;
@@ -244,7 +272,13 @@  static int clsact_init(struct Qdisc *sch, struct nlattr *opt,
 
 	mini_qdisc_pair_block_init(&q->miniqp_ingress, q->ingress_block);
 
-	mini_qdisc_pair_init(&q->miniqp_egress, sch, &dev->miniq_egress);
+	entry = tcx_entry_fetch_or_create(dev, false, &created);
+	if (!entry)
+		return -ENOMEM;
+	tcx_miniq_set_active(entry, true);
+	mini_qdisc_pair_init(&q->miniqp_egress, sch, &tcx_entry(entry)->miniq);
+	if (created)
+		tcx_entry_update(dev, entry, false);
 
 	q->egress_block_info.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_EGRESS;
 	q->egress_block_info.chain_head_change = clsact_chain_head_change;
@@ -256,12 +290,31 @@  static int clsact_init(struct Qdisc *sch, struct nlattr *opt,
 static void clsact_destroy(struct Qdisc *sch)
 {
 	struct clsact_sched_data *q = qdisc_priv(sch);
+	struct net_device *dev = qdisc_dev(sch);
+	struct bpf_mprog_entry *ingress_entry = rtnl_dereference(dev->tcx_ingress);
+	struct bpf_mprog_entry *egress_entry = rtnl_dereference(dev->tcx_egress);
 
 	if (sch->parent != TC_H_CLSACT)
 		return;
 
-	tcf_block_put_ext(q->egress_block, sch, &q->egress_block_info);
 	tcf_block_put_ext(q->ingress_block, sch, &q->ingress_block_info);
+	tcf_block_put_ext(q->egress_block, sch, &q->egress_block_info);
+
+	if (ingress_entry) {
+		tcx_miniq_set_active(ingress_entry, false);
+		if (!tcx_entry_is_active(ingress_entry)) {
+			tcx_entry_update(dev, NULL, true);
+			tcx_entry_free(ingress_entry);
+		}
+	}
+
+	if (egress_entry) {
+		tcx_miniq_set_active(egress_entry, false);
+		if (!tcx_entry_is_active(egress_entry)) {
+			tcx_entry_update(dev, NULL, false);
+			tcx_entry_free(egress_entry);
+		}
+	}
 
 	net_dec_ingress_queue();
 	net_dec_egress_queue();
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 1c166870cdf3..47b76925189f 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1036,6 +1036,8 @@  enum bpf_attach_type {
 	BPF_LSM_CGROUP,
 	BPF_STRUCT_OPS,
 	BPF_NETFILTER,
+	BPF_TCX_INGRESS,
+	BPF_TCX_EGRESS,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -1053,7 +1055,7 @@  enum bpf_link_type {
 	BPF_LINK_TYPE_KPROBE_MULTI = 8,
 	BPF_LINK_TYPE_STRUCT_OPS = 9,
 	BPF_LINK_TYPE_NETFILTER = 10,
-
+	BPF_LINK_TYPE_TCX = 11,
 	MAX_BPF_LINK_TYPE,
 };
 
@@ -1569,13 +1571,13 @@  union bpf_attr {
 			__u32		map_fd;		/* struct_ops to attach */
 		};
 		union {
-			__u32		target_fd;	/* object to attach to */
-			__u32		target_ifindex; /* target ifindex */
+			__u32	target_fd;	/* target object to attach to or ... */
+			__u32	target_ifindex; /* target ifindex */
 		};
 		__u32		attach_type;	/* attach type */
 		__u32		flags;		/* extra flags */
 		union {
-			__u32		target_btf_id;	/* btf_id of target to attach to */
+			__u32	target_btf_id;	/* btf_id of target to attach to */
 			struct {
 				__aligned_u64	iter_info;	/* extra bpf_iter_link_info */
 				__u32		iter_info_len;	/* iter_info length */
@@ -1609,6 +1611,13 @@  union bpf_attr {
 				__s32		priority;
 				__u32		flags;
 			} netfilter;
+			struct {
+				union {
+					__u32	relative_fd;
+					__u32	relative_id;
+				};
+				__u64		expected_revision;
+			} tcx;
 		};
 	} link_create;
 
@@ -6217,6 +6226,19 @@  struct bpf_sock_tuple {
 	};
 };
 
+/* (Simplified) user return codes for tcx prog type.
+ * A valid tcx program must return one of these defined values. All other
+ * return codes are reserved for future use. Must remain compatible with
+ * their TC_ACT_* counter-parts. For compatibility in behavior, unknown
+ * return codes are mapped to TCX_NEXT.
+ */
+enum tcx_action_base {
+	TCX_NEXT	= -1,
+	TCX_PASS	= 0,
+	TCX_DROP	= 2,
+	TCX_REDIRECT	= 7,
+};
+
 struct bpf_xdp_sock {
 	__u32 queue_id;
 };
@@ -6499,6 +6521,10 @@  struct bpf_link_info {
 				} event; /* BPF_PERF_EVENT_EVENT */
 			};
 		} perf_event;
+		struct {
+			__u32 ifindex;
+			__u32 attach_type;
+		} tcx;
 	};
 } __attribute__((aligned(8)));