[RFC,bpf-next,0/7] Add bpf_link based TC-BPF API

Message ID	20210528195946.2375109-1-memxor@gmail.com (mailing list archive)
Headers	show Return-Path: <bpf-owner@kernel.org> From: Kumar Kartikeya Dwivedi <memxor@gmail.com> To: bpf@vger.kernel.org Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>, Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, Andrii Nakryiko <andrii@kernel.org>, Martin KaFai Lau <kafai@fb.com>, Song Liu <songliubraving@fb.com>, Yonghong Song <yhs@fb.com>, John Fastabend <john.fastabend@gmail.com>, KP Singh <kpsingh@kernel.org>, Jamal Hadi Salim <jhs@mojatatu.com>, Vlad Buslov <vladbu@nvidia.com>, Cong Wang <xiyou.wangcong@gmail.com>, Jiri Pirko <jiri@resnulli.us>, "David S. Miller" <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org>, Joe Stringer <joe@cilium.io>, Quentin Monnet <quentin@isovalent.com>, Jesper Dangaard Brouer <brouer@redhat.com>, =?utf-8?q?Toke_H=C3=B8iland-J?= =?utf-8?q?=C3=B8rgensen?= <toke@redhat.com>, netdev@vger.kernel.org Subject: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API Date: Sat, 29 May 2021 01:29:39 +0530 Message-Id: <20210528195946.2375109-1-memxor@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Add bpf_link based TC-BPF API \| expand [RFC,bpf-next,0/7] Add bpf_link based TC-BPF API [RFC,bpf-next,1/7] net: sched: refactor cls_bpf creation code [RFC,bpf-next,2/7] bpf: export bpf_link functions for modules [RFC,bpf-next,3/7] net: sched: add bpf_link API for bpf classifier [RFC,bpf-next,4/7] net: sched: add lightweight update path for cls_bpf [RFC,bpf-next,5/7] tools: bpf.h: sync with kernel sources [RFC,bpf-next,6/7] libbpf: add bpf_link based TC-BPF management API [RFC,bpf-next,7/7] libbpf: add selftest for bpf_link based TC-BPF management API

Kumar Kartikeya Dwivedi May 28, 2021, 7:59 p.m. UTC

This is the first RFC version.

This adds a bpf_link path to create TC filters tied to cls_bpf classifier, and
introduces fd based ownership for such TC filters. Netlink cannot delete or
replace such filters, but the bpf_link is severed on indirect destruction of the
filter (backing qdisc being deleted, or chain being flushed, etc.). To ensure
that filters remain attached beyond process lifetime, the usual bpf_link fd
pinning approach can be used.

The individual patches contain more details and comments, but the overall kernel
API and libbpf helper mirrors the semantics of the netlink based TC-BPF API
merged recently. This means that we start by always setting direct action mode,
protocol to ETH_P_ALL, chain_index as 0, etc. If there is a need for more
options in the future, they can be easily exposed through the bpf_link API in
the future.

Patch 1 refactors cls_bpf change function to extract two helpers that will be
reused in bpf_link creation.

Patch 2 exports some bpf_link management functions to modules. This is needed
because our bpf_link object is tied to the cls_bpf_prog object. Tying it to
tcf_proto would be weird, because the update path has to replace offloaded bpf
prog, which happens using internal cls_bpf helpers, and would in general be more
code to abstract over an operation that is unlikely to be implemented for other
filter types.

Patch 3 adds the main bpf_link API. A function in cls_api takes care of
obtaining block reference, creating the filter object, and then calls the
bpf_link_change tcf_proto op (only supported by cls_bpf) that returns a fd after
setting up the internal structures. An optimization is made to not keep around
resources for extended actions, which is explained in a code comment as it wasn't
immediately obvious.

Patch 4 adds an update path for bpf_link. Since bpf_link_update only supports
replacing the bpf_prog, we can skip tc filter's change path by reusing the
filter object but swapping its bpf_prog. This takes care of replacing the
offloaded prog as well (if that fails, update is aborted). So far however,
tcf_classify could do normal load (possibly torn) as the cls_bpf_prog->filter
would never be modified concurrently. This is no longer true, and to not
penalize the classify hot path, we also cannot impose serialization around
its load. Hence the load is changed to READ_ONCE, so that the pointer value is
always consistent. Due to invocation in a RCU critical section, the lifetime of
the prog is guaranteed for the duration of the call.

Patch 5, 6 take care of updating the userspace bits and add a bpf_link returning
function to libbpf.

Patch 7 adds a selftest that exercises all possible problematic interactions
that I could think of.

Design:

This is where in the object hierarchy our bpf_link object is attached.

                                                                            ┌─────┐
                                                                            │     │
                                                                            │ BPF │
                                                                            program
                                                                            │     │
                                                                            └──▲──┘
                                                      ┌───────┐                │
                                                      │       │         ┌──────┴───────┐
                                                      │  mod  ├─────────► cls_bpf_prog │
┌────────────────┐                                    │cls_bpf│         └────┬───▲─────┘
│    tcf_block   │                                    │       │              │   │
└────────┬───────┘                                    └───▲───┘              │   │
         │          ┌─────────────┐                       │                ┌─▼───┴──┐
         └──────────►  tcf_chain  │                       │                │bpf_link│
                    └───────┬─────┘                       │                └────────┘
                            │          ┌─────────────┐    │
                            └──────────►  tcf_proto  ├────┘
                                       └─────────────┘

The bpf_link is detached on destruction of the cls_bpf_prog.  Doing it this way
allows us to implement update in a lightweight manner without having to recreate
a new filter, where we can just replace the BPF prog attached to cls_bpf_prog.

The other way to do it would be to link the bpf_link to tcf_proto, there are
numerous downsides to this:

1. All filters have to embed the pointer even though they won't be using it when
cls_bpf is compiled in.
2. This probably won't make sense to be extended to other filter types anyway.
3. We aren't able to optimize the update case without adding another bpf_link
specific update operation to tcf_proto ops.

The downside with tying this to the module is having to export bpf_link
management functions and introducing a tcf_proto op. Hopefully the cost of
another operation func pointer is not big enough (as there is only one ops
struct per module).

This first version is to collect feedback on the approach and get ideas if there
is a better way to do this.

Kumar Kartikeya Dwivedi (7):
  net: sched: refactor cls_bpf creation code
  bpf: export bpf_link functions for modules
  net: sched: add bpf_link API for bpf classifier
  net: sched: add lightweight update path for cls_bpf
  tools: bpf.h: sync with kernel sources
  libbpf: add bpf_link based TC-BPF management API
  libbpf: add selftest for bpf_link based TC-BPF management API

 include/linux/bpf_types.h                     |   3 +
 include/net/pkt_cls.h                         |  13 +
 include/net/sch_generic.h                     |   6 +-
 include/uapi/linux/bpf.h                      |  15 +
 kernel/bpf/syscall.c                          |  14 +-
 net/sched/cls_api.c                           | 138 ++++++-
 net/sched/cls_bpf.c                           | 386 ++++++++++++++++--
 tools/include/uapi/linux/bpf.h                |  15 +
 tools/lib/bpf/bpf.c                           |   5 +
 tools/lib/bpf/bpf.h                           |   8 +-
 tools/lib/bpf/libbpf.c                        |  59 ++-
 tools/lib/bpf/libbpf.h                        |  17 +
 tools/lib/bpf/libbpf.map                      |   1 +
 tools/lib/bpf/netlink.c                       |   5 +-
 tools/lib/bpf/netlink.h                       |   8 +
 .../selftests/bpf/prog_tests/tc_bpf_link.c    | 285 +++++++++++++
 16 files changed, 934 insertions(+), 44 deletions(-)
 create mode 100644 tools/lib/bpf/netlink.h
 create mode 100644 tools/testing/selftests/bpf/prog_tests/tc_bpf_link.c

Andrii Nakryiko June 2, 2021, 9:09 p.m. UTC | #1

On Fri, May 28, 2021 at 1:00 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> This is the first RFC version.
>
> This adds a bpf_link path to create TC filters tied to cls_bpf classifier, and
> introduces fd based ownership for such TC filters. Netlink cannot delete or
> replace such filters, but the bpf_link is severed on indirect destruction of the
> filter (backing qdisc being deleted, or chain being flushed, etc.). To ensure
> that filters remain attached beyond process lifetime, the usual bpf_link fd
> pinning approach can be used.
>
> The individual patches contain more details and comments, but the overall kernel
> API and libbpf helper mirrors the semantics of the netlink based TC-BPF API
> merged recently. This means that we start by always setting direct action mode,
> protocol to ETH_P_ALL, chain_index as 0, etc. If there is a need for more
> options in the future, they can be easily exposed through the bpf_link API in
> the future.
>
> Patch 1 refactors cls_bpf change function to extract two helpers that will be
> reused in bpf_link creation.
>
> Patch 2 exports some bpf_link management functions to modules. This is needed
> because our bpf_link object is tied to the cls_bpf_prog object. Tying it to
> tcf_proto would be weird, because the update path has to replace offloaded bpf
> prog, which happens using internal cls_bpf helpers, and would in general be more
> code to abstract over an operation that is unlikely to be implemented for other
> filter types.
>
> Patch 3 adds the main bpf_link API. A function in cls_api takes care of
> obtaining block reference, creating the filter object, and then calls the
> bpf_link_change tcf_proto op (only supported by cls_bpf) that returns a fd after
> setting up the internal structures. An optimization is made to not keep around
> resources for extended actions, which is explained in a code comment as it wasn't
> immediately obvious.
>
> Patch 4 adds an update path for bpf_link. Since bpf_link_update only supports
> replacing the bpf_prog, we can skip tc filter's change path by reusing the
> filter object but swapping its bpf_prog. This takes care of replacing the
> offloaded prog as well (if that fails, update is aborted). So far however,
> tcf_classify could do normal load (possibly torn) as the cls_bpf_prog->filter
> would never be modified concurrently. This is no longer true, and to not
> penalize the classify hot path, we also cannot impose serialization around
> its load. Hence the load is changed to READ_ONCE, so that the pointer value is
> always consistent. Due to invocation in a RCU critical section, the lifetime of
> the prog is guaranteed for the duration of the call.
>
> Patch 5, 6 take care of updating the userspace bits and add a bpf_link returning
> function to libbpf.
>
> Patch 7 adds a selftest that exercises all possible problematic interactions
> that I could think of.
>
> Design:
>
> This is where in the object hierarchy our bpf_link object is attached.
>
>                                                                             ┌─────┐
>                                                                             │     │
>                                                                             │ BPF │
>                                                                             program
>                                                                             │     │
>                                                                             └──▲──┘
>                                                       ┌───────┐                │
>                                                       │       │         ┌──────┴───────┐
>                                                       │  mod  ├─────────► cls_bpf_prog │
> ┌────────────────┐                                    │cls_bpf│         └────┬───▲─────┘
> │    tcf_block   │                                    │       │              │   │
> └────────┬───────┘                                    └───▲───┘              │   │
>          │          ┌─────────────┐                       │                ┌─▼───┴──┐
>          └──────────►  tcf_chain  │                       │                │bpf_link│
>                     └───────┬─────┘                       │                └────────┘
>                             │          ┌─────────────┐    │
>                             └──────────►  tcf_proto  ├────┘
>                                        └─────────────┘
>
> The bpf_link is detached on destruction of the cls_bpf_prog.  Doing it this way
> allows us to implement update in a lightweight manner without having to recreate
> a new filter, where we can just replace the BPF prog attached to cls_bpf_prog.
>
> The other way to do it would be to link the bpf_link to tcf_proto, there are
> numerous downsides to this:
>
> 1. All filters have to embed the pointer even though they won't be using it when
> cls_bpf is compiled in.
> 2. This probably won't make sense to be extended to other filter types anyway.
> 3. We aren't able to optimize the update case without adding another bpf_link
> specific update operation to tcf_proto ops.
>
> The downside with tying this to the module is having to export bpf_link
> management functions and introducing a tcf_proto op. Hopefully the cost of
> another operation func pointer is not big enough (as there is only one ops
> struct per module).
>
> This first version is to collect feedback on the approach and get ideas if there
> is a better way to do this.

Bpf_link-based TC API is a long time coming, so it's great to see
someone finally working on this. Thanks!

I briefly skimmed through the patch set, noticed a few generic
bpf_link problems. But I think main feedback will come from Cilium
folks and others that heavily rely on TC APIs. I wonder if there is an
opportunity to simplify the API further given we have a new
opportunity here. I don't think we are constrained to follow legacy TC
API exactly.

The problem is that your patch set was marked as spam by Google, so I
suspect a bunch of folks haven't gotten it. I suggest re-sending it
again but trimming down the CC list, leaving only bpf@vger,
netdev@vger, and BPF maintainers CC'ed directly.

>
> Kumar Kartikeya Dwivedi (7):
>   net: sched: refactor cls_bpf creation code
>   bpf: export bpf_link functions for modules
>   net: sched: add bpf_link API for bpf classifier
>   net: sched: add lightweight update path for cls_bpf
>   tools: bpf.h: sync with kernel sources
>   libbpf: add bpf_link based TC-BPF management API
>   libbpf: add selftest for bpf_link based TC-BPF management API
>
>  include/linux/bpf_types.h                     |   3 +
>  include/net/pkt_cls.h                         |  13 +
>  include/net/sch_generic.h                     |   6 +-
>  include/uapi/linux/bpf.h                      |  15 +
>  kernel/bpf/syscall.c                          |  14 +-
>  net/sched/cls_api.c                           | 138 ++++++-
>  net/sched/cls_bpf.c                           | 386 ++++++++++++++++--
>  tools/include/uapi/linux/bpf.h                |  15 +
>  tools/lib/bpf/bpf.c                           |   5 +
>  tools/lib/bpf/bpf.h                           |   8 +-
>  tools/lib/bpf/libbpf.c                        |  59 ++-
>  tools/lib/bpf/libbpf.h                        |  17 +
>  tools/lib/bpf/libbpf.map                      |   1 +
>  tools/lib/bpf/netlink.c                       |   5 +-
>  tools/lib/bpf/netlink.h                       |   8 +
>  .../selftests/bpf/prog_tests/tc_bpf_link.c    | 285 +++++++++++++
>  16 files changed, 934 insertions(+), 44 deletions(-)
>  create mode 100644 tools/lib/bpf/netlink.h
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/tc_bpf_link.c
>
> --
> 2.31.1
>

Kumar Kartikeya Dwivedi June 2, 2021, 9:45 p.m. UTC | #2

On Thu, Jun 03, 2021 at 02:39:15AM IST, Andrii Nakryiko wrote:
> On Fri, May 28, 2021 at 1:00 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > This is the first RFC version.
> >
> > This adds a bpf_link path to create TC filters tied to cls_bpf classifier, and
> > introduces fd based ownership for such TC filters. Netlink cannot delete or
> > replace such filters, but the bpf_link is severed on indirect destruction of the
> > filter (backing qdisc being deleted, or chain being flushed, etc.). To ensure
> > that filters remain attached beyond process lifetime, the usual bpf_link fd
> > pinning approach can be used.
> >
> > The individual patches contain more details and comments, but the overall kernel
> > API and libbpf helper mirrors the semantics of the netlink based TC-BPF API
> > merged recently. This means that we start by always setting direct action mode,
> > protocol to ETH_P_ALL, chain_index as 0, etc. If there is a need for more
> > options in the future, they can be easily exposed through the bpf_link API in
> > the future.
> >
> > Patch 1 refactors cls_bpf change function to extract two helpers that will be
> > reused in bpf_link creation.
> >
> > Patch 2 exports some bpf_link management functions to modules. This is needed
> > because our bpf_link object is tied to the cls_bpf_prog object. Tying it to
> > tcf_proto would be weird, because the update path has to replace offloaded bpf
> > prog, which happens using internal cls_bpf helpers, and would in general be more
> > code to abstract over an operation that is unlikely to be implemented for other
> > filter types.
> >
> > Patch 3 adds the main bpf_link API. A function in cls_api takes care of
> > obtaining block reference, creating the filter object, and then calls the
> > bpf_link_change tcf_proto op (only supported by cls_bpf) that returns a fd after
> > setting up the internal structures. An optimization is made to not keep around
> > resources for extended actions, which is explained in a code comment as it wasn't
> > immediately obvious.
> >
> > Patch 4 adds an update path for bpf_link. Since bpf_link_update only supports
> > replacing the bpf_prog, we can skip tc filter's change path by reusing the
> > filter object but swapping its bpf_prog. This takes care of replacing the
> > offloaded prog as well (if that fails, update is aborted). So far however,
> > tcf_classify could do normal load (possibly torn) as the cls_bpf_prog->filter
> > would never be modified concurrently. This is no longer true, and to not
> > penalize the classify hot path, we also cannot impose serialization around
> > its load. Hence the load is changed to READ_ONCE, so that the pointer value is
> > always consistent. Due to invocation in a RCU critical section, the lifetime of
> > the prog is guaranteed for the duration of the call.
> >
> > Patch 5, 6 take care of updating the userspace bits and add a bpf_link returning
> > function to libbpf.
> >
> > Patch 7 adds a selftest that exercises all possible problematic interactions
> > that I could think of.
> >
> > Design:
> >
> > This is where in the object hierarchy our bpf_link object is attached.
> >
> >                                                                             ┌─────┐
> >                                                                             │     │
> >                                                                             │ BPF │
> >                                                                             program
> >                                                                             │     │
> >                                                                             └──▲──┘
> >                                                       ┌───────┐                │
> >                                                       │       │         ┌──────┴───────┐
> >                                                       │  mod  ├─────────► cls_bpf_prog │
> > ┌────────────────┐                                    │cls_bpf│         └────┬───▲─────┘
> > │    tcf_block   │                                    │       │              │   │
> > └────────┬───────┘                                    └───▲───┘              │   │
> >          │          ┌─────────────┐                       │                ┌─▼───┴──┐
> >          └──────────►  tcf_chain  │                       │                │bpf_link│
> >                     └───────┬─────┘                       │                └────────┘
> >                             │          ┌─────────────┐    │
> >                             └──────────►  tcf_proto  ├────┘
> >                                        └─────────────┘
> >
> > The bpf_link is detached on destruction of the cls_bpf_prog.  Doing it this way
> > allows us to implement update in a lightweight manner without having to recreate
> > a new filter, where we can just replace the BPF prog attached to cls_bpf_prog.
> >
> > The other way to do it would be to link the bpf_link to tcf_proto, there are
> > numerous downsides to this:
> >
> > 1. All filters have to embed the pointer even though they won't be using it when
> > cls_bpf is compiled in.
> > 2. This probably won't make sense to be extended to other filter types anyway.
> > 3. We aren't able to optimize the update case without adding another bpf_link
> > specific update operation to tcf_proto ops.
> >
> > The downside with tying this to the module is having to export bpf_link
> > management functions and introducing a tcf_proto op. Hopefully the cost of
> > another operation func pointer is not big enough (as there is only one ops
> > struct per module).
> >
> > This first version is to collect feedback on the approach and get ideas if there
> > is a better way to do this.
>
> Bpf_link-based TC API is a long time coming, so it's great to see
> someone finally working on this. Thanks!
>
> I briefly skimmed through the patch set, noticed a few generic
> bpf_link problems. But I think main feedback will come from Cilium

Thanks for the review. I'll fix both of these in the resend (also have a couple
of private reports from the kernel test robot).

> folks and others that heavily rely on TC APIs. I wonder if there is an
> opportunity to simplify the API further given we have a new
> opportunity here. I don't think we are constrained to follow legacy TC
> API exactly.
>

I tried to keep it simple by going for the defaults we agreed upon for the new
netlink based libbpf API, and always setting direct action mode, and it's still
in a position to be extended in the future to allow full TC filter setup like
netlink does, if someone ever happens to need that.

As for the implementation, I did notice that there has been discussion around
this (though I could only find [0]) but I think doing it the way this patch does
is more flexible as you can attach the bpf filter to an aribitrary parent/class,
not just ingress and egress, and it can coexist with a conventional TC setup.

[0]: https://facebookmicrosites.github.io/bpf/blog/2018/08/31/object-lifetime.html
     ("Note that there is ongoing work ...")

> The problem is that your patch set was marked as spam by Google, so I
> suspect a bunch of folks haven't gotten it. I suggest re-sending it
> again but trimming down the CC list, leaving only bpf@vger,
> netdev@vger, and BPF maintainers CC'ed directly.
>

Thanks for the heads up, I'll resend tomorrow.

--
Kartikeya

Alexei Starovoitov June 2, 2021, 11:50 p.m. UTC | #3

On Thu, Jun 03, 2021 at 03:15:13AM +0530, Kumar Kartikeya Dwivedi wrote:
> 
> > The problem is that your patch set was marked as spam by Google, so I
> > suspect a bunch of folks haven't gotten it. I suggest re-sending it
> > again but trimming down the CC list, leaving only bpf@vger,
> > netdev@vger, and BPF maintainers CC'ed directly.
> >
> 
> Thanks for the heads up, I'll resend tomorrow.

fyi I see this thread in my inbox, but, sadly, not the patches.
So guessing based on cover letter and hoping that the following is true:
link_fd is returned by BPF_LINK_CREATE command.
If anything is missing in struct link_create the patches are adding it there.
target_ifindex, flags are reused. attach_type indicates ingress vs egress.

Kumar Kartikeya Dwivedi June 4, 2021, 6:43 a.m. UTC | #4

On Thu, Jun 03, 2021 at 05:20:58AM IST, Alexei Starovoitov wrote:
> On Thu, Jun 03, 2021 at 03:15:13AM +0530, Kumar Kartikeya Dwivedi wrote:
> >
> > > The problem is that your patch set was marked as spam by Google, so I
> > > suspect a bunch of folks haven't gotten it. I suggest re-sending it
> > > again but trimming down the CC list, leaving only bpf@vger,
> > > netdev@vger, and BPF maintainers CC'ed directly.
> > >
> >
> > Thanks for the heads up, I'll resend tomorrow.
>
> fyi I see this thread in my inbox, but, sadly, not the patches.
> So guessing based on cover letter and hoping that the following is true:
> link_fd is returned by BPF_LINK_CREATE command.
> If anything is missing in struct link_create the patches are adding it there.
> target_ifindex, flags are reused. attach_type indicates ingress vs egress.

Everything is true except the attach_type part. I don't hook directly into
sch_handle_{ingress,egress}. It's a normal TC filter, and if one wants to hook
into ingress, egress, they attach it to clsact qdisc. The lifetime however is
decided by the link fd.

The new version is here:
https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com

--
Kartikeya

Cong Wang June 6, 2021, 11:37 p.m. UTC | #5

On Fri, May 28, 2021 at 1:00 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> This is the first RFC version.
>
> This adds a bpf_link path to create TC filters tied to cls_bpf classifier, and
> introduces fd based ownership for such TC filters. Netlink cannot delete or
> replace such filters, but the bpf_link is severed on indirect destruction of the
> filter (backing qdisc being deleted, or chain being flushed, etc.). To ensure
> that filters remain attached beyond process lifetime, the usual bpf_link fd
> pinning approach can be used.

I have some troubles understanding this. So... why TC filter is so special
here that it deserves such a special treatment?

The reason why I ask is that none of other bpf links actually create any
object, they merely attach bpf program to an existing object. For example,
netns bpf_link does not create an netns, cgroup bpf_link does not create
a cgroup either. So, why TC filter is so lucky to be the first one requires
creating an object?

Is it because there is no fd associated with any TC object?  Or what?
TC object, like all other netlink stuffs, is not fs based, hence does not
have an fd. Or maybe you don't need an fd at all? Since at least xdp
bpf_link is associated with a netdev which does not have an fd either.

>
> The individual patches contain more details and comments, but the overall kernel
> API and libbpf helper mirrors the semantics of the netlink based TC-BPF API
> merged recently. This means that we start by always setting direct action mode,
> protocol to ETH_P_ALL, chain_index as 0, etc. If there is a need for more
> options in the future, they can be easily exposed through the bpf_link API in
> the future.

As you already see, this fits really oddly into TC infrastructure, because
TC qdisc/filter/action are a whole subsystem, here you are trying to punch
a hole in the middle. ;) This usually indicates that we are going in a wrong
direction, maybe your case is an exception, but I can't find anything to justify
it in your cover letter.

Even if you really want to go down this path (I still double), you probably
want to explore whether there is any generic way to associate a TC object
with an fd, because we have TC bpf action and we will have TC bpf qdisc
too, I don't see any bpf_cls is more special than them.

Thanks.

Kumar Kartikeya Dwivedi June 7, 2021, 3:37 a.m. UTC | #6

On Mon, Jun 07, 2021 at 05:07:28AM IST, Cong Wang wrote:
> On Fri, May 28, 2021 at 1:00 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > This is the first RFC version.
> >
> > This adds a bpf_link path to create TC filters tied to cls_bpf classifier, and
> > introduces fd based ownership for such TC filters. Netlink cannot delete or
> > replace such filters, but the bpf_link is severed on indirect destruction of the
> > filter (backing qdisc being deleted, or chain being flushed, etc.). To ensure
> > that filters remain attached beyond process lifetime, the usual bpf_link fd
> > pinning approach can be used.
>
> I have some troubles understanding this. So... why TC filter is so special
> here that it deserves such a special treatment?
>

So the motivation behind this was automatic cleanup of filters installed by some
program. Usually from the userspace side you need a bunch of things (handle,
priority, protocol, chain_index, etc.) to be able to delete a filter without
stepping on others' toes. Also, there is no gurantee that filter hasn't been
replaced, so you need to check some other way (either tag or prog_id, but these
are also not perfect).

bpf_link provides isolation from netlink and fd-based lifetime management. As
for why it needs special treatment (by which I guess you mean why it _creates_
an object instead of simply attaching to one, see below):

> The reason why I ask is that none of other bpf links actually create any
> object, they merely attach bpf program to an existing object. For example,
> netns bpf_link does not create an netns, cgroup bpf_link does not create
> a cgroup either. So, why TC filter is so lucky to be the first one requires
> creating an object?
>

They are created behind the scenes, but are also fairly isolated from netlink
(i.e.  can only be introspected, so not hidden in that sense, but are
effectively locked for replace/delete).

The problem would be (of not creating a filter during attach) is that a typical
'attach point' for TC exists in form of tcf_proto. If we take priority (protocol
is fixed) out of the equation, it becomes possible to attach just a single BPF
prog, but that seems like a needless limitation when TC already supports list of
filters at each 'attach point'.

My point is that the created object (the tcf_proto under the 'chain' object) is
the attach point, and since there can be so many, keeping them around at all
times doesn't make sense, so the refcounted attach locations are created as
needed.  Both netlink and bpf_link owned filters can be attached under the same
location, with different ownership story in userspace.

> Is it because there is no fd associated with any TC object?  Or what?
> TC object, like all other netlink stuffs, is not fs based, hence does not
> have an fd. Or maybe you don't need an fd at all? Since at least xdp
> bpf_link is associated with a netdev which does not have an fd either.
>
> >
> > The individual patches contain more details and comments, but the overall kernel
> > API and libbpf helper mirrors the semantics of the netlink based TC-BPF API
> > merged recently. This means that we start by always setting direct action mode,
> > protocol to ETH_P_ALL, chain_index as 0, etc. If there is a need for more
> > options in the future, they can be easily exposed through the bpf_link API in
> > the future.
>
> As you already see, this fits really oddly into TC infrastructure, because
> TC qdisc/filter/action are a whole subsystem, here you are trying to punch
> a hole in the middle. ;) This usually indicates that we are going in a wrong
> direction, maybe your case is an exception, but I can't find anything to justify
> it in your cover letter.
>

I don't see why I'm punching a hole. The qdisc, chain, protocol, priority is the
'attach location', handle is just an ID, maybe we can skip all this and just
create a static hook for attaching single BPF program that doesn't require
creating a filter, but someday someone will have to reimplement chaining of
programs again (like libxdp does).

> Even if you really want to go down this path (I still double), you probably
> want to explore whether there is any generic way to associate a TC object
> with an fd, because we have TC bpf action and we will have TC bpf qdisc
> too, I don't see any bpf_cls is more special than them.
>

I think TC bpf actions are not relevant going forward (due to cls_bpf's direct
action mode), but I could be wrong. I say so because even a proposed API to
attach these from libbpf was dropped because arguably cls_bpf does it better,
and people shouldn't be using integrated actions going forward.

TC bpf qdisc might be, but that can be a different attach type (say BPF_SCHED),
which if exposed through bpf_link will again have its attach point to be the
target_ifindex, not some fd, and it would still be possible to use this API to
attach to a eBPF qdisc.

What do you suggest? I am open to reworking this in a different way if there are
any better ideas.

> Thanks.

--
Kartikeya

Cong Wang June 7, 2021, 5:18 a.m. UTC | #7

On Sun, Jun 6, 2021 at 8:38 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>
> On Mon, Jun 07, 2021 at 05:07:28AM IST, Cong Wang wrote:
> > On Fri, May 28, 2021 at 1:00 PM Kumar Kartikeya Dwivedi
> > <memxor@gmail.com> wrote:
> > >
> > > This is the first RFC version.
> > >
> > > This adds a bpf_link path to create TC filters tied to cls_bpf classifier, and
> > > introduces fd based ownership for such TC filters. Netlink cannot delete or
> > > replace such filters, but the bpf_link is severed on indirect destruction of the
> > > filter (backing qdisc being deleted, or chain being flushed, etc.). To ensure
> > > that filters remain attached beyond process lifetime, the usual bpf_link fd
> > > pinning approach can be used.
> >
> > I have some troubles understanding this. So... why TC filter is so special
> > here that it deserves such a special treatment?
> >
>
> So the motivation behind this was automatic cleanup of filters installed by some
> program. Usually from the userspace side you need a bunch of things (handle,
> priority, protocol, chain_index, etc.) to be able to delete a filter without
> stepping on others' toes. Also, there is no gurantee that filter hasn't been
> replaced, so you need to check some other way (either tag or prog_id, but these
> are also not perfect).
>
> bpf_link provides isolation from netlink and fd-based lifetime management. As
> for why it needs special treatment (by which I guess you mean why it _creates_
> an object instead of simply attaching to one, see below):

Are you saying TC filter is not independent? IOW, it has to rely on TC qdisc
to exist. This is true, and is of course different with netns/cgroup.
This is perhaps
not hard to solve, because TC actions are already independent, we can perhaps
convert TC filters too (the biggest blocker is compatibility).

Or do you just need an ephemeral representation of a TC filter which only exists
for a process? If so, see below.

>
> > The reason why I ask is that none of other bpf links actually create any
> > object, they merely attach bpf program to an existing object. For example,
> > netns bpf_link does not create an netns, cgroup bpf_link does not create
> > a cgroup either. So, why TC filter is so lucky to be the first one requires
> > creating an object?
> >
>
> They are created behind the scenes, but are also fairly isolated from netlink
> (i.e.  can only be introspected, so not hidden in that sense, but are
> effectively locked for replace/delete).
>
> The problem would be (of not creating a filter during attach) is that a typical
> 'attach point' for TC exists in form of tcf_proto. If we take priority (protocol
> is fixed) out of the equation, it becomes possible to attach just a single BPF
> prog, but that seems like a needless limitation when TC already supports list of
> filters at each 'attach point'.
>
> My point is that the created object (the tcf_proto under the 'chain' object) is
> the attach point, and since there can be so many, keeping them around at all
> times doesn't make sense, so the refcounted attach locations are created as
> needed.  Both netlink and bpf_link owned filters can be attached under the same
> location, with different ownership story in userspace.

I do not understand "created behind the scenes". These are all created
independent of bpf_link, right? For example, we create a cgroup with
mkdir, then open it and pass the fd to bpf_link. Clearly, cgroup is not
created by bpf_link or any bpf syscall.

The only thing different is fd, or more accurately, an identifier to locate
these objects. For example, ifindex can also be used to locate a netdev.
We can certainly locate a TC filter with (prio,proto,handle) but this has to
be passed via netlink. So if you need some locator, I think we can
introduce a kernel API which takes all necessary parameters to locate
a TC filter and return it to you. For a quick example, like this:

struct tcf_proto *tcf_get_proto(struct net *net, int ifindex,
                                int parent, char* kind, int handle...);

(Note, it can grab a refcnt in case of being deleted by others.)

With this, you can simply call it in bpf_link, and attach bpf prog to tcf_proto
(of course, only cls_bpf succeeds here).

Thanks.

Kumar Kartikeya Dwivedi June 7, 2021, 6:07 a.m. UTC | #8

On Mon, Jun 07, 2021 at 10:48:04AM IST, Cong Wang wrote:
> On Sun, Jun 6, 2021 at 8:38 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> >
> > On Mon, Jun 07, 2021 at 05:07:28AM IST, Cong Wang wrote:
> > > On Fri, May 28, 2021 at 1:00 PM Kumar Kartikeya Dwivedi
> > > <memxor@gmail.com> wrote:
> > > >
> > > > This is the first RFC version.
> > > >
> > > > This adds a bpf_link path to create TC filters tied to cls_bpf classifier, and
> > > > introduces fd based ownership for such TC filters. Netlink cannot delete or
> > > > replace such filters, but the bpf_link is severed on indirect destruction of the
> > > > filter (backing qdisc being deleted, or chain being flushed, etc.). To ensure
> > > > that filters remain attached beyond process lifetime, the usual bpf_link fd
> > > > pinning approach can be used.
> > >
> > > I have some troubles understanding this. So... why TC filter is so special
> > > here that it deserves such a special treatment?
> > >
> >
> > So the motivation behind this was automatic cleanup of filters installed by some
> > program. Usually from the userspace side you need a bunch of things (handle,
> > priority, protocol, chain_index, etc.) to be able to delete a filter without
> > stepping on others' toes. Also, there is no gurantee that filter hasn't been
> > replaced, so you need to check some other way (either tag or prog_id, but these
> > are also not perfect).
> >
> > bpf_link provides isolation from netlink and fd-based lifetime management. As
> > for why it needs special treatment (by which I guess you mean why it _creates_
> > an object instead of simply attaching to one, see below):
>
> Are you saying TC filter is not independent? IOW, it has to rely on TC qdisc
> to exist. This is true, and is of course different with netns/cgroup.
> This is perhaps
> not hard to solve, because TC actions are already independent, we can perhaps
> convert TC filters too (the biggest blocker is compatibility).
>

True, but that would mean you need some way to create a detached TC filter, correct?
Can you give some ideas on how the setup would look like from userspace side?

IIUC you mean

RTM_NEWTFILTER (with kind == bpf) parent == SOME_MAGIC_DETACHED ifindex == INVALID

then bpf_link comes in and creates the binding to the qdisc, parent, prio,
chain, handle ... ?

> Or do you just need an ephemeral representation of a TC filter which only exists
> for a process? If so, see below.
>
> >
> > > The reason why I ask is that none of other bpf links actually create any
> > > object, they merely attach bpf program to an existing object. For example,
> > > netns bpf_link does not create an netns, cgroup bpf_link does not create
> > > a cgroup either. So, why TC filter is so lucky to be the first one requires
> > > creating an object?
> > >
> >
> > They are created behind the scenes, but are also fairly isolated from netlink
> > (i.e.  can only be introspected, so not hidden in that sense, but are
> > effectively locked for replace/delete).
> >
> > The problem would be (of not creating a filter during attach) is that a typical
> > 'attach point' for TC exists in form of tcf_proto. If we take priority (protocol
> > is fixed) out of the equation, it becomes possible to attach just a single BPF
> > prog, but that seems like a needless limitation when TC already supports list of
> > filters at each 'attach point'.
> >
> > My point is that the created object (the tcf_proto under the 'chain' object) is
> > the attach point, and since there can be so many, keeping them around at all
> > times doesn't make sense, so the refcounted attach locations are created as
> > needed.  Both netlink and bpf_link owned filters can be attached under the same
> > location, with different ownership story in userspace.
>
> I do not understand "created behind the scenes". These are all created
> independent of bpf_link, right? For example, we create a cgroup with
> mkdir, then open it and pass the fd to bpf_link. Clearly, cgroup is not
> created by bpf_link or any bpf syscall.

Sorry, that must be confusing. I was only referring to what this patch does.
Indeed, as far as implementation is concerned this has no precedence.

>
> The only thing different is fd, or more accurately, an identifier to locate
> these objects. For example, ifindex can also be used to locate a netdev.
> We can certainly locate a TC filter with (prio,proto,handle) but this has to
> be passed via netlink. So if you need some locator, I think we can
> introduce a kernel API which takes all necessary parameters to locate
> a TC filter and return it to you. For a quick example, like this:
>
> struct tcf_proto *tcf_get_proto(struct net *net, int ifindex,
>                                 int parent, char* kind, int handle...);
>

I think this already exists in some way, i.e. you can just ignore if filter
handle from tp->ops->get doesn't exist (reusing the exsiting code) that walks
from qdisc/block -> chain -> tcf_proto during creation.

> (Note, it can grab a refcnt in case of being deleted by others.)
>
> With this, you can simply call it in bpf_link, and attach bpf prog to tcf_proto
> (of course, only cls_bpf succeeds here).
>

So IIUC, you are proposing to first create a filter normally using netlink, then
attach it using bpf_link to the proper parent? I.e. your main contention point
is to not create filter from bpf_link, instead take a filter and attach it to a
parent with bpf_link representing this attachment?

But then the created filter stays with refcount of 1 until RTM_DELTFILTER, i.e.
the lifetime of the attachment may be managed by bpf_link (in that we can detach
the filter from parent) but the filter itself will not be cleaned up. One of the
goals of tying TC filter to fd was to bind lifetime of filter itself, along with
attachment. Separating both doesn't seem to buy anything here. You always create
a filter to attach somewhere.

With actions, things are different, you may create one action but bind it to
multiple filters, so actions existing as their own thing makes sense. A single
action can serve multiple filters, and save on memory.

You could argue that even with filters this is true, as you may want to attach
the same filter to multiple qdiscs, but we already have a facility to do that
(shared tcf_block with block->q == NULL). However that is not as flexible as
what you are proposing.

It may be odd from the kernel side but to userspace a parent, prio, handle (we
don't let user choose anything else for now) is itself the attach point, how
bpf_link manages the attachment internally isn't really that interesting. It
does so now by way of creating an object that represents a certain hook, then
binding the BPF prog to it. I consider this mostly an implementation detail.
What you are really attaching to is the qdisc/block, which is the resource
analogous to cgroup fd, netns fd, and ifindex, and 'where' is described by other
attributes.

> Thanks.

--
Kartikeya

Cong Wang June 8, 2021, 2 a.m. UTC | #9

On Sun, Jun 6, 2021 at 11:08 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Mon, Jun 07, 2021 at 10:48:04AM IST, Cong Wang wrote:
> > On Sun, Jun 6, 2021 at 8:38 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> > >
> > > On Mon, Jun 07, 2021 at 05:07:28AM IST, Cong Wang wrote:
> > > > On Fri, May 28, 2021 at 1:00 PM Kumar Kartikeya Dwivedi
> > > > <memxor@gmail.com> wrote:
> > > > >
> > > > > This is the first RFC version.
> > > > >
> > > > > This adds a bpf_link path to create TC filters tied to cls_bpf classifier, and
> > > > > introduces fd based ownership for such TC filters. Netlink cannot delete or
> > > > > replace such filters, but the bpf_link is severed on indirect destruction of the
> > > > > filter (backing qdisc being deleted, or chain being flushed, etc.). To ensure
> > > > > that filters remain attached beyond process lifetime, the usual bpf_link fd
> > > > > pinning approach can be used.
> > > >
> > > > I have some troubles understanding this. So... why TC filter is so special
> > > > here that it deserves such a special treatment?
> > > >
> > >
> > > So the motivation behind this was automatic cleanup of filters installed by some
> > > program. Usually from the userspace side you need a bunch of things (handle,
> > > priority, protocol, chain_index, etc.) to be able to delete a filter without
> > > stepping on others' toes. Also, there is no gurantee that filter hasn't been
> > > replaced, so you need to check some other way (either tag or prog_id, but these
> > > are also not perfect).
> > >
> > > bpf_link provides isolation from netlink and fd-based lifetime management. As
> > > for why it needs special treatment (by which I guess you mean why it _creates_
> > > an object instead of simply attaching to one, see below):
> >
> > Are you saying TC filter is not independent? IOW, it has to rely on TC qdisc
> > to exist. This is true, and is of course different with netns/cgroup.
> > This is perhaps
> > not hard to solve, because TC actions are already independent, we can perhaps
> > convert TC filters too (the biggest blocker is compatibility).
> >
>
> True, but that would mean you need some way to create a detached TC filter, correct?
> Can you give some ideas on how the setup would look like from userspace side?
>
> IIUC you mean
>
> RTM_NEWTFILTER (with kind == bpf) parent == SOME_MAGIC_DETACHED ifindex == INVALID
>
> then bpf_link comes in and creates the binding to the qdisc, parent, prio,
> chain, handle ... ?

Roughly yes, except creation is still done by netlink, not bpf_link. It is
pretty much similar to those unbound TC actions.

>
> > Or do you just need an ephemeral representation of a TC filter which only exists
> > for a process? If so, see below.
> >
> > >
> > > > The reason why I ask is that none of other bpf links actually create any
> > > > object, they merely attach bpf program to an existing object. For example,
> > > > netns bpf_link does not create an netns, cgroup bpf_link does not create
> > > > a cgroup either. So, why TC filter is so lucky to be the first one requires
> > > > creating an object?
> > > >
> > >
> > > They are created behind the scenes, but are also fairly isolated from netlink
> > > (i.e.  can only be introspected, so not hidden in that sense, but are
> > > effectively locked for replace/delete).
> > >
> > > The problem would be (of not creating a filter during attach) is that a typical
> > > 'attach point' for TC exists in form of tcf_proto. If we take priority (protocol
> > > is fixed) out of the equation, it becomes possible to attach just a single BPF
> > > prog, but that seems like a needless limitation when TC already supports list of
> > > filters at each 'attach point'.
> > >
> > > My point is that the created object (the tcf_proto under the 'chain' object) is
> > > the attach point, and since there can be so many, keeping them around at all
> > > times doesn't make sense, so the refcounted attach locations are created as
> > > needed.  Both netlink and bpf_link owned filters can be attached under the same
> > > location, with different ownership story in userspace.
> >
> > I do not understand "created behind the scenes". These are all created
> > independent of bpf_link, right? For example, we create a cgroup with
> > mkdir, then open it and pass the fd to bpf_link. Clearly, cgroup is not
> > created by bpf_link or any bpf syscall.
>
> Sorry, that must be confusing. I was only referring to what this patch does.
> Indeed, as far as implementation is concerned this has no precedence.
>
> >
> > The only thing different is fd, or more accurately, an identifier to locate
> > these objects. For example, ifindex can also be used to locate a netdev.
> > We can certainly locate a TC filter with (prio,proto,handle) but this has to
> > be passed via netlink. So if you need some locator, I think we can
> > introduce a kernel API which takes all necessary parameters to locate
> > a TC filter and return it to you. For a quick example, like this:
> >
> > struct tcf_proto *tcf_get_proto(struct net *net, int ifindex,
> >                                 int parent, char* kind, int handle...);
> >
>
> I think this already exists in some way, i.e. you can just ignore if filter
> handle from tp->ops->get doesn't exist (reusing the exsiting code) that walks
> from qdisc/block -> chain -> tcf_proto during creation.

Right, except currently it requires a few API's to reach TC filters
(first netdev,,
then qdisc, finally filters). So, I think providing one API could at
least address
your "stepping on others toes" concern?

>
> > (Note, it can grab a refcnt in case of being deleted by others.)
> >
> > With this, you can simply call it in bpf_link, and attach bpf prog to tcf_proto
> > (of course, only cls_bpf succeeds here).
> >
>
> So IIUC, you are proposing to first create a filter normally using netlink, then
> attach it using bpf_link to the proper parent? I.e. your main contention point
> is to not create filter from bpf_link, instead take a filter and attach it to a
> parent with bpf_link representing this attachment?

Yes, to me I don't see a reason we want to create it from bpf_link.

>
> But then the created filter stays with refcount of 1 until RTM_DELTFILTER, i.e.
> the lifetime of the attachment may be managed by bpf_link (in that we can detach
> the filter from parent) but the filter itself will not be cleaned up. One of the
> goals of tying TC filter to fd was to bind lifetime of filter itself, along with
> attachment. Separating both doesn't seem to buy anything here. You always create
> a filter to attach somewhere.

This is really odd, for two reasons:

1) Why netdev does not have such problem? bpf_xdp_link_attach() uses
ifindex to locate a netdev, without creating it or cleaning it either.
So, why do we
never want to bind a netdev to an fd? IOW, what makes TC filters' lifetime so
different from netdev?

2) All existing bpf_link targets, except netdev, are fs based, hence an fd makes
sense for them naturally. TC filters, or any other netlink based
things, are not even
related to fs, hence fd does not make sense here, like we never bind a netdev
to a fd.

>
> With actions, things are different, you may create one action but bind it to
> multiple filters, so actions existing as their own thing makes sense. A single
> action can serve multiple filters, and save on memory.
>
> You could argue that even with filters this is true, as you may want to attach
> the same filter to multiple qdiscs, but we already have a facility to do that
> (shared tcf_block with block->q == NULL). However that is not as flexible as
> what you are proposing.

True. I think making TC filters as standalone as TC actions is a right
direction,
if it helps you too.

>
> It may be odd from the kernel side but to userspace a parent, prio, handle (we
> don't let user choose anything else for now) is itself the attach point, how
> bpf_link manages the attachment internally isn't really that interesting. It
> does so now by way of creating an object that represents a certain hook, then
> binding the BPF prog to it. I consider this mostly an implementation detail.
> What you are really attaching to is the qdisc/block, which is the resource
> analogous to cgroup fd, netns fd, and ifindex, and 'where' is described by other
> attributes.

How do you establish the analogy here? cgroup and netns are fs based,
having an fd is natural. ifindex is not an fd, it is a locator for netdev. Plus,
current bpf_link code does not create any of them.

Thanks.

Kumar Kartikeya Dwivedi June 8, 2021, 7:19 a.m. UTC | #10

On Tue, Jun 08, 2021 at 07:30:40AM IST, Cong Wang wrote:
> On Sun, Jun 6, 2021 at 11:08 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Mon, Jun 07, 2021 at 10:48:04AM IST, Cong Wang wrote:
> > > On Sun, Jun 6, 2021 at 8:38 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> > > >
> > > > On Mon, Jun 07, 2021 at 05:07:28AM IST, Cong Wang wrote:
> > > > > On Fri, May 28, 2021 at 1:00 PM Kumar Kartikeya Dwivedi
> > > > > <memxor@gmail.com> wrote:
> > > > > >
> > > > > > This is the first RFC version.
> > > > > >
> > > > > > This adds a bpf_link path to create TC filters tied to cls_bpf classifier, and
> > > > > > introduces fd based ownership for such TC filters. Netlink cannot delete or
> > > > > > replace such filters, but the bpf_link is severed on indirect destruction of the
> > > > > > filter (backing qdisc being deleted, or chain being flushed, etc.). To ensure
> > > > > > that filters remain attached beyond process lifetime, the usual bpf_link fd
> > > > > > pinning approach can be used.
> > > > >
> > > > > I have some troubles understanding this. So... why TC filter is so special
> > > > > here that it deserves such a special treatment?
> > > > >
> > > >
> > > > So the motivation behind this was automatic cleanup of filters installed by some
> > > > program. Usually from the userspace side you need a bunch of things (handle,
> > > > priority, protocol, chain_index, etc.) to be able to delete a filter without
> > > > stepping on others' toes. Also, there is no gurantee that filter hasn't been
> > > > replaced, so you need to check some other way (either tag or prog_id, but these
> > > > are also not perfect).
> > > >
> > > > bpf_link provides isolation from netlink and fd-based lifetime management. As
> > > > for why it needs special treatment (by which I guess you mean why it _creates_
> > > > an object instead of simply attaching to one, see below):
> > >
> > > Are you saying TC filter is not independent? IOW, it has to rely on TC qdisc
> > > to exist. This is true, and is of course different with netns/cgroup.
> > > This is perhaps
> > > not hard to solve, because TC actions are already independent, we can perhaps
> > > convert TC filters too (the biggest blocker is compatibility).
> > >
> >
> > True, but that would mean you need some way to create a detached TC filter, correct?
> > Can you give some ideas on how the setup would look like from userspace side?
> >
> > IIUC you mean
> >
> > RTM_NEWTFILTER (with kind == bpf) parent == SOME_MAGIC_DETACHED ifindex == INVALID
> >
> > then bpf_link comes in and creates the binding to the qdisc, parent, prio,
> > chain, handle ... ?
>
> Roughly yes, except creation is still done by netlink, not bpf_link. It is
> pretty much similar to those unbound TC actions.
>

Right, thanks for explaining. I will try to work on this and see if it works out.

> >
> > > Or do you just need an ephemeral representation of a TC filter which only exists
> > > for a process? If so, see below.
> > >
> > > >
> > > > > The reason why I ask is that none of other bpf links actually create any
> > > > > object, they merely attach bpf program to an existing object. For example,
> > > > > netns bpf_link does not create an netns, cgroup bpf_link does not create
> > > > > a cgroup either. So, why TC filter is so lucky to be the first one requires
> > > > > creating an object?
> > > > >
> > > >
> > > > They are created behind the scenes, but are also fairly isolated from netlink
> > > > (i.e.  can only be introspected, so not hidden in that sense, but are
> > > > effectively locked for replace/delete).
> > > >
> > > > The problem would be (of not creating a filter during attach) is that a typical
> > > > 'attach point' for TC exists in form of tcf_proto. If we take priority (protocol
> > > > is fixed) out of the equation, it becomes possible to attach just a single BPF
> > > > prog, but that seems like a needless limitation when TC already supports list of
> > > > filters at each 'attach point'.
> > > >
> > > > My point is that the created object (the tcf_proto under the 'chain' object) is
> > > > the attach point, and since there can be so many, keeping them around at all
> > > > times doesn't make sense, so the refcounted attach locations are created as
> > > > needed.  Both netlink and bpf_link owned filters can be attached under the same
> > > > location, with different ownership story in userspace.
> > >
> > > I do not understand "created behind the scenes". These are all created
> > > independent of bpf_link, right? For example, we create a cgroup with
> > > mkdir, then open it and pass the fd to bpf_link. Clearly, cgroup is not
> > > created by bpf_link or any bpf syscall.
> >
> > Sorry, that must be confusing. I was only referring to what this patch does.
> > Indeed, as far as implementation is concerned this has no precedence.
> >
> > >
> > > The only thing different is fd, or more accurately, an identifier to locate
> > > these objects. For example, ifindex can also be used to locate a netdev.
> > > We can certainly locate a TC filter with (prio,proto,handle) but this has to
> > > be passed via netlink. So if you need some locator, I think we can
> > > introduce a kernel API which takes all necessary parameters to locate
> > > a TC filter and return it to you. For a quick example, like this:
> > >
> > > struct tcf_proto *tcf_get_proto(struct net *net, int ifindex,
> > >                                 int parent, char* kind, int handle...);
> > >
> >
> > I think this already exists in some way, i.e. you can just ignore if filter
> > handle from tp->ops->get doesn't exist (reusing the exsiting code) that walks
> > from qdisc/block -> chain -> tcf_proto during creation.
>
> Right, except currently it requires a few API's to reach TC filters
> (first netdev,,
> then qdisc, finally filters). So, I think providing one API could at
> least address
> your "stepping on others toes" concern?
>
> >
> > > (Note, it can grab a refcnt in case of being deleted by others.)
> > >
> > > With this, you can simply call it in bpf_link, and attach bpf prog to tcf_proto
> > > (of course, only cls_bpf succeeds here).
> > >
> >
> > So IIUC, you are proposing to first create a filter normally using netlink, then
> > attach it using bpf_link to the proper parent? I.e. your main contention point
> > is to not create filter from bpf_link, instead take a filter and attach it to a
> > parent with bpf_link representing this attachment?
>
> Yes, to me I don't see a reason we want to create it from bpf_link.
>
> >
> > But then the created filter stays with refcount of 1 until RTM_DELTFILTER, i.e.
> > the lifetime of the attachment may be managed by bpf_link (in that we can detach
> > the filter from parent) but the filter itself will not be cleaned up. One of the
> > goals of tying TC filter to fd was to bind lifetime of filter itself, along with
> > attachment. Separating both doesn't seem to buy anything here. You always create
> > a filter to attach somewhere.
>
> This is really odd, for two reasons:
>
> 1) Why netdev does not have such problem? bpf_xdp_link_attach() uses
> ifindex to locate a netdev, without creating it or cleaning it either.
> So, why do we
> never want to bind a netdev to an fd? IOW, what makes TC filters' lifetime so
> different from netdev?
>

I think I tried to explain the difference, but I may have failed.

netdev does not have this problem because netdev is to XDP prog what qdisc is to
a SCHED_CLS prog. The filter is merely a way to hook into the qdisc. So we bind
the attachment's lifetime to the filter's lifetime, which in turn is controlled
by the bpf_link fd. When the filter is gone, the attachment to the qdisc is gone.

So we're not really creating a qdisc here, we're just tying the filter (which in
the current semantics exists only while attached) to the bpf_link. The filter is
the attachment, so tying its lifetime to bpf_link makes sense. When you destroy
the bpf_link, the filter goes away too, which means classification at that
hook (parent/class) in the qdisc stops working. This is why creating the filter
from the bpf_link made sense to me.

I hope you can see where I was going with this now.  Introducing a new kind of
method to attach to qdisc didn't seem wise to me, given all the infrastructure
already exists.

> 2) All existing bpf_link targets, except netdev, are fs based, hence an fd makes
> sense for them naturally. TC filters, or any other netlink based
> things, are not even
> related to fs, hence fd does not make sense here, like we never bind a netdev
> to a fd.
>

Yes, none of them create any objects. It is only a side effect of current
semantics that you are able to control the filter's lifetime using the bpf_link
as filter creation is also accompanied with its attachment to the qdisc.

Your unbound filter idea just separates the two. One will still end up creating
a cls_bpf_prog object internally in the kernel, just that it will now be
refcounted and be linked into multiple tcf_proto (based on how many bpf_link's
are attached).

Another additional responsibility of the user space is to now clean up these
unbound filters when it is done using them (either right after making a bpf_link
attachment so that it is removed on bpf_link destruction, or later), because
they don't sit under any chain etc. so a full flush of filters won't remove
them.

> >
> > With actions, things are different, you may create one action but bind it to
> > multiple filters, so actions existing as their own thing makes sense. A single
> > action can serve multiple filters, and save on memory.
> >
> > You could argue that even with filters this is true, as you may want to attach
> > the same filter to multiple qdiscs, but we already have a facility to do that
> > (shared tcf_block with block->q == NULL). However that is not as flexible as
> > what you are proposing.
>
> True. I think making TC filters as standalone as TC actions is a right
> direction,
> if it helps you too.
>
> >
> > It may be odd from the kernel side but to userspace a parent, prio, handle (we
> > don't let user choose anything else for now) is itself the attach point, how
> > bpf_link manages the attachment internally isn't really that interesting. It
> > does so now by way of creating an object that represents a certain hook, then
> > binding the BPF prog to it. I consider this mostly an implementation detail.
> > What you are really attaching to is the qdisc/block, which is the resource
> > analogous to cgroup fd, netns fd, and ifindex, and 'where' is described by other
> > attributes.
>
> How do you establish the analogy here? cgroup and netns are fs based,
> having an fd is natural. ifindex is not an fd, it is a locator for netdev. Plus,
> current bpf_link code does not create any of them.
>
> Thanks.

--
Kartikeya

Alexei Starovoitov June 8, 2021, 3:39 p.m. UTC | #11

On Tue, Jun 8, 2021 at 12:20 AM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> > 2) All existing bpf_link targets, except netdev, are fs based, hence an fd makes
> > sense for them naturally. TC filters, or any other netlink based

fs analogy is not applicable.
bpf_link-s for tracing and xdp have nothing to do with file systems.

> > things, are not even
> > related to fs, hence fd does not make sense here, like we never bind a netdev
> > to a fd.
> >
>
> Yes, none of them create any objects. It is only a side effect of current
> semantics that you are able to control the filter's lifetime using the bpf_link
> as filter creation is also accompanied with its attachment to the qdisc.

I think it makes sense to create these objects as part of establishing bpf_link.
ingress qdisc is a fake qdisc anyway.
If we could go back in time I would argue that its existence doesn't
need to be shown in iproute2. It's an object that serves no purpose
other than attaching filters to it. It doesn't do any queuing unlike
real qdiscs.
It's an artifact of old choices. Old doesn't mean good.
The kernel is full of such quirks and oddities. New api-s shouldn't
blindly follow them.
tc qdisc add dev eth0 clsact
is a useless command with nop effect.

Cong Wang June 11, 2021, 2 a.m. UTC | #12

On Tue, Jun 8, 2021 at 12:20 AM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> So we're not really creating a qdisc here, we're just tying the filter (which in
> the current semantics exists only while attached) to the bpf_link. The filter is
> the attachment, so tying its lifetime to bpf_link makes sense. When you destroy
> the bpf_link, the filter goes away too, which means classification at that
> hook (parent/class) in the qdisc stops working. This is why creating the filter
> from the bpf_link made sense to me.

I see why you are creating TC filters now, because you are trying to
force the lifetime of a bpf target to align with the bpf program itself.
The deeper reason seems to be that a cls_bpf filter looks so small
that it appears to you that it has nothing but a bpf_prog, right?

I offer two different views here:

1. If you view a TC filter as an instance as a netdev/qdisc/action, they
are no different from this perspective. Maybe the fact that a TC filter
resides in a qdisc makes a slight difference here, but like I mentioned, it
actually makes sense to let TC filters be standalone, qdisc's just have to
bind with them, like how we bind TC filters with standalone TC actions.
These are all updated independently, despite some of them residing in
another. There should not be an exceptional TC filter which can not
be updated via `tc filter` command.

2. For cls_bpf specifically, it is also an instance, like all other TC filters.
You can update it in the same way: tc filter change [...] The only difference
is a bpf program can attach to such an instance. So you can view the bpf
program attached to cls_bpf as a property of it. From this point of view,
there is no difference with XDP to netdev, where an XDP program
attached to a netdev is also a property of netdev. A netdev can still
function without XDP. Same for cls_bpf, it can be just a nop returns
TC_ACT_SHOT (or whatever) if no ppf program is attached. Thus,
the lifetime of a bpf program can be separated from the target it
attaches too, like all other bpf_link targets. bpf_link is just a
supplement to `tc filter change cls_bpf`, not to replace it.

This is actually simpler, you do not need to worry about whether
netdev is destroyed when you detach the XDP bpf_link anyway,
same for cls_bpf filters. Likewise, TC filters don't need to worry
about bpf_links associated.

Thanks.

Cong Wang June 11, 2021, 2:10 a.m. UTC | #13

On Tue, Jun 8, 2021 at 8:39 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> I think it makes sense to create these objects as part of establishing bpf_link.
> ingress qdisc is a fake qdisc anyway.
> If we could go back in time I would argue that its existence doesn't
> need to be shown in iproute2. It's an object that serves no purpose
> other than attaching filters to it. It doesn't do any queuing unlike
> real qdiscs.
> It's an artifact of old choices. Old doesn't mean good.
> The kernel is full of such quirks and oddities. New api-s shouldn't
> blindly follow them.
> tc qdisc add dev eth0 clsact
> is a useless command with nop effect.

Sounds like you just need a new bpf attach point outside of TC,
probably inside __dev_queue_xmit(). You don't need to create
any object, probably just need to attach it to a netdev.

Thanks.

Kumar Kartikeya Dwivedi June 13, 2021, 2:53 a.m. UTC | #14

On Fri, Jun 11, 2021 at 07:30:49AM IST, Cong Wang wrote:
> On Tue, Jun 8, 2021 at 12:20 AM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > So we're not really creating a qdisc here, we're just tying the filter (which in
> > the current semantics exists only while attached) to the bpf_link. The filter is
> > the attachment, so tying its lifetime to bpf_link makes sense. When you destroy
> > the bpf_link, the filter goes away too, which means classification at that
> > hook (parent/class) in the qdisc stops working. This is why creating the filter
> > from the bpf_link made sense to me.
>
> I see why you are creating TC filters now, because you are trying to
> force the lifetime of a bpf target to align with the bpf program itself.
> The deeper reason seems to be that a cls_bpf filter looks so small
> that it appears to you that it has nothing but a bpf_prog, right?
>

Yes, pretty much.

> I offer two different views here:
>
> 1. If you view a TC filter as an instance as a netdev/qdisc/action, they
> are no different from this perspective. Maybe the fact that a TC filter
> resides in a qdisc makes a slight difference here, but like I mentioned, it
> actually makes sense to let TC filters be standalone, qdisc's just have to
> bind with them, like how we bind TC filters with standalone TC actions.

You propose something different below IIUC, but I explained why I'm wary of
these unbound filters. They seem to add a step to classifier setup for no real
benefit to the user (except keeping track of one more object and cleaning it
up with the link when done).

I understand that the filter is very much an object of its own and why keeping
them unbound makes sense, but for the user there is no real benefit of this
scheme (some things like classid attribute are contextual in that they make
sense to be set based on what parent we're attaching to).

> These are all updated independently, despite some of them residing in
> another. There should not be an exceptional TC filter which can not
> be updated via `tc filter` command.

I see, but I'm mirroring what was done for XDP bpf_link.

Besides, flush still works, it's only that manipulating a filter managed by
bpf_link is not allowed, which sounds reasonable to me, given we're bringing
new ownership semantics here which didn't exist before with netlink, so it
doesn't make sense to allow netlink to simply invalidate the filter installed by
some other program.

You wouldn't do something like that for a cooperating setup, we're just
enforcing that using -EPERM (bpf_link is not allowed to replace netlink
installed filters either, so it goes both ways).

>
> 2. For cls_bpf specifically, it is also an instance, like all other TC filters.
> You can update it in the same way: tc filter change [...] The only difference
> is a bpf program can attach to such an instance. So you can view the bpf
> program attached to cls_bpf as a property of it. From this point of view,
> there is no difference with XDP to netdev, where an XDP program
> attached to a netdev is also a property of netdev. A netdev can still
> function without XDP. Same for cls_bpf, it can be just a nop returns
> TC_ACT_SHOT (or whatever) if no ppf program is attached. Thus,
> the lifetime of a bpf program can be separated from the target it
> attaches too, like all other bpf_link targets. bpf_link is just a
> supplement to `tc filter change cls_bpf`, not to replace it.
>

So this is different now, as in the filter is attached as usual but bpf_link
represents attachment of bpf prog to the filter itself, not the filter to the
qdisc.

To me it seems apart from not having to create filter, this would pretty much be
equivalent to where I hook the bpf_link right now?

TBF, this split doesn't really seem to be bringing anything to the table (except
maybe preserving netlink as the only way to manipulate filter properties) and
keeping filters as separate objects. I can understand your position but for the
user it's just more and more objects to keep track of with no proper
ownership/cleanup semantics.

Though considering it for cls_bpf in particular, there are mainly three things
you would want to tc filter change:

* Integrated actions
  These are not allowed anyway, we force enable direct action mode, and I don't
  plan on opening up actions for this if its gets accepted. Anything missing
  we'll try to make it work in eBPF (act_ct etc.)

* classid
  cls_bpf has a good alternative of instead manipulating __sk_buff::tc_classid

* skip_hw/skip_sw
  Not supported for now, but can be done using flags in BPF_LINK_UPDATE

* BPF program
  Already works using BPF_LINK_UPDATE

So bpf_link isn't really prohibitive in any way.

Doing it your way also complicates cleanup of the filter (in case we don't want
to leave it attached), because it is hard to know who closes the link_fd last.
Closing it earlier would break the link for existing users, not doing it would
leave around unused object (which can accumulate if we use auto allocation of
filter priority). Counting existing links is racy.

This is better done in the kernel than worked around in userspace, as part of
attachment.

> This is actually simpler, you do not need to worry about whether
> netdev is destroyed when you detach the XDP bpf_link anyway,
> same for cls_bpf filters. Likewise, TC filters don't need to worry
> about bpf_links associated.
>
> Thanks.

--
Kartikeya

Kumar Kartikeya Dwivedi June 13, 2021, 3:08 a.m. UTC | #15

On Fri, Jun 11, 2021 at 07:30:49AM IST, Cong Wang wrote:
> I see why you are creating TC filters now, because you are trying to
> force the lifetime of a bpf target to align with the bpf program itself.
> The deeper reason seems to be that a cls_bpf filter looks so small
> that it appears to you that it has nothing but a bpf_prog, right?
>

Just to clarify on this further, BPF program still has its own lifetime, link
takes a reference, and the filter still takes a reference on it (since it
assumes ownership, so it was easier that way).

When releasing the bpf_link if the prog pointer is set, we also detach the TC
filter (which releases its reference on the prog). The link on destruction
releases its reference. So the rest of refcount will depend on userspace
holding/pinning the fd or not.

--
Kartikeya

Jamal Hadi Salim June 13, 2021, 8:27 p.m. UTC | #16

Hi,

Sorry - but i havent kept up with the discussion so some of this
and it is possible I may be misunderstanding some things you mention
in passing below (example that you only support da mode or the classid 
being able to be handled differently etc).
XDP may not be the best model to follow since some things that exist
in the tc architecture(example ability to have multi-programs)
seem to be plumbed in later (mostly because the original design intent
for XDP was to make it simple and then deployment follow and more
features get added)

Integrating tc into libbpf is a definete bonus that allows with a
unified programmatic interface and a singular loading mechanism - but
it wasnt clear why we loose some features that tc provides; we have
them today with current tc based loading scheme. I certainly use the
non-da scheme because over time it became clear that complex
programs(not necessarily large code size) are a challenge with ebpf
and using existing tc actions is valuable.
Also, multiple priorities are  important for the same reason - you
can work around them in your singular ebpf program but sooner than
later you will run out "tricks".

We do have this monthly tc meetup every second monday of the month.
Unfortunately it is short notice since the next one is monday 12pm
eastern time. Maybe you can show up and a high bandwidth discussion
(aka voice) would help?

cheers,
jamal

On 2021-06-12 10:53 p.m., Kumar Kartikeya Dwivedi wrote:
> On Fri, Jun 11, 2021 at 07:30:49AM IST, Cong Wang wrote:
>> On Tue, Jun 8, 2021 at 12:20 AM Kumar Kartikeya Dwivedi
>> <memxor@gmail.com> wrote:
>>>
>>> So we're not really creating a qdisc here, we're just tying the filter (which in
>>> the current semantics exists only while attached) to the bpf_link. The filter is
>>> the attachment, so tying its lifetime to bpf_link makes sense. When you destroy
>>> the bpf_link, the filter goes away too, which means classification at that
>>> hook (parent/class) in the qdisc stops working. This is why creating the filter
>>> from the bpf_link made sense to me.
>>
>> I see why you are creating TC filters now, because you are trying to
>> force the lifetime of a bpf target to align with the bpf program itself.
>> The deeper reason seems to be that a cls_bpf filter looks so small
>> that it appears to you that it has nothing but a bpf_prog, right?
>>
> 
> Yes, pretty much.
> 
>> I offer two different views here:
>>
>> 1. If you view a TC filter as an instance as a netdev/qdisc/action, they
>> are no different from this perspective. Maybe the fact that a TC filter
>> resides in a qdisc makes a slight difference here, but like I mentioned, it
>> actually makes sense to let TC filters be standalone, qdisc's just have to
>> bind with them, like how we bind TC filters with standalone TC actions.
> 
> You propose something different below IIUC, but I explained why I'm wary of
> these unbound filters. They seem to add a step to classifier setup for no real
> benefit to the user (except keeping track of one more object and cleaning it
> up with the link when done).
> 
> I understand that the filter is very much an object of its own and why keeping
> them unbound makes sense, but for the user there is no real benefit of this
> scheme (some things like classid attribute are contextual in that they make
> sense to be set based on what parent we're attaching to).
> 
>> These are all updated independently, despite some of them residing in
>> another. There should not be an exceptional TC filter which can not
>> be updated via `tc filter` command.
> 
> I see, but I'm mirroring what was done for XDP bpf_link.
> 
> Besides, flush still works, it's only that manipulating a filter managed by
> bpf_link is not allowed, which sounds reasonable to me, given we're bringing
> new ownership semantics here which didn't exist before with netlink, so it
> doesn't make sense to allow netlink to simply invalidate the filter installed by
> some other program.
> 
> You wouldn't do something like that for a cooperating setup, we're just
> enforcing that using -EPERM (bpf_link is not allowed to replace netlink
> installed filters either, so it goes both ways).
> 
>>
>> 2. For cls_bpf specifically, it is also an instance, like all other TC filters.
>> You can update it in the same way: tc filter change [...] The only difference
>> is a bpf program can attach to such an instance. So you can view the bpf
>> program attached to cls_bpf as a property of it. From this point of view,
>> there is no difference with XDP to netdev, where an XDP program
>> attached to a netdev is also a property of netdev. A netdev can still
>> function without XDP. Same for cls_bpf, it can be just a nop returns
>> TC_ACT_SHOT (or whatever) if no ppf program is attached. Thus,
>> the lifetime of a bpf program can be separated from the target it
>> attaches too, like all other bpf_link targets. bpf_link is just a
>> supplement to `tc filter change cls_bpf`, not to replace it.
>>
> 
> So this is different now, as in the filter is attached as usual but bpf_link
> represents attachment of bpf prog to the filter itself, not the filter to the
> qdisc.
> 
> To me it seems apart from not having to create filter, this would pretty much be
> equivalent to where I hook the bpf_link right now?
> 
> TBF, this split doesn't really seem to be bringing anything to the table (except
> maybe preserving netlink as the only way to manipulate filter properties) and
> keeping filters as separate objects. I can understand your position but for the
> user it's just more and more objects to keep track of with no proper
> ownership/cleanup semantics.
> 
> Though considering it for cls_bpf in particular, there are mainly three things
> you would want to tc filter change:
> 
> * Integrated actions
>    These are not allowed anyway, we force enable direct action mode, and I don't
>    plan on opening up actions for this if its gets accepted. Anything missing
>    we'll try to make it work in eBPF (act_ct etc.)
> 
> * classid
>    cls_bpf has a good alternative of instead manipulating __sk_buff::tc_classid
> 
> * skip_hw/skip_sw
>    Not supported for now, but can be done using flags in BPF_LINK_UPDATE
> 
> * BPF program
>    Already works using BPF_LINK_UPDATE
> 
> So bpf_link isn't really prohibitive in any way.
> 
> Doing it your way also complicates cleanup of the filter (in case we don't want
> to leave it attached), because it is hard to know who closes the link_fd last.
> Closing it earlier would break the link for existing users, not doing it would
> leave around unused object (which can accumulate if we use auto allocation of
> filter priority). Counting existing links is racy.
> 
> This is better done in the kernel than worked around in userspace, as part of
> attachment.
> 
>> This is actually simpler, you do not need to worry about whether
>> netdev is destroyed when you detach the XDP bpf_link anyway,
>> same for cls_bpf filters. Likewise, TC filters don't need to worry
>> about bpf_links associated.
>>
>> Thanks.
> 
> --
> Kartikeya
>

Kumar Kartikeya Dwivedi June 13, 2021, 8:34 p.m. UTC | #17

On Mon, Jun 14, 2021 at 01:57:16AM IST, Jamal Hadi Salim wrote:
> Hi,
>
> Sorry - but i havent kept up with the discussion so some of this
> and it is possible I may be misunderstanding some things you mention
> in passing below (example that you only support da mode or the classid being
> able to be handled differently etc).
> XDP may not be the best model to follow since some things that exist
> in the tc architecture(example ability to have multi-programs)
> seem to be plumbed in later (mostly because the original design intent
> for XDP was to make it simple and then deployment follow and more
> features get added)
>
> Integrating tc into libbpf is a definete bonus that allows with a
> unified programmatic interface and a singular loading mechanism - but
> it wasnt clear why we loose some features that tc provides; we have
> them today with current tc based loading scheme. I certainly use the
> non-da scheme because over time it became clear that complex
> programs(not necessarily large code size) are a challenge with ebpf
> and using existing tc actions is valuable.
> Also, multiple priorities are  important for the same reason - you
> can work around them in your singular ebpf program but sooner than
> later you will run out "tricks".
>

Right, also I'm just posting so that the use cases I care about are clear, and
why they are not being fulifilled in some other way. How to do it is ofcourse up
to TC and BPF maintainers, which is why I'm still waiting on feedback from you,
Cong and others before posting the next version.

> We do have this monthly tc meetup every second monday of the month.
> Unfortunately it is short notice since the next one is monday 12pm
> eastern time. Maybe you can show up and a high bandwidth discussion
> (aka voice) would help?
>

That would be best, please let me know how to join tomorrow. There are a few
other things I was working on that I also want to discuss with this.

> cheers,
> jamal
>

Jamal Hadi Salim June 13, 2021, 9:10 p.m. UTC | #18

On 2021-06-13 4:34 p.m., Kumar Kartikeya Dwivedi wrote:
> On Mon, Jun 14, 2021 at 01:57:16AM IST, Jamal Hadi Salim wrote:

> 
> Right, also I'm just posting so that the use cases I care about are clear, and
> why they are not being fulifilled in some other way. How to do it is ofcourse up
> to TC and BPF maintainers, which is why I'm still waiting on feedback from you,
> Cong and others before posting the next version.
> 

I look at it from the perspective that if i can run something with
existing tc loading mechanism then i should be able to do the same
with the new (libbpf) scheme.

>> We do have this monthly tc meetup every second monday of the month.
>> Unfortunately it is short notice since the next one is monday 12pm
>> eastern time. Maybe you can show up and a high bandwidth discussion
>> (aka voice) would help?
>>
> 
> That would be best, please let me know how to join tomorrow. There are a few
> other things I was working on that I also want to discuss with this.
> 

That would be great - thanks for your understanding.
+Cc Marcelo (who is the keeper of the meetup)
in case the link may change..

cheers,
jamal

Marcelo Ricardo Leitner June 14, 2021, 1:03 p.m. UTC | #19

On Sun, Jun 13, 2021 at 05:10:14PM -0400, Jamal Hadi Salim wrote:
> On 2021-06-13 4:34 p.m., Kumar Kartikeya Dwivedi wrote:
> > On Mon, Jun 14, 2021 at 01:57:16AM IST, Jamal Hadi Salim wrote:
> > > We do have this monthly tc meetup every second monday of the month.
> > > Unfortunately it is short notice since the next one is monday 12pm
> > > eastern time. Maybe you can show up and a high bandwidth discussion
> > > (aka voice) would help?
> > >
> >
> > That would be best, please let me know how to join tomorrow. There are a few
> > other things I was working on that I also want to discuss with this.
> >
>
> That would be great - thanks for your understanding.
> +Cc Marcelo (who is the keeper of the meetup)
> in case the link may change..

We have 2 URLs for today. The official one [1] and a test one [2].
We will be testing a new video conferencing system today and depending
on how it goes, we will be on one or another. I'll try to be always
present in the official one [1] to point people towards the testing
one [2] in case we're there.

Also, we have an agenda doc [3]. I can't openly share it to the public
but if you send a request for access, I'll grant it.

1.https://meet.kernel.social/tc-meetup
2.https://www.airmeet.com/e/2494c770-cc8c-11eb-830b-e787c099d9c3
3.https://docs.google.com/document/d/1uUm_o7lR9jCAH0bqZ1dyscXZbIF4GN3mh1FwwIuePcM/edit#

  Marcelo

Cong Wang June 15, 2021, 4:33 a.m. UTC | #20

On Sat, Jun 12, 2021 at 7:54 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Fri, Jun 11, 2021 at 07:30:49AM IST, Cong Wang wrote:
> > On Tue, Jun 8, 2021 at 12:20 AM Kumar Kartikeya Dwivedi
> > <memxor@gmail.com> wrote:
> > >
> > > So we're not really creating a qdisc here, we're just tying the filter (which in
> > > the current semantics exists only while attached) to the bpf_link. The filter is
> > > the attachment, so tying its lifetime to bpf_link makes sense. When you destroy
> > > the bpf_link, the filter goes away too, which means classification at that
> > > hook (parent/class) in the qdisc stops working. This is why creating the filter
> > > from the bpf_link made sense to me.
> >
> > I see why you are creating TC filters now, because you are trying to
> > force the lifetime of a bpf target to align with the bpf program itself.
> > The deeper reason seems to be that a cls_bpf filter looks so small
> > that it appears to you that it has nothing but a bpf_prog, right?
> >
>
> Yes, pretty much.

OK. Just in case of any misunderstand: cls_bpf filter has more than
just a bpf prog, it inherits all other generic attributes, e.g. TC proto/prio,
too from TC infra. If you can agree on this, then it is no different from
netdev/cgroup/netns bpf_links.

>
> > I offer two different views here:
> >
> > 1. If you view a TC filter as an instance as a netdev/qdisc/action, they
> > are no different from this perspective. Maybe the fact that a TC filter
> > resides in a qdisc makes a slight difference here, but like I mentioned, it
> > actually makes sense to let TC filters be standalone, qdisc's just have to
> > bind with them, like how we bind TC filters with standalone TC actions.
>
> You propose something different below IIUC, but I explained why I'm wary of
> these unbound filters. They seem to add a step to classifier setup for no real
> benefit to the user (except keeping track of one more object and cleaning it
> up with the link when done).

I am not even sure if unbound filters help your case at all, making
them unbound merely changes their residence, not ownership.
You are trying to pass the ownership from TC to bpf_link, which
is what I am against.

>
> I understand that the filter is very much an object of its own and why keeping
> them unbound makes sense, but for the user there is no real benefit of this
> scheme (some things like classid attribute are contextual in that they make
> sense to be set based on what parent we're attaching to).
>
> > These are all updated independently, despite some of them residing in
> > another. There should not be an exceptional TC filter which can not
> > be updated via `tc filter` command.
>
> I see, but I'm mirroring what was done for XDP bpf_link.

Really? Does XDP bpf_link create a netdev or remove it? I see none.
It merely looks up netdev by attr->link_create.target_ifindex in
bpf_xdp_link_attach(). Where does the "mirroring" come from?

>
> Besides, flush still works, it's only that manipulating a filter managed by
> bpf_link is not allowed, which sounds reasonable to me, given we're bringing
> new ownership semantics here which didn't exist before with netlink, so it
> doesn't make sense to allow netlink to simply invalidate the filter installed by
> some other program.
>
> You wouldn't do something like that for a cooperating setup, we're just
> enforcing that using -EPERM (bpf_link is not allowed to replace netlink
> installed filters either, so it goes both ways).

I think our argument is never who manages it, our argument is who owns
it. By creating a TC filter from bpf_link and managed by bpf_link exclusively,
the ownership pretty much goes to bpf_link.

>
> >
> > 2. For cls_bpf specifically, it is also an instance, like all other TC filters.
> > You can update it in the same way: tc filter change [...] The only difference
> > is a bpf program can attach to such an instance. So you can view the bpf
> > program attached to cls_bpf as a property of it. From this point of view,
> > there is no difference with XDP to netdev, where an XDP program
> > attached to a netdev is also a property of netdev. A netdev can still
> > function without XDP. Same for cls_bpf, it can be just a nop returns
> > TC_ACT_SHOT (or whatever) if no ppf program is attached. Thus,
> > the lifetime of a bpf program can be separated from the target it
> > attaches too, like all other bpf_link targets. bpf_link is just a
> > supplement to `tc filter change cls_bpf`, not to replace it.
> >
>
> So this is different now, as in the filter is attached as usual but bpf_link
> represents attachment of bpf prog to the filter itself, not the filter to the
> qdisc.

Yes, I think this is the right view of cls_bpf. It contains more than just
a bpf prog, its generic part (struct tcf_proto) contains other attributes
of this filter inherited from TC infra. And of course, TC actions can be
inherited too (for non-DA).

>
> To me it seems apart from not having to create filter, this would pretty much be
> equivalent to where I hook the bpf_link right now?
>
> TBF, this split doesn't really seem to be bringing anything to the table (except
> maybe preserving netlink as the only way to manipulate filter properties) and
> keeping filters as separate objects. I can understand your position but for the
> user it's just more and more objects to keep track of with no proper
> ownership/cleanup semantics.
>
> Though considering it for cls_bpf in particular, there are mainly three things
> you would want to tc filter change:
>
> * Integrated actions
>   These are not allowed anyway, we force enable direct action mode, and I don't
>   plan on opening up actions for this if its gets accepted. Anything missing
>   we'll try to make it work in eBPF (act_ct etc.)
>
> * classid
>   cls_bpf has a good alternative of instead manipulating __sk_buff::tc_classid
>
> * skip_hw/skip_sw
>   Not supported for now, but can be done using flags in BPF_LINK_UPDATE
>
> * BPF program
>   Already works using BPF_LINK_UPDATE

Our argument is never which pieces of cls_bpf should be updated by TC
or bpf_link. It is always ownership. TC should own TC filters, even its name
tells so. You are trying to create TC filters with bpf_link which are not even
owned by TC. And more importantly, you are not doing the same for other
bpf_link targets, by singling out TC filters for no valid reason.

>
> So bpf_link isn't really prohibitive in any way.
>
> Doing it your way also complicates cleanup of the filter (in case we don't want
> to leave it attached), because it is hard to know who closes the link_fd last.
> Closing it earlier would break the link for existing users, not doing it would
> leave around unused object (which can accumulate if we use auto allocation of
> filter priority). Counting existing links is racy.
>
> This is better done in the kernel than worked around in userspace, as part of
> attachment.

I am not proposing anything for your case, I am only explaining why
creating TC filters exclusively via bpf_link does not make sense to me.

Thanks.

Toke Høiland-Jørgensen June 15, 2021, 11:54 a.m. UTC | #21

Cong Wang <xiyou.wangcong@gmail.com> writes:

>> > I offer two different views here:
>> >
>> > 1. If you view a TC filter as an instance as a netdev/qdisc/action, they
>> > are no different from this perspective. Maybe the fact that a TC filter
>> > resides in a qdisc makes a slight difference here, but like I mentioned, it
>> > actually makes sense to let TC filters be standalone, qdisc's just have to
>> > bind with them, like how we bind TC filters with standalone TC actions.
>>
>> You propose something different below IIUC, but I explained why I'm wary of
>> these unbound filters. They seem to add a step to classifier setup for no real
>> benefit to the user (except keeping track of one more object and cleaning it
>> up with the link when done).
>
> I am not even sure if unbound filters help your case at all, making
> them unbound merely changes their residence, not ownership.
> You are trying to pass the ownership from TC to bpf_link, which
> is what I am against.

So what do you propose instead?

bpf_link is solving a specific problem: ensuring automatic cleanup of
kernel resources held by a userspace application with a BPF component.
Not all applications work this way, but for the ones that do it's very
useful. But if the TC filter stays around after bpf_link detaches, that
kinda defeats the point of the automatic cleanup.

So I don't really see any way around transferring ownership somehow.
Unless you have some other idea that I'm missing?

-Toke

Daniel Borkmann June 15, 2021, 11:07 p.m. UTC | #22

On 6/13/21 11:10 PM, Jamal Hadi Salim wrote:
> On 2021-06-13 4:34 p.m., Kumar Kartikeya Dwivedi wrote:
>> On Mon, Jun 14, 2021 at 01:57:16AM IST, Jamal Hadi Salim wrote:
[...]
>> Right, also I'm just posting so that the use cases I care about are clear, and
>> why they are not being fulifilled in some other way. How to do it is ofcourse up
>> to TC and BPF maintainers, which is why I'm still waiting on feedback from you,
>> Cong and others before posting the next version.
> 
> I look at it from the perspective that if i can run something with
> existing tc loading mechanism then i should be able to do the same
> with the new (libbpf) scheme.

The intention is not to provide a full-blown tc library (that could be subject to a
libtc or such), but rather to only have libbpf abstract the tc related API that is
most /relevant/ for BPF program development and /efficient/ in terms of execution in
fast-path while at the same time providing a good user experience from the API itself.

That is, simple to use and straight forward to explain to folks with otherwise zero
experience of tc. The current implementation does all that, and from experience with
large BPF programs managed via cls_bpf that is all that is actually needed from tc
layer perspective. The ability to have multi programs (incl. priorities) is in the
existing libbpf API as well.

Best,
Daniel

Daniel Borkmann June 15, 2021, 11:44 p.m. UTC | #23

On 6/15/21 1:54 PM, Toke Høiland-Jørgensen wrote:
> Cong Wang <xiyou.wangcong@gmail.com> writes:
[...]
>>>> I offer two different views here:
>>>>
>>>> 1. If you view a TC filter as an instance as a netdev/qdisc/action, they
>>>> are no different from this perspective. Maybe the fact that a TC filter
>>>> resides in a qdisc makes a slight difference here, but like I mentioned, it
>>>> actually makes sense to let TC filters be standalone, qdisc's just have to
>>>> bind with them, like how we bind TC filters with standalone TC actions.
>>>
>>> You propose something different below IIUC, but I explained why I'm wary of
>>> these unbound filters. They seem to add a step to classifier setup for no real
>>> benefit to the user (except keeping track of one more object and cleaning it
>>> up with the link when done).
>>
>> I am not even sure if unbound filters help your case at all, making
>> them unbound merely changes their residence, not ownership.
>> You are trying to pass the ownership from TC to bpf_link, which
>> is what I am against.
> 
> So what do you propose instead?
> 
> bpf_link is solving a specific problem: ensuring automatic cleanup of
> kernel resources held by a userspace application with a BPF component.
> Not all applications work this way, but for the ones that do it's very
> useful. But if the TC filter stays around after bpf_link detaches, that
> kinda defeats the point of the automatic cleanup.
> 
> So I don't really see any way around transferring ownership somehow.
> Unless you have some other idea that I'm missing?

Just to keep on brainstorming here, I wanted to bring back Alexei's earlier quote:

   > I think it makes sense to create these objects as part of establishing bpf_link.
   > ingress qdisc is a fake qdisc anyway.
   > If we could go back in time I would argue that its existence doesn't
   > need to be shown in iproute2. It's an object that serves no purpose
   > other than attaching filters to it. It doesn't do any queuing unlike
   > real qdiscs.
   > It's an artifact of old choices. Old doesn't mean good.
   > The kernel is full of such quirks and oddities. New api-s shouldn't
   > blindly follow them.
   > tc qdisc add dev eth0 clsact
   > is a useless command with nop effect.

The whole bpf_link in this context feels somewhat awkward because both are two
different worlds, one accessible via netlink with its own lifetime etc, the other
one tied to fds and bpf syscall. Back in the days we did the cls_bpf integration
since it felt the most natural at that time and it had support for both the ingress
and egress side, along with the direct action support which was added later to have
a proper fast path for BPF. One thing that I personally never liked is that later
on tc sadly became a complex, quirky dumping ground for all the nic hw offloads (I
guess mainly driven from ovs side) for which I have a hard time convincing myself
that this is used at scale in production. Stuff like af699626ee26 just to pick one
which annoyingly also adds to the fast path given distros will just compile in most
of these things (like NET_TC_SKB_EXT)... what if such bpf_link object is not tied
at all to cls_bpf or cls_act qdisc, and instead would implement the tcf_classify_
{egress,ingress}() as-is in that sense, similar like the bpf_lsm hooks. Meaning,
you could run existing tc BPF prog without any modifications and without additional
extra overhead (no need to walk the clsact qdisc and then again into the cls_bpf
one). These tc BPF programs would be managed only from bpf() via tc bpf_link api,
and are otherwise not bothering to classic tc command (though they could be dumped
there as well for sake of visibility, though bpftool would be fitting too). However,
if there is something attached from classic tc side, it would also go into the old
style tcf_classify_ingress() implementation and walk whatever is there so that nothing
existing breaks (same as when no bpf_link would be present so that there is no extra
overhead). This would also allow for a migration path of multi prog from cls_bpf to
this new implementation. Details still tbd, but I would much rather like such an
approach than the current discussed one, and it would also fit better given we don't
run into this current mismatch of both worlds.

Thanks,
Daniel

Toke Høiland-Jørgensen June 16, 2021, 12:03 p.m. UTC | #24

Daniel Borkmann <daniel@iogearbox.net> writes:

> On 6/15/21 1:54 PM, Toke Høiland-Jørgensen wrote:
>> Cong Wang <xiyou.wangcong@gmail.com> writes:
> [...]
>>>>> I offer two different views here:
>>>>>
>>>>> 1. If you view a TC filter as an instance as a netdev/qdisc/action, they
>>>>> are no different from this perspective. Maybe the fact that a TC filter
>>>>> resides in a qdisc makes a slight difference here, but like I mentioned, it
>>>>> actually makes sense to let TC filters be standalone, qdisc's just have to
>>>>> bind with them, like how we bind TC filters with standalone TC actions.
>>>>
>>>> You propose something different below IIUC, but I explained why I'm wary of
>>>> these unbound filters. They seem to add a step to classifier setup for no real
>>>> benefit to the user (except keeping track of one more object and cleaning it
>>>> up with the link when done).
>>>
>>> I am not even sure if unbound filters help your case at all, making
>>> them unbound merely changes their residence, not ownership.
>>> You are trying to pass the ownership from TC to bpf_link, which
>>> is what I am against.
>> 
>> So what do you propose instead?
>> 
>> bpf_link is solving a specific problem: ensuring automatic cleanup of
>> kernel resources held by a userspace application with a BPF component.
>> Not all applications work this way, but for the ones that do it's very
>> useful. But if the TC filter stays around after bpf_link detaches, that
>> kinda defeats the point of the automatic cleanup.
>> 
>> So I don't really see any way around transferring ownership somehow.
>> Unless you have some other idea that I'm missing?
>
> Just to keep on brainstorming here, I wanted to bring back Alexei's earlier quote:
>
>    > I think it makes sense to create these objects as part of establishing bpf_link.
>    > ingress qdisc is a fake qdisc anyway.
>    > If we could go back in time I would argue that its existence doesn't
>    > need to be shown in iproute2. It's an object that serves no purpose
>    > other than attaching filters to it. It doesn't do any queuing unlike
>    > real qdiscs.
>    > It's an artifact of old choices. Old doesn't mean good.
>    > The kernel is full of such quirks and oddities. New api-s shouldn't
>    > blindly follow them.
>    > tc qdisc add dev eth0 clsact
>    > is a useless command with nop effect.
>
> The whole bpf_link in this context feels somewhat awkward because both are two
> different worlds, one accessible via netlink with its own lifetime etc, the other
> one tied to fds and bpf syscall. Back in the days we did the cls_bpf integration
> since it felt the most natural at that time and it had support for both the ingress
> and egress side, along with the direct action support which was added later to have
> a proper fast path for BPF. One thing that I personally never liked is that later
> on tc sadly became a complex, quirky dumping ground for all the nic hw offloads (I
> guess mainly driven from ovs side) for which I have a hard time convincing myself
> that this is used at scale in production. Stuff like af699626ee26 just to pick one
> which annoyingly also adds to the fast path given distros will just compile in most
> of these things (like NET_TC_SKB_EXT)... what if such bpf_link object is not tied
> at all to cls_bpf or cls_act qdisc, and instead would implement the tcf_classify_
> {egress,ingress}() as-is in that sense, similar like the bpf_lsm hooks. Meaning,
> you could run existing tc BPF prog without any modifications and without additional
> extra overhead (no need to walk the clsact qdisc and then again into the cls_bpf
> one). These tc BPF programs would be managed only from bpf() via tc bpf_link api,
> and are otherwise not bothering to classic tc command (though they could be dumped
> there as well for sake of visibility, though bpftool would be fitting too). However,
> if there is something attached from classic tc side, it would also go into the old
> style tcf_classify_ingress() implementation and walk whatever is there so that nothing
> existing breaks (same as when no bpf_link would be present so that there is no extra
> overhead). This would also allow for a migration path of multi prog from cls_bpf to
> this new implementation. Details still tbd, but I would much rather like such an
> approach than the current discussed one, and it would also fit better given we don't
> run into this current mismatch of both worlds.

So this would entail adding a separate list of BPF programs and run
through those at the start of sch_handle_{egress,ingress}() I suppose?
And that list of filters would only contain bpf_link-attached BPF
programs, sorted by priority like TC filters? And return codes of
TC_ACT_OK or TC_ACT_RECLASSIFY would continue through to
tcf_classify_{egress,ingress}()?

I suppose that could work; we could even stick the second filter list in
struct mini_Qdisc and have clsact and bpf_link cooperate on managing
that, no? That way it would also be easy to dump the BPF filters via
netlink: I do think that will be the least surprising thing to do (so
people can at least see there's something there with existing tools).

The overhead would be a single extra branch when only one of clsact or
bpf_link is in use (to check if the other list of filters is set);
that's probably acceptable at this level...

-Toke

Jamal Hadi Salim June 16, 2021, 2:40 p.m. UTC | #25

On 2021-06-15 7:07 p.m., Daniel Borkmann wrote:
> On 6/13/21 11:10 PM, Jamal Hadi Salim wrote:

[..]

>>
>> I look at it from the perspective that if i can run something with
>> existing tc loading mechanism then i should be able to do the same
>> with the new (libbpf) scheme.
> 
> The intention is not to provide a full-blown tc library (that could be 
> subject to a
> libtc or such), but rather to only have libbpf abstract the tc related 
> API that is
> most /relevant/ for BPF program development and /efficient/ in terms of 
> execution in
> fast-path while at the same time providing a good user experience from 
> the API itself.
> 
> That is, simple to use and straight forward to explain to folks with 
> otherwise zero
> experience of tc. The current implementation does all that, and from 
> experience with
> large BPF programs managed via cls_bpf that is all that is actually 
> needed from tc
> layer perspective. The ability to have multi programs (incl. priorities) 
> is in the
> existing libbpf API as well.
> 

Which is a fair statement, but if you take away things that work fine
with current iproute2 loading I have no motivation to migrate at all.
Its like that saying of "throwing out the baby with the bathwater".
I want my baby.

In particular, here's a list from Kartikeya's implementation:

1) Direct action mode only
2) Protocol ETH_P_ALL only
3) Only at chain 0
4) No block support

I think he said priority is supported but was also originally on that
list.
When we discussed at the meetup it didnt seem these cost anything
in terms of code complexity or usability of the API.

1) We use non-DA mode, so i cant live without that (and frankly ebpf
has challenges adding complex code blocks).

2) We also use different protocols when i need to
(yes, you can do the filtering in the bpf code - but why impose that
if the cost of adding it is simple? and of course it is cheaper to do
the check outside of ebpf)
3) We use chains outside of zero

4) So far we dont use block support but certainly my recent experiences
in a deployment shows that we need to group netdevices more often than
i thought was necessary. So if i could express one map shared by
multiple netdevices it should cut down the user space complexity.

cheers,
jamal

Kumar Kartikeya Dwivedi June 16, 2021, 3:32 p.m. UTC | #26

On Wed, Jun 16, 2021 at 08:10:55PM IST, Jamal Hadi Salim wrote:
> On 2021-06-15 7:07 p.m., Daniel Borkmann wrote:
> > On 6/13/21 11:10 PM, Jamal Hadi Salim wrote:
>
> [..]
>
> > >
> > > I look at it from the perspective that if i can run something with
> > > existing tc loading mechanism then i should be able to do the same
> > > with the new (libbpf) scheme.
> >
> > The intention is not to provide a full-blown tc library (that could be
> > subject to a
> > libtc or such), but rather to only have libbpf abstract the tc related
> > API that is
> > most /relevant/ for BPF program development and /efficient/ in terms of
> > execution in
> > fast-path while at the same time providing a good user experience from
> > the API itself.
> >
> > That is, simple to use and straight forward to explain to folks with
> > otherwise zero
> > experience of tc. The current implementation does all that, and from
> > experience with
> > large BPF programs managed via cls_bpf that is all that is actually
> > needed from tc
> > layer perspective. The ability to have multi programs (incl. priorities)
> > is in the
> > existing libbpf API as well.
> >
>
> Which is a fair statement, but if you take away things that work fine
> with current iproute2 loading I have no motivation to migrate at all.
> Its like that saying of "throwing out the baby with the bathwater".
> I want my baby.
>
> In particular, here's a list from Kartikeya's implementation:
>
> 1) Direct action mode only
> 2) Protocol ETH_P_ALL only
> 3) Only at chain 0
> 4) No block support
>

Block is supported, you just need to set TCM_IFINDEX_MAGIC_BLOCK as ifindex and
parent as the block index. There isn't anything more to it than that from libbpf
side (just specify BPF_TC_CUSTOM enum).

What I meant was that hook_create doesn't support specifying the ingress/egress
block when creating clsact, but that typically isn't a problem because qdiscs
for shared blocks would be set up together prior to the attachment anyway.

> I think he said priority is supported but was also originally on that
> list.
> When we discussed at the meetup it didnt seem these cost anything
> in terms of code complexity or usability of the API.
>
> 1) We use non-DA mode, so i cant live without that (and frankly ebpf
> has challenges adding complex code blocks).
>
> 2) We also use different protocols when i need to
> (yes, you can do the filtering in the bpf code - but why impose that
> if the cost of adding it is simple? and of course it is cheaper to do
> the check outside of ebpf)
> 3) We use chains outside of zero
>
> 4) So far we dont use block support but certainly my recent experiences
> in a deployment shows that we need to group netdevices more often than
> i thought was necessary. So if i could express one map shared by
> multiple netdevices it should cut down the user space complexity.
>
> cheers,
> jamal

--
Kartikeya

Jamal Hadi Salim June 16, 2021, 3:33 p.m. UTC | #27

On 2021-06-15 7:44 p.m., Daniel Borkmann wrote:
> On 6/15/21 1:54 PM, Toke Høiland-Jørgensen wrote:
>> Cong Wang <xiyou.wangcong@gmail.com> writes:
> [...]


> 
> Just to keep on brainstorming here, I wanted to bring back Alexei's 
> earlier quote:
> 
>    > I think it makes sense to create these objects as part of 
> establishing bpf_link.
>    > ingress qdisc is a fake qdisc anyway.
>    > If we could go back in time I would argue that its existence doesn't
>    > need to be shown in iproute2. It's an object that serves no purpose
>    > other than attaching filters to it. It doesn't do any queuing unlike
>    > real qdiscs.
>    > It's an artifact of old choices. Old doesn't mean good.
>    > The kernel is full of such quirks and oddities. New api-s shouldn't
>    > blindly follow them.
>    > tc qdisc add dev eth0 clsact
>    > is a useless command with nop effect.
> 

I am not sure what Alexei's statement about old vs good was getting at.
You have to have  hooks/locations to stick things. Does it matter what
you call that hook?

> The whole bpf_link in this context feels somewhat awkward because both 
> are two
> different worlds, one accessible via netlink with its own lifetime etc, 
> the other
> one tied to fds and bpf syscall. Back in the days we did the cls_bpf 
> integration
> since it felt the most natural at that time and it had support for both 
> the ingress
> and egress side, along with the direct action support which was added 
> later to have
> a proper fast path for BPF. One thing that I personally never liked is 
> that later
> on tc sadly became a complex, quirky dumping ground for all the nic hw 
> offloads (I
> guess mainly driven from ovs side) for which I have a hard time 
> convincing myself
> that this is used at scale in production. Stuff like af699626ee26 just 
> to pick one
> which annoyingly also adds to the fast path given distros will just 
> compile in most
> of these things (like NET_TC_SKB_EXT)... what if such bpf_link object is 
> not tied
> at all to cls_bpf or cls_act qdisc, and instead would implement the 
> tcf_classify_
> {egress,ingress}() as-is in that sense, similar like the bpf_lsm hooks. 


The choice is between generic architecture and appliance
only-what-you-need code (via ebpf).
Dont disagree that at times patches go in at the expense of the kernel
datapath complexity or cost. Unfortunately sometimes this is because
theres no sufficient review time - but thats a different topic.
We try to impose a rule which states that any hardware offload has
to have a kernel/software twin. Often that helps contain things.


> Meaning,
> you could run existing tc BPF prog without any modifications and without 
> additional
> extra overhead (no need to walk the clsact qdisc and then again into the 
> cls_bpf
> one). These tc BPF programs would be managed only from bpf() via tc 
> bpf_link api,
> and are otherwise not bothering to classic tc command (though they could 
> be dumped
> there as well for sake of visibility, though bpftool would be fitting 
> too). However,
> if there is something attached from classic tc side, it would also go 
> into the old
> style tcf_classify_ingress() implementation and walk whatever is there 
> so that nothing
> existing breaks (same as when no bpf_link would be present so that there 
> is no extra
> overhead). This would also allow for a migration path of multi prog from 
> cls_bpf to
> this new implementation. Details still tbd, but I would much rather like 
> such an
> approach than the current discussed one, and it would also fit better 
> given we don't
> run into this current mismatch of both worlds.
>

The danger is totally divorcing from tc when you have speacial
cases just for ebpf/tc i.e this is no different than what the
hardware offload making you unhappy. The ability to use existing
tools (user space tc in this case) to inter-work on both is
very useful.


 From the discussion on the control aspect with Kartikeya i
understood that we need some "transient state" which needs to get
created and stored somewhere before being applied to tc (example
creating the filters first and all necessary artificats then calling
internally to cls api).
Seems to me that the "transient state" belongs to bpf. And i understood
Kartikeya this was his design intent as well (which seems sane to me).


cheers,
jamal

Daniel Borkmann June 16, 2021, 4 p.m. UTC | #28

On 6/16/21 5:32 PM, Kumar Kartikeya Dwivedi wrote:
> On Wed, Jun 16, 2021 at 08:10:55PM IST, Jamal Hadi Salim wrote:
>> On 2021-06-15 7:07 p.m., Daniel Borkmann wrote:
>>> On 6/13/21 11:10 PM, Jamal Hadi Salim wrote:
>>
>> [..]
>>
>>>> I look at it from the perspective that if i can run something with
>>>> existing tc loading mechanism then i should be able to do the same
>>>> with the new (libbpf) scheme.
>>>
>>> The intention is not to provide a full-blown tc library (that could be
>>> subject to a
>>> libtc or such), but rather to only have libbpf abstract the tc related
>>> API that is
>>> most /relevant/ for BPF program development and /efficient/ in terms of
>>> execution in
>>> fast-path while at the same time providing a good user experience from
>>> the API itself.
>>>
>>> That is, simple to use and straight forward to explain to folks with
>>> otherwise zero
>>> experience of tc. The current implementation does all that, and from
>>> experience with
>>> large BPF programs managed via cls_bpf that is all that is actually
>>> needed from tc
>>> layer perspective. The ability to have multi programs (incl. priorities)
>>> is in the
>>> existing libbpf API as well.
>>
>> Which is a fair statement, but if you take away things that work fine
>> with current iproute2 loading I have no motivation to migrate at all.
>> Its like that saying of "throwing out the baby with the bathwater".
>> I want my baby.
>>
>> In particular, here's a list from Kartikeya's implementation:
>>
>> 1) Direct action mode only

(More below.)

>> 2) Protocol ETH_P_ALL only

The issue I see with this one is that it's not very valuable or useful from a BPF
point of view. Meaning, this kind of check can and typically is implemented from
BPF program anyway. For example, when you have direct packet access initially
parsing the eth header anyway (and from there having logic for the various eth
protos).

That protocol option is maybe more useful when you have classic tc with cls+act
style pipeline where you want a quick skip of classifiers to avoid reparsing the
packet. Given you can do everything inside the BPF program already it adds more
confusion than value for a simple libbpf [tc/BPF] API.

>> 3) Only at chain 0
>> 4) No block support
> 
> Block is supported, you just need to set TCM_IFINDEX_MAGIC_BLOCK as ifindex and
> parent as the block index. There isn't anything more to it than that from libbpf
> side (just specify BPF_TC_CUSTOM enum).
> 
> What I meant was that hook_create doesn't support specifying the ingress/egress
> block when creating clsact, but that typically isn't a problem because qdiscs
> for shared blocks would be set up together prior to the attachment anyway.
> 
>> I think he said priority is supported but was also originally on that
>> list.
>> When we discussed at the meetup it didnt seem these cost anything
>> in terms of code complexity or usability of the API.
>>
>> 1) We use non-DA mode, so i cant live without that (and frankly ebpf
>> has challenges adding complex code blocks).

Could you elaborate on that or provide code examples? Since introduction of the
direct action mode I've never used anything else again, and we do have complex
BPF code blocks that we need to handle as well. Would be good if you could provide
more details on things you ran into, maybe they can be solved?

>> 2) We also use different protocols when i need to
>> (yes, you can do the filtering in the bpf code - but why impose that
>> if the cost of adding it is simple? and of course it is cheaper to do
>> the check outside of ebpf)
>> 3) We use chains outside of zero
>>
>> 4) So far we dont use block support but certainly my recent experiences
>> in a deployment shows that we need to group netdevices more often than
>> i thought was necessary. So if i could express one map shared by
>> multiple netdevices it should cut down the user space complexity.

Thanks,
Daniel

Jamal Hadi Salim June 18, 2021, 11:40 a.m. UTC | #29

On 2021-06-16 12:00 p.m., Daniel Borkmann wrote:
> On 6/16/21 5:32 PM, Kumar Kartikeya Dwivedi wrote:
>> On Wed, Jun 16, 2021 at 08:10:55PM IST, Jamal Hadi Salim wrote:
>>> On 2021-06-15 7:07 p.m., Daniel Borkmann wrote:
>>>> On 6/13/21 11:10 PM, Jamal Hadi Salim wrote:

[..]

>>>
>>> In particular, here's a list from Kartikeya's implementation:
>>>
>>> 1) Direct action mode only
> 
> (More below.)
> 
>>> 2) Protocol ETH_P_ALL only
> 
> The issue I see with this one is that it's not very valuable or useful 
> from a BPF
> point of view. Meaning, this kind of check can and typically is 
> implemented from
> BPF program anyway. For example, when you have direct packet access 
> initially
> parsing the eth header anyway (and from there having logic for the 
> various eth
> protos).

In that case make it optional to specify proto and default it to
ETH_P_ALL. As far as i can see this flexibility doesnt
complicate usability or add code complexity to the interfaces.

> 
> That protocol option is maybe more useful when you have classic tc with 
> cls+act
> style pipeline where you want a quick skip of classifiers to avoid 
> reparsing the
> packet. Given you can do everything inside the BPF program already it 
> adds more
> confusion than value for a simple libbpf [tc/BPF] API.
> 

There's no point in repeating an operation of identifying
the protocol type which can/has already be Id-ed by the calling
(into ebpf) code. If all i am interested in is IPv4, then
my ebpf parser can be simplified if i am sure i can assume it
is an IPv4 packet.

[..]

>>> 1) We use non-DA mode, so i cant live without that (and frankly ebpf
>>> has challenges adding complex code blocks).
> 
> Could you elaborate on that or provide code examples? Since introduction 
> of the
> direct action mode I've never used anything else again, and we do have 
> complex
> BPF code blocks that we need to handle as well. Would be good if you 
> could provide
> more details on things you ran into, maybe they can be solved?
> 

Main issue is code complexity in ebpf and not so much instruction
count (which is complicated once you have bounded loops).
Earlier, I tried to post on the ebpf list but i got no response.
I moved on since. I would like to engage you at some point - and
you are right there may be some clever tricks to achieve the goals
we had. The challenge is in keeping up with the bag of tricks to make
the verifier happy.
Being able to run non-da mode and for example attach an action such
as the policer (and others) has pragmatic uses. It would be quiet 
complex to implement the policer within an all-in-one-appliance
da-mode ebpf code.
One approach is to add more helpers to invoke such code directly
from ebpf - but we have some restrictions; the deployment is RHEL8.3
based and we have to live with the kernel features supported there.
i.e kernel upgrade is a no-no. Given all these TC features have
existed (and stable) for 100 years it make a lot of sense to use them.

We are going to present some of the challenges we faced in a subset
of our work in an approach to replace iptables at netdev 0x15
(hopefully we get accepted).

cheers,
jamal

Alexei Starovoitov June 18, 2021, 2:38 p.m. UTC | #30

On Fri, Jun 18, 2021 at 4:40 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> We are going to present some of the challenges we faced in a subset
> of our work in an approach to replace iptables at netdev 0x15
> (hopefully we get accepted).

Jamal,
please stop using netdev@vger mailing list to promote a conference
that does NOT represent the netdev kernel community.
Slides shown at that conference is a non-event as far as this discussion goes.

Jamal Hadi Salim June 18, 2021, 2:50 p.m. UTC | #31

On 2021-06-18 10:38 a.m., Alexei Starovoitov wrote:
> On Fri, Jun 18, 2021 at 4:40 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>>
>> We are going to present some of the challenges we faced in a subset
>> of our work in an approach to replace iptables at netdev 0x15
>> (hopefully we get accepted).
> 
> Jamal,
> please stop using netdev@vger mailing list to promote a conference
> that does NOT represent the netdev kernel community.
 >
> Slides shown at that conference is a non-event as far as this discussion goes.

Alexei,
Tame the aggression, would you please?
You have no right to make claims as to who represents the community.
Absolutely none. So get off that high horse.

I only mentioned the slides because it will be a good spot when
done which captures the issues. As i mentioned in i actually did
send some email (some Cced to you) but got no response.
I dont mind having a discussion but you have to be willing to
listen as well.

cheers,
jamal

Alexei Starovoitov June 18, 2021, 4:23 p.m. UTC | #32

On Fri, Jun 18, 2021 at 7:50 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On 2021-06-18 10:38 a.m., Alexei Starovoitov wrote:
> > On Fri, Jun 18, 2021 at 4:40 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >>
> >> We are going to present some of the challenges we faced in a subset
> >> of our work in an approach to replace iptables at netdev 0x15
> >> (hopefully we get accepted).
> >
> > Jamal,
> > please stop using netdev@vger mailing list to promote a conference
> > that does NOT represent the netdev kernel community.
>  >
> > Slides shown at that conference is a non-event as far as this discussion goes.
>
> Alexei,
> Tame the aggression, would you please?
> You have no right to make claims as to who represents the community.
> Absolutely none. So get off that high horse.
>
> I only mentioned the slides because it will be a good spot when
> done which captures the issues. As i mentioned in i actually did
> send some email (some Cced to you) but got no response.
> I dont mind having a discussion but you have to be willing to
> listen as well.

You've side tracked technical discussion to promote your own conference.
That's not acceptable. Please use other forums for marketing.
This mailing list is for technical discussions.

Jamal Hadi Salim June 18, 2021, 4:41 p.m. UTC | #33

On 2021-06-18 12:23 p.m., Alexei Starovoitov wrote:
> On Fri, Jun 18, 2021 at 7:50 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>

[..]
>> Alexei,
>> Tame the aggression, would you please?
>> You have no right to make claims as to who represents the community.
>> Absolutely none. So get off that high horse.
>>
>> I only mentioned the slides because it will be a good spot when
>> done which captures the issues. As i mentioned in i actually did
>> send some email (some Cced to you) but got no response.
>> I dont mind having a discussion but you have to be willing to
>> listen as well.
> 
> You've side tracked technical discussion to promote your own conference.
> That's not acceptable. Please use other forums for marketing.
 >
> This mailing list is for technical discussions.

I just made a statement in passing and you took it to a
tangent. If you are so righteous, why didnt you just stick
to making technical comments?
Stop making bold statements and then playing the victim.

cheers,
jamal

Daniel Borkmann June 18, 2021, 10:42 p.m. UTC | #34

On 6/18/21 1:40 PM, Jamal Hadi Salim wrote:
> On 2021-06-16 12:00 p.m., Daniel Borkmann wrote:
>> On 6/16/21 5:32 PM, Kumar Kartikeya Dwivedi wrote:
>>> On Wed, Jun 16, 2021 at 08:10:55PM IST, Jamal Hadi Salim wrote:
>>>> On 2021-06-15 7:07 p.m., Daniel Borkmann wrote:
>>>>> On 6/13/21 11:10 PM, Jamal Hadi Salim wrote:
> 
> [..]
> 
>>>> In particular, here's a list from Kartikeya's implementation:
>>>>
>>>> 1) Direct action mode only
>>
>> (More below.)
>>
>>>> 2) Protocol ETH_P_ALL only
>>
>> The issue I see with this one is that it's not very valuable or useful from a BPF
>> point of view. Meaning, this kind of check can and typically is implemented from
>> BPF program anyway. For example, when you have direct packet access initially
>> parsing the eth header anyway (and from there having logic for the various eth
>> protos).
> 
> In that case make it optional to specify proto and default it to
> ETH_P_ALL. As far as i can see this flexibility doesnt
> complicate usability or add code complexity to the interfaces.

 From a user interface PoV it's odd since you need to go and parse that anyway, at
least the programs typically start out with a switch/case on either reading the
skb->protocol or getting it via eth->h_proto. But then once you extend that same
program to also cover IPv6, you don't need to do anything with the ETH_P_ALL
from the loader application, but now you'd also need to additionally remember to
downgrade ETH_P_IP to ETH_P_ALL and rebuild the loader to get v6 traffic. But even
if you were to split things in the main/entry program to separate v4/v6 processing
into two different ones, I expect this to be faster via tail calls (given direct
absolute jump) instead of walking a list of tcf_proto objects, comparing the
tp->protocol and going into a different cls_bpf instance.

[...]>> Could you elaborate on that or provide code examples? Since introduction of the
>> direct action mode I've never used anything else again, and we do have complex
>> BPF code blocks that we need to handle as well. Would be good if you could provide
>> more details on things you ran into, maybe they can be solved?
> 
> Main issue is code complexity in ebpf and not so much instruction
> count (which is complicated once you have bounded loops).
> Earlier, I tried to post on the ebpf list but i got no response.
> I moved on since. I would like to engage you at some point - and
> you are right there may be some clever tricks to achieve the goals
> we had. The challenge is in keeping up with the bag of tricks to make
> the verifier happy.
> Being able to run non-da mode and for example attach an action such
> as the policer (and others) has pragmatic uses. It would be quiet complex to implement the policer within an all-in-one-appliance
> da-mode ebpf code.

It may be more tricky but not impossible either, in recent years some (imho) very
interesting and exciting use cases have been implemented and talked about e.g. [0-2],
and with the recent linker work there could also be a [e.g. in-kernel] collection with
library code that can be pulled in by others aside from using them as BPF selftests
as one option. The gain you have with the flexibility [as you know] is that it allows
easy integration/orchestration into user space applications and thus suitable for
more dynamic envs as with old-style actions. The issue I have with the latter is
that they're not scalable enough from a SW datapath / tc fast-path perspective given
you then need to fallback to old-style list processing of cls+act combinations which
is also not covered / in scope for the libbpf API in terms of their setup, and
additionally not all of the BPF features can be used this way either, so it'll be very
hard for users to debug why their BPF programs don't work as they're expected to.

But also aside from those blockers, the case with this clean slate tc BPF API is that
we have a unique chance to overcome the cmdline usability struggles, and make it as
straight forward as possible for new generation of users.

   [0] https://linuxplumbersconf.org/event/7/contributions/677/
   [1] https://linuxplumbersconf.org/event/2/contributions/121/
   [2] https://netdevconf.info/0x14/session.html?talk-replacing-HTB-with-EDT-and-BPF

Thanks,
Daniel

Jamal Hadi Salim June 21, 2021, 1:55 p.m. UTC | #35

On 2021-06-18 6:42 p.m., Daniel Borkmann wrote:
> On 6/18/21 1:40 PM, Jamal Hadi Salim wrote:

[..]
>  From a user interface PoV it's odd since you need to go and parse that 
> anyway, at
> least the programs typically start out with a switch/case on either 
> reading the
> skb->protocol or getting it via eth->h_proto. But then once you extend 
> that same
> program to also cover IPv6, you don't need to do anything with the 
> ETH_P_ALL
> from the loader application, but now you'd also need to additionally 
> remember to
> downgrade ETH_P_IP to ETH_P_ALL and rebuild the loader to get v6 
> traffic. But even
> if you were to split things in the main/entry program to separate v4/v6 
> processing
> into two different ones, I expect this to be faster via tail calls 
> (given direct
> absolute jump) instead of walking a list of tcf_proto objects, comparing 
> the
> tp->protocol and going into a different cls_bpf instance.
> 

Good point on being more future proof with ETH_P_ALL.
Note: In our case we were only interested in ipv4 and i dont see that
changing for the specific prog we have. From a compute perspective all
i am saving by not using ETH_P_ALL is one if statement (checking if
proto is ipv4). If you feel strongly about it we can change our code.
My worry now is if we used this approach then likely someone else in the 
wild used something similar.

I think it boils down again to: if it doesnt confuse the API or add
extra complexity why not allow it and default to ETH_P_ALL?

On your comment that a bpf based proto comparison being faster - the
issue is that the tp proto always happens regardless and ebpf, depending
on your program, may not fit all your code. Example i may actually
decide to have a program for v6 and v4 separately if i wanted
to with current mechanism - at different tc ruleset prios just
so as to work around code/complexity issues.

BTW: tail call limit of 32 provides an upper bound which affects
depth of (generic) parsing.
Does it make sense to allow (maybe on a per-boot) increasing the size?
The fact things run on the stack may be restricting.

> It may be more tricky but not impossible either, in recent years some 
> (imho) very
> interesting and exciting use cases have been implemented and talked 
> about e.g. [0-2],
> and with the recent linker work there could also be a [e.g. in-kernel] 
> collection with
> library code that can be pulled in by others aside from using them as 
> BPF selftests
> as one option. The gain you have with the flexibility [as you know] is 
> that it allows
> easy integration/orchestration into user space applications and thus 
> suitable for
> more dynamic envs as with old-style actions. The issue I have with the 
> latter is
> that they're not scalable enough from a SW datapath / tc fast-path 
> perspective given
> you then need to fallback to old-style list processing of cls+act 
> combinations which
> is also not covered / in scope for the libbpf API in terms of their 
> setup, and
> additionally not all of the BPF features can be used this way either, so 
> it'll be very
> hard for users to debug why their BPF programs don't work as they're 
> expected to.
> 
> But also aside from those blockers, the case with this clean slate tc 
> BPF API is that
> we have a unique chance to overcome the cmdline usability struggles, and 
> make it as
> straight forward as possible for new generation of users.
> 
>    [0] https://linuxplumbersconf.org/event/7/contributions/677/
>    [1] https://linuxplumbersconf.org/event/2/contributions/121/
>    [2] 
> https://netdevconf.info/0x14/session.html?talk-replacing-HTB-with-EDT-and-BPF 

I took a quick glance at the refs.

IIUC, your message is "do more with less" i.e restrict choices now
so we can focus on optimizing for speed. Here's my experience.
We have two pragmatic challenges:

1) In a deployment, like some enterprise class data centers, we are
often limited by the kernel and often even the distro you are on. You
cant just upgrade to the latest and greatest without risking voiding
the distro vendors support contract. Big shops with a lot of geniuses
like FB and Google dont have these problems of course - but the majority
out there do.

So even our little program must use supported interfaces (ex: You cant
expect support on RH8.3 for an XDP issue without using the supplied XDP 
lib) to be accepted.

So building in support to use existing infra is useful

2) challenges with ebpf code space and code complexity: Depending
on the complexity, a program with less than 4K instructions may be
rejected by the verifier. IOW, I just cant add all the features
i need _even if i wanted to_.

For this reason working cooperatively with other existing kernel
and user infra makes sense (Ref [2] is doing that for example).
You dont want to rewrite the kernel using ebpf. Extending the kernel
with ebpf makes sense. And of course I dont want to loose performance
but there may be a trade-off sometimes where a little loss in 
performance is justified for gain of a feature makes sense
(the non-da example applies).

Perhaps adding more helpers to interface to the actions and classifiers
is one way forward.

cheers,
jamal

PS: I didnt understand the kernel linker point with BPF selftests.
Pointer?

[RFC,bpf-next,0/7] Add bpf_link based TC-BPF API

Message

Comments