Message ID | 20210528195946.2375109-1-memxor@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | Add bpf_link based TC-BPF API | expand |
On Fri, May 28, 2021 at 1:00 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote: > > This is the first RFC version. > > This adds a bpf_link path to create TC filters tied to cls_bpf classifier, and > introduces fd based ownership for such TC filters. Netlink cannot delete or > replace such filters, but the bpf_link is severed on indirect destruction of the > filter (backing qdisc being deleted, or chain being flushed, etc.). To ensure > that filters remain attached beyond process lifetime, the usual bpf_link fd > pinning approach can be used. > > The individual patches contain more details and comments, but the overall kernel > API and libbpf helper mirrors the semantics of the netlink based TC-BPF API > merged recently. This means that we start by always setting direct action mode, > protocol to ETH_P_ALL, chain_index as 0, etc. If there is a need for more > options in the future, they can be easily exposed through the bpf_link API in > the future. > > Patch 1 refactors cls_bpf change function to extract two helpers that will be > reused in bpf_link creation. > > Patch 2 exports some bpf_link management functions to modules. This is needed > because our bpf_link object is tied to the cls_bpf_prog object. Tying it to > tcf_proto would be weird, because the update path has to replace offloaded bpf > prog, which happens using internal cls_bpf helpers, and would in general be more > code to abstract over an operation that is unlikely to be implemented for other > filter types. > > Patch 3 adds the main bpf_link API. A function in cls_api takes care of > obtaining block reference, creating the filter object, and then calls the > bpf_link_change tcf_proto op (only supported by cls_bpf) that returns a fd after > setting up the internal structures. An optimization is made to not keep around > resources for extended actions, which is explained in a code comment as it wasn't > immediately obvious. > > Patch 4 adds an update path for bpf_link. Since bpf_link_update only supports > replacing the bpf_prog, we can skip tc filter's change path by reusing the > filter object but swapping its bpf_prog. This takes care of replacing the > offloaded prog as well (if that fails, update is aborted). So far however, > tcf_classify could do normal load (possibly torn) as the cls_bpf_prog->filter > would never be modified concurrently. This is no longer true, and to not > penalize the classify hot path, we also cannot impose serialization around > its load. Hence the load is changed to READ_ONCE, so that the pointer value is > always consistent. Due to invocation in a RCU critical section, the lifetime of > the prog is guaranteed for the duration of the call. > > Patch 5, 6 take care of updating the userspace bits and add a bpf_link returning > function to libbpf. > > Patch 7 adds a selftest that exercises all possible problematic interactions > that I could think of. > > Design: > > This is where in the object hierarchy our bpf_link object is attached. > > ┌─────┐ > │ │ > │ BPF │ > program > │ │ > └──▲──┘ > ┌───────┐ │ > │ │ ┌──────┴───────┐ > │ mod ├─────────► cls_bpf_prog │ > ┌────────────────┐ │cls_bpf│ └────┬───▲─────┘ > │ tcf_block │ │ │ │ │ > └────────┬───────┘ └───▲───┘ │ │ > │ ┌─────────────┐ │ ┌─▼───┴──┐ > └──────────► tcf_chain │ │ │bpf_link│ > └───────┬─────┘ │ └────────┘ > │ ┌─────────────┐ │ > └──────────► tcf_proto ├────┘ > └─────────────┘ > > The bpf_link is detached on destruction of the cls_bpf_prog. Doing it this way > allows us to implement update in a lightweight manner without having to recreate > a new filter, where we can just replace the BPF prog attached to cls_bpf_prog. > > The other way to do it would be to link the bpf_link to tcf_proto, there are > numerous downsides to this: > > 1. All filters have to embed the pointer even though they won't be using it when > cls_bpf is compiled in. > 2. This probably won't make sense to be extended to other filter types anyway. > 3. We aren't able to optimize the update case without adding another bpf_link > specific update operation to tcf_proto ops. > > The downside with tying this to the module is having to export bpf_link > management functions and introducing a tcf_proto op. Hopefully the cost of > another operation func pointer is not big enough (as there is only one ops > struct per module). > > This first version is to collect feedback on the approach and get ideas if there > is a better way to do this. Bpf_link-based TC API is a long time coming, so it's great to see someone finally working on this. Thanks! I briefly skimmed through the patch set, noticed a few generic bpf_link problems. But I think main feedback will come from Cilium folks and others that heavily rely on TC APIs. I wonder if there is an opportunity to simplify the API further given we have a new opportunity here. I don't think we are constrained to follow legacy TC API exactly. The problem is that your patch set was marked as spam by Google, so I suspect a bunch of folks haven't gotten it. I suggest re-sending it again but trimming down the CC list, leaving only bpf@vger, netdev@vger, and BPF maintainers CC'ed directly. > > Kumar Kartikeya Dwivedi (7): > net: sched: refactor cls_bpf creation code > bpf: export bpf_link functions for modules > net: sched: add bpf_link API for bpf classifier > net: sched: add lightweight update path for cls_bpf > tools: bpf.h: sync with kernel sources > libbpf: add bpf_link based TC-BPF management API > libbpf: add selftest for bpf_link based TC-BPF management API > > include/linux/bpf_types.h | 3 + > include/net/pkt_cls.h | 13 + > include/net/sch_generic.h | 6 +- > include/uapi/linux/bpf.h | 15 + > kernel/bpf/syscall.c | 14 +- > net/sched/cls_api.c | 138 ++++++- > net/sched/cls_bpf.c | 386 ++++++++++++++++-- > tools/include/uapi/linux/bpf.h | 15 + > tools/lib/bpf/bpf.c | 5 + > tools/lib/bpf/bpf.h | 8 +- > tools/lib/bpf/libbpf.c | 59 ++- > tools/lib/bpf/libbpf.h | 17 + > tools/lib/bpf/libbpf.map | 1 + > tools/lib/bpf/netlink.c | 5 +- > tools/lib/bpf/netlink.h | 8 + > .../selftests/bpf/prog_tests/tc_bpf_link.c | 285 +++++++++++++ > 16 files changed, 934 insertions(+), 44 deletions(-) > create mode 100644 tools/lib/bpf/netlink.h > create mode 100644 tools/testing/selftests/bpf/prog_tests/tc_bpf_link.c > > -- > 2.31.1 >
On Thu, Jun 03, 2021 at 02:39:15AM IST, Andrii Nakryiko wrote: > On Fri, May 28, 2021 at 1:00 PM Kumar Kartikeya Dwivedi > <memxor@gmail.com> wrote: > > > > This is the first RFC version. > > > > This adds a bpf_link path to create TC filters tied to cls_bpf classifier, and > > introduces fd based ownership for such TC filters. Netlink cannot delete or > > replace such filters, but the bpf_link is severed on indirect destruction of the > > filter (backing qdisc being deleted, or chain being flushed, etc.). To ensure > > that filters remain attached beyond process lifetime, the usual bpf_link fd > > pinning approach can be used. > > > > The individual patches contain more details and comments, but the overall kernel > > API and libbpf helper mirrors the semantics of the netlink based TC-BPF API > > merged recently. This means that we start by always setting direct action mode, > > protocol to ETH_P_ALL, chain_index as 0, etc. If there is a need for more > > options in the future, they can be easily exposed through the bpf_link API in > > the future. > > > > Patch 1 refactors cls_bpf change function to extract two helpers that will be > > reused in bpf_link creation. > > > > Patch 2 exports some bpf_link management functions to modules. This is needed > > because our bpf_link object is tied to the cls_bpf_prog object. Tying it to > > tcf_proto would be weird, because the update path has to replace offloaded bpf > > prog, which happens using internal cls_bpf helpers, and would in general be more > > code to abstract over an operation that is unlikely to be implemented for other > > filter types. > > > > Patch 3 adds the main bpf_link API. A function in cls_api takes care of > > obtaining block reference, creating the filter object, and then calls the > > bpf_link_change tcf_proto op (only supported by cls_bpf) that returns a fd after > > setting up the internal structures. An optimization is made to not keep around > > resources for extended actions, which is explained in a code comment as it wasn't > > immediately obvious. > > > > Patch 4 adds an update path for bpf_link. Since bpf_link_update only supports > > replacing the bpf_prog, we can skip tc filter's change path by reusing the > > filter object but swapping its bpf_prog. This takes care of replacing the > > offloaded prog as well (if that fails, update is aborted). So far however, > > tcf_classify could do normal load (possibly torn) as the cls_bpf_prog->filter > > would never be modified concurrently. This is no longer true, and to not > > penalize the classify hot path, we also cannot impose serialization around > > its load. Hence the load is changed to READ_ONCE, so that the pointer value is > > always consistent. Due to invocation in a RCU critical section, the lifetime of > > the prog is guaranteed for the duration of the call. > > > > Patch 5, 6 take care of updating the userspace bits and add a bpf_link returning > > function to libbpf. > > > > Patch 7 adds a selftest that exercises all possible problematic interactions > > that I could think of. > > > > Design: > > > > This is where in the object hierarchy our bpf_link object is attached. > > > > ┌─────┐ > > │ │ > > │ BPF │ > > program > > │ │ > > └──▲──┘ > > ┌───────┐ │ > > │ │ ┌──────┴───────┐ > > │ mod ├─────────► cls_bpf_prog │ > > ┌────────────────┐ │cls_bpf│ └────┬───▲─────┘ > > │ tcf_block │ │ │ │ │ > > └────────┬───────┘ └───▲───┘ │ │ > > │ ┌─────────────┐ │ ┌─▼───┴──┐ > > └──────────► tcf_chain │ │ │bpf_link│ > > └───────┬─────┘ │ └────────┘ > > │ ┌─────────────┐ │ > > └──────────► tcf_proto ├────┘ > > └─────────────┘ > > > > The bpf_link is detached on destruction of the cls_bpf_prog. Doing it this way > > allows us to implement update in a lightweight manner without having to recreate > > a new filter, where we can just replace the BPF prog attached to cls_bpf_prog. > > > > The other way to do it would be to link the bpf_link to tcf_proto, there are > > numerous downsides to this: > > > > 1. All filters have to embed the pointer even though they won't be using it when > > cls_bpf is compiled in. > > 2. This probably won't make sense to be extended to other filter types anyway. > > 3. We aren't able to optimize the update case without adding another bpf_link > > specific update operation to tcf_proto ops. > > > > The downside with tying this to the module is having to export bpf_link > > management functions and introducing a tcf_proto op. Hopefully the cost of > > another operation func pointer is not big enough (as there is only one ops > > struct per module). > > > > This first version is to collect feedback on the approach and get ideas if there > > is a better way to do this. > > Bpf_link-based TC API is a long time coming, so it's great to see > someone finally working on this. Thanks! > > I briefly skimmed through the patch set, noticed a few generic > bpf_link problems. But I think main feedback will come from Cilium Thanks for the review. I'll fix both of these in the resend (also have a couple of private reports from the kernel test robot). > folks and others that heavily rely on TC APIs. I wonder if there is an > opportunity to simplify the API further given we have a new > opportunity here. I don't think we are constrained to follow legacy TC > API exactly. > I tried to keep it simple by going for the defaults we agreed upon for the new netlink based libbpf API, and always setting direct action mode, and it's still in a position to be extended in the future to allow full TC filter setup like netlink does, if someone ever happens to need that. As for the implementation, I did notice that there has been discussion around this (though I could only find [0]) but I think doing it the way this patch does is more flexible as you can attach the bpf filter to an aribitrary parent/class, not just ingress and egress, and it can coexist with a conventional TC setup. [0]: https://facebookmicrosites.github.io/bpf/blog/2018/08/31/object-lifetime.html ("Note that there is ongoing work ...") > The problem is that your patch set was marked as spam by Google, so I > suspect a bunch of folks haven't gotten it. I suggest re-sending it > again but trimming down the CC list, leaving only bpf@vger, > netdev@vger, and BPF maintainers CC'ed directly. > Thanks for the heads up, I'll resend tomorrow. -- Kartikeya
On Thu, Jun 03, 2021 at 03:15:13AM +0530, Kumar Kartikeya Dwivedi wrote: > > > The problem is that your patch set was marked as spam by Google, so I > > suspect a bunch of folks haven't gotten it. I suggest re-sending it > > again but trimming down the CC list, leaving only bpf@vger, > > netdev@vger, and BPF maintainers CC'ed directly. > > > > Thanks for the heads up, I'll resend tomorrow. fyi I see this thread in my inbox, but, sadly, not the patches. So guessing based on cover letter and hoping that the following is true: link_fd is returned by BPF_LINK_CREATE command. If anything is missing in struct link_create the patches are adding it there. target_ifindex, flags are reused. attach_type indicates ingress vs egress.
On Thu, Jun 03, 2021 at 05:20:58AM IST, Alexei Starovoitov wrote: > On Thu, Jun 03, 2021 at 03:15:13AM +0530, Kumar Kartikeya Dwivedi wrote: > > > > > The problem is that your patch set was marked as spam by Google, so I > > > suspect a bunch of folks haven't gotten it. I suggest re-sending it > > > again but trimming down the CC list, leaving only bpf@vger, > > > netdev@vger, and BPF maintainers CC'ed directly. > > > > > > > Thanks for the heads up, I'll resend tomorrow. > > fyi I see this thread in my inbox, but, sadly, not the patches. > So guessing based on cover letter and hoping that the following is true: > link_fd is returned by BPF_LINK_CREATE command. > If anything is missing in struct link_create the patches are adding it there. > target_ifindex, flags are reused. attach_type indicates ingress vs egress. Everything is true except the attach_type part. I don't hook directly into sch_handle_{ingress,egress}. It's a normal TC filter, and if one wants to hook into ingress, egress, they attach it to clsact qdisc. The lifetime however is decided by the link fd. The new version is here: https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com -- Kartikeya
On Fri, May 28, 2021 at 1:00 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote: > > This is the first RFC version. > > This adds a bpf_link path to create TC filters tied to cls_bpf classifier, and > introduces fd based ownership for such TC filters. Netlink cannot delete or > replace such filters, but the bpf_link is severed on indirect destruction of the > filter (backing qdisc being deleted, or chain being flushed, etc.). To ensure > that filters remain attached beyond process lifetime, the usual bpf_link fd > pinning approach can be used. I have some troubles understanding this. So... why TC filter is so special here that it deserves such a special treatment? The reason why I ask is that none of other bpf links actually create any object, they merely attach bpf program to an existing object. For example, netns bpf_link does not create an netns, cgroup bpf_link does not create a cgroup either. So, why TC filter is so lucky to be the first one requires creating an object? Is it because there is no fd associated with any TC object? Or what? TC object, like all other netlink stuffs, is not fs based, hence does not have an fd. Or maybe you don't need an fd at all? Since at least xdp bpf_link is associated with a netdev which does not have an fd either. > > The individual patches contain more details and comments, but the overall kernel > API and libbpf helper mirrors the semantics of the netlink based TC-BPF API > merged recently. This means that we start by always setting direct action mode, > protocol to ETH_P_ALL, chain_index as 0, etc. If there is a need for more > options in the future, they can be easily exposed through the bpf_link API in > the future. As you already see, this fits really oddly into TC infrastructure, because TC qdisc/filter/action are a whole subsystem, here you are trying to punch a hole in the middle. ;) This usually indicates that we are going in a wrong direction, maybe your case is an exception, but I can't find anything to justify it in your cover letter. Even if you really want to go down this path (I still double), you probably want to explore whether there is any generic way to associate a TC object with an fd, because we have TC bpf action and we will have TC bpf qdisc too, I don't see any bpf_cls is more special than them. Thanks.
On Mon, Jun 07, 2021 at 05:07:28AM IST, Cong Wang wrote: > On Fri, May 28, 2021 at 1:00 PM Kumar Kartikeya Dwivedi > <memxor@gmail.com> wrote: > > > > This is the first RFC version. > > > > This adds a bpf_link path to create TC filters tied to cls_bpf classifier, and > > introduces fd based ownership for such TC filters. Netlink cannot delete or > > replace such filters, but the bpf_link is severed on indirect destruction of the > > filter (backing qdisc being deleted, or chain being flushed, etc.). To ensure > > that filters remain attached beyond process lifetime, the usual bpf_link fd > > pinning approach can be used. > > I have some troubles understanding this. So... why TC filter is so special > here that it deserves such a special treatment? > So the motivation behind this was automatic cleanup of filters installed by some program. Usually from the userspace side you need a bunch of things (handle, priority, protocol, chain_index, etc.) to be able to delete a filter without stepping on others' toes. Also, there is no gurantee that filter hasn't been replaced, so you need to check some other way (either tag or prog_id, but these are also not perfect). bpf_link provides isolation from netlink and fd-based lifetime management. As for why it needs special treatment (by which I guess you mean why it _creates_ an object instead of simply attaching to one, see below): > The reason why I ask is that none of other bpf links actually create any > object, they merely attach bpf program to an existing object. For example, > netns bpf_link does not create an netns, cgroup bpf_link does not create > a cgroup either. So, why TC filter is so lucky to be the first one requires > creating an object? > They are created behind the scenes, but are also fairly isolated from netlink (i.e. can only be introspected, so not hidden in that sense, but are effectively locked for replace/delete). The problem would be (of not creating a filter during attach) is that a typical 'attach point' for TC exists in form of tcf_proto. If we take priority (protocol is fixed) out of the equation, it becomes possible to attach just a single BPF prog, but that seems like a needless limitation when TC already supports list of filters at each 'attach point'. My point is that the created object (the tcf_proto under the 'chain' object) is the attach point, and since there can be so many, keeping them around at all times doesn't make sense, so the refcounted attach locations are created as needed. Both netlink and bpf_link owned filters can be attached under the same location, with different ownership story in userspace. > Is it because there is no fd associated with any TC object? Or what? > TC object, like all other netlink stuffs, is not fs based, hence does not > have an fd. Or maybe you don't need an fd at all? Since at least xdp > bpf_link is associated with a netdev which does not have an fd either. > > > > > The individual patches contain more details and comments, but the overall kernel > > API and libbpf helper mirrors the semantics of the netlink based TC-BPF API > > merged recently. This means that we start by always setting direct action mode, > > protocol to ETH_P_ALL, chain_index as 0, etc. If there is a need for more > > options in the future, they can be easily exposed through the bpf_link API in > > the future. > > As you already see, this fits really oddly into TC infrastructure, because > TC qdisc/filter/action are a whole subsystem, here you are trying to punch > a hole in the middle. ;) This usually indicates that we are going in a wrong > direction, maybe your case is an exception, but I can't find anything to justify > it in your cover letter. > I don't see why I'm punching a hole. The qdisc, chain, protocol, priority is the 'attach location', handle is just an ID, maybe we can skip all this and just create a static hook for attaching single BPF program that doesn't require creating a filter, but someday someone will have to reimplement chaining of programs again (like libxdp does). > Even if you really want to go down this path (I still double), you probably > want to explore whether there is any generic way to associate a TC object > with an fd, because we have TC bpf action and we will have TC bpf qdisc > too, I don't see any bpf_cls is more special than them. > I think TC bpf actions are not relevant going forward (due to cls_bpf's direct action mode), but I could be wrong. I say so because even a proposed API to attach these from libbpf was dropped because arguably cls_bpf does it better, and people shouldn't be using integrated actions going forward. TC bpf qdisc might be, but that can be a different attach type (say BPF_SCHED), which if exposed through bpf_link will again have its attach point to be the target_ifindex, not some fd, and it would still be possible to use this API to attach to a eBPF qdisc. What do you suggest? I am open to reworking this in a different way if there are any better ideas. > Thanks. -- Kartikeya
On Sun, Jun 6, 2021 at 8:38 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote: > > On Mon, Jun 07, 2021 at 05:07:28AM IST, Cong Wang wrote: > > On Fri, May 28, 2021 at 1:00 PM Kumar Kartikeya Dwivedi > > <memxor@gmail.com> wrote: > > > > > > This is the first RFC version. > > > > > > This adds a bpf_link path to create TC filters tied to cls_bpf classifier, and > > > introduces fd based ownership for such TC filters. Netlink cannot delete or > > > replace such filters, but the bpf_link is severed on indirect destruction of the > > > filter (backing qdisc being deleted, or chain being flushed, etc.). To ensure > > > that filters remain attached beyond process lifetime, the usual bpf_link fd > > > pinning approach can be used. > > > > I have some troubles understanding this. So... why TC filter is so special > > here that it deserves such a special treatment? > > > > So the motivation behind this was automatic cleanup of filters installed by some > program. Usually from the userspace side you need a bunch of things (handle, > priority, protocol, chain_index, etc.) to be able to delete a filter without > stepping on others' toes. Also, there is no gurantee that filter hasn't been > replaced, so you need to check some other way (either tag or prog_id, but these > are also not perfect). > > bpf_link provides isolation from netlink and fd-based lifetime management. As > for why it needs special treatment (by which I guess you mean why it _creates_ > an object instead of simply attaching to one, see below): Are you saying TC filter is not independent? IOW, it has to rely on TC qdisc to exist. This is true, and is of course different with netns/cgroup. This is perhaps not hard to solve, because TC actions are already independent, we can perhaps convert TC filters too (the biggest blocker is compatibility). Or do you just need an ephemeral representation of a TC filter which only exists for a process? If so, see below. > > > The reason why I ask is that none of other bpf links actually create any > > object, they merely attach bpf program to an existing object. For example, > > netns bpf_link does not create an netns, cgroup bpf_link does not create > > a cgroup either. So, why TC filter is so lucky to be the first one requires > > creating an object? > > > > They are created behind the scenes, but are also fairly isolated from netlink > (i.e. can only be introspected, so not hidden in that sense, but are > effectively locked for replace/delete). > > The problem would be (of not creating a filter during attach) is that a typical > 'attach point' for TC exists in form of tcf_proto. If we take priority (protocol > is fixed) out of the equation, it becomes possible to attach just a single BPF > prog, but that seems like a needless limitation when TC already supports list of > filters at each 'attach point'. > > My point is that the created object (the tcf_proto under the 'chain' object) is > the attach point, and since there can be so many, keeping them around at all > times doesn't make sense, so the refcounted attach locations are created as > needed. Both netlink and bpf_link owned filters can be attached under the same > location, with different ownership story in userspace. I do not understand "created behind the scenes". These are all created independent of bpf_link, right? For example, we create a cgroup with mkdir, then open it and pass the fd to bpf_link. Clearly, cgroup is not created by bpf_link or any bpf syscall. The only thing different is fd, or more accurately, an identifier to locate these objects. For example, ifindex can also be used to locate a netdev. We can certainly locate a TC filter with (prio,proto,handle) but this has to be passed via netlink. So if you need some locator, I think we can introduce a kernel API which takes all necessary parameters to locate a TC filter and return it to you. For a quick example, like this: struct tcf_proto *tcf_get_proto(struct net *net, int ifindex, int parent, char* kind, int handle...); (Note, it can grab a refcnt in case of being deleted by others.) With this, you can simply call it in bpf_link, and attach bpf prog to tcf_proto (of course, only cls_bpf succeeds here). Thanks.
On Mon, Jun 07, 2021 at 10:48:04AM IST, Cong Wang wrote: > On Sun, Jun 6, 2021 at 8:38 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote: > > > > On Mon, Jun 07, 2021 at 05:07:28AM IST, Cong Wang wrote: > > > On Fri, May 28, 2021 at 1:00 PM Kumar Kartikeya Dwivedi > > > <memxor@gmail.com> wrote: > > > > > > > > This is the first RFC version. > > > > > > > > This adds a bpf_link path to create TC filters tied to cls_bpf classifier, and > > > > introduces fd based ownership for such TC filters. Netlink cannot delete or > > > > replace such filters, but the bpf_link is severed on indirect destruction of the > > > > filter (backing qdisc being deleted, or chain being flushed, etc.). To ensure > > > > that filters remain attached beyond process lifetime, the usual bpf_link fd > > > > pinning approach can be used. > > > > > > I have some troubles understanding this. So... why TC filter is so special > > > here that it deserves such a special treatment? > > > > > > > So the motivation behind this was automatic cleanup of filters installed by some > > program. Usually from the userspace side you need a bunch of things (handle, > > priority, protocol, chain_index, etc.) to be able to delete a filter without > > stepping on others' toes. Also, there is no gurantee that filter hasn't been > > replaced, so you need to check some other way (either tag or prog_id, but these > > are also not perfect). > > > > bpf_link provides isolation from netlink and fd-based lifetime management. As > > for why it needs special treatment (by which I guess you mean why it _creates_ > > an object instead of simply attaching to one, see below): > > Are you saying TC filter is not independent? IOW, it has to rely on TC qdisc > to exist. This is true, and is of course different with netns/cgroup. > This is perhaps > not hard to solve, because TC actions are already independent, we can perhaps > convert TC filters too (the biggest blocker is compatibility). > True, but that would mean you need some way to create a detached TC filter, correct? Can you give some ideas on how the setup would look like from userspace side? IIUC you mean RTM_NEWTFILTER (with kind == bpf) parent == SOME_MAGIC_DETACHED ifindex == INVALID then bpf_link comes in and creates the binding to the qdisc, parent, prio, chain, handle ... ? > Or do you just need an ephemeral representation of a TC filter which only exists > for a process? If so, see below. > > > > > > The reason why I ask is that none of other bpf links actually create any > > > object, they merely attach bpf program to an existing object. For example, > > > netns bpf_link does not create an netns, cgroup bpf_link does not create > > > a cgroup either. So, why TC filter is so lucky to be the first one requires > > > creating an object? > > > > > > > They are created behind the scenes, but are also fairly isolated from netlink > > (i.e. can only be introspected, so not hidden in that sense, but are > > effectively locked for replace/delete). > > > > The problem would be (of not creating a filter during attach) is that a typical > > 'attach point' for TC exists in form of tcf_proto. If we take priority (protocol > > is fixed) out of the equation, it becomes possible to attach just a single BPF > > prog, but that seems like a needless limitation when TC already supports list of > > filters at each 'attach point'. > > > > My point is that the created object (the tcf_proto under the 'chain' object) is > > the attach point, and since there can be so many, keeping them around at all > > times doesn't make sense, so the refcounted attach locations are created as > > needed. Both netlink and bpf_link owned filters can be attached under the same > > location, with different ownership story in userspace. > > I do not understand "created behind the scenes". These are all created > independent of bpf_link, right? For example, we create a cgroup with > mkdir, then open it and pass the fd to bpf_link. Clearly, cgroup is not > created by bpf_link or any bpf syscall. Sorry, that must be confusing. I was only referring to what this patch does. Indeed, as far as implementation is concerned this has no precedence. > > The only thing different is fd, or more accurately, an identifier to locate > these objects. For example, ifindex can also be used to locate a netdev. > We can certainly locate a TC filter with (prio,proto,handle) but this has to > be passed via netlink. So if you need some locator, I think we can > introduce a kernel API which takes all necessary parameters to locate > a TC filter and return it to you. For a quick example, like this: > > struct tcf_proto *tcf_get_proto(struct net *net, int ifindex, > int parent, char* kind, int handle...); > I think this already exists in some way, i.e. you can just ignore if filter handle from tp->ops->get doesn't exist (reusing the exsiting code) that walks from qdisc/block -> chain -> tcf_proto during creation. > (Note, it can grab a refcnt in case of being deleted by others.) > > With this, you can simply call it in bpf_link, and attach bpf prog to tcf_proto > (of course, only cls_bpf succeeds here). > So IIUC, you are proposing to first create a filter normally using netlink, then attach it using bpf_link to the proper parent? I.e. your main contention point is to not create filter from bpf_link, instead take a filter and attach it to a parent with bpf_link representing this attachment? But then the created filter stays with refcount of 1 until RTM_DELTFILTER, i.e. the lifetime of the attachment may be managed by bpf_link (in that we can detach the filter from parent) but the filter itself will not be cleaned up. One of the goals of tying TC filter to fd was to bind lifetime of filter itself, along with attachment. Separating both doesn't seem to buy anything here. You always create a filter to attach somewhere. With actions, things are different, you may create one action but bind it to multiple filters, so actions existing as their own thing makes sense. A single action can serve multiple filters, and save on memory. You could argue that even with filters this is true, as you may want to attach the same filter to multiple qdiscs, but we already have a facility to do that (shared tcf_block with block->q == NULL). However that is not as flexible as what you are proposing. It may be odd from the kernel side but to userspace a parent, prio, handle (we don't let user choose anything else for now) is itself the attach point, how bpf_link manages the attachment internally isn't really that interesting. It does so now by way of creating an object that represents a certain hook, then binding the BPF prog to it. I consider this mostly an implementation detail. What you are really attaching to is the qdisc/block, which is the resource analogous to cgroup fd, netns fd, and ifindex, and 'where' is described by other attributes. > Thanks. -- Kartikeya
On Sun, Jun 6, 2021 at 11:08 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote: > > On Mon, Jun 07, 2021 at 10:48:04AM IST, Cong Wang wrote: > > On Sun, Jun 6, 2021 at 8:38 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote: > > > > > > On Mon, Jun 07, 2021 at 05:07:28AM IST, Cong Wang wrote: > > > > On Fri, May 28, 2021 at 1:00 PM Kumar Kartikeya Dwivedi > > > > <memxor@gmail.com> wrote: > > > > > > > > > > This is the first RFC version. > > > > > > > > > > This adds a bpf_link path to create TC filters tied to cls_bpf classifier, and > > > > > introduces fd based ownership for such TC filters. Netlink cannot delete or > > > > > replace such filters, but the bpf_link is severed on indirect destruction of the > > > > > filter (backing qdisc being deleted, or chain being flushed, etc.). To ensure > > > > > that filters remain attached beyond process lifetime, the usual bpf_link fd > > > > > pinning approach can be used. > > > > > > > > I have some troubles understanding this. So... why TC filter is so special > > > > here that it deserves such a special treatment? > > > > > > > > > > So the motivation behind this was automatic cleanup of filters installed by some > > > program. Usually from the userspace side you need a bunch of things (handle, > > > priority, protocol, chain_index, etc.) to be able to delete a filter without > > > stepping on others' toes. Also, there is no gurantee that filter hasn't been > > > replaced, so you need to check some other way (either tag or prog_id, but these > > > are also not perfect). > > > > > > bpf_link provides isolation from netlink and fd-based lifetime management. As > > > for why it needs special treatment (by which I guess you mean why it _creates_ > > > an object instead of simply attaching to one, see below): > > > > Are you saying TC filter is not independent? IOW, it has to rely on TC qdisc > > to exist. This is true, and is of course different with netns/cgroup. > > This is perhaps > > not hard to solve, because TC actions are already independent, we can perhaps > > convert TC filters too (the biggest blocker is compatibility). > > > > True, but that would mean you need some way to create a detached TC filter, correct? > Can you give some ideas on how the setup would look like from userspace side? > > IIUC you mean > > RTM_NEWTFILTER (with kind == bpf) parent == SOME_MAGIC_DETACHED ifindex == INVALID > > then bpf_link comes in and creates the binding to the qdisc, parent, prio, > chain, handle ... ? Roughly yes, except creation is still done by netlink, not bpf_link. It is pretty much similar to those unbound TC actions. > > > Or do you just need an ephemeral representation of a TC filter which only exists > > for a process? If so, see below. > > > > > > > > > The reason why I ask is that none of other bpf links actually create any > > > > object, they merely attach bpf program to an existing object. For example, > > > > netns bpf_link does not create an netns, cgroup bpf_link does not create > > > > a cgroup either. So, why TC filter is so lucky to be the first one requires > > > > creating an object? > > > > > > > > > > They are created behind the scenes, but are also fairly isolated from netlink > > > (i.e. can only be introspected, so not hidden in that sense, but are > > > effectively locked for replace/delete). > > > > > > The problem would be (of not creating a filter during attach) is that a typical > > > 'attach point' for TC exists in form of tcf_proto. If we take priority (protocol > > > is fixed) out of the equation, it becomes possible to attach just a single BPF > > > prog, but that seems like a needless limitation when TC already supports list of > > > filters at each 'attach point'. > > > > > > My point is that the created object (the tcf_proto under the 'chain' object) is > > > the attach point, and since there can be so many, keeping them around at all > > > times doesn't make sense, so the refcounted attach locations are created as > > > needed. Both netlink and bpf_link owned filters can be attached under the same > > > location, with different ownership story in userspace. > > > > I do not understand "created behind the scenes". These are all created > > independent of bpf_link, right? For example, we create a cgroup with > > mkdir, then open it and pass the fd to bpf_link. Clearly, cgroup is not > > created by bpf_link or any bpf syscall. > > Sorry, that must be confusing. I was only referring to what this patch does. > Indeed, as far as implementation is concerned this has no precedence. > > > > > The only thing different is fd, or more accurately, an identifier to locate > > these objects. For example, ifindex can also be used to locate a netdev. > > We can certainly locate a TC filter with (prio,proto,handle) but this has to > > be passed via netlink. So if you need some locator, I think we can > > introduce a kernel API which takes all necessary parameters to locate > > a TC filter and return it to you. For a quick example, like this: > > > > struct tcf_proto *tcf_get_proto(struct net *net, int ifindex, > > int parent, char* kind, int handle...); > > > > I think this already exists in some way, i.e. you can just ignore if filter > handle from tp->ops->get doesn't exist (reusing the exsiting code) that walks > from qdisc/block -> chain -> tcf_proto during creation. Right, except currently it requires a few API's to reach TC filters (first netdev,, then qdisc, finally filters). So, I think providing one API could at least address your "stepping on others toes" concern? > > > (Note, it can grab a refcnt in case of being deleted by others.) > > > > With this, you can simply call it in bpf_link, and attach bpf prog to tcf_proto > > (of course, only cls_bpf succeeds here). > > > > So IIUC, you are proposing to first create a filter normally using netlink, then > attach it using bpf_link to the proper parent? I.e. your main contention point > is to not create filter from bpf_link, instead take a filter and attach it to a > parent with bpf_link representing this attachment? Yes, to me I don't see a reason we want to create it from bpf_link. > > But then the created filter stays with refcount of 1 until RTM_DELTFILTER, i.e. > the lifetime of the attachment may be managed by bpf_link (in that we can detach > the filter from parent) but the filter itself will not be cleaned up. One of the > goals of tying TC filter to fd was to bind lifetime of filter itself, along with > attachment. Separating both doesn't seem to buy anything here. You always create > a filter to attach somewhere. This is really odd, for two reasons: 1) Why netdev does not have such problem? bpf_xdp_link_attach() uses ifindex to locate a netdev, without creating it or cleaning it either. So, why do we never want to bind a netdev to an fd? IOW, what makes TC filters' lifetime so different from netdev? 2) All existing bpf_link targets, except netdev, are fs based, hence an fd makes sense for them naturally. TC filters, or any other netlink based things, are not even related to fs, hence fd does not make sense here, like we never bind a netdev to a fd. > > With actions, things are different, you may create one action but bind it to > multiple filters, so actions existing as their own thing makes sense. A single > action can serve multiple filters, and save on memory. > > You could argue that even with filters this is true, as you may want to attach > the same filter to multiple qdiscs, but we already have a facility to do that > (shared tcf_block with block->q == NULL). However that is not as flexible as > what you are proposing. True. I think making TC filters as standalone as TC actions is a right direction, if it helps you too. > > It may be odd from the kernel side but to userspace a parent, prio, handle (we > don't let user choose anything else for now) is itself the attach point, how > bpf_link manages the attachment internally isn't really that interesting. It > does so now by way of creating an object that represents a certain hook, then > binding the BPF prog to it. I consider this mostly an implementation detail. > What you are really attaching to is the qdisc/block, which is the resource > analogous to cgroup fd, netns fd, and ifindex, and 'where' is described by other > attributes. How do you establish the analogy here? cgroup and netns are fs based, having an fd is natural. ifindex is not an fd, it is a locator for netdev. Plus, current bpf_link code does not create any of them. Thanks.
On Tue, Jun 08, 2021 at 07:30:40AM IST, Cong Wang wrote: > On Sun, Jun 6, 2021 at 11:08 PM Kumar Kartikeya Dwivedi > <memxor@gmail.com> wrote: > > > > On Mon, Jun 07, 2021 at 10:48:04AM IST, Cong Wang wrote: > > > On Sun, Jun 6, 2021 at 8:38 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote: > > > > > > > > On Mon, Jun 07, 2021 at 05:07:28AM IST, Cong Wang wrote: > > > > > On Fri, May 28, 2021 at 1:00 PM Kumar Kartikeya Dwivedi > > > > > <memxor@gmail.com> wrote: > > > > > > > > > > > > This is the first RFC version. > > > > > > > > > > > > This adds a bpf_link path to create TC filters tied to cls_bpf classifier, and > > > > > > introduces fd based ownership for such TC filters. Netlink cannot delete or > > > > > > replace such filters, but the bpf_link is severed on indirect destruction of the > > > > > > filter (backing qdisc being deleted, or chain being flushed, etc.). To ensure > > > > > > that filters remain attached beyond process lifetime, the usual bpf_link fd > > > > > > pinning approach can be used. > > > > > > > > > > I have some troubles understanding this. So... why TC filter is so special > > > > > here that it deserves such a special treatment? > > > > > > > > > > > > > So the motivation behind this was automatic cleanup of filters installed by some > > > > program. Usually from the userspace side you need a bunch of things (handle, > > > > priority, protocol, chain_index, etc.) to be able to delete a filter without > > > > stepping on others' toes. Also, there is no gurantee that filter hasn't been > > > > replaced, so you need to check some other way (either tag or prog_id, but these > > > > are also not perfect). > > > > > > > > bpf_link provides isolation from netlink and fd-based lifetime management. As > > > > for why it needs special treatment (by which I guess you mean why it _creates_ > > > > an object instead of simply attaching to one, see below): > > > > > > Are you saying TC filter is not independent? IOW, it has to rely on TC qdisc > > > to exist. This is true, and is of course different with netns/cgroup. > > > This is perhaps > > > not hard to solve, because TC actions are already independent, we can perhaps > > > convert TC filters too (the biggest blocker is compatibility). > > > > > > > True, but that would mean you need some way to create a detached TC filter, correct? > > Can you give some ideas on how the setup would look like from userspace side? > > > > IIUC you mean > > > > RTM_NEWTFILTER (with kind == bpf) parent == SOME_MAGIC_DETACHED ifindex == INVALID > > > > then bpf_link comes in and creates the binding to the qdisc, parent, prio, > > chain, handle ... ? > > Roughly yes, except creation is still done by netlink, not bpf_link. It is > pretty much similar to those unbound TC actions. > Right, thanks for explaining. I will try to work on this and see if it works out. > > > > > Or do you just need an ephemeral representation of a TC filter which only exists > > > for a process? If so, see below. > > > > > > > > > > > > The reason why I ask is that none of other bpf links actually create any > > > > > object, they merely attach bpf program to an existing object. For example, > > > > > netns bpf_link does not create an netns, cgroup bpf_link does not create > > > > > a cgroup either. So, why TC filter is so lucky to be the first one requires > > > > > creating an object? > > > > > > > > > > > > > They are created behind the scenes, but are also fairly isolated from netlink > > > > (i.e. can only be introspected, so not hidden in that sense, but are > > > > effectively locked for replace/delete). > > > > > > > > The problem would be (of not creating a filter during attach) is that a typical > > > > 'attach point' for TC exists in form of tcf_proto. If we take priority (protocol > > > > is fixed) out of the equation, it becomes possible to attach just a single BPF > > > > prog, but that seems like a needless limitation when TC already supports list of > > > > filters at each 'attach point'. > > > > > > > > My point is that the created object (the tcf_proto under the 'chain' object) is > > > > the attach point, and since there can be so many, keeping them around at all > > > > times doesn't make sense, so the refcounted attach locations are created as > > > > needed. Both netlink and bpf_link owned filters can be attached under the same > > > > location, with different ownership story in userspace. > > > > > > I do not understand "created behind the scenes". These are all created > > > independent of bpf_link, right? For example, we create a cgroup with > > > mkdir, then open it and pass the fd to bpf_link. Clearly, cgroup is not > > > created by bpf_link or any bpf syscall. > > > > Sorry, that must be confusing. I was only referring to what this patch does. > > Indeed, as far as implementation is concerned this has no precedence. > > > > > > > > The only thing different is fd, or more accurately, an identifier to locate > > > these objects. For example, ifindex can also be used to locate a netdev. > > > We can certainly locate a TC filter with (prio,proto,handle) but this has to > > > be passed via netlink. So if you need some locator, I think we can > > > introduce a kernel API which takes all necessary parameters to locate > > > a TC filter and return it to you. For a quick example, like this: > > > > > > struct tcf_proto *tcf_get_proto(struct net *net, int ifindex, > > > int parent, char* kind, int handle...); > > > > > > > I think this already exists in some way, i.e. you can just ignore if filter > > handle from tp->ops->get doesn't exist (reusing the exsiting code) that walks > > from qdisc/block -> chain -> tcf_proto during creation. > > Right, except currently it requires a few API's to reach TC filters > (first netdev,, > then qdisc, finally filters). So, I think providing one API could at > least address > your "stepping on others toes" concern? > > > > > > (Note, it can grab a refcnt in case of being deleted by others.) > > > > > > With this, you can simply call it in bpf_link, and attach bpf prog to tcf_proto > > > (of course, only cls_bpf succeeds here). > > > > > > > So IIUC, you are proposing to first create a filter normally using netlink, then > > attach it using bpf_link to the proper parent? I.e. your main contention point > > is to not create filter from bpf_link, instead take a filter and attach it to a > > parent with bpf_link representing this attachment? > > Yes, to me I don't see a reason we want to create it from bpf_link. > > > > > But then the created filter stays with refcount of 1 until RTM_DELTFILTER, i.e. > > the lifetime of the attachment may be managed by bpf_link (in that we can detach > > the filter from parent) but the filter itself will not be cleaned up. One of the > > goals of tying TC filter to fd was to bind lifetime of filter itself, along with > > attachment. Separating both doesn't seem to buy anything here. You always create > > a filter to attach somewhere. > > This is really odd, for two reasons: > > 1) Why netdev does not have such problem? bpf_xdp_link_attach() uses > ifindex to locate a netdev, without creating it or cleaning it either. > So, why do we > never want to bind a netdev to an fd? IOW, what makes TC filters' lifetime so > different from netdev? > I think I tried to explain the difference, but I may have failed. netdev does not have this problem because netdev is to XDP prog what qdisc is to a SCHED_CLS prog. The filter is merely a way to hook into the qdisc. So we bind the attachment's lifetime to the filter's lifetime, which in turn is controlled by the bpf_link fd. When the filter is gone, the attachment to the qdisc is gone. So we're not really creating a qdisc here, we're just tying the filter (which in the current semantics exists only while attached) to the bpf_link. The filter is the attachment, so tying its lifetime to bpf_link makes sense. When you destroy the bpf_link, the filter goes away too, which means classification at that hook (parent/class) in the qdisc stops working. This is why creating the filter from the bpf_link made sense to me. I hope you can see where I was going with this now. Introducing a new kind of method to attach to qdisc didn't seem wise to me, given all the infrastructure already exists. > 2) All existing bpf_link targets, except netdev, are fs based, hence an fd makes > sense for them naturally. TC filters, or any other netlink based > things, are not even > related to fs, hence fd does not make sense here, like we never bind a netdev > to a fd. > Yes, none of them create any objects. It is only a side effect of current semantics that you are able to control the filter's lifetime using the bpf_link as filter creation is also accompanied with its attachment to the qdisc. Your unbound filter idea just separates the two. One will still end up creating a cls_bpf_prog object internally in the kernel, just that it will now be refcounted and be linked into multiple tcf_proto (based on how many bpf_link's are attached). Another additional responsibility of the user space is to now clean up these unbound filters when it is done using them (either right after making a bpf_link attachment so that it is removed on bpf_link destruction, or later), because they don't sit under any chain etc. so a full flush of filters won't remove them. > > > > With actions, things are different, you may create one action but bind it to > > multiple filters, so actions existing as their own thing makes sense. A single > > action can serve multiple filters, and save on memory. > > > > You could argue that even with filters this is true, as you may want to attach > > the same filter to multiple qdiscs, but we already have a facility to do that > > (shared tcf_block with block->q == NULL). However that is not as flexible as > > what you are proposing. > > True. I think making TC filters as standalone as TC actions is a right > direction, > if it helps you too. > > > > > It may be odd from the kernel side but to userspace a parent, prio, handle (we > > don't let user choose anything else for now) is itself the attach point, how > > bpf_link manages the attachment internally isn't really that interesting. It > > does so now by way of creating an object that represents a certain hook, then > > binding the BPF prog to it. I consider this mostly an implementation detail. > > What you are really attaching to is the qdisc/block, which is the resource > > analogous to cgroup fd, netns fd, and ifindex, and 'where' is described by other > > attributes. > > How do you establish the analogy here? cgroup and netns are fs based, > having an fd is natural. ifindex is not an fd, it is a locator for netdev. Plus, > current bpf_link code does not create any of them. > > Thanks. -- Kartikeya
On Tue, Jun 8, 2021 at 12:20 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote: > > > 2) All existing bpf_link targets, except netdev, are fs based, hence an fd makes > > sense for them naturally. TC filters, or any other netlink based fs analogy is not applicable. bpf_link-s for tracing and xdp have nothing to do with file systems. > > things, are not even > > related to fs, hence fd does not make sense here, like we never bind a netdev > > to a fd. > > > > Yes, none of them create any objects. It is only a side effect of current > semantics that you are able to control the filter's lifetime using the bpf_link > as filter creation is also accompanied with its attachment to the qdisc. I think it makes sense to create these objects as part of establishing bpf_link. ingress qdisc is a fake qdisc anyway. If we could go back in time I would argue that its existence doesn't need to be shown in iproute2. It's an object that serves no purpose other than attaching filters to it. It doesn't do any queuing unlike real qdiscs. It's an artifact of old choices. Old doesn't mean good. The kernel is full of such quirks and oddities. New api-s shouldn't blindly follow them. tc qdisc add dev eth0 clsact is a useless command with nop effect.
On Tue, Jun 8, 2021 at 12:20 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote: > > So we're not really creating a qdisc here, we're just tying the filter (which in > the current semantics exists only while attached) to the bpf_link. The filter is > the attachment, so tying its lifetime to bpf_link makes sense. When you destroy > the bpf_link, the filter goes away too, which means classification at that > hook (parent/class) in the qdisc stops working. This is why creating the filter > from the bpf_link made sense to me. I see why you are creating TC filters now, because you are trying to force the lifetime of a bpf target to align with the bpf program itself. The deeper reason seems to be that a cls_bpf filter looks so small that it appears to you that it has nothing but a bpf_prog, right? I offer two different views here: 1. If you view a TC filter as an instance as a netdev/qdisc/action, they are no different from this perspective. Maybe the fact that a TC filter resides in a qdisc makes a slight difference here, but like I mentioned, it actually makes sense to let TC filters be standalone, qdisc's just have to bind with them, like how we bind TC filters with standalone TC actions. These are all updated independently, despite some of them residing in another. There should not be an exceptional TC filter which can not be updated via `tc filter` command. 2. For cls_bpf specifically, it is also an instance, like all other TC filters. You can update it in the same way: tc filter change [...] The only difference is a bpf program can attach to such an instance. So you can view the bpf program attached to cls_bpf as a property of it. From this point of view, there is no difference with XDP to netdev, where an XDP program attached to a netdev is also a property of netdev. A netdev can still function without XDP. Same for cls_bpf, it can be just a nop returns TC_ACT_SHOT (or whatever) if no ppf program is attached. Thus, the lifetime of a bpf program can be separated from the target it attaches too, like all other bpf_link targets. bpf_link is just a supplement to `tc filter change cls_bpf`, not to replace it. This is actually simpler, you do not need to worry about whether netdev is destroyed when you detach the XDP bpf_link anyway, same for cls_bpf filters. Likewise, TC filters don't need to worry about bpf_links associated. Thanks.
On Tue, Jun 8, 2021 at 8:39 AM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > I think it makes sense to create these objects as part of establishing bpf_link. > ingress qdisc is a fake qdisc anyway. > If we could go back in time I would argue that its existence doesn't > need to be shown in iproute2. It's an object that serves no purpose > other than attaching filters to it. It doesn't do any queuing unlike > real qdiscs. > It's an artifact of old choices. Old doesn't mean good. > The kernel is full of such quirks and oddities. New api-s shouldn't > blindly follow them. > tc qdisc add dev eth0 clsact > is a useless command with nop effect. Sounds like you just need a new bpf attach point outside of TC, probably inside __dev_queue_xmit(). You don't need to create any object, probably just need to attach it to a netdev. Thanks.
On Fri, Jun 11, 2021 at 07:30:49AM IST, Cong Wang wrote: > On Tue, Jun 8, 2021 at 12:20 AM Kumar Kartikeya Dwivedi > <memxor@gmail.com> wrote: > > > > So we're not really creating a qdisc here, we're just tying the filter (which in > > the current semantics exists only while attached) to the bpf_link. The filter is > > the attachment, so tying its lifetime to bpf_link makes sense. When you destroy > > the bpf_link, the filter goes away too, which means classification at that > > hook (parent/class) in the qdisc stops working. This is why creating the filter > > from the bpf_link made sense to me. > > I see why you are creating TC filters now, because you are trying to > force the lifetime of a bpf target to align with the bpf program itself. > The deeper reason seems to be that a cls_bpf filter looks so small > that it appears to you that it has nothing but a bpf_prog, right? > Yes, pretty much. > I offer two different views here: > > 1. If you view a TC filter as an instance as a netdev/qdisc/action, they > are no different from this perspective. Maybe the fact that a TC filter > resides in a qdisc makes a slight difference here, but like I mentioned, it > actually makes sense to let TC filters be standalone, qdisc's just have to > bind with them, like how we bind TC filters with standalone TC actions. You propose something different below IIUC, but I explained why I'm wary of these unbound filters. They seem to add a step to classifier setup for no real benefit to the user (except keeping track of one more object and cleaning it up with the link when done). I understand that the filter is very much an object of its own and why keeping them unbound makes sense, but for the user there is no real benefit of this scheme (some things like classid attribute are contextual in that they make sense to be set based on what parent we're attaching to). > These are all updated independently, despite some of them residing in > another. There should not be an exceptional TC filter which can not > be updated via `tc filter` command. I see, but I'm mirroring what was done for XDP bpf_link. Besides, flush still works, it's only that manipulating a filter managed by bpf_link is not allowed, which sounds reasonable to me, given we're bringing new ownership semantics here which didn't exist before with netlink, so it doesn't make sense to allow netlink to simply invalidate the filter installed by some other program. You wouldn't do something like that for a cooperating setup, we're just enforcing that using -EPERM (bpf_link is not allowed to replace netlink installed filters either, so it goes both ways). > > 2. For cls_bpf specifically, it is also an instance, like all other TC filters. > You can update it in the same way: tc filter change [...] The only difference > is a bpf program can attach to such an instance. So you can view the bpf > program attached to cls_bpf as a property of it. From this point of view, > there is no difference with XDP to netdev, where an XDP program > attached to a netdev is also a property of netdev. A netdev can still > function without XDP. Same for cls_bpf, it can be just a nop returns > TC_ACT_SHOT (or whatever) if no ppf program is attached. Thus, > the lifetime of a bpf program can be separated from the target it > attaches too, like all other bpf_link targets. bpf_link is just a > supplement to `tc filter change cls_bpf`, not to replace it. > So this is different now, as in the filter is attached as usual but bpf_link represents attachment of bpf prog to the filter itself, not the filter to the qdisc. To me it seems apart from not having to create filter, this would pretty much be equivalent to where I hook the bpf_link right now? TBF, this split doesn't really seem to be bringing anything to the table (except maybe preserving netlink as the only way to manipulate filter properties) and keeping filters as separate objects. I can understand your position but for the user it's just more and more objects to keep track of with no proper ownership/cleanup semantics. Though considering it for cls_bpf in particular, there are mainly three things you would want to tc filter change: * Integrated actions These are not allowed anyway, we force enable direct action mode, and I don't plan on opening up actions for this if its gets accepted. Anything missing we'll try to make it work in eBPF (act_ct etc.) * classid cls_bpf has a good alternative of instead manipulating __sk_buff::tc_classid * skip_hw/skip_sw Not supported for now, but can be done using flags in BPF_LINK_UPDATE * BPF program Already works using BPF_LINK_UPDATE So bpf_link isn't really prohibitive in any way. Doing it your way also complicates cleanup of the filter (in case we don't want to leave it attached), because it is hard to know who closes the link_fd last. Closing it earlier would break the link for existing users, not doing it would leave around unused object (which can accumulate if we use auto allocation of filter priority). Counting existing links is racy. This is better done in the kernel than worked around in userspace, as part of attachment. > This is actually simpler, you do not need to worry about whether > netdev is destroyed when you detach the XDP bpf_link anyway, > same for cls_bpf filters. Likewise, TC filters don't need to worry > about bpf_links associated. > > Thanks. -- Kartikeya
On Fri, Jun 11, 2021 at 07:30:49AM IST, Cong Wang wrote: > I see why you are creating TC filters now, because you are trying to > force the lifetime of a bpf target to align with the bpf program itself. > The deeper reason seems to be that a cls_bpf filter looks so small > that it appears to you that it has nothing but a bpf_prog, right? > Just to clarify on this further, BPF program still has its own lifetime, link takes a reference, and the filter still takes a reference on it (since it assumes ownership, so it was easier that way). When releasing the bpf_link if the prog pointer is set, we also detach the TC filter (which releases its reference on the prog). The link on destruction releases its reference. So the rest of refcount will depend on userspace holding/pinning the fd or not. -- Kartikeya
Hi, Sorry - but i havent kept up with the discussion so some of this and it is possible I may be misunderstanding some things you mention in passing below (example that you only support da mode or the classid being able to be handled differently etc). XDP may not be the best model to follow since some things that exist in the tc architecture(example ability to have multi-programs) seem to be plumbed in later (mostly because the original design intent for XDP was to make it simple and then deployment follow and more features get added) Integrating tc into libbpf is a definete bonus that allows with a unified programmatic interface and a singular loading mechanism - but it wasnt clear why we loose some features that tc provides; we have them today with current tc based loading scheme. I certainly use the non-da scheme because over time it became clear that complex programs(not necessarily large code size) are a challenge with ebpf and using existing tc actions is valuable. Also, multiple priorities are important for the same reason - you can work around them in your singular ebpf program but sooner than later you will run out "tricks". We do have this monthly tc meetup every second monday of the month. Unfortunately it is short notice since the next one is monday 12pm eastern time. Maybe you can show up and a high bandwidth discussion (aka voice) would help? cheers, jamal On 2021-06-12 10:53 p.m., Kumar Kartikeya Dwivedi wrote: > On Fri, Jun 11, 2021 at 07:30:49AM IST, Cong Wang wrote: >> On Tue, Jun 8, 2021 at 12:20 AM Kumar Kartikeya Dwivedi >> <memxor@gmail.com> wrote: >>> >>> So we're not really creating a qdisc here, we're just tying the filter (which in >>> the current semantics exists only while attached) to the bpf_link. The filter is >>> the attachment, so tying its lifetime to bpf_link makes sense. When you destroy >>> the bpf_link, the filter goes away too, which means classification at that >>> hook (parent/class) in the qdisc stops working. This is why creating the filter >>> from the bpf_link made sense to me. >> >> I see why you are creating TC filters now, because you are trying to >> force the lifetime of a bpf target to align with the bpf program itself. >> The deeper reason seems to be that a cls_bpf filter looks so small >> that it appears to you that it has nothing but a bpf_prog, right? >> > > Yes, pretty much. > >> I offer two different views here: >> >> 1. If you view a TC filter as an instance as a netdev/qdisc/action, they >> are no different from this perspective. Maybe the fact that a TC filter >> resides in a qdisc makes a slight difference here, but like I mentioned, it >> actually makes sense to let TC filters be standalone, qdisc's just have to >> bind with them, like how we bind TC filters with standalone TC actions. > > You propose something different below IIUC, but I explained why I'm wary of > these unbound filters. They seem to add a step to classifier setup for no real > benefit to the user (except keeping track of one more object and cleaning it > up with the link when done). > > I understand that the filter is very much an object of its own and why keeping > them unbound makes sense, but for the user there is no real benefit of this > scheme (some things like classid attribute are contextual in that they make > sense to be set based on what parent we're attaching to). > >> These are all updated independently, despite some of them residing in >> another. There should not be an exceptional TC filter which can not >> be updated via `tc filter` command. > > I see, but I'm mirroring what was done for XDP bpf_link. > > Besides, flush still works, it's only that manipulating a filter managed by > bpf_link is not allowed, which sounds reasonable to me, given we're bringing > new ownership semantics here which didn't exist before with netlink, so it > doesn't make sense to allow netlink to simply invalidate the filter installed by > some other program. > > You wouldn't do something like that for a cooperating setup, we're just > enforcing that using -EPERM (bpf_link is not allowed to replace netlink > installed filters either, so it goes both ways). > >> >> 2. For cls_bpf specifically, it is also an instance, like all other TC filters. >> You can update it in the same way: tc filter change [...] The only difference >> is a bpf program can attach to such an instance. So you can view the bpf >> program attached to cls_bpf as a property of it. From this point of view, >> there is no difference with XDP to netdev, where an XDP program >> attached to a netdev is also a property of netdev. A netdev can still >> function without XDP. Same for cls_bpf, it can be just a nop returns >> TC_ACT_SHOT (or whatever) if no ppf program is attached. Thus, >> the lifetime of a bpf program can be separated from the target it >> attaches too, like all other bpf_link targets. bpf_link is just a >> supplement to `tc filter change cls_bpf`, not to replace it. >> > > So this is different now, as in the filter is attached as usual but bpf_link > represents attachment of bpf prog to the filter itself, not the filter to the > qdisc. > > To me it seems apart from not having to create filter, this would pretty much be > equivalent to where I hook the bpf_link right now? > > TBF, this split doesn't really seem to be bringing anything to the table (except > maybe preserving netlink as the only way to manipulate filter properties) and > keeping filters as separate objects. I can understand your position but for the > user it's just more and more objects to keep track of with no proper > ownership/cleanup semantics. > > Though considering it for cls_bpf in particular, there are mainly three things > you would want to tc filter change: > > * Integrated actions > These are not allowed anyway, we force enable direct action mode, and I don't > plan on opening up actions for this if its gets accepted. Anything missing > we'll try to make it work in eBPF (act_ct etc.) > > * classid > cls_bpf has a good alternative of instead manipulating __sk_buff::tc_classid > > * skip_hw/skip_sw > Not supported for now, but can be done using flags in BPF_LINK_UPDATE > > * BPF program > Already works using BPF_LINK_UPDATE > > So bpf_link isn't really prohibitive in any way. > > Doing it your way also complicates cleanup of the filter (in case we don't want > to leave it attached), because it is hard to know who closes the link_fd last. > Closing it earlier would break the link for existing users, not doing it would > leave around unused object (which can accumulate if we use auto allocation of > filter priority). Counting existing links is racy. > > This is better done in the kernel than worked around in userspace, as part of > attachment. > >> This is actually simpler, you do not need to worry about whether >> netdev is destroyed when you detach the XDP bpf_link anyway, >> same for cls_bpf filters. Likewise, TC filters don't need to worry >> about bpf_links associated. >> >> Thanks. > > -- > Kartikeya >
On Mon, Jun 14, 2021 at 01:57:16AM IST, Jamal Hadi Salim wrote: > Hi, > > Sorry - but i havent kept up with the discussion so some of this > and it is possible I may be misunderstanding some things you mention > in passing below (example that you only support da mode or the classid being > able to be handled differently etc). > XDP may not be the best model to follow since some things that exist > in the tc architecture(example ability to have multi-programs) > seem to be plumbed in later (mostly because the original design intent > for XDP was to make it simple and then deployment follow and more > features get added) > > Integrating tc into libbpf is a definete bonus that allows with a > unified programmatic interface and a singular loading mechanism - but > it wasnt clear why we loose some features that tc provides; we have > them today with current tc based loading scheme. I certainly use the > non-da scheme because over time it became clear that complex > programs(not necessarily large code size) are a challenge with ebpf > and using existing tc actions is valuable. > Also, multiple priorities are important for the same reason - you > can work around them in your singular ebpf program but sooner than > later you will run out "tricks". > Right, also I'm just posting so that the use cases I care about are clear, and why they are not being fulifilled in some other way. How to do it is ofcourse up to TC and BPF maintainers, which is why I'm still waiting on feedback from you, Cong and others before posting the next version. > We do have this monthly tc meetup every second monday of the month. > Unfortunately it is short notice since the next one is monday 12pm > eastern time. Maybe you can show up and a high bandwidth discussion > (aka voice) would help? > That would be best, please let me know how to join tomorrow. There are a few other things I was working on that I also want to discuss with this. > cheers, > jamal >
On 2021-06-13 4:34 p.m., Kumar Kartikeya Dwivedi wrote: > On Mon, Jun 14, 2021 at 01:57:16AM IST, Jamal Hadi Salim wrote: > > Right, also I'm just posting so that the use cases I care about are clear, and > why they are not being fulifilled in some other way. How to do it is ofcourse up > to TC and BPF maintainers, which is why I'm still waiting on feedback from you, > Cong and others before posting the next version. > I look at it from the perspective that if i can run something with existing tc loading mechanism then i should be able to do the same with the new (libbpf) scheme. >> We do have this monthly tc meetup every second monday of the month. >> Unfortunately it is short notice since the next one is monday 12pm >> eastern time. Maybe you can show up and a high bandwidth discussion >> (aka voice) would help? >> > > That would be best, please let me know how to join tomorrow. There are a few > other things I was working on that I also want to discuss with this. > That would be great - thanks for your understanding. +Cc Marcelo (who is the keeper of the meetup) in case the link may change.. cheers, jamal
On Sun, Jun 13, 2021 at 05:10:14PM -0400, Jamal Hadi Salim wrote: > On 2021-06-13 4:34 p.m., Kumar Kartikeya Dwivedi wrote: > > On Mon, Jun 14, 2021 at 01:57:16AM IST, Jamal Hadi Salim wrote: > > > We do have this monthly tc meetup every second monday of the month. > > > Unfortunately it is short notice since the next one is monday 12pm > > > eastern time. Maybe you can show up and a high bandwidth discussion > > > (aka voice) would help? > > > > > > > That would be best, please let me know how to join tomorrow. There are a few > > other things I was working on that I also want to discuss with this. > > > > That would be great - thanks for your understanding. > +Cc Marcelo (who is the keeper of the meetup) > in case the link may change.. We have 2 URLs for today. The official one [1] and a test one [2]. We will be testing a new video conferencing system today and depending on how it goes, we will be on one or another. I'll try to be always present in the official one [1] to point people towards the testing one [2] in case we're there. Also, we have an agenda doc [3]. I can't openly share it to the public but if you send a request for access, I'll grant it. 1.https://meet.kernel.social/tc-meetup 2.https://www.airmeet.com/e/2494c770-cc8c-11eb-830b-e787c099d9c3 3.https://docs.google.com/document/d/1uUm_o7lR9jCAH0bqZ1dyscXZbIF4GN3mh1FwwIuePcM/edit# Marcelo
On Sat, Jun 12, 2021 at 7:54 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote: > > On Fri, Jun 11, 2021 at 07:30:49AM IST, Cong Wang wrote: > > On Tue, Jun 8, 2021 at 12:20 AM Kumar Kartikeya Dwivedi > > <memxor@gmail.com> wrote: > > > > > > So we're not really creating a qdisc here, we're just tying the filter (which in > > > the current semantics exists only while attached) to the bpf_link. The filter is > > > the attachment, so tying its lifetime to bpf_link makes sense. When you destroy > > > the bpf_link, the filter goes away too, which means classification at that > > > hook (parent/class) in the qdisc stops working. This is why creating the filter > > > from the bpf_link made sense to me. > > > > I see why you are creating TC filters now, because you are trying to > > force the lifetime of a bpf target to align with the bpf program itself. > > The deeper reason seems to be that a cls_bpf filter looks so small > > that it appears to you that it has nothing but a bpf_prog, right? > > > > Yes, pretty much. OK. Just in case of any misunderstand: cls_bpf filter has more than just a bpf prog, it inherits all other generic attributes, e.g. TC proto/prio, too from TC infra. If you can agree on this, then it is no different from netdev/cgroup/netns bpf_links. > > > I offer two different views here: > > > > 1. If you view a TC filter as an instance as a netdev/qdisc/action, they > > are no different from this perspective. Maybe the fact that a TC filter > > resides in a qdisc makes a slight difference here, but like I mentioned, it > > actually makes sense to let TC filters be standalone, qdisc's just have to > > bind with them, like how we bind TC filters with standalone TC actions. > > You propose something different below IIUC, but I explained why I'm wary of > these unbound filters. They seem to add a step to classifier setup for no real > benefit to the user (except keeping track of one more object and cleaning it > up with the link when done). I am not even sure if unbound filters help your case at all, making them unbound merely changes their residence, not ownership. You are trying to pass the ownership from TC to bpf_link, which is what I am against. > > I understand that the filter is very much an object of its own and why keeping > them unbound makes sense, but for the user there is no real benefit of this > scheme (some things like classid attribute are contextual in that they make > sense to be set based on what parent we're attaching to). > > > These are all updated independently, despite some of them residing in > > another. There should not be an exceptional TC filter which can not > > be updated via `tc filter` command. > > I see, but I'm mirroring what was done for XDP bpf_link. Really? Does XDP bpf_link create a netdev or remove it? I see none. It merely looks up netdev by attr->link_create.target_ifindex in bpf_xdp_link_attach(). Where does the "mirroring" come from? > > Besides, flush still works, it's only that manipulating a filter managed by > bpf_link is not allowed, which sounds reasonable to me, given we're bringing > new ownership semantics here which didn't exist before with netlink, so it > doesn't make sense to allow netlink to simply invalidate the filter installed by > some other program. > > You wouldn't do something like that for a cooperating setup, we're just > enforcing that using -EPERM (bpf_link is not allowed to replace netlink > installed filters either, so it goes both ways). I think our argument is never who manages it, our argument is who owns it. By creating a TC filter from bpf_link and managed by bpf_link exclusively, the ownership pretty much goes to bpf_link. > > > > > 2. For cls_bpf specifically, it is also an instance, like all other TC filters. > > You can update it in the same way: tc filter change [...] The only difference > > is a bpf program can attach to such an instance. So you can view the bpf > > program attached to cls_bpf as a property of it. From this point of view, > > there is no difference with XDP to netdev, where an XDP program > > attached to a netdev is also a property of netdev. A netdev can still > > function without XDP. Same for cls_bpf, it can be just a nop returns > > TC_ACT_SHOT (or whatever) if no ppf program is attached. Thus, > > the lifetime of a bpf program can be separated from the target it > > attaches too, like all other bpf_link targets. bpf_link is just a > > supplement to `tc filter change cls_bpf`, not to replace it. > > > > So this is different now, as in the filter is attached as usual but bpf_link > represents attachment of bpf prog to the filter itself, not the filter to the > qdisc. Yes, I think this is the right view of cls_bpf. It contains more than just a bpf prog, its generic part (struct tcf_proto) contains other attributes of this filter inherited from TC infra. And of course, TC actions can be inherited too (for non-DA). > > To me it seems apart from not having to create filter, this would pretty much be > equivalent to where I hook the bpf_link right now? > > TBF, this split doesn't really seem to be bringing anything to the table (except > maybe preserving netlink as the only way to manipulate filter properties) and > keeping filters as separate objects. I can understand your position but for the > user it's just more and more objects to keep track of with no proper > ownership/cleanup semantics. > > Though considering it for cls_bpf in particular, there are mainly three things > you would want to tc filter change: > > * Integrated actions > These are not allowed anyway, we force enable direct action mode, and I don't > plan on opening up actions for this if its gets accepted. Anything missing > we'll try to make it work in eBPF (act_ct etc.) > > * classid > cls_bpf has a good alternative of instead manipulating __sk_buff::tc_classid > > * skip_hw/skip_sw > Not supported for now, but can be done using flags in BPF_LINK_UPDATE > > * BPF program > Already works using BPF_LINK_UPDATE Our argument is never which pieces of cls_bpf should be updated by TC or bpf_link. It is always ownership. TC should own TC filters, even its name tells so. You are trying to create TC filters with bpf_link which are not even owned by TC. And more importantly, you are not doing the same for other bpf_link targets, by singling out TC filters for no valid reason. > > So bpf_link isn't really prohibitive in any way. > > Doing it your way also complicates cleanup of the filter (in case we don't want > to leave it attached), because it is hard to know who closes the link_fd last. > Closing it earlier would break the link for existing users, not doing it would > leave around unused object (which can accumulate if we use auto allocation of > filter priority). Counting existing links is racy. > > This is better done in the kernel than worked around in userspace, as part of > attachment. I am not proposing anything for your case, I am only explaining why creating TC filters exclusively via bpf_link does not make sense to me. Thanks.
Cong Wang <xiyou.wangcong@gmail.com> writes: >> > I offer two different views here: >> > >> > 1. If you view a TC filter as an instance as a netdev/qdisc/action, they >> > are no different from this perspective. Maybe the fact that a TC filter >> > resides in a qdisc makes a slight difference here, but like I mentioned, it >> > actually makes sense to let TC filters be standalone, qdisc's just have to >> > bind with them, like how we bind TC filters with standalone TC actions. >> >> You propose something different below IIUC, but I explained why I'm wary of >> these unbound filters. They seem to add a step to classifier setup for no real >> benefit to the user (except keeping track of one more object and cleaning it >> up with the link when done). > > I am not even sure if unbound filters help your case at all, making > them unbound merely changes their residence, not ownership. > You are trying to pass the ownership from TC to bpf_link, which > is what I am against. So what do you propose instead? bpf_link is solving a specific problem: ensuring automatic cleanup of kernel resources held by a userspace application with a BPF component. Not all applications work this way, but for the ones that do it's very useful. But if the TC filter stays around after bpf_link detaches, that kinda defeats the point of the automatic cleanup. So I don't really see any way around transferring ownership somehow. Unless you have some other idea that I'm missing? -Toke
On 6/13/21 11:10 PM, Jamal Hadi Salim wrote: > On 2021-06-13 4:34 p.m., Kumar Kartikeya Dwivedi wrote: >> On Mon, Jun 14, 2021 at 01:57:16AM IST, Jamal Hadi Salim wrote: [...] >> Right, also I'm just posting so that the use cases I care about are clear, and >> why they are not being fulifilled in some other way. How to do it is ofcourse up >> to TC and BPF maintainers, which is why I'm still waiting on feedback from you, >> Cong and others before posting the next version. > > I look at it from the perspective that if i can run something with > existing tc loading mechanism then i should be able to do the same > with the new (libbpf) scheme. The intention is not to provide a full-blown tc library (that could be subject to a libtc or such), but rather to only have libbpf abstract the tc related API that is most /relevant/ for BPF program development and /efficient/ in terms of execution in fast-path while at the same time providing a good user experience from the API itself. That is, simple to use and straight forward to explain to folks with otherwise zero experience of tc. The current implementation does all that, and from experience with large BPF programs managed via cls_bpf that is all that is actually needed from tc layer perspective. The ability to have multi programs (incl. priorities) is in the existing libbpf API as well. Best, Daniel
On 6/15/21 1:54 PM, Toke Høiland-Jørgensen wrote: > Cong Wang <xiyou.wangcong@gmail.com> writes: [...] >>>> I offer two different views here: >>>> >>>> 1. If you view a TC filter as an instance as a netdev/qdisc/action, they >>>> are no different from this perspective. Maybe the fact that a TC filter >>>> resides in a qdisc makes a slight difference here, but like I mentioned, it >>>> actually makes sense to let TC filters be standalone, qdisc's just have to >>>> bind with them, like how we bind TC filters with standalone TC actions. >>> >>> You propose something different below IIUC, but I explained why I'm wary of >>> these unbound filters. They seem to add a step to classifier setup for no real >>> benefit to the user (except keeping track of one more object and cleaning it >>> up with the link when done). >> >> I am not even sure if unbound filters help your case at all, making >> them unbound merely changes their residence, not ownership. >> You are trying to pass the ownership from TC to bpf_link, which >> is what I am against. > > So what do you propose instead? > > bpf_link is solving a specific problem: ensuring automatic cleanup of > kernel resources held by a userspace application with a BPF component. > Not all applications work this way, but for the ones that do it's very > useful. But if the TC filter stays around after bpf_link detaches, that > kinda defeats the point of the automatic cleanup. > > So I don't really see any way around transferring ownership somehow. > Unless you have some other idea that I'm missing? Just to keep on brainstorming here, I wanted to bring back Alexei's earlier quote: > I think it makes sense to create these objects as part of establishing bpf_link. > ingress qdisc is a fake qdisc anyway. > If we could go back in time I would argue that its existence doesn't > need to be shown in iproute2. It's an object that serves no purpose > other than attaching filters to it. It doesn't do any queuing unlike > real qdiscs. > It's an artifact of old choices. Old doesn't mean good. > The kernel is full of such quirks and oddities. New api-s shouldn't > blindly follow them. > tc qdisc add dev eth0 clsact > is a useless command with nop effect. The whole bpf_link in this context feels somewhat awkward because both are two different worlds, one accessible via netlink with its own lifetime etc, the other one tied to fds and bpf syscall. Back in the days we did the cls_bpf integration since it felt the most natural at that time and it had support for both the ingress and egress side, along with the direct action support which was added later to have a proper fast path for BPF. One thing that I personally never liked is that later on tc sadly became a complex, quirky dumping ground for all the nic hw offloads (I guess mainly driven from ovs side) for which I have a hard time convincing myself that this is used at scale in production. Stuff like af699626ee26 just to pick one which annoyingly also adds to the fast path given distros will just compile in most of these things (like NET_TC_SKB_EXT)... what if such bpf_link object is not tied at all to cls_bpf or cls_act qdisc, and instead would implement the tcf_classify_ {egress,ingress}() as-is in that sense, similar like the bpf_lsm hooks. Meaning, you could run existing tc BPF prog without any modifications and without additional extra overhead (no need to walk the clsact qdisc and then again into the cls_bpf one). These tc BPF programs would be managed only from bpf() via tc bpf_link api, and are otherwise not bothering to classic tc command (though they could be dumped there as well for sake of visibility, though bpftool would be fitting too). However, if there is something attached from classic tc side, it would also go into the old style tcf_classify_ingress() implementation and walk whatever is there so that nothing existing breaks (same as when no bpf_link would be present so that there is no extra overhead). This would also allow for a migration path of multi prog from cls_bpf to this new implementation. Details still tbd, but I would much rather like such an approach than the current discussed one, and it would also fit better given we don't run into this current mismatch of both worlds. Thanks, Daniel
Daniel Borkmann <daniel@iogearbox.net> writes: > On 6/15/21 1:54 PM, Toke Høiland-Jørgensen wrote: >> Cong Wang <xiyou.wangcong@gmail.com> writes: > [...] >>>>> I offer two different views here: >>>>> >>>>> 1. If you view a TC filter as an instance as a netdev/qdisc/action, they >>>>> are no different from this perspective. Maybe the fact that a TC filter >>>>> resides in a qdisc makes a slight difference here, but like I mentioned, it >>>>> actually makes sense to let TC filters be standalone, qdisc's just have to >>>>> bind with them, like how we bind TC filters with standalone TC actions. >>>> >>>> You propose something different below IIUC, but I explained why I'm wary of >>>> these unbound filters. They seem to add a step to classifier setup for no real >>>> benefit to the user (except keeping track of one more object and cleaning it >>>> up with the link when done). >>> >>> I am not even sure if unbound filters help your case at all, making >>> them unbound merely changes their residence, not ownership. >>> You are trying to pass the ownership from TC to bpf_link, which >>> is what I am against. >> >> So what do you propose instead? >> >> bpf_link is solving a specific problem: ensuring automatic cleanup of >> kernel resources held by a userspace application with a BPF component. >> Not all applications work this way, but for the ones that do it's very >> useful. But if the TC filter stays around after bpf_link detaches, that >> kinda defeats the point of the automatic cleanup. >> >> So I don't really see any way around transferring ownership somehow. >> Unless you have some other idea that I'm missing? > > Just to keep on brainstorming here, I wanted to bring back Alexei's earlier quote: > > > I think it makes sense to create these objects as part of establishing bpf_link. > > ingress qdisc is a fake qdisc anyway. > > If we could go back in time I would argue that its existence doesn't > > need to be shown in iproute2. It's an object that serves no purpose > > other than attaching filters to it. It doesn't do any queuing unlike > > real qdiscs. > > It's an artifact of old choices. Old doesn't mean good. > > The kernel is full of such quirks and oddities. New api-s shouldn't > > blindly follow them. > > tc qdisc add dev eth0 clsact > > is a useless command with nop effect. > > The whole bpf_link in this context feels somewhat awkward because both are two > different worlds, one accessible via netlink with its own lifetime etc, the other > one tied to fds and bpf syscall. Back in the days we did the cls_bpf integration > since it felt the most natural at that time and it had support for both the ingress > and egress side, along with the direct action support which was added later to have > a proper fast path for BPF. One thing that I personally never liked is that later > on tc sadly became a complex, quirky dumping ground for all the nic hw offloads (I > guess mainly driven from ovs side) for which I have a hard time convincing myself > that this is used at scale in production. Stuff like af699626ee26 just to pick one > which annoyingly also adds to the fast path given distros will just compile in most > of these things (like NET_TC_SKB_EXT)... what if such bpf_link object is not tied > at all to cls_bpf or cls_act qdisc, and instead would implement the tcf_classify_ > {egress,ingress}() as-is in that sense, similar like the bpf_lsm hooks. Meaning, > you could run existing tc BPF prog without any modifications and without additional > extra overhead (no need to walk the clsact qdisc and then again into the cls_bpf > one). These tc BPF programs would be managed only from bpf() via tc bpf_link api, > and are otherwise not bothering to classic tc command (though they could be dumped > there as well for sake of visibility, though bpftool would be fitting too). However, > if there is something attached from classic tc side, it would also go into the old > style tcf_classify_ingress() implementation and walk whatever is there so that nothing > existing breaks (same as when no bpf_link would be present so that there is no extra > overhead). This would also allow for a migration path of multi prog from cls_bpf to > this new implementation. Details still tbd, but I would much rather like such an > approach than the current discussed one, and it would also fit better given we don't > run into this current mismatch of both worlds. So this would entail adding a separate list of BPF programs and run through those at the start of sch_handle_{egress,ingress}() I suppose? And that list of filters would only contain bpf_link-attached BPF programs, sorted by priority like TC filters? And return codes of TC_ACT_OK or TC_ACT_RECLASSIFY would continue through to tcf_classify_{egress,ingress}()? I suppose that could work; we could even stick the second filter list in struct mini_Qdisc and have clsact and bpf_link cooperate on managing that, no? That way it would also be easy to dump the BPF filters via netlink: I do think that will be the least surprising thing to do (so people can at least see there's something there with existing tools). The overhead would be a single extra branch when only one of clsact or bpf_link is in use (to check if the other list of filters is set); that's probably acceptable at this level... -Toke
On 2021-06-15 7:07 p.m., Daniel Borkmann wrote: > On 6/13/21 11:10 PM, Jamal Hadi Salim wrote: [..] >> >> I look at it from the perspective that if i can run something with >> existing tc loading mechanism then i should be able to do the same >> with the new (libbpf) scheme. > > The intention is not to provide a full-blown tc library (that could be > subject to a > libtc or such), but rather to only have libbpf abstract the tc related > API that is > most /relevant/ for BPF program development and /efficient/ in terms of > execution in > fast-path while at the same time providing a good user experience from > the API itself. > > That is, simple to use and straight forward to explain to folks with > otherwise zero > experience of tc. The current implementation does all that, and from > experience with > large BPF programs managed via cls_bpf that is all that is actually > needed from tc > layer perspective. The ability to have multi programs (incl. priorities) > is in the > existing libbpf API as well. > Which is a fair statement, but if you take away things that work fine with current iproute2 loading I have no motivation to migrate at all. Its like that saying of "throwing out the baby with the bathwater". I want my baby. In particular, here's a list from Kartikeya's implementation: 1) Direct action mode only 2) Protocol ETH_P_ALL only 3) Only at chain 0 4) No block support I think he said priority is supported but was also originally on that list. When we discussed at the meetup it didnt seem these cost anything in terms of code complexity or usability of the API. 1) We use non-DA mode, so i cant live without that (and frankly ebpf has challenges adding complex code blocks). 2) We also use different protocols when i need to (yes, you can do the filtering in the bpf code - but why impose that if the cost of adding it is simple? and of course it is cheaper to do the check outside of ebpf) 3) We use chains outside of zero 4) So far we dont use block support but certainly my recent experiences in a deployment shows that we need to group netdevices more often than i thought was necessary. So if i could express one map shared by multiple netdevices it should cut down the user space complexity. cheers, jamal
On Wed, Jun 16, 2021 at 08:10:55PM IST, Jamal Hadi Salim wrote: > On 2021-06-15 7:07 p.m., Daniel Borkmann wrote: > > On 6/13/21 11:10 PM, Jamal Hadi Salim wrote: > > [..] > > > > > > > I look at it from the perspective that if i can run something with > > > existing tc loading mechanism then i should be able to do the same > > > with the new (libbpf) scheme. > > > > The intention is not to provide a full-blown tc library (that could be > > subject to a > > libtc or such), but rather to only have libbpf abstract the tc related > > API that is > > most /relevant/ for BPF program development and /efficient/ in terms of > > execution in > > fast-path while at the same time providing a good user experience from > > the API itself. > > > > That is, simple to use and straight forward to explain to folks with > > otherwise zero > > experience of tc. The current implementation does all that, and from > > experience with > > large BPF programs managed via cls_bpf that is all that is actually > > needed from tc > > layer perspective. The ability to have multi programs (incl. priorities) > > is in the > > existing libbpf API as well. > > > > Which is a fair statement, but if you take away things that work fine > with current iproute2 loading I have no motivation to migrate at all. > Its like that saying of "throwing out the baby with the bathwater". > I want my baby. > > In particular, here's a list from Kartikeya's implementation: > > 1) Direct action mode only > 2) Protocol ETH_P_ALL only > 3) Only at chain 0 > 4) No block support > Block is supported, you just need to set TCM_IFINDEX_MAGIC_BLOCK as ifindex and parent as the block index. There isn't anything more to it than that from libbpf side (just specify BPF_TC_CUSTOM enum). What I meant was that hook_create doesn't support specifying the ingress/egress block when creating clsact, but that typically isn't a problem because qdiscs for shared blocks would be set up together prior to the attachment anyway. > I think he said priority is supported but was also originally on that > list. > When we discussed at the meetup it didnt seem these cost anything > in terms of code complexity or usability of the API. > > 1) We use non-DA mode, so i cant live without that (and frankly ebpf > has challenges adding complex code blocks). > > 2) We also use different protocols when i need to > (yes, you can do the filtering in the bpf code - but why impose that > if the cost of adding it is simple? and of course it is cheaper to do > the check outside of ebpf) > 3) We use chains outside of zero > > 4) So far we dont use block support but certainly my recent experiences > in a deployment shows that we need to group netdevices more often than > i thought was necessary. So if i could express one map shared by > multiple netdevices it should cut down the user space complexity. > > cheers, > jamal -- Kartikeya
On 2021-06-15 7:44 p.m., Daniel Borkmann wrote: > On 6/15/21 1:54 PM, Toke Høiland-Jørgensen wrote: >> Cong Wang <xiyou.wangcong@gmail.com> writes: > [...] > > Just to keep on brainstorming here, I wanted to bring back Alexei's > earlier quote: > > > I think it makes sense to create these objects as part of > establishing bpf_link. > > ingress qdisc is a fake qdisc anyway. > > If we could go back in time I would argue that its existence doesn't > > need to be shown in iproute2. It's an object that serves no purpose > > other than attaching filters to it. It doesn't do any queuing unlike > > real qdiscs. > > It's an artifact of old choices. Old doesn't mean good. > > The kernel is full of such quirks and oddities. New api-s shouldn't > > blindly follow them. > > tc qdisc add dev eth0 clsact > > is a useless command with nop effect. > I am not sure what Alexei's statement about old vs good was getting at. You have to have hooks/locations to stick things. Does it matter what you call that hook? > The whole bpf_link in this context feels somewhat awkward because both > are two > different worlds, one accessible via netlink with its own lifetime etc, > the other > one tied to fds and bpf syscall. Back in the days we did the cls_bpf > integration > since it felt the most natural at that time and it had support for both > the ingress > and egress side, along with the direct action support which was added > later to have > a proper fast path for BPF. One thing that I personally never liked is > that later > on tc sadly became a complex, quirky dumping ground for all the nic hw > offloads (I > guess mainly driven from ovs side) for which I have a hard time > convincing myself > that this is used at scale in production. Stuff like af699626ee26 just > to pick one > which annoyingly also adds to the fast path given distros will just > compile in most > of these things (like NET_TC_SKB_EXT)... what if such bpf_link object is > not tied > at all to cls_bpf or cls_act qdisc, and instead would implement the > tcf_classify_ > {egress,ingress}() as-is in that sense, similar like the bpf_lsm hooks. The choice is between generic architecture and appliance only-what-you-need code (via ebpf). Dont disagree that at times patches go in at the expense of the kernel datapath complexity or cost. Unfortunately sometimes this is because theres no sufficient review time - but thats a different topic. We try to impose a rule which states that any hardware offload has to have a kernel/software twin. Often that helps contain things. > Meaning, > you could run existing tc BPF prog without any modifications and without > additional > extra overhead (no need to walk the clsact qdisc and then again into the > cls_bpf > one). These tc BPF programs would be managed only from bpf() via tc > bpf_link api, > and are otherwise not bothering to classic tc command (though they could > be dumped > there as well for sake of visibility, though bpftool would be fitting > too). However, > if there is something attached from classic tc side, it would also go > into the old > style tcf_classify_ingress() implementation and walk whatever is there > so that nothing > existing breaks (same as when no bpf_link would be present so that there > is no extra > overhead). This would also allow for a migration path of multi prog from > cls_bpf to > this new implementation. Details still tbd, but I would much rather like > such an > approach than the current discussed one, and it would also fit better > given we don't > run into this current mismatch of both worlds. > The danger is totally divorcing from tc when you have speacial cases just for ebpf/tc i.e this is no different than what the hardware offload making you unhappy. The ability to use existing tools (user space tc in this case) to inter-work on both is very useful. From the discussion on the control aspect with Kartikeya i understood that we need some "transient state" which needs to get created and stored somewhere before being applied to tc (example creating the filters first and all necessary artificats then calling internally to cls api). Seems to me that the "transient state" belongs to bpf. And i understood Kartikeya this was his design intent as well (which seems sane to me). cheers, jamal
On 6/16/21 5:32 PM, Kumar Kartikeya Dwivedi wrote: > On Wed, Jun 16, 2021 at 08:10:55PM IST, Jamal Hadi Salim wrote: >> On 2021-06-15 7:07 p.m., Daniel Borkmann wrote: >>> On 6/13/21 11:10 PM, Jamal Hadi Salim wrote: >> >> [..] >> >>>> I look at it from the perspective that if i can run something with >>>> existing tc loading mechanism then i should be able to do the same >>>> with the new (libbpf) scheme. >>> >>> The intention is not to provide a full-blown tc library (that could be >>> subject to a >>> libtc or such), but rather to only have libbpf abstract the tc related >>> API that is >>> most /relevant/ for BPF program development and /efficient/ in terms of >>> execution in >>> fast-path while at the same time providing a good user experience from >>> the API itself. >>> >>> That is, simple to use and straight forward to explain to folks with >>> otherwise zero >>> experience of tc. The current implementation does all that, and from >>> experience with >>> large BPF programs managed via cls_bpf that is all that is actually >>> needed from tc >>> layer perspective. The ability to have multi programs (incl. priorities) >>> is in the >>> existing libbpf API as well. >> >> Which is a fair statement, but if you take away things that work fine >> with current iproute2 loading I have no motivation to migrate at all. >> Its like that saying of "throwing out the baby with the bathwater". >> I want my baby. >> >> In particular, here's a list from Kartikeya's implementation: >> >> 1) Direct action mode only (More below.) >> 2) Protocol ETH_P_ALL only The issue I see with this one is that it's not very valuable or useful from a BPF point of view. Meaning, this kind of check can and typically is implemented from BPF program anyway. For example, when you have direct packet access initially parsing the eth header anyway (and from there having logic for the various eth protos). That protocol option is maybe more useful when you have classic tc with cls+act style pipeline where you want a quick skip of classifiers to avoid reparsing the packet. Given you can do everything inside the BPF program already it adds more confusion than value for a simple libbpf [tc/BPF] API. >> 3) Only at chain 0 >> 4) No block support > > Block is supported, you just need to set TCM_IFINDEX_MAGIC_BLOCK as ifindex and > parent as the block index. There isn't anything more to it than that from libbpf > side (just specify BPF_TC_CUSTOM enum). > > What I meant was that hook_create doesn't support specifying the ingress/egress > block when creating clsact, but that typically isn't a problem because qdiscs > for shared blocks would be set up together prior to the attachment anyway. > >> I think he said priority is supported but was also originally on that >> list. >> When we discussed at the meetup it didnt seem these cost anything >> in terms of code complexity or usability of the API. >> >> 1) We use non-DA mode, so i cant live without that (and frankly ebpf >> has challenges adding complex code blocks). Could you elaborate on that or provide code examples? Since introduction of the direct action mode I've never used anything else again, and we do have complex BPF code blocks that we need to handle as well. Would be good if you could provide more details on things you ran into, maybe they can be solved? >> 2) We also use different protocols when i need to >> (yes, you can do the filtering in the bpf code - but why impose that >> if the cost of adding it is simple? and of course it is cheaper to do >> the check outside of ebpf) >> 3) We use chains outside of zero >> >> 4) So far we dont use block support but certainly my recent experiences >> in a deployment shows that we need to group netdevices more often than >> i thought was necessary. So if i could express one map shared by >> multiple netdevices it should cut down the user space complexity. Thanks, Daniel
On 2021-06-16 12:00 p.m., Daniel Borkmann wrote: > On 6/16/21 5:32 PM, Kumar Kartikeya Dwivedi wrote: >> On Wed, Jun 16, 2021 at 08:10:55PM IST, Jamal Hadi Salim wrote: >>> On 2021-06-15 7:07 p.m., Daniel Borkmann wrote: >>>> On 6/13/21 11:10 PM, Jamal Hadi Salim wrote: [..] >>> >>> In particular, here's a list from Kartikeya's implementation: >>> >>> 1) Direct action mode only > > (More below.) > >>> 2) Protocol ETH_P_ALL only > > The issue I see with this one is that it's not very valuable or useful > from a BPF > point of view. Meaning, this kind of check can and typically is > implemented from > BPF program anyway. For example, when you have direct packet access > initially > parsing the eth header anyway (and from there having logic for the > various eth > protos). In that case make it optional to specify proto and default it to ETH_P_ALL. As far as i can see this flexibility doesnt complicate usability or add code complexity to the interfaces. > > That protocol option is maybe more useful when you have classic tc with > cls+act > style pipeline where you want a quick skip of classifiers to avoid > reparsing the > packet. Given you can do everything inside the BPF program already it > adds more > confusion than value for a simple libbpf [tc/BPF] API. > There's no point in repeating an operation of identifying the protocol type which can/has already be Id-ed by the calling (into ebpf) code. If all i am interested in is IPv4, then my ebpf parser can be simplified if i am sure i can assume it is an IPv4 packet. [..] >>> 1) We use non-DA mode, so i cant live without that (and frankly ebpf >>> has challenges adding complex code blocks). > > Could you elaborate on that or provide code examples? Since introduction > of the > direct action mode I've never used anything else again, and we do have > complex > BPF code blocks that we need to handle as well. Would be good if you > could provide > more details on things you ran into, maybe they can be solved? > Main issue is code complexity in ebpf and not so much instruction count (which is complicated once you have bounded loops). Earlier, I tried to post on the ebpf list but i got no response. I moved on since. I would like to engage you at some point - and you are right there may be some clever tricks to achieve the goals we had. The challenge is in keeping up with the bag of tricks to make the verifier happy. Being able to run non-da mode and for example attach an action such as the policer (and others) has pragmatic uses. It would be quiet complex to implement the policer within an all-in-one-appliance da-mode ebpf code. One approach is to add more helpers to invoke such code directly from ebpf - but we have some restrictions; the deployment is RHEL8.3 based and we have to live with the kernel features supported there. i.e kernel upgrade is a no-no. Given all these TC features have existed (and stable) for 100 years it make a lot of sense to use them. We are going to present some of the challenges we faced in a subset of our work in an approach to replace iptables at netdev 0x15 (hopefully we get accepted). cheers, jamal
On Fri, Jun 18, 2021 at 4:40 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > We are going to present some of the challenges we faced in a subset > of our work in an approach to replace iptables at netdev 0x15 > (hopefully we get accepted). Jamal, please stop using netdev@vger mailing list to promote a conference that does NOT represent the netdev kernel community. Slides shown at that conference is a non-event as far as this discussion goes.
On 2021-06-18 10:38 a.m., Alexei Starovoitov wrote: > On Fri, Jun 18, 2021 at 4:40 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote: >> >> We are going to present some of the challenges we faced in a subset >> of our work in an approach to replace iptables at netdev 0x15 >> (hopefully we get accepted). > > Jamal, > please stop using netdev@vger mailing list to promote a conference > that does NOT represent the netdev kernel community. > > Slides shown at that conference is a non-event as far as this discussion goes. Alexei, Tame the aggression, would you please? You have no right to make claims as to who represents the community. Absolutely none. So get off that high horse. I only mentioned the slides because it will be a good spot when done which captures the issues. As i mentioned in i actually did send some email (some Cced to you) but got no response. I dont mind having a discussion but you have to be willing to listen as well. cheers, jamal
On Fri, Jun 18, 2021 at 7:50 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > On 2021-06-18 10:38 a.m., Alexei Starovoitov wrote: > > On Fri, Jun 18, 2021 at 4:40 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > >> > >> We are going to present some of the challenges we faced in a subset > >> of our work in an approach to replace iptables at netdev 0x15 > >> (hopefully we get accepted). > > > > Jamal, > > please stop using netdev@vger mailing list to promote a conference > > that does NOT represent the netdev kernel community. > > > > Slides shown at that conference is a non-event as far as this discussion goes. > > Alexei, > Tame the aggression, would you please? > You have no right to make claims as to who represents the community. > Absolutely none. So get off that high horse. > > I only mentioned the slides because it will be a good spot when > done which captures the issues. As i mentioned in i actually did > send some email (some Cced to you) but got no response. > I dont mind having a discussion but you have to be willing to > listen as well. You've side tracked technical discussion to promote your own conference. That's not acceptable. Please use other forums for marketing. This mailing list is for technical discussions.
On 2021-06-18 12:23 p.m., Alexei Starovoitov wrote: > On Fri, Jun 18, 2021 at 7:50 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > [..] >> Alexei, >> Tame the aggression, would you please? >> You have no right to make claims as to who represents the community. >> Absolutely none. So get off that high horse. >> >> I only mentioned the slides because it will be a good spot when >> done which captures the issues. As i mentioned in i actually did >> send some email (some Cced to you) but got no response. >> I dont mind having a discussion but you have to be willing to >> listen as well. > > You've side tracked technical discussion to promote your own conference. > That's not acceptable. Please use other forums for marketing. > > This mailing list is for technical discussions. I just made a statement in passing and you took it to a tangent. If you are so righteous, why didnt you just stick to making technical comments? Stop making bold statements and then playing the victim. cheers, jamal
On 6/18/21 1:40 PM, Jamal Hadi Salim wrote: > On 2021-06-16 12:00 p.m., Daniel Borkmann wrote: >> On 6/16/21 5:32 PM, Kumar Kartikeya Dwivedi wrote: >>> On Wed, Jun 16, 2021 at 08:10:55PM IST, Jamal Hadi Salim wrote: >>>> On 2021-06-15 7:07 p.m., Daniel Borkmann wrote: >>>>> On 6/13/21 11:10 PM, Jamal Hadi Salim wrote: > > [..] > >>>> In particular, here's a list from Kartikeya's implementation: >>>> >>>> 1) Direct action mode only >> >> (More below.) >> >>>> 2) Protocol ETH_P_ALL only >> >> The issue I see with this one is that it's not very valuable or useful from a BPF >> point of view. Meaning, this kind of check can and typically is implemented from >> BPF program anyway. For example, when you have direct packet access initially >> parsing the eth header anyway (and from there having logic for the various eth >> protos). > > In that case make it optional to specify proto and default it to > ETH_P_ALL. As far as i can see this flexibility doesnt > complicate usability or add code complexity to the interfaces. From a user interface PoV it's odd since you need to go and parse that anyway, at least the programs typically start out with a switch/case on either reading the skb->protocol or getting it via eth->h_proto. But then once you extend that same program to also cover IPv6, you don't need to do anything with the ETH_P_ALL from the loader application, but now you'd also need to additionally remember to downgrade ETH_P_IP to ETH_P_ALL and rebuild the loader to get v6 traffic. But even if you were to split things in the main/entry program to separate v4/v6 processing into two different ones, I expect this to be faster via tail calls (given direct absolute jump) instead of walking a list of tcf_proto objects, comparing the tp->protocol and going into a different cls_bpf instance. [...]>> Could you elaborate on that or provide code examples? Since introduction of the >> direct action mode I've never used anything else again, and we do have complex >> BPF code blocks that we need to handle as well. Would be good if you could provide >> more details on things you ran into, maybe they can be solved? > > Main issue is code complexity in ebpf and not so much instruction > count (which is complicated once you have bounded loops). > Earlier, I tried to post on the ebpf list but i got no response. > I moved on since. I would like to engage you at some point - and > you are right there may be some clever tricks to achieve the goals > we had. The challenge is in keeping up with the bag of tricks to make > the verifier happy. > Being able to run non-da mode and for example attach an action such > as the policer (and others) has pragmatic uses. It would be quiet complex to implement the policer within an all-in-one-appliance > da-mode ebpf code. It may be more tricky but not impossible either, in recent years some (imho) very interesting and exciting use cases have been implemented and talked about e.g. [0-2], and with the recent linker work there could also be a [e.g. in-kernel] collection with library code that can be pulled in by others aside from using them as BPF selftests as one option. The gain you have with the flexibility [as you know] is that it allows easy integration/orchestration into user space applications and thus suitable for more dynamic envs as with old-style actions. The issue I have with the latter is that they're not scalable enough from a SW datapath / tc fast-path perspective given you then need to fallback to old-style list processing of cls+act combinations which is also not covered / in scope for the libbpf API in terms of their setup, and additionally not all of the BPF features can be used this way either, so it'll be very hard for users to debug why their BPF programs don't work as they're expected to. But also aside from those blockers, the case with this clean slate tc BPF API is that we have a unique chance to overcome the cmdline usability struggles, and make it as straight forward as possible for new generation of users. [0] https://linuxplumbersconf.org/event/7/contributions/677/ [1] https://linuxplumbersconf.org/event/2/contributions/121/ [2] https://netdevconf.info/0x14/session.html?talk-replacing-HTB-with-EDT-and-BPF Thanks, Daniel
On 2021-06-18 6:42 p.m., Daniel Borkmann wrote: > On 6/18/21 1:40 PM, Jamal Hadi Salim wrote: [..] > From a user interface PoV it's odd since you need to go and parse that > anyway, at > least the programs typically start out with a switch/case on either > reading the > skb->protocol or getting it via eth->h_proto. But then once you extend > that same > program to also cover IPv6, you don't need to do anything with the > ETH_P_ALL > from the loader application, but now you'd also need to additionally > remember to > downgrade ETH_P_IP to ETH_P_ALL and rebuild the loader to get v6 > traffic. But even > if you were to split things in the main/entry program to separate v4/v6 > processing > into two different ones, I expect this to be faster via tail calls > (given direct > absolute jump) instead of walking a list of tcf_proto objects, comparing > the > tp->protocol and going into a different cls_bpf instance. > Good point on being more future proof with ETH_P_ALL. Note: In our case we were only interested in ipv4 and i dont see that changing for the specific prog we have. From a compute perspective all i am saving by not using ETH_P_ALL is one if statement (checking if proto is ipv4). If you feel strongly about it we can change our code. My worry now is if we used this approach then likely someone else in the wild used something similar. I think it boils down again to: if it doesnt confuse the API or add extra complexity why not allow it and default to ETH_P_ALL? On your comment that a bpf based proto comparison being faster - the issue is that the tp proto always happens regardless and ebpf, depending on your program, may not fit all your code. Example i may actually decide to have a program for v6 and v4 separately if i wanted to with current mechanism - at different tc ruleset prios just so as to work around code/complexity issues. BTW: tail call limit of 32 provides an upper bound which affects depth of (generic) parsing. Does it make sense to allow (maybe on a per-boot) increasing the size? The fact things run on the stack may be restricting. > It may be more tricky but not impossible either, in recent years some > (imho) very > interesting and exciting use cases have been implemented and talked > about e.g. [0-2], > and with the recent linker work there could also be a [e.g. in-kernel] > collection with > library code that can be pulled in by others aside from using them as > BPF selftests > as one option. The gain you have with the flexibility [as you know] is > that it allows > easy integration/orchestration into user space applications and thus > suitable for > more dynamic envs as with old-style actions. The issue I have with the > latter is > that they're not scalable enough from a SW datapath / tc fast-path > perspective given > you then need to fallback to old-style list processing of cls+act > combinations which > is also not covered / in scope for the libbpf API in terms of their > setup, and > additionally not all of the BPF features can be used this way either, so > it'll be very > hard for users to debug why their BPF programs don't work as they're > expected to. > > But also aside from those blockers, the case with this clean slate tc > BPF API is that > we have a unique chance to overcome the cmdline usability struggles, and > make it as > straight forward as possible for new generation of users. > > [0] https://linuxplumbersconf.org/event/7/contributions/677/ > [1] https://linuxplumbersconf.org/event/2/contributions/121/ > [2] > https://netdevconf.info/0x14/session.html?talk-replacing-HTB-with-EDT-and-BPF I took a quick glance at the refs. IIUC, your message is "do more with less" i.e restrict choices now so we can focus on optimizing for speed. Here's my experience. We have two pragmatic challenges: 1) In a deployment, like some enterprise class data centers, we are often limited by the kernel and often even the distro you are on. You cant just upgrade to the latest and greatest without risking voiding the distro vendors support contract. Big shops with a lot of geniuses like FB and Google dont have these problems of course - but the majority out there do. So even our little program must use supported interfaces (ex: You cant expect support on RH8.3 for an XDP issue without using the supplied XDP lib) to be accepted. So building in support to use existing infra is useful 2) challenges with ebpf code space and code complexity: Depending on the complexity, a program with less than 4K instructions may be rejected by the verifier. IOW, I just cant add all the features i need _even if i wanted to_. For this reason working cooperatively with other existing kernel and user infra makes sense (Ref [2] is doing that for example). You dont want to rewrite the kernel using ebpf. Extending the kernel with ebpf makes sense. And of course I dont want to loose performance but there may be a trade-off sometimes where a little loss in performance is justified for gain of a feature makes sense (the non-da example applies). Perhaps adding more helpers to interface to the actions and classifiers is one way forward. cheers, jamal PS: I didnt understand the kernel linker point with BPF selftests. Pointer?