Message ID | 20221005171709.150520-1-xiyou.wangcong@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | net_sched: introduce eBPF based Qdisc | expand |
On 10/05, Cong Wang wrote: > From: Cong Wang <cong.wang@bytedance.com> > This *incomplete* patchset introduces a programmable Qdisc with eBPF. > There are a few use cases: > 1. Allow customizing Qdisc's in an easier way. So that people don't > have to write a complete Qdisc kernel module just to experiment > some new queuing theory. > 2. Solve EDT's problem. EDT calcuates the "tokens" in clsact which > is before enqueue, it is impossible to adjust those "tokens" after > packets get dropped in enqueue. With eBPF Qdisc, it is easy to > be solved with a shared map between clsact and sch_bpf. > 3. Replace qevents, as now the user gains much more control over the > skb and queues. > 4. Provide a new way to reuse TC filters. Currently TC relies on filter > chain and block to reuse the TC filters, but they are too complicated > to understand. With eBPF helper bpf_skb_tc_classify(), we can invoke > TC filters on _any_ Qdisc (even on a different netdev) to do the > classification. > 5. Potentially pave a way for ingress to queue packets, although > current implementation is still only for egress. > 6. Possibly pave a way for handling TCP protocol in TC, as rbtree itself > is already used by TCP to handle TCP retransmission. > 7. Potentially pave a way for implementing eBPF based CPU scheduler, > because we already have task kptr and CFS is clearly based on rbtree > too. > The goal here is to make this Qdisc as programmable as possible, > that is, to replace as many existing Qdisc's as we can, no matter > in tree or out of tree. This is why I give up on PIFO which has > serious limitations on the programmablity. More importantly, with > rbtree, it is easy to implement PIFO logic too. > Here is a summary of design decisions I made: [..] > 1. Avoid eBPF struct_ops, as it would be really hard to program > a Qdisc with this approach, literally all the struct Qdisc_ops > and struct Qdisc_class_ops are needed to implement. This is almost > as hard as programming a Qdisc kernel module. Can you expand more on what's exactly 'hard'? We need to override enqueue/dequeue mostly and that's it? Does it make sense to at least do, say, sch_struct_opts? Maybe we can have some higher level struct_opts with enqueue/dequeue only for bpf to override? > 2. Introduce a generic rbtree map, which supports kptr, so that skb's > can be stored inside. There has been several rbtree proposals. Do we really need an rbtree here (ignoring performance)? Can we demonstrate these new qdisc capabilities with a hashtable? > a) As eBPF maps are not directly visible to the kernel, we have to > dump the stats via eBPF map API's instead of netlink. > b) The user-space is not allowed to read the entire packets, only > __sk_buff > itself is readable, because we don't have such a use case yet and it > would > require a different API to read the data, as map values have fixed > length. [..] > c) Multi-queue support is implemented via map-in-map, in a similar > push/pop fasion based on rbtree too. I took a brief look at sch_bpf.c and I don't see how that would work. Can you expand more on that (+locking)? Also, in sch_bpf.c, you care about classid, why? Isn't the whole purpose of the series is to expand tc capabilities? Why are we still dragging this classid? > d) Use the netdevice notifier to reset the packets inside skb map upon > NETDEV_DOWN event. (TODO: need to recognize kptr type) [..] > 3. Integrate with existing TC infra. For example, if the user doesn't want > to implement her own filters (e.g. a flow dissector), she should be > able > to re-use the existing TC filters. Another helper > bpf_skb_tc_classify() is > introduced for this purpose. Supposedly, the idea is that bpf can implement all existing policies + more, so why are we trying to re-use existing filters? > Any high-level feedback is welcome. Please kindly ignore any coding > details > until RFC tag is removed. > TODO: > 1. actually test it > 2. write a document for this Qdisc [..] > 3. add test cases and sample code I'd start with that; otherwise, it's not clear what are the current capabilities. > Cc: Toke H�iland-J�rgensen <toke@redhat.com> > Cc: Jamal Hadi Salim <jhs@mojatatu.com> > Cc: Jiri Pirko <jiri@resnulli.us> > Signed-off-by: Cong Wang <cong.wang@bytedance.com> > --- > v6: switch to kptr based approach > v5: mv kernel/bpf/skb_map.c net/core/skb_map.c > implement flow map as map-in-map > rename bpf_skb_tc_classify() and move it to net/sched/cls_api.c > clean up eBPF qdisc program context > v4: get rid of PIFO, use rbtree directly > v3: move priority queue from sch_bpf to skb map > introduce skb map and its helpers > introduce bpf_skb_classify() > use netdevice notifier to reset skb's > Rebase on latest bpf-next > v2: Rebase on latest net-next > Make the code more complete (but still incomplete) > Cong Wang (5): > bpf: Introduce rbtree map > bpf: Add map in map support to rbtree > net_sched: introduce eBPF based Qdisc > net_sched: Add kfuncs for storing skb > net_sched: Introduce helper bpf_skb_tc_classify() > include/linux/bpf.h | 4 + > include/linux/bpf_types.h | 4 + > include/uapi/linux/bpf.h | 19 ++ > include/uapi/linux/pkt_sched.h | 17 + > kernel/bpf/Makefile | 2 +- > kernel/bpf/rbtree.c | 603 +++++++++++++++++++++++++++++++++ > kernel/bpf/syscall.c | 7 + > net/core/filter.c | 27 ++ > net/sched/Kconfig | 15 + > net/sched/Makefile | 1 + > net/sched/cls_api.c | 69 ++++ > net/sched/sch_bpf.c | 564 ++++++++++++++++++++++++++++++ > 12 files changed, 1331 insertions(+), 1 deletion(-) > create mode 100644 kernel/bpf/rbtree.c > create mode 100644 net/sched/sch_bpf.c > -- > 2.34.1