Message ID | 20241220195619.2022866-1-amery.hung@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | bpf qdisc | expand |
Amery Hung <ameryhung@gmail.com> writes: > Hi all, > > This patchset aims to support implementing qdisc using bpf struct_ops. > This version takes a step back and only implements the minimum support > for bpf qdisc. 1) support of adding skb to bpf_list and bpf_rbtree > directly and 2) classful qdisc are deferred to future patchsets. > > * Overview * > > This series supports implementing qdisc using bpf struct_ops. bpf qdisc > aims to be a flexible and easy-to-use infrastructure that allows users to > quickly experiment with different scheduling algorithms/policies. It only > requires users to implement core qdisc logic using bpf and implements the > mundane part for them. In addition, the ability to easily communicate > between qdisc and other components will also bring new opportunities for > new applications and optimizations. This is very cool, and I'm thrilled to see this work getting closer to being merged! :) > * struct_ops changes * > > To make struct_ops works better with bpf qdisc, two new changes are > introduced to bpf specifically for struct_ops programs. Frist, we > introduce "__ref" postfix for arguments in stub functions in patch 1-2. > It allows Qdisc_ops->enqueue to acquire an unique referenced kptr to the > skb argument. Through the reference object tracking mechanism in > the verifier, we can make sure that the acquired skb will be either > enqueued or dropped. Besides, no duplicate references can be acquired. > Then, we allow a referenced kptr to be returned from struct_ops programs > so that we can return an skb naturally. This is done and tested in patch 3 > and 4. > > * Performance of bpf qdisc * > > We tested several bpf qdiscs included in the selftests and their in-tree > counterparts to give you a sense of the performance of qdisc implemented > in bpf. > > The implementation of bpf_fq is fairly complex and slightly different from > fq so later we only compare the two fifo qdiscs. bpf_fq implements the > same fair queueing algorithm in fq, but without flow hash collision > avoidance and garbage collection of inactive flows. bpf_fifo uses a single > bpf_list as a queue instead of three queues for different priorities in > pfifo_fast. The time complexity of fifo however should be similar since the > queue selection time is negligible. > > Test setup: > > client -> qdisc -------------> server > ~~~~~~~~~~~~~~~ ~~~~~~ > nested VM1 @ DC1 VM2 @ DC2 > > Throghput: iperf3 -t 600, 5 times > > Qdisc Average (GBits/sec) > ---------- ------------------- > pfifo_fast 12.52 ± 0.26 > bpf_fifo 11.72 ± 0.32 > fq 10.24 ± 0.13 > bpf_fq 11.92 ± 0.64 > > Latency: sockperf pp --tcp -t 600, 5 times > > Qdisc Average (usec) > ---------- -------------- > pfifo_fast 244.58 ± 7.93 > bpf_fifo 244.92 ± 15.22 > fq 234.30 ± 19.25 > bpf_fq 221.34 ± 10.76 > > Looking at the two fifo qdiscs, the 6.4% drop in throughput in the bpf > implementatioin is consistent with previous observation (v8 throughput > test on a loopback device). This should be able to be mitigated by > supporting adding skb to bpf_list or bpf_rbtree directly in the future. This looks pretty decent! > * Clean up skb in bpf qdisc during reset * > > The current implementation relies on bpf qdisc implementors to correctly > release skbs in queues (bpf graphs or maps) in .reset, which might not be > a safe thing to do. The solution as Martin has suggested would be > supporting private data in struct_ops. This can also help simplifying > implementation of qdisc that works with mq. For examples, qdiscs in the > selftest mostly use global data. Therefore, even if user add multiple > qdisc instances under mq, they would still share the same queue. So is the plan to fix this before merging this series? Seems like a bit of a footgun, otherwise? -Toke
On 12/20/24 11:55 AM, Amery Hung wrote: > The implementation of bpf_fq is fairly complex and slightly different from > fq so later we only compare the two fifo qdiscs. bpf_fq implements the > same fair queueing algorithm in fq, but without flow hash collision > avoidance and garbage collection of inactive flows. bpf_fifo uses a single For hash collision, I think you meant >1 tcp_socks having the same hash in patch 14? This probably could be detected by adding the sk pointer value to the bpf-map key? not asking to change patch 14 though. For garbage collection, I think patch 14 has it but yes it is iterating the bpf map, so not as quick as doing gc while searching for the sk in the rbtree. I think the only missing piece is being able to iterate the bpf_rb_root, i.e. able to directly search left and right of a bpf_rb_node. > bpf_list as a queue instead of three queues for different priorities in > pfifo_fast. The time complexity of fifo however should be similar since the > queue selection time is negligible. > > Test setup: > > client -> qdisc -------------> server > ~~~~~~~~~~~~~~~ ~~~~~~ > nested VM1 @ DC1 VM2 @ DC2 > > Throghput: iperf3 -t 600, 5 times > > Qdisc Average (GBits/sec) > ---------- ------------------- > pfifo_fast 12.52 ± 0.26 > bpf_fifo 11.72 ± 0.32 > fq 10.24 ± 0.13 > bpf_fq 11.92 ± 0.64 > > Latency: sockperf pp --tcp -t 600, 5 times > > Qdisc Average (usec) > ---------- -------------- > pfifo_fast 244.58 ± 7.93 > bpf_fifo 244.92 ± 15.22 > fq 234.30 ± 19.25 > bpf_fq 221.34 ± 10.76 > > Looking at the two fifo qdiscs, the 6.4% drop in throughput in the bpf > implementatioin is consistent with previous observation (v8 throughput > test on a loopback device). This should be able to be mitigated by > supporting adding skb to bpf_list or bpf_rbtree directly in the future. > > * Clean up skb in bpf qdisc during reset * > > The current implementation relies on bpf qdisc implementors to correctly > release skbs in queues (bpf graphs or maps) in .reset, which might not be > a safe thing to do. The solution as Martin has suggested would be > supporting private data in struct_ops. This can also help simplifying > implementation of qdisc that works with mq. For examples, qdiscs in the > selftest mostly use global data. Therefore, even if user add multiple > qdisc instances under mq, they would still share the same queue. Although not as nice as priv_data, I think mq setup with a dedicated queue can be done with bpf map-in-map. For the cleanup part, it is similar to how the bpf kptr is cleaned up, either the bpf program frees it or the bpf infra will eventually clean it up during the bpf map destruction. For priv_data, I think it could be a useful addition to the bpf_struct_ops. Meaning it should also work for struct_ops other than Qdisc_ops. Then all destruction and free could be done more automatically and seamlessly. imo, the above improvements can be iterated later on top of the core pieces of this set.