Message ID | 20221125175207.473866-1-pctammela@mojatatu.com (mailing list archive) |
---|---|
Headers | show |
Series | net/sched: retpoline wrappers for tc | expand |
You forgot to add the RFC tag. Also add my reviewed-by: cheers, jamal On Fri, Nov 25, 2022 at 12:52 PM Pedro Tammela <pctammela@gmail.com> wrote: > > In tc all qdics, classifiers and actions can be compiled as modules. > This results today in indirect calls in all transitions in the tc hierarchy. > Due to CONFIG_RETPOLINE, CPUs with mitigations=on might pay an extra cost on > indirect calls. For newer Intel cpus with IBRS the extra cost is > nonexistent, but AMD Zen cpus and older x86 cpus still go through the > retpoline thunk. > > Known built-in symbols can be optimized into direct calls, thus > avoiding the retpoline thunk. So far, tc has not been leveraging this > build information and leaving out a performance optimization for some > CPUs. In this series we wire up 'tcf_classify()' and 'tcf_action_exec()' > with direct calls when known modules are compiled as built-in as an > opt-in optimization. > > We measured these changes in one AMD Zen 3 cpu (Retpoline), one Intel 10th > Gen CPU (IBRS), one Intel 3rd Gen cpu (Retpoline) and one Intel Xeon CPU (IBRS) > using pktgen with 64b udp packets. Our test setup is a dummy device with > clsact and matchall in a kernel compiled with every tc module as built-in. > We observed a 6-10% speed up on the retpoline CPUs, when going through 1 > tc filter, and a 60-100% speed up when going through 100 filters. > For the IBRS cpus we observed a 1-2% degradation in both scenarios, we believe > the extra branches checks introduced a small overhead therefore we added > a Kconfig option to make these changes opt-in even in CONFIG_RETPOLINE kernels. > > We are continuing to test on other hardware variants as we find them: > > 1 filter: > CPU | before (pps) | after (pps) | diff > R9 5950X | 4237838 | 4412241 | +4.1% > R9 5950X | 4265287 | 4413757 | +3.4% [*] > i5-3337U | 1580565 | 1682406 | +6.4% > i5-10210U | 3006074 | 3006857 | +0.0% > i5-10210U | 3160245 | 3179945 | +0.6% [*] > Xeon 6230R | 3196906 | 3197059 | +0.0% > Xeon 6230R | 3190392 | 3196153 | +0.01% [*] > > 100 filters: > CPU | before (pps) | after (pps) | diff > R9 5950X | 313469 | 633303 | +102.03% > R9 5950X | 313797 | 633150 | +101.77% [*] > i5-3337U | 127454 | 211210 | +65.71% > i5-10210U | 389259 | 381765 | -1.9% > i5-10210U | 408812 | 412730 | +0.9% [*] > Xeon 6230R | 415420 | 406612 | -2.1% > Xeon 6230R | 416705 | 405869 | -2.6% [*] > > [*] In these tests we ran pktgen with clone set to 1000. > > Pedro Tammela (3): > net/sched: add retpoline wrapper for tc > net/sched: avoid indirect act functions on retpoline kernels > net/sched: avoid indirect classify functions on retpoline kernels > > include/net/tc_wrapper.h | 274 +++++++++++++++++++++++++++++++++++++ > net/sched/Kconfig | 13 ++ > net/sched/act_api.c | 3 +- > net/sched/act_bpf.c | 6 +- > net/sched/act_connmark.c | 6 +- > net/sched/act_csum.c | 6 +- > net/sched/act_ct.c | 4 +- > net/sched/act_ctinfo.c | 6 +- > net/sched/act_gact.c | 6 +- > net/sched/act_gate.c | 6 +- > net/sched/act_ife.c | 6 +- > net/sched/act_ipt.c | 6 +- > net/sched/act_mirred.c | 6 +- > net/sched/act_mpls.c | 6 +- > net/sched/act_nat.c | 7 +- > net/sched/act_pedit.c | 6 +- > net/sched/act_police.c | 6 +- > net/sched/act_sample.c | 6 +- > net/sched/act_simple.c | 6 +- > net/sched/act_skbedit.c | 6 +- > net/sched/act_skbmod.c | 6 +- > net/sched/act_tunnel_key.c | 6 +- > net/sched/act_vlan.c | 6 +- > net/sched/cls_api.c | 3 +- > net/sched/cls_basic.c | 6 +- > net/sched/cls_bpf.c | 6 +- > net/sched/cls_cgroup.c | 6 +- > net/sched/cls_flow.c | 6 +- > net/sched/cls_flower.c | 6 +- > net/sched/cls_fw.c | 6 +- > net/sched/cls_matchall.c | 6 +- > net/sched/cls_route.c | 6 +- > net/sched/cls_rsvp.c | 2 + > net/sched/cls_rsvp.h | 7 +- > net/sched/cls_rsvp6.c | 2 + > net/sched/cls_tcindex.c | 7 +- > net/sched/cls_u32.c | 6 +- > 37 files changed, 417 insertions(+), 67 deletions(-) > create mode 100644 include/net/tc_wrapper.h > > -- > 2.34.1 >