mbox series

[net-next,v2,0/3] net/sched: retpoline wrappers for tc

Message ID 20221128154456.689326-1-pctammela@mojatatu.com (mailing list archive)
Headers show
Series net/sched: retpoline wrappers for tc | expand

Message

Pedro Tammela Nov. 28, 2022, 3:44 p.m. UTC
In tc all qdics, classifiers and actions can be compiled as modules.
This results today in indirect calls in all transitions in the tc hierarchy.
Due to CONFIG_RETPOLINE, CPUs with mitigations=on might pay an extra cost on
indirect calls. For newer Intel cpus with IBRS the extra cost is
nonexistent, but AMD Zen cpus and older x86 cpus still go through the
retpoline thunk.

Known built-in symbols can be optimized into direct calls, thus
avoiding the retpoline thunk. So far, tc has not been leveraging this
build information and leaving out a performance optimization for some
CPUs. In this series we wire up 'tcf_classify()' and 'tcf_action_exec()'
with direct calls when known modules are compiled as built-in as an
opt-in optimization.

We measured these changes in one AMD Zen 3 cpu (Retpoline), one Intel 10th
Gen CPU (IBRS), one Intel 3rd Gen cpu (Retpoline) and one Intel Xeon CPU (IBRS)
using pktgen with 64b udp packets. Our test setup is a dummy device with
clsact and matchall in a kernel compiled with every tc module as built-in.
We observed a 3-6% speed up on the retpoline CPUs, when going through 1
tc filter, and a 60-100% speed up when going through 100 filters.
For the IBRS cpus we observed a 1-2% degradation in both scenarios, we believe
the extra branches checks introduced a small overhead therefore we added
a Kconfig option to make these changes opt-in even in CONFIG_RETPOLINE kernels.

We are continuing to test on other hardware variants as we find them:

1 filter:
CPU        | before (pps) | after (pps) | diff
R9 5950X   | 4237838      | 4412241     | +4.1%
R9 5950X   | 4265287      | 4413757     | +3.4%   [*]
i5-3337U   | 1580565      | 1682406     | +6.4%
i5-10210U  | 3006074      | 3006857     | +0.0%
i5-10210U  | 3160245      | 3179945     | +0.6%   [*]
Xeon 6230R | 3196906      | 3197059     | +0.0%
Xeon 6230R | 3190392      | 3196153     | +0.01%  [*]

100 filters:
CPU        | before (pps) | after (pps) | diff
R9 5950X   | 313469       | 633303      | +102.03%
R9 5950X   | 313797       | 633150      | +101.77% [*]
i5-3337U   | 127454       | 211210      | +65.71%
i5-10210U  | 389259       | 381765      | -1.9%
i5-10210U  | 408812       | 412730      | +0.9%    [*]
Xeon 6230R | 415420       | 406612      | -2.1%
Xeon 6230R | 416705       | 405869      | -2.6%    [*]

[*] In these tests we ran pktgen with clone set to 1000.

v1->v2:
- Fix build errors found by the bots
- Address Kuniyuki Iwashima suggestions

Pedro Tammela (3):
  net/sched: add retpoline wrapper for tc
  net/sched: avoid indirect act functions on retpoline kernels
  net/sched: avoid indirect classify functions on retpoline kernels

 include/net/tc_wrapper.h   | 232 +++++++++++++++++++++++++++++++++++++
 net/sched/Kconfig          |  13 +++
 net/sched/act_api.c        |   3 +-
 net/sched/act_bpf.c        |   6 +-
 net/sched/act_connmark.c   |   6 +-
 net/sched/act_csum.c       |   6 +-
 net/sched/act_ct.c         |   5 +-
 net/sched/act_ctinfo.c     |   6 +-
 net/sched/act_gact.c       |   6 +-
 net/sched/act_gate.c       |   6 +-
 net/sched/act_ife.c        |   6 +-
 net/sched/act_ipt.c        |   6 +-
 net/sched/act_mirred.c     |   6 +-
 net/sched/act_mpls.c       |   6 +-
 net/sched/act_nat.c        |   7 +-
 net/sched/act_pedit.c      |   6 +-
 net/sched/act_police.c     |   6 +-
 net/sched/act_sample.c     |   6 +-
 net/sched/act_simple.c     |   6 +-
 net/sched/act_skbedit.c    |   6 +-
 net/sched/act_skbmod.c     |   6 +-
 net/sched/act_tunnel_key.c |   6 +-
 net/sched/act_vlan.c       |   6 +-
 net/sched/cls_api.c        |   3 +-
 net/sched/cls_basic.c      |   6 +-
 net/sched/cls_bpf.c        |   6 +-
 net/sched/cls_cgroup.c     |   6 +-
 net/sched/cls_flow.c       |   6 +-
 net/sched/cls_flower.c     |   6 +-
 net/sched/cls_fw.c         |   6 +-
 net/sched/cls_matchall.c   |   6 +-
 net/sched/cls_route.c      |   6 +-
 net/sched/cls_rsvp.c       |   2 +
 net/sched/cls_rsvp.h       |   6 +-
 net/sched/cls_rsvp6.c      |   2 +
 net/sched/cls_tcindex.c    |   7 +-
 net/sched/cls_u32.c        |   6 +-
 37 files changed, 375 insertions(+), 67 deletions(-)
 create mode 100644 include/net/tc_wrapper.h

Comments

Paolo Abeni Dec. 1, 2022, 11:05 a.m. UTC | #1
On Mon, 2022-11-28 at 12:44 -0300, Pedro Tammela wrote:
> In tc all qdics, classifiers and actions can be compiled as modules.
> This results today in indirect calls in all transitions in the tc hierarchy.
> Due to CONFIG_RETPOLINE, CPUs with mitigations=on might pay an extra cost on
> indirect calls. For newer Intel cpus with IBRS the extra cost is
> nonexistent, but AMD Zen cpus and older x86 cpus still go through the
> retpoline thunk.
> 
> Known built-in symbols can be optimized into direct calls, thus
> avoiding the retpoline thunk. So far, tc has not been leveraging this
> build information and leaving out a performance optimization for some
> CPUs. In this series we wire up 'tcf_classify()' and 'tcf_action_exec()'
> with direct calls when known modules are compiled as built-in as an
> opt-in optimization.
> 
> We measured these changes in one AMD Zen 3 cpu (Retpoline), one Intel 10th
> Gen CPU (IBRS), one Intel 3rd Gen cpu (Retpoline) and one Intel Xeon CPU (IBRS)
> using pktgen with 64b udp packets. Our test setup is a dummy device with
> clsact and matchall in a kernel compiled with every tc module as built-in.
> We observed a 3-6% speed up on the retpoline CPUs, when going through 1
> tc filter, 

Do yu have all the existing filters enabled at build time in your test
kernel? the reported figures are quite higher then expected considering
there are 7th new unlikely branch in between.

Also it would be nice to have some figure for the last filter in the if
chain. I fear we could have some regressions there even for 'retpoline'
CPUs - given the long if chain - and u32 is AFAIK (not much actually)
still quite used.

Finally, it looks like the filter order in patch 1/3 is quite relevant,
and it looks like you used the lexicographic order, I guess it should
be better to sort them by 'relevance', if someone could provide a
reasonable 'relevance' order. I personally would move ife, ipt and
simple towards the bottom.

Thanks,

Paolo
Jamal Hadi Salim Dec. 1, 2022, 12:34 p.m. UTC | #2
On Thu, Dec 1, 2022 at 6:05 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Mon, 2022-11-28 at 12:44 -0300, Pedro Tammela wrote:

[..]

> > We observed a 3-6% speed up on the retpoline CPUs, when going through 1
> > tc filter,
>
> Do yu have all the existing filters enabled at build time in your test
> kernel? the reported figures are quite higher then expected considering
> there are 7th new unlikely branch in between.
>

That can be validated with a test that compiles a kernel with a filter under
test listed first then another kernel with the same filter last.

Also given these tests were using 64B pkts to achieve the highest pps, perhaps
using MTU sized pkts with pktgen would give more realistic results?

In addition to the tests for 1 and 100 filters...

> Also it would be nice to have some figure for the last filter in the if
> chain. I fear we could have some regressions there even for 'retpoline'
> CPUs - given the long if chain - and u32 is AFAIK (not much actually)
> still quite used.
>

I would say flower and bpf + u32 are probably the highest used,
but given no available data on this usage beauty is in the eye of
the beholder. I hope it doesnt become a real estate battle like we
have in which subsystem gets to see packets first or last ;->

> Finally, it looks like the filter order in patch 1/3 is quite relevant,
> and it looks like you used the lexicographic order, I guess it should
> be better to sort them by 'relevance', if someone could provide a
> reasonable 'relevance' order. I personally would move ife, ipt and
> simple towards the bottom.

I think we can come up with some reasonable order.

cheers,
jamal
Pedro Tammela Dec. 4, 2022, 11:13 p.m. UTC | #3
On 01/12/2022 09:34, Jamal Hadi Salim wrote:
> On Thu, Dec 1, 2022 at 6:05 AM Paolo Abeni <pabeni@redhat.com> wrote:
>>
>> On Mon, 2022-11-28 at 12:44 -0300, Pedro Tammela wrote:
> 
> [..]
> 
>>> We observed a 3-6% speed up on the retpoline CPUs, when going through 1
>>> tc filter,
>>
>> Do yu have all the existing filters enabled at build time in your test
>> kernel? the reported figures are quite higher then expected considering
>> there are 7th new unlikely branch in between.
>>
> 
> That can be validated with a test that compiles a kernel with a filter under
> test listed first then another kernel with the same filter last.
> 
> Also given these tests were using 64B pkts to achieve the highest pps, perhaps
> using MTU sized pkts with pktgen would give more realistic results?
> 
> In addition to the tests for 1 and 100 filters...
> 
>> Also it would be nice to have some figure for the last filter in the if
>> chain. I fear we could have some regressions there even for 'retpoline'
>> CPUs - given the long if chain - and u32 is AFAIK (not much actually)
>> still quite used.
>>
> 
> I would say flower and bpf + u32 are probably the highest used,
> but given no available data on this usage beauty is in the eye of
> the beholder. I hope it doesnt become a real estate battle like we
> have in which subsystem gets to see packets first or last ;->
> 
>> Finally, it looks like the filter order in patch 1/3 is quite relevant,
>> and it looks like you used the lexicographic order, I guess it should
>> be better to sort them by 'relevance', if someone could provide a
>> reasonable 'relevance' order. I personally would move ife, ipt and
>> simple towards the bottom.
> 
> I think we can come up with some reasonable order.
> 
> cheers,
> jamal

We got a new system with a 7950x and I had some free time today to test 
out the classifier order with v3, which I will post soon.

64b pps:
baseline - 5914980
first - 6397116 (+8.15%)
last - 6362476 (+7.5%)

1500b pps:
baseline - 6367965
first - 6754578 (+6.07%)
last - 6745576 (+5.9%)

The difference between first to last is minimal, but it exists.
DDR5 seems to give a nice boost on pps for this test, when compared to 
the 5950x. Which makes sense, since it's quite heavy on the memory 
allocator.