mbox series

[net-next,v6,0/4] net/sched: retpoline wrappers for tc

Message ID 20221206135513.1904815-1-pctammela@mojatatu.com (mailing list archive)
Headers show
Series net/sched: retpoline wrappers for tc | expand

Message

Pedro Tammela Dec. 6, 2022, 1:55 p.m. UTC
In tc all qdics, classifiers and actions can be compiled as modules.
This results today in indirect calls in all transitions in the tc hierarchy.
Due to CONFIG_RETPOLINE, CPUs with mitigations=on might pay an extra cost on
indirect calls. For newer Intel cpus with IBRS the extra cost is
nonexistent, but AMD Zen cpus and older x86 cpus still go through the
retpoline thunk.

Known built-in symbols can be optimized into direct calls, thus
avoiding the retpoline thunk. So far, tc has not been leveraging this
build information and leaving out a performance optimization for some
CPUs. In this series we wire up 'tcf_classify()' and 'tcf_action_exec()'
with direct calls when known modules are compiled as built-in as an
opt-in optimization.

We measured these changes in one AMD Zen 4 cpu (Retpoline), one AMD Zen 3 cpu (Retpoline),
one Intel 10th Gen CPU (IBRS), one Intel 3rd Gen cpu (Retpoline) and one
Intel Xeon CPU (IBRS) using pktgen with 64b udp packets. Our test setup is a
dummy device with clsact and matchall in a kernel compiled with every
tc module as built-in.  We observed a 3-8% speed up on the retpoline CPUs,
when going through 1 tc filter, and a 60-100% speed up when going through 100 filters.
For the IBRS cpus we observed a 1-2% degradation in both scenarios, we believe
the extra branches check introduced a small overhead therefore we added
a static key that bypasses the wrapper on kernels not using the retpoline mitigation,
but compiled with CONFIG_RETPOLINE.

1 filter:
CPU        | before (pps) | after (pps) | diff
R9 7950X   | 5914980      | 6380227     | +7.8%
R9 5950X   | 4237838      | 4412241     | +4.1%
R9 5950X   | 4265287      | 4413757     | +3.4%   [*]
i5-3337U   | 1580565      | 1682406     | +6.4%
i5-10210U  | 3006074      | 3006857     | +0.0%
i5-10210U  | 3160245      | 3179945     | +0.6%   [*]
Xeon 6230R | 3196906      | 3197059     | +0.0%
Xeon 6230R | 3190392      | 3196153     | +0.01%  [*]

100 filters:
CPU        | before (pps) | after (pps) | diff
R9 7950X   | 373598       | 820396      | +119.59%
R9 5950X   | 313469       | 633303      | +102.03%
R9 5950X   | 313797       | 633150      | +101.77% [*]
i5-3337U   | 127454       | 211210      | +65.71%
i5-10210U  | 389259       | 381765      | -1.9%
i5-10210U  | 408812       | 412730      | +0.9%    [*]
Xeon 6230R | 415420       | 406612      | -2.1%
Xeon 6230R | 416705       | 405869      | -2.6%    [*]

[*] In these tests we ran pktgen with clone set to 1000.

On the 7950x system we also tested the impact of filters if iteration order
placement varied, first by compiling a kernel with the filter under test being
the first one in the static iteration and then repeating it with being last (of 15 classifiers existing today).
We saw a difference of +0.5-1% in pps between being the first in the iteration vs being the last.
Therefore we order the classifiers and actions according to relevance per our current thinking.

v5->v6:
- Address Eric Dumazet suggestions

v4->v5:
- Rebase

v3->v4:
- Address Eric Dumazet suggestions

v2->v3:
- Address suggestions by Jakub, Paolo and Eric
- Dropped RFC tag (I forgot to add it on v2)

v1->v2:
- Fix build errors found by the bots
- Address Kuniyuki Iwashima suggestions

Pedro Tammela (4):
  net/sched: move struct action_ops definition out of ifdef
  net/sched: add retpoline wrapper for tc
  net/sched: avoid indirect act functions on retpoline kernels
  net/sched: avoid indirect classify functions on retpoline kernels

 include/net/act_api.h      |  10 +-
 include/net/tc_wrapper.h   | 251 +++++++++++++++++++++++++++++++++++++
 net/sched/act_api.c        |   3 +-
 net/sched/act_bpf.c        |   6 +-
 net/sched/act_connmark.c   |   6 +-
 net/sched/act_csum.c       |   6 +-
 net/sched/act_ct.c         |   5 +-
 net/sched/act_ctinfo.c     |   6 +-
 net/sched/act_gact.c       |   6 +-
 net/sched/act_gate.c       |   6 +-
 net/sched/act_ife.c        |   6 +-
 net/sched/act_ipt.c        |   6 +-
 net/sched/act_mirred.c     |   6 +-
 net/sched/act_mpls.c       |   6 +-
 net/sched/act_nat.c        |   7 +-
 net/sched/act_pedit.c      |   6 +-
 net/sched/act_police.c     |   6 +-
 net/sched/act_sample.c     |   6 +-
 net/sched/act_simple.c     |   6 +-
 net/sched/act_skbedit.c    |   6 +-
 net/sched/act_skbmod.c     |   6 +-
 net/sched/act_tunnel_key.c |   6 +-
 net/sched/act_vlan.c       |   6 +-
 net/sched/cls_api.c        |   3 +-
 net/sched/cls_basic.c      |   6 +-
 net/sched/cls_bpf.c        |   6 +-
 net/sched/cls_cgroup.c     |   6 +-
 net/sched/cls_flow.c       |   6 +-
 net/sched/cls_flower.c     |   6 +-
 net/sched/cls_fw.c         |   6 +-
 net/sched/cls_matchall.c   |   6 +-
 net/sched/cls_route.c      |   6 +-
 net/sched/cls_rsvp.c       |   2 +
 net/sched/cls_rsvp.h       |   6 +-
 net/sched/cls_rsvp6.c      |   2 +
 net/sched/cls_tcindex.c    |   7 +-
 net/sched/cls_u32.c        |   6 +-
 net/sched/sch_api.c        |   5 +
 38 files changed, 391 insertions(+), 72 deletions(-)
 create mode 100644 include/net/tc_wrapper.h

Comments

patchwork-bot+netdevbpf@kernel.org Dec. 9, 2022, 9:40 a.m. UTC | #1
Hello:

This series was applied to netdev/net-next.git (master)
by David S. Miller <davem@davemloft.net>:

On Tue,  6 Dec 2022 10:55:09 -0300 you wrote:
> In tc all qdics, classifiers and actions can be compiled as modules.
> This results today in indirect calls in all transitions in the tc hierarchy.
> Due to CONFIG_RETPOLINE, CPUs with mitigations=on might pay an extra cost on
> indirect calls. For newer Intel cpus with IBRS the extra cost is
> nonexistent, but AMD Zen cpus and older x86 cpus still go through the
> retpoline thunk.
> 
> [...]

Here is the summary with links:
  - [net-next,v6,1/4] net/sched: move struct action_ops definition out of ifdef
    https://git.kernel.org/netdev/net-next/c/2a7d228f1ae7
  - [net-next,v6,2/4] net/sched: add retpoline wrapper for tc
    https://git.kernel.org/netdev/net-next/c/7f0e810220e2
  - [net-next,v6,3/4] net/sched: avoid indirect act functions on retpoline kernels
    https://git.kernel.org/netdev/net-next/c/871cf386dd16
  - [net-next,v6,4/4] net/sched: avoid indirect classify functions on retpoline kernels
    https://git.kernel.org/netdev/net-next/c/9f3101dca3a7

You are awesome, thank you!
Rudi Heitbaum Dec. 27, 2022, 8:33 a.m. UTC | #2
Hi Pedro,

Compiling kernel 6.2-rc1 fails on x86_64 when CONFIG_NET_CLS or
CONFIG_NET_CLS_ACT is not set, when CONFIG_RETPOLINE=y is set.

Does tc_wrapper RETPOLINE need a dependency on NET_CLS/NET_CLS_ACT
to be added? Or a default?

net/sched/sch_api.c: In function 'pktsched_init':
net/sched/sch_api.c:2306:9: error: implicit declaration of function
'tc_wrapper_init' [-Werror=implicit-function-declaration]
 2306 |         tc_wrapper_init();
      |         ^~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
make[3]: *** [scripts/Makefile.build:252: net/sched/sch_api.o] Error 1
make[2]: *** [scripts/Makefile.build:504: net/sched] Error 2
make[1]: *** [scripts/Makefile.build:504: net] Error 2

below is the relevent lines from the .config file.

$ grep -e RETPOLINE -e NET_CLS projects/Generic/linux/linux.x86_64.conf 
CONFIG_RETPOLINE=y
# CONFIG_NET_CLS_BASIC is not set
# CONFIG_NET_CLS_TCINDEX is not set
# CONFIG_NET_CLS_ROUTE4 is not set
# CONFIG_NET_CLS_FW is not set
# CONFIG_NET_CLS_U32 is not set
# CONFIG_NET_CLS_RSVP is not set
# CONFIG_NET_CLS_RSVP6 is not set
# CONFIG_NET_CLS_FLOW is not set
# CONFIG_NET_CLS_CGROUP is not set
# CONFIG_NET_CLS_BPF is not set
# CONFIG_NET_CLS_FLOWER is not set
# CONFIG_NET_CLS_MATCHALL is not set
# CONFIG_NET_CLS_ACT is not set

On Tue, Dec 06, 2022 at 10:55:09AM -0300, Pedro Tammela wrote:
> In tc all qdics, classifiers and actions can be compiled as modules.
> This results today in indirect calls in all transitions in the tc hierarchy.
> Due to CONFIG_RETPOLINE, CPUs with mitigations=on might pay an extra cost on
> indirect calls. For newer Intel cpus with IBRS the extra cost is
> nonexistent, but AMD Zen cpus and older x86 cpus still go through the
> retpoline thunk.
> 
> Known built-in symbols can be optimized into direct calls, thus
> avoiding the retpoline thunk. So far, tc has not been leveraging this
> build information and leaving out a performance optimization for some
> CPUs. In this series we wire up 'tcf_classify()' and 'tcf_action_exec()'
> with direct calls when known modules are compiled as built-in as an
> opt-in optimization.
> 
> We measured these changes in one AMD Zen 4 cpu (Retpoline), one AMD Zen 3 cpu (Retpoline),
> one Intel 10th Gen CPU (IBRS), one Intel 3rd Gen cpu (Retpoline) and one
> Intel Xeon CPU (IBRS) using pktgen with 64b udp packets. Our test setup is a
> dummy device with clsact and matchall in a kernel compiled with every
> tc module as built-in.  We observed a 3-8% speed up on the retpoline CPUs,
> when going through 1 tc filter, and a 60-100% speed up when going through 100 filters.
> For the IBRS cpus we observed a 1-2% degradation in both scenarios, we believe
> the extra branches check introduced a small overhead therefore we added
> a static key that bypasses the wrapper on kernels not using the retpoline mitigation,
> but compiled with CONFIG_RETPOLINE.

...
Pedro Tammela Dec. 27, 2022, 12:24 p.m. UTC | #3
On 27/12/2022 05:33, Rudi Heitbaum wrote:
> Hi Pedro,
> 
> Compiling kernel 6.2-rc1 fails on x86_64 when CONFIG_NET_CLS or
> CONFIG_NET_CLS_ACT is not set, when CONFIG_RETPOLINE=y is set.
> 
> Does tc_wrapper RETPOLINE need a dependency on NET_CLS/NET_CLS_ACT
> to be added? Or a default?
> 
> net/sched/sch_api.c: In function 'pktsched_init':
> net/sched/sch_api.c:2306:9: error: implicit declaration of function
> 'tc_wrapper_init' [-Werror=implicit-function-declaration]
>   2306 |         tc_wrapper_init();
>        |         ^~~~~~~~~~~~~~~
> cc1: some warnings being treated as errors
> make[3]: *** [scripts/Makefile.build:252: net/sched/sch_api.o] Error 1
> make[2]: *** [scripts/Makefile.build:504: net/sched] Error 2
> make[1]: *** [scripts/Makefile.build:504: net] Error 2
> 
> below is the relevent lines from the .config file.
> 
> $ grep -e RETPOLINE -e NET_CLS projects/Generic/linux/linux.x86_64.conf
> CONFIG_RETPOLINE=y
> # CONFIG_NET_CLS_BASIC is not set
> # CONFIG_NET_CLS_TCINDEX is not set
> # CONFIG_NET_CLS_ROUTE4 is not set
> # CONFIG_NET_CLS_FW is not set
> # CONFIG_NET_CLS_U32 is not set
> # CONFIG_NET_CLS_RSVP is not set
> # CONFIG_NET_CLS_RSVP6 is not set
> # CONFIG_NET_CLS_FLOW is not set
> # CONFIG_NET_CLS_CGROUP is not set
> # CONFIG_NET_CLS_BPF is not set
> # CONFIG_NET_CLS_FLOWER is not set
> # CONFIG_NET_CLS_MATCHALL is not set
> # CONFIG_NET_CLS_ACT is not set
> 

Hi Rudi,

Thanks for the report.
Could you try the following patch?

diff --git a/include/net/tc_wrapper.h b/include/net/tc_wrapper.h
index ceed2fc089ff..d323fffb839a 100644
--- a/include/net/tc_wrapper.h
+++ b/include/net/tc_wrapper.h
@@ -216,6 +216,8 @@ static inline int tc_classify(struct sk_buff *skb, 
const struct tcf_proto *tp,
         return tp->classify(skb, tp, res);
  }

+#endif /* CONFIG_NET_CLS */
+
  static inline void tc_wrapper_init(void)
  {
  #ifdef CONFIG_X86
@@ -224,8 +226,6 @@ static inline void tc_wrapper_init(void)
  #endif
  }

-#endif /* CONFIG_NET_CLS */
-
  #else

  #define TC_INDIRECT_SCOPE static
Rudi Heitbaum Dec. 27, 2022, 1:47 p.m. UTC | #4
On Tue, Dec 27, 2022 at 09:24:22AM -0300, Pedro Tammela wrote:
> On 27/12/2022 05:33, Rudi Heitbaum wrote:
> > Hi Pedro,
> > 
> > Compiling kernel 6.2-rc1 fails on x86_64 when CONFIG_NET_CLS or
> > CONFIG_NET_CLS_ACT is not set, when CONFIG_RETPOLINE=y is set.

...

> Hi Rudi,
> 
> Thanks for the report.
> Could you try the following patch?
> 
> diff --git a/include/net/tc_wrapper.h b/include/net/tc_wrapper.h
> index ceed2fc089ff..d323fffb839a 100644
> --- a/include/net/tc_wrapper.h
> +++ b/include/net/tc_wrapper.h
> @@ -216,6 +216,8 @@ static inline int tc_classify(struct sk_buff *skb, const
> struct tcf_proto *tp,
>         return tp->classify(skb, tp, res);
>  }
> 
> +#endif /* CONFIG_NET_CLS */
> +
>  static inline void tc_wrapper_init(void)
>  {
>  #ifdef CONFIG_X86
> @@ -224,8 +226,6 @@ static inline void tc_wrapper_init(void)
>  #endif
>  }
> 
> -#endif /* CONFIG_NET_CLS */
> -
>  #else
> 
>  #define TC_INDIRECT_SCOPE static

Hi Pedro,

I had to adjust it slightly to apply cleanly.
But works correctly with my .config now.

Thanks
Rudi

Tested-by: Rudi Heitbaum <rudi@heitbaum.com>

diff --git a/include/net/tc_wrapper.h b/include/net/tc_wrapper.h
index ceed2fc089ff..d323fffb839a 100644
--- a/include/net/tc_wrapper.h
+++ b/include/net/tc_wrapper.h
@@ -216,6 +216,8 @@
 	return tp->classify(skb, tp, res);
 }
 
+#endif /* CONFIG_NET_CLS */
+
 static inline void tc_wrapper_init(void)
 {
 #ifdef CONFIG_X86
@@ -224,8 +226,6 @@
 #endif
 }
 
-#endif /* CONFIG_NET_CLS */
-
 #else
 
 #define TC_INDIRECT_SCOPE static