diff mbox series

[net-next,3/3] net: sched: make skip_sw actually skip software

Message ID 20240215160458.1727237-4-ast@fiberby.net (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series make skip_sw actually skip software | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1043 this patch: 1043
netdev/build_tools success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers warning 3 maintainers not CCed: pabeni@redhat.com kuba@kernel.org edumazet@google.com
netdev/build_clang success Errors and warnings before: 1007 this patch: 1007
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 1087 this patch: 1087
netdev/checkpatch warning WARNING: line length of 89 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-02-15--21-00 (tests: 882)

Commit Message

Asbjørn Sloth Tønnesen Feb. 15, 2024, 4:04 p.m. UTC
TC filters come in 3 variants:
- no flag (no opinion, process wherever possible)
- skip_hw (do not process filter by hardware)
- skip_sw (do not process filter by software)

However skip_sw is implemented so that the skip_sw
flag can first be checked, after it has been matched.

IMHO it's common when using skip_sw, to use it on all rules.

So if all filters in a block is skip_sw filters, then
we can bail early, we can thus avoid having to match
the filters, just to check for the skip_sw flag.

 +----------------------------+--------+--------+--------+
 | Test description           | Pre    | Post   | Rel.   |
 |                            | kpps   | kpps   | chg.   |
 +----------------------------+--------+--------+--------+
 | basic forwarding + notrack | 1264.9 | 1277.7 |  1.01x |
 | switch to eswitch mode     | 1067.1 | 1071.0 |  1.00x |
 | add ingress qdisc          | 1056.0 | 1059.1 |  1.00x |
 +----------------------------+--------+--------+--------+
 | 1 non-matching rule        |  927.9 | 1057.1 |  1.14x |
 | 10 non-matching rules      |  495.8 | 1055.6 |  2.13x |
 | 25 non-matching rules      |  280.6 | 1053.5 |  3.75x |
 | 50 non-matching rules      |  162.0 | 1055.7 |  6.52x |
 | 100 non-matching rules     |   87.7 | 1019.0 | 11.62x |
 +----------------------------+--------+--------+--------+

perf top (100 n-m skip_sw rules - pre patch):
  25.57%  [kernel]  [k] __skb_flow_dissect
  20.77%  [kernel]  [k] rhashtable_jhash2
  14.26%  [kernel]  [k] fl_classify
  13.28%  [kernel]  [k] fl_mask_lookup
   6.38%  [kernel]  [k] memset_orig
   3.22%  [kernel]  [k] tcf_classify

perf top (100 n-m skip_sw rules - post patch):
   4.28%  [kernel]  [k] __dev_queue_xmit
   3.80%  [kernel]  [k] check_preemption_disabled
   3.68%  [kernel]  [k] nft_do_chain
   3.08%  [kernel]  [k] __netif_receive_skb_core.constprop.0
   2.59%  [kernel]  [k] mlx5e_xmit
   2.48%  [kernel]  [k] mlx5e_skb_from_cqe_mpwrq_nonlinear

Test setup:
 DUT: Intel Xeon D-1518 (2.20GHz) w/ Nvidia/Mellanox ConnectX-6 Dx 2x100G
 Data rate measured on switch (Extreme X690), and DUT connected as
 a router on a stick, with pktgen and pktsink as VLANs.
 Pktgen was in range 12.79 - 12.95 Mpps across all tests.

Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
---
 include/net/pkt_cls.h | 5 +++++
 net/core/dev.c        | 3 +++
 2 files changed, 8 insertions(+)

Comments

Jamal Hadi Salim Feb. 15, 2024, 5:49 p.m. UTC | #1
On Thu, Feb 15, 2024 at 11:06 AM Asbjørn Sloth Tønnesen <ast@fiberby.net> wrote:
>
> TC filters come in 3 variants:
> - no flag (no opinion, process wherever possible)
> - skip_hw (do not process filter by hardware)
> - skip_sw (do not process filter by software)
>
> However skip_sw is implemented so that the skip_sw
> flag can first be checked, after it has been matched.
>
> IMHO it's common when using skip_sw, to use it on all rules.
>
> So if all filters in a block is skip_sw filters, then
> we can bail early, we can thus avoid having to match
> the filters, just to check for the skip_sw flag.
>
>  +----------------------------+--------+--------+--------+
>  | Test description           | Pre    | Post   | Rel.   |
>  |                            | kpps   | kpps   | chg.   |
>  +----------------------------+--------+--------+--------+
>  | basic forwarding + notrack | 1264.9 | 1277.7 |  1.01x |
>  | switch to eswitch mode     | 1067.1 | 1071.0 |  1.00x |
>  | add ingress qdisc          | 1056.0 | 1059.1 |  1.00x |
>  +----------------------------+--------+--------+--------+
>  | 1 non-matching rule        |  927.9 | 1057.1 |  1.14x |
>  | 10 non-matching rules      |  495.8 | 1055.6 |  2.13x |
>  | 25 non-matching rules      |  280.6 | 1053.5 |  3.75x |
>  | 50 non-matching rules      |  162.0 | 1055.7 |  6.52x |
>  | 100 non-matching rules     |   87.7 | 1019.0 | 11.62x |
>  +----------------------------+--------+--------+--------+
>
> perf top (100 n-m skip_sw rules - pre patch):
>   25.57%  [kernel]  [k] __skb_flow_dissect
>   20.77%  [kernel]  [k] rhashtable_jhash2
>   14.26%  [kernel]  [k] fl_classify
>   13.28%  [kernel]  [k] fl_mask_lookup
>    6.38%  [kernel]  [k] memset_orig
>    3.22%  [kernel]  [k] tcf_classify
>
> perf top (100 n-m skip_sw rules - post patch):
>    4.28%  [kernel]  [k] __dev_queue_xmit
>    3.80%  [kernel]  [k] check_preemption_disabled
>    3.68%  [kernel]  [k] nft_do_chain
>    3.08%  [kernel]  [k] __netif_receive_skb_core.constprop.0
>    2.59%  [kernel]  [k] mlx5e_xmit
>    2.48%  [kernel]  [k] mlx5e_skb_from_cqe_mpwrq_nonlinear
>

The concept makes sense - but i am wondering when you have a mix of
skip_sw and skip_hw if it makes more sense to just avoid looking up
skip_sw at all in the s/w datapath? Potentially by separating the
hashes for skip_sw/hw. I know it's a deeper surgery - but would be
more general purpose....unless i am missing something

> Test setup:
>  DUT: Intel Xeon D-1518 (2.20GHz) w/ Nvidia/Mellanox ConnectX-6 Dx 2x100G
>  Data rate measured on switch (Extreme X690), and DUT connected as
>  a router on a stick, with pktgen and pktsink as VLANs.
>  Pktgen was in range 12.79 - 12.95 Mpps across all tests.
>

Hrm. Those are "tiny" numbers (25G @64B is about 3x that). What are
the packet sizes?
Perhaps the traffic generator is a limitation here?
Also feels like you are doing exact matches? A sample flower rule
would have helped.

cheers,
jamal
> Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
> ---
>  include/net/pkt_cls.h | 5 +++++
>  net/core/dev.c        | 3 +++
>  2 files changed, 8 insertions(+)
>
> diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
> index a4ee43f493bb..a065da4df7ff 100644
> --- a/include/net/pkt_cls.h
> +++ b/include/net/pkt_cls.h
> @@ -74,6 +74,11 @@ static inline bool tcf_block_non_null_shared(struct tcf_block *block)
>         return block && block->index;
>  }
>
> +static inline bool tcf_block_has_skip_sw_only(struct tcf_block *block)
> +{
> +       return block && atomic_read(&block->filtercnt) == atomic_read(&block->skipswcnt);
> +}
> +
>  static inline struct Qdisc *tcf_block_q(struct tcf_block *block)
>  {
>         WARN_ON(tcf_block_shared(block));
> diff --git a/net/core/dev.c b/net/core/dev.c
> index d8dd293a7a27..7cd014e5066e 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3910,6 +3910,9 @@ static int tc_run(struct tcx_entry *entry, struct sk_buff *skb,
>         if (!miniq)
>                 return ret;
>
> +       if (tcf_block_has_skip_sw_only(miniq->block))
> +               return ret;
> +
>         tc_skb_cb(skb)->mru = 0;
>         tc_skb_cb(skb)->post_ct = false;
>         tcf_set_drop_reason(skb, *drop_reason);
> --
> 2.43.0
>
Vlad Buslov Feb. 16, 2024, 8:47 a.m. UTC | #2
On Thu 15 Feb 2024 at 16:04, Asbjørn Sloth Tønnesen <ast@fiberby.net> wrote:
> TC filters come in 3 variants:
> - no flag (no opinion, process wherever possible)
> - skip_hw (do not process filter by hardware)
> - skip_sw (do not process filter by software)
>
> However skip_sw is implemented so that the skip_sw
> flag can first be checked, after it has been matched.
>
> IMHO it's common when using skip_sw, to use it on all rules.
>
> So if all filters in a block is skip_sw filters, then
> we can bail early, we can thus avoid having to match
> the filters, just to check for the skip_sw flag.
>
>  +----------------------------+--------+--------+--------+
>  | Test description           | Pre    | Post   | Rel.   |
>  |                            | kpps   | kpps   | chg.   |
>  +----------------------------+--------+--------+--------+
>  | basic forwarding + notrack | 1264.9 | 1277.7 |  1.01x |
>  | switch to eswitch mode     | 1067.1 | 1071.0 |  1.00x |
>  | add ingress qdisc          | 1056.0 | 1059.1 |  1.00x |
>  +----------------------------+--------+--------+--------+
>  | 1 non-matching rule        |  927.9 | 1057.1 |  1.14x |
>  | 10 non-matching rules      |  495.8 | 1055.6 |  2.13x |
>  | 25 non-matching rules      |  280.6 | 1053.5 |  3.75x |
>  | 50 non-matching rules      |  162.0 | 1055.7 |  6.52x |
>  | 100 non-matching rules     |   87.7 | 1019.0 | 11.62x |
>  +----------------------------+--------+--------+--------+
>
> perf top (100 n-m skip_sw rules - pre patch):
>   25.57%  [kernel]  [k] __skb_flow_dissect
>   20.77%  [kernel]  [k] rhashtable_jhash2
>   14.26%  [kernel]  [k] fl_classify
>   13.28%  [kernel]  [k] fl_mask_lookup
>    6.38%  [kernel]  [k] memset_orig
>    3.22%  [kernel]  [k] tcf_classify
>
> perf top (100 n-m skip_sw rules - post patch):
>    4.28%  [kernel]  [k] __dev_queue_xmit
>    3.80%  [kernel]  [k] check_preemption_disabled
>    3.68%  [kernel]  [k] nft_do_chain
>    3.08%  [kernel]  [k] __netif_receive_skb_core.constprop.0
>    2.59%  [kernel]  [k] mlx5e_xmit
>    2.48%  [kernel]  [k] mlx5e_skb_from_cqe_mpwrq_nonlinear
>
> Test setup:
>  DUT: Intel Xeon D-1518 (2.20GHz) w/ Nvidia/Mellanox ConnectX-6 Dx 2x100G
>  Data rate measured on switch (Extreme X690), and DUT connected as
>  a router on a stick, with pktgen and pktsink as VLANs.
>  Pktgen was in range 12.79 - 12.95 Mpps across all tests.
>
> Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
> ---
>  include/net/pkt_cls.h | 5 +++++
>  net/core/dev.c        | 3 +++
>  2 files changed, 8 insertions(+)
>
> diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
> index a4ee43f493bb..a065da4df7ff 100644
> --- a/include/net/pkt_cls.h
> +++ b/include/net/pkt_cls.h
> @@ -74,6 +74,11 @@ static inline bool tcf_block_non_null_shared(struct tcf_block *block)
>  	return block && block->index;
>  }
>  
> +static inline bool tcf_block_has_skip_sw_only(struct tcf_block *block)
> +{
> +	return block && atomic_read(&block->filtercnt) == atomic_read(&block->skipswcnt);
> +}

Note that this introduces a read from heavily contended cache-line on
data path for all classifiers, including the ones that don't support
offloads. Wonder if this a concern for users running purely software tc.

> +
>  static inline struct Qdisc *tcf_block_q(struct tcf_block *block)
>  {
>  	WARN_ON(tcf_block_shared(block));
> diff --git a/net/core/dev.c b/net/core/dev.c
> index d8dd293a7a27..7cd014e5066e 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3910,6 +3910,9 @@ static int tc_run(struct tcx_entry *entry, struct sk_buff *skb,
>  	if (!miniq)
>  		return ret;
>  
> +	if (tcf_block_has_skip_sw_only(miniq->block))
> +		return ret;
> +
>  	tc_skb_cb(skb)->mru = 0;
>  	tc_skb_cb(skb)->post_ct = false;
>  	tcf_set_drop_reason(skb, *drop_reason);
Jiri Pirko Feb. 16, 2024, 12:57 p.m. UTC | #3
Thu, Feb 15, 2024 at 06:49:05PM CET, jhs@mojatatu.com wrote:
>On Thu, Feb 15, 2024 at 11:06 AM Asbjørn Sloth Tønnesen <ast@fiberby.net> wrote:
>>
>> TC filters come in 3 variants:
>> - no flag (no opinion, process wherever possible)
>> - skip_hw (do not process filter by hardware)
>> - skip_sw (do not process filter by software)
>>
>> However skip_sw is implemented so that the skip_sw
>> flag can first be checked, after it has been matched.
>>
>> IMHO it's common when using skip_sw, to use it on all rules.
>>
>> So if all filters in a block is skip_sw filters, then
>> we can bail early, we can thus avoid having to match
>> the filters, just to check for the skip_sw flag.
>>
>>  +----------------------------+--------+--------+--------+
>>  | Test description           | Pre    | Post   | Rel.   |
>>  |                            | kpps   | kpps   | chg.   |
>>  +----------------------------+--------+--------+--------+
>>  | basic forwarding + notrack | 1264.9 | 1277.7 |  1.01x |
>>  | switch to eswitch mode     | 1067.1 | 1071.0 |  1.00x |
>>  | add ingress qdisc          | 1056.0 | 1059.1 |  1.00x |
>>  +----------------------------+--------+--------+--------+
>>  | 1 non-matching rule        |  927.9 | 1057.1 |  1.14x |
>>  | 10 non-matching rules      |  495.8 | 1055.6 |  2.13x |
>>  | 25 non-matching rules      |  280.6 | 1053.5 |  3.75x |
>>  | 50 non-matching rules      |  162.0 | 1055.7 |  6.52x |
>>  | 100 non-matching rules     |   87.7 | 1019.0 | 11.62x |
>>  +----------------------------+--------+--------+--------+
>>
>> perf top (100 n-m skip_sw rules - pre patch):
>>   25.57%  [kernel]  [k] __skb_flow_dissect
>>   20.77%  [kernel]  [k] rhashtable_jhash2
>>   14.26%  [kernel]  [k] fl_classify
>>   13.28%  [kernel]  [k] fl_mask_lookup
>>    6.38%  [kernel]  [k] memset_orig
>>    3.22%  [kernel]  [k] tcf_classify
>>
>> perf top (100 n-m skip_sw rules - post patch):
>>    4.28%  [kernel]  [k] __dev_queue_xmit
>>    3.80%  [kernel]  [k] check_preemption_disabled
>>    3.68%  [kernel]  [k] nft_do_chain
>>    3.08%  [kernel]  [k] __netif_receive_skb_core.constprop.0
>>    2.59%  [kernel]  [k] mlx5e_xmit
>>    2.48%  [kernel]  [k] mlx5e_skb_from_cqe_mpwrq_nonlinear
>>
>
>The concept makes sense - but i am wondering when you have a mix of
>skip_sw and skip_hw if it makes more sense to just avoid looking up
>skip_sw at all in the s/w datapath? Potentially by separating the
>hashes for skip_sw/hw. I know it's a deeper surgery - but would be

Yeah, there could be 2 hashes: skip_sw/rest
rest is the only one that needs to be looked-up in kernel datapath.
skip_sw is just for control path.

But is it worth the efford? I mean, since now, nobody seemed to care. If
this patchset solves the problem for this usecase, I think it is enough.

In that case, I'm fine with this patch:

Reviewed-by: Jiri Pirko <jiri@nvidia.com>



>more general purpose....unless i am missing something
>
>> Test setup:
>>  DUT: Intel Xeon D-1518 (2.20GHz) w/ Nvidia/Mellanox ConnectX-6 Dx 2x100G
>>  Data rate measured on switch (Extreme X690), and DUT connected as
>>  a router on a stick, with pktgen and pktsink as VLANs.
>>  Pktgen was in range 12.79 - 12.95 Mpps across all tests.
>>
>
>Hrm. Those are "tiny" numbers (25G @64B is about 3x that). What are
>the packet sizes?
>Perhaps the traffic generator is a limitation here?
>Also feels like you are doing exact matches? A sample flower rule
>would have helped.
>
>cheers,
>jamal
>> Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
>> ---
>>  include/net/pkt_cls.h | 5 +++++
>>  net/core/dev.c        | 3 +++
>>  2 files changed, 8 insertions(+)
>>
>> diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
>> index a4ee43f493bb..a065da4df7ff 100644
>> --- a/include/net/pkt_cls.h
>> +++ b/include/net/pkt_cls.h
>> @@ -74,6 +74,11 @@ static inline bool tcf_block_non_null_shared(struct tcf_block *block)
>>         return block && block->index;
>>  }
>>
>> +static inline bool tcf_block_has_skip_sw_only(struct tcf_block *block)
>> +{
>> +       return block && atomic_read(&block->filtercnt) == atomic_read(&block->skipswcnt);
>> +}
>> +
>>  static inline struct Qdisc *tcf_block_q(struct tcf_block *block)
>>  {
>>         WARN_ON(tcf_block_shared(block));
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index d8dd293a7a27..7cd014e5066e 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -3910,6 +3910,9 @@ static int tc_run(struct tcx_entry *entry, struct sk_buff *skb,
>>         if (!miniq)
>>                 return ret;
>>
>> +       if (tcf_block_has_skip_sw_only(miniq->block))
>> +               return ret;
>> +
>>         tc_skb_cb(skb)->mru = 0;
>>         tc_skb_cb(skb)->post_ct = false;
>>         tcf_set_drop_reason(skb, *drop_reason);
>> --
>> 2.43.0
>>
Asbjørn Sloth Tønnesen Feb. 16, 2024, 1:38 p.m. UTC | #4
Hi Jamal,

On 2/15/24 17:49, Jamal Hadi Salim wrote:
> On Thu, Feb 15, 2024 at 11:06 AM Asbjørn Sloth Tønnesen <ast@fiberby.net> wrote:
>>
>> TC filters come in 3 variants:
>> - no flag (no opinion, process wherever possible)
>> - skip_hw (do not process filter by hardware)
>> - skip_sw (do not process filter by software)
>>
>> However skip_sw is implemented so that the skip_sw
>> flag can first be checked, after it has been matched.
>>
>> IMHO it's common when using skip_sw, to use it on all rules.
>>
>> So if all filters in a block is skip_sw filters, then
>> we can bail early, we can thus avoid having to match
>> the filters, just to check for the skip_sw flag.
>>
>>   +----------------------------+--------+--------+--------+
>>   | Test description           | Pre    | Post   | Rel.   |
>>   |                            | kpps   | kpps   | chg.   |
>>   +----------------------------+--------+--------+--------+
>>   | basic forwarding + notrack | 1264.9 | 1277.7 |  1.01x |
>>   | switch to eswitch mode     | 1067.1 | 1071.0 |  1.00x |
>>   | add ingress qdisc          | 1056.0 | 1059.1 |  1.00x |
>>   +----------------------------+--------+--------+--------+
>>   | 1 non-matching rule        |  927.9 | 1057.1 |  1.14x |
>>   | 10 non-matching rules      |  495.8 | 1055.6 |  2.13x |
>>   | 25 non-matching rules      |  280.6 | 1053.5 |  3.75x |
>>   | 50 non-matching rules      |  162.0 | 1055.7 |  6.52x |
>>   | 100 non-matching rules     |   87.7 | 1019.0 | 11.62x |
>>   +----------------------------+--------+--------+--------+
>>
>> perf top (100 n-m skip_sw rules - pre patch):
>>    25.57%  [kernel]  [k] __skb_flow_dissect
>>    20.77%  [kernel]  [k] rhashtable_jhash2
>>    14.26%  [kernel]  [k] fl_classify
>>    13.28%  [kernel]  [k] fl_mask_lookup
>>     6.38%  [kernel]  [k] memset_orig
>>     3.22%  [kernel]  [k] tcf_classify
>>
>> perf top (100 n-m skip_sw rules - post patch):
>>     4.28%  [kernel]  [k] __dev_queue_xmit
>>     3.80%  [kernel]  [k] check_preemption_disabled
>>     3.68%  [kernel]  [k] nft_do_chain
>>     3.08%  [kernel]  [k] __netif_receive_skb_core.constprop.0
>>     2.59%  [kernel]  [k] mlx5e_xmit
>>     2.48%  [kernel]  [k] mlx5e_skb_from_cqe_mpwrq_nonlinear
>>
> 
> The concept makes sense - but i am wondering when you have a mix of
> skip_sw and skip_hw if it makes more sense to just avoid looking up
> skip_sw at all in the s/w datapath? Potentially by separating the
> hashes for skip_sw/hw. I know it's a deeper surgery - but would be
> more general purpose....unless i am missing something
 >
>> Test setup:
>>   DUT: Intel Xeon D-1518 (2.20GHz) w/ Nvidia/Mellanox ConnectX-6 Dx 2x100G
>>   Data rate measured on switch (Extreme X690), and DUT connected as
>>   a router on a stick, with pktgen and pktsink as VLANs.
>>   Pktgen was in range 12.79 - 12.95 Mpps across all tests.
>>
> 
> Hrm. Those are "tiny" numbers (25G @64B is about 3x that). What are
> the packet sizes?
> Perhaps the traffic generator is a limitation here?
> Also feels like you are doing exact matches? A sample flower rule
> would have helped.

Yeah, I would also have liked those number to be higher. Sorry forgot to mention it is 64B packets.

Sadly, I used two machine to compensate for my lack of pktgen skills.
I know that there are faster packet generators, but I just used the in-kernel one,
and haven't invested much time into it, but the normal ethtool params mentioned in
the doc didn't change much, and since it was still more than enough packets that DUT
would go to 100% CPU, I didn't pursue it further.

pktgen A: Xeon E5-1620 v2 @ 3.70GHz w/ ConnectX-6 Dx and 100G link.
pktgen B: Xeon(R) D-2123IT @ 2.20GHz w/ Intel X722 and 10G link.

Both pktgen boxes are running stock Debian bullseye kernel (v6.1)

I found this to work best:
./pktgen_sample05_flow_per_thread.sh -i enp8s0f0np0 -s 64 -d 10.53.22.3 -m 2a:11:22:33:21:11 -p 1024-65000 -t 8 -n 0

Datarates monitored through SNMP using:
   snmpdelta -Cp 10 -c public -v 2c -Cs $hostname IF-MIB::ifInUcastPkts.1057 IF-MIB::ifOutUcastPkts.1057

VLAN config, MAC and IP addressing:
   v21:
     2a:11:22:33:21:11 - 10.53.21.1/27 - dut v21      (tagged VLAN)
     2a:11:22:33:21:12 - 10.53.21.2/27 - pktgen-a     (untagged)
     2a:11:22:33:21:13 - 10.53.21.3/27 - pktgen-b     (untagged)

   v22:
     2a:11:22:33:22:21 - 10.53.22.2/31 - dut v22      (tagged VLAN)
     2a:11:22:33:22:22 - 10.53.22.3/31 - packet drop  (untagged)


Switch MAC address table config:
   # static entry
   create fdb 2a:11:22:33:21:11 vlan "v21" port 57

   # blackhole pktsink
   create fdb 2a:11:22:33:22:22 vlan "v22" blackhole

flow control are disabled on all links:
   ethtool -A $dev rx off tx off

I have uploaded the script used for the flower rules in the tests here:
   https://files.fiberby.net/ast/2024/tc_skip_sw/flower_placement_tests.sh


> cheers,
> jamal
>> Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
>> ---
>>   include/net/pkt_cls.h | 5 +++++
>>   net/core/dev.c        | 3 +++
>>   2 files changed, 8 insertions(+)
>>
>> diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
>> index a4ee43f493bb..a065da4df7ff 100644
>> --- a/include/net/pkt_cls.h
>> +++ b/include/net/pkt_cls.h
>> @@ -74,6 +74,11 @@ static inline bool tcf_block_non_null_shared(struct tcf_block *block)
>>          return block && block->index;
>>   }
>>
>> +static inline bool tcf_block_has_skip_sw_only(struct tcf_block *block)
>> +{
>> +       return block && atomic_read(&block->filtercnt) == atomic_read(&block->skipswcnt);
>> +}
>> +
>>   static inline struct Qdisc *tcf_block_q(struct tcf_block *block)
>>   {
>>          WARN_ON(tcf_block_shared(block));
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index d8dd293a7a27..7cd014e5066e 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -3910,6 +3910,9 @@ static int tc_run(struct tcx_entry *entry, struct sk_buff *skb,
>>          if (!miniq)
>>                  return ret;
>>
>> +       if (tcf_block_has_skip_sw_only(miniq->block))
>> +               return ret;
>> +
>>          tc_skb_cb(skb)->mru = 0;
>>          tc_skb_cb(skb)->post_ct = false;
>>          tcf_set_drop_reason(skb, *drop_reason);
>> --
>> 2.43.0
>>
Asbjørn Sloth Tønnesen Feb. 16, 2024, 2:01 p.m. UTC | #5
Hi Vlad,

On 2/16/24 08:47, Vlad Buslov wrote:
> On Thu 15 Feb 2024 at 16:04, Asbjørn Sloth Tønnesen <ast@fiberby.net> wrote:
>> TC filters come in 3 variants:
>> - no flag (no opinion, process wherever possible)
>> - skip_hw (do not process filter by hardware)
>> - skip_sw (do not process filter by software)
>>
>> However skip_sw is implemented so that the skip_sw
>> flag can first be checked, after it has been matched.
>>
>> IMHO it's common when using skip_sw, to use it on all rules.
>>
>> So if all filters in a block is skip_sw filters, then
>> we can bail early, we can thus avoid having to match
>> the filters, just to check for the skip_sw flag.
>>
>>   +----------------------------+--------+--------+--------+
>>   | Test description           | Pre    | Post   | Rel.   |
>>   |                            | kpps   | kpps   | chg.   |
>>   +----------------------------+--------+--------+--------+
>>   | basic forwarding + notrack | 1264.9 | 1277.7 |  1.01x |
>>   | switch to eswitch mode     | 1067.1 | 1071.0 |  1.00x |
>>   | add ingress qdisc          | 1056.0 | 1059.1 |  1.00x |
>>   +----------------------------+--------+--------+--------+
>>   | 1 non-matching rule        |  927.9 | 1057.1 |  1.14x |
>>   | 10 non-matching rules      |  495.8 | 1055.6 |  2.13x |
>>   | 25 non-matching rules      |  280.6 | 1053.5 |  3.75x |
>>   | 50 non-matching rules      |  162.0 | 1055.7 |  6.52x |
>>   | 100 non-matching rules     |   87.7 | 1019.0 | 11.62x |
>>   +----------------------------+--------+--------+--------+
>>
>> perf top (100 n-m skip_sw rules - pre patch):
>>    25.57%  [kernel]  [k] __skb_flow_dissect
>>    20.77%  [kernel]  [k] rhashtable_jhash2
>>    14.26%  [kernel]  [k] fl_classify
>>    13.28%  [kernel]  [k] fl_mask_lookup
>>     6.38%  [kernel]  [k] memset_orig
>>     3.22%  [kernel]  [k] tcf_classify
>>
>> perf top (100 n-m skip_sw rules - post patch):
>>     4.28%  [kernel]  [k] __dev_queue_xmit
>>     3.80%  [kernel]  [k] check_preemption_disabled
>>     3.68%  [kernel]  [k] nft_do_chain
>>     3.08%  [kernel]  [k] __netif_receive_skb_core.constprop.0
>>     2.59%  [kernel]  [k] mlx5e_xmit
>>     2.48%  [kernel]  [k] mlx5e_skb_from_cqe_mpwrq_nonlinear
>>
>> Test setup:
>>   DUT: Intel Xeon D-1518 (2.20GHz) w/ Nvidia/Mellanox ConnectX-6 Dx 2x100G
>>   Data rate measured on switch (Extreme X690), and DUT connected as
>>   a router on a stick, with pktgen and pktsink as VLANs.
>>   Pktgen was in range 12.79 - 12.95 Mpps across all tests.
>>
>> Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
>> ---
>>   include/net/pkt_cls.h | 5 +++++
>>   net/core/dev.c        | 3 +++
>>   2 files changed, 8 insertions(+)
>>
>> diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
>> index a4ee43f493bb..a065da4df7ff 100644
>> --- a/include/net/pkt_cls.h
>> +++ b/include/net/pkt_cls.h
>> @@ -74,6 +74,11 @@ static inline bool tcf_block_non_null_shared(struct tcf_block *block)
>>   	return block && block->index;
>>   }
>>   
>> +static inline bool tcf_block_has_skip_sw_only(struct tcf_block *block)
>> +{
>> +	return block && atomic_read(&block->filtercnt) == atomic_read(&block->skipswcnt);
>> +}
> 
> Note that this introduces a read from heavily contended cache-line on
> data path for all classifiers, including the ones that don't support
> offloads. Wonder if this a concern for users running purely software tc.

Unfortunately, I don't have access to any multi-CPU machines, so I haven't been
able to test the impact of that.

Alternatively I guess I could also maintain a static key in the counter update logic.


>> +
>>   static inline struct Qdisc *tcf_block_q(struct tcf_block *block)
>>   {
>>   	WARN_ON(tcf_block_shared(block));
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index d8dd293a7a27..7cd014e5066e 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -3910,6 +3910,9 @@ static int tc_run(struct tcx_entry *entry, struct sk_buff *skb,
>>   	if (!miniq)
>>   		return ret;
>>   
>> +	if (tcf_block_has_skip_sw_only(miniq->block))
>> +		return ret;
>> +
>>   	tc_skb_cb(skb)->mru = 0;
>>   	tc_skb_cb(skb)->post_ct = false;
>>   	tcf_set_drop_reason(skb, *drop_reason);
>
Jamal Hadi Salim Feb. 16, 2024, 3:07 p.m. UTC | #6
On Fri, Feb 16, 2024 at 7:57 AM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Thu, Feb 15, 2024 at 06:49:05PM CET, jhs@mojatatu.com wrote:
> >On Thu, Feb 15, 2024 at 11:06 AM Asbjørn Sloth Tønnesen <ast@fiberby.net> wrote:
> >>
> >> TC filters come in 3 variants:
> >> - no flag (no opinion, process wherever possible)
> >> - skip_hw (do not process filter by hardware)
> >> - skip_sw (do not process filter by software)
> >>
> >> However skip_sw is implemented so that the skip_sw
> >> flag can first be checked, after it has been matched.
> >>
> >> IMHO it's common when using skip_sw, to use it on all rules.
> >>
> >> So if all filters in a block is skip_sw filters, then
> >> we can bail early, we can thus avoid having to match
> >> the filters, just to check for the skip_sw flag.
> >>
> >>  +----------------------------+--------+--------+--------+
> >>  | Test description           | Pre    | Post   | Rel.   |
> >>  |                            | kpps   | kpps   | chg.   |
> >>  +----------------------------+--------+--------+--------+
> >>  | basic forwarding + notrack | 1264.9 | 1277.7 |  1.01x |
> >>  | switch to eswitch mode     | 1067.1 | 1071.0 |  1.00x |
> >>  | add ingress qdisc          | 1056.0 | 1059.1 |  1.00x |
> >>  +----------------------------+--------+--------+--------+
> >>  | 1 non-matching rule        |  927.9 | 1057.1 |  1.14x |
> >>  | 10 non-matching rules      |  495.8 | 1055.6 |  2.13x |
> >>  | 25 non-matching rules      |  280.6 | 1053.5 |  3.75x |
> >>  | 50 non-matching rules      |  162.0 | 1055.7 |  6.52x |
> >>  | 100 non-matching rules     |   87.7 | 1019.0 | 11.62x |
> >>  +----------------------------+--------+--------+--------+
> >>
> >> perf top (100 n-m skip_sw rules - pre patch):
> >>   25.57%  [kernel]  [k] __skb_flow_dissect
> >>   20.77%  [kernel]  [k] rhashtable_jhash2
> >>   14.26%  [kernel]  [k] fl_classify
> >>   13.28%  [kernel]  [k] fl_mask_lookup
> >>    6.38%  [kernel]  [k] memset_orig
> >>    3.22%  [kernel]  [k] tcf_classify
> >>
> >> perf top (100 n-m skip_sw rules - post patch):
> >>    4.28%  [kernel]  [k] __dev_queue_xmit
> >>    3.80%  [kernel]  [k] check_preemption_disabled
> >>    3.68%  [kernel]  [k] nft_do_chain
> >>    3.08%  [kernel]  [k] __netif_receive_skb_core.constprop.0
> >>    2.59%  [kernel]  [k] mlx5e_xmit
> >>    2.48%  [kernel]  [k] mlx5e_skb_from_cqe_mpwrq_nonlinear
> >>
> >
> >The concept makes sense - but i am wondering when you have a mix of
> >skip_sw and skip_hw if it makes more sense to just avoid looking up
> >skip_sw at all in the s/w datapath? Potentially by separating the
> >hashes for skip_sw/hw. I know it's a deeper surgery - but would be
>
> Yeah, there could be 2 hashes: skip_sw/rest
> rest is the only one that needs to be looked-up in kernel datapath.
> skip_sw is just for control path.
>
> But is it worth the efford? I mean, since now, nobody seemed to care. If
> this patchset solves the problem for this usecase, I think it is enough.
>

May not be worth the effort - and this is a reasonable use case. The
approach is a hack nonetheless and kills at least some insects. To
address the issues Vlad brought up, perhaps we should wrap it under
some kconfig.

cheers,
jamal

> In that case, I'm fine with this patch:
>
> Reviewed-by: Jiri Pirko <jiri@nvidia.com>
>
>
>
> >more general purpose....unless i am missing something
> >
> >> Test setup:
> >>  DUT: Intel Xeon D-1518 (2.20GHz) w/ Nvidia/Mellanox ConnectX-6 Dx 2x100G
> >>  Data rate measured on switch (Extreme X690), and DUT connected as
> >>  a router on a stick, with pktgen and pktsink as VLANs.
> >>  Pktgen was in range 12.79 - 12.95 Mpps across all tests.
> >>
> >
> >Hrm. Those are "tiny" numbers (25G @64B is about 3x that). What are
> >the packet sizes?
> >Perhaps the traffic generator is a limitation here?
> >Also feels like you are doing exact matches? A sample flower rule
> >would have helped.
> >
> >cheers,
> >jamal
> >> Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
> >> ---
> >>  include/net/pkt_cls.h | 5 +++++
> >>  net/core/dev.c        | 3 +++
> >>  2 files changed, 8 insertions(+)
> >>
> >> diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
> >> index a4ee43f493bb..a065da4df7ff 100644
> >> --- a/include/net/pkt_cls.h
> >> +++ b/include/net/pkt_cls.h
> >> @@ -74,6 +74,11 @@ static inline bool tcf_block_non_null_shared(struct tcf_block *block)
> >>         return block && block->index;
> >>  }
> >>
> >> +static inline bool tcf_block_has_skip_sw_only(struct tcf_block *block)
> >> +{
> >> +       return block && atomic_read(&block->filtercnt) == atomic_read(&block->skipswcnt);
> >> +}
> >> +
> >>  static inline struct Qdisc *tcf_block_q(struct tcf_block *block)
> >>  {
> >>         WARN_ON(tcf_block_shared(block));
> >> diff --git a/net/core/dev.c b/net/core/dev.c
> >> index d8dd293a7a27..7cd014e5066e 100644
> >> --- a/net/core/dev.c
> >> +++ b/net/core/dev.c
> >> @@ -3910,6 +3910,9 @@ static int tc_run(struct tcx_entry *entry, struct sk_buff *skb,
> >>         if (!miniq)
> >>                 return ret;
> >>
> >> +       if (tcf_block_has_skip_sw_only(miniq->block))
> >> +               return ret;
> >> +
> >>         tc_skb_cb(skb)->mru = 0;
> >>         tc_skb_cb(skb)->post_ct = false;
> >>         tcf_set_drop_reason(skb, *drop_reason);
> >> --
> >> 2.43.0
> >>
diff mbox series

Patch

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index a4ee43f493bb..a065da4df7ff 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -74,6 +74,11 @@  static inline bool tcf_block_non_null_shared(struct tcf_block *block)
 	return block && block->index;
 }
 
+static inline bool tcf_block_has_skip_sw_only(struct tcf_block *block)
+{
+	return block && atomic_read(&block->filtercnt) == atomic_read(&block->skipswcnt);
+}
+
 static inline struct Qdisc *tcf_block_q(struct tcf_block *block)
 {
 	WARN_ON(tcf_block_shared(block));
diff --git a/net/core/dev.c b/net/core/dev.c
index d8dd293a7a27..7cd014e5066e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3910,6 +3910,9 @@  static int tc_run(struct tcx_entry *entry, struct sk_buff *skb,
 	if (!miniq)
 		return ret;
 
+	if (tcf_block_has_skip_sw_only(miniq->block))
+		return ret;
+
 	tc_skb_cb(skb)->mru = 0;
 	tc_skb_cb(skb)->post_ct = false;
 	tcf_set_drop_reason(skb, *drop_reason);