mbox series

[RFC/PATCH,net-next,v3,0/8] allow user to offload tc action to net device

Message ID 20211028110646.13791-1-simon.horman@corigine.com (mailing list archive)
Headers show
Series allow user to offload tc action to net device | expand

Message

Simon Horman Oct. 28, 2021, 11:06 a.m. UTC
Baowen Zheng says:

Allow use of flow_indr_dev_register/flow_indr_dev_setup_offload to offload
tc actions independent of flows.

The motivation for this work is to prepare for using TC police action
instances to provide hardware offload of OVS metering feature - which calls
for policers that may be used by multiple flows and whose lifecycle is
independent of any flows that use them.

This patch includes basic changes to offload drivers to return EOPNOTSUPP
if this feature is used - it is not yet supported by any driver.

Tc cli command to offload and quote an action:

tc qdisc del dev $DEV ingress && sleep 1 || true
tc actions delete action police index 99 || true

tc qdisc add dev $DEV ingress
tc qdisc show dev $DEV ingress

tc actions add action police index 99 rate 1mbit burst 100k skip_sw
tc actions list action police

tc filter add dev $DEV protocol ip parent ffff:
flower ip_proto tcp action police index 99
tc -s -d filter show dev $DEV protocol ip parent ffff:
tc filter add dev $DEV protocol ipv6 parent ffff:
flower skip_sw ip_proto tcp action police index 99
tc -s -d filter show dev $DEV protocol ipv6 parent ffff:
tc actions list action police

tc qdisc del dev $DEV ingress && sleep 1
tc actions delete action police index 99
tc actions list action police

Changes compared to v2 patches:

* Made changes according to the review comments.
* Delete in_hw and not_in_hw flag and user can judge if the action is
  offloaded to any hardware by in_hw_count.
* Split the main patch of the action offload to three single patch to
facilitate code review.

Posting this revision of the patchset as an RFC as while we feel it is
ready for review we would like an opportunity to conduct further testing
before acceptance into upstream.

Baowen Zheng (8):
  flow_offload: fill flags to action structure
  flow_offload: reject to offload tc actions in offload drivers
  flow_offload: allow user to offload tc action to net device
  flow_offload: add skip_hw and skip_sw to control if offload the action
  flow_offload: add process to update action stats from hardware
  net: sched: save full flags for tc action
  flow_offload: add reoffload process to update hw_count
  flow_offload: validate flags of filter and actions

 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c  |   2 +-
 .../ethernet/mellanox/mlx5/core/en/rep/tc.c   |   3 +
 .../ethernet/netronome/nfp/flower/offload.c   |   3 +
 include/linux/netdevice.h                     |   1 +
 include/net/act_api.h                         |  34 +-
 include/net/flow_offload.h                    |  17 +
 include/net/pkt_cls.h                         |  61 ++-
 include/uapi/linux/pkt_cls.h                  |   9 +-
 net/core/flow_offload.c                       |  48 +-
 net/sched/act_api.c                           | 440 +++++++++++++++++-
 net/sched/act_bpf.c                           |   2 +-
 net/sched/act_connmark.c                      |   2 +-
 net/sched/act_ctinfo.c                        |   2 +-
 net/sched/act_gate.c                          |   2 +-
 net/sched/act_ife.c                           |   2 +-
 net/sched/act_ipt.c                           |   2 +-
 net/sched/act_mpls.c                          |   2 +-
 net/sched/act_nat.c                           |   2 +-
 net/sched/act_pedit.c                         |   2 +-
 net/sched/act_police.c                        |   2 +-
 net/sched/act_sample.c                        |   2 +-
 net/sched/act_simple.c                        |   2 +-
 net/sched/act_skbedit.c                       |   2 +-
 net/sched/act_skbmod.c                        |   2 +-
 net/sched/cls_api.c                           |  55 ++-
 net/sched/cls_flower.c                        |   3 +-
 net/sched/cls_matchall.c                      |   4 +-
 net/sched/cls_u32.c                           |   7 +-
 28 files changed, 661 insertions(+), 54 deletions(-)

Comments

Jamal Hadi Salim Oct. 28, 2021, 2:23 p.m. UTC | #1
On 2021-10-28 07:06, Simon Horman wrote:
> aowen Zheng says:
> 
> Allow use of flow_indr_dev_register/flow_indr_dev_setup_offload to offload
> tc actions independent of flows.
> 
> The motivation for this work is to prepare for using TC police action
> instances to provide hardware offload of OVS metering feature - which calls
> for policers that may be used by multiple flows and whose lifecycle is
> independent of any flows that use them.
> 
> This patch includes basic changes to offload drivers to return EOPNOTSUPP
> if this feature is used - it is not yet supported by any driver.
> 
> Tc cli command to offload and quote an action:
> 
> tc qdisc del dev $DEV ingress && sleep 1 || true
> tc actions delete action police index 99 || true
> 
> tc qdisc add dev $DEV ingress
> tc qdisc show dev $DEV ingress
> 
> tc actions add action police index 99 rate 1mbit burst 100k skip_sw
> tc actions list action police
> 
> tc filter add dev $DEV protocol ip parent ffff:
> flower ip_proto tcp action police index 99
> tc -s -d filter show dev $DEV protocol ip parent ffff:
> tc filter add dev $DEV protocol ipv6 parent ffff:
> flower skip_sw ip_proto tcp action police index 99
> tc -s -d filter show dev $DEV protocol ipv6 parent ffff:
> tc actions list action police
> 
> tc qdisc del dev $DEV ingress && sleep 1
> tc actions delete action police index 99
> tc actions list action police


It will be helpful to display the output of the show commands in the
cover letter....

cheers,
jamal
Jamal Hadi Salim Oct. 28, 2021, 2:39 p.m. UTC | #2
On 2021-10-28 10:23, Jamal Hadi Salim wrote:
[..]
> 
> It will be helpful to display the output of the show commands in the
> cover letter....

Also some tdc tests please...

cheers,
jamal
Oz Shlomo Oct. 31, 2021, 9:50 a.m. UTC | #3
On 10/28/2021 2:06 PM, Simon Horman wrote:
> Baowen Zheng says:
> 
> Allow use of flow_indr_dev_register/flow_indr_dev_setup_offload to offload
> tc actions independent of flows.
> 
> The motivation for this work is to prepare for using TC police action
> instances to provide hardware offload of OVS metering feature - which calls
> for policers that may be used by multiple flows and whose lifecycle is
> independent of any flows that use them.
> 
> This patch includes basic changes to offload drivers to return EOPNOTSUPP
> if this feature is used - it is not yet supported by any driver.
> 
> Tc cli command to offload and quote an action:
> 
> tc qdisc del dev $DEV ingress && sleep 1 || true
> tc actions delete action police index 99 || true
> 
> tc qdisc add dev $DEV ingress
> tc qdisc show dev $DEV ingress
> 
> tc actions add action police index 99 rate 1mbit burst 100k skip_sw
> tc actions list action police
> 
> tc filter add dev $DEV protocol ip parent ffff:
> flower ip_proto tcp action police index 99
> tc -s -d filter show dev $DEV protocol ip parent ffff:
> tc filter add dev $DEV protocol ipv6 parent ffff:
> flower skip_sw ip_proto tcp action police index 99
> tc -s -d filter show dev $DEV protocol ipv6 parent ffff:
> tc actions list action police
> 
> tc qdisc del dev $DEV ingress && sleep 1
> tc actions delete action police index 99
> tc actions list action police
> 

Actions are also (implicitly) instantiated when filters are created.
In the following example the mirred action instance (created by the first filter) is shared by the 
second filter:

tc filter add dev $DEV1 proto ip parent ffff: flower \
	ip_proto tcp action mirred egress redirect dev $DEV3

tc filter add dev $DEV2 proto ip parent ffff: flower \
	ip_proto tcp action mirred index 1


> Changes compared to v2 patches:
> 
> * Made changes according to the review comments.
> * Delete in_hw and not_in_hw flag and user can judge if the action is
>    offloaded to any hardware by in_hw_count.
> * Split the main patch of the action offload to three single patch to
> facilitate code review.
> 
> Posting this revision of the patchset as an RFC as while we feel it is
> ready for review we would like an opportunity to conduct further testing
> before acceptance into upstream.
> 
> Baowen Zheng (8):
>    flow_offload: fill flags to action structure
>    flow_offload: reject to offload tc actions in offload drivers
>    flow_offload: allow user to offload tc action to net device
>    flow_offload: add skip_hw and skip_sw to control if offload the action
>    flow_offload: add process to update action stats from hardware
>    net: sched: save full flags for tc action
>    flow_offload: add reoffload process to update hw_count
>    flow_offload: validate flags of filter and actions
> 
>   drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c  |   2 +-
>   .../ethernet/mellanox/mlx5/core/en/rep/tc.c   |   3 +
>   .../ethernet/netronome/nfp/flower/offload.c   |   3 +
>   include/linux/netdevice.h                     |   1 +
>   include/net/act_api.h                         |  34 +-
>   include/net/flow_offload.h                    |  17 +
>   include/net/pkt_cls.h                         |  61 ++-
>   include/uapi/linux/pkt_cls.h                  |   9 +-
>   net/core/flow_offload.c                       |  48 +-
>   net/sched/act_api.c                           | 440 +++++++++++++++++-
>   net/sched/act_bpf.c                           |   2 +-
>   net/sched/act_connmark.c                      |   2 +-
>   net/sched/act_ctinfo.c                        |   2 +-
>   net/sched/act_gate.c                          |   2 +-
>   net/sched/act_ife.c                           |   2 +-
>   net/sched/act_ipt.c                           |   2 +-
>   net/sched/act_mpls.c                          |   2 +-
>   net/sched/act_nat.c                           |   2 +-
>   net/sched/act_pedit.c                         |   2 +-
>   net/sched/act_police.c                        |   2 +-
>   net/sched/act_sample.c                        |   2 +-
>   net/sched/act_simple.c                        |   2 +-
>   net/sched/act_skbedit.c                       |   2 +-
>   net/sched/act_skbmod.c                        |   2 +-
>   net/sched/cls_api.c                           |  55 ++-
>   net/sched/cls_flower.c                        |   3 +-
>   net/sched/cls_matchall.c                      |   4 +-
>   net/sched/cls_u32.c                           |   7 +-
>   28 files changed, 661 insertions(+), 54 deletions(-)
>
Dave Taht Oct. 31, 2021, 12:03 p.m. UTC | #4
On Sun, Oct 31, 2021 at 2:51 AM Oz Shlomo <ozsh@nvidia.com> wrote:
>
>
>
> On 10/28/2021 2:06 PM, Simon Horman wrote:
> > Baowen Zheng says:
> >
> > Allow use of flow_indr_dev_register/flow_indr_dev_setup_offload to offload
> > tc actions independent of flows.
> >
> > The motivation for this work is to prepare for using TC police action
> > instances to provide hardware offload of OVS metering feature - which calls
> > for policers that may be used by multiple flows and whose lifecycle is
> > independent of any flows that use them.
> >
> > This patch includes basic changes to offload drivers to return EOPNOTSUPP
> > if this feature is used - it is not yet supported by any driver.
> >
> > Tc cli command to offload and quote an action:
> >
> > tc qdisc del dev $DEV ingress && sleep 1 || true
> > tc actions delete action police index 99 || true
> >
> > tc qdisc add dev $DEV ingress
> > tc qdisc show dev $DEV ingress
> >
> > tc actions add action police index 99 rate 1mbit burst 100k skip_sw
> > tc actions list action police
> >
> > tc filter add dev $DEV protocol ip parent ffff:
> > flower ip_proto tcp action police index 99
> > tc -s -d filter show dev $DEV protocol ip parent ffff:
> > tc filter add dev $DEV protocol ipv6 parent ffff:
> > flower skip_sw ip_proto tcp action police index 99
> > tc -s -d filter show dev $DEV protocol ipv6 parent ffff:
> > tc actions list action police
> >
> > tc qdisc del dev $DEV ingress && sleep 1
> > tc actions delete action police index 99
> > tc actions list action police
> >
>
> Actions are also (implicitly) instantiated when filters are created.
> In the following example the mirred action instance (created by the first filter) is shared by the
> second filter:
>
> tc filter add dev $DEV1 proto ip parent ffff: flower \
>         ip_proto tcp action mirred egress redirect dev $DEV3
>
> tc filter add dev $DEV2 proto ip parent ffff: flower \
>         ip_proto tcp action mirred index 1
>
>
> > Changes compared to v2 patches:
> >
> > * Made changes according to the review comments.
> > * Delete in_hw and not_in_hw flag and user can judge if the action is
> >    offloaded to any hardware by in_hw_count.
> > * Split the main patch of the action offload to three single patch to
> > facilitate code review.
> >
> > Posting this revision of the patchset as an RFC as while we feel it is
> > ready for review we would like an opportunity to conduct further testing
> > before acceptance into upstream.
> >
> > Baowen Zheng (8):
> >    flow_offload: fill flags to action structure
> >    flow_offload: reject to offload tc actions in offload drivers
> >    flow_offload: allow user to offload tc action to net device
> >    flow_offload: add skip_hw and skip_sw to control if offload the action
> >    flow_offload: add process to update action stats from hardware
> >    net: sched: save full flags for tc action
> >    flow_offload: add reoffload process to update hw_count
> >    flow_offload: validate flags of filter and actions
> >
> >   drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c  |   2 +-
> >   .../ethernet/mellanox/mlx5/core/en/rep/tc.c   |   3 +
> >   .../ethernet/netronome/nfp/flower/offload.c   |   3 +
> >   include/linux/netdevice.h                     |   1 +
> >   include/net/act_api.h                         |  34 +-
> >   include/net/flow_offload.h                    |  17 +
> >   include/net/pkt_cls.h                         |  61 ++-
> >   include/uapi/linux/pkt_cls.h                  |   9 +-
> >   net/core/flow_offload.c                       |  48 +-
> >   net/sched/act_api.c                           | 440 +++++++++++++++++-
> >   net/sched/act_bpf.c                           |   2 +-
> >   net/sched/act_connmark.c                      |   2 +-
> >   net/sched/act_ctinfo.c                        |   2 +-
> >   net/sched/act_gate.c                          |   2 +-
> >   net/sched/act_ife.c                           |   2 +-
> >   net/sched/act_ipt.c                           |   2 +-
> >   net/sched/act_mpls.c                          |   2 +-
> >   net/sched/act_nat.c                           |   2 +-
> >   net/sched/act_pedit.c                         |   2 +-
> >   net/sched/act_police.c                        |   2 +-
> >   net/sched/act_sample.c                        |   2 +-
> >   net/sched/act_simple.c                        |   2 +-
> >   net/sched/act_skbedit.c                       |   2 +-
> >   net/sched/act_skbmod.c                        |   2 +-
> >   net/sched/cls_api.c                           |  55 ++-
> >   net/sched/cls_flower.c                        |   3 +-
> >   net/sched/cls_matchall.c                      |   4 +-
> >   net/sched/cls_u32.c                           |   7 +-
> >   28 files changed, 661 insertions(+), 54 deletions(-)
> >

Just as an on-going grump: It has been my hope that policing as a
technique would have died a horrible death by now. Seeing it come back
as an "easy to offload" operation here - fresh from the 1990s! does
not mean it's a good idea, and I'd rather like it if we were finding
ways to
offload newer things that work better, such as modern aqm, fair
queuing, and shaping technologies that are in pie, fq_codel, and cake.

policing leads to bursty loss, especially at higher rates, BBR has a
specific mode designed to defeat it, and I ripped it out of
wondershaper
long ago for very good reasons:
https://www.bufferbloat.net/projects/bloat/wiki/Wondershaper_Must_Die/

I did a long time ago start working on a better policing idea based on
some good aqm ideas like AFD, but dropped it figuring that policing
was going to vanish
from the planet. It's baaaaaack.
Jamal Hadi Salim Oct. 31, 2021, 1:40 p.m. UTC | #5
On 2021-10-31 05:50, Oz Shlomo wrote:
> 
> 
> On 10/28/2021 2:06 PM, Simon Horman wrote:
>> Baowen Zheng says:
>>
>> Allow use of flow_indr_dev_register/flow_indr_dev_setup_offload to 
>> offload
>> tc actions independent of flows.
>>
>> The motivation for this work is to prepare for using TC police action
>> instances to provide hardware offload of OVS metering feature - which 
>> calls
>> for policers that may be used by multiple flows and whose lifecycle is
>> independent of any flows that use them.
>>
>> This patch includes basic changes to offload drivers to return EOPNOTSUPP
>> if this feature is used - it is not yet supported by any driver.
>>
>> Tc cli command to offload and quote an action:
>>
>> tc qdisc del dev $DEV ingress && sleep 1 || true
>> tc actions delete action police index 99 || true
>>
>> tc qdisc add dev $DEV ingress
>> tc qdisc show dev $DEV ingress
>>
>> tc actions add action police index 99 rate 1mbit burst 100k skip_sw
>> tc actions list action police
>>
>> tc filter add dev $DEV protocol ip parent ffff:
>> flower ip_proto tcp action police index 99
>> tc -s -d filter show dev $DEV protocol ip parent ffff:
>> tc filter add dev $DEV protocol ipv6 parent ffff:
>> flower skip_sw ip_proto tcp action police index 99
>> tc -s -d filter show dev $DEV protocol ipv6 parent ffff:
>> tc actions list action police
>>
>> tc qdisc del dev $DEV ingress && sleep 1
>> tc actions delete action police index 99
>> tc actions list action police
>>
> 
> Actions are also (implicitly) instantiated when filters are created.
> In the following example the mirred action instance (created by the 
> first filter) is shared by the second filter:
> 
> tc filter add dev $DEV1 proto ip parent ffff: flower \
>      ip_proto tcp action mirred egress redirect dev $DEV3
> 
> tc filter add dev $DEV2 proto ip parent ffff: flower \
>      ip_proto tcp action mirred index 1
> 
> 

I sure hope this is supported. At least the discussions so far
are a nod in that direction...
I know there is hardware that is not capable of achieving this
(little CPE type devices) but lets not make that the common case.

cheers,
jamal
Jamal Hadi Salim Oct. 31, 2021, 2:14 p.m. UTC | #6
On 2021-10-31 08:03, Dave Taht wrote:
[..]

> 
> Just as an on-going grump: It has been my hope that policing as a
> technique would have died a horrible death by now. Seeing it come back
> as an "easy to offload" operation here - fresh from the 1990s! does
> not mean it's a good idea, and I'd rather like it if we were finding
> ways to
> offload newer things that work better, such as modern aqm, fair
> queuing, and shaping technologies that are in pie, fq_codel, and cake.
> 
> policing leads to bursty loss, especially at higher rates, BBR has a
> specific mode designed to defeat it, and I ripped it out of
> wondershaper
> long ago for very good reasons:
> https://www.bufferbloat.net/projects/bloat/wiki/Wondershaper_Must_Die/
> 
> I did a long time ago start working on a better policing idea based on
> some good aqm ideas like AFD, but dropped it figuring that policing
> was going to vanish
> from the planet. It's baaaaaack.

A lot of enthusiasm for fq_codel in that link ;->
Root cause for burstiness is typically due to large transient queues
(which are sometimes not under your admin control) and of course if
you use a policer and dont have your double leaky buckets set properly
to compensate for both short and long term rates you will have bursts
of drops with the policer. It would be the same with shaper as well
if the packet burst shows up when the queue is full.

Intuitively it would feel, for non-work conserving approaches,
delaying a packet (as in shaping) is better than dropping (as in
policing) - but i have not a study which scientifically proves it.
Any pointers in that regard?
TCP would recover either way (either detecting sequence gaps or RTO).

In Linux kernel level i am not sure i see much difference in either
since we actually feedback an indicator to TCP to indicate a local
drop (as opposed to guessing when it is dropped in the network)
and the TCP code is smart enough to utilize that knowledge.
For hardware offload there is no such feedback for either of those
two approaches (so no difference with drop in the blackhole).

As to "policer must die" - not possible i am afraid;-> I mean there
has to be strong evidence that it is a bad idea and besides that
_a lot of hardware_ supports it;-> Ergo, we have to support it as well.
Note: RED for example has been proven almost impossible to configure
properly but we still support it and there's a good set of hardware
offload support for it. For RED - and i should say the policer as well -
if you configure properly, _it works_.


BTW, Some mellanox NICs offload HTB. See for example:
https://legacy.netdevconf.info/0x14/session.html?talk-hierarchical-QoS-hardware-offload

cheers,
jamal
Jamal Hadi Salim Oct. 31, 2021, 2:19 p.m. UTC | #7
On 2021-10-31 10:14, Jamal Hadi Salim wrote:

> BTW, Some mellanox NICs offload HTB. See for example:
             ^^^^^^^^

Sorry, to be politically correct s/mellanox/Nvidia ;->

cheers,
jamal
Vlad Buslov Nov. 1, 2021, 8:01 a.m. UTC | #8
On Sun 31 Oct 2021 at 15:40, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> On 2021-10-31 05:50, Oz Shlomo wrote:
>> 
>> On 10/28/2021 2:06 PM, Simon Horman wrote:
>>> Baowen Zheng says:
>>>
>>> Allow use of flow_indr_dev_register/flow_indr_dev_setup_offload to offload
>>> tc actions independent of flows.
>>>
>>> The motivation for this work is to prepare for using TC police action
>>> instances to provide hardware offload of OVS metering feature - which calls
>>> for policers that may be used by multiple flows and whose lifecycle is
>>> independent of any flows that use them.
>>>
>>> This patch includes basic changes to offload drivers to return EOPNOTSUPP
>>> if this feature is used - it is not yet supported by any driver.
>>>
>>> Tc cli command to offload and quote an action:
>>>
>>> tc qdisc del dev $DEV ingress && sleep 1 || true
>>> tc actions delete action police index 99 || true
>>>
>>> tc qdisc add dev $DEV ingress
>>> tc qdisc show dev $DEV ingress
>>>
>>> tc actions add action police index 99 rate 1mbit burst 100k skip_sw
>>> tc actions list action police
>>>
>>> tc filter add dev $DEV protocol ip parent ffff:
>>> flower ip_proto tcp action police index 99
>>> tc -s -d filter show dev $DEV protocol ip parent ffff:
>>> tc filter add dev $DEV protocol ipv6 parent ffff:
>>> flower skip_sw ip_proto tcp action police index 99
>>> tc -s -d filter show dev $DEV protocol ipv6 parent ffff:
>>> tc actions list action police
>>>
>>> tc qdisc del dev $DEV ingress && sleep 1
>>> tc actions delete action police index 99
>>> tc actions list action police
>>>
>> Actions are also (implicitly) instantiated when filters are created.
>> In the following example the mirred action instance (created by the first
>> filter) is shared by the second filter:
>> tc filter add dev $DEV1 proto ip parent ffff: flower \
>>      ip_proto tcp action mirred egress redirect dev $DEV3
>> tc filter add dev $DEV2 proto ip parent ffff: flower \
>>      ip_proto tcp action mirred index 1
>> 
>
> I sure hope this is supported. At least the discussions so far
> are a nod in that direction...
> I know there is hardware that is not capable of achieving this
> (little CPE type devices) but lets not make that the common case.

Looks like it isn't supported in this change since
tcf_action_offload_add() is only called by tcf_action_init() when BIND
flag is not set (the flag is always set when called from cls code).
Moreover, I don't think it is good idea to support such use-case because
that would require to increase number of calls to driver offload
infrastructure from 1 per filter to 1+number_of_actions, which would
significantly impact insertion rate.
Dave Taht Nov. 1, 2021, 2:27 p.m. UTC | #9
On Sun, Oct 31, 2021 at 7:14 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On 2021-10-31 08:03, Dave Taht wrote:
> [..]
>
> >
> > Just as an on-going grump: It has been my hope that policing as a
> > technique would have died a horrible death by now. Seeing it come back
> > as an "easy to offload" operation here - fresh from the 1990s! does
> > not mean it's a good idea, and I'd rather like it if we were finding
> > ways to
> > offload newer things that work better, such as modern aqm, fair
> > queuing, and shaping technologies that are in pie, fq_codel, and cake.
> >
> > policing leads to bursty loss, especially at higher rates, BBR has a
> > specific mode designed to defeat it, and I ripped it out of
> > wondershaper
> > long ago for very good reasons:
> > https://www.bufferbloat.net/projects/bloat/wiki/Wondershaper_Must_Die/
> >
> > I did a long time ago start working on a better policing idea based on
> > some good aqm ideas like AFD, but dropped it figuring that policing
> > was going to vanish
> > from the planet. It's baaaaaack.
>
> A lot of enthusiasm for fq_codel in that link ;->

Wrote that in 2013. It's not every day you solve tcp global synchronization,
achieve a queue depth of 5ms no matter the rate, develop something that
has zero latency for sparse packets, only shoots at the fat flows, drops from
head so there's always an immediate signal of congestion from the packet
just behind, makes opus's PLC and simpler forms of FEC "just work", and
requires near zero configuration.

The plots at the end made a very convincing case for abandoning policing.

> Root cause for burstiness is typically due to large transient queues
> (which are sometimes not under your admin control) and of course if
> you use a policer and dont have your double leaky buckets set properly
> to compensate for both short and long term rates you will have bursts
> of drops with the policer.

I would really like to see a good configuration guide for policing at
multiple real-world bandwidths and at real-world workloads.

> It would be the same with shaper as well
> if the packet burst shows up when the queue is full.

Queues are shock absorbers as Van always says. We do drop packets
still, on the rx ring. The default queue depth of codel is 32MB. It takes
a really really really large burst to overwhelm that.

I wonder where all the userspace wireguard vpns are dropping packets nowdays.

> Intuitively it would feel, for non-work conserving approaches,
> delaying a packet (as in shaping) is better than dropping (as in

It's shaping + flow queueing that's the win, if you are going to
queue. It gets all
the flows statistically multiplexed and in flow balance orders of
magnitude faster
than a policer could. (flow queuing is different from classic fair queuing)

The tiny flows pass through untouched at zero delay also.

At the time, I was considering applying a codel-like technique to policing -
I'd called it "bobbie", where once you exceed the rate, a virtual clock moves
forward as to how long you would have delayed packet delivery if you were
queueing and then starts shooting at packets once your burst tolerance is
exceeded until

But inbound fq+shaping did wonders faster, and selfishly I didn't feel
like abandoning
floating point to work with in the kernel.

That said, it's taken breaking the qdisc lock and xpf to make inbound
shaping scale
decently (see: https://github.com/rchac/LibreQoS#how-do-cake-and-fq_codel-work )

> policing) - but i have not a study which scientifically proves it.
> Any pointers in that regard?

Neither do I. Matt Mathis has ranted about it, and certainly the workarounds
in BBRv1 to defeat others desperate attempts to control their bandwidth with
a policer is obvious from their data.

If there really is a resurgence of interest in policing, a good paper
would compare
a classic 3 color policer to bobbie, and to shaping vs a vs BBR and cubic.

I'm low on students at the moment...

> TCP would recover either way (either detecting sequence gaps or RTO).

Yes, it does. But policing is often devastating to voip and videoconferencing
traffic.

> In Linux kernel level i am not sure i see much difference in either
> since we actually feedback an indicator to TCP to indicate a local
> drop (as opposed to guessing when it is dropped in the network)
> and the TCP code is smart enough to utilize that knowledge.

There is an extremely long and difficult conversation I'd had over
the differences between sch_fq and fq_codel and the differences
between a server and a router, for real world applications over here:

https://github.com/systemd/systemd/issues/9725#issuecomment-413369212

> For hardware offload there is no such feedback for either of those
> two approaches (so no difference with drop in the blackhole).

Yes, now you've built a *router* and lost the local control loop.
TSQ, sch_fq's pacing, and other host optimizations no longer work.

I encourage more folk to regularly take packet
captures of the end results of offloads vs a vs network latency.

Look! MORE BANDWIDTH for a single flow! Wait! There's
600ms of latency and new flows can't even get started!

>
> As to "policer must die" - not possible i am afraid;-> I mean there
> has to be strong evidence that it is a bad idea and besides that
> _a lot of hardware_ supports it;-> Ergo, we have to support it as well.

I agree that supporting hardware features is good. I merely wish that
certain other software features were making it into modern hardware.

I'm encouraged by this work in p4, at least.

https://arxiv.org/pdf/2010.04528.pdf

> Note: RED for example has been proven almost impossible to configure
> properly but we still support it and there's a good set of hardware
> offload support for it. For RED - and i should say the policer as well -
> if you configure properly, _it works_.

I have no idea how often RED is used nowadays. The *only* requests
for offloading it Ive heard is for configuring it as a brick wall ecn marking
tool, which does indeed work for dctcp.

The hope was with pie, being similar in construction, would end up
implemented in hardware, however it's so far turned out that codel
was easier to implement in hw and more effective.
>
>
> BTW, Some mellanox NICs offload HTB. See for example:
> https://legacy.netdevconf.info/0x14/session.html?talk-hierarchical-QoS-hardware-offload

Yes. Too bad they then attach it to fifos. I'd so love to help a
hardware company
willing to do the work to put modern algorithms in hw...

> cheers,
> jamal

I note that although I've enjoyed ranting, I don't actually have any
objections to this
particular patch.
--
I tried to build a better future, a few times:
https://wayforward.archive.org/?site=https%3A%2F%2Fwww.icei.org

Dave Täht CEO, TekLibre, LLC
Simon Horman Nov. 2, 2021, 12:51 p.m. UTC | #10
On Mon, Nov 01, 2021 at 10:01:28AM +0200, Vlad Buslov wrote:
> On Sun 31 Oct 2021 at 15:40, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > On 2021-10-31 05:50, Oz Shlomo wrote:
> >> 
> >> On 10/28/2021 2:06 PM, Simon Horman wrote:

...

> >> Actions are also (implicitly) instantiated when filters are created.
> >> In the following example the mirred action instance (created by the first
> >> filter) is shared by the second filter:
> >> tc filter add dev $DEV1 proto ip parent ffff: flower \
> >>      ip_proto tcp action mirred egress redirect dev $DEV3
> >> tc filter add dev $DEV2 proto ip parent ffff: flower \
> >>      ip_proto tcp action mirred index 1
> >> 
> >
> > I sure hope this is supported. At least the discussions so far
> > are a nod in that direction...
> > I know there is hardware that is not capable of achieving this
> > (little CPE type devices) but lets not make that the common case.
> 
> Looks like it isn't supported in this change since
> tcf_action_offload_add() is only called by tcf_action_init() when BIND
> flag is not set (the flag is always set when called from cls code).
> Moreover, I don't think it is good idea to support such use-case because
> that would require to increase number of calls to driver offload
> infrastructure from 1 per filter to 1+number_of_actions, which would
> significantly impact insertion rate.

Hi,

I feel that I am missing some very obvious point here.

But from my perspective the use case described by Oz is supported
by existing offload of the flower classifier (since ~4.13 IIRC).
Vlad Buslov Nov. 2, 2021, 3:33 p.m. UTC | #11
On Tue 02 Nov 2021 at 14:51, Simon Horman <simon.horman@corigine.com> wrote:
> On Mon, Nov 01, 2021 at 10:01:28AM +0200, Vlad Buslov wrote:
>> On Sun 31 Oct 2021 at 15:40, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>> > On 2021-10-31 05:50, Oz Shlomo wrote:
>> >> 
>> >> On 10/28/2021 2:06 PM, Simon Horman wrote:
>
> ...
>
>> >> Actions are also (implicitly) instantiated when filters are created.
>> >> In the following example the mirred action instance (created by the first
>> >> filter) is shared by the second filter:
>> >> tc filter add dev $DEV1 proto ip parent ffff: flower \
>> >>      ip_proto tcp action mirred egress redirect dev $DEV3
>> >> tc filter add dev $DEV2 proto ip parent ffff: flower \
>> >>      ip_proto tcp action mirred index 1
>> >> 
>> >
>> > I sure hope this is supported. At least the discussions so far
>> > are a nod in that direction...
>> > I know there is hardware that is not capable of achieving this
>> > (little CPE type devices) but lets not make that the common case.
>> 
>> Looks like it isn't supported in this change since
>> tcf_action_offload_add() is only called by tcf_action_init() when BIND
>> flag is not set (the flag is always set when called from cls code).
>> Moreover, I don't think it is good idea to support such use-case because
>> that would require to increase number of calls to driver offload
>> infrastructure from 1 per filter to 1+number_of_actions, which would
>> significantly impact insertion rate.
>
> Hi,
>
> I feel that I am missing some very obvious point here.
>
> But from my perspective the use case described by Oz is supported
> by existing offload of the flower classifier (since ~4.13 IIRC).

Mlx5 driver can't support such case without infrastructure change in
kernel for following reasons:

- Action index is not provided by flow_action offload infrastructure for
  most of the actions, so there is no way for driver to determine
  whether the action is shared.

- If we extend the infrastructure to always provide tcfa_index (a
  trivial change), there would be not much use for it because there is
  no way to properly update shared action counters without
  infrastructure code similar to what you implemented as part of this
  series.

How do you support shared actions created through cls_api in your
driver, considering described limitations?
Simon Horman Nov. 2, 2021, 4:15 p.m. UTC | #12
On Tue, Nov 02, 2021 at 05:33:14PM +0200, Vlad Buslov wrote:
> On Tue 02 Nov 2021 at 14:51, Simon Horman <simon.horman@corigine.com> wrote:
> > On Mon, Nov 01, 2021 at 10:01:28AM +0200, Vlad Buslov wrote:
> >> On Sun 31 Oct 2021 at 15:40, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >> > On 2021-10-31 05:50, Oz Shlomo wrote:
> >> >> 
> >> >> On 10/28/2021 2:06 PM, Simon Horman wrote:
> >
> > ...
> >
> >> >> Actions are also (implicitly) instantiated when filters are created.
> >> >> In the following example the mirred action instance (created by the first
> >> >> filter) is shared by the second filter:
> >> >> tc filter add dev $DEV1 proto ip parent ffff: flower \
> >> >>      ip_proto tcp action mirred egress redirect dev $DEV3
> >> >> tc filter add dev $DEV2 proto ip parent ffff: flower \
> >> >>      ip_proto tcp action mirred index 1
> >> >> 
> >> >
> >> > I sure hope this is supported. At least the discussions so far
> >> > are a nod in that direction...
> >> > I know there is hardware that is not capable of achieving this
> >> > (little CPE type devices) but lets not make that the common case.
> >> 
> >> Looks like it isn't supported in this change since
> >> tcf_action_offload_add() is only called by tcf_action_init() when BIND
> >> flag is not set (the flag is always set when called from cls code).
> >> Moreover, I don't think it is good idea to support such use-case because
> >> that would require to increase number of calls to driver offload
> >> infrastructure from 1 per filter to 1+number_of_actions, which would
> >> significantly impact insertion rate.
> >
> > Hi,
> >
> > I feel that I am missing some very obvious point here.
> >
> > But from my perspective the use case described by Oz is supported
> > by existing offload of the flower classifier (since ~4.13 IIRC).
> 
> Mlx5 driver can't support such case without infrastructure change in
> kernel for following reasons:
> 
> - Action index is not provided by flow_action offload infrastructure for
>   most of the actions, so there is no way for driver to determine
>   whether the action is shared.
> 
> - If we extend the infrastructure to always provide tcfa_index (a
>   trivial change), there would be not much use for it because there is
>   no way to properly update shared action counters without
>   infrastructure code similar to what you implemented as part of this
>   series.
> 
> How do you support shared actions created through cls_api in your
> driver, considering described limitations?

Thanks,

I misread the use case described by Oz, but I believe I understand it now.

I agree that the case described is neither currently supported, nor
supported by this patchset (to be honest I for one had not considered it).

So, I think the question is: does upporting this use-case make sense - from
implementation, use-case, and consistency perspectives - in the context of
this patchset?

Am I on the right track?
Oz Shlomo Nov. 3, 2021, 10:56 a.m. UTC | #13
On 11/2/2021 6:15 PM, Simon Horman wrote:
> On Tue, Nov 02, 2021 at 05:33:14PM +0200, Vlad Buslov wrote:
>> On Tue 02 Nov 2021 at 14:51, Simon Horman <simon.horman@corigine.com> wrote:
>>> On Mon, Nov 01, 2021 at 10:01:28AM +0200, Vlad Buslov wrote:
>>>> On Sun 31 Oct 2021 at 15:40, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>>>>> On 2021-10-31 05:50, Oz Shlomo wrote:
>>>>>>
>>>>>> On 10/28/2021 2:06 PM, Simon Horman wrote:
>>>
>>> ...
>>>
>>>>>> Actions are also (implicitly) instantiated when filters are created.
>>>>>> In the following example the mirred action instance (created by the first
>>>>>> filter) is shared by the second filter:
>>>>>> tc filter add dev $DEV1 proto ip parent ffff: flower \
>>>>>>       ip_proto tcp action mirred egress redirect dev $DEV3
>>>>>> tc filter add dev $DEV2 proto ip parent ffff: flower \
>>>>>>       ip_proto tcp action mirred index 1
>>>>>>
>>>>>
>>>>> I sure hope this is supported. At least the discussions so far
>>>>> are a nod in that direction...
>>>>> I know there is hardware that is not capable of achieving this
>>>>> (little CPE type devices) but lets not make that the common case.
>>>>
>>>> Looks like it isn't supported in this change since
>>>> tcf_action_offload_add() is only called by tcf_action_init() when BIND
>>>> flag is not set (the flag is always set when called from cls code).
>>>> Moreover, I don't think it is good idea to support such use-case because
>>>> that would require to increase number of calls to driver offload
>>>> infrastructure from 1 per filter to 1+number_of_actions, which would
>>>> significantly impact insertion rate.
>>>
>>> Hi,
>>>
>>> I feel that I am missing some very obvious point here.
>>>
>>> But from my perspective the use case described by Oz is supported
>>> by existing offload of the flower classifier (since ~4.13 IIRC).
>>
>> Mlx5 driver can't support such case without infrastructure change in
>> kernel for following reasons:
>>
>> - Action index is not provided by flow_action offload infrastructure for
>>    most of the actions, so there is no way for driver to determine
>>    whether the action is shared.
>>
>> - If we extend the infrastructure to always provide tcfa_index (a
>>    trivial change), there would be not much use for it because there is
>>    no way to properly update shared action counters without
>>    infrastructure code similar to what you implemented as part of this
>>    series.
>>
>> How do you support shared actions created through cls_api in your
>> driver, considering described limitations?
> 
> Thanks,
> 
> I misread the use case described by Oz, but I believe I understand it now.
> 
> I agree that the case described is neither currently supported, nor
> supported by this patchset (to be honest I for one had not considered it).
> 
> So, I think the question is: does upporting this use-case make sense - from
> implementation, use-case, and consistency perspectives - in the context of
> this patchset?
> 
> Am I on the right track?
> 

Currently we don't have a specific application use case for sharing actions that were created by tc 
filters. However, we do have future use case in mind.
We could add such functionality on top of this series when a use case will materialize.
Perhaps, at that point, we can also introduce a control flag in order to avoid unnecessary insertion 
rate performance degradation.