diff mbox series

[net-next,08/15] net/mlx5e: CT: Preparation for offloading +trk+new ct rules

Message ID 20210108053054.660499-9-saeed@kernel.org (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [net-next,01/15] net/mlx5: Add HW definition of reg_c_preserve | expand

Checks

Context Check Description
netdev/cover_letter success Pull request
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for net-next
netdev/subject_prefix success Link
netdev/cc_maintainers warning 6 maintainers not CCed: lariel@nvidia.com elibr@mellanox.com ozsh@mellanox.com paulb@mellanox.com linux-rdma@vger.kernel.org leon@kernel.org
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 146 lines checked
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/header_inline success Link
netdev/stable success Stable not CCed

Commit Message

Saeed Mahameed Jan. 8, 2021, 5:30 a.m. UTC
From: Roi Dayan <roid@nvidia.com>

Connection tracking associates the connection state per packet. The
first packet of a connection is assigned with the +trk+new state. The
connection enters the established state once a packet is seen on the
other direction.

Currently we offload only the established flows. However, UDP traffic
using source port entropy (e.g. vxlan, RoCE) will never enter the
established state. Such protocols do not require stateful processing,
and therefore could be offloaded.

The change in the model is that a miss on the CT table will be forwarded
to a new +trk+new ct table and a miss there will be forwarded to the slow
path table.

Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../ethernet/mellanox/mlx5/core/en/tc_ct.c    | 104 ++++++++++++++++--
 1 file changed, 96 insertions(+), 8 deletions(-)

Comments

Marcelo Ricardo Leitner Jan. 8, 2021, 9:48 p.m. UTC | #1
Hi,

On Thu, Jan 07, 2021 at 09:30:47PM -0800, Saeed Mahameed wrote:
> From: Roi Dayan <roid@nvidia.com>
> 
> Connection tracking associates the connection state per packet. The
> first packet of a connection is assigned with the +trk+new state. The
> connection enters the established state once a packet is seen on the
> other direction.
> 
> Currently we offload only the established flows. However, UDP traffic
> using source port entropy (e.g. vxlan, RoCE) will never enter the
> established state. Such protocols do not require stateful processing,
> and therefore could be offloaded.

If it doesn't require stateful processing, please enlight me on why
conntrack is being used in the first place. What's the use case here?

> 
> The change in the model is that a miss on the CT table will be forwarded
> to a new +trk+new ct table and a miss there will be forwarded to the slow
> path table.

AFAICU this new +trk+new ct table is a wildcard match on sport with
specific dports. Also AFAICU, such entries will not be visible to the
userspace then. Is this right?

  Marcelo
Roi Dayan Jan. 10, 2021, 7:45 a.m. UTC | #2
On 2021-01-08 11:48 PM, Marcelo Ricardo Leitner wrote:
> Hi,
> 
> On Thu, Jan 07, 2021 at 09:30:47PM -0800, Saeed Mahameed wrote:
>> From: Roi Dayan <roid@nvidia.com>
>>
>> Connection tracking associates the connection state per packet. The
>> first packet of a connection is assigned with the +trk+new state. The
>> connection enters the established state once a packet is seen on the
>> other direction.
>>
>> Currently we offload only the established flows. However, UDP traffic
>> using source port entropy (e.g. vxlan, RoCE) will never enter the
>> established state. Such protocols do not require stateful processing,
>> and therefore could be offloaded.
> 
> If it doesn't require stateful processing, please enlight me on why
> conntrack is being used in the first place. What's the use case here?
> 

The use case for example is when we have vxlan traffic but we do
conntrack on the inner packet (rules on the physical port) so
we never get established but on miss we can still offload as normal
vxlan traffic.

>>
>> The change in the model is that a miss on the CT table will be forwarded
>> to a new +trk+new ct table and a miss there will be forwarded to the slow
>> path table.
> 
> AFAICU this new +trk+new ct table is a wildcard match on sport with
> specific dports. Also AFAICU, such entries will not be visible to the
> userspace then. Is this right?
> 
>    Marcelo
> 

right.
Roi Dayan Jan. 10, 2021, 7:52 a.m. UTC | #3
On 2021-01-10 9:45 AM, Roi Dayan wrote:
> 
> 
> On 2021-01-08 11:48 PM, Marcelo Ricardo Leitner wrote:
>> Hi,
>>
>> On Thu, Jan 07, 2021 at 09:30:47PM -0800, Saeed Mahameed wrote:
>>> From: Roi Dayan <roid@nvidia.com>
>>>
>>> Connection tracking associates the connection state per packet. The
>>> first packet of a connection is assigned with the +trk+new state. The
>>> connection enters the established state once a packet is seen on the
>>> other direction.
>>>
>>> Currently we offload only the established flows. However, UDP traffic
>>> using source port entropy (e.g. vxlan, RoCE) will never enter the
>>> established state. Such protocols do not require stateful processing,
>>> and therefore could be offloaded.
>>
>> If it doesn't require stateful processing, please enlight me on why
>> conntrack is being used in the first place. What's the use case here?
>>
> 
> The use case for example is when we have vxlan traffic but we do
> conntrack on the inner packet (rules on the physical port) so
> we never get established but on miss we can still offload as normal
> vxlan traffic.
> 

my mistake about "inner packet". we do CT on the underlay network, i.e.
the outer header.

>>>
>>> The change in the model is that a miss on the CT table will be forwarded
>>> to a new +trk+new ct table and a miss there will be forwarded to the 
>>> slow
>>> path table.
>>
>> AFAICU this new +trk+new ct table is a wildcard match on sport with
>> specific dports. Also AFAICU, such entries will not be visible to the
>> userspace then. Is this right?
>>
>>    Marcelo
>>
> 
> right.
Marcelo Ricardo Leitner Jan. 11, 2021, 11:51 p.m. UTC | #4
On Sun, Jan 10, 2021 at 09:52:55AM +0200, Roi Dayan wrote:
> 
> 
> On 2021-01-10 9:45 AM, Roi Dayan wrote:
> > 
> > 
> > On 2021-01-08 11:48 PM, Marcelo Ricardo Leitner wrote:
> > > Hi,
> > > 
> > > On Thu, Jan 07, 2021 at 09:30:47PM -0800, Saeed Mahameed wrote:
> > > > From: Roi Dayan <roid@nvidia.com>
> > > > 
> > > > Connection tracking associates the connection state per packet. The
> > > > first packet of a connection is assigned with the +trk+new state. The
> > > > connection enters the established state once a packet is seen on the
> > > > other direction.
> > > > 
> > > > Currently we offload only the established flows. However, UDP traffic
> > > > using source port entropy (e.g. vxlan, RoCE) will never enter the
> > > > established state. Such protocols do not require stateful processing,
> > > > and therefore could be offloaded.
> > > 
> > > If it doesn't require stateful processing, please enlight me on why
> > > conntrack is being used in the first place. What's the use case here?
> > > 
> > 
> > The use case for example is when we have vxlan traffic but we do
> > conntrack on the inner packet (rules on the physical port) so
> > we never get established but on miss we can still offload as normal
> > vxlan traffic.
> > 
> 
> my mistake about "inner packet". we do CT on the underlay network, i.e.
> the outer header.

I miss why the CT match is being used there then. Isn't it a config
issue/waste of resources? What is CT adding to the matches/actions
being done on these flows?

> 
> > > > 
> > > > The change in the model is that a miss on the CT table will be forwarded
> > > > to a new +trk+new ct table and a miss there will be forwarded to
> > > > the slow
> > > > path table.
> > > 
> > > AFAICU this new +trk+new ct table is a wildcard match on sport with
> > > specific dports. Also AFAICU, such entries will not be visible to the
> > > userspace then. Is this right?
> > > 
> > >    Marcelo
> > > 
> > 
> > right.

Thanks,
Marcelo
Oz Shlomo Jan. 12, 2021, 9:27 a.m. UTC | #5
On 1/12/2021 1:51 AM, Marcelo Ricardo Leitner wrote:
> On Sun, Jan 10, 2021 at 09:52:55AM +0200, Roi Dayan wrote:
>>
>>
>> On 2021-01-10 9:45 AM, Roi Dayan wrote:
>>>
>>>
>>> On 2021-01-08 11:48 PM, Marcelo Ricardo Leitner wrote:
>>>> Hi,
>>>>
>>>> On Thu, Jan 07, 2021 at 09:30:47PM -0800, Saeed Mahameed wrote:
>>>>> From: Roi Dayan <roid@nvidia.com>
>>>>>
>>>>> Connection tracking associates the connection state per packet. The
>>>>> first packet of a connection is assigned with the +trk+new state. The
>>>>> connection enters the established state once a packet is seen on the
>>>>> other direction.
>>>>>
>>>>> Currently we offload only the established flows. However, UDP traffic
>>>>> using source port entropy (e.g. vxlan, RoCE) will never enter the
>>>>> established state. Such protocols do not require stateful processing,
>>>>> and therefore could be offloaded.
>>>>
>>>> If it doesn't require stateful processing, please enlight me on why
>>>> conntrack is being used in the first place. What's the use case here?
>>>>
>>>
>>> The use case for example is when we have vxlan traffic but we do
>>> conntrack on the inner packet (rules on the physical port) so
>>> we never get established but on miss we can still offload as normal
>>> vxlan traffic.
>>>
>>
>> my mistake about "inner packet". we do CT on the underlay network, i.e.
>> the outer header.
> 
> I miss why the CT match is being used there then. Isn't it a config
> issue/waste of resources? What is CT adding to the matches/actions
> being done on these flows?
> 

Consider a use case where the network port receives both east-west encapsulated traffic and 
north-south non-encapsulated traffic that requires NAT.

One possible configuration is to first apply the CT-NAT action.
Established north-south connections will successfully execute the nat action and will set the +est 
ct state.
However, the +new state may apply either for valid east-west traffic (e.g. vxlan) due to source port 
entropy, or to insecure north-south traffic that the fw should block. The user may distinguish 
between the two cases, for example, by matching on the dest udp port.


>>
>>>>>
>>>>> The change in the model is that a miss on the CT table will be forwarded
>>>>> to a new +trk+new ct table and a miss there will be forwarded to
>>>>> the slow
>>>>> path table.
>>>>
>>>> AFAICU this new +trk+new ct table is a wildcard match on sport with
>>>> specific dports. Also AFAICU, such entries will not be visible to the
>>>> userspace then. Is this right?
>>>>
>>>>     Marcelo
>>>>
>>>
>>> right.
> 
> Thanks,
> Marcelo
>
Marcelo Ricardo Leitner Jan. 14, 2021, 1:02 p.m. UTC | #6
On Tue, Jan 12, 2021 at 11:27:04AM +0200, Oz Shlomo wrote:
> 
> 
> On 1/12/2021 1:51 AM, Marcelo Ricardo Leitner wrote:
> > On Sun, Jan 10, 2021 at 09:52:55AM +0200, Roi Dayan wrote:
> > > 
> > > 
> > > On 2021-01-10 9:45 AM, Roi Dayan wrote:
> > > > 
> > > > 
> > > > On 2021-01-08 11:48 PM, Marcelo Ricardo Leitner wrote:
> > > > > Hi,
> > > > > 
> > > > > On Thu, Jan 07, 2021 at 09:30:47PM -0800, Saeed Mahameed wrote:
> > > > > > From: Roi Dayan <roid@nvidia.com>
> > > > > > 
> > > > > > Connection tracking associates the connection state per packet. The
> > > > > > first packet of a connection is assigned with the +trk+new state. The
> > > > > > connection enters the established state once a packet is seen on the
> > > > > > other direction.
> > > > > > 
> > > > > > Currently we offload only the established flows. However, UDP traffic
> > > > > > using source port entropy (e.g. vxlan, RoCE) will never enter the
> > > > > > established state. Such protocols do not require stateful processing,
> > > > > > and therefore could be offloaded.
> > > > > 
> > > > > If it doesn't require stateful processing, please enlight me on why
> > > > > conntrack is being used in the first place. What's the use case here?
> > > > > 
> > > > 
> > > > The use case for example is when we have vxlan traffic but we do
> > > > conntrack on the inner packet (rules on the physical port) so
> > > > we never get established but on miss we can still offload as normal
> > > > vxlan traffic.
> > > > 
> > > 
> > > my mistake about "inner packet". we do CT on the underlay network, i.e.
> > > the outer header.
> > 
> > I miss why the CT match is being used there then. Isn't it a config
> > issue/waste of resources? What is CT adding to the matches/actions
> > being done on these flows?
> > 
> 
> Consider a use case where the network port receives both east-west
> encapsulated traffic and north-south non-encapsulated traffic that requires
> NAT.
> 
> One possible configuration is to first apply the CT-NAT action.
> Established north-south connections will successfully execute the nat action
> and will set the +est ct state.
> However, the +new state may apply either for valid east-west traffic (e.g.
> vxlan) due to source port entropy, or to insecure north-south traffic that
> the fw should block. The user may distinguish between the two cases, for
> example, by matching on the dest udp port.

Sorry but I still don't see the big picture. :-]

What do you consider as east-west and north-south traffic? My initial
understanding of east-west is traffic between VFs and north-south
would be in and out to the wire. You mentioned that north-south is
insecure, it would match, but then, non-encapsulated?

So it seems you referred to the datacenter. East-west is traffic
between hosts on the same datacenter, and north-south is traffic that
goes out of it. This seems to match.

Assuming it's the latter, then it seems that the idea is to work
around a config simplification that was done by the user.  As
mentioned on the changelog, such protocols do not require stateful
processing, and AFAICU this patch twists conntrack so that the user
can have simplified rules. Why can't the user have specific rules for
the tunnels, and other for dealing with north-south traffic? The fw
would still be able to block unwanted traffic.

My main problems with this is this, that it is making conntrack do
stuff that the user may not be expecting it to do, and that packets
may get matched (maybe even unintentionally) and the system won't have
visibility on them. Maybe I'm just missing something?

> 
> 
> > > 
> > > > > > 
> > > > > > The change in the model is that a miss on the CT table will be forwarded
> > > > > > to a new +trk+new ct table and a miss there will be forwarded to
> > > > > > the slow
> > > > > > path table.
> > > > > 
> > > > > AFAICU this new +trk+new ct table is a wildcard match on sport with
> > > > > specific dports. Also AFAICU, such entries will not be visible to the
> > > > > userspace then. Is this right?
> > > > > 
> > > > >     Marcelo
> > > > > 
> > > > 
> > > > right.
> > 
> > Thanks,
> > Marcelo
> > 
>
Oz Shlomo Jan. 14, 2021, 2:03 p.m. UTC | #7
On 1/14/2021 3:02 PM, Marcelo Ricardo Leitner wrote:
> On Tue, Jan 12, 2021 at 11:27:04AM +0200, Oz Shlomo wrote:
>>
>>
>> On 1/12/2021 1:51 AM, Marcelo Ricardo Leitner wrote:
>>> On Sun, Jan 10, 2021 at 09:52:55AM +0200, Roi Dayan wrote:
>>>>
>>>>
>>>> On 2021-01-10 9:45 AM, Roi Dayan wrote:
>>>>>
>>>>>
>>>>> On 2021-01-08 11:48 PM, Marcelo Ricardo Leitner wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On Thu, Jan 07, 2021 at 09:30:47PM -0800, Saeed Mahameed wrote:
>>>>>>> From: Roi Dayan <roid@nvidia.com>
>>>>>>>
>>>>>>> Connection tracking associates the connection state per packet. The
>>>>>>> first packet of a connection is assigned with the +trk+new state. The
>>>>>>> connection enters the established state once a packet is seen on the
>>>>>>> other direction.
>>>>>>>
>>>>>>> Currently we offload only the established flows. However, UDP traffic
>>>>>>> using source port entropy (e.g. vxlan, RoCE) will never enter the
>>>>>>> established state. Such protocols do not require stateful processing,
>>>>>>> and therefore could be offloaded.
>>>>>>
>>>>>> If it doesn't require stateful processing, please enlight me on why
>>>>>> conntrack is being used in the first place. What's the use case here?
>>>>>>
>>>>>
>>>>> The use case for example is when we have vxlan traffic but we do
>>>>> conntrack on the inner packet (rules on the physical port) so
>>>>> we never get established but on miss we can still offload as normal
>>>>> vxlan traffic.
>>>>>
>>>>
>>>> my mistake about "inner packet". we do CT on the underlay network, i.e.
>>>> the outer header.
>>>
>>> I miss why the CT match is being used there then. Isn't it a config
>>> issue/waste of resources? What is CT adding to the matches/actions
>>> being done on these flows?
>>>
>>
>> Consider a use case where the network port receives both east-west
>> encapsulated traffic and north-south non-encapsulated traffic that requires
>> NAT.
>>
>> One possible configuration is to first apply the CT-NAT action.
>> Established north-south connections will successfully execute the nat action
>> and will set the +est ct state.
>> However, the +new state may apply either for valid east-west traffic (e.g.
>> vxlan) due to source port entropy, or to insecure north-south traffic that
>> the fw should block. The user may distinguish between the two cases, for
>> example, by matching on the dest udp port.
> 
> Sorry but I still don't see the big picture. :-]
> 
> What do you consider as east-west and north-south traffic? My initial
> understanding of east-west is traffic between VFs and north-south
> would be in and out to the wire. You mentioned that north-south is
> insecure, it would match, but then, non-encapsulated?
> 
> So it seems you referred to the datacenter. East-west is traffic
> between hosts on the same datacenter, and north-south is traffic that
> goes out of it. This seems to match.

Right.

> 
> Assuming it's the latter, then it seems that the idea is to work
> around a config simplification that was done by the user.  As
> mentioned on the changelog, such protocols do not require stateful
> processing, and AFAICU this patch twists conntrack so that the user
> can have simplified rules. Why can't the user have specific rules for
> the tunnels, and other for dealing with north-south traffic? The fw
> would still be able to block unwanted traffic.

We cannot control what the user is doing.
This is a valid tc configuration and would work using tc software datapath.
However, in such configurations vxlan packets would not be processed in hardware because they are 
marked as new connections.

> 
> My main problems with this is this, that it is making conntrack do
> stuff that the user may not be expecting it to do, and that packets
> may get matched (maybe even unintentionally) and the system won't have
> visibility on them. Maybe I'm just missing something?
> 

This is why we restricted this feature to udp protocols that will never enter established state due 
to source port entropy.
Do you see a problematic use case that can arise?

>>
>>
>>>>
>>>>>>>
>>>>>>> The change in the model is that a miss on the CT table will be forwarded
>>>>>>> to a new +trk+new ct table and a miss there will be forwarded to
>>>>>>> the slow
>>>>>>> path table.
>>>>>>
>>>>>> AFAICU this new +trk+new ct table is a wildcard match on sport with
>>>>>> specific dports. Also AFAICU, such entries will not be visible to the
>>>>>> userspace then. Is this right?
>>>>>>
>>>>>>      Marcelo
>>>>>>
>>>>>
>>>>> right.
>>>
>>> Thanks,
>>> Marcelo
>>>
>>
Marcelo Ricardo Leitner Jan. 14, 2021, 9:50 p.m. UTC | #8
On Thu, Jan 14, 2021 at 04:03:43PM +0200, Oz Shlomo wrote:
> 
> 
> On 1/14/2021 3:02 PM, Marcelo Ricardo Leitner wrote:
> > On Tue, Jan 12, 2021 at 11:27:04AM +0200, Oz Shlomo wrote:
> > > 
> > > 
> > > On 1/12/2021 1:51 AM, Marcelo Ricardo Leitner wrote:
> > > > On Sun, Jan 10, 2021 at 09:52:55AM +0200, Roi Dayan wrote:
> > > > > 
> > > > > 
> > > > > On 2021-01-10 9:45 AM, Roi Dayan wrote:
> > > > > > 
> > > > > > 
> > > > > > On 2021-01-08 11:48 PM, Marcelo Ricardo Leitner wrote:
> > > > > > > Hi,
> > > > > > > 
> > > > > > > On Thu, Jan 07, 2021 at 09:30:47PM -0800, Saeed Mahameed wrote:
> > > > > > > > From: Roi Dayan <roid@nvidia.com>
> > > > > > > > 
> > > > > > > > Connection tracking associates the connection state per packet. The
> > > > > > > > first packet of a connection is assigned with the +trk+new state. The
> > > > > > > > connection enters the established state once a packet is seen on the
> > > > > > > > other direction.
> > > > > > > > 
> > > > > > > > Currently we offload only the established flows. However, UDP traffic
> > > > > > > > using source port entropy (e.g. vxlan, RoCE) will never enter the
> > > > > > > > established state. Such protocols do not require stateful processing,
> > > > > > > > and therefore could be offloaded.
> > > > > > > 
> > > > > > > If it doesn't require stateful processing, please enlight me on why
> > > > > > > conntrack is being used in the first place. What's the use case here?
> > > > > > > 
> > > > > > 
> > > > > > The use case for example is when we have vxlan traffic but we do
> > > > > > conntrack on the inner packet (rules on the physical port) so
> > > > > > we never get established but on miss we can still offload as normal
> > > > > > vxlan traffic.
> > > > > > 
> > > > > 
> > > > > my mistake about "inner packet". we do CT on the underlay network, i.e.
> > > > > the outer header.
> > > > 
> > > > I miss why the CT match is being used there then. Isn't it a config
> > > > issue/waste of resources? What is CT adding to the matches/actions
> > > > being done on these flows?
> > > > 
> > > 
> > > Consider a use case where the network port receives both east-west
> > > encapsulated traffic and north-south non-encapsulated traffic that requires
> > > NAT.
> > > 
> > > One possible configuration is to first apply the CT-NAT action.
> > > Established north-south connections will successfully execute the nat action
> > > and will set the +est ct state.
> > > However, the +new state may apply either for valid east-west traffic (e.g.
> > > vxlan) due to source port entropy, or to insecure north-south traffic that
> > > the fw should block. The user may distinguish between the two cases, for
> > > example, by matching on the dest udp port.
> > 
> > Sorry but I still don't see the big picture. :-]
> > 
> > What do you consider as east-west and north-south traffic? My initial
> > understanding of east-west is traffic between VFs and north-south
> > would be in and out to the wire. You mentioned that north-south is
> > insecure, it would match, but then, non-encapsulated?
> > 
> > So it seems you referred to the datacenter. East-west is traffic
> > between hosts on the same datacenter, and north-south is traffic that
> > goes out of it. This seems to match.
> 
> Right.
> 
> > 
> > Assuming it's the latter, then it seems that the idea is to work
> > around a config simplification that was done by the user.  As
> > mentioned on the changelog, such protocols do not require stateful
> > processing, and AFAICU this patch twists conntrack so that the user
> > can have simplified rules. Why can't the user have specific rules for
> > the tunnels, and other for dealing with north-south traffic? The fw
> > would still be able to block unwanted traffic.
> 
> We cannot control what the user is doing.

Right, but we can educate and point them towards better configs. With
non-optimal configs it's fair to expect non-optimal effects.

> This is a valid tc configuration and would work using tc software datapath.
> However, in such configurations vxlan packets would not be processed in
> hardware because they are marked as new connections.

Makes sense.

> 
> > 
> > My main problems with this is this, that it is making conntrack do
> > stuff that the user may not be expecting it to do, and that packets
> > may get matched (maybe even unintentionally) and the system won't have
> > visibility on them. Maybe I'm just missing something?
> > 
> 
> This is why we restricted this feature to udp protocols that will never
> enter established state due to source port entropy.
> Do you see a problematic use case that can arise?

For use case, the only one I see is if someone wants to use this
feature for another application/dstport. It's hardcoded to tunnels
ones.

It feels that the problem is not being solved at the right place. It
will work well for hardware processing, while for software it will
work while having a ton of conntrack entries. Different behaviors that
can lead to people wasting time. Like, trying to debug on why srcport
is not getting randomized when offloaded, while in fact they are, it's
just masked.

As this is a fallback (iow, search is done in 2 levels at least), I
wonder what other approaches were considered. I'm thinking two for
now. One is to add a flag to conntrack entries that allow them to be
this generic. Finding the right conntrack entry probably gets harder,
but when the user dumps /proc/net/nf_conntrack, it says something. On
how/when to add this flag, maybe act_ct can do it if dstport matches
something and/or a sysctl specifying a port list.

The other one may sound an overkill, but is to work with conntrack
expectations somehow.

The first one is closer to the current proposal. It basically makes
the port list configurable and move the "do it" decision to outside
the driver, where the admin can have more control. If conntrack itself
can also leverage it and avoid having tons of entries, even better, as
then we have both behaviors in sync.

Thoughts?

> 
> > > 
> > > 
> > > > > 
> > > > > > > > 
> > > > > > > > The change in the model is that a miss on the CT table will be forwarded
> > > > > > > > to a new +trk+new ct table and a miss there will be forwarded to
> > > > > > > > the slow
> > > > > > > > path table.
> > > > > > > 
> > > > > > > AFAICU this new +trk+new ct table is a wildcard match on sport with
> > > > > > > specific dports. Also AFAICU, such entries will not be visible to the
> > > > > > > userspace then. Is this right?
> > > > > > > 
> > > > > > >      Marcelo
> > > > > > > 
> > > > > > 
> > > > > > right.
> > > > 
> > > > Thanks,
> > > > Marcelo
> > > > 
> > > 
>
Oz Shlomo Jan. 20, 2021, 4:09 p.m. UTC | #9
On 1/14/2021 11:50 PM, Marcelo Ricardo Leitner wrote:
> On Thu, Jan 14, 2021 at 04:03:43PM +0200, Oz Shlomo wrote:
>>
>>
>> On 1/14/2021 3:02 PM, Marcelo Ricardo Leitner wrote:
>>> On Tue, Jan 12, 2021 at 11:27:04AM +0200, Oz Shlomo wrote:
>>>>
>>>>
>>>> On 1/12/2021 1:51 AM, Marcelo Ricardo Leitner wrote:
>>>>> On Sun, Jan 10, 2021 at 09:52:55AM +0200, Roi Dayan wrote:
>>>>>>
>>>>>>
>>>>>> On 2021-01-10 9:45 AM, Roi Dayan wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2021-01-08 11:48 PM, Marcelo Ricardo Leitner wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On Thu, Jan 07, 2021 at 09:30:47PM -0800, Saeed Mahameed wrote:
>>>>>>>>> From: Roi Dayan <roid@nvidia.com>
>>>>>>>>>
>>>>>>>>> Connection tracking associates the connection state per packet. The
>>>>>>>>> first packet of a connection is assigned with the +trk+new state. The
>>>>>>>>> connection enters the established state once a packet is seen on the
>>>>>>>>> other direction.
>>>>>>>>>
>>>>>>>>> Currently we offload only the established flows. However, UDP traffic
>>>>>>>>> using source port entropy (e.g. vxlan, RoCE) will never enter the
>>>>>>>>> established state. Such protocols do not require stateful processing,
>>>>>>>>> and therefore could be offloaded.
>>>>>>>>
>>>>>>>> If it doesn't require stateful processing, please enlight me on why
>>>>>>>> conntrack is being used in the first place. What's the use case here?
>>>>>>>>
>>>>>>>
>>>>>>> The use case for example is when we have vxlan traffic but we do
>>>>>>> conntrack on the inner packet (rules on the physical port) so
>>>>>>> we never get established but on miss we can still offload as normal
>>>>>>> vxlan traffic.
>>>>>>>
>>>>>>
>>>>>> my mistake about "inner packet". we do CT on the underlay network, i.e.
>>>>>> the outer header.
>>>>>
>>>>> I miss why the CT match is being used there then. Isn't it a config
>>>>> issue/waste of resources? What is CT adding to the matches/actions
>>>>> being done on these flows?
>>>>>
>>>>
>>>> Consider a use case where the network port receives both east-west
>>>> encapsulated traffic and north-south non-encapsulated traffic that requires
>>>> NAT.
>>>>
>>>> One possible configuration is to first apply the CT-NAT action.
>>>> Established north-south connections will successfully execute the nat action
>>>> and will set the +est ct state.
>>>> However, the +new state may apply either for valid east-west traffic (e.g.
>>>> vxlan) due to source port entropy, or to insecure north-south traffic that
>>>> the fw should block. The user may distinguish between the two cases, for
>>>> example, by matching on the dest udp port.
>>>
>>> Sorry but I still don't see the big picture. :-]
>>>
>>> What do you consider as east-west and north-south traffic? My initial
>>> understanding of east-west is traffic between VFs and north-south
>>> would be in and out to the wire. You mentioned that north-south is
>>> insecure, it would match, but then, non-encapsulated?
>>>
>>> So it seems you referred to the datacenter. East-west is traffic
>>> between hosts on the same datacenter, and north-south is traffic that
>>> goes out of it. This seems to match.
>>
>> Right.
>>
>>>
>>> Assuming it's the latter, then it seems that the idea is to work
>>> around a config simplification that was done by the user.  As
>>> mentioned on the changelog, such protocols do not require stateful
>>> processing, and AFAICU this patch twists conntrack so that the user
>>> can have simplified rules. Why can't the user have specific rules for
>>> the tunnels, and other for dealing with north-south traffic? The fw
>>> would still be able to block unwanted traffic.
>>
>> We cannot control what the user is doing.
> 
> Right, but we can educate and point them towards better configs. With
> non-optimal configs it's fair to expect non-optimal effects.
> 
>> This is a valid tc configuration and would work using tc software datapath.
>> However, in such configurations vxlan packets would not be processed in
>> hardware because they are marked as new connections.
> 
> Makes sense.
> 
>>
>>>
>>> My main problems with this is this, that it is making conntrack do
>>> stuff that the user may not be expecting it to do, and that packets
>>> may get matched (maybe even unintentionally) and the system won't have
>>> visibility on them. Maybe I'm just missing something?
>>>
>>
>> This is why we restricted this feature to udp protocols that will never
>> enter established state due to source port entropy.
>> Do you see a problematic use case that can arise?
> 
> For use case, the only one I see is if someone wants to use this
> feature for another application/dstport. It's hardcoded to tunnels
> ones.

It's a hardware offload optimization feature.
This is why we chose to support specific protocols that explicitly define source port entropy.

> 
> It feels that the problem is not being solved at the right place. It
> will work well for hardware processing, while for software it will
> work while having a ton of conntrack entries. Different behaviors that
> can lead to people wasting time. Like, trying to debug on why srcport
> is not getting randomized when offloaded, while in fact they are, it's
> just masked.

The SW and HW offload are functionally identical.
You are correct that with this patch the UNREPLIED CT entries will not be visible to the user 
through /proc/net/nf_conntrack

> 
> As this is a fallback (iow, search is done in 2 levels at least), I
> wonder what other approaches were considered. I'm thinking two for
> now. One is to add a flag to conntrack entries that allow them to be
> this generic. Finding the right conntrack entry probably gets harder,
> but when the user dumps /proc/net/nf_conntrack, it says something. On
> how/when to add this flag, maybe act_ct can do it if dstport matches
> something and/or a sysctl specifying a port list.
> 
> The other one may sound an overkill, but is to work with conntrack
> expectations somehow.
> 
> The first one is closer to the current proposal. It basically makes
> the port list configurable and move the "do it" decision to outside
> the driver, where the admin can have more control. If conntrack itself
> can also leverage it and avoid having tons of entries, even better, as
> then we have both behaviors in sync.

IIUC you propose a mechanism for avoiding CT processing of packets with a certain mask (e.g. based 
on dst udp port). Configured by admin and enforced by act_ct or even conntrack itself.

If so, this seems like a fundamental change to nf conntrack requiring it to add packet 
classification engines.

> 
> Thoughts?
> 

I wonder if we should develop a generic mechanism to optimize CT software for a use case that is 
faulty by design.
This has limited value for software as it would only reduce the conntrack table size (packet 
classification is still required).
However, this feature may have a big impact on hardware offload.
Normally hardware offload relies on software to handle new connections. Causing all new connections 
to be processed by software.
With this patch the hardware may autonomously set the +new connection state for the relevant 
connections.



>>
>>>>
>>>>
>>>>>>
>>>>>>>>>
>>>>>>>>> The change in the model is that a miss on the CT table will be forwarded
>>>>>>>>> to a new +trk+new ct table and a miss there will be forwarded to
>>>>>>>>> the slow
>>>>>>>>> path table.
>>>>>>>>
>>>>>>>> AFAICU this new +trk+new ct table is a wildcard match on sport with
>>>>>>>> specific dports. Also AFAICU, such entries will not be visible to the
>>>>>>>> userspace then. Is this right?
>>>>>>>>
>>>>>>>>       Marcelo
>>>>>>>>
>>>>>>>
>>>>>>> right.
>>>>>
>>>>> Thanks,
>>>>> Marcelo
>>>>>
>>>>
>>
Pablo Neira Ayuso Jan. 22, 2021, 1:18 a.m. UTC | #10
Hi Oz,

On Wed, Jan 20, 2021 at 06:09:48PM +0200, Oz Shlomo wrote:
> On 1/14/2021 11:50 PM, Marcelo Ricardo Leitner wrote:
> > 
> > Thoughts?
> > 
> 
> I wonder if we should develop a generic mechanism to optimize CT software
> for a use case that is faulty by design.
> This has limited value for software as it would only reduce the conntrack
> table size (packet classification is still required).
> However, this feature may have a big impact on hardware offload.
> Normally hardware offload relies on software to handle new connections.
> Causing all new connections to be processed by software.
> With this patch the hardware may autonomously set the +new connection state
> for the relevant connections.

Could you fix this issue with unidirectional flows by checking for
IPS_CONFIRMED status bit? The idea is to hardware offload the entry
after the first packet goes through software successfully. Then, there
is no need to wait for the established state that requires to see
traffic in both directions.
Marcelo Ricardo Leitner Jan. 22, 2021, 2:16 a.m. UTC | #11
On Fri, Jan 22, 2021 at 02:18:34AM +0100, Pablo Neira Ayuso wrote:
> Hi Oz,
> 
> On Wed, Jan 20, 2021 at 06:09:48PM +0200, Oz Shlomo wrote:
> > On 1/14/2021 11:50 PM, Marcelo Ricardo Leitner wrote:
> > > 
> > > Thoughts?
> > > 
> > 
> > I wonder if we should develop a generic mechanism to optimize CT software
> > for a use case that is faulty by design.
> > This has limited value for software as it would only reduce the conntrack
> > table size (packet classification is still required).
> > However, this feature may have a big impact on hardware offload.
> > Normally hardware offload relies on software to handle new connections.
> > Causing all new connections to be processed by software.
> > With this patch the hardware may autonomously set the +new connection state
> > for the relevant connections.
> 
> Could you fix this issue with unidirectional flows by checking for
> IPS_CONFIRMED status bit? The idea is to hardware offload the entry
> after the first packet goes through software successfully. Then, there
> is no need to wait for the established state that requires to see
> traffic in both directions.

That's an interesting idea. This way, basically all that needs to be
changed is tcf_ct_flow_table_process_conn() to handle this new
condition for UDP packets and on tcf_ct_act().

It has a small performance penaulty if compared to the original
solution, as now the first packet(s) goes to sw, but looks like a good
compromise between supporting a (from what I could understand)
somewhat lazy flow design (as I still think these didn't need to go
through conntrack), an uniform system behavior (with and without
offload, with mlx5 or another driver) and a more generic approach.
Other situations that rely on unidirectional UDP flows will benefit
from it as well.

This way I even think it doesn't need to be configurable right now.
It will be easier to add a knob to switch back to the old behavior if
needed later on, if anything.

  Marcelo
Oz Shlomo Jan. 25, 2021, 9:15 a.m. UTC | #12
On 1/22/2021 4:16 AM, Marcelo Ricardo Leitner wrote:
> On Fri, Jan 22, 2021 at 02:18:34AM +0100, Pablo Neira Ayuso wrote:
>> Hi Oz,
>>
>> On Wed, Jan 20, 2021 at 06:09:48PM +0200, Oz Shlomo wrote:
>>> On 1/14/2021 11:50 PM, Marcelo Ricardo Leitner wrote:
>>>>
>>>> Thoughts?
>>>>
>>>
>>> I wonder if we should develop a generic mechanism to optimize CT software
>>> for a use case that is faulty by design.
>>> This has limited value for software as it would only reduce the conntrack
>>> table size (packet classification is still required).
>>> However, this feature may have a big impact on hardware offload.
>>> Normally hardware offload relies on software to handle new connections.
>>> Causing all new connections to be processed by software.
>>> With this patch the hardware may autonomously set the +new connection state
>>> for the relevant connections.
>>
>> Could you fix this issue with unidirectional flows by checking for
>> IPS_CONFIRMED status bit? The idea is to hardware offload the entry
>> after the first packet goes through software successfully. Then, there
>> is no need to wait for the established state that requires to see
>> traffic in both directions.
> 
> That's an interesting idea. This way, basically all that needs to be
> changed is tcf_ct_flow_table_process_conn() to handle this new
> condition for UDP packets and on tcf_ct_act().

Will act_ct need to maintain a port list and classify the packet to realize whether the udp packet 
is part of a unidirection or biderectional udp connection?


> 
> It has a small performance penaulty if compared to the original
> solution, as now the first packet(s) goes to sw, but looks like a good
> compromise between supporting a (from what I could understand)
> somewhat lazy flow design (as I still think these didn't need to go
> through conntrack), an uniform system behavior (with and without
> offload, with mlx5 or another driver) and a more generic approach.
> Other situations that rely on unidirectional UDP flows will benefit
> from it as well.

The hardware offload perspective is a bit different.
With this approach the system will offload a rule per connection instead of offloading one mega-flow 
rule on dst udp port.
This will increase the hardware scale requirements in terms of number of offloaded rules.
In addition, a counter will need to be instantiated per rule and the software will need to manage 
the aging of these connections.

We hoped that the hardware can fully offload this scenario, avoiding the need for sw processing at all.


> 
> This way I even think it doesn't need to be configurable right now.
> It will be easier to add a knob to switch back to the old behavior if
> needed later on, if anything.
> 
>    Marcelo
>
diff mbox series

Patch

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c
index d7ecd5e5f7c4..6dac2fabb7f5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c
@@ -21,6 +21,7 @@ 
 #include "en.h"
 #include "en_tc.h"
 #include "en_rep.h"
+#include "fs_core.h"
 
 #define MLX5_CT_ZONE_BITS (mlx5e_tc_attr_to_reg_mappings[ZONE_TO_REG].mlen * 8)
 #define MLX5_CT_ZONE_MASK GENMASK(MLX5_CT_ZONE_BITS - 1, 0)
@@ -50,6 +51,9 @@  struct mlx5_tc_ct_priv {
 	struct mlx5_flow_table *ct;
 	struct mlx5_flow_table *ct_nat;
 	struct mlx5_flow_table *post_ct;
+	struct mlx5_flow_table *trk_new_ct;
+	struct mlx5_flow_group *miss_grp;
+	struct mlx5_flow_handle *miss_rule;
 	struct mutex control_lock; /* guards parallel adds/dels */
 	struct mutex shared_counter_lock;
 	struct mapping_ctx *zone_mapping;
@@ -1490,14 +1494,14 @@  mlx5_tc_ct_del_ft_cb(struct mlx5_tc_ct_priv *ct_priv, struct mlx5_ct_ft *ft)
  *      | set zone
  *      v
  * +--------------------+
- * + CT (nat or no nat) +
- * + tuple + zone match +
- * +--------------------+
- *      | set mark
- *      | set labels_id
- *      | set established
- *	| set zone_restore
- *      | do nat (if needed)
+ * + CT (nat or no nat) +    miss          +---------------------+  miss
+ * + tuple + zone match +----------------->+ trk_new_ct          +-------> SW
+ * +--------------------+                  + vxlan||roce match   +
+ *      | set mark                         +---------------------+
+ *      | set labels_id                             | set ct_state +trk+new
+ *      | set established                           | set zone_restore
+ *	| set zone_restore                          v
+ *      | do nat (if needed)                      post_ct
  *      v
  * +--------------+
  * + post_ct      + original filter actions
@@ -1893,6 +1897,72 @@  mlx5_tc_ct_init_check_support(struct mlx5e_priv *priv,
 		return mlx5_tc_ct_init_check_nic_support(priv, err_msg);
 }
 
+static struct mlx5_flow_handle *
+tc_ct_add_miss_rule(struct mlx5_flow_table *ft,
+		    struct mlx5_flow_table *next_ft)
+{
+	struct mlx5_flow_destination dest = {};
+	struct mlx5_flow_act act = {};
+
+	act.flags  = FLOW_ACT_IGNORE_FLOW_LEVEL | FLOW_ACT_NO_APPEND;
+	act.action = MLX5_FLOW_CONTEXT_ACTION_FWD_DEST;
+	dest.type  = MLX5_FLOW_DESTINATION_TYPE_FLOW_TABLE;
+	dest.ft = next_ft;
+
+	return mlx5_add_flow_rules(ft, NULL, &act, &dest, 1);
+}
+
+static int
+tc_ct_add_ct_table_miss_rule(struct mlx5_tc_ct_priv *ct_priv)
+{
+	int inlen = MLX5_ST_SZ_BYTES(create_flow_group_in);
+	struct mlx5_flow_handle *miss_rule;
+	struct mlx5_flow_group *miss_group;
+	int max_fte = ct_priv->ct->max_fte;
+	u32 *flow_group_in;
+	int err = 0;
+
+	flow_group_in = kvzalloc(inlen, GFP_KERNEL);
+	if (!flow_group_in)
+		return -ENOMEM;
+
+	/* create miss group */
+	MLX5_SET(create_flow_group_in, flow_group_in, start_flow_index,
+		 max_fte - 2);
+	MLX5_SET(create_flow_group_in, flow_group_in, end_flow_index,
+		 max_fte - 1);
+	miss_group = mlx5_create_flow_group(ct_priv->ct, flow_group_in);
+	if (IS_ERR(miss_group)) {
+		err = PTR_ERR(miss_group);
+		goto err_miss_grp;
+	}
+
+	/* add miss rule to next fdb */
+	miss_rule = tc_ct_add_miss_rule(ct_priv->ct, ct_priv->trk_new_ct);
+	if (IS_ERR(miss_rule)) {
+		err = PTR_ERR(miss_rule);
+		goto err_miss_rule;
+	}
+
+	ct_priv->miss_grp = miss_group;
+	ct_priv->miss_rule = miss_rule;
+	kvfree(flow_group_in);
+	return 0;
+
+err_miss_rule:
+	mlx5_destroy_flow_group(miss_group);
+err_miss_grp:
+	kvfree(flow_group_in);
+	return err;
+}
+
+static void
+tc_ct_del_ct_table_miss_rule(struct mlx5_tc_ct_priv *ct_priv)
+{
+	mlx5_del_flow_rules(ct_priv->miss_rule);
+	mlx5_destroy_flow_group(ct_priv->miss_grp);
+}
+
 #define INIT_ERR_PREFIX "tc ct offload init failed"
 
 struct mlx5_tc_ct_priv *
@@ -1962,6 +2032,18 @@  mlx5_tc_ct_init(struct mlx5e_priv *priv, struct mlx5_fs_chains *chains,
 		goto err_post_ct_tbl;
 	}
 
+	ct_priv->trk_new_ct = mlx5_chains_create_global_table(chains);
+	if (IS_ERR(ct_priv->trk_new_ct)) {
+		err = PTR_ERR(ct_priv->trk_new_ct);
+		mlx5_core_warn(dev, "%s, failed to create trk new ct table err: %d",
+			       INIT_ERR_PREFIX, err);
+		goto err_trk_new_ct_tbl;
+	}
+
+	err = tc_ct_add_ct_table_miss_rule(ct_priv);
+	if (err)
+		goto err_init_ct_tbl;
+
 	idr_init(&ct_priv->fte_ids);
 	mutex_init(&ct_priv->control_lock);
 	mutex_init(&ct_priv->shared_counter_lock);
@@ -1971,6 +2053,10 @@  mlx5_tc_ct_init(struct mlx5e_priv *priv, struct mlx5_fs_chains *chains,
 
 	return ct_priv;
 
+err_init_ct_tbl:
+	mlx5_chains_destroy_global_table(chains, ct_priv->trk_new_ct);
+err_trk_new_ct_tbl:
+	mlx5_chains_destroy_global_table(chains, ct_priv->post_ct);
 err_post_ct_tbl:
 	mlx5_chains_destroy_global_table(chains, ct_priv->ct_nat);
 err_ct_nat_tbl:
@@ -1997,6 +2083,8 @@  mlx5_tc_ct_clean(struct mlx5_tc_ct_priv *ct_priv)
 
 	chains = ct_priv->chains;
 
+	tc_ct_del_ct_table_miss_rule(ct_priv);
+	mlx5_chains_destroy_global_table(chains, ct_priv->trk_new_ct);
 	mlx5_chains_destroy_global_table(chains, ct_priv->post_ct);
 	mlx5_chains_destroy_global_table(chains, ct_priv->ct_nat);
 	mlx5_chains_destroy_global_table(chains, ct_priv->ct);