mbox series

[xfrm-next,v3,0/6] Extend XFRM core to allow full offload configuration

Message ID cover.1661260787.git.leonro@nvidia.com (mailing list archive)
Headers show
Series Extend XFRM core to allow full offload configuration | expand

Message

Leon Romanovsky Aug. 23, 2022, 1:31 p.m. UTC
From: Leon Romanovsky <leonro@nvidia.com>

Changelog:
v3:
 * I didn't hear any suggestion what term to use instead of
   "full offload", so left it as is. It is used in commit messages
   and documentation only and easy to rename.
 * Added performance data and background info to cover letter
 * Reused xfrm_output_resume() function to support multiple XFRM transformations
 * Add PMTU check in addition to driver .xdo_dev_offload_ok validation
 * Documentation is in progress, but not part of this series yet.
v2: https://lore.kernel.org/all/cover.1660639789.git.leonro@nvidia.com
 * Rebased to latest 6.0-rc1
 * Add an extra check in TX datapath patch to validate packets before
   forwarding to HW.
 * Added policy cleanup logic in case of netdev down event
v1: https://lore.kernel.org/all/cover.1652851393.git.leonro@nvidia.com
 * Moved comment to be before if (...) in third patch.
v0: https://lore.kernel.org/all/cover.1652176932.git.leonro@nvidia.com
-----------------------------------------------------------------------

The following series extends XFRM core code to handle a new type of IPsec
offload - full offload.

In this mode, the HW is going to be responsible for the whole data path,
so both policy and state should be offloaded.

IPsec full offload is an improved version of IPsec crypto mode,
In full mode, HW is responsible to trim/add headers in addition to
decrypt/encrypt. In this mode, the packet arrives to the stack as already
decrypted and vice versa for TX (exits to HW as not-encrypted).

Devices that implement IPsec full offload mode offload policies too.
In the RX path, it causes the situation that HW can't effectively
handle mixed SW and HW priorities unless users make sure that HW offloaded
policies have higher priorities.

To make sure that users have a coherent picture, we require that
HW offloaded policies have always (both RX and TX) higher priorities
than SW ones.

To not over-engineer the code, HW policies are treated as SW ones and
don't take into account netdev to allow reuse of the same priorities for
different devices.

There are several deliberate limitations:
 * No software fallback
 * Fragments are dropped, both in RX and TX
 * No sockets policies
 * Only IPsec transport mode is implemented

================================================================================
Configuration:

iproute2: https://lore.kernel.org/netdev/cover.1652179360.git.leonro@nvidia.com/

Full offload mode:
  ip xfrm state offload full dev <if-name> dir <in|out>
  ip xfrm policy .... offload full dev <if-name>
Crypto offload mode:
  ip xfrm state offload crypto dev <if-name> dir <in|out>
or (backward compatibility)
  ip xfrm state offload dev <if-name> dir <in|out>

================================================================================
Performance results:

TCP multi-stream, using iperf3 instance per-CPU.
+----------------------+--------+--------+--------+--------+---------+---------+
|                      | 1 CPU  | 2 CPUs | 4 CPUs | 8 CPUs | 16 CPUs | 32 CPUs |
|                      +--------+--------+--------+--------+---------+---------+
|                      |                   BW (Gbps)                           |
+----------------------+--------+--------+-------+---------+---------+---------+
| Baseline             | 27.9   | 59     | 93.1  | 92.8    | 93.7    | 94.4    |
+----------------------+--------+--------+-------+---------+---------+---------+
| Software IPsec       | 6      | 11.9   | 23.3  | 45.9    | 83.8    | 91.8    |
+----------------------+--------+--------+-------+---------+---------+---------+
| IPsec crypto offload | 15     | 29.7   | 58.5  | 89.6    | 90.4    | 90.8    |
+----------------------+--------+--------+-------+---------+---------+---------+
| IPsec full offload   | 28     | 57     | 90.7  | 91      | 91.3    | 91.9    |
+----------------------+--------+--------+-------+---------+---------+---------+

IPsec full offload mode behaves as baseline and reaches linerate with same amount
of CPUs.

Setups details (similar for both sides):
* NIC: ConnectX6-DX dual port, 100 Gbps each.
  Single port used in the tests.
* CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz

================================================================================
Series together with mlx5 part:
https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/log/?h=xfrm-next

Thanks

Leon Romanovsky (6):
  xfrm: add new full offload flag
  xfrm: allow state full offload mode
  xfrm: add an interface to offload policy
  xfrm: add TX datapath support for IPsec full offload mode
  xfrm: add RX datapath protection for IPsec full offload mode
  xfrm: enforce separation between priorities of HW/SW policies

 .../inline_crypto/ch_ipsec/chcr_ipsec.c       |   4 +
 .../net/ethernet/intel/ixgbe/ixgbe_ipsec.c    |   5 +
 drivers/net/ethernet/intel/ixgbevf/ipsec.c    |   5 +
 .../mellanox/mlx5/core/en_accel/ipsec.c       |   4 +
 drivers/net/netdevsim/ipsec.c                 |   5 +
 include/linux/netdevice.h                     |   3 +
 include/net/netns/xfrm.h                      |   8 +-
 include/net/xfrm.h                            | 104 +++++++---
 include/uapi/linux/xfrm.h                     |   6 +
 net/xfrm/xfrm_device.c                        | 102 +++++++++-
 net/xfrm/xfrm_output.c                        |  12 +-
 net/xfrm/xfrm_policy.c                        | 181 ++++++++++++++++++
 net/xfrm/xfrm_user.c                          |  19 ++
 13 files changed, 425 insertions(+), 33 deletions(-)

Comments

Jakub Kicinski Aug. 25, 2022, 9:36 p.m. UTC | #1
On Tue, 23 Aug 2022 16:31:57 +0300 Leon Romanovsky wrote:
>  * I didn't hear any suggestion what term to use instead of
>    "full offload", so left it as is. It is used in commit messages
>    and documentation only and easy to rename.
>  * Added performance data and background info to cover letter
>  * Reused xfrm_output_resume() function to support multiple XFRM transformations
>  * Add PMTU check in addition to driver .xdo_dev_offload_ok validation
>  * Documentation is in progress, but not part of this series yet.

Since the use case is somewhat in question, perhaps switch to RFC
postings until the drivers side incl. tc forwarding is implemented?
Also the perf traces, I don't see them here.
Leon Romanovsky Aug. 26, 2022, 6:26 a.m. UTC | #2
On Thu, Aug 25, 2022 at 02:36:10PM -0700, Jakub Kicinski wrote:
> On Tue, 23 Aug 2022 16:31:57 +0300 Leon Romanovsky wrote:
> >  * I didn't hear any suggestion what term to use instead of
> >    "full offload", so left it as is. It is used in commit messages
> >    and documentation only and easy to rename.
> >  * Added performance data and background info to cover letter
> >  * Reused xfrm_output_resume() function to support multiple XFRM transformations
> >  * Add PMTU check in addition to driver .xdo_dev_offload_ok validation
> >  * Documentation is in progress, but not part of this series yet.
> 
> Since the use case is somewhat in question, perhaps switch to RFC
> postings until the drivers side incl. tc forwarding is implemented?

Proposed driver implementation works fine with eswitch representors.
All our flow steering magic is performed on local table entry and it
ensures that representors receives/sends "clean" traffic.

We are using the following configuration snippet to achieve that.

---------------------------------------------------------------------
#!/bin/bash
P0_OUTER_REMOTE_IP=192.168.50.2
P0_OUTER_LOCAL_IP=192.168.50.1
PF0=enp8s0f0
VF0_REP=enp8s0f0_0

set -v
# Configure IP and turn VF_REP on
ifconfig $PF0 $P0_OUTER_LOCAL_IP/24 up
ifconfig $VF0_REP up

# Clean all TC rules, start fresh
tc qdisc del dev enp8s0f0 ingress >/dev/null 2>&1
tc qdisc del dev enp8s0f0_0 ingress >/dev/null 2>&1

# Make sure steering mode is dmfs(FW) and eswitch encap is none
devlink dev param set pci/0000:08:00.0 name flow_steering_mode value dmfs cmode runtime
devlink dev eswitch set pci/0000:08:00.0 mode legacy
devlink dev eswitch set pci/0000:08:00.0 encap none
devlink dev eswitch set pci/0000:08:00.0 mode switchdev

sleep 2

tc qdisc add dev enp8s0f0 ingress
tc qdisc add dev enp8s0f0_0 ingress

# Add TC rules
tc filter add dev $PF0 parent ffff: protocol 802.1q chain 0 flower vlan_id 10 vlan_ethtype 802.1q cvlan_id 5 action vlan pop action vlan pop  action mirred egress redirect dev $VF0_REP
tc filter add dev $VF0_REP parent ffff: protocol all chain 0 flower action vlan push protocol 802.1q id 5 action vlan push protocol 802.1q id 10 action mirred egress redirect dev $PF0
tc filter show dev $PF0 ingress
----------------------------------------------------------------------------------------------------

We also don't offload anything related to routing as we can't
differentiate between local traffic.

> Also the perf traces, I don't see them here.

It is worth to separate it to standalone discussion with a title:
"why crypto is not fast enough?". I don't think that mixed discussions
about full offload which Steffen said that he is interested and
research about crypto bottlenecks will be productive. These discussions
are orthogonal.

Thanks
Jakub Kicinski Aug. 26, 2022, 11:45 p.m. UTC | #3
On Fri, 26 Aug 2022 09:26:57 +0300 Leon Romanovsky wrote:
> On Thu, Aug 25, 2022 at 02:36:10PM -0700, Jakub Kicinski wrote:
> > On Tue, 23 Aug 2022 16:31:57 +0300 Leon Romanovsky wrote:  
> > >  * I didn't hear any suggestion what term to use instead of
> > >    "full offload", so left it as is. It is used in commit messages
> > >    and documentation only and easy to rename.
> > >  * Added performance data and background info to cover letter
> > >  * Reused xfrm_output_resume() function to support multiple XFRM transformations
> > >  * Add PMTU check in addition to driver .xdo_dev_offload_ok validation
> > >  * Documentation is in progress, but not part of this series yet.  
> > 
> > Since the use case is somewhat in question, perhaps switch to RFC
> > postings until the drivers side incl. tc forwarding is implemented?  
> 
> Proposed driver implementation works fine with eswitch representors.
> All our flow steering magic is performed on local table entry and it
> ensures that representors receives/sends "clean" traffic.
> 
> We are using the following configuration snippet to achieve that.
> 
> ---------------------------------------------------------------------
> #!/bin/bash
> P0_OUTER_REMOTE_IP=192.168.50.2
> P0_OUTER_LOCAL_IP=192.168.50.1
> PF0=enp8s0f0
> VF0_REP=enp8s0f0_0
> 
> set -v
> # Configure IP and turn VF_REP on
> ifconfig $PF0 $P0_OUTER_LOCAL_IP/24 up
> ifconfig $VF0_REP up
> 
> # Clean all TC rules, start fresh
> tc qdisc del dev enp8s0f0 ingress >/dev/null 2>&1
> tc qdisc del dev enp8s0f0_0 ingress >/dev/null 2>&1
> 
> # Make sure steering mode is dmfs(FW) and eswitch encap is none
> devlink dev param set pci/0000:08:00.0 name flow_steering_mode value dmfs cmode runtime
> devlink dev eswitch set pci/0000:08:00.0 mode legacy
> devlink dev eswitch set pci/0000:08:00.0 encap none
> devlink dev eswitch set pci/0000:08:00.0 mode switchdev
> 
> sleep 2
> 
> tc qdisc add dev enp8s0f0 ingress
> tc qdisc add dev enp8s0f0_0 ingress
> 
> # Add TC rules
> tc filter add dev $PF0 parent ffff: protocol 802.1q chain 0 flower vlan_id 10 vlan_ethtype 802.1q cvlan_id 5 action vlan pop action vlan pop  action mirred egress redirect dev $VF0_REP
> tc filter add dev $VF0_REP parent ffff: protocol all chain 0 flower action vlan push protocol 802.1q id 5 action vlan push protocol 802.1q id 10 action mirred egress redirect dev $PF0
> tc filter show dev $PF0 ingress
> ----------------------------------------------------------------------------------------------------
> 
> We also don't offload anything related to routing as we can't
> differentiate between local traffic.

Yeah, nah, that's not what I'm asking for.
I said forwarding, not sending traffic thru a different virtual
interface. The TC rules must forward from or two the IPSec ifc.

That was the use case Jason mentioned.

> > Also the perf traces, I don't see them here.  
> 
> It is worth to separate it to standalone discussion with a title:
> "why crypto is not fast enough?". I don't think that mixed discussions
> about full offload which Steffen said that he is interested and
> research about crypto bottlenecks will be productive. These discussions
> are orthogonal.

What do you mean by crypto bottlenecks?

Please use more precise language. crypto here may mean "crypto only
offload" or "crypto as done by CPU". I have no idea which one you mean.

We are very much interested in the former, the latter is indeed out of
scope here.
Leon Romanovsky Aug. 28, 2022, 9:28 a.m. UTC | #4
On Fri, Aug 26, 2022 at 04:45:22PM -0700, Jakub Kicinski wrote:
> On Fri, 26 Aug 2022 09:26:57 +0300 Leon Romanovsky wrote:
> > On Thu, Aug 25, 2022 at 02:36:10PM -0700, Jakub Kicinski wrote:
> > > On Tue, 23 Aug 2022 16:31:57 +0300 Leon Romanovsky wrote:  
> > > >  * I didn't hear any suggestion what term to use instead of
> > > >    "full offload", so left it as is. It is used in commit messages
> > > >    and documentation only and easy to rename.
> > > >  * Added performance data and background info to cover letter
> > > >  * Reused xfrm_output_resume() function to support multiple XFRM transformations
> > > >  * Add PMTU check in addition to driver .xdo_dev_offload_ok validation
> > > >  * Documentation is in progress, but not part of this series yet.  
> > > 
> > > Since the use case is somewhat in question, perhaps switch to RFC
> > > postings until the drivers side incl. tc forwarding is implemented?  

<...>

> > We also don't offload anything related to routing as we can't
> > differentiate between local traffic.
> 
> Yeah, nah, that's not what I'm asking for.
> I said forwarding, not sending traffic thru a different virtual
> interface. The TC rules must forward from or two the IPSec ifc.
> 
> That was the use case Jason mentioned.

I see, word "TC" confused me, sorry about that.

My next mlx5-related task after this IPsec full offload will be accepted
is to extend mlx5 with extra eswitch logic.

There is no change in API, xfrm code or behavior, just internal change
where IPsec flow steering tables will be created and how they will be
created/destroyed. 

Unfortunately, this "just.." change is a complicated task due to mlx5 core
internal implementation and will take time, but as I said, I will do it.

> 
> > > Also the perf traces, I don't see them here.  
> > 
> > It is worth to separate it to standalone discussion with a title:
> > "why crypto is not fast enough?". I don't think that mixed discussions
> > about full offload which Steffen said that he is interested and
> > research about crypto bottlenecks will be productive. These discussions
> > are orthogonal.
> 
> What do you mean by crypto bottlenecks?

I think that I used same language all the time.
* IPsec SW - software path
* IPsec crypto - HW offload of crypto part
* IPsec full offload - state and policy offloads to the HW. 

I will try to be more clear next time.

> 
> Please use more precise language. crypto here may mean "crypto only
> offload" or "crypto as done by CPU". I have no idea which one you mean.
> 
> We are very much interested in the former, the latter is indeed out of
> scope here.
Steffen Klassert Aug. 29, 2022, 7:54 a.m. UTC | #5
On Tue, Aug 23, 2022 at 04:31:57PM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Changelog:
> v3:
>  * I didn't hear any suggestion what term to use instead of
>    "full offload", so left it as is. It is used in commit messages
>    and documentation only and easy to rename.
>  * Added performance data and background info to cover letter
>  * Reused xfrm_output_resume() function to support multiple XFRM transformations
>  * Add PMTU check in addition to driver .xdo_dev_offload_ok validation
>  * Documentation is in progress, but not part of this series yet.
> v2: https://lore.kernel.org/all/cover.1660639789.git.leonro@nvidia.com
>  * Rebased to latest 6.0-rc1
>  * Add an extra check in TX datapath patch to validate packets before
>    forwarding to HW.
>  * Added policy cleanup logic in case of netdev down event
> v1: https://lore.kernel.org/all/cover.1652851393.git.leonro@nvidia.com
>  * Moved comment to be before if (...) in third patch.
> v0: https://lore.kernel.org/all/cover.1652176932.git.leonro@nvidia.com
> -----------------------------------------------------------------------
> 
> The following series extends XFRM core code to handle a new type of IPsec
> offload - full offload.
> 
> In this mode, the HW is going to be responsible for the whole data path,
> so both policy and state should be offloaded.
> 
> IPsec full offload is an improved version of IPsec crypto mode,
> In full mode, HW is responsible to trim/add headers in addition to
> decrypt/encrypt. In this mode, the packet arrives to the stack as already
> decrypted and vice versa for TX (exits to HW as not-encrypted).
> 
> Devices that implement IPsec full offload mode offload policies too.
> In the RX path, it causes the situation that HW can't effectively
> handle mixed SW and HW priorities unless users make sure that HW offloaded
> policies have higher priorities.
> 
> To make sure that users have a coherent picture, we require that
> HW offloaded policies have always (both RX and TX) higher priorities
> than SW ones.
> 
> To not over-engineer the code, HW policies are treated as SW ones and
> don't take into account netdev to allow reuse of the same priorities for
> different devices.
> 
> There are several deliberate limitations:
>  * No software fallback
>  * Fragments are dropped, both in RX and TX
>  * No sockets policies
>  * Only IPsec transport mode is implemented

... and you still have not answered the fundamental questions:

As implemented, the software does not hold any state.
I.e. there is no sync between hardware and software
regarding stats, liftetime, lifebyte, packet counts
and replay window. IKE rekeying and auditing is based
on these, how should this be done?

How can tunnel mode work with this offload?

I want to see the full picture before I consider to
apply this.
Steffen Klassert Aug. 29, 2022, 8:07 a.m. UTC | #6
On Fri, Aug 26, 2022 at 09:26:57AM +0300, Leon Romanovsky wrote:
> On Thu, Aug 25, 2022 at 02:36:10PM -0700, Jakub Kicinski wrote:
> > On Tue, 23 Aug 2022 16:31:57 +0300 Leon Romanovsky wrote:
> > >  * I didn't hear any suggestion what term to use instead of
> > >    "full offload", so left it as is. It is used in commit messages
> > >    and documentation only and easy to rename.
> > >  * Added performance data and background info to cover letter
> > >  * Reused xfrm_output_resume() function to support multiple XFRM transformations
> > >  * Add PMTU check in addition to driver .xdo_dev_offload_ok validation
> > >  * Documentation is in progress, but not part of this series yet.
> > 
> > Since the use case is somewhat in question, perhaps switch to RFC
> > postings until the drivers side incl. tc forwarding is implemented?

+1

Please mark it as RFC as long as the full picture is not yet clear.
This is still far away from being ready for merging.

> 
> > Also the perf traces, I don't see them here.
> 
> It is worth to separate it to standalone discussion with a title:
> "why crypto is not fast enough?".

This is not the question. I want to know whether performance
comes from encapsulation/decapsulation offload, or lookup
offload, or something else.

Please provide the traces.

> I don't think that mixed discussions
> about full offload which Steffen said that he is interested and
> research about crypto bottlenecks will be productive. These discussions
> are orthogonal.

I'm interested in a 'full offload' but we need to make sure
this 'full offload' is able to support most of the common
usecases. It is still not clear whether this particular
offload you are proposing is the one that can make it.
Leon Romanovsky Aug. 30, 2022, 6:48 a.m. UTC | #7
On Mon, Aug 29, 2022 at 09:54:03AM +0200, Steffen Klassert wrote:
> On Tue, Aug 23, 2022 at 04:31:57PM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> > 
> > Changelog:
> > v3:
> >  * I didn't hear any suggestion what term to use instead of
> >    "full offload", so left it as is. It is used in commit messages
> >    and documentation only and easy to rename.
> >  * Added performance data and background info to cover letter
> >  * Reused xfrm_output_resume() function to support multiple XFRM transformations
> >  * Add PMTU check in addition to driver .xdo_dev_offload_ok validation
> >  * Documentation is in progress, but not part of this series yet.
> > v2: https://lore.kernel.org/all/cover.1660639789.git.leonro@nvidia.com
> >  * Rebased to latest 6.0-rc1
> >  * Add an extra check in TX datapath patch to validate packets before
> >    forwarding to HW.
> >  * Added policy cleanup logic in case of netdev down event
> > v1: https://lore.kernel.org/all/cover.1652851393.git.leonro@nvidia.com
> >  * Moved comment to be before if (...) in third patch.
> > v0: https://lore.kernel.org/all/cover.1652176932.git.leonro@nvidia.com
> > -----------------------------------------------------------------------
> > 
> > The following series extends XFRM core code to handle a new type of IPsec
> > offload - full offload.
> > 
> > In this mode, the HW is going to be responsible for the whole data path,
> > so both policy and state should be offloaded.
> > 
> > IPsec full offload is an improved version of IPsec crypto mode,
> > In full mode, HW is responsible to trim/add headers in addition to
> > decrypt/encrypt. In this mode, the packet arrives to the stack as already
> > decrypted and vice versa for TX (exits to HW as not-encrypted).
> > 
> > Devices that implement IPsec full offload mode offload policies too.
> > In the RX path, it causes the situation that HW can't effectively
> > handle mixed SW and HW priorities unless users make sure that HW offloaded
> > policies have higher priorities.
> > 
> > To make sure that users have a coherent picture, we require that
> > HW offloaded policies have always (both RX and TX) higher priorities
> > than SW ones.
> > 
> > To not over-engineer the code, HW policies are treated as SW ones and
> > don't take into account netdev to allow reuse of the same priorities for
> > different devices.
> > 
> > There are several deliberate limitations:
> >  * No software fallback
> >  * Fragments are dropped, both in RX and TX
> >  * No sockets policies
> >  * Only IPsec transport mode is implemented
> 
> ... and you still have not answered the fundamental questions:
> 
> As implemented, the software does not hold any state.
> I.e. there is no sync between hardware and software
> regarding stats, liftetime, lifebyte, packet counts
> and replay window. IKE rekeying and auditing is based
> on these, how should this be done?
> 
> How can tunnel mode work with this offload?
> 
> I want to see the full picture before I consider to
> apply this.

Hi,

Unfortunately, due to vacation time, it takes more time to get internal answers
than I anticipated.

I will continue to collect the needed data and add everything to cover
letter/documentation when respin.

Thanks
Jason Gunthorpe Aug. 30, 2022, 5:36 p.m. UTC | #8
On Fri, Aug 26, 2022 at 04:45:22PM -0700, Jakub Kicinski wrote:

> > # Add TC rules
> > tc filter add dev $PF0 parent ffff: protocol 802.1q chain 0 flower vlan_id 10 vlan_ethtype 802.1q cvlan_id 5 action vlan pop action vlan pop  action mirred egress redirect dev $VF0_REP
> > tc filter add dev $VF0_REP parent ffff: protocol all chain 0 flower action vlan push protocol 802.1q id 5 action vlan push protocol 802.1q id 10 action mirred egress redirect dev $PF0
> > tc filter show dev $PF0 ingress
> > ----------------------------------------------------------------------------------------------------
> > 
> > We also don't offload anything related to routing as we can't
> > differentiate between local traffic.
> 
> Yeah, nah, that's not what I'm asking for.
> I said forwarding, not sending traffic thru a different virtual
> interface. The TC rules must forward from or two the IPSec ifc.
> 
> That was the use case Jason mentioned.

I was meaning rather generically handling the packets in the
hypervisor side without involving the CPU. 

We have customers deploying many different models for this in their
hypervisor, including a significant deployment using a model like the
above. 

It achieves a kind of connectivity to a VM with 0 hypervisor CPU
involvement with the vlan push/pop done in HW.

We other use-models, like the offloaded OVS switching model you are
alluding to, that is Leon has as a followup.

Jason
Jakub Kicinski Aug. 30, 2022, 6:56 p.m. UTC | #9
On Tue, 30 Aug 2022 14:36:52 -0300 Jason Gunthorpe wrote:
> I was meaning rather generically handling the packets in the
> hypervisor side without involving the CPU. 
> 
> We have customers deploying many different models for this in their
> hypervisor, including a significant deployment using a model like the
> above. 
> 
> It achieves a kind of connectivity to a VM with 0 hypervisor CPU
> involvement with the vlan push/pop done in HW.

I don't see how that necessitates the full IPsec offload.

Sorry, perhaps I assumed we were on the same page too hastily.

I was referring to a model where the TC rules redirect to a IPsec
tunnel which is then routed to the uplink. All offloaded. Like
we do today for VxLAN encap/decap.

That necessitates full offload, and therefore can be used as a strong
argument for having a full offload in netdev.

> We other use-models, like the offloaded OVS switching model you are
> alluding to, that is Leon has as a followup.
Jason Gunthorpe Aug. 30, 2022, 10:34 p.m. UTC | #10
On Tue, Aug 30, 2022 at 11:56:27AM -0700, Jakub Kicinski wrote:
> On Tue, 30 Aug 2022 14:36:52 -0300 Jason Gunthorpe wrote:
> > I was meaning rather generically handling the packets in the
> > hypervisor side without involving the CPU. 
> > 
> > We have customers deploying many different models for this in their
> > hypervisor, including a significant deployment using a model like the
> > above. 
> > 
> > It achieves a kind of connectivity to a VM with 0 hypervisor CPU
> > involvement with the vlan push/pop done in HW.
> 
> I don't see how that necessitates the full IPsec offload.

I think it would help if Leon showed the xfrm commands in his
configuration snippet and explained the traffic flow that is achieved
by it.

He has a configuration that is as I described, 0 hypervisor CPU
involvement, IPSEC, and VMs.

> Sorry, perhaps I assumed we were on the same page too hastily.

Well, I think we all agree that if a hypervisor environment is
controlling the IPSEC then we'd like a path similar to others where
the hypervisor doesn't process the packets in the CPU - it just flows
in the HW right out of the VM.

It seems obvious, but this means that the hypervisor controls HW that
does the crypto, the anti-replay, the counting and limiting
"autonomously" from the hypervisor CPU. (the so-called "full offload")

> I was referring to a model where the TC rules redirect to a IPsec
> tunnel which is then routed to the uplink. All offloaded. Like
> we do today for VxLAN encap/decap.

I think there are several models with TC that can do this. It is
probably worth talking about exactly which models should and need to
have it.

eg what exactly do you mean when you say "TC rules redirect to a IPsec
tunnel" ? How does a TC rule redirect to an xfrm?

Jason
Leon Romanovsky Sept. 4, 2022, 4:45 p.m. UTC | #11
On Mon, Aug 29, 2022 at 09:54:03AM +0200, Steffen Klassert wrote:
> On Tue, Aug 23, 2022 at 04:31:57PM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> > 
> > Changelog:
> > v3:
> >  * I didn't hear any suggestion what term to use instead of
> >    "full offload", so left it as is. It is used in commit messages
> >    and documentation only and easy to rename.
> >  * Added performance data and background info to cover letter
> >  * Reused xfrm_output_resume() function to support multiple XFRM transformations
> >  * Add PMTU check in addition to driver .xdo_dev_offload_ok validation
> >  * Documentation is in progress, but not part of this series yet.
> > v2: https://lore.kernel.org/all/cover.1660639789.git.leonro@nvidia.com
> >  * Rebased to latest 6.0-rc1
> >  * Add an extra check in TX datapath patch to validate packets before
> >    forwarding to HW.
> >  * Added policy cleanup logic in case of netdev down event
> > v1: https://lore.kernel.org/all/cover.1652851393.git.leonro@nvidia.com
> >  * Moved comment to be before if (...) in third patch.
> > v0: https://lore.kernel.org/all/cover.1652176932.git.leonro@nvidia.com
> > -----------------------------------------------------------------------
> > 
> > The following series extends XFRM core code to handle a new type of IPsec
> > offload - full offload.
> > 
> > In this mode, the HW is going to be responsible for the whole data path,
> > so both policy and state should be offloaded.
> > 
> > IPsec full offload is an improved version of IPsec crypto mode,
> > In full mode, HW is responsible to trim/add headers in addition to
> > decrypt/encrypt. In this mode, the packet arrives to the stack as already
> > decrypted and vice versa for TX (exits to HW as not-encrypted).
> > 
> > Devices that implement IPsec full offload mode offload policies too.
> > In the RX path, it causes the situation that HW can't effectively
> > handle mixed SW and HW priorities unless users make sure that HW offloaded
> > policies have higher priorities.
> > 
> > To make sure that users have a coherent picture, we require that
> > HW offloaded policies have always (both RX and TX) higher priorities
> > than SW ones.
> > 
> > To not over-engineer the code, HW policies are treated as SW ones and
> > don't take into account netdev to allow reuse of the same priorities for
> > different devices.
> > 
> > There are several deliberate limitations:
> >  * No software fallback
> >  * Fragments are dropped, both in RX and TX
> >  * No sockets policies
> >  * Only IPsec transport mode is implemented
> 
> ... and you still have not answered the fundamental questions:
> 
> As implemented, the software does not hold any state.
> I.e. there is no sync between hardware and software
> regarding stats, liftetime, lifebyte, packet counts
> and replay window. IKE rekeying and auditing is based
> on these, how should this be done?

I hope that the patch added in v4 clarifies it. There is a sync between
HW and core in regarding of packet counts. The HW generates event and
calls to xfrm_state_check_expire() to make sure that already existing
logic will do rekeying.

The replay window will be handled in similar way. HW will generate an
event.

> 
> How can tunnel mode work with this offload?

I don't see any issues here. Same rules will apply here. 

> 
> I want to see the full picture before I consider to
> apply this.