diff mbox series

[xfrm-next,v7,6/8] xfrm: speed-up lookup of HW policies

Message ID f611857594c5c53918d782f104d6f4e028ba465d.1667997522.git.leonro@nvidia.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series Extend XFRM core to allow packet offload configuration | expand

Checks

Context Check Description
netdev/tree_selection success Guessed tree name to be net-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 113 this patch: 113
netdev/cc_maintainers warning 1 maintainers not CCed: pabeni@redhat.com
netdev/build_clang success Errors and warnings before: 5 this patch: 5
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 113 this patch: 113
netdev/checkpatch warning CHECK: Macro argument '_type' may be better as '(_type)' to avoid precedence issues
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Leon Romanovsky Nov. 9, 2022, 12:54 p.m. UTC
From: Leon Romanovsky <leonro@nvidia.com>

Devices that implement IPsec packet offload mode should offload SA and
policies too. In RX path, it causes to the situation that HW will always
have higher priority over any SW policies.

It means that we don't need to perform any search of inexact policies
and/or priority checks if HW policy was discovered. In such situation,
the HW will catch the packets anyway and HW can still implement inexact
lookups.

In case specific policy is not found, we will continue with packet lookup and
check for existence of HW policies in inexact list.

HW policies are added to the head of SPD to ensure fast lookup, as XFRM
iterates over all policies in the loop.

The same solution of adding HW SAs at the begging of the list is applied
to SA database too. However, we don't need to change lookups as they are
sorted by insertion order and not priority.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 net/xfrm/xfrm_policy.c | 16 ++++++----
 net/xfrm/xfrm_state.c  | 66 ++++++++++++++++++++++++++++++++----------
 2 files changed, 62 insertions(+), 20 deletions(-)

Comments

Steffen Klassert Nov. 17, 2022, 12:12 p.m. UTC | #1
On Wed, Nov 09, 2022 at 02:54:34PM +0200, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> @@ -1166,16 +1187,24 @@ xfrm_state_find(const xfrm_address_t *daddr, const xfrm_address_t *saddr,
>  			spin_lock_bh(&net->xfrm.xfrm_state_lock);
>  			x->km.state = XFRM_STATE_ACQ;
>  			list_add(&x->km.all, &net->xfrm.state_all);
> -			hlist_add_head_rcu(&x->bydst, net->xfrm.state_bydst + h);
> +			XFRM_STATE_INSERT(bydst, &x->bydst,
> +					  net->xfrm.state_bydst + h,
> +					  x->xso.type);
>  			h = xfrm_src_hash(net, daddr, saddr, encap_family);
> -			hlist_add_head_rcu(&x->bysrc, net->xfrm.state_bysrc + h);
> +			XFRM_STATE_INSERT(bysrc, &x->bysrc,
> +					  net->xfrm.state_bysrc + h,
> +					  x->xso.type);
>  			if (x->id.spi) {
>  				h = xfrm_spi_hash(net, &x->id.daddr, x->id.spi, x->id.proto, encap_family);
> -				hlist_add_head_rcu(&x->byspi, net->xfrm.state_byspi + h);
> +				XFRM_STATE_INSERT(byspi, &x->byspi,
> +						  net->xfrm.state_byspi + h,
> +						  x->xso.type);
>  			}
>  			if (x->km.seq) {
>  				h = xfrm_seq_hash(net, x->km.seq);
> -				hlist_add_head_rcu(&x->byseq, net->xfrm.state_byseq + h);
> +				XFRM_STATE_INSERT(byseq, &x->byseq,
> +						  net->xfrm.state_byseq + h,
> +						  x->xso.type);
>  			}

This does not work. A larval state will never have a x->xso.type set.
So this raises the question how to handle acquires with this packet
offload. You could place the type and offload device to the template,
but we also have to make sure not to mess too much with the non
offloaded codepath.

This is yet another corner case where the concept of doing policy and
state lookup in software for a HW offload does not work so well. I
fear this is not the last corner case that comes up once you put this
into a real network.
Leon Romanovsky Nov. 17, 2022, 12:51 p.m. UTC | #2
On Thu, Nov 17, 2022 at 01:12:43PM +0100, Steffen Klassert wrote:
> On Wed, Nov 09, 2022 at 02:54:34PM +0200, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> > 
> > @@ -1166,16 +1187,24 @@ xfrm_state_find(const xfrm_address_t *daddr, const xfrm_address_t *saddr,
> >  			spin_lock_bh(&net->xfrm.xfrm_state_lock);
> >  			x->km.state = XFRM_STATE_ACQ;
> >  			list_add(&x->km.all, &net->xfrm.state_all);
> > -			hlist_add_head_rcu(&x->bydst, net->xfrm.state_bydst + h);
> > +			XFRM_STATE_INSERT(bydst, &x->bydst,
> > +					  net->xfrm.state_bydst + h,
> > +					  x->xso.type);
> >  			h = xfrm_src_hash(net, daddr, saddr, encap_family);
> > -			hlist_add_head_rcu(&x->bysrc, net->xfrm.state_bysrc + h);
> > +			XFRM_STATE_INSERT(bysrc, &x->bysrc,
> > +					  net->xfrm.state_bysrc + h,
> > +					  x->xso.type);
> >  			if (x->id.spi) {
> >  				h = xfrm_spi_hash(net, &x->id.daddr, x->id.spi, x->id.proto, encap_family);
> > -				hlist_add_head_rcu(&x->byspi, net->xfrm.state_byspi + h);
> > +				XFRM_STATE_INSERT(byspi, &x->byspi,
> > +						  net->xfrm.state_byspi + h,
> > +						  x->xso.type);
> >  			}
> >  			if (x->km.seq) {
> >  				h = xfrm_seq_hash(net, x->km.seq);
> > -				hlist_add_head_rcu(&x->byseq, net->xfrm.state_byseq + h);
> > +				XFRM_STATE_INSERT(byseq, &x->byseq,
> > +						  net->xfrm.state_byseq + h,
> > +						  x->xso.type);
> >  			}
> 
> This does not work. A larval state will never have a x->xso.type set.

x->xso.type always exists. Default is 0, which is XFRM_DEV_OFFLOAD_UNSPECIFIED.
It means this XFRM_STATE_INSERT() will behave exactly as hlist_add_head_rcu() before.

> So this raises the question how to handle acquires with this packet
> offload. 

We handle acquires as SW policies and don't offload them.


> You could place the type and offload device to the template,
> but we also have to make sure not to mess too much with the non
> offloaded codepath.
> 
> This is yet another corner case where the concept of doing policy and
> state lookup in software for a HW offload does not work so well. I
> fear this is not the last corner case that comes up once you put this
> into a real network.
> 

It is not different from any other kernel code, bugs will be fixed.

BTW, IPsec packet offload mode is in use for almost two years already
in real networks.
https://docs.nvidia.com/networking/display/OFEDv521040/Changes+and+New+Features

Thanks
Steffen Klassert Nov. 18, 2022, 10:49 a.m. UTC | #3
On Thu, Nov 17, 2022 at 02:51:33PM +0200, Leon Romanovsky wrote:
> On Thu, Nov 17, 2022 at 01:12:43PM +0100, Steffen Klassert wrote:
> > On Wed, Nov 09, 2022 at 02:54:34PM +0200, Leon Romanovsky wrote:
> > > From: Leon Romanovsky <leonro@nvidia.com>
> > 
> > This does not work. A larval state will never have a x->xso.type set.
> 
> x->xso.type always exists. Default is 0, which is XFRM_DEV_OFFLOAD_UNSPECIFIED.
> It means this XFRM_STATE_INSERT() will behave exactly as hlist_add_head_rcu() before.

Sure it exists, and is always 0 here.

> 
> > So this raises the question how to handle acquires with this packet
> > offload. 
> 
> We handle acquires as SW policies and don't offload them.

We trigger acquires with states, not policies. The thing is,
we might match a HW policy but create a SW acquire state.
This will not match anymore as soon as the lookup is
implemented correctly.

> > You could place the type and offload device to the template,
> > but we also have to make sure not to mess too much with the non
> > offloaded codepath.
> > 
> > This is yet another corner case where the concept of doing policy and
> > state lookup in software for a HW offload does not work so well. I
> > fear this is not the last corner case that comes up once you put this
> > into a real network.
> > 
> 
> It is not different from any other kernel code, bugs will be fixed.

The thing that is different here is, that the concept is already
broken. We can't split the datapath to be partially handled in
SW and HW in any sane way, this becomes clearer and clearer.

The full protocol offload simply does not fit well into HW,
but we try to make it fit with a hammer. This is the problem
why I do not really like this, and is also the reason why this
is still not merged. We might be much better of by doing a
HW frindly redesign of the protocol and offload this then.
But, yes that takes time and will have issues too.
Leon Romanovsky Nov. 20, 2022, 7:17 p.m. UTC | #4
On Fri, Nov 18, 2022 at 11:49:07AM +0100, Steffen Klassert wrote:
> On Thu, Nov 17, 2022 at 02:51:33PM +0200, Leon Romanovsky wrote:
> > On Thu, Nov 17, 2022 at 01:12:43PM +0100, Steffen Klassert wrote:
> > > On Wed, Nov 09, 2022 at 02:54:34PM +0200, Leon Romanovsky wrote:
> > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > 
> > > This does not work. A larval state will never have a x->xso.type set.
> > 
> > x->xso.type always exists. Default is 0, which is XFRM_DEV_OFFLOAD_UNSPECIFIED.
> > It means this XFRM_STATE_INSERT() will behave exactly as hlist_add_head_rcu() before.
> 
> Sure it exists, and is always 0 here.
> 
> > 
> > > So this raises the question how to handle acquires with this packet
> > > offload. 
> > 
> > We handle acquires as SW policies and don't offload them.
> 
> We trigger acquires with states, not policies. The thing is,
> we might match a HW policy but create a SW acquire state.
> This will not match anymore as soon as the lookup is
> implemented correctly.

For now, all such packets will be dropped as we have offlaoded
policy but not SA.

Like we said in one of our IPsec coffee hours, this flow will be
supported later. Right now, it is not important.

> 
> > > You could place the type and offload device to the template,
> > > but we also have to make sure not to mess too much with the non
> > > offloaded codepath.
> > > 
> > > This is yet another corner case where the concept of doing policy and
> > > state lookup in software for a HW offload does not work so well. I
> > > fear this is not the last corner case that comes up once you put this
> > > into a real network.
> > > 
> > 
> > It is not different from any other kernel code, bugs will be fixed.
> 
> The thing that is different here is, that the concept is already
> broken. We can't split the datapath to be partially handled in
> SW and HW in any sane way, this becomes clearer and clearer.
> 
> The full protocol offload simply does not fit well into HW,
> but we try to make it fit with a hammer. This is the problem
> why I do not really like this, and is also the reason why this
> is still not merged. We might be much better of by doing a
> HW frindly redesign of the protocol and offload this then.
> But, yes that takes time and will have issues too.

When you say "protocol", what do you mean? Many users, who have
deployed IPsec solutions, just want to have same look and feel
but much faster.

I truly believe that this packet offload fits SW model and the
(small) amount of changes supports it. There are almost no changes
to the stack to natively support this offload.

As long as HW involved, you will never have solution without issues,
and like you said even redesign "will have issues".

Plus, we called it packet offload, better and redesigned version will
be called something else. My patches which enhance XFRM to support more
than type of offload are exactly for that.

Thanks
Steffen Klassert Nov. 21, 2022, 9:44 a.m. UTC | #5
On Sun, Nov 20, 2022 at 09:17:02PM +0200, Leon Romanovsky wrote:
> On Fri, Nov 18, 2022 at 11:49:07AM +0100, Steffen Klassert wrote:
> > On Thu, Nov 17, 2022 at 02:51:33PM +0200, Leon Romanovsky wrote:
> > > On Thu, Nov 17, 2022 at 01:12:43PM +0100, Steffen Klassert wrote:
> > > > On Wed, Nov 09, 2022 at 02:54:34PM +0200, Leon Romanovsky wrote:
> > > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > 
> > > > So this raises the question how to handle acquires with this packet
> > > > offload. 
> > > 
> > > We handle acquires as SW policies and don't offload them.
> > 
> > We trigger acquires with states, not policies. The thing is,
> > we might match a HW policy but create a SW acquire state.
> > This will not match anymore as soon as the lookup is
> > implemented correctly.
> 
> For now, all such packets will be dropped as we have offlaoded
> policy but not SA.

I think you missed my point. If the HW policy does not match
the SW acquire state, then each packet will geneate a new
acquire. So you need to make sure that policy and acquire
state will match to send the acquire just once to userspace.

> > > It is not different from any other kernel code, bugs will be fixed.
> > 
> > The thing that is different here is, that the concept is already
> > broken. We can't split the datapath to be partially handled in
> > SW and HW in any sane way, this becomes clearer and clearer.
> > 
> > The full protocol offload simply does not fit well into HW,
> > but we try to make it fit with a hammer. This is the problem
> > why I do not really like this, and is also the reason why this
> > is still not merged. We might be much better of by doing a
> > HW frindly redesign of the protocol and offload this then.
> > But, yes that takes time and will have issues too.
> 
> When you say "protocol", what do you mean? Many users, who have
> deployed IPsec solutions, just want to have same look and feel
> but much faster.
> 
> I truly believe that this packet offload fits SW model and the
> (small) amount of changes supports it. There are almost no changes
> to the stack to natively support this offload.
> 
> As long as HW involved, you will never have solution without issues,
> and like you said even redesign "will have issues".

Things would be much easier, if we don't need to add HW policies
and states to SW databases. But yes, a redesign might have issues
too. That's why we are still working on the current soluion :)
Leon Romanovsky Nov. 21, 2022, 10:27 a.m. UTC | #6
On Mon, Nov 21, 2022 at 10:44:04AM +0100, Steffen Klassert wrote:
> On Sun, Nov 20, 2022 at 09:17:02PM +0200, Leon Romanovsky wrote:
> > On Fri, Nov 18, 2022 at 11:49:07AM +0100, Steffen Klassert wrote:
> > > On Thu, Nov 17, 2022 at 02:51:33PM +0200, Leon Romanovsky wrote:
> > > > On Thu, Nov 17, 2022 at 01:12:43PM +0100, Steffen Klassert wrote:
> > > > > On Wed, Nov 09, 2022 at 02:54:34PM +0200, Leon Romanovsky wrote:
> > > > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > 
> > > > > So this raises the question how to handle acquires with this packet
> > > > > offload. 
> > > > 
> > > > We handle acquires as SW policies and don't offload them.
> > > 
> > > We trigger acquires with states, not policies. The thing is,
> > > we might match a HW policy but create a SW acquire state.
> > > This will not match anymore as soon as the lookup is
> > > implemented correctly.
> > 
> > For now, all such packets will be dropped as we have offlaoded
> > policy but not SA.
> 
> I think you missed my point. If the HW policy does not match
> the SW acquire state, then each packet will geneate a new
> acquire. So you need to make sure that policy and acquire
> state will match to send the acquire just once to userspace.

I think that I'm still missing the point.

We require both policy and SA to be offloaded. It means that once
we hit HW policy, we must hit SA too (at least this is how mlx5 part
is implemented).

If it doesn't hit HW policy, it means that whole path is not offloaded,
so no harm will be if we create SW SA for this path.

> 
> > > > It is not different from any other kernel code, bugs will be fixed.
> > > 
> > > The thing that is different here is, that the concept is already
> > > broken. We can't split the datapath to be partially handled in
> > > SW and HW in any sane way, this becomes clearer and clearer.
> > > 
> > > The full protocol offload simply does not fit well into HW,
> > > but we try to make it fit with a hammer. This is the problem
> > > why I do not really like this, and is also the reason why this
> > > is still not merged. We might be much better of by doing a
> > > HW frindly redesign of the protocol and offload this then.
> > > But, yes that takes time and will have issues too.
> > 
> > When you say "protocol", what do you mean? Many users, who have
> > deployed IPsec solutions, just want to have same look and feel
> > but much faster.
> > 
> > I truly believe that this packet offload fits SW model and the
> > (small) amount of changes supports it. There are almost no changes
> > to the stack to natively support this offload.
> > 
> > As long as HW involved, you will never have solution without issues,
> > and like you said even redesign "will have issues".
> 
> Things would be much easier, if we don't need to add HW policies
> and states to SW databases. But yes, a redesign might have issues
> too. That's why we are still working on the current soluion :)
> 

There will be PSP for that, but it is not IPsec.

Thanks
Steffen Klassert Nov. 21, 2022, 11:09 a.m. UTC | #7
On Mon, Nov 21, 2022 at 12:27:01PM +0200, Leon Romanovsky wrote:
> On Mon, Nov 21, 2022 at 10:44:04AM +0100, Steffen Klassert wrote:
> > On Sun, Nov 20, 2022 at 09:17:02PM +0200, Leon Romanovsky wrote:
> > > On Fri, Nov 18, 2022 at 11:49:07AM +0100, Steffen Klassert wrote:
> > > > On Thu, Nov 17, 2022 at 02:51:33PM +0200, Leon Romanovsky wrote:
> > > > > On Thu, Nov 17, 2022 at 01:12:43PM +0100, Steffen Klassert wrote:
> > > > > > On Wed, Nov 09, 2022 at 02:54:34PM +0200, Leon Romanovsky wrote:
> > > > > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > > 
> > > > > > So this raises the question how to handle acquires with this packet
> > > > > > offload. 
> > > > > 
> > > > > We handle acquires as SW policies and don't offload them.
> > > > 
> > > > We trigger acquires with states, not policies. The thing is,
> > > > we might match a HW policy but create a SW acquire state.
> > > > This will not match anymore as soon as the lookup is
> > > > implemented correctly.
> > > 
> > > For now, all such packets will be dropped as we have offlaoded
> > > policy but not SA.
> > 
> > I think you missed my point. If the HW policy does not match
> > the SW acquire state, then each packet will geneate a new
> > acquire. So you need to make sure that policy and acquire
> > state will match to send the acquire just once to userspace.
> 
> I think that I'm still missing the point.
> 
> We require both policy and SA to be offloaded. It means that once
> we hit HW policy, we must hit SA too (at least this is how mlx5 part
> is implemented).

Let's assume a packet hits a HW policy. Then this HW policy must match
a HW state. In case there is no matching HW state, we generate an acquire
and insert a larval state. Currently, larval states are never marked as HW.

Now, the next packet from the same flow maches again this HW policy,
but it does not find the larval state because it is not marked as
a HW state. So we generate another acquire and insert another
larval state. Same happens for packets 3,4,5...

Expected behaviour for subsequent packets is that the lookup will
find a matching HW larval state and the packet is dropped without
creating another acquire + larval state for the same flow.
Leon Romanovsky Nov. 21, 2022, 11:15 a.m. UTC | #8
On Mon, Nov 21, 2022 at 12:09:26PM +0100, Steffen Klassert wrote:
> On Mon, Nov 21, 2022 at 12:27:01PM +0200, Leon Romanovsky wrote:
> > On Mon, Nov 21, 2022 at 10:44:04AM +0100, Steffen Klassert wrote:
> > > On Sun, Nov 20, 2022 at 09:17:02PM +0200, Leon Romanovsky wrote:
> > > > On Fri, Nov 18, 2022 at 11:49:07AM +0100, Steffen Klassert wrote:
> > > > > On Thu, Nov 17, 2022 at 02:51:33PM +0200, Leon Romanovsky wrote:
> > > > > > On Thu, Nov 17, 2022 at 01:12:43PM +0100, Steffen Klassert wrote:
> > > > > > > On Wed, Nov 09, 2022 at 02:54:34PM +0200, Leon Romanovsky wrote:
> > > > > > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > > > 
> > > > > > > So this raises the question how to handle acquires with this packet
> > > > > > > offload. 
> > > > > > 
> > > > > > We handle acquires as SW policies and don't offload them.
> > > > > 
> > > > > We trigger acquires with states, not policies. The thing is,
> > > > > we might match a HW policy but create a SW acquire state.
> > > > > This will not match anymore as soon as the lookup is
> > > > > implemented correctly.
> > > > 
> > > > For now, all such packets will be dropped as we have offlaoded
> > > > policy but not SA.
> > > 
> > > I think you missed my point. If the HW policy does not match
> > > the SW acquire state, then each packet will geneate a new
> > > acquire. So you need to make sure that policy and acquire
> > > state will match to send the acquire just once to userspace.
> > 
> > I think that I'm still missing the point.
> > 
> > We require both policy and SA to be offloaded. It means that once
> > we hit HW policy, we must hit SA too (at least this is how mlx5 part
> > is implemented).
> 
> Let's assume a packet hits a HW policy. Then this HW policy must match
> a HW state. In case there is no matching HW state, we generate an acquire
> and insert a larval state. Currently, larval states are never marked as HW.

And this is there our views are different. If HW (in RX) sees policy but
doesn't have state, this packet will be dropped in HW. It won't get to
stack and no acquire request will be issues.

> 
> Now, the next packet from the same flow maches again this HW policy,
> but it does not find the larval state because it is not marked as
> a HW state. So we generate another acquire and insert another
> larval state. Same happens for packets 3,4,5...
> 
> Expected behaviour for subsequent packets is that the lookup will
> find a matching HW larval state and the packet is dropped without
> creating another acquire + larval state for the same flow.
> 

This is why we don't support acquire for now as it will require mixing
HW and SW paths which we don't want for now.

Thanks
Steffen Klassert Nov. 21, 2022, 11:25 a.m. UTC | #9
On Mon, Nov 21, 2022 at 01:15:36PM +0200, Leon Romanovsky wrote:
> On Mon, Nov 21, 2022 at 12:09:26PM +0100, Steffen Klassert wrote:
> > On Mon, Nov 21, 2022 at 12:27:01PM +0200, Leon Romanovsky wrote:
> > > On Mon, Nov 21, 2022 at 10:44:04AM +0100, Steffen Klassert wrote:
> > > > On Sun, Nov 20, 2022 at 09:17:02PM +0200, Leon Romanovsky wrote:
> > > > > On Fri, Nov 18, 2022 at 11:49:07AM +0100, Steffen Klassert wrote:
> > > > > > On Thu, Nov 17, 2022 at 02:51:33PM +0200, Leon Romanovsky wrote:
> > > > > > > On Thu, Nov 17, 2022 at 01:12:43PM +0100, Steffen Klassert wrote:
> > > > > > > > On Wed, Nov 09, 2022 at 02:54:34PM +0200, Leon Romanovsky wrote:
> > > > > > > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > > > > 
> > > > > > > > So this raises the question how to handle acquires with this packet
> > > > > > > > offload. 
> > > > > > > 
> > > > > > > We handle acquires as SW policies and don't offload them.
> > > > > > 
> > > > > > We trigger acquires with states, not policies. The thing is,
> > > > > > we might match a HW policy but create a SW acquire state.
> > > > > > This will not match anymore as soon as the lookup is
> > > > > > implemented correctly.
> > > > > 
> > > > > For now, all such packets will be dropped as we have offlaoded
> > > > > policy but not SA.
> > > > 
> > > > I think you missed my point. If the HW policy does not match
> > > > the SW acquire state, then each packet will geneate a new
> > > > acquire. So you need to make sure that policy and acquire
> > > > state will match to send the acquire just once to userspace.
> > > 
> > > I think that I'm still missing the point.
> > > 
> > > We require both policy and SA to be offloaded. It means that once
> > > we hit HW policy, we must hit SA too (at least this is how mlx5 part
> > > is implemented).
> > 
> > Let's assume a packet hits a HW policy. Then this HW policy must match
> > a HW state. In case there is no matching HW state, we generate an acquire
> > and insert a larval state. Currently, larval states are never marked as HW.
> 
> And this is there our views are different. If HW (in RX) sees policy but
> doesn't have state, this packet will be dropped in HW. It won't get to
> stack and no acquire request will be issues.

This makes no sense. Acquires are always generated at TX, never at RX.

On RX, the state lookup happens first, the policy must match to the
decapsulated packet.
Leon Romanovsky Nov. 21, 2022, 11:34 a.m. UTC | #10
On Mon, Nov 21, 2022 at 12:25:21PM +0100, Steffen Klassert wrote:
> On Mon, Nov 21, 2022 at 01:15:36PM +0200, Leon Romanovsky wrote:
> > On Mon, Nov 21, 2022 at 12:09:26PM +0100, Steffen Klassert wrote:
> > > On Mon, Nov 21, 2022 at 12:27:01PM +0200, Leon Romanovsky wrote:
> > > > On Mon, Nov 21, 2022 at 10:44:04AM +0100, Steffen Klassert wrote:
> > > > > On Sun, Nov 20, 2022 at 09:17:02PM +0200, Leon Romanovsky wrote:
> > > > > > On Fri, Nov 18, 2022 at 11:49:07AM +0100, Steffen Klassert wrote:
> > > > > > > On Thu, Nov 17, 2022 at 02:51:33PM +0200, Leon Romanovsky wrote:
> > > > > > > > On Thu, Nov 17, 2022 at 01:12:43PM +0100, Steffen Klassert wrote:
> > > > > > > > > On Wed, Nov 09, 2022 at 02:54:34PM +0200, Leon Romanovsky wrote:
> > > > > > > > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > > > > > 
> > > > > > > > > So this raises the question how to handle acquires with this packet
> > > > > > > > > offload. 
> > > > > > > > 
> > > > > > > > We handle acquires as SW policies and don't offload them.
> > > > > > > 
> > > > > > > We trigger acquires with states, not policies. The thing is,
> > > > > > > we might match a HW policy but create a SW acquire state.
> > > > > > > This will not match anymore as soon as the lookup is
> > > > > > > implemented correctly.
> > > > > > 
> > > > > > For now, all such packets will be dropped as we have offlaoded
> > > > > > policy but not SA.
> > > > > 
> > > > > I think you missed my point. If the HW policy does not match
> > > > > the SW acquire state, then each packet will geneate a new
> > > > > acquire. So you need to make sure that policy and acquire
> > > > > state will match to send the acquire just once to userspace.
> > > > 
> > > > I think that I'm still missing the point.
> > > > 
> > > > We require both policy and SA to be offloaded. It means that once
> > > > we hit HW policy, we must hit SA too (at least this is how mlx5 part
> > > > is implemented).
> > > 
> > > Let's assume a packet hits a HW policy. Then this HW policy must match
> > > a HW state. In case there is no matching HW state, we generate an acquire
> > > and insert a larval state. Currently, larval states are never marked as HW.
> > 
> > And this is there our views are different. If HW (in RX) sees policy but
> > doesn't have state, this packet will be dropped in HW. It won't get to
> > stack and no acquire request will be issues.
> 
> This makes no sense. Acquires are always generated at TX, never at RX.

Sorry, my bad. But why can't we drop all packets that don't have HW
state? Why do we need to add larval?

> 
> On RX, the state lookup happens first, the policy must match to the
> decapsulated packet.
>
Leon Romanovsky Nov. 21, 2022, 12:02 p.m. UTC | #11
On Mon, Nov 21, 2022 at 01:34:30PM +0200, Leon Romanovsky wrote:
> On Mon, Nov 21, 2022 at 12:25:21PM +0100, Steffen Klassert wrote:
> > On Mon, Nov 21, 2022 at 01:15:36PM +0200, Leon Romanovsky wrote:
> > > On Mon, Nov 21, 2022 at 12:09:26PM +0100, Steffen Klassert wrote:
> > > > On Mon, Nov 21, 2022 at 12:27:01PM +0200, Leon Romanovsky wrote:
> > > > > On Mon, Nov 21, 2022 at 10:44:04AM +0100, Steffen Klassert wrote:
> > > > > > On Sun, Nov 20, 2022 at 09:17:02PM +0200, Leon Romanovsky wrote:
> > > > > > > On Fri, Nov 18, 2022 at 11:49:07AM +0100, Steffen Klassert wrote:
> > > > > > > > On Thu, Nov 17, 2022 at 02:51:33PM +0200, Leon Romanovsky wrote:
> > > > > > > > > On Thu, Nov 17, 2022 at 01:12:43PM +0100, Steffen Klassert wrote:
> > > > > > > > > > On Wed, Nov 09, 2022 at 02:54:34PM +0200, Leon Romanovsky wrote:
> > > > > > > > > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > > > > > > 
> > > > > > > > > > So this raises the question how to handle acquires with this packet
> > > > > > > > > > offload. 
> > > > > > > > > 
> > > > > > > > > We handle acquires as SW policies and don't offload them.
> > > > > > > > 
> > > > > > > > We trigger acquires with states, not policies. The thing is,
> > > > > > > > we might match a HW policy but create a SW acquire state.
> > > > > > > > This will not match anymore as soon as the lookup is
> > > > > > > > implemented correctly.
> > > > > > > 
> > > > > > > For now, all such packets will be dropped as we have offlaoded
> > > > > > > policy but not SA.
> > > > > > 
> > > > > > I think you missed my point. If the HW policy does not match
> > > > > > the SW acquire state, then each packet will geneate a new
> > > > > > acquire. So you need to make sure that policy and acquire
> > > > > > state will match to send the acquire just once to userspace.
> > > > > 
> > > > > I think that I'm still missing the point.
> > > > > 
> > > > > We require both policy and SA to be offloaded. It means that once
> > > > > we hit HW policy, we must hit SA too (at least this is how mlx5 part
> > > > > is implemented).
> > > > 
> > > > Let's assume a packet hits a HW policy. Then this HW policy must match
> > > > a HW state. In case there is no matching HW state, we generate an acquire
> > > > and insert a larval state. Currently, larval states are never marked as HW.
> > > 
> > > And this is there our views are different. If HW (in RX) sees policy but
> > > doesn't have state, this packet will be dropped in HW. It won't get to
> > > stack and no acquire request will be issues.
> > 
> > This makes no sense. Acquires are always generated at TX, never at RX.
> 
> Sorry, my bad. But why can't we drop all packets that don't have HW
> state? Why do we need to add larval?

I think that something like this will do the trick.

diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 5076f9d7a752..d1c9ef857755 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -1090,6 +1090,28 @@ static void xfrm_state_look_at(struct xfrm_policy *pol, struct xfrm_state *x,
        }
 }

+static bool xfrm_state_and_policy_mixed(struct xfrm_state *x,
+                                       struct xfrm_policy *p)
+{
+       /* Packet offload: both policy and SA should be offloaded */
+       if (p->xdo.type == XFRM_DEV_OFFLOAD_PACKET &&
+           x->xso.type != XFRM_DEV_OFFLOAD_PACKET)
+               return true;
+
+       if (p->xdo.type != XFRM_DEV_OFFLOAD_PACKET &&
+           x->xso.type == XFRM_DEV_OFFLOAD_PACKET)
+               return true;
+
+       if (p->xdo.type != XFRM_DEV_OFFLOAD_PACKET)
+               return false;
+
+       /* Packet offload: both policy and SA should have same device */
+       if (p->xdo.dev != x->xso.dev)
+               return true;
+
+       return false;
+}
+
 struct xfrm_state *
 xfrm_state_find(const xfrm_address_t *daddr, const xfrm_address_t *saddr,
                const struct flowi *fl, struct xfrm_tmpl *tmpl,
@@ -1147,7 +1169,8 @@ xfrm_state_find(const xfrm_address_t *daddr, const xfrm_address_t *saddr,

 found:
        x = best;
-       if (!x && !error && !acquire_in_progress) {
+       if (!x && !error && !acquire_in_progress &&
+           pol->xdo.type != XFRM_DEV_OFFLOAD_PACKET) {
                if (tmpl->id.spi &&
                    (x0 = __xfrm_state_lookup(net, mark, daddr, tmpl->id.spi,
                                              tmpl->id.proto, encap_family)) != NULL) {
@@ -1228,7 +1251,14 @@ xfrm_state_find(const xfrm_address_t *daddr, const xfrm_address_t *saddr,
                        *err = -EAGAIN;
                        x = NULL;
                }
+               if (x && xfrm_state_and_policy_mixed(x, pol)) {
+                       *err = -EINVAL;
+                       x = NULL;
+               }
        } else {
+               if (pol->xdo.type == XFRM_DEV_OFFLOAD_PACKET)
+                       error = -EINVAL;
+
                *err = acquire_in_progress ? -EAGAIN : error;
        }
        rcu_read_unlock();
(END)


> 
> > 
> > On RX, the state lookup happens first, the policy must match to the
> > decapsulated packet.
> >
Steffen Klassert Nov. 21, 2022, 12:10 p.m. UTC | #12
On Mon, Nov 21, 2022 at 01:34:30PM +0200, Leon Romanovsky wrote:
> 
> Sorry, my bad. But why can't we drop all packets that don't have HW
> state? Why do we need to add larval?

The first packet of a flow tiggers an acquire and inserts a larval
state. On a traffic triggered connection, we need this to get
a state with keys installed.

We need this larval state then, because that tells us we sent already an
acquire to userspace. All subsequent packets of that flow will be
dropped without sending another acquire. Otherwise each subsequent
packet will generate another acquire until the keys are negotiated.
If a flow starts sending on a high rate, this would be not so nice
for userspace :)
Steffen Klassert Nov. 21, 2022, 12:43 p.m. UTC | #13
On Mon, Nov 21, 2022 at 02:02:52PM +0200, Leon Romanovsky wrote:
> 
> I think that something like this will do the trick.
> 
> diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
> index 5076f9d7a752..d1c9ef857755 100644
> --- a/net/xfrm/xfrm_state.c
> +++ b/net/xfrm/xfrm_state.c
> @@ -1090,6 +1090,28 @@ static void xfrm_state_look_at(struct xfrm_policy *pol, struct xfrm_state *x,
>         }
>  }
> 
> +static bool xfrm_state_and_policy_mixed(struct xfrm_state *x,
> +                                       struct xfrm_policy *p)
> +{
> +       /* Packet offload: both policy and SA should be offloaded */
> +       if (p->xdo.type == XFRM_DEV_OFFLOAD_PACKET &&
> +           x->xso.type != XFRM_DEV_OFFLOAD_PACKET)
> +               return true;
> +
> +       if (p->xdo.type != XFRM_DEV_OFFLOAD_PACKET &&
> +           x->xso.type == XFRM_DEV_OFFLOAD_PACKET)
> +               return true;
> +
> +       if (p->xdo.type != XFRM_DEV_OFFLOAD_PACKET)
> +               return false;
> +
> +       /* Packet offload: both policy and SA should have same device */
> +       if (p->xdo.dev != x->xso.dev)
> +               return true;
> +
> +       return false;
> +}
> +
>  struct xfrm_state *
>  xfrm_state_find(const xfrm_address_t *daddr, const xfrm_address_t *saddr,
>                 const struct flowi *fl, struct xfrm_tmpl *tmpl,
> @@ -1147,7 +1169,8 @@ xfrm_state_find(const xfrm_address_t *daddr, const xfrm_address_t *saddr,
> 
>  found:
>         x = best;
> -       if (!x && !error && !acquire_in_progress) {
> +       if (!x && !error && !acquire_in_progress &&
> +           pol->xdo.type != XFRM_DEV_OFFLOAD_PACKET) {
>                 if (tmpl->id.spi &&
>                     (x0 = __xfrm_state_lookup(net, mark, daddr, tmpl->id.spi,
>                                               tmpl->id.proto, encap_family)) != NULL) {
> @@ -1228,7 +1251,14 @@ xfrm_state_find(const xfrm_address_t *daddr, const xfrm_address_t *saddr,
>                         *err = -EAGAIN;
>                         x = NULL;
>                 }
> +               if (x && xfrm_state_and_policy_mixed(x, pol)) {
> +                       *err = -EINVAL;
> +                       x = NULL;

If policy and state do not match here, this means the lookup
returned the wrong state. The correct state might still sit
in the database. At this point, you should either have found
a matching state, or no state at all.
Leon Romanovsky Nov. 21, 2022, 1:01 p.m. UTC | #14
On Mon, Nov 21, 2022 at 01:43:49PM +0100, Steffen Klassert wrote:
> On Mon, Nov 21, 2022 at 02:02:52PM +0200, Leon Romanovsky wrote:
> > 
> > I think that something like this will do the trick.
> > 
> > diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
> > index 5076f9d7a752..d1c9ef857755 100644
> > --- a/net/xfrm/xfrm_state.c
> > +++ b/net/xfrm/xfrm_state.c
> > @@ -1090,6 +1090,28 @@ static void xfrm_state_look_at(struct xfrm_policy *pol, struct xfrm_state *x,
> >         }
> >  }
> > 
> > +static bool xfrm_state_and_policy_mixed(struct xfrm_state *x,
> > +                                       struct xfrm_policy *p)
> > +{
> > +       /* Packet offload: both policy and SA should be offloaded */
> > +       if (p->xdo.type == XFRM_DEV_OFFLOAD_PACKET &&
> > +           x->xso.type != XFRM_DEV_OFFLOAD_PACKET)
> > +               return true;
> > +
> > +       if (p->xdo.type != XFRM_DEV_OFFLOAD_PACKET &&
> > +           x->xso.type == XFRM_DEV_OFFLOAD_PACKET)
> > +               return true;
> > +
> > +       if (p->xdo.type != XFRM_DEV_OFFLOAD_PACKET)
> > +               return false;
> > +
> > +       /* Packet offload: both policy and SA should have same device */
> > +       if (p->xdo.dev != x->xso.dev)
> > +               return true;
> > +
> > +       return false;
> > +}
> > +
> >  struct xfrm_state *
> >  xfrm_state_find(const xfrm_address_t *daddr, const xfrm_address_t *saddr,
> >                 const struct flowi *fl, struct xfrm_tmpl *tmpl,
> > @@ -1147,7 +1169,8 @@ xfrm_state_find(const xfrm_address_t *daddr, const xfrm_address_t *saddr,
> > 
> >  found:
> >         x = best;
> > -       if (!x && !error && !acquire_in_progress) {
> > +       if (!x && !error && !acquire_in_progress &&
> > +           pol->xdo.type != XFRM_DEV_OFFLOAD_PACKET) {
> >                 if (tmpl->id.spi &&
> >                     (x0 = __xfrm_state_lookup(net, mark, daddr, tmpl->id.spi,
> >                                               tmpl->id.proto, encap_family)) != NULL) {
> > @@ -1228,7 +1251,14 @@ xfrm_state_find(const xfrm_address_t *daddr, const xfrm_address_t *saddr,
> >                         *err = -EAGAIN;
> >                         x = NULL;
> >                 }
> > +               if (x && xfrm_state_and_policy_mixed(x, pol)) {
> > +                       *err = -EINVAL;
> > +                       x = NULL;
> 
> If policy and state do not match here, this means the lookup
> returned the wrong state. The correct state might still sit
> in the database. At this point, you should either have found
> a matching state, or no state at all.

I check for "x" because of "x = NULL" above.

Thanks
Leon Romanovsky Nov. 21, 2022, 1:21 p.m. UTC | #15
On Mon, Nov 21, 2022 at 01:10:40PM +0100, Steffen Klassert wrote:
> On Mon, Nov 21, 2022 at 01:34:30PM +0200, Leon Romanovsky wrote:
> > 
> > Sorry, my bad. But why can't we drop all packets that don't have HW
> > state? Why do we need to add larval?
> 
> The first packet of a flow tiggers an acquire and inserts a larval
> state. On a traffic triggered connection, we need this to get
> a state with keys installed.
> 
> We need this larval state then, because that tells us we sent already an
> acquire to userspace. All subsequent packets of that flow will be
> dropped without sending another acquire. Otherwise each subsequent
> packet will generate another acquire until the keys are negotiated.
> If a flow starts sending on a high rate, this would be not so nice
> for userspace :)

The thing is that this SW acquire flow is a fraction case, as it applies
to locally generated traffic.

In my mind, there are other cases, like eswitch mode and tunnel mode. In
these cases, the packets are arrived to HW without even passing SW stack.

What we want to do is to catch in HW all TX packets, which don't have SAs
(applicable for all types of traffic), mark them and route back to the
driver. The driver will be responsible to talk with XFRM core to
generate acquire.

The same logic of rerouting packets is required for audit and will be done
later. Right now, we rely on *swan implementations which configure everything
in advance.

Also larval is default to 1 (drop) in all distros.

I hope that this larval/acquire is not must for this series to be merged.
And it is going to be implemented later as I'm assigned to work on this
offload feature till feature complete :).

Thanks
Herbert Xu Nov. 22, 2022, 4:29 a.m. UTC | #16
On Mon, Nov 21, 2022 at 03:21:45PM +0200, Leon Romanovsky wrote:
>
> The thing is that this SW acquire flow is a fraction case, as it applies
> to locally generated traffic.

A router can trigger an acquire on forwarded packets too.  Without
larvals this could quickly overwhelm the router.

Cheers,
Leon Romanovsky Nov. 22, 2022, 6:27 a.m. UTC | #17
On Tue, Nov 22, 2022 at 12:29:12PM +0800, Herbert Xu wrote:
> On Mon, Nov 21, 2022 at 03:21:45PM +0200, Leon Romanovsky wrote:
> >
> > The thing is that this SW acquire flow is a fraction case, as it applies
> > to locally generated traffic.
> 
> A router can trigger an acquire on forwarded packets too.  Without
> larvals this could quickly overwhelm the router.

This series doesn't support tunnel mode yet.

Maybe I was not clear, but I wanted to say what in eswitch case and
tunnel mode, the packets will be handled purely by HW without raising
into SW core.

It is so called transparent IPsec, where all configuration is done on
hypervisor, so VMs connected through eswitch will get already decrypted
traffic which is routed through eswitch NIC logic without passing
hypervisor data path.

Steffen expected to see changes to acquire logic as part of this series
and in my explanation, I tried to explain why it is not needed now and
how will it be implemented later.

Thanks

> 
> Cheers,
> -- 
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Steffen Klassert Nov. 22, 2022, 1 p.m. UTC | #18
On Tue, Nov 22, 2022 at 08:27:48AM +0200, Leon Romanovsky wrote:
> On Tue, Nov 22, 2022 at 12:29:12PM +0800, Herbert Xu wrote:
> > On Mon, Nov 21, 2022 at 03:21:45PM +0200, Leon Romanovsky wrote:
> > >
> > > The thing is that this SW acquire flow is a fraction case, as it applies
> > > to locally generated traffic.
> > 
> > A router can trigger an acquire on forwarded packets too.  Without
> > larvals this could quickly overwhelm the router.
> 
> This series doesn't support tunnel mode yet.

It does not matter if tunnel or transport mode, acquires must
work as expected. This is a fundamental concept of IPsec, there
is no way to tell userspace that we don't support this.

> Maybe I was not clear, but I wanted to say what in eswitch case and
> tunnel mode, the packets will be handled purely by HW without raising
> into SW core.

Can you please explain why we need host interaction for
transport, but not for tunnel mode?

Staying away with HW policies and states from SW databases is what
I wanted to have from the beginning. If that is possible for tunnel
mode, it should be possible for transport mode too.

And as said already, I want to see the full picture (transport
+ tunnel mode) before we merge it.
Steffen Klassert Nov. 22, 2022, 1:10 p.m. UTC | #19
On Mon, Nov 21, 2022 at 03:01:42PM +0200, Leon Romanovsky wrote:
> On Mon, Nov 21, 2022 at 01:43:49PM +0100, Steffen Klassert wrote:
> > On Mon, Nov 21, 2022 at 02:02:52PM +0200, Leon Romanovsky wrote:
> > > 
> > > I think that something like this will do the trick.
> > > 
> > > diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
> > > index 5076f9d7a752..d1c9ef857755 100644
> > > --- a/net/xfrm/xfrm_state.c
> > > +++ b/net/xfrm/xfrm_state.c
> > > @@ -1090,6 +1090,28 @@ static void xfrm_state_look_at(struct xfrm_policy *pol, struct xfrm_state *x,
> > >         }
> > >  }
> > > 
> > > +static bool xfrm_state_and_policy_mixed(struct xfrm_state *x,
> > > +                                       struct xfrm_policy *p)
> > > +{
> > > +       /* Packet offload: both policy and SA should be offloaded */
> > > +       if (p->xdo.type == XFRM_DEV_OFFLOAD_PACKET &&
> > > +           x->xso.type != XFRM_DEV_OFFLOAD_PACKET)
> > > +               return true;
> > > +
> > > +       if (p->xdo.type != XFRM_DEV_OFFLOAD_PACKET &&
> > > +           x->xso.type == XFRM_DEV_OFFLOAD_PACKET)
> > > +               return true;
> > > +
> > > +       if (p->xdo.type != XFRM_DEV_OFFLOAD_PACKET)
> > > +               return false;
> > > +
> > > +       /* Packet offload: both policy and SA should have same device */
> > > +       if (p->xdo.dev != x->xso.dev)
> > > +               return true;
> > > +
> > > +       return false;
> > > +}
> > > +
> > >  struct xfrm_state *
> > >  xfrm_state_find(const xfrm_address_t *daddr, const xfrm_address_t *saddr,
> > >                 const struct flowi *fl, struct xfrm_tmpl *tmpl,
> > > @@ -1147,7 +1169,8 @@ xfrm_state_find(const xfrm_address_t *daddr, const xfrm_address_t *saddr,
> > > 
> > >  found:
> > >         x = best;
> > > -       if (!x && !error && !acquire_in_progress) {
> > > +       if (!x && !error && !acquire_in_progress &&
> > > +           pol->xdo.type != XFRM_DEV_OFFLOAD_PACKET) {
> > >                 if (tmpl->id.spi &&
> > >                     (x0 = __xfrm_state_lookup(net, mark, daddr, tmpl->id.spi,
> > >                                               tmpl->id.proto, encap_family)) != NULL) {
> > > @@ -1228,7 +1251,14 @@ xfrm_state_find(const xfrm_address_t *daddr, const xfrm_address_t *saddr,
> > >                         *err = -EAGAIN;
> > >                         x = NULL;
> > >                 }
> > > +               if (x && xfrm_state_and_policy_mixed(x, pol)) {
> > > +                       *err = -EINVAL;
> > > +                       x = NULL;
> > 
> > If policy and state do not match here, this means the lookup
> > returned the wrong state. The correct state might still sit
> > in the database. At this point, you should either have found
> > a matching state, or no state at all.
> 
> I check for "x" because of "x = NULL" above.

This does not change the fact that the lookup returned the wrong state.
Leon Romanovsky Nov. 22, 2022, 1:54 p.m. UTC | #20
On Tue, Nov 22, 2022 at 02:00:02PM +0100, Steffen Klassert wrote:
> On Tue, Nov 22, 2022 at 08:27:48AM +0200, Leon Romanovsky wrote:
> > On Tue, Nov 22, 2022 at 12:29:12PM +0800, Herbert Xu wrote:
> > > On Mon, Nov 21, 2022 at 03:21:45PM +0200, Leon Romanovsky wrote:
> > > >
> > > > The thing is that this SW acquire flow is a fraction case, as it applies
> > > > to locally generated traffic.
> > > 
> > > A router can trigger an acquire on forwarded packets too.  Without
> > > larvals this could quickly overwhelm the router.
> > 
> > This series doesn't support tunnel mode yet.
> 
> It does not matter if tunnel or transport mode, acquires must
> work as expected. This is a fundamental concept of IPsec, there
> is no way to tell userspace that we don't support this.
> 
> > Maybe I was not clear, but I wanted to say what in eswitch case and
> > tunnel mode, the packets will be handled purely by HW without raising
> > into SW core.
> 
> Can you please explain why we need host interaction for
> transport, but not for tunnel mode?

The main difference is that in transport mode, you must bring packet
to the kernel in which you configured SA/policy. It means that we must
ensure that such packets won't be checked again in SW because all packets
(encrypted and not) pass XFRM logic.

 - wire -> RX NIC -> kernel -> XFRM stack (we need HW DB here to skip this stage) -> ....
 ... -> kernel -> XFRM stack (skip for HW SA/policies) -> TX NIC -> wire.

In tunnel mode, we arrive to XFRM when nothing IPsec related is configured.

 - wire -> RX PF NIC -> eswitch NIC logic -> TX uplink NIC -> RX
   representors -> XFRM stack in VM (nothing configured here) -> kernel

The most troublesome part is in TX path, where you must skip "double check"
before NIC. This check doesn't exist in tunnel mode.

In RX, there is also difference between modes due to how we are supposed
to treat headers.

Raed will add more details.

> 
> And as said already, I want to see the full picture (transport
> + tunnel mode) before we merge it.

It looks like we already have this picture.

Thanks
Leon Romanovsky Nov. 22, 2022, 1:57 p.m. UTC | #21
On Tue, Nov 22, 2022 at 02:10:02PM +0100, Steffen Klassert wrote:
> On Mon, Nov 21, 2022 at 03:01:42PM +0200, Leon Romanovsky wrote:
> > On Mon, Nov 21, 2022 at 01:43:49PM +0100, Steffen Klassert wrote:
> > > On Mon, Nov 21, 2022 at 02:02:52PM +0200, Leon Romanovsky wrote:
> > > > 
> > > > I think that something like this will do the trick.
> > > > 
> > > > diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
> > > > index 5076f9d7a752..d1c9ef857755 100644
> > > > --- a/net/xfrm/xfrm_state.c
> > > > +++ b/net/xfrm/xfrm_state.c
> > > > @@ -1090,6 +1090,28 @@ static void xfrm_state_look_at(struct xfrm_policy *pol, struct xfrm_state *x,
> > > >         }
> > > >  }
> > > > 
> > > > +static bool xfrm_state_and_policy_mixed(struct xfrm_state *x,
> > > > +                                       struct xfrm_policy *p)
> > > > +{
> > > > +       /* Packet offload: both policy and SA should be offloaded */
> > > > +       if (p->xdo.type == XFRM_DEV_OFFLOAD_PACKET &&
> > > > +           x->xso.type != XFRM_DEV_OFFLOAD_PACKET)
> > > > +               return true;
> > > > +
> > > > +       if (p->xdo.type != XFRM_DEV_OFFLOAD_PACKET &&
> > > > +           x->xso.type == XFRM_DEV_OFFLOAD_PACKET)
> > > > +               return true;
> > > > +
> > > > +       if (p->xdo.type != XFRM_DEV_OFFLOAD_PACKET)
> > > > +               return false;
> > > > +
> > > > +       /* Packet offload: both policy and SA should have same device */
> > > > +       if (p->xdo.dev != x->xso.dev)
> > > > +               return true;
> > > > +
> > > > +       return false;
> > > > +}
> > > > +
> > > >  struct xfrm_state *
> > > >  xfrm_state_find(const xfrm_address_t *daddr, const xfrm_address_t *saddr,
> > > >                 const struct flowi *fl, struct xfrm_tmpl *tmpl,
> > > > @@ -1147,7 +1169,8 @@ xfrm_state_find(const xfrm_address_t *daddr, const xfrm_address_t *saddr,
> > > > 
> > > >  found:
> > > >         x = best;
> > > > -       if (!x && !error && !acquire_in_progress) {
> > > > +       if (!x && !error && !acquire_in_progress &&
> > > > +           pol->xdo.type != XFRM_DEV_OFFLOAD_PACKET) {
> > > >                 if (tmpl->id.spi &&
> > > >                     (x0 = __xfrm_state_lookup(net, mark, daddr, tmpl->id.spi,
> > > >                                               tmpl->id.proto, encap_family)) != NULL) {
> > > > @@ -1228,7 +1251,14 @@ xfrm_state_find(const xfrm_address_t *daddr, const xfrm_address_t *saddr,
> > > >                         *err = -EAGAIN;
> > > >                         x = NULL;
> > > >                 }
> > > > +               if (x && xfrm_state_and_policy_mixed(x, pol)) {
> > > > +                       *err = -EINVAL;
> > > > +                       x = NULL;
> > > 
> > > If policy and state do not match here, this means the lookup
> > > returned the wrong state. The correct state might still sit
> > > in the database. At this point, you should either have found
> > > a matching state, or no state at all.
> > 
> > I check for "x" because of "x = NULL" above.
> 
> This does not change the fact that the lookup returned the wrong state.

Steffen, but this is exactly why we added this check - to catch wrong
states and configurations. 

If lookup was successful, we know that HW handles this packet, if lookup
failed to find SA and we have HW policy, we should drop such packet.

Thanks
Steffen Klassert Nov. 23, 2022, 8:23 a.m. UTC | #22
On Tue, Nov 22, 2022 at 03:54:42PM +0200, Leon Romanovsky wrote:
> On Tue, Nov 22, 2022 at 02:00:02PM +0100, Steffen Klassert wrote:
> > On Tue, Nov 22, 2022 at 08:27:48AM +0200, Leon Romanovsky wrote:
> > > On Tue, Nov 22, 2022 at 12:29:12PM +0800, Herbert Xu wrote:
> > > > On Mon, Nov 21, 2022 at 03:21:45PM +0200, Leon Romanovsky wrote:
> > 
> > Can you please explain why we need host interaction for
> > transport, but not for tunnel mode?
> 
> The main difference is that in transport mode, you must bring packet
> to the kernel in which you configured SA/policy. It means that we must
> ensure that such packets won't be checked again in SW because all packets
> (encrypted and not) pass XFRM logic.
> 
>  - wire -> RX NIC -> kernel -> XFRM stack (we need HW DB here to skip this stage) -> ....
>  ... -> kernel -> XFRM stack (skip for HW SA/policies) -> TX NIC -> wire.
> 
> In tunnel mode, we arrive to XFRM when nothing IPsec related is configured.
> 
>  - wire -> RX PF NIC -> eswitch NIC logic -> TX uplink NIC -> RX
>    representors -> XFRM stack in VM (nothing configured here) -> kernel

Forget about eswitch, VM, etc. for a moment. I'm interested how the
simplest possible tunnel mode cases will work.

Forwarding:

wire -> random NIC RX -> kernel -> IPsec tunnel offload NIC TX -> wire
wire -> IPsec tunnel offload NIC RX -> kernel -> random NIC TX -> wire

Local endpoints:

Application -> kernel -> IPsec tunnel offload NIC TX -> wire
wire -> IPsec tunnel offload NIC RX -> kernel -> Application

These two must work, so how are these cases handled?

If you can do more fancy things with tunnel mode and special NICs
at TX and RX, that's fine but not absolutely required.
Steffen Klassert Nov. 23, 2022, 8:37 a.m. UTC | #23
On Tue, Nov 22, 2022 at 03:57:43PM +0200, Leon Romanovsky wrote:
> On Tue, Nov 22, 2022 at 02:10:02PM +0100, Steffen Klassert wrote:
> > On Mon, Nov 21, 2022 at 03:01:42PM +0200, Leon Romanovsky wrote:
> > > On Mon, Nov 21, 2022 at 01:43:49PM +0100, Steffen Klassert wrote:
> > > > On Mon, Nov 21, 2022 at 02:02:52PM +0200, Leon Romanovsky wrote:
> > > > 
> > > > If policy and state do not match here, this means the lookup
> > > > returned the wrong state. The correct state might still sit
> > > > in the database. At this point, you should either have found
> > > > a matching state, or no state at all.
> > > 
> > > I check for "x" because of "x = NULL" above.
> > 
> > This does not change the fact that the lookup returned the wrong state.
> 
> Steffen, but this is exactly why we added this check - to catch wrong
> states and configurations. 

No, you have to adjust the lookup so that this can't happen.
This is not a missconfiguration, The lookup found the wrong
SA, this is a difference.

Use the offload type and dev as a lookup key and don't consider
SAs that don't match this in the lookup.

This is really not too hard to do. The thing that could be a bit
more difficult is that the lookup should be only adjusted when
we really have HW policies installed. Otherwise this affects
even systems that don't use this kind of offload.
Leon Romanovsky Nov. 23, 2022, 9:36 a.m. UTC | #24
On Wed, Nov 23, 2022 at 09:37:20AM +0100, Steffen Klassert wrote:
> On Tue, Nov 22, 2022 at 03:57:43PM +0200, Leon Romanovsky wrote:
> > On Tue, Nov 22, 2022 at 02:10:02PM +0100, Steffen Klassert wrote:
> > > On Mon, Nov 21, 2022 at 03:01:42PM +0200, Leon Romanovsky wrote:
> > > > On Mon, Nov 21, 2022 at 01:43:49PM +0100, Steffen Klassert wrote:
> > > > > On Mon, Nov 21, 2022 at 02:02:52PM +0200, Leon Romanovsky wrote:
> > > > > 
> > > > > If policy and state do not match here, this means the lookup
> > > > > returned the wrong state. The correct state might still sit
> > > > > in the database. At this point, you should either have found
> > > > > a matching state, or no state at all.
> > > > 
> > > > I check for "x" because of "x = NULL" above.
> > > 
> > > This does not change the fact that the lookup returned the wrong state.
> > 
> > Steffen, but this is exactly why we added this check - to catch wrong
> > states and configurations. 
> 
> No, you have to adjust the lookup so that this can't happen.
> This is not a missconfiguration, The lookup found the wrong
> SA, this is a difference.
> 
> Use the offload type and dev as a lookup key and don't consider
> SAs that don't match this in the lookup.
> 
> This is really not too hard to do. The thing that could be a bit
> more difficult is that the lookup should be only adjusted when
> we really have HW policies installed. Otherwise this affects
> even systems that don't use this kind of offload.

Thanks for an explanation, trying it now.
Leon Romanovsky Nov. 23, 2022, 10:25 a.m. UTC | #25
On Wed, Nov 23, 2022 at 09:23:58AM +0100, Steffen Klassert wrote:
> On Tue, Nov 22, 2022 at 03:54:42PM +0200, Leon Romanovsky wrote:
> > On Tue, Nov 22, 2022 at 02:00:02PM +0100, Steffen Klassert wrote:
> > > On Tue, Nov 22, 2022 at 08:27:48AM +0200, Leon Romanovsky wrote:
> > > > On Tue, Nov 22, 2022 at 12:29:12PM +0800, Herbert Xu wrote:
> > > > > On Mon, Nov 21, 2022 at 03:21:45PM +0200, Leon Romanovsky wrote:
> > > 
> > > Can you please explain why we need host interaction for
> > > transport, but not for tunnel mode?
> > 
> > The main difference is that in transport mode, you must bring packet
> > to the kernel in which you configured SA/policy. It means that we must
> > ensure that such packets won't be checked again in SW because all packets
> > (encrypted and not) pass XFRM logic.
> > 
> >  - wire -> RX NIC -> kernel -> XFRM stack (we need HW DB here to skip this stage) -> ....
> >  ... -> kernel -> XFRM stack (skip for HW SA/policies) -> TX NIC -> wire.
> > 
> > In tunnel mode, we arrive to XFRM when nothing IPsec related is configured.
> > 
> >  - wire -> RX PF NIC -> eswitch NIC logic -> TX uplink NIC -> RX
> >    representors -> XFRM stack in VM (nothing configured here) -> kernel
> 
> Forget about eswitch, VM, etc. for a moment. I'm interested how the
> simplest possible tunnel mode cases will work.
> 
> Forwarding:
> 
> wire -> random NIC RX -> kernel -> IPsec tunnel offload NIC TX -> wire
> wire -> IPsec tunnel offload NIC RX -> kernel -> random NIC TX -> wire
> 
> Local endpoints:
> 
> Application -> kernel -> IPsec tunnel offload NIC TX -> wire
> wire -> IPsec tunnel offload NIC RX -> kernel -> Application
> 
> These two must work, so how are these cases handled?

These two cases conceptually no different from transport modes.
The difference is how HW handles IP packets.

If packet comes from RX, it will be received as plain packet in the
kernel. If packet goes to TX, it must be skipped in the XFRM.

For all "wire -> IPsec tunnel offload NIC RX ...", everything works
as you would expect. HW handles everything, and feeds the kernel with
plain packet. These packets will have CRYPTO_DONE and status so they
can skip all XFRM logic.

All this complexity is For "... kernel -> IPsec tunnel offload NIC TX -> wire"
flow. You need a way to say to the kernel that XFRM should be skipped.


In TX path, we will need to perform neighbor resolution to fill proper
MAC address for outer IP header.
In RX path, once the packet is decrypted, there is a need to change MAC
address for the inner IP header. This will be done by kernel as HW
doesn't have such knowledge.

Of course, there are many possible implementations of how to have right
MAC address (static during SA creations, or dynamic if we listen to ARP
events), but it is not XFRM related.

Thanks

> 
> If you can do more fancy things with tunnel mode and special NICs
> at TX and RX, that's fine but not absolutely required.
Leon Romanovsky Nov. 23, 2022, 12:53 p.m. UTC | #26
On Wed, Nov 23, 2022 at 11:36:19AM +0200, Leon Romanovsky wrote:
> On Wed, Nov 23, 2022 at 09:37:20AM +0100, Steffen Klassert wrote:
> > On Tue, Nov 22, 2022 at 03:57:43PM +0200, Leon Romanovsky wrote:
> > > On Tue, Nov 22, 2022 at 02:10:02PM +0100, Steffen Klassert wrote:
> > > > On Mon, Nov 21, 2022 at 03:01:42PM +0200, Leon Romanovsky wrote:
> > > > > On Mon, Nov 21, 2022 at 01:43:49PM +0100, Steffen Klassert wrote:
> > > > > > On Mon, Nov 21, 2022 at 02:02:52PM +0200, Leon Romanovsky wrote:
> > > > > > 
> > > > > > If policy and state do not match here, this means the lookup
> > > > > > returned the wrong state. The correct state might still sit
> > > > > > in the database. At this point, you should either have found
> > > > > > a matching state, or no state at all.
> > > > > 
> > > > > I check for "x" because of "x = NULL" above.
> > > > 
> > > > This does not change the fact that the lookup returned the wrong state.
> > > 
> > > Steffen, but this is exactly why we added this check - to catch wrong
> > > states and configurations. 
> > 
> > No, you have to adjust the lookup so that this can't happen.
> > This is not a missconfiguration, The lookup found the wrong
> > SA, this is a difference.
> > 
> > Use the offload type and dev as a lookup key and don't consider
> > SAs that don't match this in the lookup.
> > 
> > This is really not too hard to do. The thing that could be a bit
> > more difficult is that the lookup should be only adjusted when
> > we really have HW policies installed. Otherwise this affects
> > even systems that don't use this kind of offload.
> 
> Thanks for an explanation, trying it now.

Something like that?

The code is untested yet.

diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 5076f9d7a752..5819023c32ba 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -1115,6 +1115,19 @@ xfrm_state_find(const xfrm_address_t *daddr, const xfrm_address_t *saddr,
 	rcu_read_lock();
 	h = xfrm_dst_hash(net, daddr, saddr, tmpl->reqid, encap_family);
 	hlist_for_each_entry_rcu(x, net->xfrm.state_bydst + h, bydst) {
+		if (IS_ENABLED(CONFIG_XFRM_OFFLOAD) &&
+		    pol->xdo.type == XFRM_DEV_OFFLOAD_PACKET) {
+			if (x->xso.type != XFRM_DEV_OFFLOAD_PACKET)
+				/* HW states are in the head of list, there is no need
+				 * to iterate further.
+				 */
+				break;
+
+			/* Packet offload: both policy and SA should have same device */
+			if (pol->xdo.dev != x->xso.dev)
+				continue;
+		}
+
 		if (x->props.family == encap_family &&
 		    x->props.reqid == tmpl->reqid &&
 		    (mark & x->mark.m) == x->mark.v &&
@@ -1132,6 +1145,19 @@ xfrm_state_find(const xfrm_address_t *daddr, const xfrm_address_t *saddr,
 
 	h_wildcard = xfrm_dst_hash(net, daddr, &saddr_wildcard, tmpl->reqid, encap_family);
 	hlist_for_each_entry_rcu(x, net->xfrm.state_bydst + h_wildcard, bydst) {
+		if (IS_ENABLED(CONFIG_XFRM_OFFLOAD) &&
+		    pol->xdo.type == XFRM_DEV_OFFLOAD_PACKET) {
+			if (x->xso.type != XFRM_DEV_OFFLOAD_PACKET)
+				/* HW states are in the head of list, there is no need
+				 * to iterate further.
+				 */
+				break;
+
+			/* Packet offload: both policy and SA should have same device */
+			if (pol->xdo.dev != x->xso.dev)
+				continue;
+		}
+
 		if (x->props.family == encap_family &&
 		    x->props.reqid == tmpl->reqid &&
 		    (mark & x->mark.m) == x->mark.v &&
@@ -1185,6 +1211,17 @@ xfrm_state_find(const xfrm_address_t *daddr, const xfrm_address_t *saddr,
 			goto out;
 		}
 
+		if (pol->xdo.type == XFRM_DEV_OFFLOAD_PACKET) {
+			memcpy(&x->xso, &pol->xdo, sizeof(x->xso));
+			error = pol->xdo.dev->xfrmdev_ops->xdo_dev_state_add(x);
+			if (error) {
+				x->km.state = XFRM_STATE_DEAD;
+				to_put = x;
+				x = NULL;
+				goto out;
+			}
+		}
+
 		if (km_query(x, tmpl, pol) == 0) {
 			spin_lock_bh(&net->xfrm.xfrm_state_lock);
 			x->km.state = XFRM_STATE_ACQ;
Steffen Klassert Nov. 24, 2022, 11:07 a.m. UTC | #27
On Wed, Nov 23, 2022 at 02:53:10PM +0200, Leon Romanovsky wrote:
> On Wed, Nov 23, 2022 at 11:36:19AM +0200, Leon Romanovsky wrote:
> > Thanks for an explanation, trying it now.
> 
> Something like that?

Yes :)

> 
> The code is untested yet.
> 
> diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
> index 5076f9d7a752..5819023c32ba 100644
> --- a/net/xfrm/xfrm_state.c
> +++ b/net/xfrm/xfrm_state.c
> @@ -1115,6 +1115,19 @@ xfrm_state_find(const xfrm_address_t *daddr, const xfrm_address_t *saddr,
>  	rcu_read_lock();
>  	h = xfrm_dst_hash(net, daddr, saddr, tmpl->reqid, encap_family);
>  	hlist_for_each_entry_rcu(x, net->xfrm.state_bydst + h, bydst) {
> +		if (IS_ENABLED(CONFIG_XFRM_OFFLOAD) &&
> +		    pol->xdo.type == XFRM_DEV_OFFLOAD_PACKET) {

Please try to avoid that check for every state in the list.
Maybe enable this code with a static key if packet offload
is used?

> +			if (x->xso.type != XFRM_DEV_OFFLOAD_PACKET)
> +				/* HW states are in the head of list, there is no need
> +				 * to iterate further.
> +				 */
> +				break;
> +
> +			/* Packet offload: both policy and SA should have same device */
> +			if (pol->xdo.dev != x->xso.dev)
> +				continue;
> +		}
> +
>  		if (x->props.family == encap_family &&
>  		    x->props.reqid == tmpl->reqid &&
>  		    (mark & x->mark.m) == x->mark.v &&
> @@ -1132,6 +1145,19 @@ xfrm_state_find(const xfrm_address_t *daddr, const xfrm_address_t *saddr,
>  
>  	h_wildcard = xfrm_dst_hash(net, daddr, &saddr_wildcard, tmpl->reqid, encap_family);
>  	hlist_for_each_entry_rcu(x, net->xfrm.state_bydst + h_wildcard, bydst) {
> +		if (IS_ENABLED(CONFIG_XFRM_OFFLOAD) &&
> +		    pol->xdo.type == XFRM_DEV_OFFLOAD_PACKET) {
> +			if (x->xso.type != XFRM_DEV_OFFLOAD_PACKET)
> +				/* HW states are in the head of list, there is no need
> +				 * to iterate further.
> +				 */
> +				break;
> +
> +			/* Packet offload: both policy and SA should have same device */
> +			if (pol->xdo.dev != x->xso.dev)
> +				continue;
> +		}
> +
>  		if (x->props.family == encap_family &&
>  		    x->props.reqid == tmpl->reqid &&
>  		    (mark & x->mark.m) == x->mark.v &&
> @@ -1185,6 +1211,17 @@ xfrm_state_find(const xfrm_address_t *daddr, const xfrm_address_t *saddr,
>  			goto out;
>  		}
>  
> +		if (pol->xdo.type == XFRM_DEV_OFFLOAD_PACKET) {
> +			memcpy(&x->xso, &pol->xdo, sizeof(x->xso));
> +			error = pol->xdo.dev->xfrmdev_ops->xdo_dev_state_add(x);
> +			if (error) {
> +				x->km.state = XFRM_STATE_DEAD;
> +				to_put = x;
> +				x = NULL;
> +				goto out;
> +			}
> +		}

I guess that is to handle acquires, right?
What is the idea behind that? xdo_dev_state_add sets
offload type and dev?
Leon Romanovsky Nov. 25, 2022, 6:23 a.m. UTC | #28
On Thu, Nov 24, 2022 at 12:07:48PM +0100, Steffen Klassert wrote:
> On Wed, Nov 23, 2022 at 02:53:10PM +0200, Leon Romanovsky wrote:
> > On Wed, Nov 23, 2022 at 11:36:19AM +0200, Leon Romanovsky wrote:
> > > Thanks for an explanation, trying it now.
> > 
> > Something like that?
> 
> Yes :)

Great, will send proper version on Sunday.

> 
> > 
> > The code is untested yet.
> > 
> > diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
> > index 5076f9d7a752..5819023c32ba 100644
> > --- a/net/xfrm/xfrm_state.c
> > +++ b/net/xfrm/xfrm_state.c
> > @@ -1115,6 +1115,19 @@ xfrm_state_find(const xfrm_address_t *daddr, const xfrm_address_t *saddr,
> >  	rcu_read_lock();
> >  	h = xfrm_dst_hash(net, daddr, saddr, tmpl->reqid, encap_family);
> >  	hlist_for_each_entry_rcu(x, net->xfrm.state_bydst + h, bydst) {
> > +		if (IS_ENABLED(CONFIG_XFRM_OFFLOAD) &&
> > +		    pol->xdo.type == XFRM_DEV_OFFLOAD_PACKET) {
> 
> Please try to avoid that check for every state in the list.
> Maybe enable this code with a static key if packet offload
> is used?

I assumed that it will be optimized by compiled because "pol->xdo.type ==
XFRM_DEV_OFFLOAD_PACKET)" is constant here. I'll take a look for more fancy
solutions, but I have serious doubts if they give any benefits.

> 
> > +			if (x->xso.type != XFRM_DEV_OFFLOAD_PACKET)
> > +				/* HW states are in the head of list, there is no need
> > +				 * to iterate further.
> > +				 */
> > +				break;
> > +
> > +			/* Packet offload: both policy and SA should have same device */
> > +			if (pol->xdo.dev != x->xso.dev)
> > +				continue;
> > +		}
> > +
> >  		if (x->props.family == encap_family &&
> >  		    x->props.reqid == tmpl->reqid &&
> >  		    (mark & x->mark.m) == x->mark.v &&
> > @@ -1132,6 +1145,19 @@ xfrm_state_find(const xfrm_address_t *daddr, const xfrm_address_t *saddr,
> >  
> >  	h_wildcard = xfrm_dst_hash(net, daddr, &saddr_wildcard, tmpl->reqid, encap_family);
> >  	hlist_for_each_entry_rcu(x, net->xfrm.state_bydst + h_wildcard, bydst) {
> > +		if (IS_ENABLED(CONFIG_XFRM_OFFLOAD) &&
> > +		    pol->xdo.type == XFRM_DEV_OFFLOAD_PACKET) {
> > +			if (x->xso.type != XFRM_DEV_OFFLOAD_PACKET)
> > +				/* HW states are in the head of list, there is no need
> > +				 * to iterate further.
> > +				 */
> > +				break;
> > +
> > +			/* Packet offload: both policy and SA should have same device */
> > +			if (pol->xdo.dev != x->xso.dev)
> > +				continue;
> > +		}
> > +
> >  		if (x->props.family == encap_family &&
> >  		    x->props.reqid == tmpl->reqid &&
> >  		    (mark & x->mark.m) == x->mark.v &&
> > @@ -1185,6 +1211,17 @@ xfrm_state_find(const xfrm_address_t *daddr, const xfrm_address_t *saddr,
> >  			goto out;
> >  		}
> >  
> > +		if (pol->xdo.type == XFRM_DEV_OFFLOAD_PACKET) {
> > +			memcpy(&x->xso, &pol->xdo, sizeof(x->xso));
> > +			error = pol->xdo.dev->xfrmdev_ops->xdo_dev_state_add(x);
> > +			if (error) {
> > +				x->km.state = XFRM_STATE_DEAD;
> > +				to_put = x;
> > +				x = NULL;
> > +				goto out;
> > +			}
> > +		}
> 
> I guess that is to handle acquires, right?

Yes

> What is the idea behind that?

In previous patches, I made sure that policy and SA uses same
struct xfrm_dev_offload and set fields exactly the same.

This line sets all properties::
memcpy(&x->xso, &pol->xdo, sizeof(x->xso));

And .xdo_dev_state_add() gets everything pre-configured.

But yes, it will be different in final patch to make sure that
offload_handle is cleared and dev_tracker is valid.

Thanks
diff mbox series

Patch

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 06226942a152..93a4a9149f8c 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -536,7 +536,7 @@  static void xfrm_dst_hash_transfer(struct net *net,
 		__get_hash_thresh(net, pol->family, dir, &dbits, &sbits);
 		h = __addr_hash(&pol->selector.daddr, &pol->selector.saddr,
 				pol->family, nhashmask, dbits, sbits);
-		if (!entry0) {
+		if (!entry0 || pol->xdo.type == XFRM_DEV_OFFLOAD_PACKET) {
 			hlist_del_rcu(&pol->bydst);
 			hlist_add_head_rcu(&pol->bydst, ndsttable + h);
 			h0 = h;
@@ -867,7 +867,7 @@  static void xfrm_policy_inexact_list_reinsert(struct net *net,
 				break;
 		}
 
-		if (newpos)
+		if (newpos && policy->xdo.type != XFRM_DEV_OFFLOAD_PACKET)
 			hlist_add_behind_rcu(&policy->bydst, newpos);
 		else
 			hlist_add_head_rcu(&policy->bydst, &n->hhead);
@@ -1348,7 +1348,7 @@  static void xfrm_hash_rebuild(struct work_struct *work)
 			else
 				break;
 		}
-		if (newpos)
+		if (newpos && policy->xdo.type != XFRM_DEV_OFFLOAD_PACKET)
 			hlist_add_behind_rcu(&policy->bydst, newpos);
 		else
 			hlist_add_head_rcu(&policy->bydst, chain);
@@ -1525,7 +1525,7 @@  static void xfrm_policy_insert_inexact_list(struct hlist_head *chain,
 			break;
 	}
 
-	if (newpos)
+	if (newpos && policy->xdo.type != XFRM_DEV_OFFLOAD_PACKET)
 		hlist_add_behind_rcu(&policy->bydst_inexact_list, newpos);
 	else
 		hlist_add_head_rcu(&policy->bydst_inexact_list, chain);
@@ -1562,9 +1562,12 @@  static struct xfrm_policy *xfrm_policy_insert_list(struct hlist_head *chain,
 			break;
 	}
 
-	if (newpos)
+	if (newpos && policy->xdo.type != XFRM_DEV_OFFLOAD_PACKET)
 		hlist_add_behind_rcu(&policy->bydst, &newpos->bydst);
 	else
+		/* Packet offload policies enter to the head
+		 * to speed-up lookups.
+		 */
 		hlist_add_head_rcu(&policy->bydst, chain);
 
 	return delpol;
@@ -2181,6 +2184,9 @@  static struct xfrm_policy *xfrm_policy_lookup_bytype(struct net *net, u8 type,
 			break;
 		}
 	}
+	if (ret && ret->xdo.type == XFRM_DEV_OFFLOAD_PACKET)
+		goto skip_inexact;
+
 	bin = xfrm_policy_inexact_lookup_rcu(net, type, family, dir, if_id);
 	if (!bin || !xfrm_policy_find_inexact_candidates(&cand, bin, saddr,
 							 daddr))
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 3d2fe7712ac5..cfc8c72b173d 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -84,6 +84,25 @@  static unsigned int xfrm_seq_hash(struct net *net, u32 seq)
 	return __xfrm_seq_hash(seq, net->xfrm.state_hmask);
 }
 
+#define XFRM_STATE_INSERT(by, _n, _h, _type)                               \
+	{                                                                  \
+		struct xfrm_state *_x = NULL;                              \
+									   \
+		if (_type != XFRM_DEV_OFFLOAD_PACKET) {                    \
+			hlist_for_each_entry_rcu(_x, _h, by) {             \
+				if (_x->xso.type == XFRM_DEV_OFFLOAD_PACKET) \
+					continue;                          \
+				break;                                     \
+			}                                                  \
+		}                                                          \
+									   \
+		if (!_x || _x->xso.type == XFRM_DEV_OFFLOAD_PACKET)        \
+			/* SAD is empty or consist from HW SAs only */     \
+			hlist_add_head_rcu(_n, _h);                        \
+		else                                                       \
+			hlist_add_before_rcu(_n, &_x->by);                 \
+	}
+
 static void xfrm_hash_transfer(struct hlist_head *list,
 			       struct hlist_head *ndsttable,
 			       struct hlist_head *nsrctable,
@@ -100,23 +119,25 @@  static void xfrm_hash_transfer(struct hlist_head *list,
 		h = __xfrm_dst_hash(&x->id.daddr, &x->props.saddr,
 				    x->props.reqid, x->props.family,
 				    nhashmask);
-		hlist_add_head_rcu(&x->bydst, ndsttable + h);
+		XFRM_STATE_INSERT(bydst, &x->bydst, ndsttable + h, x->xso.type);
 
 		h = __xfrm_src_hash(&x->id.daddr, &x->props.saddr,
 				    x->props.family,
 				    nhashmask);
-		hlist_add_head_rcu(&x->bysrc, nsrctable + h);
+		XFRM_STATE_INSERT(bysrc, &x->bysrc, nsrctable + h, x->xso.type);
 
 		if (x->id.spi) {
 			h = __xfrm_spi_hash(&x->id.daddr, x->id.spi,
 					    x->id.proto, x->props.family,
 					    nhashmask);
-			hlist_add_head_rcu(&x->byspi, nspitable + h);
+			XFRM_STATE_INSERT(byspi, &x->byspi, nspitable + h,
+					  x->xso.type);
 		}
 
 		if (x->km.seq) {
 			h = __xfrm_seq_hash(x->km.seq, nhashmask);
-			hlist_add_head_rcu(&x->byseq, nseqtable + h);
+			XFRM_STATE_INSERT(byseq, &x->byseq, nseqtable + h,
+					  x->xso.type);
 		}
 	}
 }
@@ -1166,16 +1187,24 @@  xfrm_state_find(const xfrm_address_t *daddr, const xfrm_address_t *saddr,
 			spin_lock_bh(&net->xfrm.xfrm_state_lock);
 			x->km.state = XFRM_STATE_ACQ;
 			list_add(&x->km.all, &net->xfrm.state_all);
-			hlist_add_head_rcu(&x->bydst, net->xfrm.state_bydst + h);
+			XFRM_STATE_INSERT(bydst, &x->bydst,
+					  net->xfrm.state_bydst + h,
+					  x->xso.type);
 			h = xfrm_src_hash(net, daddr, saddr, encap_family);
-			hlist_add_head_rcu(&x->bysrc, net->xfrm.state_bysrc + h);
+			XFRM_STATE_INSERT(bysrc, &x->bysrc,
+					  net->xfrm.state_bysrc + h,
+					  x->xso.type);
 			if (x->id.spi) {
 				h = xfrm_spi_hash(net, &x->id.daddr, x->id.spi, x->id.proto, encap_family);
-				hlist_add_head_rcu(&x->byspi, net->xfrm.state_byspi + h);
+				XFRM_STATE_INSERT(byspi, &x->byspi,
+						  net->xfrm.state_byspi + h,
+						  x->xso.type);
 			}
 			if (x->km.seq) {
 				h = xfrm_seq_hash(net, x->km.seq);
-				hlist_add_head_rcu(&x->byseq, net->xfrm.state_byseq + h);
+				XFRM_STATE_INSERT(byseq, &x->byseq,
+						  net->xfrm.state_byseq + h,
+						  x->xso.type);
 			}
 			x->lft.hard_add_expires_seconds = net->xfrm.sysctl_acq_expires;
 			hrtimer_start(&x->mtimer,
@@ -1280,22 +1309,26 @@  static void __xfrm_state_insert(struct xfrm_state *x)
 
 	h = xfrm_dst_hash(net, &x->id.daddr, &x->props.saddr,
 			  x->props.reqid, x->props.family);
-	hlist_add_head_rcu(&x->bydst, net->xfrm.state_bydst + h);
+	XFRM_STATE_INSERT(bydst, &x->bydst, net->xfrm.state_bydst + h,
+			  x->xso.type);
 
 	h = xfrm_src_hash(net, &x->id.daddr, &x->props.saddr, x->props.family);
-	hlist_add_head_rcu(&x->bysrc, net->xfrm.state_bysrc + h);
+	XFRM_STATE_INSERT(bysrc, &x->bysrc, net->xfrm.state_bysrc + h,
+			  x->xso.type);
 
 	if (x->id.spi) {
 		h = xfrm_spi_hash(net, &x->id.daddr, x->id.spi, x->id.proto,
 				  x->props.family);
 
-		hlist_add_head_rcu(&x->byspi, net->xfrm.state_byspi + h);
+		XFRM_STATE_INSERT(byspi, &x->byspi, net->xfrm.state_byspi + h,
+				  x->xso.type);
 	}
 
 	if (x->km.seq) {
 		h = xfrm_seq_hash(net, x->km.seq);
 
-		hlist_add_head_rcu(&x->byseq, net->xfrm.state_byseq + h);
+		XFRM_STATE_INSERT(byseq, &x->byseq, net->xfrm.state_byseq + h,
+				  x->xso.type);
 	}
 
 	hrtimer_start(&x->mtimer, ktime_set(1, 0), HRTIMER_MODE_REL_SOFT);
@@ -1409,9 +1442,11 @@  static struct xfrm_state *__find_acq_core(struct net *net,
 			      ktime_set(net->xfrm.sysctl_acq_expires, 0),
 			      HRTIMER_MODE_REL_SOFT);
 		list_add(&x->km.all, &net->xfrm.state_all);
-		hlist_add_head_rcu(&x->bydst, net->xfrm.state_bydst + h);
+		XFRM_STATE_INSERT(bydst, &x->bydst, net->xfrm.state_bydst + h,
+				  x->xso.type);
 		h = xfrm_src_hash(net, daddr, saddr, family);
-		hlist_add_head_rcu(&x->bysrc, net->xfrm.state_bysrc + h);
+		XFRM_STATE_INSERT(bysrc, &x->bysrc, net->xfrm.state_bysrc + h,
+				  x->xso.type);
 
 		net->xfrm.state_num++;
 
@@ -2085,7 +2120,8 @@  int xfrm_alloc_spi(struct xfrm_state *x, u32 low, u32 high)
 		spin_lock_bh(&net->xfrm.xfrm_state_lock);
 		x->id.spi = newspi;
 		h = xfrm_spi_hash(net, &x->id.daddr, x->id.spi, x->id.proto, x->props.family);
-		hlist_add_head_rcu(&x->byspi, net->xfrm.state_byspi + h);
+		XFRM_STATE_INSERT(byspi, &x->byspi, net->xfrm.state_byspi + h,
+				  x->xso.type);
 		spin_unlock_bh(&net->xfrm.xfrm_state_lock);
 
 		err = 0;