diff mbox series

[RFC,xfrm-next,v3,6/8] xfrm: enforce separation between priorities of HW/SW policies

Message ID 1b9d865971972a63eaa2c076afd71743952bd3c8.1662295929.git.leonro@nvidia.com (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series Extend XFRM core to allow full offload configuration | expand

Checks

Context Check Description
netdev/tree_selection success Guessed tree name to be net-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 4610 this patch: 4610
netdev/cc_maintainers warning 2 maintainers not CCed: antony.antony@secunet.com nicolas.dichtel@6wind.com
netdev/build_clang success Errors and warnings before: 1117 this patch: 1117
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 4768 this patch: 4768
netdev/checkpatch warning WARNING: line length of 81 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Leon Romanovsky Sept. 4, 2022, 1:15 p.m. UTC
From: Leon Romanovsky <leonro@nvidia.com>

Devices that implement IPsec full offload mode offload policies too.
In RX path, it causes to the situation that HW can't effectively handle
mixed SW and HW priorities unless users make sure that HW offloaded
policies have higher priorities.

In order to make sure that users have coherent picture, let's require
that HW offloaded policies have always (both RX and TX) higher priorities
than SW ones.

To do not over engineer the code, HW policies are treated as SW ones and
don't take into account netdev to allow reuse of same priorities for
different devices.

Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/net/netns/xfrm.h |   8 ++-
 net/xfrm/xfrm_policy.c   | 113 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 120 insertions(+), 1 deletion(-)

Comments

Steffen Klassert Sept. 25, 2022, 9:34 a.m. UTC | #1
On Sun, Sep 04, 2022 at 04:15:40PM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Devices that implement IPsec full offload mode offload policies too.
> In RX path, it causes to the situation that HW can't effectively handle
> mixed SW and HW priorities unless users make sure that HW offloaded
> policies have higher priorities.
> 
> In order to make sure that users have coherent picture, let's require
> that HW offloaded policies have always (both RX and TX) higher priorities
> than SW ones.
> 
> To do not over engineer the code, HW policies are treated as SW ones and
> don't take into account netdev to allow reuse of same priorities for
> different devices.

I think we should split HW and SW SPD (and maybe even SAD) and priorize
over the SPDs instead of doing that in one single SPD. Each NIC should
maintain its own databases and we should do the lookups from SW with
a callback. With the current approach, we still do the costly full
policy and state lookup on the TX side in software. On a 'full offload'
that should happen in HW too. Also, that will make things easier with
tunnel mode whre we can have overlapping traffic selectors.

We can keep a HW SPD in software as a fallback for devices that don't
support the offloaded lookup, but on the long run lookups for the  RX
anf TX path should happen in HW.
Leon Romanovsky Sept. 26, 2022, 6:38 a.m. UTC | #2
On Sun, Sep 25, 2022 at 11:34:54AM +0200, Steffen Klassert wrote:
> On Sun, Sep 04, 2022 at 04:15:40PM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> > 
> > Devices that implement IPsec full offload mode offload policies too.
> > In RX path, it causes to the situation that HW can't effectively handle
> > mixed SW and HW priorities unless users make sure that HW offloaded
> > policies have higher priorities.
> > 
> > In order to make sure that users have coherent picture, let's require
> > that HW offloaded policies have always (both RX and TX) higher priorities
> > than SW ones.
> > 
> > To do not over engineer the code, HW policies are treated as SW ones and
> > don't take into account netdev to allow reuse of same priorities for
> > different devices.
> 
> I think we should split HW and SW SPD (and maybe even SAD) and priorize
> over the SPDs instead of doing that in one single SPD. Each NIC should
> maintain its own databases and we should do the lookups from SW with
> a callback. 

I don't understand how will it work and scale.

Every packet needs to be classified if it is in offloaded path or not.
To ensure that, we will need to have two identifiers: targeted device
(part of skb, so ok) and relevant SP/SA.

The latter is needed to make sure that we perform lookup only on
offloaded SP/SA. It means that we will do lookup twice now. First
to see if SP/SA is offloaded and second to see if it in HW.

HW lookup is also misleading name, as the lookup will be in driver code
in very similar way to how SADB managed for crypto mode. It makes no
sense to convert data from XFRM representation to HW format, execute in
HW and convert returned result back. It will be also slow because lookup
of SP/SA based on XFRM properties is not datapath.

In any case, you will have double lookup. You will need to query XFRM
core DBs and later call to driver DB or vice versa.

Unless you want to perform HW lookup always without relation to SP/SA
state and hurt performance for non-offloaded path.

> With the current approach, we still do the costly full
> policy and state lookup on the TX side in software. On a 'full offload'
> that should happen in HW too.

In proposed approach, you have only one lookup which is better than two.
I'm not even talking about "quality" of driver lookups implementations.

> Also, that will make things easier with tunnel mode whre we can have overlapping
> traffic selectors.

Can we please put tunnel mode aside? It is a long journey.

> 
> We can keep a HW SPD in software as a fallback for devices that don't
> support the offloaded lookup, but on the long run lookups for the  RX
> anf TX path should happen in HW.

I doubt about it.

>
Steffen Klassert Sept. 27, 2022, 5:48 a.m. UTC | #3
On Mon, Sep 26, 2022 at 09:38:10AM +0300, Leon Romanovsky wrote:
> On Sun, Sep 25, 2022 at 11:34:54AM +0200, Steffen Klassert wrote:
> > On Sun, Sep 04, 2022 at 04:15:40PM +0300, Leon Romanovsky wrote:
> > > From: Leon Romanovsky <leonro@nvidia.com>
> > > 
> > > Devices that implement IPsec full offload mode offload policies too.
> > > In RX path, it causes to the situation that HW can't effectively handle
> > > mixed SW and HW priorities unless users make sure that HW offloaded
> > > policies have higher priorities.
> > > 
> > > In order to make sure that users have coherent picture, let's require
> > > that HW offloaded policies have always (both RX and TX) higher priorities
> > > than SW ones.
> > > 
> > > To do not over engineer the code, HW policies are treated as SW ones and
> > > don't take into account netdev to allow reuse of same priorities for
> > > different devices.
> > 
> > I think we should split HW and SW SPD (and maybe even SAD) and priorize
> > over the SPDs instead of doing that in one single SPD. Each NIC should
> > maintain its own databases and we should do the lookups from SW with
> > a callback. 
> 
> I don't understand how will it work and scale.

That is rather easy. HW offload devices register their databases
at the xfrm layer with a certain priority higher than the one
of the SW databases. The lookup will happen consecutively based
on the database priorities. If there are no HW databases are
registered everything is like it was before. It gives us a clear
separation between HW and SW.

This has the advantage that you don't need to mess with policy
priorites. User can keep the priorites as they were before. This
is in particular important because usually the IKE daemon chosses
the priorities based on some heuristics.

The HW offload has also the advantage that we don't need to
search through all SW policies and states in that case.

> Every packet needs to be classified if it is in offloaded path or not.
> To ensure that, we will need to have two identifiers: targeted device
> (part of skb, so ok) and relevant SP/SA.
> 
> The latter is needed to make sure that we perform lookup only on
> offloaded SP/SA. It means that we will do lookup twice now. First
> to see if SP/SA is offloaded and second to see if it in HW.

I think you did not get my point, see above.

> HW lookup is also misleading name, as the lookup will be in driver code
> in very similar way to how SADB managed for crypto mode.

No, HW lookup means we do the lookup in the hardware and return
the result to software. The whole point of a 'full offload' is
to get rid of the costly SW database lookups.

> It makes no
> sense to convert data from XFRM representation to HW format, execute in
> HW and convert returned result back. It will be also slow because lookup
> of SP/SA based on XFRM properties is not datapath.

In case the HW can't do the lookup itself or is considered to be slower
than in software, a separated database for HW offload devices can be
maintained.

> In any case, you will have double lookup. You will need to query XFRM
> core DBs and later call to driver DB or vice versa.
> 
> Unless you want to perform HW lookup always without relation to SP/SA
> state and hurt performance for non-offloaded path.
> 
> > With the current approach, we still do the costly full
> > policy and state lookup on the TX side in software. On a 'full offload'
> > that should happen in HW too.
> 
> In proposed approach, you have only one lookup which is better than two.

In your approach, you still do the lookups in SW. This is not a problem
if you have just a handfull policies and SAs, but is problematic when
you have 100K policies and SAs. Even if the current HW can not support
this, we need to make sure our design allows for it.

> I'm not even talking about "quality" of driver lookups implementations.

You don't need to to reimplement the lookups, you just have to create an
additional database.

> 
> > Also, that will make things easier with tunnel mode whre we can have overlapping
> > traffic selectors.
> 
> Can we please put tunnel mode aside? It is a long journey.

No, we can't. A good design should include transport and tunnel mode.
I don't want to change everyting later just because we notice then that
it does not work at all for tunnel mode.
Leon Romanovsky Sept. 27, 2022, 10:21 a.m. UTC | #4
On Tue, Sep 27, 2022 at 07:48:38AM +0200, Steffen Klassert wrote:
> On Mon, Sep 26, 2022 at 09:38:10AM +0300, Leon Romanovsky wrote:
> > On Sun, Sep 25, 2022 at 11:34:54AM +0200, Steffen Klassert wrote:
> > > On Sun, Sep 04, 2022 at 04:15:40PM +0300, Leon Romanovsky wrote:
> > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > 
> > > > Devices that implement IPsec full offload mode offload policies too.
> > > > In RX path, it causes to the situation that HW can't effectively handle
> > > > mixed SW and HW priorities unless users make sure that HW offloaded
> > > > policies have higher priorities.
> > > > 
> > > > In order to make sure that users have coherent picture, let's require
> > > > that HW offloaded policies have always (both RX and TX) higher priorities
> > > > than SW ones.
> > > > 
> > > > To do not over engineer the code, HW policies are treated as SW ones and
> > > > don't take into account netdev to allow reuse of same priorities for
> > > > different devices.
> > > 
> > > I think we should split HW and SW SPD (and maybe even SAD) and priorize
> > > over the SPDs instead of doing that in one single SPD. Each NIC should
> > > maintain its own databases and we should do the lookups from SW with
> > > a callback. 
> > 
> > I don't understand how will it work and scale.
> 
> That is rather easy. HW offload devices register their databases
> at the xfrm layer with a certain priority higher than the one
> of the SW databases. The lookup will happen consecutively based
> on the database priorities. If there are no HW databases are
> registered everything is like it was before. It gives us a clear
> separation between HW and SW.
> 
> This has the advantage that you don't need to mess with policy
> priorites. User can keep the priorites as they were before. This
> is in particular important because usually the IKE daemon chosses
> the priorities based on some heuristics.
> 
> The HW offload has also the advantage that we don't need to
> search through all SW policies and states in that case.

And disadvantage for SW policies, because once you register HW DB, you
will first lookup there, won't find and fallback to perform another
lookup in SW.

<...>

> > It makes no
> > sense to convert data from XFRM representation to HW format, execute in
> > HW and convert returned result back. It will be also slow because lookup
> > of SP/SA based on XFRM properties is not datapath.
> 
> In case the HW can't do the lookup itself or is considered to be slower
> than in software, a separated database for HW offload devices can be
> maintained.

ok, this is my case.
I'll try to see what I can do here.

Thanks
diff mbox series

Patch

diff --git a/include/net/netns/xfrm.h b/include/net/netns/xfrm.h
index bd7c3be4af5d..f0cfa0faf611 100644
--- a/include/net/netns/xfrm.h
+++ b/include/net/netns/xfrm.h
@@ -29,6 +29,11 @@  struct xfrm_policy_hthresh {
 	u8			rbits6;
 };
 
+struct xfrm_policy_prio {
+	u32 max_sw_prio;
+	u32 min_hw_prio;
+};
+
 struct netns_xfrm {
 	struct list_head	state_all;
 	/*
@@ -52,7 +57,7 @@  struct netns_xfrm {
 	unsigned int		policy_idx_hmask;
 	struct hlist_head	policy_inexact[XFRM_POLICY_MAX];
 	struct xfrm_policy_hash	policy_bydst[XFRM_POLICY_MAX];
-	unsigned int		policy_count[XFRM_POLICY_MAX * 2];
+	unsigned int		policy_count[XFRM_POLICY_MAX * 3];
 	struct work_struct	policy_hash_work;
 	struct xfrm_policy_hthresh policy_hthresh;
 	struct list_head	inexact_bins;
@@ -67,6 +72,7 @@  struct netns_xfrm {
 	u32			sysctl_acq_expires;
 
 	u8			policy_default[XFRM_POLICY_MAX];
+	struct xfrm_policy_prio	policy_prio[XFRM_POLICY_MAX];
 
 #ifdef CONFIG_SYSCTL
 	struct ctl_table_header	*sysctl_hdr;
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 56ccedbb7212..69ef85424d4b 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -1570,13 +1570,70 @@  static struct xfrm_policy *xfrm_policy_insert_list(struct hlist_head *chain,
 	return delpol;
 }
 
+static int __xfrm_policy_check_hw_priority(struct net *net,
+					   struct xfrm_policy *policy, int dir)
+{
+	int left, right;
+
+	lockdep_assert_held(&net->xfrm.xfrm_policy_lock);
+
+	if (!net->xfrm.policy_count[dir])
+		/* Adding first policy */
+		return 0;
+
+	if (policy->xdo.type != XFRM_DEV_OFFLOAD_FULL) {
+		/* SW priority */
+		if (!net->xfrm.policy_count[2 * XFRM_POLICY_MAX + dir])
+			/* Special case to allow reuse maximum priority
+			 * (U32_MAX) for SW policies, when no HW policy exist.
+			 */
+			return 0;
+
+		left = policy->priority;
+		right = net->xfrm.policy_prio[dir].min_hw_prio;
+	} else {
+		/* HW priority */
+		left = net->xfrm.policy_prio[dir].max_sw_prio;
+		right = policy->priority;
+	}
+	if (left >= right)
+		return -EINVAL;
+
+	return 0;
+}
+
+static void __xfrm_policy_update_hw_priority(struct net *net,
+					     struct xfrm_policy *policy,
+					     int dir)
+{
+	u32 *hw_prio, *sw_prio;
+
+	lockdep_assert_held(&net->xfrm.xfrm_policy_lock);
+
+	if (policy->xdo.type != XFRM_DEV_OFFLOAD_FULL) {
+		sw_prio = &net->xfrm.policy_prio[dir].max_sw_prio;
+		*sw_prio = max(*sw_prio, policy->priority);
+		return;
+	}
+
+	hw_prio = &net->xfrm.policy_prio[dir].min_hw_prio;
+	*hw_prio = min(*hw_prio, policy->priority);
+}
+
 int xfrm_policy_insert(int dir, struct xfrm_policy *policy, int excl)
 {
 	struct net *net = xp_net(policy);
 	struct xfrm_policy *delpol;
 	struct hlist_head *chain;
+	int ret;
 
 	spin_lock_bh(&net->xfrm.xfrm_policy_lock);
+	ret = __xfrm_policy_check_hw_priority(net, policy, dir);
+	if (ret) {
+		spin_unlock_bh(&net->xfrm.xfrm_policy_lock);
+		return ret;
+	}
+
 	chain = policy_hash_bysel(net, &policy->selector, policy->family, dir);
 	if (chain)
 		delpol = xfrm_policy_insert_list(chain, policy, excl);
@@ -1606,6 +1663,7 @@  int xfrm_policy_insert(int dir, struct xfrm_policy *policy, int excl)
 	policy->curlft.use_time = 0;
 	if (!mod_timer(&policy->timer, jiffies + HZ))
 		xfrm_pol_hold(policy);
+	__xfrm_policy_update_hw_priority(net, policy, dir);
 	spin_unlock_bh(&net->xfrm.xfrm_policy_lock);
 
 	if (delpol)
@@ -2271,6 +2329,8 @@  static void __xfrm_policy_link(struct xfrm_policy *pol, int dir)
 
 	list_add(&pol->walk.all, &net->xfrm.policy_all);
 	net->xfrm.policy_count[dir]++;
+	if (pol->xdo.type == XFRM_DEV_OFFLOAD_FULL)
+		net->xfrm.policy_count[2 * XFRM_POLICY_MAX + dir]++;
 	xfrm_pol_hold(pol);
 }
 
@@ -2290,6 +2350,8 @@  static struct xfrm_policy *__xfrm_policy_unlink(struct xfrm_policy *pol,
 	}
 
 	list_del_init(&pol->walk.all);
+	if (pol->xdo.type == XFRM_DEV_OFFLOAD_FULL)
+		net->xfrm.policy_count[2 * XFRM_POLICY_MAX + dir]--;
 	net->xfrm.policy_count[dir]--;
 
 	return pol;
@@ -2305,12 +2367,58 @@  static void xfrm_sk_policy_unlink(struct xfrm_policy *pol, int dir)
 	__xfrm_policy_unlink(pol, XFRM_POLICY_MAX + dir);
 }
 
+static void __xfrm_policy_delete_prio(struct net *net,
+				      struct xfrm_policy *policy, int dir)
+{
+	struct xfrm_policy *pol;
+	u32 sw_prio = 0;
+
+	lockdep_assert_held(&net->xfrm.xfrm_policy_lock);
+
+	if (!net->xfrm.policy_count[dir]) {
+		net->xfrm.policy_prio[dir].max_sw_prio = sw_prio;
+		net->xfrm.policy_prio[dir].min_hw_prio = U32_MAX;
+		return;
+	}
+
+	if (policy->xdo.type == XFRM_DEV_OFFLOAD_FULL &&
+	    !net->xfrm.policy_count[2 * XFRM_POLICY_MAX + dir]) {
+		net->xfrm.policy_prio[dir].min_hw_prio = U32_MAX;
+		return;
+	}
+
+	list_for_each_entry(pol, &net->xfrm.policy_all, walk.all) {
+		if (pol->walk.dead)
+			continue;
+
+		if (policy->xdo.type != XFRM_DEV_OFFLOAD_FULL) {
+			/* SW priority */
+			if (pol->xdo.type == XFRM_DEV_OFFLOAD_FULL) {
+				net->xfrm.policy_prio[dir].max_sw_prio = sw_prio;
+				return;
+			}
+			sw_prio = pol->priority;
+			continue;
+		}
+		/* HW priority */
+		if (pol->xdo.type != XFRM_DEV_OFFLOAD_FULL)
+			continue;
+
+		net->xfrm.policy_prio[dir].min_hw_prio = pol->priority;
+		return;
+	}
+
+	net->xfrm.policy_prio[dir].max_sw_prio = sw_prio;
+}
+
 int xfrm_policy_delete(struct xfrm_policy *pol, int dir)
 {
 	struct net *net = xp_net(pol);
 
 	spin_lock_bh(&net->xfrm.xfrm_policy_lock);
 	pol = __xfrm_policy_unlink(pol, dir);
+	if (pol)
+		__xfrm_policy_delete_prio(net, pol, dir);
 	spin_unlock_bh(&net->xfrm.xfrm_policy_lock);
 	if (pol) {
 		xfrm_dev_policy_delete(pol);
@@ -4112,6 +4220,7 @@  static int __net_init xfrm_policy_init(struct net *net)
 
 		net->xfrm.policy_count[dir] = 0;
 		net->xfrm.policy_count[XFRM_POLICY_MAX + dir] = 0;
+		net->xfrm.policy_count[2 * XFRM_POLICY_MAX + dir] = 0;
 		INIT_HLIST_HEAD(&net->xfrm.policy_inexact[dir]);
 
 		htab = &net->xfrm.policy_bydst[dir];
@@ -4197,6 +4306,10 @@  static int __net_init xfrm_net_init(struct net *net)
 	net->xfrm.policy_default[XFRM_POLICY_FWD] = XFRM_USERPOLICY_ACCEPT;
 	net->xfrm.policy_default[XFRM_POLICY_OUT] = XFRM_USERPOLICY_ACCEPT;
 
+	net->xfrm.policy_prio[XFRM_POLICY_IN].min_hw_prio = U32_MAX;
+	net->xfrm.policy_prio[XFRM_POLICY_FWD].min_hw_prio = U32_MAX;
+	net->xfrm.policy_prio[XFRM_POLICY_OUT].min_hw_prio = U32_MAX;
+
 	rv = xfrm_statistics_init(net);
 	if (rv < 0)
 		goto out_statistics;