diff mbox series

[ipsec-next] xfrm: interface: support collect metadata mode

Message ID 20201121142823.3629805-1-eyal.birger@gmail.com (mailing list archive)
State Awaiting Upstream
Delegated to: Netdev Maintainers
Headers show
Series [ipsec-next] xfrm: interface: support collect metadata mode | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Guessed tree name to be net-next
netdev/subject_prefix warning Target tree name not specified in the subject
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 7622 this patch: 7622
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 312 lines checked
netdev/build_allmodconfig_warn success Errors and warnings before: 8046 this patch: 8046
netdev/header_inline success Link
netdev/stable success Stable not CCed

Commit Message

Eyal Birger Nov. 21, 2020, 2:28 p.m. UTC
This commit adds support for 'collect_md' mode on xfrm interfaces.

Each net can have one collect_md device, created by providing the
IFLA_XFRM_COLLECT_METADATA flag at creation. This device cannot be
altered and has no if_id or link device attributes.

On transmit to this device, the if_id is fetched from the attached dst
metadata on the skb. The dst metadata type used is METADATA_IP_TUNNEL
since the only needed property is the if_id stored in the tun_id member
of the ip_tunnel_info->key.

On the receive side, xfrmi_rcv_cb() populates a dst metadata for each
packet received and attaches it to the skb. The if_id used in this case is
fetched from the xfrm state. This can later be used by upper layers such
as tc, ebpf, and ip rules.

Because the skb is scrubed in xfrmi_rcv_cb(), the attachment of the dst
metadata is postponed until after scrubing. Similarly, xfrm_input() is
adapted to avoid dropping metadata dsts by only dropping 'valid'
(skb_valid_dst(skb) == true) dsts.

Policy matching on packets arriving from collect_md xfrmi devices is
done by using the xfrm state existing in the skb's sec_path.
The xfrm_if_cb.decode_cb() interface implemented by xfrmi_decode_session()
is changed to keep the details of the if_id extraction tucked away
in xfrm_interface.c.

Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
---
 include/net/xfrm.h           |  11 +++-
 include/uapi/linux/if_link.h |   1 +
 net/xfrm/xfrm_input.c        |   7 ++-
 net/xfrm/xfrm_interface.c    | 114 ++++++++++++++++++++++++++++++-----
 net/xfrm/xfrm_policy.c       |  10 +--
 5 files changed, 119 insertions(+), 24 deletions(-)

Comments

Steffen Klassert Nov. 27, 2020, 9:44 a.m. UTC | #1
On Sat, Nov 21, 2020 at 04:28:23PM +0200, Eyal Birger wrote:
> This commit adds support for 'collect_md' mode on xfrm interfaces.
> 
> Each net can have one collect_md device, created by providing the
> IFLA_XFRM_COLLECT_METADATA flag at creation. This device cannot be
> altered and has no if_id or link device attributes.
> 
> On transmit to this device, the if_id is fetched from the attached dst
> metadata on the skb. The dst metadata type used is METADATA_IP_TUNNEL
> since the only needed property is the if_id stored in the tun_id member
> of the ip_tunnel_info->key.

Can we please have a separate metadata type for xfrm interfaces?

Sharing such structures turned already out to be a bad idea
on vti interfaces, let's try to avoid that misstake with
xfrm interfaces.

> On the receive side, xfrmi_rcv_cb() populates a dst metadata for each
> packet received and attaches it to the skb. The if_id used in this case is
> fetched from the xfrm state. This can later be used by upper layers such
> as tc, ebpf, and ip rules.
> 
> Because the skb is scrubed in xfrmi_rcv_cb(), the attachment of the dst
> metadata is postponed until after scrubing. Similarly, xfrm_input() is
> adapted to avoid dropping metadata dsts by only dropping 'valid'
> (skb_valid_dst(skb) == true) dsts.
> 
> Policy matching on packets arriving from collect_md xfrmi devices is
> done by using the xfrm state existing in the skb's sec_path.
> The xfrm_if_cb.decode_cb() interface implemented by xfrmi_decode_session()
> is changed to keep the details of the if_id extraction tucked away
> in xfrm_interface.c.
> 
> Signed-off-by: Eyal Birger <eyal.birger@gmail.com>

The rest of the patch looks good.

Thanks!
Eyal Birger Nov. 27, 2020, 12:32 p.m. UTC | #2
Hi Steffen,

On Fri, Nov 27, 2020 at 11:44 AM Steffen Klassert
<steffen.klassert@secunet.com> wrote:
>
> On Sat, Nov 21, 2020 at 04:28:23PM +0200, Eyal Birger wrote:
> > This commit adds support for 'collect_md' mode on xfrm interfaces.
> >
> > Each net can have one collect_md device, created by providing the
> > IFLA_XFRM_COLLECT_METADATA flag at creation. This device cannot be
> > altered and has no if_id or link device attributes.
> >
> > On transmit to this device, the if_id is fetched from the attached dst
> > metadata on the skb. The dst metadata type used is METADATA_IP_TUNNEL
> > since the only needed property is the if_id stored in the tun_id member
> > of the ip_tunnel_info->key.
>
> Can we please have a separate metadata type for xfrm interfaces?
>
> Sharing such structures turned already out to be a bad idea
> on vti interfaces, let's try to avoid that misstake with
> xfrm interfaces.

My initial thought was to do that, but it looks like most of the constructs
surrounding this facility - tc, nft, ovs, ebpf, ip routing - are built around
struct ip_tunnel_info and don't regard other possible metadata types.

For xfrm interfaces, the only metadata used is the if_id, which is stored
in the metadata tun_id, so I think other than naming consideration, the use
of struct ip_tunnel_info does not imply tunneling and does not limit the
use of xfrmi to a specific mode of operation.

On the other hand, adding a new metadata type would require changing all
other places to regard the new metadata type, with a large number of
userspace visible changes.

>
> > On the receive side, xfrmi_rcv_cb() populates a dst metadata for each
> > packet received and attaches it to the skb. The if_id used in this case is
> > fetched from the xfrm state. This can later be used by upper layers such
> > as tc, ebpf, and ip rules.
> >
> > Because the skb is scrubed in xfrmi_rcv_cb(), the attachment of the dst
> > metadata is postponed until after scrubing. Similarly, xfrm_input() is
> > adapted to avoid dropping metadata dsts by only dropping 'valid'
> > (skb_valid_dst(skb) == true) dsts.
> >
> > Policy matching on packets arriving from collect_md xfrmi devices is
> > done by using the xfrm state existing in the skb's sec_path.
> > The xfrm_if_cb.decode_cb() interface implemented by xfrmi_decode_session()
> > is changed to keep the details of the if_id extraction tucked away
> > in xfrm_interface.c.
> >
> > Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
>
> The rest of the patch looks good.

Thanks for the review!
Eyal.
Steffen Klassert Dec. 2, 2020, 9:37 a.m. UTC | #3
On Fri, Nov 27, 2020 at 02:32:44PM +0200, Eyal Birger wrote:
> Hi Steffen,
> 
> On Fri, Nov 27, 2020 at 11:44 AM Steffen Klassert
> <steffen.klassert@secunet.com> wrote:
> >
> > On Sat, Nov 21, 2020 at 04:28:23PM +0200, Eyal Birger wrote:
> > > This commit adds support for 'collect_md' mode on xfrm interfaces.
> > >
> > > Each net can have one collect_md device, created by providing the
> > > IFLA_XFRM_COLLECT_METADATA flag at creation. This device cannot be
> > > altered and has no if_id or link device attributes.
> > >
> > > On transmit to this device, the if_id is fetched from the attached dst
> > > metadata on the skb. The dst metadata type used is METADATA_IP_TUNNEL
> > > since the only needed property is the if_id stored in the tun_id member
> > > of the ip_tunnel_info->key.
> >
> > Can we please have a separate metadata type for xfrm interfaces?
> >
> > Sharing such structures turned already out to be a bad idea
> > on vti interfaces, let's try to avoid that misstake with
> > xfrm interfaces.
> 
> My initial thought was to do that, but it looks like most of the constructs
> surrounding this facility - tc, nft, ovs, ebpf, ip routing - are built around
> struct ip_tunnel_info and don't regard other possible metadata types.

That is likely because most objects that have a collect_md mode are
tunnels. We have already a second metadata type, and I don't see
why we can't have a third one. Maybe we can create something more
generic so that it can have other users too.

> For xfrm interfaces, the only metadata used is the if_id, which is stored
> in the metadata tun_id, so I think other than naming consideration, the use
> of struct ip_tunnel_info does not imply tunneling and does not limit the
> use of xfrmi to a specific mode of operation.

I agree that this can work, but it is a first step into a wrong direction.
Using a __be64 field of a completely unrelated structure as an u32 if_id
is bad style IMO.

> On the other hand, adding a new metadata type would require changing all
> other places to regard the new metadata type, with a large number of
> userspace visible changes.

I admit that this might have some disadvantages too, but I'm not convinced
that this justifies the 'ip_tunnel_info' hack.
Eyal Birger Dec. 28, 2020, 7:03 a.m. UTC | #4
Hi Steffen,

On Mon, Dec 7, 2020 at 11:55 AM Steffen Klassert
<steffen.klassert@secunet.com> wrote:
>
> On Fri, Nov 27, 2020 at 02:32:44PM +0200, Eyal Birger wrote:
> > Hi Steffen,
> >
> > On Fri, Nov 27, 2020 at 11:44 AM Steffen Klassert
> > <steffen.klassert@secunet.com> wrote:
> > >
> > > On Sat, Nov 21, 2020 at 04:28:23PM +0200, Eyal Birger wrote:
> > > > This commit adds support for 'collect_md' mode on xfrm interfaces.
> > > >
> > > > Each net can have one collect_md device, created by providing the
> > > > IFLA_XFRM_COLLECT_METADATA flag at creation. This device cannot be
> > > > altered and has no if_id or link device attributes.
> > > >
> > > > On transmit to this device, the if_id is fetched from the attached dst
> > > > metadata on the skb. The dst metadata type used is METADATA_IP_TUNNEL
> > > > since the only needed property is the if_id stored in the tun_id member
> > > > of the ip_tunnel_info->key.
> > >
> > > Can we please have a separate metadata type for xfrm interfaces?
> > >
> > > Sharing such structures turned already out to be a bad idea
> > > on vti interfaces, let's try to avoid that misstake with
> > > xfrm interfaces.
> >
> > My initial thought was to do that, but it looks like most of the constructs
> > surrounding this facility - tc, nft, ovs, ebpf, ip routing - are built around
> > struct ip_tunnel_info and don't regard other possible metadata types.
>
> That is likely because most objects that have a collect_md mode are
> tunnels. We have already a second metadata type, and I don't see
> why we can't have a third one. Maybe we can create something more
> generic so that it can have other users too.
>
> > For xfrm interfaces, the only metadata used is the if_id, which is stored
> > in the metadata tun_id, so I think other than naming consideration, the use
> > of struct ip_tunnel_info does not imply tunneling and does not limit the
> > use of xfrmi to a specific mode of operation.
>
> I agree that this can work, but it is a first step into a wrong direction.
> Using a __be64 field of a completely unrelated structure as an u32 if_id
> is bad style IMO.
>
> > On the other hand, adding a new metadata type would require changing all
> > other places to regard the new metadata type, with a large number of
> > userspace visible changes.
>
> I admit that this might have some disadvantages too, but I'm not convinced
> that this justifies the 'ip_tunnel_info' hack.
>

I understand. I'll try to come up with something more generic.
I hope I can find a way to still utilize the existing userspace
constructs.

Thanks!
Eyal.
Steffen Klassert Jan. 5, 2021, 8:13 a.m. UTC | #5
On Mon, Dec 28, 2020 at 09:03:28AM +0200, Eyal Birger wrote:
> Hi Steffen,
> 
> On Mon, Dec 7, 2020 at 11:55 AM Steffen Klassert
> <steffen.klassert@secunet.com> wrote:
> >
> > On Fri, Nov 27, 2020 at 02:32:44PM +0200, Eyal Birger wrote:
> > > Hi Steffen,
> > >
> > > On Fri, Nov 27, 2020 at 11:44 AM Steffen Klassert
> > > <steffen.klassert@secunet.com> wrote:
> > > >
> > > > On Sat, Nov 21, 2020 at 04:28:23PM +0200, Eyal Birger wrote:
> > > > > This commit adds support for 'collect_md' mode on xfrm interfaces.
> > > > >
> > > > > Each net can have one collect_md device, created by providing the
> > > > > IFLA_XFRM_COLLECT_METADATA flag at creation. This device cannot be
> > > > > altered and has no if_id or link device attributes.
> > > > >
> > > > > On transmit to this device, the if_id is fetched from the attached dst
> > > > > metadata on the skb. The dst metadata type used is METADATA_IP_TUNNEL
> > > > > since the only needed property is the if_id stored in the tun_id member
> > > > > of the ip_tunnel_info->key.
> > > >
> > > > Can we please have a separate metadata type for xfrm interfaces?
> > > >
> > > > Sharing such structures turned already out to be a bad idea
> > > > on vti interfaces, let's try to avoid that misstake with
> > > > xfrm interfaces.
> > >
> > > My initial thought was to do that, but it looks like most of the constructs
> > > surrounding this facility - tc, nft, ovs, ebpf, ip routing - are built around
> > > struct ip_tunnel_info and don't regard other possible metadata types.
> >
> > That is likely because most objects that have a collect_md mode are
> > tunnels. We have already a second metadata type, and I don't see
> > why we can't have a third one. Maybe we can create something more
> > generic so that it can have other users too.
> >
> > > For xfrm interfaces, the only metadata used is the if_id, which is stored
> > > in the metadata tun_id, so I think other than naming consideration, the use
> > > of struct ip_tunnel_info does not imply tunneling and does not limit the
> > > use of xfrmi to a specific mode of operation.
> >
> > I agree that this can work, but it is a first step into a wrong direction.
> > Using a __be64 field of a completely unrelated structure as an u32 if_id
> > is bad style IMO.
> >
> > > On the other hand, adding a new metadata type would require changing all
> > > other places to regard the new metadata type, with a large number of
> > > userspace visible changes.
> >
> > I admit that this might have some disadvantages too, but I'm not convinced
> > that this justifies the 'ip_tunnel_info' hack.
> >
> 
> I understand. I'll try to come up with something more generic.
> I hope I can find a way to still utilize the existing userspace
> constructs.

Great, thanks!
diff mbox series

Patch

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index b2a06f10b62c..925f8dcdd0db 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -308,9 +308,15 @@  struct xfrm_replay {
 	int	(*overflow)(struct xfrm_state *x, struct sk_buff *skb);
 };
 
+struct xfrm_if_decode_session_params {
+	struct net *net;
+	u32 if_id;
+};
+
 struct xfrm_if_cb {
-	struct xfrm_if	*(*decode_session)(struct sk_buff *skb,
-					   unsigned short family);
+	bool (*decode_session)(struct sk_buff *skb,
+			       unsigned short family,
+			       struct xfrm_if_decode_session_params *params);
 };
 
 void xfrm_if_register_cb(const struct xfrm_if_cb *ifcb);
@@ -984,6 +990,7 @@  void xfrm_dst_ifdown(struct dst_entry *dst, struct net_device *dev);
 struct xfrm_if_parms {
 	int link;		/* ifindex of underlying L2 interface */
 	u32 if_id;		/* interface identifyer */
+	bool collect_md;
 };
 
 struct xfrm_if {
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index c4b23f06f69e..ff04a06c2b69 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -655,6 +655,7 @@  enum {
 	IFLA_XFRM_UNSPEC,
 	IFLA_XFRM_LINK,
 	IFLA_XFRM_IF_ID,
+	IFLA_XFRM_COLLECT_METADATA,
 	__IFLA_XFRM_MAX
 };
 
diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
index be6351e3f3cd..c7de46d30697 100644
--- a/net/xfrm/xfrm_input.c
+++ b/net/xfrm/xfrm_input.c
@@ -20,6 +20,7 @@ 
 #include <net/xfrm.h>
 #include <net/ip_tunnels.h>
 #include <net/ip6_tunnel.h>
+#include <net/dst_metadata.h>
 
 #include "xfrm_inout.h"
 
@@ -719,7 +720,8 @@  int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 spi, int encap_type)
 		sp = skb_sec_path(skb);
 		if (sp)
 			sp->olen = 0;
-		skb_dst_drop(skb);
+		if (skb_valid_dst(skb))
+			skb_dst_drop(skb);
 		gro_cells_receive(&gro_cells, skb);
 		return 0;
 	} else {
@@ -737,7 +739,8 @@  int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 spi, int encap_type)
 			sp = skb_sec_path(skb);
 			if (sp)
 				sp->olen = 0;
-			skb_dst_drop(skb);
+			if (skb_valid_dst(skb))
+				skb_dst_drop(skb);
 			gro_cells_receive(&gro_cells, skb);
 			return err;
 		}
diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c
index 9b8e292a7c6a..10c14967d305 100644
--- a/net/xfrm/xfrm_interface.c
+++ b/net/xfrm/xfrm_interface.c
@@ -41,6 +41,7 @@ 
 #include <net/addrconf.h>
 #include <net/xfrm.h>
 #include <net/net_namespace.h>
+#include <net/dst_metadata.h>
 #include <net/netns/generic.h>
 #include <linux/etherdevice.h>
 
@@ -56,11 +57,22 @@  static const struct net_device_ops xfrmi_netdev_ops;
 struct xfrmi_net {
 	/* lists for storing interfaces in use */
 	struct xfrm_if __rcu *xfrmi[XFRMI_HASH_SIZE];
+	struct xfrm_if __rcu *collect_md_xfrmi;
 };
 
 #define for_each_xfrmi_rcu(start, xi) \
 	for (xi = rcu_dereference(start); xi; xi = rcu_dereference(xi->next))
 
+static u32 tunnel_id_to_if_id(__be64 tun_id)
+{
+	return ntohl(tunnel_id_to_key32(tun_id));
+}
+
+static __be64 if_id_to_tunnel_id(u32 if_id)
+{
+	return key32_to_tunnel_id(htonl(if_id));
+}
+
 static u32 xfrmi_hash(u32 if_id)
 {
 	return hash_32(if_id, XFRMI_HASH_BITS);
@@ -77,17 +89,23 @@  static struct xfrm_if *xfrmi_lookup(struct net *net, struct xfrm_state *x)
 			return xi;
 	}
 
+	xi = rcu_dereference(xfrmn->collect_md_xfrmi);
+	if (xi && xi->dev->flags & IFF_UP)
+		return xi;
+
 	return NULL;
 }
 
-static struct xfrm_if *xfrmi_decode_session(struct sk_buff *skb,
-					    unsigned short family)
+static bool xfrmi_decode_session(struct sk_buff *skb,
+				 unsigned short family,
+				 struct xfrm_if_decode_session_params *params)
 {
 	struct net_device *dev;
+	struct xfrm_if *xi;
 	int ifindex = 0;
 
 	if (!secpath_exists(skb) || !skb->dev)
-		return NULL;
+		return false;
 
 	switch (family) {
 	case AF_INET6:
@@ -107,11 +125,18 @@  static struct xfrm_if *xfrmi_decode_session(struct sk_buff *skb,
 	}
 
 	if (!dev || !(dev->flags & IFF_UP))
-		return NULL;
+		return false;
 	if (dev->netdev_ops != &xfrmi_netdev_ops)
-		return NULL;
+		return false;
+
+	xi = netdev_priv(dev);
+	params->net = xi->net;
 
-	return netdev_priv(dev);
+	if (xi->p.collect_md)
+		params->if_id = xfrm_input_state(skb)->if_id;
+	else
+		params->if_id = xi->p.if_id;
+	return true;
 }
 
 static void xfrmi_link(struct xfrmi_net *xfrmn, struct xfrm_if *xi)
@@ -157,7 +182,10 @@  static int xfrmi_create(struct net_device *dev)
 	if (err < 0)
 		goto out;
 
-	xfrmi_link(xfrmn, xi);
+	if (xi->p.collect_md)
+		rcu_assign_pointer(xfrmn->collect_md_xfrmi, xi);
+	else
+		xfrmi_link(xfrmn, xi);
 
 	return 0;
 
@@ -185,7 +213,10 @@  static void xfrmi_dev_uninit(struct net_device *dev)
 	struct xfrm_if *xi = netdev_priv(dev);
 	struct xfrmi_net *xfrmn = net_generic(xi->net, xfrmi_net_id);
 
-	xfrmi_unlink(xfrmn, xi);
+	if (xi->p.collect_md)
+		rcu_assign_pointer(xfrmn->collect_md_xfrmi, NULL);
+	else
+		xfrmi_unlink(xfrmn, xi);
 }
 
 static void xfrmi_scrub_packet(struct sk_buff *skb, bool xnet)
@@ -254,6 +285,16 @@  static int xfrmi_rcv_cb(struct sk_buff *skb, int err)
 	}
 
 	xfrmi_scrub_packet(skb, xnet);
+	if (xi->p.collect_md) {
+		struct metadata_dst *tun_dst;
+
+		tun_dst = tun_rx_dst(0);
+		if (!tun_dst)
+			return -ENOMEM;
+
+		tun_dst->u.tun_info.key.tun_id = if_id_to_tunnel_id(x->if_id);
+		skb_dst_set(skb, (struct dst_entry *)tun_dst);
+	}
 	dev_sw_netstats_rx_add(dev, skb->len);
 
 	return 0;
@@ -269,10 +310,24 @@  xfrmi_xmit2(struct sk_buff *skb, struct net_device *dev, struct flowi *fl)
 	struct net_device *tdev;
 	struct xfrm_state *x;
 	int err = -1;
+	u32 if_id;
 	int mtu;
 
+	if (xi->p.collect_md) {
+		struct ip_tunnel_info *tun_info = skb_tunnel_info(skb);
+
+		if (unlikely(!tun_info ||
+			     !(tun_info->mode & IP_TUNNEL_INFO_TX))) {
+			return -EINVAL;
+		}
+
+		if_id = tunnel_id_to_if_id(tun_info->key.tun_id);
+	} else {
+		if_id = xi->p.if_id;
+	}
+
 	dst_hold(dst);
-	dst = xfrm_lookup_with_ifid(xi->net, dst, fl, NULL, 0, xi->p.if_id);
+	dst = xfrm_lookup_with_ifid(xi->net, dst, fl, NULL, 0, if_id);
 	if (IS_ERR(dst)) {
 		err = PTR_ERR(dst);
 		dst = NULL;
@@ -283,7 +338,7 @@  xfrmi_xmit2(struct sk_buff *skb, struct net_device *dev, struct flowi *fl)
 	if (!x)
 		goto tx_err_link_failure;
 
-	if (x->if_id != xi->p.if_id)
+	if (x->if_id != if_id)
 		goto tx_err_link_failure;
 
 	tdev = dst->dev;
@@ -633,6 +688,9 @@  static void xfrmi_netlink_parms(struct nlattr *data[],
 
 	if (data[IFLA_XFRM_IF_ID])
 		parms->if_id = nla_get_u32(data[IFLA_XFRM_IF_ID]);
+
+	if (data[IFLA_XFRM_COLLECT_METADATA])
+		parms->collect_md = true;
 }
 
 static int xfrmi_newlink(struct net *src_net, struct net_device *dev,
@@ -645,9 +703,20 @@  static int xfrmi_newlink(struct net *src_net, struct net_device *dev,
 	int err;
 
 	xfrmi_netlink_parms(data, &p);
-	xi = xfrmi_locate(net, &p);
-	if (xi)
-		return -EEXIST;
+	if (p.collect_md) {
+		struct xfrmi_net *xfrmn = net_generic(net, xfrmi_net_id);
+
+		if (p.link || p.if_id)
+			return -EINVAL;
+
+		if (rtnl_dereference(xfrmn->collect_md_xfrmi))
+			return -EEXIST;
+
+	} else {
+		xi = xfrmi_locate(net, &p);
+		if (xi)
+			return -EEXIST;
+	}
 
 	xi = netdev_priv(dev);
 	xi->p = p;
@@ -672,12 +741,17 @@  static int xfrmi_changelink(struct net_device *dev, struct nlattr *tb[],
 	struct xfrm_if_parms p;
 
 	xfrmi_netlink_parms(data, &p);
+	if (p.collect_md)
+		return -EINVAL;
+
 	xi = xfrmi_locate(net, &p);
 	if (!xi) {
 		xi = netdev_priv(dev);
 	} else {
 		if (xi->dev != dev)
 			return -EEXIST;
+		if (xi->p.collect_md)
+			return -EINVAL;
 	}
 
 	return xfrmi_update(xi, &p);
@@ -690,6 +764,8 @@  static size_t xfrmi_get_size(const struct net_device *dev)
 		nla_total_size(4) +
 		/* IFLA_XFRM_IF_ID */
 		nla_total_size(4) +
+		/* IFLA_XFRM_COLLECT_METADATA */
+		nla_total_size(0) +
 		0;
 }
 
@@ -701,6 +777,10 @@  static int xfrmi_fill_info(struct sk_buff *skb, const struct net_device *dev)
 	if (nla_put_u32(skb, IFLA_XFRM_LINK, parm->link) ||
 	    nla_put_u32(skb, IFLA_XFRM_IF_ID, parm->if_id))
 		goto nla_put_failure;
+	if (xi->p.collect_md) {
+		if (nla_put_flag(skb, IFLA_XFRM_COLLECT_METADATA))
+			goto nla_put_failure;
+	}
 	return 0;
 
 nla_put_failure:
@@ -715,8 +795,9 @@  static struct net *xfrmi_get_link_net(const struct net_device *dev)
 }
 
 static const struct nla_policy xfrmi_policy[IFLA_XFRM_MAX + 1] = {
-	[IFLA_XFRM_LINK]	= { .type = NLA_U32 },
-	[IFLA_XFRM_IF_ID]	= { .type = NLA_U32 },
+	[IFLA_XFRM_LINK]		= { .type = NLA_U32 },
+	[IFLA_XFRM_IF_ID]		= { .type = NLA_U32 },
+	[IFLA_XFRM_COLLECT_METADATA]	= { .type = NLA_FLAG },
 };
 
 static struct rtnl_link_ops xfrmi_link_ops __read_mostly = {
@@ -752,6 +833,9 @@  static void __net_exit xfrmi_exit_batch_net(struct list_head *net_exit_list)
 			     xip = &xi->next)
 				unregister_netdevice_queue(xi->dev, &list);
 		}
+		xi = rcu_dereference(xfrmn->collect_md_xfrmi);
+		if (xi)
+			unregister_netdevice_queue(xi->dev, &list);
 	}
 	unregister_netdevice_many(&list);
 	rtnl_unlock();
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index d622c2548d22..4da5ea277500 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -3519,17 +3519,17 @@  int __xfrm_policy_check(struct sock *sk, int dir, struct sk_buff *skb,
 	int xerr_idx = -1;
 	const struct xfrm_if_cb *ifcb;
 	struct sec_path *sp;
-	struct xfrm_if *xi;
 	u32 if_id = 0;
 
 	rcu_read_lock();
 	ifcb = xfrm_if_get_cb();
 
 	if (ifcb) {
-		xi = ifcb->decode_session(skb, family);
-		if (xi) {
-			if_id = xi->p.if_id;
-			net = xi->net;
+		struct xfrm_if_decode_session_params p;
+
+		if (ifcb->decode_session(skb, family, &p)) {
+			if_id = p.if_id;
+			net = p.net;
 		}
 	}
 	rcu_read_unlock();