Message ID | 20240528032914.2551267-1-eyal.birger@gmail.com (mailing list archive) |
---|---|
State | Awaiting Upstream |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [ipsec-next,v4] xfrm: support sending NAT keepalives in ESP in UDP states | expand |
On Mon, May 27, 2024 at 08:29:14PM -0700, Eyal Birger wrote: > Add the ability to send out RFC-3948 NAT keepalives from the xfrm stack. > > To use, Userspace sets an XFRM_NAT_KEEPALIVE_INTERVAL integer property when > creating XFRM outbound states which denotes the number of seconds between > keepalive messages. > > Keepalive messages are sent from a per net delayed work which iterates over > the xfrm states. The logic is guarded by the xfrm state spinlock due to the > xfrm state walk iterator. > > Possible future enhancements: > > - Adding counters to keep track of sent keepalives. > - deduplicate NAT keepalives between states sharing the same nat keepalive > parameters. > - provisioning hardware offloads for devices capable of implementing this. > - revise xfrm state list to use an rcu list in order to avoid running this > under spinlock. > > Suggested-by: Paul Wouters <paul.wouters@aiven.io> > Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Paul, you said you wanted to test this, if I remember correct. Do you still plan to do a test?
Hi Eyal, On Mon, May 27, 2024 at 08:29:14PM -0700, Eyal Birger wrote: > Add the ability to send out RFC-3948 NAT keepalives from the xfrm stack. > > To use, Userspace sets an XFRM_NAT_KEEPALIVE_INTERVAL integer property when > creating XFRM outbound states which denotes the number of seconds between > keepalive messages. > > Keepalive messages are sent from a per net delayed work which iterates over > the xfrm states. The logic is guarded by the xfrm state spinlock due to the > xfrm state walk iterator. > > Possible future enhancements: > > - Adding counters to keep track of sent keepalives. > - deduplicate NAT keepalives between states sharing the same nat keepalive > parameters. > - provisioning hardware offloads for devices capable of implementing this. > - revise xfrm state list to use an rcu list in order to avoid running this > under spinlock. > > Suggested-by: Paul Wouters <paul.wouters@aiven.io> > Signed-off-by: Eyal Birger <eyal.birger@gmail.com> > > --- > v4: rebase and explicitly check that keepalive is only configured on > outbound SAs > v3: add missing ip6_checksum header > v2: change xfrm compat to include the new attribute > --- > include/net/ipv6_stubs.h | 3 + > include/net/netns/xfrm.h | 1 + > include/net/xfrm.h | 10 ++ > include/uapi/linux/xfrm.h | 1 + > net/ipv6/af_inet6.c | 1 + > net/ipv6/xfrm6_policy.c | 7 + > net/xfrm/Makefile | 3 +- > net/xfrm/xfrm_compat.c | 6 +- > net/xfrm/xfrm_nat_keepalive.c | 292 ++++++++++++++++++++++++++++++++++ > net/xfrm/xfrm_policy.c | 8 + > net/xfrm/xfrm_state.c | 17 ++ > net/xfrm/xfrm_user.c | 15 ++ > 12 files changed, 361 insertions(+), 3 deletions(-) > create mode 100644 net/xfrm/xfrm_nat_keepalive.c > > diff --git a/include/net/ipv6_stubs.h b/include/net/ipv6_stubs.h > index 485c39a89866..11cefd50704d 100644 > --- a/include/net/ipv6_stubs.h > +++ b/include/net/ipv6_stubs.h > @@ -9,6 +9,7 @@ > #include <net/flow.h> > #include <net/neighbour.h> > #include <net/sock.h> > +#include <net/ipv6.h> > > /* structs from net/ip6_fib.h */ > struct fib6_info; > @@ -72,6 +73,8 @@ struct ipv6_stub { > int (*output)(struct net *, struct sock *, struct sk_buff *)); > struct net_device *(*ipv6_dev_find)(struct net *net, const struct in6_addr *addr, > struct net_device *dev); > + int (*ip6_xmit)(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6, > + __u32 mark, struct ipv6_txoptions *opt, int tclass, u32 priority); > }; > extern const struct ipv6_stub *ipv6_stub __read_mostly; > > diff --git a/include/net/netns/xfrm.h b/include/net/netns/xfrm.h > index 423b52eca908..d489d9250bff 100644 > --- a/include/net/netns/xfrm.h > +++ b/include/net/netns/xfrm.h > @@ -83,6 +83,7 @@ struct netns_xfrm { > > spinlock_t xfrm_policy_lock; > struct mutex xfrm_cfg_mutex; > + struct delayed_work nat_keepalive_work; > }; > > #endif > diff --git a/include/net/xfrm.h b/include/net/xfrm.h > index 7c9be06f8302..e208907b1a00 100644 > --- a/include/net/xfrm.h > +++ b/include/net/xfrm.h > @@ -229,6 +229,10 @@ struct xfrm_state { > struct xfrm_encap_tmpl *encap; > struct sock __rcu *encap_sk; > > + /* NAT keepalive */ > + u32 nat_keepalive_interval; /* seconds */ > + time64_t nat_keepalive_expiration; > + > /* Data for care-of address */ > xfrm_address_t *coaddr; > > @@ -2200,4 +2204,10 @@ static inline int register_xfrm_state_bpf(void) > } > #endif > > +int xfrm_nat_keepalive_init(unsigned short family); > +void xfrm_nat_keepalive_fini(unsigned short family); > +int xfrm_nat_keepalive_net_init(struct net *net); > +int xfrm_nat_keepalive_net_fini(struct net *net); > +void xfrm_nat_keepalive_state_updated(struct xfrm_state *x); > + > #endif /* _NET_XFRM_H */ > diff --git a/include/uapi/linux/xfrm.h b/include/uapi/linux/xfrm.h > index 18ceaba8486e..7744441c8d5f 100644 > --- a/include/uapi/linux/xfrm.h > +++ b/include/uapi/linux/xfrm.h > @@ -321,6 +321,7 @@ enum xfrm_attr_type_t { > XFRMA_IF_ID, /* __u32 */ > XFRMA_MTIMER_THRESH, /* __u32 in seconds for input SA */ > XFRMA_SA_DIR, /* __u8 */ > + XFRMA_NAT_KEEPALIVE_INTERVAL, /* __u32 in seconds for NAT keepalive */ > __XFRMA_MAX > > #define XFRMA_OUTPUT_MARK XFRMA_SET_MARK /* Compatibility */ > diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c > index 8041dc181bd4..2b893858b9a9 100644 > --- a/net/ipv6/af_inet6.c > +++ b/net/ipv6/af_inet6.c > @@ -1060,6 +1060,7 @@ static const struct ipv6_stub ipv6_stub_impl = { > .nd_tbl = &nd_tbl, > .ipv6_fragment = ip6_fragment, > .ipv6_dev_find = ipv6_dev_find, > + .ip6_xmit = ip6_xmit, > }; > > static const struct ipv6_bpf_stub ipv6_bpf_stub_impl = { > diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c > index 42fb6996b077..f03dbc011e65 100644 > --- a/net/ipv6/xfrm6_policy.c > +++ b/net/ipv6/xfrm6_policy.c > @@ -285,8 +285,14 @@ int __init xfrm6_init(void) > ret = register_pernet_subsys(&xfrm6_net_ops); > if (ret) > goto out_protocol; > + > + ret = xfrm_nat_keepalive_init(AF_INET6); > + if (ret) > + goto out_nat_keepalive; > out: > return ret; > +out_nat_keepalive: > + unregister_pernet_subsys(&xfrm6_net_ops); > out_protocol: > xfrm6_protocol_fini(); > out_state: > @@ -298,6 +304,7 @@ int __init xfrm6_init(void) > > void xfrm6_fini(void) > { > + xfrm_nat_keepalive_fini(AF_INET6); > unregister_pernet_subsys(&xfrm6_net_ops); > xfrm6_protocol_fini(); > xfrm6_policy_fini(); > diff --git a/net/xfrm/Makefile b/net/xfrm/Makefile > index 547cec77ba03..512e0b2f8514 100644 > --- a/net/xfrm/Makefile > +++ b/net/xfrm/Makefile > @@ -13,7 +13,8 @@ endif > > obj-$(CONFIG_XFRM) := xfrm_policy.o xfrm_state.o xfrm_hash.o \ > xfrm_input.o xfrm_output.o \ > - xfrm_sysctl.o xfrm_replay.o xfrm_device.o > + xfrm_sysctl.o xfrm_replay.o xfrm_device.o \ > + xfrm_nat_keepalive.o > obj-$(CONFIG_XFRM_STATISTICS) += xfrm_proc.o > obj-$(CONFIG_XFRM_ALGO) += xfrm_algo.o > obj-$(CONFIG_XFRM_USER) += xfrm_user.o > diff --git a/net/xfrm/xfrm_compat.c b/net/xfrm/xfrm_compat.c > index 703d4172c7d7..91357ccaf4af 100644 > --- a/net/xfrm/xfrm_compat.c > +++ b/net/xfrm/xfrm_compat.c > @@ -131,6 +131,7 @@ static const struct nla_policy compat_policy[XFRMA_MAX+1] = { > [XFRMA_IF_ID] = { .type = NLA_U32 }, > [XFRMA_MTIMER_THRESH] = { .type = NLA_U32 }, > [XFRMA_SA_DIR] = NLA_POLICY_RANGE(NLA_U8, XFRM_SA_DIR_IN, XFRM_SA_DIR_OUT), > + [XFRMA_NAT_KEEPALIVE_INTERVAL] = { .type = NLA_U32 }, > }; > > static struct nlmsghdr *xfrm_nlmsg_put_compat(struct sk_buff *skb, > @@ -280,9 +281,10 @@ static int xfrm_xlate64_attr(struct sk_buff *dst, const struct nlattr *src) > case XFRMA_IF_ID: > case XFRMA_MTIMER_THRESH: > case XFRMA_SA_DIR: > + case XFRMA_NAT_KEEPALIVE_INTERVAL: > return xfrm_nla_cpy(dst, src, nla_len(src)); > default: > - BUILD_BUG_ON(XFRMA_MAX != XFRMA_SA_DIR); > + BUILD_BUG_ON(XFRMA_MAX != XFRMA_NAT_KEEPALIVE_INTERVAL); > pr_warn_once("unsupported nla_type %d\n", src->nla_type); > return -EOPNOTSUPP; > } > @@ -437,7 +439,7 @@ static int xfrm_xlate32_attr(void *dst, const struct nlattr *nla, > int err; > > if (type > XFRMA_MAX) { > - BUILD_BUG_ON(XFRMA_MAX != XFRMA_SA_DIR); > + BUILD_BUG_ON(XFRMA_MAX != XFRMA_NAT_KEEPALIVE_INTERVAL); > NL_SET_ERR_MSG(extack, "Bad attribute"); > return -EOPNOTSUPP; > } > diff --git a/net/xfrm/xfrm_nat_keepalive.c b/net/xfrm/xfrm_nat_keepalive.c > new file mode 100644 > index 000000000000..82f0a301683f > --- /dev/null > +++ b/net/xfrm/xfrm_nat_keepalive.c > @@ -0,0 +1,292 @@ > +// SPDX-License-Identifier: GPL-2.0-only > +/* > + * xfrm_nat_keepalive.c > + * > + * (c) 2024 Eyal Birger <eyal.birger@gmail.com> > + */ > + > +#include <net/inet_common.h> > +#include <net/ip6_checksum.h> > +#include <net/xfrm.h> > + > +static DEFINE_PER_CPU(struct sock *, nat_keepalive_sk_ipv4); > +#if IS_ENABLED(CONFIG_IPV6) > +static DEFINE_PER_CPU(struct sock *, nat_keepalive_sk_ipv6); > +#endif > + > +struct nat_keepalive { > + struct net *net; > + u16 family; > + xfrm_address_t saddr; > + xfrm_address_t daddr; > + __be16 encap_sport; > + __be16 encap_dport; > + __u32 smark; > +}; > + > +static void nat_keepalive_init(struct nat_keepalive *ka, struct xfrm_state *x) > +{ > + ka->net = xs_net(x); > + ka->family = x->props.family; > + ka->saddr = x->props.saddr; > + ka->daddr = x->id.daddr; > + ka->encap_sport = x->encap->encap_sport; > + ka->encap_dport = x->encap->encap_dport; > + ka->smark = xfrm_smark_get(0, x); > +} > + > +static int nat_keepalive_send_ipv4(struct sk_buff *skb, > + struct nat_keepalive *ka) > +{ > + struct net *net = ka->net; > + struct flowi4 fl4; > + struct rtable *rt; > + struct sock *sk; > + __u8 tos = 0; > + int err; > + > + flowi4_init_output(&fl4, 0 /* oif */, skb->mark, tos, > + RT_SCOPE_UNIVERSE, IPPROTO_UDP, 0, > + ka->daddr.a4, ka->saddr.a4, ka->encap_dport, > + ka->encap_sport, sock_net_uid(net, NULL)); > + > + rt = ip_route_output_key(net, &fl4); > + if (IS_ERR(rt)) > + return PTR_ERR(rt); > + > + skb_dst_set(skb, &rt->dst); > + > + sk = *this_cpu_ptr(&nat_keepalive_sk_ipv4); > + sock_net_set(sk, net); > + err = ip_build_and_send_pkt(skb, sk, fl4.saddr, fl4.daddr, NULL, tos); > + sock_net_set(sk, &init_net); > + return err; > +} > + > +#if IS_ENABLED(CONFIG_IPV6) > +static int nat_keepalive_send_ipv6(struct sk_buff *skb, > + struct nat_keepalive *ka, > + struct udphdr *uh) > +{ > + struct net *net = ka->net; > + struct dst_entry *dst; > + struct flowi6 fl6; > + struct sock *sk; > + __wsum csum; > + int err; > + > + csum = skb_checksum(skb, 0, skb->len, 0); > + uh->check = csum_ipv6_magic(&ka->saddr.in6, &ka->daddr.in6, > + skb->len, IPPROTO_UDP, csum); > + if (uh->check == 0) > + uh->check = CSUM_MANGLED_0; > + > + memset(&fl6, 0, sizeof(fl6)); > + fl6.flowi6_mark = skb->mark; > + fl6.saddr = ka->saddr.in6; > + fl6.daddr = ka->daddr.in6; > + fl6.flowi6_proto = IPPROTO_UDP; > + fl6.fl6_sport = ka->encap_sport; > + fl6.fl6_dport = ka->encap_dport; > + > + sk = *this_cpu_ptr(&nat_keepalive_sk_ipv6); > + sock_net_set(sk, net); > + dst = ipv6_stub->ipv6_dst_lookup_flow(net, sk, &fl6, NULL); > + if (IS_ERR(dst)) > + return PTR_ERR(dst); > + > + skb_dst_set(skb, dst); > + err = ipv6_stub->ip6_xmit(sk, skb, &fl6, skb->mark, NULL, 0, 0); > + sock_net_set(sk, &init_net); > + return err; > +} > +#endif > + > +static void nat_keepalive_send(struct nat_keepalive *ka) > +{ > + const int nat_ka_hdrs_len = max(sizeof(struct iphdr), > + sizeof(struct ipv6hdr)) + > + sizeof(struct udphdr); > + const u8 nat_ka_payload = 0xFF; > + int err = -EAFNOSUPPORT; > + struct sk_buff *skb; > + struct udphdr *uh; > + > + skb = alloc_skb(nat_ka_hdrs_len + sizeof(nat_ka_payload), GFP_ATOMIC); > + if (unlikely(!skb)) > + return; > + > + skb_reserve(skb, nat_ka_hdrs_len); > + > + skb_put_u8(skb, nat_ka_payload); > + > + uh = skb_push(skb, sizeof(*uh)); > + uh->source = ka->encap_sport; > + uh->dest = ka->encap_dport; > + uh->len = htons(skb->len); > + uh->check = 0; > + > + skb->mark = ka->smark; > + > + switch (ka->family) { > + case AF_INET: > + err = nat_keepalive_send_ipv4(skb, ka); > + break; > +#if IS_ENABLED(CONFIG_IPV6) > + case AF_INET6: > + err = nat_keepalive_send_ipv6(skb, ka, uh); > + break; > +#endif > + } > + if (err) > + kfree_skb(skb); > +} > + > +struct nat_keepalive_work_ctx { > + time64_t next_run; > + time64_t now; > +}; > + > +static int nat_keepalive_work_single(struct xfrm_state *x, int count, void *ptr) > +{ > + struct nat_keepalive_work_ctx *ctx = ptr; > + bool send_keepalive = false; > + struct nat_keepalive ka; > + time64_t next_run; > + u32 interval; > + int delta; > + > + interval = x->nat_keepalive_interval; > + if (!interval) > + return 0; > + > + spin_lock(&x->lock); > + > + delta = (int)(ctx->now - x->lastused); > + if (delta < interval) { > + x->nat_keepalive_expiration = ctx->now + interval - delta; > + next_run = x->nat_keepalive_expiration; > + } else if (x->nat_keepalive_expiration > ctx->now) { > + next_run = x->nat_keepalive_expiration; > + } else { > + next_run = ctx->now + interval; > + nat_keepalive_init(&ka, x); > + send_keepalive = true; > + } > + > + spin_unlock(&x->lock); > + > + if (send_keepalive) > + nat_keepalive_send(&ka); > + > + if (!ctx->next_run || next_run < ctx->next_run) > + ctx->next_run = next_run; > + return 0; > +} > + > +static void nat_keepalive_work(struct work_struct *work) > +{ > + struct nat_keepalive_work_ctx ctx; > + struct xfrm_state_walk walk; > + struct net *net; > + > + ctx.next_run = 0; > + ctx.now = ktime_get_real_seconds(); > + > + net = container_of(work, struct net, xfrm.nat_keepalive_work.work); > + xfrm_state_walk_init(&walk, IPPROTO_ESP, NULL); > + xfrm_state_walk(net, &walk, nat_keepalive_work_single, &ctx); > + xfrm_state_walk_done(&walk, net); > + if (ctx.next_run) > + schedule_delayed_work(&net->xfrm.nat_keepalive_work, > + (ctx.next_run - ctx.now) * HZ); > +} > + > +static int nat_keepalive_sk_init(struct sock * __percpu *socks, > + unsigned short family) > +{ > + struct sock *sk; > + int err, i; > + > + for_each_possible_cpu(i) { > + err = inet_ctl_sock_create(&sk, family, SOCK_RAW, IPPROTO_UDP, > + &init_net); > + if (err < 0) > + goto err; > + > + *per_cpu_ptr(socks, i) = sk; > + } > + > + return 0; > +err: > + for_each_possible_cpu(i) > + inet_ctl_sock_destroy(*per_cpu_ptr(socks, i)); > + return err; > +} > + > +static void nat_keepalive_sk_fini(struct sock * __percpu *socks) > +{ > + int i; > + > + for_each_possible_cpu(i) > + inet_ctl_sock_destroy(*per_cpu_ptr(socks, i)); > +} > + > +void xfrm_nat_keepalive_state_updated(struct xfrm_state *x) > +{ > + struct net *net; > + > + if (!x->nat_keepalive_interval) > + return; > + > + net = xs_net(x); > + schedule_delayed_work(&net->xfrm.nat_keepalive_work, 0); > +} > + > +int __net_init xfrm_nat_keepalive_net_init(struct net *net) > +{ > + INIT_DELAYED_WORK(&net->xfrm.nat_keepalive_work, nat_keepalive_work); > + return 0; > +} > + > +int xfrm_nat_keepalive_net_fini(struct net *net) > +{ > + cancel_delayed_work_sync(&net->xfrm.nat_keepalive_work); > + return 0; > +} > + > +int xfrm_nat_keepalive_init(unsigned short family) > +{ > + int err = -EAFNOSUPPORT; > + > + switch (family) { > + case AF_INET: > + err = nat_keepalive_sk_init(&nat_keepalive_sk_ipv4, PF_INET); > + break; > +#if IS_ENABLED(CONFIG_IPV6) > + case AF_INET6: > + err = nat_keepalive_sk_init(&nat_keepalive_sk_ipv6, PF_INET6); > + break; > +#endif > + } > + > + if (err) > + pr_err("xfrm nat keepalive init: failed to init err:%d\n", err); > + return err; > +} > +EXPORT_SYMBOL_GPL(xfrm_nat_keepalive_init); > + > +void xfrm_nat_keepalive_fini(unsigned short family) > +{ > + switch (family) { > + case AF_INET: > + nat_keepalive_sk_fini(&nat_keepalive_sk_ipv4); > + break; > +#if IS_ENABLED(CONFIG_IPV6) > + case AF_INET6: > + nat_keepalive_sk_fini(&nat_keepalive_sk_ipv6); > + break; > +#endif > + } > +} > +EXPORT_SYMBOL_GPL(xfrm_nat_keepalive_fini); > diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c > index 298b3a9eb48d..580c27ac7778 100644 > --- a/net/xfrm/xfrm_policy.c > +++ b/net/xfrm/xfrm_policy.c > @@ -4288,8 +4288,14 @@ static int __net_init xfrm_net_init(struct net *net) > if (rv < 0) > goto out_sysctl; > > + rv = xfrm_nat_keepalive_net_init(net); > + if (rv < 0) > + goto out_nat_keepalive; > + > return 0; > > +out_nat_keepalive: > + xfrm_sysctl_fini(net); > out_sysctl: > xfrm_policy_fini(net); > out_policy: > @@ -4302,6 +4308,7 @@ static int __net_init xfrm_net_init(struct net *net) > > static void __net_exit xfrm_net_exit(struct net *net) > { > + xfrm_nat_keepalive_net_fini(net); > xfrm_sysctl_fini(net); > xfrm_policy_fini(net); > xfrm_state_fini(net); > @@ -4363,6 +4370,7 @@ void __init xfrm_init(void) > #endif > > register_xfrm_state_bpf(); > + xfrm_nat_keepalive_init(AF_INET); > } > > #ifdef CONFIG_AUDITSYSCALL > diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c > index 649bb739df0d..abadc857cd45 100644 > --- a/net/xfrm/xfrm_state.c > +++ b/net/xfrm/xfrm_state.c > @@ -715,6 +715,7 @@ int __xfrm_state_delete(struct xfrm_state *x) > if (x->id.spi) > hlist_del_rcu(&x->byspi); > net->xfrm.state_num--; > + xfrm_nat_keepalive_state_updated(x); > spin_unlock(&net->xfrm.xfrm_state_lock); > > if (x->encap_sk) > @@ -1453,6 +1454,7 @@ static void __xfrm_state_insert(struct xfrm_state *x) > net->xfrm.state_num++; > > xfrm_hash_grow_check(net, x->bydst.next != NULL); > + xfrm_nat_keepalive_state_updated(x); > } > > /* net->xfrm.xfrm_state_lock is held */ > @@ -2871,6 +2873,21 @@ int __xfrm_init_state(struct xfrm_state *x, bool init_replay, bool offload, > goto error; > } > > + if (x->nat_keepalive_interval) { > + if (x->dir != XFRM_SA_DIR_OUT) { > + NL_SET_ERR_MSG(extack, "NAT keepalive is only supported for outbound SAs"); > + err = -EINVAL; > + goto error; > + } > + > + if (!x->encap || x->encap->encap_type != UDP_ENCAP_ESPINUDP) { > + NL_SET_ERR_MSG(extack, > + "NAT keepalive is only supported for UDP encapsulation"); > + err = -EINVAL; > + goto error; > + } > + } > + > error: > return err; > } > diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c > index e83c687bd64e..a552cfa623ea 100644 > --- a/net/xfrm/xfrm_user.c > +++ b/net/xfrm/xfrm_user.c > @@ -833,6 +833,10 @@ static struct xfrm_state *xfrm_state_construct(struct net *net, > if (attrs[XFRMA_SA_DIR]) > x->dir = nla_get_u8(attrs[XFRMA_SA_DIR]); > > + if (attrs[XFRMA_NAT_KEEPALIVE_INTERVAL]) > + x->nat_keepalive_interval = > + nla_get_u32(attrs[XFRMA_NAT_KEEPALIVE_INTERVAL]); > + > err = __xfrm_init_state(x, false, attrs[XFRMA_OFFLOAD_DEV], extack); > if (err) > goto error; > @@ -1288,6 +1292,13 @@ static int copy_to_user_state_extra(struct xfrm_state *x, > } > if (x->dir) > ret = nla_put_u8(skb, XFRMA_SA_DIR, x->dir); > + > + if (x->nat_keepalive_interval) { > + ret = nla_put_u32(skb, XFRMA_NAT_KEEPALIVE_INTERVAL, > + x->nat_keepalive_interval); > + if (ret) > + goto out; > + } > out: > return ret; > } > @@ -3165,6 +3176,7 @@ const struct nla_policy xfrma_policy[XFRMA_MAX+1] = { > [XFRMA_IF_ID] = { .type = NLA_U32 }, > [XFRMA_MTIMER_THRESH] = { .type = NLA_U32 }, > [XFRMA_SA_DIR] = NLA_POLICY_RANGE(NLA_U8, XFRM_SA_DIR_IN, XFRM_SA_DIR_OUT), > + [XFRMA_NAT_KEEPALIVE_INTERVAL] = { .type = NLA_U32 }, What would happen if the value is set to 0? Should it be allowed? I don't see any use for it. If not, should we make the range from 1 to UINT32_MAX? This way, we can avoid any potential issues with the value 0, ensuring proper functionality. I also wonder about limiting the maxium value to x->lft.hard_add_expires_seconds. Adding XFRMA_NAT_KEEPALIVE_INTERVAL > x->lft.hard_add_expires_seconds does not make sesne to me. Just another sanity check. > }; > EXPORT_SYMBOL_GPL(xfrma_policy); > > @@ -3474,6 +3486,9 @@ static inline unsigned int xfrm_sa_len(struct xfrm_state *x) > if (x->dir) > l += nla_total_size(sizeof(x->dir)); > > + if (x->nat_keepalive_interval) > + l += nla_total_size(sizeof(x->nat_keepalive_interval)); > + > return l; > } > > -- > 2.34.1 One curious question: What happens if the NAT gateway in between returns an ICMP unreachable error as a response? If the IKE daemon was sending it, IKEd would notice the ICMP errors and possibly take action. This is something to consider. For example, this might be important to handle when offloading on an Android phone use case. Somehow, the IKE daemon should be woken up to handle these errors; otherwise, you risk unnecessary battery drain. Or if you are server, flodding lot of nat-keep-alive. 07:33:30.839377 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 29) 192.1.3.33.4500 > 192.1.2.23.4500: [no cksum] isakmp-nat-keep-alive 07:33:30.840014 IP (tos 0xc0, ttl 63, id 17350, offset 0, flags [none], proto ICMP (1), length 57) 192.1.2.23 > 192.1.3.33: ICMP 192.1.2.23 udp port 4500 unreachable, length 37 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 29) 192.1.3.33.4500 > 192.1.2.23.4500: [no cksum] isakmp-nat-keep-alive I ran a few quick tests and the patch works well. thanks, -antony
Hi Antony, On Mon, Jun 10, 2024 at 10:45 PM Antony Antony <antony@phenome.org> wrote: > > Hi Eyal, > > On Mon, May 27, 2024 at 08:29:14PM -0700, Eyal Birger wrote: > > Add the ability to send out RFC-3948 NAT keepalives from the xfrm stack. > > > > To use, Userspace sets an XFRM_NAT_KEEPALIVE_INTERVAL integer property when > > creating XFRM outbound states which denotes the number of seconds between > > keepalive messages. > > > > Keepalive messages are sent from a per net delayed work which iterates over > > the xfrm states. The logic is guarded by the xfrm state spinlock due to the > > xfrm state walk iterator. > > > > Possible future enhancements: > > > > - Adding counters to keep track of sent keepalives. > > - deduplicate NAT keepalives between states sharing the same nat keepalive > > parameters. > > - provisioning hardware offloads for devices capable of implementing this. > > - revise xfrm state list to use an rcu list in order to avoid running this > > under spinlock. > > > > Suggested-by: Paul Wouters <paul.wouters@aiven.io> > > Signed-off-by: Eyal Birger <eyal.birger@gmail.com> > > > > --- > > v4: rebase and explicitly check that keepalive is only configured on > > outbound SAs > > v3: add missing ip6_checksum header > > v2: change xfrm compat to include the new attribute > > --- > > include/net/ipv6_stubs.h | 3 + > > include/net/netns/xfrm.h | 1 + > > include/net/xfrm.h | 10 ++ > > include/uapi/linux/xfrm.h | 1 + > > net/ipv6/af_inet6.c | 1 + > > net/ipv6/xfrm6_policy.c | 7 + > > net/xfrm/Makefile | 3 +- > > net/xfrm/xfrm_compat.c | 6 +- > > net/xfrm/xfrm_nat_keepalive.c | 292 ++++++++++++++++++++++++++++++++++ > > net/xfrm/xfrm_policy.c | 8 + > > net/xfrm/xfrm_state.c | 17 ++ > > net/xfrm/xfrm_user.c | 15 ++ > > 12 files changed, 361 insertions(+), 3 deletions(-) > > create mode 100644 net/xfrm/xfrm_nat_keepalive.c > > > > diff --git a/include/net/ipv6_stubs.h b/include/net/ipv6_stubs.h > > index 485c39a89866..11cefd50704d 100644 > > --- a/include/net/ipv6_stubs.h > > +++ b/include/net/ipv6_stubs.h > > @@ -9,6 +9,7 @@ > > #include <net/flow.h> > > #include <net/neighbour.h> > > #include <net/sock.h> > > +#include <net/ipv6.h> > > > > /* structs from net/ip6_fib.h */ > > struct fib6_info; > > @@ -72,6 +73,8 @@ struct ipv6_stub { > > int (*output)(struct net *, struct sock *, struct sk_buff *)); > > struct net_device *(*ipv6_dev_find)(struct net *net, const struct in6_addr *addr, > > struct net_device *dev); > > + int (*ip6_xmit)(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6, > > + __u32 mark, struct ipv6_txoptions *opt, int tclass, u32 priority); > > }; > > extern const struct ipv6_stub *ipv6_stub __read_mostly; > > > > diff --git a/include/net/netns/xfrm.h b/include/net/netns/xfrm.h > > index 423b52eca908..d489d9250bff 100644 > > --- a/include/net/netns/xfrm.h > > +++ b/include/net/netns/xfrm.h > > @@ -83,6 +83,7 @@ struct netns_xfrm { > > > > spinlock_t xfrm_policy_lock; > > struct mutex xfrm_cfg_mutex; > > + struct delayed_work nat_keepalive_work; > > }; > > > > #endif > > diff --git a/include/net/xfrm.h b/include/net/xfrm.h > > index 7c9be06f8302..e208907b1a00 100644 > > --- a/include/net/xfrm.h > > +++ b/include/net/xfrm.h > > @@ -229,6 +229,10 @@ struct xfrm_state { > > struct xfrm_encap_tmpl *encap; > > struct sock __rcu *encap_sk; > > > > + /* NAT keepalive */ > > + u32 nat_keepalive_interval; /* seconds */ > > + time64_t nat_keepalive_expiration; > > + > > /* Data for care-of address */ > > xfrm_address_t *coaddr; > > > > @@ -2200,4 +2204,10 @@ static inline int register_xfrm_state_bpf(void) > > } > > #endif > > > > +int xfrm_nat_keepalive_init(unsigned short family); > > +void xfrm_nat_keepalive_fini(unsigned short family); > > +int xfrm_nat_keepalive_net_init(struct net *net); > > +int xfrm_nat_keepalive_net_fini(struct net *net); > > +void xfrm_nat_keepalive_state_updated(struct xfrm_state *x); > > + > > #endif /* _NET_XFRM_H */ > > diff --git a/include/uapi/linux/xfrm.h b/include/uapi/linux/xfrm.h > > index 18ceaba8486e..7744441c8d5f 100644 > > --- a/include/uapi/linux/xfrm.h > > +++ b/include/uapi/linux/xfrm.h > > @@ -321,6 +321,7 @@ enum xfrm_attr_type_t { > > XFRMA_IF_ID, /* __u32 */ > > XFRMA_MTIMER_THRESH, /* __u32 in seconds for input SA */ > > XFRMA_SA_DIR, /* __u8 */ > > + XFRMA_NAT_KEEPALIVE_INTERVAL, /* __u32 in seconds for NAT keepalive */ > > __XFRMA_MAX > > > > #define XFRMA_OUTPUT_MARK XFRMA_SET_MARK /* Compatibility */ > > diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c > > index 8041dc181bd4..2b893858b9a9 100644 > > --- a/net/ipv6/af_inet6.c > > +++ b/net/ipv6/af_inet6.c > > @@ -1060,6 +1060,7 @@ static const struct ipv6_stub ipv6_stub_impl = { > > .nd_tbl = &nd_tbl, > > .ipv6_fragment = ip6_fragment, > > .ipv6_dev_find = ipv6_dev_find, > > + .ip6_xmit = ip6_xmit, > > }; > > > > static const struct ipv6_bpf_stub ipv6_bpf_stub_impl = { > > diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c > > index 42fb6996b077..f03dbc011e65 100644 > > --- a/net/ipv6/xfrm6_policy.c > > +++ b/net/ipv6/xfrm6_policy.c > > @@ -285,8 +285,14 @@ int __init xfrm6_init(void) > > ret = register_pernet_subsys(&xfrm6_net_ops); > > if (ret) > > goto out_protocol; > > + > > + ret = xfrm_nat_keepalive_init(AF_INET6); > > + if (ret) > > + goto out_nat_keepalive; > > out: > > return ret; > > +out_nat_keepalive: > > + unregister_pernet_subsys(&xfrm6_net_ops); > > out_protocol: > > xfrm6_protocol_fini(); > > out_state: > > @@ -298,6 +304,7 @@ int __init xfrm6_init(void) > > > > void xfrm6_fini(void) > > { > > + xfrm_nat_keepalive_fini(AF_INET6); > > unregister_pernet_subsys(&xfrm6_net_ops); > > xfrm6_protocol_fini(); > > xfrm6_policy_fini(); > > diff --git a/net/xfrm/Makefile b/net/xfrm/Makefile > > index 547cec77ba03..512e0b2f8514 100644 > > --- a/net/xfrm/Makefile > > +++ b/net/xfrm/Makefile > > @@ -13,7 +13,8 @@ endif > > > > obj-$(CONFIG_XFRM) := xfrm_policy.o xfrm_state.o xfrm_hash.o \ > > xfrm_input.o xfrm_output.o \ > > - xfrm_sysctl.o xfrm_replay.o xfrm_device.o > > + xfrm_sysctl.o xfrm_replay.o xfrm_device.o \ > > + xfrm_nat_keepalive.o > > obj-$(CONFIG_XFRM_STATISTICS) += xfrm_proc.o > > obj-$(CONFIG_XFRM_ALGO) += xfrm_algo.o > > obj-$(CONFIG_XFRM_USER) += xfrm_user.o > > diff --git a/net/xfrm/xfrm_compat.c b/net/xfrm/xfrm_compat.c > > index 703d4172c7d7..91357ccaf4af 100644 > > --- a/net/xfrm/xfrm_compat.c > > +++ b/net/xfrm/xfrm_compat.c > > @@ -131,6 +131,7 @@ static const struct nla_policy compat_policy[XFRMA_MAX+1] = { > > [XFRMA_IF_ID] = { .type = NLA_U32 }, > > [XFRMA_MTIMER_THRESH] = { .type = NLA_U32 }, > > [XFRMA_SA_DIR] = NLA_POLICY_RANGE(NLA_U8, XFRM_SA_DIR_IN, XFRM_SA_DIR_OUT), > > + [XFRMA_NAT_KEEPALIVE_INTERVAL] = { .type = NLA_U32 }, > > }; > > > > static struct nlmsghdr *xfrm_nlmsg_put_compat(struct sk_buff *skb, > > @@ -280,9 +281,10 @@ static int xfrm_xlate64_attr(struct sk_buff *dst, const struct nlattr *src) > > case XFRMA_IF_ID: > > case XFRMA_MTIMER_THRESH: > > case XFRMA_SA_DIR: > > + case XFRMA_NAT_KEEPALIVE_INTERVAL: > > return xfrm_nla_cpy(dst, src, nla_len(src)); > > default: > > - BUILD_BUG_ON(XFRMA_MAX != XFRMA_SA_DIR); > > + BUILD_BUG_ON(XFRMA_MAX != XFRMA_NAT_KEEPALIVE_INTERVAL); > > pr_warn_once("unsupported nla_type %d\n", src->nla_type); > > return -EOPNOTSUPP; > > } > > @@ -437,7 +439,7 @@ static int xfrm_xlate32_attr(void *dst, const struct nlattr *nla, > > int err; > > > > if (type > XFRMA_MAX) { > > - BUILD_BUG_ON(XFRMA_MAX != XFRMA_SA_DIR); > > + BUILD_BUG_ON(XFRMA_MAX != XFRMA_NAT_KEEPALIVE_INTERVAL); > > NL_SET_ERR_MSG(extack, "Bad attribute"); > > return -EOPNOTSUPP; > > } > > diff --git a/net/xfrm/xfrm_nat_keepalive.c b/net/xfrm/xfrm_nat_keepalive.c > > new file mode 100644 > > index 000000000000..82f0a301683f > > --- /dev/null > > +++ b/net/xfrm/xfrm_nat_keepalive.c > > @@ -0,0 +1,292 @@ > > +// SPDX-License-Identifier: GPL-2.0-only > > +/* > > + * xfrm_nat_keepalive.c > > + * > > + * (c) 2024 Eyal Birger <eyal.birger@gmail.com> > > + */ > > + > > +#include <net/inet_common.h> > > +#include <net/ip6_checksum.h> > > +#include <net/xfrm.h> > > + > > +static DEFINE_PER_CPU(struct sock *, nat_keepalive_sk_ipv4); > > +#if IS_ENABLED(CONFIG_IPV6) > > +static DEFINE_PER_CPU(struct sock *, nat_keepalive_sk_ipv6); > > +#endif > > + > > +struct nat_keepalive { > > + struct net *net; > > + u16 family; > > + xfrm_address_t saddr; > > + xfrm_address_t daddr; > > + __be16 encap_sport; > > + __be16 encap_dport; > > + __u32 smark; > > +}; > > + > > +static void nat_keepalive_init(struct nat_keepalive *ka, struct xfrm_state *x) > > +{ > > + ka->net = xs_net(x); > > + ka->family = x->props.family; > > + ka->saddr = x->props.saddr; > > + ka->daddr = x->id.daddr; > > + ka->encap_sport = x->encap->encap_sport; > > + ka->encap_dport = x->encap->encap_dport; > > + ka->smark = xfrm_smark_get(0, x); > > +} > > + > > +static int nat_keepalive_send_ipv4(struct sk_buff *skb, > > + struct nat_keepalive *ka) > > +{ > > + struct net *net = ka->net; > > + struct flowi4 fl4; > > + struct rtable *rt; > > + struct sock *sk; > > + __u8 tos = 0; > > + int err; > > + > > + flowi4_init_output(&fl4, 0 /* oif */, skb->mark, tos, > > + RT_SCOPE_UNIVERSE, IPPROTO_UDP, 0, > > + ka->daddr.a4, ka->saddr.a4, ka->encap_dport, > > + ka->encap_sport, sock_net_uid(net, NULL)); > > + > > + rt = ip_route_output_key(net, &fl4); > > + if (IS_ERR(rt)) > > + return PTR_ERR(rt); > > + > > + skb_dst_set(skb, &rt->dst); > > + > > + sk = *this_cpu_ptr(&nat_keepalive_sk_ipv4); > > + sock_net_set(sk, net); > > + err = ip_build_and_send_pkt(skb, sk, fl4.saddr, fl4.daddr, NULL, tos); > > + sock_net_set(sk, &init_net); > > + return err; > > +} > > + > > +#if IS_ENABLED(CONFIG_IPV6) > > +static int nat_keepalive_send_ipv6(struct sk_buff *skb, > > + struct nat_keepalive *ka, > > + struct udphdr *uh) > > +{ > > + struct net *net = ka->net; > > + struct dst_entry *dst; > > + struct flowi6 fl6; > > + struct sock *sk; > > + __wsum csum; > > + int err; > > + > > + csum = skb_checksum(skb, 0, skb->len, 0); > > + uh->check = csum_ipv6_magic(&ka->saddr.in6, &ka->daddr.in6, > > + skb->len, IPPROTO_UDP, csum); > > + if (uh->check == 0) > > + uh->check = CSUM_MANGLED_0; > > + > > + memset(&fl6, 0, sizeof(fl6)); > > + fl6.flowi6_mark = skb->mark; > > + fl6.saddr = ka->saddr.in6; > > + fl6.daddr = ka->daddr.in6; > > + fl6.flowi6_proto = IPPROTO_UDP; > > + fl6.fl6_sport = ka->encap_sport; > > + fl6.fl6_dport = ka->encap_dport; > > + > > + sk = *this_cpu_ptr(&nat_keepalive_sk_ipv6); > > + sock_net_set(sk, net); > > + dst = ipv6_stub->ipv6_dst_lookup_flow(net, sk, &fl6, NULL); > > + if (IS_ERR(dst)) > > + return PTR_ERR(dst); > > + > > + skb_dst_set(skb, dst); > > + err = ipv6_stub->ip6_xmit(sk, skb, &fl6, skb->mark, NULL, 0, 0); > > + sock_net_set(sk, &init_net); > > + return err; > > +} > > +#endif > > + > > +static void nat_keepalive_send(struct nat_keepalive *ka) > > +{ > > + const int nat_ka_hdrs_len = max(sizeof(struct iphdr), > > + sizeof(struct ipv6hdr)) + > > + sizeof(struct udphdr); > > + const u8 nat_ka_payload = 0xFF; > > + int err = -EAFNOSUPPORT; > > + struct sk_buff *skb; > > + struct udphdr *uh; > > + > > + skb = alloc_skb(nat_ka_hdrs_len + sizeof(nat_ka_payload), GFP_ATOMIC); > > + if (unlikely(!skb)) > > + return; > > + > > + skb_reserve(skb, nat_ka_hdrs_len); > > + > > + skb_put_u8(skb, nat_ka_payload); > > + > > + uh = skb_push(skb, sizeof(*uh)); > > + uh->source = ka->encap_sport; > > + uh->dest = ka->encap_dport; > > + uh->len = htons(skb->len); > > + uh->check = 0; > > + > > + skb->mark = ka->smark; > > + > > + switch (ka->family) { > > + case AF_INET: > > + err = nat_keepalive_send_ipv4(skb, ka); > > + break; > > +#if IS_ENABLED(CONFIG_IPV6) > > + case AF_INET6: > > + err = nat_keepalive_send_ipv6(skb, ka, uh); > > + break; > > +#endif > > + } > > + if (err) > > + kfree_skb(skb); > > +} > > + > > +struct nat_keepalive_work_ctx { > > + time64_t next_run; > > + time64_t now; > > +}; > > + > > +static int nat_keepalive_work_single(struct xfrm_state *x, int count, void *ptr) > > +{ > > + struct nat_keepalive_work_ctx *ctx = ptr; > > + bool send_keepalive = false; > > + struct nat_keepalive ka; > > + time64_t next_run; > > + u32 interval; > > + int delta; > > + > > + interval = x->nat_keepalive_interval; > > + if (!interval) > > + return 0; > > + > > + spin_lock(&x->lock); > > + > > + delta = (int)(ctx->now - x->lastused); > > + if (delta < interval) { > > + x->nat_keepalive_expiration = ctx->now + interval - delta; > > + next_run = x->nat_keepalive_expiration; > > + } else if (x->nat_keepalive_expiration > ctx->now) { > > + next_run = x->nat_keepalive_expiration; > > + } else { > > + next_run = ctx->now + interval; > > + nat_keepalive_init(&ka, x); > > + send_keepalive = true; > > + } > > + > > + spin_unlock(&x->lock); > > + > > + if (send_keepalive) > > + nat_keepalive_send(&ka); > > + > > + if (!ctx->next_run || next_run < ctx->next_run) > > + ctx->next_run = next_run; > > + return 0; > > +} > > + > > +static void nat_keepalive_work(struct work_struct *work) > > +{ > > + struct nat_keepalive_work_ctx ctx; > > + struct xfrm_state_walk walk; > > + struct net *net; > > + > > + ctx.next_run = 0; > > + ctx.now = ktime_get_real_seconds(); > > + > > + net = container_of(work, struct net, xfrm.nat_keepalive_work.work); > > + xfrm_state_walk_init(&walk, IPPROTO_ESP, NULL); > > + xfrm_state_walk(net, &walk, nat_keepalive_work_single, &ctx); > > + xfrm_state_walk_done(&walk, net); > > + if (ctx.next_run) > > + schedule_delayed_work(&net->xfrm.nat_keepalive_work, > > + (ctx.next_run - ctx.now) * HZ); > > +} > > + > > +static int nat_keepalive_sk_init(struct sock * __percpu *socks, > > + unsigned short family) > > +{ > > + struct sock *sk; > > + int err, i; > > + > > + for_each_possible_cpu(i) { > > + err = inet_ctl_sock_create(&sk, family, SOCK_RAW, IPPROTO_UDP, > > + &init_net); > > + if (err < 0) > > + goto err; > > + > > + *per_cpu_ptr(socks, i) = sk; > > + } > > + > > + return 0; > > +err: > > + for_each_possible_cpu(i) > > + inet_ctl_sock_destroy(*per_cpu_ptr(socks, i)); > > + return err; > > +} > > + > > +static void nat_keepalive_sk_fini(struct sock * __percpu *socks) > > +{ > > + int i; > > + > > + for_each_possible_cpu(i) > > + inet_ctl_sock_destroy(*per_cpu_ptr(socks, i)); > > +} > > + > > +void xfrm_nat_keepalive_state_updated(struct xfrm_state *x) > > +{ > > + struct net *net; > > + > > + if (!x->nat_keepalive_interval) > > + return; > > + > > + net = xs_net(x); > > + schedule_delayed_work(&net->xfrm.nat_keepalive_work, 0); > > +} > > + > > +int __net_init xfrm_nat_keepalive_net_init(struct net *net) > > +{ > > + INIT_DELAYED_WORK(&net->xfrm.nat_keepalive_work, nat_keepalive_work); > > + return 0; > > +} > > + > > +int xfrm_nat_keepalive_net_fini(struct net *net) > > +{ > > + cancel_delayed_work_sync(&net->xfrm.nat_keepalive_work); > > + return 0; > > +} > > + > > +int xfrm_nat_keepalive_init(unsigned short family) > > +{ > > + int err = -EAFNOSUPPORT; > > + > > + switch (family) { > > + case AF_INET: > > + err = nat_keepalive_sk_init(&nat_keepalive_sk_ipv4, PF_INET); > > + break; > > +#if IS_ENABLED(CONFIG_IPV6) > > + case AF_INET6: > > + err = nat_keepalive_sk_init(&nat_keepalive_sk_ipv6, PF_INET6); > > + break; > > +#endif > > + } > > + > > + if (err) > > + pr_err("xfrm nat keepalive init: failed to init err:%d\n", err); > > + return err; > > +} > > +EXPORT_SYMBOL_GPL(xfrm_nat_keepalive_init); > > + > > +void xfrm_nat_keepalive_fini(unsigned short family) > > +{ > > + switch (family) { > > + case AF_INET: > > + nat_keepalive_sk_fini(&nat_keepalive_sk_ipv4); > > + break; > > +#if IS_ENABLED(CONFIG_IPV6) > > + case AF_INET6: > > + nat_keepalive_sk_fini(&nat_keepalive_sk_ipv6); > > + break; > > +#endif > > + } > > +} > > +EXPORT_SYMBOL_GPL(xfrm_nat_keepalive_fini); > > diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c > > index 298b3a9eb48d..580c27ac7778 100644 > > --- a/net/xfrm/xfrm_policy.c > > +++ b/net/xfrm/xfrm_policy.c > > @@ -4288,8 +4288,14 @@ static int __net_init xfrm_net_init(struct net *net) > > if (rv < 0) > > goto out_sysctl; > > > > + rv = xfrm_nat_keepalive_net_init(net); > > + if (rv < 0) > > + goto out_nat_keepalive; > > + > > return 0; > > > > +out_nat_keepalive: > > + xfrm_sysctl_fini(net); > > out_sysctl: > > xfrm_policy_fini(net); > > out_policy: > > @@ -4302,6 +4308,7 @@ static int __net_init xfrm_net_init(struct net *net) > > > > static void __net_exit xfrm_net_exit(struct net *net) > > { > > + xfrm_nat_keepalive_net_fini(net); > > xfrm_sysctl_fini(net); > > xfrm_policy_fini(net); > > xfrm_state_fini(net); > > @@ -4363,6 +4370,7 @@ void __init xfrm_init(void) > > #endif > > > > register_xfrm_state_bpf(); > > + xfrm_nat_keepalive_init(AF_INET); > > } > > > > #ifdef CONFIG_AUDITSYSCALL > > diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c > > index 649bb739df0d..abadc857cd45 100644 > > --- a/net/xfrm/xfrm_state.c > > +++ b/net/xfrm/xfrm_state.c > > @@ -715,6 +715,7 @@ int __xfrm_state_delete(struct xfrm_state *x) > > if (x->id.spi) > > hlist_del_rcu(&x->byspi); > > net->xfrm.state_num--; > > + xfrm_nat_keepalive_state_updated(x); > > spin_unlock(&net->xfrm.xfrm_state_lock); > > > > if (x->encap_sk) > > @@ -1453,6 +1454,7 @@ static void __xfrm_state_insert(struct xfrm_state *x) > > net->xfrm.state_num++; > > > > xfrm_hash_grow_check(net, x->bydst.next != NULL); > > + xfrm_nat_keepalive_state_updated(x); > > } > > > > /* net->xfrm.xfrm_state_lock is held */ > > @@ -2871,6 +2873,21 @@ int __xfrm_init_state(struct xfrm_state *x, bool init_replay, bool offload, > > goto error; > > } > > > > + if (x->nat_keepalive_interval) { > > + if (x->dir != XFRM_SA_DIR_OUT) { > > + NL_SET_ERR_MSG(extack, "NAT keepalive is only supported for outbound SAs"); > > + err = -EINVAL; > > + goto error; > > + } > > + > > + if (!x->encap || x->encap->encap_type != UDP_ENCAP_ESPINUDP) { > > + NL_SET_ERR_MSG(extack, > > + "NAT keepalive is only supported for UDP encapsulation"); > > + err = -EINVAL; > > + goto error; > > + } > > + } > > + > > error: > > return err; > > } > > diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c > > index e83c687bd64e..a552cfa623ea 100644 > > --- a/net/xfrm/xfrm_user.c > > +++ b/net/xfrm/xfrm_user.c > > @@ -833,6 +833,10 @@ static struct xfrm_state *xfrm_state_construct(struct net *net, > > if (attrs[XFRMA_SA_DIR]) > > x->dir = nla_get_u8(attrs[XFRMA_SA_DIR]); > > > > + if (attrs[XFRMA_NAT_KEEPALIVE_INTERVAL]) > > + x->nat_keepalive_interval = > > + nla_get_u32(attrs[XFRMA_NAT_KEEPALIVE_INTERVAL]); > > + > > err = __xfrm_init_state(x, false, attrs[XFRMA_OFFLOAD_DEV], extack); > > if (err) > > goto error; > > @@ -1288,6 +1292,13 @@ static int copy_to_user_state_extra(struct xfrm_state *x, > > } > > if (x->dir) > > ret = nla_put_u8(skb, XFRMA_SA_DIR, x->dir); > > + > > + if (x->nat_keepalive_interval) { > > + ret = nla_put_u32(skb, XFRMA_NAT_KEEPALIVE_INTERVAL, > > + x->nat_keepalive_interval); > > + if (ret) > > + goto out; > > + } > > out: > > return ret; > > } > > @@ -3165,6 +3176,7 @@ const struct nla_policy xfrma_policy[XFRMA_MAX+1] = { > > [XFRMA_IF_ID] = { .type = NLA_U32 }, > > [XFRMA_MTIMER_THRESH] = { .type = NLA_U32 }, > > [XFRMA_SA_DIR] = NLA_POLICY_RANGE(NLA_U8, XFRM_SA_DIR_IN, XFRM_SA_DIR_OUT), > > + [XFRMA_NAT_KEEPALIVE_INTERVAL] = { .type = NLA_U32 }, > > What would happen if the value is set to 0? Should it be allowed? I don't > see any use for it. If not, should we make the range from 1 to UINT32_MAX? > This way, we can avoid any potential issues with the value 0, ensuring > proper functionality. The xfrm_state only has x->nat_keepalive_interval so if you set the value to '0' it would be the same as not setting it. > > I also wonder about limiting the maxium value to x->lft.hard_add_expires_seconds. > Adding XFRMA_NAT_KEEPALIVE_INTERVAL > x->lft.hard_add_expires_seconds does > not make sesne to me. Just another sanity check. I don't understand the relation between the NAT-T keepalive and "hard_add_expires_seconds". That number seems to be derived from the acquire expiration sysctl whereas NAT-T is related to the network between hosts and firewall timeouts there. As such, i'm not sure what the appropriate upper bound for this configuration would be or whether it's needed. > > > }; > > EXPORT_SYMBOL_GPL(xfrma_policy); > > > > @@ -3474,6 +3486,9 @@ static inline unsigned int xfrm_sa_len(struct xfrm_state *x) > > if (x->dir) > > l += nla_total_size(sizeof(x->dir)); > > > > + if (x->nat_keepalive_interval) > > + l += nla_total_size(sizeof(x->nat_keepalive_interval)); > > + > > return l; > > } > > > > -- > > 2.34.1 > > One curious question: What happens if the NAT gateway in between returns an > ICMP unreachable error as a response? If the IKE daemon was sending it, > IKEd would notice the ICMP errors and possibly take action. This is > something to consider. For example, this might be important to handle when > offloading on an Android phone use case. Somehow, the IKE daemon should be > woken up to handle these errors; otherwise, you risk unnecessary battery > drain. Or if you are server, flodding lot of nat-keep-alive. > > 07:33:30.839377 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 29) > 192.1.3.33.4500 > 192.1.2.23.4500: [no cksum] isakmp-nat-keep-alive > 07:33:30.840014 IP (tos 0xc0, ttl 63, id 17350, offset 0, flags [none], proto ICMP (1), length 57) > 192.1.2.23 > 192.1.3.33: ICMP 192.1.2.23 udp port 4500 unreachable, length 37 > IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 29) > 192.1.3.33.4500 > 192.1.2.23.4500: [no cksum] isakmp-nat-keep-alive That's an interesting observation. Do IKE daemons currently handle this situation? If the route between hosts isn't available, any traffic related to the xfrm state would fail in a similar way, which the IKE daemon could observe as well. Is there such handling done today? > > I ran a few quick tests and the patch works well. Thanks! Eyal.
On Tue, Jun 11, 2024 at 10:58 AM Paul Wouters <paul.wouters@aiven.io> wrote: > > > On Tue, Jun 11, 2024 at 1:08 PM Eyal Birger <eyal.birger@gmail.com> wrote: > >> >> > One curious question: What happens if the NAT gateway in between returns an >> > ICMP unreachable error as a response? If the IKE daemon was sending it, >> > IKEd would notice the ICMP errors and possibly take action. This is >> > something to consider. For example, this might be important to handle when >> > offloading on an Android phone use case. Somehow, the IKE daemon should be >> > woken up to handle these errors; otherwise, you risk unnecessary battery >> > drain. Or if you are server, flodding lot of nat-keep-alive. >> > >> > 07:33:30.839377 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 29) >> > 192.1.3.33.4500 > 192.1.2.23.4500: [no cksum] isakmp-nat-keep-alive >> > 07:33:30.840014 IP (tos 0xc0, ttl 63, id 17350, offset 0, flags [none], proto ICMP (1), length 57) >> > 192.1.2.23 > 192.1.3.33: ICMP 192.1.2.23 udp port 4500 unreachable, length 37 >> > IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 29) >> > 192.1.3.33.4500 > 192.1.2.23.4500: [no cksum] isakmp-nat-keep-alive >> >> That's an interesting observation. Do IKE daemons currently handle this >> situation? > > > libreswan logs these messages but since the ICMPs are not encrypted, it cannot act on it by itself. > >> >> If the route between hosts isn't available, any traffic related to the xfrm >> state would fail in a similar way, which the IKE daemon could observe as well. >> Is there such handling done today? > > > I would let the regular IKE liveness/DPD handle this. Unless the keepalives could cause the kernel > to suppress further messages as some sort of rate limit protection? If which case it should just ignore > these kepalives and not dp icmp for them? I also feel this handling should be done based on IKE and not as part of this feature. > > Note that it is perfectly valid to send these keepalives with a TTL=2 in which case these don't traverse the > whole internet but only pass your NAT gateway so it knows to keep the mapping open. The other end just > silently eats these anyway without any other side effects. Personally I feel TTL=2 is a bit too aggressive as we do have setups with multipli NATted VMs and CGNs. Currently the patch uses ip_select_ttl() under the hood which relies on the system defaults and routing configuration. If needed, we can later add support for tuning the TTL for our socket based on a sysctl. Eyal.
On Mon, May 27, 2024 at 20:29:14 -0700, Eyal Birger via Devel wrote: > Add the ability to send out RFC-3948 NAT keepalives from the xfrm stack. > > To use, Userspace sets an XFRM_NAT_KEEPALIVE_INTERVAL integer property when > creating XFRM outbound states which denotes the number of seconds between > keepalive messages. > > Keepalive messages are sent from a per net delayed work which iterates over > the xfrm states. The logic is guarded by the xfrm state spinlock due to the > xfrm state walk iterator. > > Possible future enhancements: > > - Adding counters to keep track of sent keepalives. > - deduplicate NAT keepalives between states sharing the same nat keepalive > parameters. > - provisioning hardware offloads for devices capable of implementing this. > - revise xfrm state list to use an rcu list in order to avoid running this > under spinlock. > > Suggested-by: Paul Wouters <paul.wouters@aiven.io> > Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Tested-by: Antony Antony <antony.antony@secunet.com> thanks, -antony
On Tue, Jun 25, 2024 at 12:21:28PM +0200, Antony Antony wrote: > On Mon, May 27, 2024 at 20:29:14 -0700, Eyal Birger via Devel wrote: > > Add the ability to send out RFC-3948 NAT keepalives from the xfrm stack. > > > > To use, Userspace sets an XFRM_NAT_KEEPALIVE_INTERVAL integer property when > > creating XFRM outbound states which denotes the number of seconds between > > keepalive messages. > > > > Keepalive messages are sent from a per net delayed work which iterates over > > the xfrm states. The logic is guarded by the xfrm state spinlock due to the > > xfrm state walk iterator. > > > > Possible future enhancements: > > > > - Adding counters to keep track of sent keepalives. > > - deduplicate NAT keepalives between states sharing the same nat keepalive > > parameters. > > - provisioning hardware offloads for devices capable of implementing this. > > - revise xfrm state list to use an rcu list in order to avoid running this > > under spinlock. > > > > Suggested-by: Paul Wouters <paul.wouters@aiven.io> > > Signed-off-by: Eyal Birger <eyal.birger@gmail.com> > > Tested-by: Antony Antony <antony.antony@secunet.com> Now applied to ipsec-next, thanks everyone!
diff --git a/include/net/ipv6_stubs.h b/include/net/ipv6_stubs.h index 485c39a89866..11cefd50704d 100644 --- a/include/net/ipv6_stubs.h +++ b/include/net/ipv6_stubs.h @@ -9,6 +9,7 @@ #include <net/flow.h> #include <net/neighbour.h> #include <net/sock.h> +#include <net/ipv6.h> /* structs from net/ip6_fib.h */ struct fib6_info; @@ -72,6 +73,8 @@ struct ipv6_stub { int (*output)(struct net *, struct sock *, struct sk_buff *)); struct net_device *(*ipv6_dev_find)(struct net *net, const struct in6_addr *addr, struct net_device *dev); + int (*ip6_xmit)(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6, + __u32 mark, struct ipv6_txoptions *opt, int tclass, u32 priority); }; extern const struct ipv6_stub *ipv6_stub __read_mostly; diff --git a/include/net/netns/xfrm.h b/include/net/netns/xfrm.h index 423b52eca908..d489d9250bff 100644 --- a/include/net/netns/xfrm.h +++ b/include/net/netns/xfrm.h @@ -83,6 +83,7 @@ struct netns_xfrm { spinlock_t xfrm_policy_lock; struct mutex xfrm_cfg_mutex; + struct delayed_work nat_keepalive_work; }; #endif diff --git a/include/net/xfrm.h b/include/net/xfrm.h index 7c9be06f8302..e208907b1a00 100644 --- a/include/net/xfrm.h +++ b/include/net/xfrm.h @@ -229,6 +229,10 @@ struct xfrm_state { struct xfrm_encap_tmpl *encap; struct sock __rcu *encap_sk; + /* NAT keepalive */ + u32 nat_keepalive_interval; /* seconds */ + time64_t nat_keepalive_expiration; + /* Data for care-of address */ xfrm_address_t *coaddr; @@ -2200,4 +2204,10 @@ static inline int register_xfrm_state_bpf(void) } #endif +int xfrm_nat_keepalive_init(unsigned short family); +void xfrm_nat_keepalive_fini(unsigned short family); +int xfrm_nat_keepalive_net_init(struct net *net); +int xfrm_nat_keepalive_net_fini(struct net *net); +void xfrm_nat_keepalive_state_updated(struct xfrm_state *x); + #endif /* _NET_XFRM_H */ diff --git a/include/uapi/linux/xfrm.h b/include/uapi/linux/xfrm.h index 18ceaba8486e..7744441c8d5f 100644 --- a/include/uapi/linux/xfrm.h +++ b/include/uapi/linux/xfrm.h @@ -321,6 +321,7 @@ enum xfrm_attr_type_t { XFRMA_IF_ID, /* __u32 */ XFRMA_MTIMER_THRESH, /* __u32 in seconds for input SA */ XFRMA_SA_DIR, /* __u8 */ + XFRMA_NAT_KEEPALIVE_INTERVAL, /* __u32 in seconds for NAT keepalive */ __XFRMA_MAX #define XFRMA_OUTPUT_MARK XFRMA_SET_MARK /* Compatibility */ diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c index 8041dc181bd4..2b893858b9a9 100644 --- a/net/ipv6/af_inet6.c +++ b/net/ipv6/af_inet6.c @@ -1060,6 +1060,7 @@ static const struct ipv6_stub ipv6_stub_impl = { .nd_tbl = &nd_tbl, .ipv6_fragment = ip6_fragment, .ipv6_dev_find = ipv6_dev_find, + .ip6_xmit = ip6_xmit, }; static const struct ipv6_bpf_stub ipv6_bpf_stub_impl = { diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c index 42fb6996b077..f03dbc011e65 100644 --- a/net/ipv6/xfrm6_policy.c +++ b/net/ipv6/xfrm6_policy.c @@ -285,8 +285,14 @@ int __init xfrm6_init(void) ret = register_pernet_subsys(&xfrm6_net_ops); if (ret) goto out_protocol; + + ret = xfrm_nat_keepalive_init(AF_INET6); + if (ret) + goto out_nat_keepalive; out: return ret; +out_nat_keepalive: + unregister_pernet_subsys(&xfrm6_net_ops); out_protocol: xfrm6_protocol_fini(); out_state: @@ -298,6 +304,7 @@ int __init xfrm6_init(void) void xfrm6_fini(void) { + xfrm_nat_keepalive_fini(AF_INET6); unregister_pernet_subsys(&xfrm6_net_ops); xfrm6_protocol_fini(); xfrm6_policy_fini(); diff --git a/net/xfrm/Makefile b/net/xfrm/Makefile index 547cec77ba03..512e0b2f8514 100644 --- a/net/xfrm/Makefile +++ b/net/xfrm/Makefile @@ -13,7 +13,8 @@ endif obj-$(CONFIG_XFRM) := xfrm_policy.o xfrm_state.o xfrm_hash.o \ xfrm_input.o xfrm_output.o \ - xfrm_sysctl.o xfrm_replay.o xfrm_device.o + xfrm_sysctl.o xfrm_replay.o xfrm_device.o \ + xfrm_nat_keepalive.o obj-$(CONFIG_XFRM_STATISTICS) += xfrm_proc.o obj-$(CONFIG_XFRM_ALGO) += xfrm_algo.o obj-$(CONFIG_XFRM_USER) += xfrm_user.o diff --git a/net/xfrm/xfrm_compat.c b/net/xfrm/xfrm_compat.c index 703d4172c7d7..91357ccaf4af 100644 --- a/net/xfrm/xfrm_compat.c +++ b/net/xfrm/xfrm_compat.c @@ -131,6 +131,7 @@ static const struct nla_policy compat_policy[XFRMA_MAX+1] = { [XFRMA_IF_ID] = { .type = NLA_U32 }, [XFRMA_MTIMER_THRESH] = { .type = NLA_U32 }, [XFRMA_SA_DIR] = NLA_POLICY_RANGE(NLA_U8, XFRM_SA_DIR_IN, XFRM_SA_DIR_OUT), + [XFRMA_NAT_KEEPALIVE_INTERVAL] = { .type = NLA_U32 }, }; static struct nlmsghdr *xfrm_nlmsg_put_compat(struct sk_buff *skb, @@ -280,9 +281,10 @@ static int xfrm_xlate64_attr(struct sk_buff *dst, const struct nlattr *src) case XFRMA_IF_ID: case XFRMA_MTIMER_THRESH: case XFRMA_SA_DIR: + case XFRMA_NAT_KEEPALIVE_INTERVAL: return xfrm_nla_cpy(dst, src, nla_len(src)); default: - BUILD_BUG_ON(XFRMA_MAX != XFRMA_SA_DIR); + BUILD_BUG_ON(XFRMA_MAX != XFRMA_NAT_KEEPALIVE_INTERVAL); pr_warn_once("unsupported nla_type %d\n", src->nla_type); return -EOPNOTSUPP; } @@ -437,7 +439,7 @@ static int xfrm_xlate32_attr(void *dst, const struct nlattr *nla, int err; if (type > XFRMA_MAX) { - BUILD_BUG_ON(XFRMA_MAX != XFRMA_SA_DIR); + BUILD_BUG_ON(XFRMA_MAX != XFRMA_NAT_KEEPALIVE_INTERVAL); NL_SET_ERR_MSG(extack, "Bad attribute"); return -EOPNOTSUPP; } diff --git a/net/xfrm/xfrm_nat_keepalive.c b/net/xfrm/xfrm_nat_keepalive.c new file mode 100644 index 000000000000..82f0a301683f --- /dev/null +++ b/net/xfrm/xfrm_nat_keepalive.c @@ -0,0 +1,292 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * xfrm_nat_keepalive.c + * + * (c) 2024 Eyal Birger <eyal.birger@gmail.com> + */ + +#include <net/inet_common.h> +#include <net/ip6_checksum.h> +#include <net/xfrm.h> + +static DEFINE_PER_CPU(struct sock *, nat_keepalive_sk_ipv4); +#if IS_ENABLED(CONFIG_IPV6) +static DEFINE_PER_CPU(struct sock *, nat_keepalive_sk_ipv6); +#endif + +struct nat_keepalive { + struct net *net; + u16 family; + xfrm_address_t saddr; + xfrm_address_t daddr; + __be16 encap_sport; + __be16 encap_dport; + __u32 smark; +}; + +static void nat_keepalive_init(struct nat_keepalive *ka, struct xfrm_state *x) +{ + ka->net = xs_net(x); + ka->family = x->props.family; + ka->saddr = x->props.saddr; + ka->daddr = x->id.daddr; + ka->encap_sport = x->encap->encap_sport; + ka->encap_dport = x->encap->encap_dport; + ka->smark = xfrm_smark_get(0, x); +} + +static int nat_keepalive_send_ipv4(struct sk_buff *skb, + struct nat_keepalive *ka) +{ + struct net *net = ka->net; + struct flowi4 fl4; + struct rtable *rt; + struct sock *sk; + __u8 tos = 0; + int err; + + flowi4_init_output(&fl4, 0 /* oif */, skb->mark, tos, + RT_SCOPE_UNIVERSE, IPPROTO_UDP, 0, + ka->daddr.a4, ka->saddr.a4, ka->encap_dport, + ka->encap_sport, sock_net_uid(net, NULL)); + + rt = ip_route_output_key(net, &fl4); + if (IS_ERR(rt)) + return PTR_ERR(rt); + + skb_dst_set(skb, &rt->dst); + + sk = *this_cpu_ptr(&nat_keepalive_sk_ipv4); + sock_net_set(sk, net); + err = ip_build_and_send_pkt(skb, sk, fl4.saddr, fl4.daddr, NULL, tos); + sock_net_set(sk, &init_net); + return err; +} + +#if IS_ENABLED(CONFIG_IPV6) +static int nat_keepalive_send_ipv6(struct sk_buff *skb, + struct nat_keepalive *ka, + struct udphdr *uh) +{ + struct net *net = ka->net; + struct dst_entry *dst; + struct flowi6 fl6; + struct sock *sk; + __wsum csum; + int err; + + csum = skb_checksum(skb, 0, skb->len, 0); + uh->check = csum_ipv6_magic(&ka->saddr.in6, &ka->daddr.in6, + skb->len, IPPROTO_UDP, csum); + if (uh->check == 0) + uh->check = CSUM_MANGLED_0; + + memset(&fl6, 0, sizeof(fl6)); + fl6.flowi6_mark = skb->mark; + fl6.saddr = ka->saddr.in6; + fl6.daddr = ka->daddr.in6; + fl6.flowi6_proto = IPPROTO_UDP; + fl6.fl6_sport = ka->encap_sport; + fl6.fl6_dport = ka->encap_dport; + + sk = *this_cpu_ptr(&nat_keepalive_sk_ipv6); + sock_net_set(sk, net); + dst = ipv6_stub->ipv6_dst_lookup_flow(net, sk, &fl6, NULL); + if (IS_ERR(dst)) + return PTR_ERR(dst); + + skb_dst_set(skb, dst); + err = ipv6_stub->ip6_xmit(sk, skb, &fl6, skb->mark, NULL, 0, 0); + sock_net_set(sk, &init_net); + return err; +} +#endif + +static void nat_keepalive_send(struct nat_keepalive *ka) +{ + const int nat_ka_hdrs_len = max(sizeof(struct iphdr), + sizeof(struct ipv6hdr)) + + sizeof(struct udphdr); + const u8 nat_ka_payload = 0xFF; + int err = -EAFNOSUPPORT; + struct sk_buff *skb; + struct udphdr *uh; + + skb = alloc_skb(nat_ka_hdrs_len + sizeof(nat_ka_payload), GFP_ATOMIC); + if (unlikely(!skb)) + return; + + skb_reserve(skb, nat_ka_hdrs_len); + + skb_put_u8(skb, nat_ka_payload); + + uh = skb_push(skb, sizeof(*uh)); + uh->source = ka->encap_sport; + uh->dest = ka->encap_dport; + uh->len = htons(skb->len); + uh->check = 0; + + skb->mark = ka->smark; + + switch (ka->family) { + case AF_INET: + err = nat_keepalive_send_ipv4(skb, ka); + break; +#if IS_ENABLED(CONFIG_IPV6) + case AF_INET6: + err = nat_keepalive_send_ipv6(skb, ka, uh); + break; +#endif + } + if (err) + kfree_skb(skb); +} + +struct nat_keepalive_work_ctx { + time64_t next_run; + time64_t now; +}; + +static int nat_keepalive_work_single(struct xfrm_state *x, int count, void *ptr) +{ + struct nat_keepalive_work_ctx *ctx = ptr; + bool send_keepalive = false; + struct nat_keepalive ka; + time64_t next_run; + u32 interval; + int delta; + + interval = x->nat_keepalive_interval; + if (!interval) + return 0; + + spin_lock(&x->lock); + + delta = (int)(ctx->now - x->lastused); + if (delta < interval) { + x->nat_keepalive_expiration = ctx->now + interval - delta; + next_run = x->nat_keepalive_expiration; + } else if (x->nat_keepalive_expiration > ctx->now) { + next_run = x->nat_keepalive_expiration; + } else { + next_run = ctx->now + interval; + nat_keepalive_init(&ka, x); + send_keepalive = true; + } + + spin_unlock(&x->lock); + + if (send_keepalive) + nat_keepalive_send(&ka); + + if (!ctx->next_run || next_run < ctx->next_run) + ctx->next_run = next_run; + return 0; +} + +static void nat_keepalive_work(struct work_struct *work) +{ + struct nat_keepalive_work_ctx ctx; + struct xfrm_state_walk walk; + struct net *net; + + ctx.next_run = 0; + ctx.now = ktime_get_real_seconds(); + + net = container_of(work, struct net, xfrm.nat_keepalive_work.work); + xfrm_state_walk_init(&walk, IPPROTO_ESP, NULL); + xfrm_state_walk(net, &walk, nat_keepalive_work_single, &ctx); + xfrm_state_walk_done(&walk, net); + if (ctx.next_run) + schedule_delayed_work(&net->xfrm.nat_keepalive_work, + (ctx.next_run - ctx.now) * HZ); +} + +static int nat_keepalive_sk_init(struct sock * __percpu *socks, + unsigned short family) +{ + struct sock *sk; + int err, i; + + for_each_possible_cpu(i) { + err = inet_ctl_sock_create(&sk, family, SOCK_RAW, IPPROTO_UDP, + &init_net); + if (err < 0) + goto err; + + *per_cpu_ptr(socks, i) = sk; + } + + return 0; +err: + for_each_possible_cpu(i) + inet_ctl_sock_destroy(*per_cpu_ptr(socks, i)); + return err; +} + +static void nat_keepalive_sk_fini(struct sock * __percpu *socks) +{ + int i; + + for_each_possible_cpu(i) + inet_ctl_sock_destroy(*per_cpu_ptr(socks, i)); +} + +void xfrm_nat_keepalive_state_updated(struct xfrm_state *x) +{ + struct net *net; + + if (!x->nat_keepalive_interval) + return; + + net = xs_net(x); + schedule_delayed_work(&net->xfrm.nat_keepalive_work, 0); +} + +int __net_init xfrm_nat_keepalive_net_init(struct net *net) +{ + INIT_DELAYED_WORK(&net->xfrm.nat_keepalive_work, nat_keepalive_work); + return 0; +} + +int xfrm_nat_keepalive_net_fini(struct net *net) +{ + cancel_delayed_work_sync(&net->xfrm.nat_keepalive_work); + return 0; +} + +int xfrm_nat_keepalive_init(unsigned short family) +{ + int err = -EAFNOSUPPORT; + + switch (family) { + case AF_INET: + err = nat_keepalive_sk_init(&nat_keepalive_sk_ipv4, PF_INET); + break; +#if IS_ENABLED(CONFIG_IPV6) + case AF_INET6: + err = nat_keepalive_sk_init(&nat_keepalive_sk_ipv6, PF_INET6); + break; +#endif + } + + if (err) + pr_err("xfrm nat keepalive init: failed to init err:%d\n", err); + return err; +} +EXPORT_SYMBOL_GPL(xfrm_nat_keepalive_init); + +void xfrm_nat_keepalive_fini(unsigned short family) +{ + switch (family) { + case AF_INET: + nat_keepalive_sk_fini(&nat_keepalive_sk_ipv4); + break; +#if IS_ENABLED(CONFIG_IPV6) + case AF_INET6: + nat_keepalive_sk_fini(&nat_keepalive_sk_ipv6); + break; +#endif + } +} +EXPORT_SYMBOL_GPL(xfrm_nat_keepalive_fini); diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index 298b3a9eb48d..580c27ac7778 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -4288,8 +4288,14 @@ static int __net_init xfrm_net_init(struct net *net) if (rv < 0) goto out_sysctl; + rv = xfrm_nat_keepalive_net_init(net); + if (rv < 0) + goto out_nat_keepalive; + return 0; +out_nat_keepalive: + xfrm_sysctl_fini(net); out_sysctl: xfrm_policy_fini(net); out_policy: @@ -4302,6 +4308,7 @@ static int __net_init xfrm_net_init(struct net *net) static void __net_exit xfrm_net_exit(struct net *net) { + xfrm_nat_keepalive_net_fini(net); xfrm_sysctl_fini(net); xfrm_policy_fini(net); xfrm_state_fini(net); @@ -4363,6 +4370,7 @@ void __init xfrm_init(void) #endif register_xfrm_state_bpf(); + xfrm_nat_keepalive_init(AF_INET); } #ifdef CONFIG_AUDITSYSCALL diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index 649bb739df0d..abadc857cd45 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c @@ -715,6 +715,7 @@ int __xfrm_state_delete(struct xfrm_state *x) if (x->id.spi) hlist_del_rcu(&x->byspi); net->xfrm.state_num--; + xfrm_nat_keepalive_state_updated(x); spin_unlock(&net->xfrm.xfrm_state_lock); if (x->encap_sk) @@ -1453,6 +1454,7 @@ static void __xfrm_state_insert(struct xfrm_state *x) net->xfrm.state_num++; xfrm_hash_grow_check(net, x->bydst.next != NULL); + xfrm_nat_keepalive_state_updated(x); } /* net->xfrm.xfrm_state_lock is held */ @@ -2871,6 +2873,21 @@ int __xfrm_init_state(struct xfrm_state *x, bool init_replay, bool offload, goto error; } + if (x->nat_keepalive_interval) { + if (x->dir != XFRM_SA_DIR_OUT) { + NL_SET_ERR_MSG(extack, "NAT keepalive is only supported for outbound SAs"); + err = -EINVAL; + goto error; + } + + if (!x->encap || x->encap->encap_type != UDP_ENCAP_ESPINUDP) { + NL_SET_ERR_MSG(extack, + "NAT keepalive is only supported for UDP encapsulation"); + err = -EINVAL; + goto error; + } + } + error: return err; } diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c index e83c687bd64e..a552cfa623ea 100644 --- a/net/xfrm/xfrm_user.c +++ b/net/xfrm/xfrm_user.c @@ -833,6 +833,10 @@ static struct xfrm_state *xfrm_state_construct(struct net *net, if (attrs[XFRMA_SA_DIR]) x->dir = nla_get_u8(attrs[XFRMA_SA_DIR]); + if (attrs[XFRMA_NAT_KEEPALIVE_INTERVAL]) + x->nat_keepalive_interval = + nla_get_u32(attrs[XFRMA_NAT_KEEPALIVE_INTERVAL]); + err = __xfrm_init_state(x, false, attrs[XFRMA_OFFLOAD_DEV], extack); if (err) goto error; @@ -1288,6 +1292,13 @@ static int copy_to_user_state_extra(struct xfrm_state *x, } if (x->dir) ret = nla_put_u8(skb, XFRMA_SA_DIR, x->dir); + + if (x->nat_keepalive_interval) { + ret = nla_put_u32(skb, XFRMA_NAT_KEEPALIVE_INTERVAL, + x->nat_keepalive_interval); + if (ret) + goto out; + } out: return ret; } @@ -3165,6 +3176,7 @@ const struct nla_policy xfrma_policy[XFRMA_MAX+1] = { [XFRMA_IF_ID] = { .type = NLA_U32 }, [XFRMA_MTIMER_THRESH] = { .type = NLA_U32 }, [XFRMA_SA_DIR] = NLA_POLICY_RANGE(NLA_U8, XFRM_SA_DIR_IN, XFRM_SA_DIR_OUT), + [XFRMA_NAT_KEEPALIVE_INTERVAL] = { .type = NLA_U32 }, }; EXPORT_SYMBOL_GPL(xfrma_policy); @@ -3474,6 +3486,9 @@ static inline unsigned int xfrm_sa_len(struct xfrm_state *x) if (x->dir) l += nla_total_size(sizeof(x->dir)); + if (x->nat_keepalive_interval) + l += nla_total_size(sizeof(x->nat_keepalive_interval)); + return l; }
Add the ability to send out RFC-3948 NAT keepalives from the xfrm stack. To use, Userspace sets an XFRM_NAT_KEEPALIVE_INTERVAL integer property when creating XFRM outbound states which denotes the number of seconds between keepalive messages. Keepalive messages are sent from a per net delayed work which iterates over the xfrm states. The logic is guarded by the xfrm state spinlock due to the xfrm state walk iterator. Possible future enhancements: - Adding counters to keep track of sent keepalives. - deduplicate NAT keepalives between states sharing the same nat keepalive parameters. - provisioning hardware offloads for devices capable of implementing this. - revise xfrm state list to use an rcu list in order to avoid running this under spinlock. Suggested-by: Paul Wouters <paul.wouters@aiven.io> Signed-off-by: Eyal Birger <eyal.birger@gmail.com> --- v4: rebase and explicitly check that keepalive is only configured on outbound SAs v3: add missing ip6_checksum header v2: change xfrm compat to include the new attribute --- include/net/ipv6_stubs.h | 3 + include/net/netns/xfrm.h | 1 + include/net/xfrm.h | 10 ++ include/uapi/linux/xfrm.h | 1 + net/ipv6/af_inet6.c | 1 + net/ipv6/xfrm6_policy.c | 7 + net/xfrm/Makefile | 3 +- net/xfrm/xfrm_compat.c | 6 +- net/xfrm/xfrm_nat_keepalive.c | 292 ++++++++++++++++++++++++++++++++++ net/xfrm/xfrm_policy.c | 8 + net/xfrm/xfrm_state.c | 17 ++ net/xfrm/xfrm_user.c | 15 ++ 12 files changed, 361 insertions(+), 3 deletions(-) create mode 100644 net/xfrm/xfrm_nat_keepalive.c