diff mbox series

[bpf-next,v2,3/8] bpf, net, frags: Add bpf_ip_check_defrag() kfunc

Message ID 7145c9891791db1c868a326476fef590f22b352b.1677526810.git.dxu@dxuuu.xyz (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series Support defragmenting IPv(4|6) packets in BPF | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for bpf-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1787 this patch: 1787
netdev/cc_maintainers success CCed 7 of 7 maintainers
netdev/build_clang success Errors and warnings before: 228 this patch: 228
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 1918 this patch: 1918
netdev/checkpatch warning WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-2 success Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-3 success Logs for build for aarch64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-5 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-6 success Logs for build for x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-7 success Logs for llvm-toolchain
bpf/vmtest-bpf-next-VM_Test-8 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-4 success Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-11 pending Logs for test_maps on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-16 pending Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-28 success Logs for test_progs_no_alu32_parallel on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-31 pending Logs for test_progs_parallel on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-34 success Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-35 success Logs for test_verifier on aarch64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-37 success Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-38 success Logs for test_verifier on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-9 success Logs for test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-10 success Logs for test_maps on aarch64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-12 success Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-13 success Logs for test_maps on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-14 fail Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-15 success Logs for test_progs on aarch64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-17 fail Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-18 success Logs for test_progs on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-19 fail Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-20 success Logs for test_progs_no_alu32 on aarch64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-22 fail Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-23 success Logs for test_progs_no_alu32 on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-24 success Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-25 success Logs for test_progs_no_alu32_parallel on aarch64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-27 success Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-29 success Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-30 success Logs for test_progs_parallel on aarch64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-32 success Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-33 success Logs for test_progs_parallel on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-36 success Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-21 success Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-26 success Logs for test_progs_no_alu32_parallel on s390x with gcc
bpf/vmtest-bpf-next-PR fail PR summary

Commit Message

Daniel Xu Feb. 27, 2023, 7:51 p.m. UTC
This kfunc is used to defragment IPv4 packets. The idea is that if you
see a fragmented packet, you call this kfunc. If the kfunc returns 0,
then the skb has been updated to contain the entire reassembled packet.

If the kfunc returns an error (most likely -EINPROGRESS), then it means
the skb is part of a yet-incomplete original packet. A reasonable
response to -EINPROGRESS is to drop the packet, as the ip defrag
infrastructure is already hanging onto the frag for future reassembly.

Care has been taken to ensure the prog skb remains valid no matter what
the underlying ip_check_defrag() call does. This is in contrast to
ip_defrag(), which may consume the skb if the skb is part of a
yet-incomplete original packet.

So far this kfunc is only callable from TC clsact progs.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
---
 include/net/ip.h           | 11 +++++
 net/ipv4/Makefile          |  1 +
 net/ipv4/ip_fragment.c     |  2 +
 net/ipv4/ip_fragment_bpf.c | 98 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 112 insertions(+)
 create mode 100644 net/ipv4/ip_fragment_bpf.c

Comments

Stanislav Fomichev Feb. 28, 2023, 7:37 p.m. UTC | #1
On 02/27, Daniel Xu wrote:
> This kfunc is used to defragment IPv4 packets. The idea is that if you
> see a fragmented packet, you call this kfunc. If the kfunc returns 0,
> then the skb has been updated to contain the entire reassembled packet.

> If the kfunc returns an error (most likely -EINPROGRESS), then it means
> the skb is part of a yet-incomplete original packet. A reasonable
> response to -EINPROGRESS is to drop the packet, as the ip defrag
> infrastructure is already hanging onto the frag for future reassembly.

> Care has been taken to ensure the prog skb remains valid no matter what
> the underlying ip_check_defrag() call does. This is in contrast to
> ip_defrag(), which may consume the skb if the skb is part of a
> yet-incomplete original packet.

> So far this kfunc is only callable from TC clsact progs.

> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
> ---
>   include/net/ip.h           | 11 +++++
>   net/ipv4/Makefile          |  1 +
>   net/ipv4/ip_fragment.c     |  2 +
>   net/ipv4/ip_fragment_bpf.c | 98 ++++++++++++++++++++++++++++++++++++++
>   4 files changed, 112 insertions(+)
>   create mode 100644 net/ipv4/ip_fragment_bpf.c

> diff --git a/include/net/ip.h b/include/net/ip.h
> index c3fffaa92d6e..f3796b1b5cac 100644
> --- a/include/net/ip.h
> +++ b/include/net/ip.h
> @@ -680,6 +680,7 @@ enum ip_defrag_users {
>   	IP_DEFRAG_VS_FWD,
>   	IP_DEFRAG_AF_PACKET,
>   	IP_DEFRAG_MACVLAN,
> +	IP_DEFRAG_BPF,
>   };

>   /* Return true if the value of 'user' is between 'lower_bond'
> @@ -693,6 +694,16 @@ static inline bool ip_defrag_user_in_between(u32  
> user,
>   }

>   int ip_defrag(struct net *net, struct sk_buff *skb, u32 user);
> +
> +#ifdef CONFIG_DEBUG_INFO_BTF
> +int register_ip_frag_bpf(void);
> +#else
> +static inline int register_ip_frag_bpf(void)
> +{
> +	return 0;
> +}
> +#endif
> +
>   #ifdef CONFIG_INET
>   struct sk_buff *ip_check_defrag(struct net *net, struct sk_buff *skb,  
> u32 user);
>   #else
> diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
> index 880277c9fd07..950efb166d37 100644
> --- a/net/ipv4/Makefile
> +++ b/net/ipv4/Makefile
> @@ -65,6 +65,7 @@ obj-$(CONFIG_TCP_CONG_ILLINOIS) += tcp_illinois.o
>   obj-$(CONFIG_NET_SOCK_MSG) += tcp_bpf.o
>   obj-$(CONFIG_BPF_SYSCALL) += udp_bpf.o
>   obj-$(CONFIG_NETLABEL) += cipso_ipv4.o
> +obj-$(CONFIG_DEBUG_INFO_BTF) += ip_fragment_bpf.o

>   obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \
>   		      xfrm4_output.o xfrm4_protocol.o
> diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
> index 959d2c4260ea..e3fda5203f09 100644
> --- a/net/ipv4/ip_fragment.c
> +++ b/net/ipv4/ip_fragment.c
> @@ -759,5 +759,7 @@ void __init ipfrag_init(void)
>   	if (inet_frags_init(&ip4_frags))
>   		panic("IP: failed to allocate ip4_frags cache\n");
>   	ip4_frags_ctl_register();
> +	if (register_ip_frag_bpf())
> +		panic("IP: bpf: failed to register ip_frag_bpf\n");
>   	register_pernet_subsys(&ip4_frags_ops);
>   }
> diff --git a/net/ipv4/ip_fragment_bpf.c b/net/ipv4/ip_fragment_bpf.c
> new file mode 100644
> index 000000000000..a9e5908ed216
> --- /dev/null
> +++ b/net/ipv4/ip_fragment_bpf.c
> @@ -0,0 +1,98 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Unstable ipv4 fragmentation helpers for TC-BPF hook
> + *
> + * These are called from SCHED_CLS BPF programs. Note that it is allowed  
> to
> + * break compatibility for these functions since the interface they are  
> exposed
> + * through to BPF programs is explicitly unstable.
> + */
> +
> +#include <linux/bpf.h>
> +#include <linux/btf_ids.h>
> +#include <linux/ip.h>
> +#include <linux/filter.h>
> +#include <linux/netdevice.h>
> +#include <net/ip.h>
> +#include <net/sock.h>
> +
> +__diag_push();
> +__diag_ignore_all("-Wmissing-prototypes",
> +		  "Global functions as their definitions will be in ip_fragment BTF");
> +
> +/* bpf_ip_check_defrag - Defragment an ipv4 packet
> + *
> + * This helper takes an skb as input. If this skb successfully  
> reassembles
> + * the original packet, the skb is updated to contain the original,  
> reassembled
> + * packet.
> + *
> + * Otherwise (on error or incomplete reassembly), the input skb remains
> + * unmodified.
> + *
> + * Parameters:
> + * @ctx		- Pointer to program context (skb)
> + * @netns	- Child network namespace id. If value is a negative signed
> + *		  32-bit integer, the netns of the device in the skb is used.
> + *
> + * Return:
> + * 0 on successfully reassembly or non-fragmented packet. Negative value  
> on
> + * error or incomplete reassembly.
> + */
> +int bpf_ip_check_defrag(struct __sk_buff *ctx, u64 netns)

Needs a __bpf_kfunc tag as well?

> +{
> +	struct sk_buff *skb = (struct sk_buff *)ctx;
> +	struct sk_buff *skb_cpy, *skb_out;
> +	struct net *caller_net;
> +	struct net *net;
> +	int mac_len;
> +	void *mac;
> +

[..]

> +	if (unlikely(!((s32)netns < 0 || netns <= S32_MAX)))
> +		return -EINVAL;

Can you explain what it does? Is it checking for -1 explicitly? Not sure
it works :-/

Maybe we can spell out the cases explicitly?

if (unlikely(
	     ((s32)netns < 0 && netns != S32_MAX) || /* -1 */
	     netns > U32_MAX /* higher 4 bytes */
	    )
	return -EINVAL;

> +
> +	caller_net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
> +	if ((s32)netns < 0) {
> +		net = caller_net;
> +	} else {
> +		net = get_net_ns_by_id(caller_net, netns);
> +		if (unlikely(!net))
> +			return -EINVAL;
> +	}
> +
> +	mac_len = skb->mac_len;
> +	skb_cpy = skb_copy(skb, GFP_ATOMIC);
> +	if (!skb_cpy)
> +		return -ENOMEM;
> +
> +	skb_out = ip_check_defrag(net, skb_cpy, IP_DEFRAG_BPF);
> +	if (IS_ERR(skb_out))
> +		return PTR_ERR(skb_out);
> +
> +	skb_morph(skb, skb_out);
> +	kfree_skb(skb_out);
> +
> +	/* ip_check_defrag() does not maintain mac header, so push empty header
> +	 * in so prog sees the correct layout. The empty mac header will be
> +	 * later pulled from cls_bpf.
> +	 */
> +	mac = skb_push(skb, mac_len);
> +	memset(mac, 0, mac_len);
> +	bpf_compute_data_pointers(skb);
> +
> +	return 0;
> +}
> +
> +__diag_pop()
> +
> +BTF_SET8_START(ip_frag_kfunc_set)
> +BTF_ID_FLAGS(func, bpf_ip_check_defrag, KF_CHANGES_PKT)
> +BTF_SET8_END(ip_frag_kfunc_set)
> +
> +static const struct btf_kfunc_id_set ip_frag_bpf_kfunc_set = {
> +	.owner = THIS_MODULE,
> +	.set   = &ip_frag_kfunc_set,
> +};
> +
> +int register_ip_frag_bpf(void)
> +{
> +	return register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS,
> +					 &ip_frag_bpf_kfunc_set);
> +}
> --
> 2.39.1
Daniel Xu Feb. 28, 2023, 10 p.m. UTC | #2
Hi Stanislav,

On Tue, Feb 28, 2023 at 11:37:16AM -0800, Stanislav Fomichev wrote:
> On 02/27, Daniel Xu wrote:
> > This kfunc is used to defragment IPv4 packets. The idea is that if you
> > see a fragmented packet, you call this kfunc. If the kfunc returns 0,
> > then the skb has been updated to contain the entire reassembled packet.
> 
> > If the kfunc returns an error (most likely -EINPROGRESS), then it means
> > the skb is part of a yet-incomplete original packet. A reasonable
> > response to -EINPROGRESS is to drop the packet, as the ip defrag
> > infrastructure is already hanging onto the frag for future reassembly.
> 
> > Care has been taken to ensure the prog skb remains valid no matter what
> > the underlying ip_check_defrag() call does. This is in contrast to
> > ip_defrag(), which may consume the skb if the skb is part of a
> > yet-incomplete original packet.
> 
> > So far this kfunc is only callable from TC clsact progs.
> 
> > Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
> > ---
> >   include/net/ip.h           | 11 +++++
> >   net/ipv4/Makefile          |  1 +
> >   net/ipv4/ip_fragment.c     |  2 +
> >   net/ipv4/ip_fragment_bpf.c | 98 ++++++++++++++++++++++++++++++++++++++
> >   4 files changed, 112 insertions(+)
> >   create mode 100644 net/ipv4/ip_fragment_bpf.c
> 
> > diff --git a/include/net/ip.h b/include/net/ip.h
> > index c3fffaa92d6e..f3796b1b5cac 100644
> > --- a/include/net/ip.h
> > +++ b/include/net/ip.h
> > @@ -680,6 +680,7 @@ enum ip_defrag_users {
> >   	IP_DEFRAG_VS_FWD,
> >   	IP_DEFRAG_AF_PACKET,
> >   	IP_DEFRAG_MACVLAN,
> > +	IP_DEFRAG_BPF,
> >   };
> 
> >   /* Return true if the value of 'user' is between 'lower_bond'
> > @@ -693,6 +694,16 @@ static inline bool ip_defrag_user_in_between(u32
> > user,
> >   }
> 
> >   int ip_defrag(struct net *net, struct sk_buff *skb, u32 user);
> > +
> > +#ifdef CONFIG_DEBUG_INFO_BTF
> > +int register_ip_frag_bpf(void);
> > +#else
> > +static inline int register_ip_frag_bpf(void)
> > +{
> > +	return 0;
> > +}
> > +#endif
> > +
> >   #ifdef CONFIG_INET
> >   struct sk_buff *ip_check_defrag(struct net *net, struct sk_buff *skb,
> > u32 user);
> >   #else
> > diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
> > index 880277c9fd07..950efb166d37 100644
> > --- a/net/ipv4/Makefile
> > +++ b/net/ipv4/Makefile
> > @@ -65,6 +65,7 @@ obj-$(CONFIG_TCP_CONG_ILLINOIS) += tcp_illinois.o
> >   obj-$(CONFIG_NET_SOCK_MSG) += tcp_bpf.o
> >   obj-$(CONFIG_BPF_SYSCALL) += udp_bpf.o
> >   obj-$(CONFIG_NETLABEL) += cipso_ipv4.o
> > +obj-$(CONFIG_DEBUG_INFO_BTF) += ip_fragment_bpf.o
> 
> >   obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \
> >   		      xfrm4_output.o xfrm4_protocol.o
> > diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
> > index 959d2c4260ea..e3fda5203f09 100644
> > --- a/net/ipv4/ip_fragment.c
> > +++ b/net/ipv4/ip_fragment.c
> > @@ -759,5 +759,7 @@ void __init ipfrag_init(void)
> >   	if (inet_frags_init(&ip4_frags))
> >   		panic("IP: failed to allocate ip4_frags cache\n");
> >   	ip4_frags_ctl_register();
> > +	if (register_ip_frag_bpf())
> > +		panic("IP: bpf: failed to register ip_frag_bpf\n");
> >   	register_pernet_subsys(&ip4_frags_ops);
> >   }
> > diff --git a/net/ipv4/ip_fragment_bpf.c b/net/ipv4/ip_fragment_bpf.c
> > new file mode 100644
> > index 000000000000..a9e5908ed216
> > --- /dev/null
> > +++ b/net/ipv4/ip_fragment_bpf.c
> > @@ -0,0 +1,98 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/* Unstable ipv4 fragmentation helpers for TC-BPF hook
> > + *
> > + * These are called from SCHED_CLS BPF programs. Note that it is
> > allowed to
> > + * break compatibility for these functions since the interface they are
> > exposed
> > + * through to BPF programs is explicitly unstable.
> > + */
> > +
> > +#include <linux/bpf.h>
> > +#include <linux/btf_ids.h>
> > +#include <linux/ip.h>
> > +#include <linux/filter.h>
> > +#include <linux/netdevice.h>
> > +#include <net/ip.h>
> > +#include <net/sock.h>
> > +
> > +__diag_push();
> > +__diag_ignore_all("-Wmissing-prototypes",
> > +		  "Global functions as their definitions will be in ip_fragment BTF");
> > +
> > +/* bpf_ip_check_defrag - Defragment an ipv4 packet
> > + *
> > + * This helper takes an skb as input. If this skb successfully
> > reassembles
> > + * the original packet, the skb is updated to contain the original,
> > reassembled
> > + * packet.
> > + *
> > + * Otherwise (on error or incomplete reassembly), the input skb remains
> > + * unmodified.
> > + *
> > + * Parameters:
> > + * @ctx		- Pointer to program context (skb)
> > + * @netns	- Child network namespace id. If value is a negative signed
> > + *		  32-bit integer, the netns of the device in the skb is used.
> > + *
> > + * Return:
> > + * 0 on successfully reassembly or non-fragmented packet. Negative
> > value on
> > + * error or incomplete reassembly.
> > + */
> > +int bpf_ip_check_defrag(struct __sk_buff *ctx, u64 netns)
> 
> Needs a __bpf_kfunc tag as well?

Ack.

> > +{
> > +	struct sk_buff *skb = (struct sk_buff *)ctx;
> > +	struct sk_buff *skb_cpy, *skb_out;
> > +	struct net *caller_net;
> > +	struct net *net;
> > +	int mac_len;
> > +	void *mac;
> > +
> 
> [..]
> 
> > +	if (unlikely(!((s32)netns < 0 || netns <= S32_MAX)))
> > +		return -EINVAL;
> 
> Can you explain what it does? Is it checking for -1 explicitly? Not sure
> it works :-/
> 
> Maybe we can spell out the cases explicitly?
> if (unlikely(
> 	     ((s32)netns < 0 && netns != S32_MAX) || /* -1 */
> 	     netns > U32_MAX /* higher 4 bytes */
> 	    )
> 	return -EINVAL;
> 

I copied this from net/core/filter.c:__bpf_skc_lookup:

	if (unlikely(flags || !((s32)netns_id < 0 || netns_id <= S32_MAX)))
		goto out;

The semantics are a bit odd, but I thought it'd be good to maintain
consistency. I believe the code correctly checks what the docs describe:

        @netns	- Child network namespace id. If value is a negative signed
                  32-bit integer, the netns of the device in the skb is used.

I can pull out the logic into a helper for v3.

[...]


Thanks,
Daniel
Stanislav Fomichev Feb. 28, 2023, 10:18 p.m. UTC | #3
On Tue, Feb 28, 2023 at 2:00 PM Daniel Xu <dxu@dxuuu.xyz> wrote:
>
> Hi Stanislav,
>
> On Tue, Feb 28, 2023 at 11:37:16AM -0800, Stanislav Fomichev wrote:
> > On 02/27, Daniel Xu wrote:
> > > This kfunc is used to defragment IPv4 packets. The idea is that if you
> > > see a fragmented packet, you call this kfunc. If the kfunc returns 0,
> > > then the skb has been updated to contain the entire reassembled packet.
> >
> > > If the kfunc returns an error (most likely -EINPROGRESS), then it means
> > > the skb is part of a yet-incomplete original packet. A reasonable
> > > response to -EINPROGRESS is to drop the packet, as the ip defrag
> > > infrastructure is already hanging onto the frag for future reassembly.
> >
> > > Care has been taken to ensure the prog skb remains valid no matter what
> > > the underlying ip_check_defrag() call does. This is in contrast to
> > > ip_defrag(), which may consume the skb if the skb is part of a
> > > yet-incomplete original packet.
> >
> > > So far this kfunc is only callable from TC clsact progs.
> >
> > > Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
> > > ---
> > >   include/net/ip.h           | 11 +++++
> > >   net/ipv4/Makefile          |  1 +
> > >   net/ipv4/ip_fragment.c     |  2 +
> > >   net/ipv4/ip_fragment_bpf.c | 98 ++++++++++++++++++++++++++++++++++++++
> > >   4 files changed, 112 insertions(+)
> > >   create mode 100644 net/ipv4/ip_fragment_bpf.c
> >
> > > diff --git a/include/net/ip.h b/include/net/ip.h
> > > index c3fffaa92d6e..f3796b1b5cac 100644
> > > --- a/include/net/ip.h
> > > +++ b/include/net/ip.h
> > > @@ -680,6 +680,7 @@ enum ip_defrag_users {
> > >     IP_DEFRAG_VS_FWD,
> > >     IP_DEFRAG_AF_PACKET,
> > >     IP_DEFRAG_MACVLAN,
> > > +   IP_DEFRAG_BPF,
> > >   };
> >
> > >   /* Return true if the value of 'user' is between 'lower_bond'
> > > @@ -693,6 +694,16 @@ static inline bool ip_defrag_user_in_between(u32
> > > user,
> > >   }
> >
> > >   int ip_defrag(struct net *net, struct sk_buff *skb, u32 user);
> > > +
> > > +#ifdef CONFIG_DEBUG_INFO_BTF
> > > +int register_ip_frag_bpf(void);
> > > +#else
> > > +static inline int register_ip_frag_bpf(void)
> > > +{
> > > +   return 0;
> > > +}
> > > +#endif
> > > +
> > >   #ifdef CONFIG_INET
> > >   struct sk_buff *ip_check_defrag(struct net *net, struct sk_buff *skb,
> > > u32 user);
> > >   #else
> > > diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
> > > index 880277c9fd07..950efb166d37 100644
> > > --- a/net/ipv4/Makefile
> > > +++ b/net/ipv4/Makefile
> > > @@ -65,6 +65,7 @@ obj-$(CONFIG_TCP_CONG_ILLINOIS) += tcp_illinois.o
> > >   obj-$(CONFIG_NET_SOCK_MSG) += tcp_bpf.o
> > >   obj-$(CONFIG_BPF_SYSCALL) += udp_bpf.o
> > >   obj-$(CONFIG_NETLABEL) += cipso_ipv4.o
> > > +obj-$(CONFIG_DEBUG_INFO_BTF) += ip_fragment_bpf.o
> >
> > >   obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \
> > >                   xfrm4_output.o xfrm4_protocol.o
> > > diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
> > > index 959d2c4260ea..e3fda5203f09 100644
> > > --- a/net/ipv4/ip_fragment.c
> > > +++ b/net/ipv4/ip_fragment.c
> > > @@ -759,5 +759,7 @@ void __init ipfrag_init(void)
> > >     if (inet_frags_init(&ip4_frags))
> > >             panic("IP: failed to allocate ip4_frags cache\n");
> > >     ip4_frags_ctl_register();
> > > +   if (register_ip_frag_bpf())
> > > +           panic("IP: bpf: failed to register ip_frag_bpf\n");
> > >     register_pernet_subsys(&ip4_frags_ops);
> > >   }
> > > diff --git a/net/ipv4/ip_fragment_bpf.c b/net/ipv4/ip_fragment_bpf.c
> > > new file mode 100644
> > > index 000000000000..a9e5908ed216
> > > --- /dev/null
> > > +++ b/net/ipv4/ip_fragment_bpf.c
> > > @@ -0,0 +1,98 @@
> > > +// SPDX-License-Identifier: GPL-2.0-only
> > > +/* Unstable ipv4 fragmentation helpers for TC-BPF hook
> > > + *
> > > + * These are called from SCHED_CLS BPF programs. Note that it is
> > > allowed to
> > > + * break compatibility for these functions since the interface they are
> > > exposed
> > > + * through to BPF programs is explicitly unstable.
> > > + */
> > > +
> > > +#include <linux/bpf.h>
> > > +#include <linux/btf_ids.h>
> > > +#include <linux/ip.h>
> > > +#include <linux/filter.h>
> > > +#include <linux/netdevice.h>
> > > +#include <net/ip.h>
> > > +#include <net/sock.h>
> > > +
> > > +__diag_push();
> > > +__diag_ignore_all("-Wmissing-prototypes",
> > > +             "Global functions as their definitions will be in ip_fragment BTF");
> > > +
> > > +/* bpf_ip_check_defrag - Defragment an ipv4 packet
> > > + *
> > > + * This helper takes an skb as input. If this skb successfully
> > > reassembles
> > > + * the original packet, the skb is updated to contain the original,
> > > reassembled
> > > + * packet.
> > > + *
> > > + * Otherwise (on error or incomplete reassembly), the input skb remains
> > > + * unmodified.
> > > + *
> > > + * Parameters:
> > > + * @ctx            - Pointer to program context (skb)
> > > + * @netns  - Child network namespace id. If value is a negative signed
> > > + *           32-bit integer, the netns of the device in the skb is used.
> > > + *
> > > + * Return:
> > > + * 0 on successfully reassembly or non-fragmented packet. Negative
> > > value on
> > > + * error or incomplete reassembly.
> > > + */
> > > +int bpf_ip_check_defrag(struct __sk_buff *ctx, u64 netns)
> >
> > Needs a __bpf_kfunc tag as well?
>
> Ack.
>
> > > +{
> > > +   struct sk_buff *skb = (struct sk_buff *)ctx;
> > > +   struct sk_buff *skb_cpy, *skb_out;
> > > +   struct net *caller_net;
> > > +   struct net *net;
> > > +   int mac_len;
> > > +   void *mac;
> > > +
> >
> > [..]
> >
> > > +   if (unlikely(!((s32)netns < 0 || netns <= S32_MAX)))
> > > +           return -EINVAL;
> >
> > Can you explain what it does? Is it checking for -1 explicitly? Not sure
> > it works :-/
> >
> > Maybe we can spell out the cases explicitly?
> > if (unlikely(
> >            ((s32)netns < 0 && netns != S32_MAX) || /* -1 */
> >            netns > U32_MAX /* higher 4 bytes */
> >           )
> >       return -EINVAL;
> >
>
> I copied this from net/core/filter.c:__bpf_skc_lookup:
>
>         if (unlikely(flags || !((s32)netns_id < 0 || netns_id <= S32_MAX)))
>                 goto out;
>
> The semantics are a bit odd, but I thought it'd be good to maintain
> consistency. I believe the code correctly checks what the docs describe:
>
>         @netns  - Child network namespace id. If value is a negative signed
>                   32-bit integer, the netns of the device in the skb is used.
>
> I can pull out the logic into a helper for v3.
>
> [...]

Ah, so this comes from commit f71c6143c203 ("bpf: Support sk lookup in
netns with id 0") which explicitly treats everything <0 as
current_netns, makes sense.
In this case agreed, let's keep for consistency. Up to you on whether
to pull it out in the helper or keep as is.

>
>
> Thanks,
> Daniel
diff mbox series

Patch

diff --git a/include/net/ip.h b/include/net/ip.h
index c3fffaa92d6e..f3796b1b5cac 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -680,6 +680,7 @@  enum ip_defrag_users {
 	IP_DEFRAG_VS_FWD,
 	IP_DEFRAG_AF_PACKET,
 	IP_DEFRAG_MACVLAN,
+	IP_DEFRAG_BPF,
 };
 
 /* Return true if the value of 'user' is between 'lower_bond'
@@ -693,6 +694,16 @@  static inline bool ip_defrag_user_in_between(u32 user,
 }
 
 int ip_defrag(struct net *net, struct sk_buff *skb, u32 user);
+
+#ifdef CONFIG_DEBUG_INFO_BTF
+int register_ip_frag_bpf(void);
+#else
+static inline int register_ip_frag_bpf(void)
+{
+	return 0;
+}
+#endif
+
 #ifdef CONFIG_INET
 struct sk_buff *ip_check_defrag(struct net *net, struct sk_buff *skb, u32 user);
 #else
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index 880277c9fd07..950efb166d37 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -65,6 +65,7 @@  obj-$(CONFIG_TCP_CONG_ILLINOIS) += tcp_illinois.o
 obj-$(CONFIG_NET_SOCK_MSG) += tcp_bpf.o
 obj-$(CONFIG_BPF_SYSCALL) += udp_bpf.o
 obj-$(CONFIG_NETLABEL) += cipso_ipv4.o
+obj-$(CONFIG_DEBUG_INFO_BTF) += ip_fragment_bpf.o
 
 obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \
 		      xfrm4_output.o xfrm4_protocol.o
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 959d2c4260ea..e3fda5203f09 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -759,5 +759,7 @@  void __init ipfrag_init(void)
 	if (inet_frags_init(&ip4_frags))
 		panic("IP: failed to allocate ip4_frags cache\n");
 	ip4_frags_ctl_register();
+	if (register_ip_frag_bpf())
+		panic("IP: bpf: failed to register ip_frag_bpf\n");
 	register_pernet_subsys(&ip4_frags_ops);
 }
diff --git a/net/ipv4/ip_fragment_bpf.c b/net/ipv4/ip_fragment_bpf.c
new file mode 100644
index 000000000000..a9e5908ed216
--- /dev/null
+++ b/net/ipv4/ip_fragment_bpf.c
@@ -0,0 +1,98 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/* Unstable ipv4 fragmentation helpers for TC-BPF hook
+ *
+ * These are called from SCHED_CLS BPF programs. Note that it is allowed to
+ * break compatibility for these functions since the interface they are exposed
+ * through to BPF programs is explicitly unstable.
+ */
+
+#include <linux/bpf.h>
+#include <linux/btf_ids.h>
+#include <linux/ip.h>
+#include <linux/filter.h>
+#include <linux/netdevice.h>
+#include <net/ip.h>
+#include <net/sock.h>
+
+__diag_push();
+__diag_ignore_all("-Wmissing-prototypes",
+		  "Global functions as their definitions will be in ip_fragment BTF");
+
+/* bpf_ip_check_defrag - Defragment an ipv4 packet
+ *
+ * This helper takes an skb as input. If this skb successfully reassembles
+ * the original packet, the skb is updated to contain the original, reassembled
+ * packet.
+ *
+ * Otherwise (on error or incomplete reassembly), the input skb remains
+ * unmodified.
+ *
+ * Parameters:
+ * @ctx		- Pointer to program context (skb)
+ * @netns	- Child network namespace id. If value is a negative signed
+ *		  32-bit integer, the netns of the device in the skb is used.
+ *
+ * Return:
+ * 0 on successfully reassembly or non-fragmented packet. Negative value on
+ * error or incomplete reassembly.
+ */
+int bpf_ip_check_defrag(struct __sk_buff *ctx, u64 netns)
+{
+	struct sk_buff *skb = (struct sk_buff *)ctx;
+	struct sk_buff *skb_cpy, *skb_out;
+	struct net *caller_net;
+	struct net *net;
+	int mac_len;
+	void *mac;
+
+	if (unlikely(!((s32)netns < 0 || netns <= S32_MAX)))
+		return -EINVAL;
+
+	caller_net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
+	if ((s32)netns < 0) {
+		net = caller_net;
+	} else {
+		net = get_net_ns_by_id(caller_net, netns);
+		if (unlikely(!net))
+			return -EINVAL;
+	}
+
+	mac_len = skb->mac_len;
+	skb_cpy = skb_copy(skb, GFP_ATOMIC);
+	if (!skb_cpy)
+		return -ENOMEM;
+
+	skb_out = ip_check_defrag(net, skb_cpy, IP_DEFRAG_BPF);
+	if (IS_ERR(skb_out))
+		return PTR_ERR(skb_out);
+
+	skb_morph(skb, skb_out);
+	kfree_skb(skb_out);
+
+	/* ip_check_defrag() does not maintain mac header, so push empty header
+	 * in so prog sees the correct layout. The empty mac header will be
+	 * later pulled from cls_bpf.
+	 */
+	mac = skb_push(skb, mac_len);
+	memset(mac, 0, mac_len);
+	bpf_compute_data_pointers(skb);
+
+	return 0;
+}
+
+__diag_pop()
+
+BTF_SET8_START(ip_frag_kfunc_set)
+BTF_ID_FLAGS(func, bpf_ip_check_defrag, KF_CHANGES_PKT)
+BTF_SET8_END(ip_frag_kfunc_set)
+
+static const struct btf_kfunc_id_set ip_frag_bpf_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set   = &ip_frag_kfunc_set,
+};
+
+int register_ip_frag_bpf(void)
+{
+	return register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS,
+					 &ip_frag_bpf_kfunc_set);
+}