diff mbox series

[bpf-next,V11,4/7] bpf: add BPF-helper for MTU checking

Message ID 161047352084.4003084.16468571234023057969.stgit@firesoul (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series bpf: New approach for BPF MTU handling | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for bpf-next
netdev/subject_prefix success Link
netdev/cc_maintainers warning 10 maintainers not CCed: kafai@fb.com ast@kernel.org daniel@iogearbox.net yhs@fb.com kpsingh@kernel.org quentin@isovalent.com andrii@kernel.org songliubraving@fb.com hawk@kernel.org davem@davemloft.net
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 12257 this patch: 12257
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch warning WARNING: line length of 83 exceeds 80 columns
netdev/build_allmodconfig_warn success Errors and warnings before: 12910 this patch: 12910
netdev/header_inline success Link
netdev/stable success Stable not CCed

Commit Message

Jesper Dangaard Brouer Jan. 12, 2021, 5:45 p.m. UTC
This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs.

The SKB object is complex and the skb->len value (accessible from
BPF-prog) also include the length of any extra GRO/GSO segments, but
without taking into account that these GRO/GSO segments get added
transport (L4) and network (L3) headers before being transmitted. Thus,
this BPF-helper is created such that the BPF-programmer don't need to
handle these details in the BPF-prog.

The API is designed to help the BPF-programmer, that want to do packet
context size changes, which involves other helpers. These other helpers
usually does a delta size adjustment. This helper also support a delta
size (len_diff), which allow BPF-programmer to reuse arguments needed by
these other helpers, and perform the MTU check prior to doing any actual
size adjustment of the packet context.

It is on purpose, that we allow the len adjustment to become a negative
result, that will pass the MTU check. This might seem weird, but it's not
this helpers responsibility to "catch" wrong len_diff adjustments. Other
helpers will take care of these checks, if BPF-programmer chooses to do
actual size adjustment.

V9:
- Use dev->hard_header_len (instead of ETH_HLEN)
- Annotate with unlikely req from Daniel
- Fix logic error using skb_gso_validate_network_len from Daniel

V6:
- Took John's advice and dropped BPF_MTU_CHK_RELAX
- Returned MTU is kept at L3-level (like fib_lookup)

V4: Lot of changes
 - ifindex 0 now use current netdev for MTU lookup
 - rename helper from bpf_mtu_check to bpf_check_mtu
 - fix bug for GSO pkt length (as skb->len is total len)
 - remove __bpf_len_adj_positive, simply allow negative len adj

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/uapi/linux/bpf.h       |   67 ++++++++++++++++++++++
 net/core/filter.c              |  122 ++++++++++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |   67 ++++++++++++++++++++++
 3 files changed, 256 insertions(+)

Comments

Andrii Nakryiko Jan. 12, 2021, 7:23 p.m. UTC | #1
On Tue, Jan 12, 2021 at 9:49 AM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs.
>
> The SKB object is complex and the skb->len value (accessible from
> BPF-prog) also include the length of any extra GRO/GSO segments, but
> without taking into account that these GRO/GSO segments get added
> transport (L4) and network (L3) headers before being transmitted. Thus,
> this BPF-helper is created such that the BPF-programmer don't need to
> handle these details in the BPF-prog.
>
> The API is designed to help the BPF-programmer, that want to do packet
> context size changes, which involves other helpers. These other helpers
> usually does a delta size adjustment. This helper also support a delta
> size (len_diff), which allow BPF-programmer to reuse arguments needed by
> these other helpers, and perform the MTU check prior to doing any actual
> size adjustment of the packet context.
>
> It is on purpose, that we allow the len adjustment to become a negative
> result, that will pass the MTU check. This might seem weird, but it's not
> this helpers responsibility to "catch" wrong len_diff adjustments. Other
> helpers will take care of these checks, if BPF-programmer chooses to do
> actual size adjustment.
>
> V9:
> - Use dev->hard_header_len (instead of ETH_HLEN)
> - Annotate with unlikely req from Daniel
> - Fix logic error using skb_gso_validate_network_len from Daniel
>
> V6:
> - Took John's advice and dropped BPF_MTU_CHK_RELAX
> - Returned MTU is kept at L3-level (like fib_lookup)
>
> V4: Lot of changes
>  - ifindex 0 now use current netdev for MTU lookup
>  - rename helper from bpf_mtu_check to bpf_check_mtu
>  - fix bug for GSO pkt length (as skb->len is total len)
>  - remove __bpf_len_adj_positive, simply allow negative len adj
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  include/uapi/linux/bpf.h       |   67 ++++++++++++++++++++++
>  net/core/filter.c              |  122 ++++++++++++++++++++++++++++++++++++++++
>  tools/include/uapi/linux/bpf.h |   67 ++++++++++++++++++++++
>  3 files changed, 256 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 649586d656b6..fa2e99351758 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -3833,6 +3833,61 @@ union bpf_attr {
>   *     Return
>   *             A pointer to a struct socket on success or NULL if the file is
>   *             not a socket.
> + *
> + * int bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_len, s32 len_diff, u64 flags)

should return long, same as most other helpers

> + *     Description
> + *             Check ctx packet size against MTU of net device (based on
> + *             *ifindex*).  This helper will likely be used in combination with
> + *             helpers that adjust/change the packet size.  The argument
> + *             *len_diff* can be used for querying with a planned size
> + *             change. This allows to check MTU prior to changing packet ctx.
> + *

[...]
Daniel Borkmann Jan. 13, 2021, 11:07 p.m. UTC | #2
On 1/12/21 6:45 PM, Jesper Dangaard Brouer wrote:
> This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs.
[...]
> + * int bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_len, s32 len_diff, u64 flags)
> + *	Description
> + *		Check ctx packet size against MTU of net device (based on
> + *		*ifindex*).  This helper will likely be used in combination with
> + *		helpers that adjust/change the packet size.  The argument
> + *		*len_diff* can be used for querying with a planned size
> + *		change. This allows to check MTU prior to changing packet ctx.
> + *
> + *		Specifying *ifindex* zero means the MTU check is performed
> + *		against the current net device.  This is practical if this isn't
> + *		used prior to redirect.
> + *
> + *		The Linux kernel route table can configure MTUs on a more
> + *		specific per route level, which is not provided by this helper.
> + *		For route level MTU checks use the **bpf_fib_lookup**\ ()
> + *		helper.
> + *
> + *		*ctx* is either **struct xdp_md** for XDP programs or
> + *		**struct sk_buff** for tc cls_act programs.
> + *
> + *		The *flags* argument can be a combination of one or more of the
> + *		following values:
> + *
> + *		**BPF_MTU_CHK_SEGS**
> + *			This flag will only works for *ctx* **struct sk_buff**.
> + *			If packet context contains extra packet segment buffers
> + *			(often knows as GSO skb), then MTU check is harder to
> + *			check at this point, because in transmit path it is
> + *			possible for the skb packet to get re-segmented
> + *			(depending on net device features).  This could still be
> + *			a MTU violation, so this flag enables performing MTU
> + *			check against segments, with a different violation
> + *			return code to tell it apart. Check cannot use len_diff.
> + *
> + *		On return *mtu_len* pointer contains the MTU value of the net
> + *		device.  Remember the net device configured MTU is the L3 size,
> + *		which is returned here and XDP and TX length operate at L2.
> + *		Helper take this into account for you, but remember when using
> + *		MTU value in your BPF-code.  On input *mtu_len* must be a valid
> + *		pointer and be initialized (to zero), else verifier will reject
> + *		BPF program.
> + *
> + *	Return
> + *		* 0 on success, and populate MTU value in *mtu_len* pointer.
> + *
> + *		* < 0 if any input argument is invalid (*mtu_len* not updated)
> + *
> + *		MTU violations return positive values, but also populate MTU
> + *		value in *mtu_len* pointer, as this can be needed for
> + *		implementing PMTU handing:
> + *
> + *		* **BPF_MTU_CHK_RET_FRAG_NEEDED**
> + *		* **BPF_MTU_CHK_RET_SEGS_TOOBIG**
> + *
>    */
>   #define __BPF_FUNC_MAPPER(FN)		\
>   	FN(unspec),			\
> @@ -3998,6 +4053,7 @@ union bpf_attr {
>   	FN(ktime_get_coarse_ns),	\
>   	FN(ima_inode_hash),		\
>   	FN(sock_from_file),		\
> +	FN(check_mtu),			\
>   	/* */
>   
>   /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> @@ -5030,6 +5086,17 @@ struct bpf_redir_neigh {
>   	};
>   };
>   
> +/* bpf_check_mtu flags*/
> +enum  bpf_check_mtu_flags {
> +	BPF_MTU_CHK_SEGS  = (1U << 0),
> +};
> +
> +enum bpf_check_mtu_ret {
> +	BPF_MTU_CHK_RET_SUCCESS,      /* check and lookup successful */
> +	BPF_MTU_CHK_RET_FRAG_NEEDED,  /* fragmentation required to fwd */
> +	BPF_MTU_CHK_RET_SEGS_TOOBIG,  /* GSO re-segmentation needed to fwd */
> +};
> +
>   enum bpf_task_fd_type {
>   	BPF_FD_TYPE_RAW_TRACEPOINT,	/* tp name */
>   	BPF_FD_TYPE_TRACEPOINT,		/* tp name */
> diff --git a/net/core/filter.c b/net/core/filter.c
> index db59ab55572c..3f2e593244ca 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5604,6 +5604,124 @@ static const struct bpf_func_proto bpf_skb_fib_lookup_proto = {
>   	.arg4_type	= ARG_ANYTHING,
>   };
>   
> +static struct net_device *__dev_via_ifindex(struct net_device *dev_curr,
> +					    u32 ifindex)
> +{
> +	struct net *netns = dev_net(dev_curr);
> +
> +	/* Non-redirect use-cases can use ifindex=0 and save ifindex lookup */
> +	if (ifindex == 0)
> +		return dev_curr;
> +
> +	return dev_get_by_index_rcu(netns, ifindex);
> +}
> +
> +BPF_CALL_5(bpf_skb_check_mtu, struct sk_buff *, skb,
> +	   u32, ifindex, u32 *, mtu_len, s32, len_diff, u64, flags)
> +{
> +	int ret = BPF_MTU_CHK_RET_FRAG_NEEDED;
> +	struct net_device *dev = skb->dev;
> +	int skb_len, dev_len;
> +	int mtu;
> +
> +	if (unlikely(flags & ~(BPF_MTU_CHK_SEGS)))
> +		return -EINVAL;
> +
> +	dev = __dev_via_ifindex(dev, ifindex);
> +	if (unlikely(!dev))
> +		return -ENODEV;
> +
> +	mtu = READ_ONCE(dev->mtu);
> +
> +	dev_len = mtu + dev->hard_header_len;
> +	skb_len = skb->len + len_diff; /* minus result pass check */
> +	if (skb_len <= dev_len) {
> +		ret = BPF_MTU_CHK_RET_SUCCESS;
> +		goto out;
> +	}
> +	/* At this point, skb->len exceed MTU, but as it include length of all
> +	 * segments, it can still be below MTU.  The SKB can possibly get
> +	 * re-segmented in transmit path (see validate_xmit_skb).  Thus, user
> +	 * must choose if segs are to be MTU checked.  Last SKB "headlen" is
> +	 * checked against MTU.
> +	 */
> +	if (skb_is_gso(skb)) {
> +		ret = BPF_MTU_CHK_RET_SUCCESS;
> +
> +		if (!(flags & BPF_MTU_CHK_SEGS))
> +			goto out;
> +
> +		if (!skb_gso_validate_network_len(skb, mtu)) {
> +			ret = BPF_MTU_CHK_RET_SEGS_TOOBIG;
> +			goto out;
> +		}
> +
> +		skb_len = skb_headlen(skb) + len_diff;
> +		if (skb_len > dev_len) {

This is still not universally correct given drivers could cook up non-linear
skbs (e.g. page frags) on rx. So the result from BPF_MTU_CHK_SEGS flag cannot
be relied on. Do you have a particular use case for the BPF_MTU_CHK_SEGS? I
also don't see the flag being used anywhere in your selftests, so I presume
not as otherwise you would have added an example there? I would just drop the
flag altogether for the tc helper..

> +			ret = BPF_MTU_CHK_RET_FRAG_NEEDED;
> +			goto out;
> +		}
> +	}
> +out:
> +	/* BPF verifier guarantees valid pointer */
> +	*mtu_len = mtu;
> +
> +	return ret;
> +}
Jesper Dangaard Brouer Jan. 14, 2021, 2:36 p.m. UTC | #3
On Thu, 14 Jan 2021 00:07:14 +0100
Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 1/12/21 6:45 PM, Jesper Dangaard Brouer wrote:
> > This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs.  
> [...]
> > + * int bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_len, s32 len_diff, u64 flags)
> > + *	Description
> > + *		Check ctx packet size against MTU of net device (based on
> > + *		*ifindex*).  This helper will likely be used in combination with
> > + *		helpers that adjust/change the packet size.  The argument
> > + *		*len_diff* can be used for querying with a planned size
> > + *		change. This allows to check MTU prior to changing packet ctx.
> > + *
> > + *		Specifying *ifindex* zero means the MTU check is performed
> > + *		against the current net device.  This is practical if this isn't
> > + *		used prior to redirect.
> > + *
> > + *		The Linux kernel route table can configure MTUs on a more
> > + *		specific per route level, which is not provided by this helper.
> > + *		For route level MTU checks use the **bpf_fib_lookup**\ ()
> > + *		helper.
> > + *
> > + *		*ctx* is either **struct xdp_md** for XDP programs or
> > + *		**struct sk_buff** for tc cls_act programs.
> > + *
> > + *		The *flags* argument can be a combination of one or more of the
> > + *		following values:
> > + *
> > + *		**BPF_MTU_CHK_SEGS**
> > + *			This flag will only works for *ctx* **struct sk_buff**.
> > + *			If packet context contains extra packet segment buffers
> > + *			(often knows as GSO skb), then MTU check is harder to
> > + *			check at this point, because in transmit path it is
> > + *			possible for the skb packet to get re-segmented
> > + *			(depending on net device features).  This could still be
> > + *			a MTU violation, so this flag enables performing MTU
> > + *			check against segments, with a different violation
> > + *			return code to tell it apart. Check cannot use len_diff.
> > + *
> > + *		On return *mtu_len* pointer contains the MTU value of the net
> > + *		device.  Remember the net device configured MTU is the L3 size,
> > + *		which is returned here and XDP and TX length operate at L2.
> > + *		Helper take this into account for you, but remember when using
> > + *		MTU value in your BPF-code.  On input *mtu_len* must be a valid
> > + *		pointer and be initialized (to zero), else verifier will reject
> > + *		BPF program.
> > + *
> > + *	Return
> > + *		* 0 on success, and populate MTU value in *mtu_len* pointer.
> > + *
> > + *		* < 0 if any input argument is invalid (*mtu_len* not updated)
> > + *
> > + *		MTU violations return positive values, but also populate MTU
> > + *		value in *mtu_len* pointer, as this can be needed for
> > + *		implementing PMTU handing:
> > + *
> > + *		* **BPF_MTU_CHK_RET_FRAG_NEEDED**
> > + *		* **BPF_MTU_CHK_RET_SEGS_TOOBIG**
> > + *
> >    */
> >   #define __BPF_FUNC_MAPPER(FN)		\
> >   	FN(unspec),			\
> > @@ -3998,6 +4053,7 @@ union bpf_attr {
> >   	FN(ktime_get_coarse_ns),	\
> >   	FN(ima_inode_hash),		\
> >   	FN(sock_from_file),		\
> > +	FN(check_mtu),			\
> >   	/* */
> >   
> >   /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> > @@ -5030,6 +5086,17 @@ struct bpf_redir_neigh {
> >   	};
> >   };
> >   
> > +/* bpf_check_mtu flags*/
> > +enum  bpf_check_mtu_flags {
> > +	BPF_MTU_CHK_SEGS  = (1U << 0),
> > +};
> > +
> > +enum bpf_check_mtu_ret {
> > +	BPF_MTU_CHK_RET_SUCCESS,      /* check and lookup successful */
> > +	BPF_MTU_CHK_RET_FRAG_NEEDED,  /* fragmentation required to fwd */
> > +	BPF_MTU_CHK_RET_SEGS_TOOBIG,  /* GSO re-segmentation needed to fwd */
> > +};
> > +
> >   enum bpf_task_fd_type {
> >   	BPF_FD_TYPE_RAW_TRACEPOINT,	/* tp name */
> >   	BPF_FD_TYPE_TRACEPOINT,		/* tp name */
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index db59ab55572c..3f2e593244ca 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -5604,6 +5604,124 @@ static const struct bpf_func_proto bpf_skb_fib_lookup_proto = {
> >   	.arg4_type	= ARG_ANYTHING,
> >   };
> >   
> > +static struct net_device *__dev_via_ifindex(struct net_device *dev_curr,
> > +					    u32 ifindex)
> > +{
> > +	struct net *netns = dev_net(dev_curr);
> > +
> > +	/* Non-redirect use-cases can use ifindex=0 and save ifindex lookup */
> > +	if (ifindex == 0)
> > +		return dev_curr;
> > +
> > +	return dev_get_by_index_rcu(netns, ifindex);
> > +}
> > +
> > +BPF_CALL_5(bpf_skb_check_mtu, struct sk_buff *, skb,
> > +	   u32, ifindex, u32 *, mtu_len, s32, len_diff, u64, flags)
> > +{
> > +	int ret = BPF_MTU_CHK_RET_FRAG_NEEDED;
> > +	struct net_device *dev = skb->dev;
> > +	int skb_len, dev_len;
> > +	int mtu;
> > +
> > +	if (unlikely(flags & ~(BPF_MTU_CHK_SEGS)))
> > +		return -EINVAL;
> > +
> > +	dev = __dev_via_ifindex(dev, ifindex);
> > +	if (unlikely(!dev))
> > +		return -ENODEV;
> > +
> > +	mtu = READ_ONCE(dev->mtu);
> > +
> > +	dev_len = mtu + dev->hard_header_len;
> > +	skb_len = skb->len + len_diff; /* minus result pass check */
> > +	if (skb_len <= dev_len) {
> > +		ret = BPF_MTU_CHK_RET_SUCCESS;
> > +		goto out;
> > +	}
> > +	/* At this point, skb->len exceed MTU, but as it include length of all
> > +	 * segments, it can still be below MTU.  The SKB can possibly get
> > +	 * re-segmented in transmit path (see validate_xmit_skb).  Thus, user
> > +	 * must choose if segs are to be MTU checked.  Last SKB "headlen" is
> > +	 * checked against MTU.
> > +	 */
> > +	if (skb_is_gso(skb)) {
> > +		ret = BPF_MTU_CHK_RET_SUCCESS;
> > +
> > +		if (!(flags & BPF_MTU_CHK_SEGS))
> > +			goto out;
> > +
> > +		if (!skb_gso_validate_network_len(skb, mtu)) {
> > +			ret = BPF_MTU_CHK_RET_SEGS_TOOBIG;
> > +			goto out;
> > +		}
> > +
> > +		skb_len = skb_headlen(skb) + len_diff;
> > +		if (skb_len > dev_len) {

Maybe I'm misunderstanding you below?  Do you just want the above two
lines moved from the patch? (sure I can do that... as it is just an
extra check of the "head"/first segment of the packet, and only done if
BPF_MTU_CHK_SEGS is set)

> 
> This is still not universally correct given drivers could cook up non-linear
> skbs (e.g. page frags) on rx. So the result from BPF_MTU_CHK_SEGS flag cannot
> be relied on. 

That is why it is a flag, that need to be explicitly set.

> Do you have a particular use case for the BPF_MTU_CHK_SEGS?

The complaint from Maze (and others) were that when skb_is_gso then all
the MTU checks are bypassed.  This flag enables checking the GSO part
via skb_gso_validate_network_len().  We cannot enable it per default,
as you say, it is universally correct in all cases.

> I also don't see the flag being used anywhere in your selftests, so I presume
> not as otherwise you would have added an example there?

I'm using the flag in the bpf-examples code[1], this is how I've tested
the code path.

I've not found a way to generate GSO packet via the selftests
infrastructure via bpf_prog_test_run_xattr().  I'm 

[1] https://github.com/xdp-project/bpf-examples/blob/master/MTU-tests/tc_mtu_enforce.c


> I would just drop the flag altogether for the tc helper..

As explain I cannot drop the flag altogether, I would also have to
remove the code then.  Sorry, but I don't 100% understand the change
you are requesting.


> > +			ret = BPF_MTU_CHK_RET_FRAG_NEEDED;
> > +			goto out;
> > +		}
> > +	}
> > +out:
> > +	/* BPF verifier guarantees valid pointer */
> > +	*mtu_len = mtu;
> > +
> > +	return ret;
> > +}  
>
Jesper Dangaard Brouer Jan. 14, 2021, 2:52 p.m. UTC | #4
On Tue, 12 Jan 2021 11:23:33 -0800
Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:

> On Tue, Jan 12, 2021 at 9:49 AM Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> >
> > This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs.
> >
> > The SKB object is complex and the skb->len value (accessible from
> > BPF-prog) also include the length of any extra GRO/GSO segments, but
> > without taking into account that these GRO/GSO segments get added
> > transport (L4) and network (L3) headers before being transmitted. Thus,
> > this BPF-helper is created such that the BPF-programmer don't need to
> > handle these details in the BPF-prog.
> >
> > The API is designed to help the BPF-programmer, that want to do packet
> > context size changes, which involves other helpers. These other helpers
> > usually does a delta size adjustment. This helper also support a delta
> > size (len_diff), which allow BPF-programmer to reuse arguments needed by
> > these other helpers, and perform the MTU check prior to doing any actual
> > size adjustment of the packet context.
> >
> > It is on purpose, that we allow the len adjustment to become a negative
> > result, that will pass the MTU check. This might seem weird, but it's not
> > this helpers responsibility to "catch" wrong len_diff adjustments. Other
> > helpers will take care of these checks, if BPF-programmer chooses to do
> > actual size adjustment.
> >
> > V9:
> > - Use dev->hard_header_len (instead of ETH_HLEN)
> > - Annotate with unlikely req from Daniel
> > - Fix logic error using skb_gso_validate_network_len from Daniel
> >
> > V6:
> > - Took John's advice and dropped BPF_MTU_CHK_RELAX
> > - Returned MTU is kept at L3-level (like fib_lookup)
> >
> > V4: Lot of changes
> >  - ifindex 0 now use current netdev for MTU lookup
> >  - rename helper from bpf_mtu_check to bpf_check_mtu
> >  - fix bug for GSO pkt length (as skb->len is total len)
> >  - remove __bpf_len_adj_positive, simply allow negative len adj
> >
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > ---
> >  include/uapi/linux/bpf.h       |   67 ++++++++++++++++++++++
> >  net/core/filter.c              |  122 ++++++++++++++++++++++++++++++++++++++++
> >  tools/include/uapi/linux/bpf.h |   67 ++++++++++++++++++++++
> >  3 files changed, 256 insertions(+)
> >
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 649586d656b6..fa2e99351758 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -3833,6 +3833,61 @@ union bpf_attr {
> >   *     Return
> >   *             A pointer to a struct socket on success or NULL if the file is
> >   *             not a socket.
> > + *
> > + * int bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_len, s32 len_diff, u64 flags)  
> 
> should return long, same as most other helpers

Is it enough to change it here?
(as this will be used for generating the helpers header file,
via ./scripts/bpf_helpers_doc.py --header)

Or do I also need to change bpf_func_proto.ret_type ?

> > + *     Description
> > + *             Check ctx packet size against MTU of net device (based on
> > + *             *ifindex*).  This helper will likely be used in combination with
> > + *             helpers that adjust/change the packet size.  The argument
> > + *             *len_diff* can be used for querying with a planned size
> > + *             change. This allows to check MTU prior to changing packet ctx.
> > + *  
> 
> [...]
>
Yonghong Song Jan. 14, 2021, 3:33 p.m. UTC | #5
On 1/14/21 6:52 AM, Jesper Dangaard Brouer wrote:
> On Tue, 12 Jan 2021 11:23:33 -0800
> Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> 
>> On Tue, Jan 12, 2021 at 9:49 AM Jesper Dangaard Brouer
>> <brouer@redhat.com> wrote:
>>>
>>> This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs.
>>>
>>> The SKB object is complex and the skb->len value (accessible from
>>> BPF-prog) also include the length of any extra GRO/GSO segments, but
>>> without taking into account that these GRO/GSO segments get added
>>> transport (L4) and network (L3) headers before being transmitted. Thus,
>>> this BPF-helper is created such that the BPF-programmer don't need to
>>> handle these details in the BPF-prog.
>>>
>>> The API is designed to help the BPF-programmer, that want to do packet
>>> context size changes, which involves other helpers. These other helpers
>>> usually does a delta size adjustment. This helper also support a delta
>>> size (len_diff), which allow BPF-programmer to reuse arguments needed by
>>> these other helpers, and perform the MTU check prior to doing any actual
>>> size adjustment of the packet context.
>>>
>>> It is on purpose, that we allow the len adjustment to become a negative
>>> result, that will pass the MTU check. This might seem weird, but it's not
>>> this helpers responsibility to "catch" wrong len_diff adjustments. Other
>>> helpers will take care of these checks, if BPF-programmer chooses to do
>>> actual size adjustment.
>>>
>>> V9:
>>> - Use dev->hard_header_len (instead of ETH_HLEN)
>>> - Annotate with unlikely req from Daniel
>>> - Fix logic error using skb_gso_validate_network_len from Daniel
>>>
>>> V6:
>>> - Took John's advice and dropped BPF_MTU_CHK_RELAX
>>> - Returned MTU is kept at L3-level (like fib_lookup)
>>>
>>> V4: Lot of changes
>>>   - ifindex 0 now use current netdev for MTU lookup
>>>   - rename helper from bpf_mtu_check to bpf_check_mtu
>>>   - fix bug for GSO pkt length (as skb->len is total len)
>>>   - remove __bpf_len_adj_positive, simply allow negative len adj
>>>
>>> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
>>> ---
>>>   include/uapi/linux/bpf.h       |   67 ++++++++++++++++++++++
>>>   net/core/filter.c              |  122 ++++++++++++++++++++++++++++++++++++++++
>>>   tools/include/uapi/linux/bpf.h |   67 ++++++++++++++++++++++
>>>   3 files changed, 256 insertions(+)
>>>
>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>> index 649586d656b6..fa2e99351758 100644
>>> --- a/include/uapi/linux/bpf.h
>>> +++ b/include/uapi/linux/bpf.h
>>> @@ -3833,6 +3833,61 @@ union bpf_attr {
>>>    *     Return
>>>    *             A pointer to a struct socket on success or NULL if the file is
>>>    *             not a socket.
>>> + *
>>> + * int bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_len, s32 len_diff, u64 flags)
>>
>> should return long, same as most other helpers
> 
> Is it enough to change it here?
> (as this will be used for generating the helpers header file,
> via ./scripts/bpf_helpers_doc.py --header)

Just change here is enough.

> 
> Or do I also need to change bpf_func_proto.ret_type ?

There is no need to change bpf_func_proto.ret_type.
RET_INTEGER already implies 64bit scalar.

> 
>>> + *     Description
>>> + *             Check ctx packet size against MTU of net device (based on
>>> + *             *ifindex*).  This helper will likely be used in combination with
>>> + *             helpers that adjust/change the packet size.  The argument
>>> + *             *len_diff* can be used for querying with a planned size
>>> + *             change. This allows to check MTU prior to changing packet ctx.
>>> + *
>>
>> [...]
>>
> 
> 
>
Daniel Borkmann Jan. 14, 2021, 10:28 p.m. UTC | #6
On 1/14/21 3:36 PM, Jesper Dangaard Brouer wrote:
[...]
>>> +BPF_CALL_5(bpf_skb_check_mtu, struct sk_buff *, skb,
>>> +	   u32, ifindex, u32 *, mtu_len, s32, len_diff, u64, flags)
>>> +{
>>> +	int ret = BPF_MTU_CHK_RET_FRAG_NEEDED;
>>> +	struct net_device *dev = skb->dev;
>>> +	int skb_len, dev_len;
>>> +	int mtu;
>>> +
>>> +	if (unlikely(flags & ~(BPF_MTU_CHK_SEGS)))
>>> +		return -EINVAL;
>>> +
>>> +	dev = __dev_via_ifindex(dev, ifindex);
>>> +	if (unlikely(!dev))
>>> +		return -ENODEV;
>>> +
>>> +	mtu = READ_ONCE(dev->mtu);
>>> +
>>> +	dev_len = mtu + dev->hard_header_len;
>>> +	skb_len = skb->len + len_diff; /* minus result pass check */
>>> +	if (skb_len <= dev_len) {
>>> +		ret = BPF_MTU_CHK_RET_SUCCESS;
>>> +		goto out;
>>> +	}
>>> +	/* At this point, skb->len exceed MTU, but as it include length of all
>>> +	 * segments, it can still be below MTU.  The SKB can possibly get
>>> +	 * re-segmented in transmit path (see validate_xmit_skb).  Thus, user
>>> +	 * must choose if segs are to be MTU checked.  Last SKB "headlen" is
>>> +	 * checked against MTU.
>>> +	 */
>>> +	if (skb_is_gso(skb)) {
>>> +		ret = BPF_MTU_CHK_RET_SUCCESS;
>>> +
>>> +		if (!(flags & BPF_MTU_CHK_SEGS))
>>> +			goto out;
>>> +
>>> +		if (!skb_gso_validate_network_len(skb, mtu)) {
>>> +			ret = BPF_MTU_CHK_RET_SEGS_TOOBIG;
>>> +			goto out;
>>> +		}
>>> +
>>> +		skb_len = skb_headlen(skb) + len_diff;
>>> +		if (skb_len > dev_len) {
[...]
>> Do you have a particular use case for the BPF_MTU_CHK_SEGS?
> 
> The complaint from Maze (and others) were that when skb_is_gso then all
> the MTU checks are bypassed.  This flag enables checking the GSO part
> via skb_gso_validate_network_len().  We cannot enable it per default,
> as you say, it is universally correct in all cases.

If there is a desire to have access to the skb_gso_validate_network_len(), I'd
keep that behind the flag then, but would drop the skb_headlen(skb) + len_diff
case given the mentioned case on rx where it would yield misleading results to
users that might be unintuitive & hard to debug.

>> I also don't see the flag being used anywhere in your selftests, so I presume
>> not as otherwise you would have added an example there?
> 
> I'm using the flag in the bpf-examples code[1], this is how I've tested
> the code path.
> 
> I've not found a way to generate GSO packet via the selftests
> infrastructure via bpf_prog_test_run_xattr().  I'm
> 
> [1] https://github.com/xdp-project/bpf-examples/blob/master/MTU-tests/tc_mtu_enforce.c

Haven't checked but likely something as prog_tests/skb_ctx.c might not be sufficient
to pass it into the helper. For real case you might need a netns + veth setup like
some of the other tests are doing and then generating TCP stream from one end to the
other.
Jesper Dangaard Brouer Jan. 18, 2021, 11:04 a.m. UTC | #7
On Thu, 14 Jan 2021 23:28:57 +0100
Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 1/14/21 3:36 PM, Jesper Dangaard Brouer wrote:
> [...]
> >>> +BPF_CALL_5(bpf_skb_check_mtu, struct sk_buff *, skb,
> >>> +	   u32, ifindex, u32 *, mtu_len, s32, len_diff, u64, flags)
> >>> +{
> >>> +	int ret = BPF_MTU_CHK_RET_FRAG_NEEDED;
> >>> +	struct net_device *dev = skb->dev;
> >>> +	int skb_len, dev_len;
> >>> +	int mtu;
> >>> +
> >>> +	if (unlikely(flags & ~(BPF_MTU_CHK_SEGS)))
> >>> +		return -EINVAL;
> >>> +
> >>> +	dev = __dev_via_ifindex(dev, ifindex);
> >>> +	if (unlikely(!dev))
> >>> +		return -ENODEV;
> >>> +
> >>> +	mtu = READ_ONCE(dev->mtu);
> >>> +
> >>> +	dev_len = mtu + dev->hard_header_len;
> >>> +	skb_len = skb->len + len_diff; /* minus result pass check */
> >>> +	if (skb_len <= dev_len) {
> >>> +		ret = BPF_MTU_CHK_RET_SUCCESS;
> >>> +		goto out;
> >>> +	}
> >>> +	/* At this point, skb->len exceed MTU, but as it include length of all
> >>> +	 * segments, it can still be below MTU.  The SKB can possibly get
> >>> +	 * re-segmented in transmit path (see validate_xmit_skb).  Thus, user
> >>> +	 * must choose if segs are to be MTU checked.  Last SKB "headlen" is
> >>> +	 * checked against MTU.
> >>> +	 */
> >>> +	if (skb_is_gso(skb)) {
> >>> +		ret = BPF_MTU_CHK_RET_SUCCESS;
> >>> +
> >>> +		if (!(flags & BPF_MTU_CHK_SEGS))
> >>> +			goto out;
> >>> +
> >>> +		if (!skb_gso_validate_network_len(skb, mtu)) {
> >>> +			ret = BPF_MTU_CHK_RET_SEGS_TOOBIG;
> >>> +			goto out;
> >>> +		}
> >>> +
> >>> +		skb_len = skb_headlen(skb) + len_diff;
> >>> +		if (skb_len > dev_len) {  
> [...]
> >> Do you have a particular use case for the BPF_MTU_CHK_SEGS?  
> > 
> > The complaint from Maze (and others) were that when skb_is_gso then all
> > the MTU checks are bypassed.  This flag enables checking the GSO part
> > via skb_gso_validate_network_len().  We cannot enable it per default,
> > as you say, it is universally correct in all cases.  
> 
> If there is a desire to have access to the skb_gso_validate_network_len(), I'd
> keep that behind the flag then, but would drop the skb_headlen(skb) + len_diff
> case given the mentioned case on rx where it would yield misleading results to
> users that might be unintuitive & hard to debug.

Okay, I will update the patch, and drop those lines.

> >> I also don't see the flag being used anywhere in your selftests, so I presume
> >> not as otherwise you would have added an example there?  
> > 
> > I'm using the flag in the bpf-examples code[1], this is how I've tested
> > the code path.
> > 
> > I've not found a way to generate GSO packet via the selftests
> > infrastructure via bpf_prog_test_run_xattr().  I'm
> > 
> > [1] https://github.com/xdp-project/bpf-examples/blob/master/MTU-tests/tc_mtu_enforce.c  
> 
> Haven't checked but likely something as prog_tests/skb_ctx.c might not be sufficient
> to pass it into the helper. For real case you might need a netns + veth setup like
> some of the other tests are doing and then generating TCP stream from one end to the
> other.

I have looked at prog_tests/skb_ctx.c and (as you say yourself) this is
not sufficient.  I can look into creating a netns+veth setup, but I
will appreciate if we can merge this patchset to make forward progress,
as I'm sure the netns+veth setup will require its own round of nitpicking.

I have created netns+veth test scripts before (see test_xdp_vlan.sh),
but my experience is that people/maintainers forget/don't to run these
separate shell scripts.  Thus, if I create a netns+veth test, then I
will prefer if I can integrate this into the "test_progs", as I know
that will be run by people/maintainers.
diff mbox series

Patch

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 649586d656b6..fa2e99351758 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3833,6 +3833,61 @@  union bpf_attr {
  *	Return
  *		A pointer to a struct socket on success or NULL if the file is
  *		not a socket.
+ *
+ * int bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_len, s32 len_diff, u64 flags)
+ *	Description
+ *		Check ctx packet size against MTU of net device (based on
+ *		*ifindex*).  This helper will likely be used in combination with
+ *		helpers that adjust/change the packet size.  The argument
+ *		*len_diff* can be used for querying with a planned size
+ *		change. This allows to check MTU prior to changing packet ctx.
+ *
+ *		Specifying *ifindex* zero means the MTU check is performed
+ *		against the current net device.  This is practical if this isn't
+ *		used prior to redirect.
+ *
+ *		The Linux kernel route table can configure MTUs on a more
+ *		specific per route level, which is not provided by this helper.
+ *		For route level MTU checks use the **bpf_fib_lookup**\ ()
+ *		helper.
+ *
+ *		*ctx* is either **struct xdp_md** for XDP programs or
+ *		**struct sk_buff** for tc cls_act programs.
+ *
+ *		The *flags* argument can be a combination of one or more of the
+ *		following values:
+ *
+ *		**BPF_MTU_CHK_SEGS**
+ *			This flag will only works for *ctx* **struct sk_buff**.
+ *			If packet context contains extra packet segment buffers
+ *			(often knows as GSO skb), then MTU check is harder to
+ *			check at this point, because in transmit path it is
+ *			possible for the skb packet to get re-segmented
+ *			(depending on net device features).  This could still be
+ *			a MTU violation, so this flag enables performing MTU
+ *			check against segments, with a different violation
+ *			return code to tell it apart. Check cannot use len_diff.
+ *
+ *		On return *mtu_len* pointer contains the MTU value of the net
+ *		device.  Remember the net device configured MTU is the L3 size,
+ *		which is returned here and XDP and TX length operate at L2.
+ *		Helper take this into account for you, but remember when using
+ *		MTU value in your BPF-code.  On input *mtu_len* must be a valid
+ *		pointer and be initialized (to zero), else verifier will reject
+ *		BPF program.
+ *
+ *	Return
+ *		* 0 on success, and populate MTU value in *mtu_len* pointer.
+ *
+ *		* < 0 if any input argument is invalid (*mtu_len* not updated)
+ *
+ *		MTU violations return positive values, but also populate MTU
+ *		value in *mtu_len* pointer, as this can be needed for
+ *		implementing PMTU handing:
+ *
+ *		* **BPF_MTU_CHK_RET_FRAG_NEEDED**
+ *		* **BPF_MTU_CHK_RET_SEGS_TOOBIG**
+ *
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3998,6 +4053,7 @@  union bpf_attr {
 	FN(ktime_get_coarse_ns),	\
 	FN(ima_inode_hash),		\
 	FN(sock_from_file),		\
+	FN(check_mtu),			\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -5030,6 +5086,17 @@  struct bpf_redir_neigh {
 	};
 };
 
+/* bpf_check_mtu flags*/
+enum  bpf_check_mtu_flags {
+	BPF_MTU_CHK_SEGS  = (1U << 0),
+};
+
+enum bpf_check_mtu_ret {
+	BPF_MTU_CHK_RET_SUCCESS,      /* check and lookup successful */
+	BPF_MTU_CHK_RET_FRAG_NEEDED,  /* fragmentation required to fwd */
+	BPF_MTU_CHK_RET_SEGS_TOOBIG,  /* GSO re-segmentation needed to fwd */
+};
+
 enum bpf_task_fd_type {
 	BPF_FD_TYPE_RAW_TRACEPOINT,	/* tp name */
 	BPF_FD_TYPE_TRACEPOINT,		/* tp name */
diff --git a/net/core/filter.c b/net/core/filter.c
index db59ab55572c..3f2e593244ca 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5604,6 +5604,124 @@  static const struct bpf_func_proto bpf_skb_fib_lookup_proto = {
 	.arg4_type	= ARG_ANYTHING,
 };
 
+static struct net_device *__dev_via_ifindex(struct net_device *dev_curr,
+					    u32 ifindex)
+{
+	struct net *netns = dev_net(dev_curr);
+
+	/* Non-redirect use-cases can use ifindex=0 and save ifindex lookup */
+	if (ifindex == 0)
+		return dev_curr;
+
+	return dev_get_by_index_rcu(netns, ifindex);
+}
+
+BPF_CALL_5(bpf_skb_check_mtu, struct sk_buff *, skb,
+	   u32, ifindex, u32 *, mtu_len, s32, len_diff, u64, flags)
+{
+	int ret = BPF_MTU_CHK_RET_FRAG_NEEDED;
+	struct net_device *dev = skb->dev;
+	int skb_len, dev_len;
+	int mtu;
+
+	if (unlikely(flags & ~(BPF_MTU_CHK_SEGS)))
+		return -EINVAL;
+
+	dev = __dev_via_ifindex(dev, ifindex);
+	if (unlikely(!dev))
+		return -ENODEV;
+
+	mtu = READ_ONCE(dev->mtu);
+
+	dev_len = mtu + dev->hard_header_len;
+	skb_len = skb->len + len_diff; /* minus result pass check */
+	if (skb_len <= dev_len) {
+		ret = BPF_MTU_CHK_RET_SUCCESS;
+		goto out;
+	}
+	/* At this point, skb->len exceed MTU, but as it include length of all
+	 * segments, it can still be below MTU.  The SKB can possibly get
+	 * re-segmented in transmit path (see validate_xmit_skb).  Thus, user
+	 * must choose if segs are to be MTU checked.  Last SKB "headlen" is
+	 * checked against MTU.
+	 */
+	if (skb_is_gso(skb)) {
+		ret = BPF_MTU_CHK_RET_SUCCESS;
+
+		if (!(flags & BPF_MTU_CHK_SEGS))
+			goto out;
+
+		if (!skb_gso_validate_network_len(skb, mtu)) {
+			ret = BPF_MTU_CHK_RET_SEGS_TOOBIG;
+			goto out;
+		}
+
+		skb_len = skb_headlen(skb) + len_diff;
+		if (skb_len > dev_len) {
+			ret = BPF_MTU_CHK_RET_FRAG_NEEDED;
+			goto out;
+		}
+	}
+out:
+	/* BPF verifier guarantees valid pointer */
+	*mtu_len = mtu;
+
+	return ret;
+}
+
+BPF_CALL_5(bpf_xdp_check_mtu, struct xdp_buff *, xdp,
+	   u32, ifindex, u32 *, mtu_len, s32, len_diff, u64, flags)
+{
+	struct net_device *dev = xdp->rxq->dev;
+	int xdp_len = xdp->data_end - xdp->data;
+	int ret = BPF_MTU_CHK_RET_SUCCESS;
+	int mtu, dev_len;
+
+	/* XDP variant doesn't support multi-buffer segment check (yet) */
+	if (unlikely(flags))
+		return -EINVAL;
+
+	dev = __dev_via_ifindex(dev, ifindex);
+	if (unlikely(!dev))
+		return -ENODEV;
+
+	mtu = READ_ONCE(dev->mtu);
+
+	/* Add L2-header as dev MTU is L3 size */
+	dev_len = mtu + dev->hard_header_len;
+
+	xdp_len += len_diff; /* minus result pass check */
+	if (xdp_len > dev_len)
+		ret = BPF_MTU_CHK_RET_FRAG_NEEDED;
+
+	/* BPF verifier guarantees valid pointer */
+	*mtu_len = mtu;
+
+	return ret;
+}
+
+static const struct bpf_func_proto bpf_skb_check_mtu_proto = {
+	.func		= bpf_skb_check_mtu,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type      = ARG_PTR_TO_CTX,
+	.arg2_type      = ARG_ANYTHING,
+	.arg3_type      = ARG_PTR_TO_INT,
+	.arg4_type      = ARG_ANYTHING,
+	.arg5_type      = ARG_ANYTHING,
+};
+
+static const struct bpf_func_proto bpf_xdp_check_mtu_proto = {
+	.func		= bpf_xdp_check_mtu,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type      = ARG_PTR_TO_CTX,
+	.arg2_type      = ARG_ANYTHING,
+	.arg3_type      = ARG_PTR_TO_INT,
+	.arg4_type      = ARG_ANYTHING,
+	.arg5_type      = ARG_ANYTHING,
+};
+
 #if IS_ENABLED(CONFIG_IPV6_SEG6_BPF)
 static int bpf_push_seg6_encap(struct sk_buff *skb, u32 type, void *hdr, u32 len)
 {
@@ -7173,6 +7291,8 @@  tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_get_socket_uid_proto;
 	case BPF_FUNC_fib_lookup:
 		return &bpf_skb_fib_lookup_proto;
+	case BPF_FUNC_check_mtu:
+		return &bpf_skb_check_mtu_proto;
 	case BPF_FUNC_sk_fullsock:
 		return &bpf_sk_fullsock_proto;
 	case BPF_FUNC_sk_storage_get:
@@ -7242,6 +7362,8 @@  xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_xdp_adjust_tail_proto;
 	case BPF_FUNC_fib_lookup:
 		return &bpf_xdp_fib_lookup_proto;
+	case BPF_FUNC_check_mtu:
+		return &bpf_xdp_check_mtu_proto;
 #ifdef CONFIG_INET
 	case BPF_FUNC_sk_lookup_udp:
 		return &bpf_xdp_sk_lookup_udp_proto;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 649586d656b6..fa2e99351758 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3833,6 +3833,61 @@  union bpf_attr {
  *	Return
  *		A pointer to a struct socket on success or NULL if the file is
  *		not a socket.
+ *
+ * int bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_len, s32 len_diff, u64 flags)
+ *	Description
+ *		Check ctx packet size against MTU of net device (based on
+ *		*ifindex*).  This helper will likely be used in combination with
+ *		helpers that adjust/change the packet size.  The argument
+ *		*len_diff* can be used for querying with a planned size
+ *		change. This allows to check MTU prior to changing packet ctx.
+ *
+ *		Specifying *ifindex* zero means the MTU check is performed
+ *		against the current net device.  This is practical if this isn't
+ *		used prior to redirect.
+ *
+ *		The Linux kernel route table can configure MTUs on a more
+ *		specific per route level, which is not provided by this helper.
+ *		For route level MTU checks use the **bpf_fib_lookup**\ ()
+ *		helper.
+ *
+ *		*ctx* is either **struct xdp_md** for XDP programs or
+ *		**struct sk_buff** for tc cls_act programs.
+ *
+ *		The *flags* argument can be a combination of one or more of the
+ *		following values:
+ *
+ *		**BPF_MTU_CHK_SEGS**
+ *			This flag will only works for *ctx* **struct sk_buff**.
+ *			If packet context contains extra packet segment buffers
+ *			(often knows as GSO skb), then MTU check is harder to
+ *			check at this point, because in transmit path it is
+ *			possible for the skb packet to get re-segmented
+ *			(depending on net device features).  This could still be
+ *			a MTU violation, so this flag enables performing MTU
+ *			check against segments, with a different violation
+ *			return code to tell it apart. Check cannot use len_diff.
+ *
+ *		On return *mtu_len* pointer contains the MTU value of the net
+ *		device.  Remember the net device configured MTU is the L3 size,
+ *		which is returned here and XDP and TX length operate at L2.
+ *		Helper take this into account for you, but remember when using
+ *		MTU value in your BPF-code.  On input *mtu_len* must be a valid
+ *		pointer and be initialized (to zero), else verifier will reject
+ *		BPF program.
+ *
+ *	Return
+ *		* 0 on success, and populate MTU value in *mtu_len* pointer.
+ *
+ *		* < 0 if any input argument is invalid (*mtu_len* not updated)
+ *
+ *		MTU violations return positive values, but also populate MTU
+ *		value in *mtu_len* pointer, as this can be needed for
+ *		implementing PMTU handing:
+ *
+ *		* **BPF_MTU_CHK_RET_FRAG_NEEDED**
+ *		* **BPF_MTU_CHK_RET_SEGS_TOOBIG**
+ *
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3998,6 +4053,7 @@  union bpf_attr {
 	FN(ktime_get_coarse_ns),	\
 	FN(ima_inode_hash),		\
 	FN(sock_from_file),		\
+	FN(check_mtu),			\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -5030,6 +5086,17 @@  struct bpf_redir_neigh {
 	};
 };
 
+/* bpf_check_mtu flags*/
+enum  bpf_check_mtu_flags {
+	BPF_MTU_CHK_SEGS  = (1U << 0),
+};
+
+enum bpf_check_mtu_ret {
+	BPF_MTU_CHK_RET_SUCCESS,      /* check and lookup successful */
+	BPF_MTU_CHK_RET_FRAG_NEEDED,  /* fragmentation required to fwd */
+	BPF_MTU_CHK_RET_SEGS_TOOBIG,  /* GSO re-segmentation needed to fwd */
+};
+
 enum bpf_task_fd_type {
 	BPF_FD_TYPE_RAW_TRACEPOINT,	/* tp name */
 	BPF_FD_TYPE_TRACEPOINT,		/* tp name */